DA Savitz - Interpreting Epidemiologic Evidence - Strategies For Study Design and Analysis 2003
DA Savitz - Interpreting Epidemiologic Evidence - Strategies For Study Design and Analysis 2003
DA Savitz - Interpreting Epidemiologic Evidence - Strategies For Study Design and Analysis 2003
Epidemiologic
Evidence
Strategies for Study Design
and Analysis
DAVID A. SAVITZ
1
2003
1
Oxford New York
Auckland Bangkok Buenos Aires Cape Town Chennai
Dar es Salaam Delhi Hong Kong Istanbul Karachi Kolkata
Kuala Lumpur Madrid Melbourne Mexico City Mumbai
Nairobi São Paulo Shanghai Taipei Tokyo Toronto
9 8 7 6 5 4 3 2 1
Printed in the United States of America
on acid-free paper
PREFACE
There was no shortage of epidemiology books when I started writing this book
in the summer of 1996, and the intervening years have brought many new and
very useful ones. Just as it is uninspiring to do a study that ends up as one more
dot on a graph or one more line in a table, it was certainly not my goal to merely
add one more book to an overcrowded shelf. But there was a particular need that
did not seem to be very well addressed in books or journal articles—to bring to-
gether concepts and methods with real research findings in order to make in-
formed judgments about the results.
One of the most difficult tasks facing new students and even experienced prac-
titioners of epidemiology is to assess how much confidence one can have in a
given set of findings. As I discussed such issues with graduate students and col-
leagues, and contemplated my own data, it was difficult to find a reasoned, bal-
anced approach. It’s easy but uninformative to simply acknowledge epidemiol-
ogy’s many pitfalls, and not much more difficult to mount a generic defense of
the credibility of one’s findings. What was difficult was to find empirical tools
for assessing the study’s susceptibility to specific sources of error in ways that
could actually change preconceptions and go beyond intuition. This kind of ex-
amination is most fruitful when it can give good or bad news about one’s study
that was unexpected.
I knew that such approaches were out there because I found candidate tools
v
vi PREFACE
pleting the challenging task of identifying suitable illustrations from the pub-
lished literature. Nonetheless, these colleagues and others who may read the book
will undoubtedly find statements with which they disagree, so their help and ac-
knowledgment should not be seen as a blanket endorsement of the book’s con-
tents. Following the spirit espoused in this book, critical evaluation is always
needed and I welcome readers’ comments on any errors in logic or omissions of
potentially valuable strategies that they may find.
CONTENTS
1. Introduction, 1
ix
x CONTENTS
7. Confounding, 137
Definition and Theoretical Background, 137
Quantification of Potential Confounding, 141
Evaluation of Confounding, 145
Integrated Assessment of Potential Confounding, 157
Index, 305
1
INTRODUCTION
This book was written for both producers and consumers of epidemiologic re-
search, though a basic understanding of epidemiologic principles will be neces-
sary at the outset. Little of the technical material will be new to experienced epi-
demiologists, but I hope that my perspective on the application of those principles
to interpreting research results will be distinctive and useful. For the large and
growing group of consumers of epidemiology, which includes attorneys and
judges, risk assessors, policy experts, clinicians, and laboratory scientists, the
book is intended to go beyond the principles learned in an introductory course
or textbook of epidemiology by applying them to concrete issues and findings.
Although it is unlikely that ambiguous evidence can be made conclusive or that
controversies will be resolved directly by applying these principles, a careful con-
sideration of the underlying reasons for the ambiguity or opposing judgments in
a controversy can represent progress. By pinpointing where the evidence falls
short of certainty, we can give questions a sharper focus, leading to a clearer de-
scription of the state of knowledge at any point in time and thus helping iden-
tify the research that could contribute to a resolution of the uncertainty that re-
mains.
Those who are called upon to assess the meaning and persuasiveness of epi-
demiologic evidence have a variety of approaches to consider. There are formal
guidelines for judging causality for an observed association, which is defined as
1
2 INTERPRETING EPIDEMIOLOGIC EVIDENCE
The product of a careful evaluation of the study itself, drawing on the relevant
methodological and substantive literature, is an informed judgment about the plau-
sibility, direction, and strength of confounding, as well as specifying further re-
search that would narrow uncertainty about the impact of confounding. Even when
agreement among evaluators cannot be attained, the areas of disagreement should
move from global questions about study validity to successively narrower ques-
tions that are amenable to empirical evaluation. To move from general disagree-
ment about the credibility of a study’s findings to asking such focused, answerable
questions as whether a specific potential confounder has a sufficiently large asso-
ciation with disease to have markedly distorted the study results represents real
progress. The methodologic principles are needed to refine the questions that give
rise to uncertainty and controversy, which must then be integrated with substan-
tive knowledge about the phenomenon of interest to make informed judgments.
Much of this book focuses on providing that linkage between methodological prin-
ciples and substantive knowledge in order to evaluate findings more accurately.
The challenges in interpretation relate to sets of studies of a given topic as
well as individual results. For example, consistency across studies is often in-
terpreted as a simple dichotomy: consistency in findings is supportive of a causal
association and inconsistency is counter to it. But epidemiologic studies are rarely,
if ever, pure replications of one another, and thus differ for reasons other than
random error. When studies that have different methodologic features yield sim-
ilar results, it can be tentatively assumed that the potential biases associated with
those aspects of the study that differ have not introduced bias, and a causal in-
ference is thus strengthened. Consistency across studies with features that should,
under a causal hypothesis, yield different results suggests that some bias may be
operating. There may also be meaningful differences in results from similarly de-
signed studies conducted in different populations, suggesting that some impor-
tant cofactors, necessary for the exposure to exert its impact, are present in some
populations but not others. Clearly, when methodologically stronger studies pro-
duce different results than weaker ones, lack of consistency in results does not
argue against causality.
The book has been organized to the extent possible in the order that issues
arise. Chapter 2 sets the stage for evaluating epidemiologic evidence by clarify-
ing the expected product of epidemiologic research, defining the goals. Next, I
propose an overall strategy and philosophy for considering the quality of epi-
demiologic research findings (Chapter 3). The following chapters systematically
cover the design, conduct, and analysis issues that bear on study interpretation.
Beginning with Chapter 4 and continuing through Chapter 9, sources of sys-
tematic error in epidemiologic studies are examined. The rationale for dividing
the topics as the table of contents does warrants a brief explanation. Selection
bias refers to “error due to systematic differences in characteristics between those
who take part in a study and those who do not” (Last, 2001). It is the constitu-
Introduction 5
tion of the study groups that is the potential source of such error. The construc-
tion of study groups is different in practice (though not in theory) in cohort and
case–control studies. In cohort studies, groups with differing exposure status are
identified and monitored for the occurrence of disease in order to compare dis-
ease incidence among them. In case–control studies, the sampling is based on
health outcome; those who have experienced the disease of interest are compared
to a sample from the population that gave rise to those cases of disease. Because
the groups are constituted in different ways for different purposes in the two de-
signs, the potential for selection bias is considered in separate chapters (Chap-
ters 4 and 5). Furthermore, given that one particular source of selection bias, that
due to non-participation, is so ubiquitous and often so large, Chapter 6 addresses
this problem in detail.
Confounding, in which there is a mixing of effects from multiple exposures,
is similar in many respects to selection bias, but its origins are natural as opposed
to arising from the way the study groups were constituted. Evaluating the pres-
ence, magnitude, and direction of confounding is the subject of Chapter 7.
The consideration of measurement error, though algebraically similar regard-
less of what is being measured, is conceptually different and has different im-
plications depending on whether the exposure, broadly defined as the potential
causal factor of interest, the disease, again broadly defined as the health outcome
of interest, is measured with error. The processes by which error arises (e.g.,
memory errors producing exposure misclassification, diagnostic errors produc-
ing disease misclassification) and their implications for bias in measures of as-
sociation make it necessary to separate the discussions of exposure (Chapter 8)
and disease (Chapter 9) misclassification.
The complex topic of random error, how it arises, affects study results, and
should be characterized is addressed in Chapter 10. The sequence is intentional.
Random error is discussed after the other factors to help counter the long-held
view that it is the first or automatically the more important issue to consider in
evaluating epidemiologic evidence.
There are several increasingly popular approaches to the integration of infor-
mation from multiple studies, such as meta-analysis [defined as “a statistical syn-
thesis of the data from separate by similar studies” (Last, 2001)] and pooling
(combining data from multiple studies for reanalysis). These methods are dis-
cussed in Chapter 11. Chapter 12 deals with the integration and summary of in-
formation gained from the approaches covered in the previous chapters.
REFERENCES
TABLE 2–1. Levels of Inference from Epidemiologic Evidence and Attendant Concerns
INFERENCE REQUIREMENTS
as the ultimate goal for epidemiologic inquiry, it would be a mistake to use the
criterion of “improvement in public health” as the only test of whether the re-
search effort (from epidemiology or any other scientific discipline) has been suc-
cessful. Highly informative epidemiologic studies might exonerate agents sus-
pected of doing harm rather than implicate agents that cause disease. Excellent
research may show lack of benefit from an ostensibly promising therapeutic meas-
ure. Such research unquestionably informs public health practice and should ul-
timately improve health by directing our energies elsewhere, but the path from
information to benefit is not always direct.
Even where epidemiologic studies produce evidence of harm or benefit from
such agents as environmental pollutants, dietary constituents, or medications, the
link to action is an indirect one. The validity of the evidence is a necessary but
not sufficient goal for influencing decisions and policy; societal concerns based
on economics and politics, outside the scope of epidemiology, can and some-
times should override even definitive epidemiologic evidence. The goal of pub-
lic health, not epidemiology, is disease prevention (Savitz et al., 1999). Market-
ing of the discipline may benefit from claims that epidemiology prevents disease,
but many past public health successes have had little to do with epidemiology
and many past (and present) failures are not the result of inadequate epidemio-
logic research. Epidemiology constitutes only one very important component of
the knowledge base for public health practice; pertinent data are also contributed
by the basic biomedical sciences, sociology, economics, and anthropology, among
other disciplines. The metaphor of the community as patient is much more suit-
able to public health practice than to the scientific discipline of epidemiology.
Like clinical medicine, public health practice draws on many scientific disci-
plines and nonscientific considerations.
The Nature of Epidemiologic Evidence 9
At the other extreme, the goals of epidemiologic research ought not to be con-
strained as so modest and technical in nature that even our successes have no
practical value. We could define the goal of epidemiology as the mechanical
process of gathering and analyzing data and generating statistical results, such
as odds ratios or regression coefficients, divorced from potential inferences and
applications. Theoretical and logistical challenges disappear one by one as the
benchmark is lowered successively. If a study’s intent is defined as assessment
of the association between the boxes checked on a questionnaire and the read-
ing on the dial of a machine for those individuals who are willing to provide the
information, then success can be guaranteed. We can undoubtedly locate pencils,
get some people to check boxes, find a machine that will give a reading, and cal-
culate measures of association. Focusing on the mechanical process of the re-
search is conservative and modest, traits valued by scientists, and averts the crit-
icism that comes when we attempt to make broader inferences from the data.
While in no way denigrating the importance of study execution (Sharp pencils
may actually help to reduce errors in coding and data entry!), these mechanical
components are only a means to the more interesting and challenging end of ex-
tending knowledge that has the potential for biomedical and societal benefit.
Beyond the purely mechanical goal of conducting epidemiologic research and
generating data, expectations for epidemiology are sometimes couched in terms
of “measuring associations” or “producing leads,” with the suggestion that sci-
entific knowledge ultimately requires corroborative research in the basic or clin-
ical sciences. The implication is that our research methods are so hopelessly
flawed that even at their very best, epidemiology yields only promising leads or
hints at truth. In one sense, this view simultaneously undervalues epidemiology
and overvalues the other disciplines. Epidemiologic evidence, like that from all
scientific disciplines, is subject to error and misinterpretation. Because of com-
pensatory strengths and weaknesses, integrating epidemiologic evidence with that
produced by other disciplines is vital to drawing broader inferences regarding
causes and prevention of disease. Epidemiologists, however, can and do go well
beyond making agnostic statements about associations (ignoring causality) or
generating hypotheses for other scientists to pursue. Epidemiology produces ev-
idence, like other scientific disciplines, that contributes to causal inferences about
the etiology and prevention of disease in human populations.
For the purposes of this book, I define the goal for epidemiologic research as
the quantification of the causal relation between exposure and disease. Although
the research itself generates only statistical estimates of association, the standard
by which the validity of those measures of association is to be judged is their
ability to approximate the causal relation of interest. The utility of those esti-
mated associations in advancing science and ultimately public health generally
depends on the extent to which they provide meaningful information on the
underlying causal relations. As discussed below, the term exposure is really a
10 INTERPRETING EPIDEMIOLOGIC EVIDENCE
important to keep in mind that the operational measures are not the entities of
interest themselves (e.g., deposition of graphite on a form is not dietary intake,
a peak on a mass spectrometer printout is not DDT exposure), but serve as in-
direct indicators of broader, often more abstract, constructs.
A key issue in evaluating epidemiologic evidence is how effectively the op-
erational definitions approximate the constructs of ultimate interest. The meth-
ods of measurement of exposure and disease are critical components of epi-
demiologic study design. The concept of misclassification applies to all the
sources of error between the operational measure and the constructs of interest.
The most obvious and easily handled sources of misclassification are clerical er-
ror or faulty instrumentation, whereas failure to properly define the relevant con-
structs, failure to elicit the necessary data to reflect those constructs, and as-
sessment of exposure or disease in the wrong time period illustrate the more
subtle and often more important sources of misclassification. We would like to
measure causal relations between what is often an abstract construct of exposure
and disease, but the study yields a measure of association between an operational
measure of exposure and disease. The nature and magnitude of disparity between
what we would like and what we have achieved calls for careful scrutiny. Chap-
ters 8 and 9 focus on examination of that gulf between the construct and the op-
erational measures of exposure and disease, respectively.
error. Such inferences require consideration of whether the desired constructs have
been captured accurately, more challenging for etiologic than descriptive purposes,
and whether confounding is present, a concern unique to causal inference.
A specific example illustrates the range of potential questions that can be applied
to epidemiologic data (Table 2.1) and how a given study may answer some ques-
tions effectively and others rather poorly. Assume that the exposure under study
is participation in a regular mammography screening program and the disease
of interest is fatal breast cancer. Such a study has potential relevance to many
questions.
1. What is the mortality rate among women who participated in the mam-
mography screening program?
Answering this question requires, at a minimum, accurate data on par-
ticipation and mortality. Loss to follow up can interfere with the accurate
description of the experience of women enrolled in the program, and ac-
curate measurement of breast cancer mortality is required for the mortality
rate to be correct. Note that accurate estimation of mortality does not re-
quire consideration of confounding, information on breast cancer risk fac-
tors, or concern with self-selection for participation in the program. The
goal is largely a descriptive one, accurately estimating a rate.
2. Is breast cancer mortality different in women who participated in the mam-
mography screening program than women who did not participate?
Beyond the requirement of accurate measurement of participation and
mortality is the need to compare participants to nonparticipants. Note that
the question as stated does not raise questions of causality, but only makes
a comparison of rates. Even if we try to restrain our desire to make infer-
ences about the causal effect of participation, questions arise regarding a
suitable comparison group, and the desire to make broader inferences be-
comes increasingly difficult to escape. Women who did not participate
could, under this general statement of the goal, be any women who did not
do so, unrestricted by age, geography, breast cancer risk factors, etc. It is
rare to make comparisons without some degree of interest in measuring a
causal role for the attribute that distinguishes the groups. Claims of agnos-
ticism should be scrutinized carefully—are the investigators just trying to
forestall criticism by pretending to be uninterested in a causal inference?
Wouldn’t they really like to know what the breast cancer mortality rate
among participants in the screening program would have been if they had
The Nature of Epidemiologic Evidence 17
not participated, and isn’t that the inference they will make based on the
nonparticipants?
3. Has participation in the mammography screening program caused a re-
duction in breast cancer mortality among those who participated in the
program?
This question directly tackles causality and thus encounters a new series
of methodologic concerns and questions. In the counterfactual conceptual-
ization of causality (Greenland & Robins, 1986), the goals for this com-
parison group are much more explicit than in Question 2. Now, we would
like to identify a group of women who reflect the breast cancer mortality
rate that women who participated in the mammography screening program
would have had if they had not in fact participated. (The comparison is
counterfactual in that the participants, by definition, did participate; we wish
to estimate the risk for those exact women had they not done so.) We can
operationalize the selection of comparison women in a variety of imperfect
ways but the conceptual goal is clear. Now, issues of self-selection, base-
line risk factors for breast cancer mortality in the two groups, and con-
founding must be considered, all of which threaten the validity of causal
inference.
4. Do mammography screening programs result in reduced mortality from
breast cancer?
This question moves beyond causal inference for the study population of
women who participated in the screening program, and now seeks a more
universal answer for other, larger populations. Even if the answer to Ques-
tion 3 is affirmative, subject to social and biological modifiers, the very
same program may not result in reduced mortality from breast cancer in
other populations. For example, the program would likely be ineffective
among very young women in whom breast cancer is very rare, and it may
be ineffective among women with a history of prior breast cancer where
the recurrence risk may demand a more frequent screening interval. In or-
der to address this question, a series of studies would need to be consid-
ered, examining the reasons for consistent or inconsistent findings and mak-
ing inferences about the universality or specificity of the association.
5. Is breast cancer screening an effective public health strategy for reducing
breast cancer mortality?
Given that we have been able to generate accurate information to answer
the preceding questions, this next level of inference goes well beyond the
epidemiologic data deeply into the realm of public health policy. This global
question, of paramount interest in applying epidemiology to guiding pub-
lic policy, requires examination of such issues as the protocol for this screen-
ing program in relation to other screening programs we might have adopted,
problems in recruiting women for screening programs, financial costs of
18 INTERPRETING EPIDEMIOLOGIC EVIDENCE
Answering the first question with confidence is readily within our grasp, and
taken literally, the second question as well. The ability to make accurate meas-
urements of exposure and disease is sufficient for addressing those narrow ques-
tions, and necessary but not sufficient for addressing the subsequent ones. The
second question concerns a comparison but tries at least to defer any causal im-
plications of the comparison—Are the mortality rates different for participants
than nonparticipants? The third question is of the nature that much of this book
focuses on, namely making causal inferences within a given study, spilling over
into the fourth question, which is a broader inference about a series of epidemi-
ologic studies. Question 5 goes well beyond epidemiology alone, though epi-
demiologic findings are clearly relevant to the broader policy judgment. The spe-
cific research question under consideration must be kept in focus, evaluating the
quality of evidence for a specific purpose rather than generically. Study designs
and data can only be evaluated based on the application of the evidence they gen-
erate, and the inferences that are to be made.
To make accurate assessments that extend beyond the individual women en-
rolled in the study, we need to ask whether those women on whom data are avail-
able provide an accurate representation of that segment of the study population
that is of interest (freedom from selection bias). For example, do the heavy al-
cohol users recruited into the study provide an accurate reflection of the risk of
spontaneous abortion among the larger target population of heavy alcohol users?
Are only the most health-conscious drinkers willing to enroll in such a study,
who engage in a series of other, risk-lowering behaviors, yielding spuriously low
rates of spontaneous abortion among heavy drinkers relative to the target popu-
lation of heavy drinkers? Selection among those in the exposure stratum that dis-
torts the rate of spontaneous abortion limits any inferences about the effects of
that exposure. Similarly, losses to follow-up may occur in a non-random man-
ner, for example, if those who choose to have elective abortions would have dif-
fered in their risk of spontaneous abortion had they allowed their pregnancies to
continue, and decisions regarding elective abortion may well differ in relation to
alcohol use. These losses may distort the measured rate of spontaneous abortion
within a stratum of alcohol use and introduce bias into the evaluation of a po-
tential causal effect of alcohol.
Beyond the potential for misrepresentation of alcohol use, which would dis-
tort the results for any conceivable purpose, choices regarding the index and tim-
ing of alcohol use must be examined. Even an accurate estimate of the rate of
spontaneous abortion in relation to average daily alcohol consumption would not
address consumption at specific time intervals around conception, the effects of
binge drinking, and the impact of specific beverage types, for example. The in-
ference might be perfectly accurate with respect to one index yet quite inaccu-
rate in fully characterizing the effect of alcohol consumption on risk of sponta-
neous abortion. If we were only interested in whether alcohol drinkers serve as
a suitable population for intensified medical surveillance for pregnancy loss dur-
ing prenatal care, for example, average daily alcohol intake might be quite ade-
quate even if it is not the index most relevant to etiology. Misclassification and
information bias are assessed relative to the etiologic hypothesis regarding an ef-
fect of alcohol on spontaneous abortion.
Technically, the study can only examine the statistical association between
verbal or written response to questions on alcohol use and hormonal, clinical, or
self-reported information pertinent to identifying spontaneous abortion. A truly
agnostic interpretation of such measures is almost unattainable, given that the
very act of making the comparisons suggests some desire to draw inferences
about the groups being compared. To make such inferences correctly about
whether alcohol causes spontaneous abortion, epidemiologic studies must be
scrutinized to assess the relationship between the operational definitions of ex-
posure and disease and the more abstract entities of interest. Criticism cannot be
avoided by claiming that a study of questionnaire responses and pathology records
20 INTERPRETING EPIDEMIOLOGIC EVIDENCE
has been conducted. Rather, the effectiveness of the operational measures as prox-
ies for the constructs of interests is a critical component of interpreting epi-
demiologic evidence.
Our ultimate interest in many instances is in whether alterations in alcohol
consumption would result in alterations in risk of spontaneous abortion. That is,
would women who are high alcohol consumers during pregnancy assume the risk
of low alcohol consumers if they had reduced their alcohol consumption? To ad-
dress this question of causality, we would like to compare the risk among high
alcohol consumers to what their risk would have been had they been otherwise
identical but nondrinkers or low consumers of alcohol (Greenland & Robins,
1986). Our best estimate of that comparison is derived from the actual non-
alcohol consumers with some additional statistical adjustments to take into ac-
count factors thought to be related to spontaneous abortion that distinguish the
groups. Beyond all the measurement and selection issues within groups of dif-
fering alcohol use, we now have to ask about the comparability of the groups to
one another. Confounding is a concern that only arises when we wish to go be-
yond a description of the data and make hypothetical inferences about what would
happen if the exposed had not been exposed. The burden on the data rises in-
crementally as the goal progresses from pure description to causal inference.
CAUSAL INFERENCE
Those who conduct the research need not even provide a comprehensive or
definitive interpretation of their own data, an elusive goal as well. A perfectly
complete, unbiased, accurate representation of the study results is also unattain-
able. The principal goal of the investigators should be to reveal as much as pos-
sible about their study methods and results in order to help themselves and oth-
ers make appropriate use of their research findings. Given that the challenges to
causal inference can often be anticipated, data that bear on the threats to such in-
ference are needed to interpret the results. Many of the strategies suggested in
this book require anticipation of such a challenge at the time of study design and
execution. If a key confounder is known to be present, e.g., tobacco use in as-
sessing alcohol effects on bladder cancer, detailed cross-tabulations may be de-
sirable to help assess whether confounding has been successfully eliminated. If
the assessment of a construct poses special challenges, e.g., measurement of work-
place stress, then the instrument needs to be described in great detail along with
relevant data that bear on its validity. Ideally, this complete description of meth-
ods and results is conducted with the goal of sharing as much of the information
as possible that will assist in the interpretation by the investigators and others.
The formal discussion of those results in the published version of the study pro-
vides an evaluation of what the study means to the authors of the report. Al-
though the investigators have the first opportunity to make such an evaluation,
it would not be surprising if others among the thousands of reviewers of the pub-
lished evidence have more helpful insights or can bring greater objectivity to bear
in making the assessment. Rather than stacking the deck in providing results to
ensure that the only possible inferences are concordant with those of the origi-
nal researchers, those who can provide enough raw material for readers to make
different inferences should be commended for their full disclosure rather than
criticized for having produced findings that are subject to varying interpretations.
This pertains to revealing flaws in the study as well as fully elucidating the pat-
tern of findings. In the conventional structure of publication, the Methods and
Results are the basis for evaluation and inference; the Discussion is just a point
of view that the investigators happen to hold.
We sometimes have the opportunity to directly evaluate the role of potential
biases, for example, assessing whether a given measure of association has been
distorted by a specific confounding factor. Generating an adjusted measure of as-
sociation tells us whether potential confounding has actually occurred, and also
provides an estimate of what the association would be if confounding were ab-
sent. Note that the exercise of calculating a confounder-adjusted result is also in-
herently speculative and subject to error, for example, if the confounder is poorly
measured. An example of a concern that is typically less amenable to direct eval-
uation is the potential for selection bias due to non-response, usually evaluated
by comparing participants to nonparticipants. The hypothesis that the association
has been distorted by confounding or non-response is evaluated by generating
The Nature of Epidemiologic Evidence 23
relevant data to guide those who review the evidence, including the authors, in
making an assessment of the likelihood and extent of such distortion.
In an ideal world, causal inference would be the end product of systematic
evaluation of alternative explanations for the data, with an unambiguous con-
clusion regarding the extent to which the measured association accurately reflects
the magnitude of the causal relationship. In practice, a series of uncertainties pre-
clude doing so with great confidence. The list of alternative explanations is lim-
ited only by the imagination of critics, with insightful future reviewers always
having the potential to change the status of the evidence. Judgment of whether
a particular alternative explanation has truly been eliminated (or confirmed) is
itself subjective. Hypotheses of bias may be more directly testable than the hy-
pothesis of causality, but they remain challenging to definitively prove or dis-
prove. The culmination of the examination of individual contributors to bias is
a judgment of how plausible or strong the distortion is likely to be and how con-
fidently such an assertion can be made rather than a simple dichotomy of pres-
ent or absent. Thus, the answer to the ultimate question of whether the reported
association correctly measures the etiologic relationship will always be “maybe,”
with the goal of making an accurate assessment of where the evidence fits within
the wide spectrum that extends from the unattainable benchmarks of “yes”
or “no.”
The array of potential biases that limit certainty regarding whether an etiologic
association (or its absence) has been measured accurately is valuable in specify-
ing the frontier for advancement of research. If the major concerns remaining af-
ter the most recent study or series of studies can be clearly articulated, the agenda
for refinement in the next round of studies has been properly defined. If the some-
what mundane but real problem were one of small study size and imprecision,
then a larger study with the strengths of the previous ones would be suggested.
If uncertainty regarding the role of a key confounder contributes importantly to
a lack of resolution, then identifying a population free of such confounding (by
randomization or identifying favorable circumstances in which the exposure and
confounder are uncoupled) or accurate measurement and control of the con-
founder may be needed. Precisely the same process that is needed to judge the
strength of evidence yields insights into key features needed for subsequent stud-
ies to make progress.
The most challenging and intellectually stimulating aspect of interpreting epi-
demiologic evidence is in the process of assessing causality. A wide array of
methodological concerns must be considered, integrating the data from the study
or studies of interest with relevant substantive knowledge and theoretical princi-
ples. We are rarely able to fully dispose of threats to validity or find fatal flaws
that negate the evidence, leaving a list of potential biases falling somewhere along
the continuum of possibility. With this array of such considerations in mind, a
balanced, explicit judgment must be made. On the one hand, the need to make
24 INTERPRETING EPIDEMIOLOGIC EVIDENCE
There are two possible solutions to this dilemma: (1) The optimal, infeasible
solution is to ensure that published epidemiologic evidence is valid. (2) The al-
ternative is to continue to put forward fallible observations, debate their merits,
and seek a systematic, objective appraisal of the value of the information. The
remaining chapters of this book are devoted to that goal of organizing the scrutiny
and interpretation of epidemiologic evidence.
The importance of epidemiologic evidence in decision-making at a societal
and personal level is generally recognized, sometimes excessively so, but the
unique strengths of epidemiology in that regard are worth reiterating. Study of
the species of interest, humans, in its natural environment with all the associated
biological and behavioral diversity markedly reduces the need for extrapolation
relative to many experimental approaches with laboratory animals or cell cul-
tures. It has been suggested that experimental approaches to understanding hu-
man health obtain precise answers to the wrong questions whereas epidemiology
obtains imprecise answers to the right questions. Just as those who design ex-
periments seek to make the inferences as relevant as possible to the ultimate ap-
plications in public health and clinical medicine, epidemiologists must strive to
make their information as valid as possible, not losing the inherent strength of
studying free-living human populations.
REFERENCES
Davidoff F, DeAngelis CD, Drazen JM, Hoey J, H¯jgaard L, Horton R, Kotzin S, Nicholls
MG, Nylenna M, Overbeke AJPM, Sox HC, Van der Weyden MB, Wilkes MS. Spon-
sorship, authorship, and accountability. N Engl J Med 2001;345:825–827.
Greenland S. Probability versus Popper: An elaboration of the insufficiency of current
Popperian approaches for epidemiologic analysis. In Rothman KJ (ed), Causal infer-
ence. Chestnut Hill, MA: Epidemiology Resources Inc., 1988:95–104.
Greenland S, Robins JM. Identifiability, exchangeability, and epidemiologic confound-
ing. Int J Epidemiol 1986;15:413–419.
Hill AB. The environment and disease: association or causation? Proc Roy Soc Med 1965
;58:295–300.
Lanes SF. The logic of causal inference. In Rothman KJ (ed), Causal inference. Chestnut
Hill, MA: Epidemiology Resources Inc., 1988:59–75.
Rothman KJ. Modern Epidemiology. Boston: Little, Brown and Co., 1986.
Rothman KJ, Poole C. Causation and causal inference. In D Schottenfeld, JF Fraumeni
Jr (eds), Cancer Epidemiology and Prevention, Second Edition. New York: Oxford
University Press, 1996:3–10.
Savitz DA, Andrews KW, Brinton LA. Occupation and cervical cancer. J Occup Envir
Med 1995;37:357–361.
Savitz DA, Olshan AF. Multiple comparisons and related issues in the interpretation of
epidemiologic data. Am J Epidemiol 1995;142:904–908.
Savitz DA, Poole C, Miller WC. Reassessing the role of epidemiology in public health.
Am J Pub Health 1999;89:1158–1161.
28 INTERPRETING EPIDEMIOLOGIC EVIDENCE
Validity of epidemiologic results refers broadly to the degree to which the in-
ferences drawn from a study are warranted (Last, 2001). The central goal in eval-
uating epidemiologic evidence is to accurately define the sources of uncertainty
and the probability of errors of varying magnitude affecting the results. Validity
cannot be established by affirmatively demonstrating its presence but rather by
systematically considering and eliminating, or more often reducing, the sources
of bias that detract from validity. The goal of this scrutiny is to quantify the un-
certainty resulting from potential biases, considering the probability that the dif-
ferent sources of potential bias have introduced varying magnitudes of distor-
tion. Part of this assessment can be made on theoretical grounds, but whenever
possible, pertinent data should be sought both inside and outside the study to as-
sess the likely magnitude of error. Sometimes the information needed to evalu-
ate a potential source of error is readily available, but in other instances research
has to be undertaken to determine to what extent the hypothesized source of er-
ror actually may have affected the study results. In fact, an important feature of
the data collection and data analysis effort should be to generate the information
needed to fairly and fully assess the validity of the results. In principle, with all
relevant data in hand from the set of pertinent studies, a comprehensive evalua-
tion of sources of uncertainty would yield a clear and accurate inference regard-
ing the present state of knowledge and identify specific methodologic issues that
29
30 INTERPRETING EPIDEMIOLOGIC EVIDENCE
1. Conclusions drawn from the results of epidemiologic studies are more likely
to be valid if the evaluation is truly comprehensive, enumerating and care-
fully considering all potentially important sources of bias. Although this
thorough examination may make the ultimate inferences more, not less,
equivocal, the conclusions will be more accurately linked to the evidence.
Drawing Inferences from Epidemiologic Evidence 31
Even (especially?) experts have preconceptions and blind spots, and may
well be prone to evaluating evidence based on an initial, subjective overview
and then maintaining consistency with their initial gut impressions. For ex-
ample, a study done by talented investigators at a prominent institution and
published in a prestigious journal may convey an initial image of certain
validity, but none of those characteristics provide assurance of accuracy nor
does their absence provide evidence that the results are in error. System-
atic scrutiny helps to ensure that important limitations are not overlooked
and ostensibly important limitations are examined to determine whether
they really are likely to have had a substantial impact on the study results.
2. Intellectual understanding of the phenomenon of interest and methodologic
issues is enhanced by a detailed, evidence-based examination. Even if ex-
perts were capable of taking unexplained shortcuts to reach an accurate as-
sessment of the state of knowledge (epidemiologic intuition), without un-
derstanding of the process by which the judgment was reached, the rest of
us would be deprived of the opportunity to develop those skills. Further-
more, the field of epidemiology makes progress based on the experience of
new applications, so that the scholarly foundations of the discipline are ad-
vanced only by methodically working through these steps over and over
again. Reaching the right conclusion about the meaning and certainty of
the evidence is of paramount importance, but it is also vital to understand
why it is correct and to elucidate the principles that should be applied to
other such issues that will inevitably arise in the future.
3. Research needs are revealed by describing specific deficiencies and uncer-
tainties in previous studies that can be remedied. Bottom line conclusions
reveal little about what should be done next—What constructive action fol-
lows from a global assessment that the evidence is weak or strong? By ex-
plaining the reasoning used to draw conclusions, a natural by-product is a
menu of candidate refinements for new studies. Quantifying the probabil-
ity and impact of sources of potential error helps to establish priorities for
research. The most plausible sources of bias that are the most capable of
producing substantial error are precisely the issues that need to be tackled
with the highest priority in future studies, whatever the current state of
knowledge. In practice, it is often only a few methodological issues that
predominate to limit the conclusiveness of a study or set of studies, but this
becomes clear only through systematic review and evaluation.
4. Reasons for disagreement among evaluators will only be revealed by defin-
ing the basis for their judgments. Multiple experts often examine the same
body of evidence and come to radically different conclusions, puzzling other
scholars and the public at large. If those who held opposing views would
articulate the component steps in their evaluations that generated their sum-
mary views, and characterize their interpretations of the evidence bearing
32 INTERPRETING EPIDEMIOLOGIC EVIDENCE
The need to strive for impartiality in the evaluation of evidence must be stressed,
partly because there are strong forces encouraging subjectivity. Among the most
vital, exciting aspects of epidemiology are its value in understanding how the
world we live in operates to affect health and the applicability of epidemiologic
evidence to policy. Epidemiologic research bears on the foods we eat, the med-
ications we take, our physical activity levels, and the most intimate aspects of
our sexual behavior, emotional ties, and whether there are health benefits to hav-
ing pets. Putting aside the scholarly arguments made in this book, I am sure every
reader “knows” something about what is beneficial and harmful, and it is diffi-
cult to overcome such insights with scientific evidence. (I don’t need epidemio-
logic research to convince me that there are profound health benefits from own-
ing pet dogs, and I am equally certain that pet cats are lacking in such value.)
Epidemiologic evidence bearing on such issues is not interpreted in a vacuum,
but rather intrudes on deeply held preconceptions based on our cultures, reli-
gions, and lifestyles. Judgments about epidemiologic evidence pertinent to pol-
icy inevitably collide with our political philosophy and social values. In fact, sus-
picion regarding the objectivity of the interpretation of evidence should arise
when researchers generate findings that are consistently seen as supporting
strongly held ideology.
Beyond the more global context for epidemiologic evidence, other challenges
to impartiality arise in the professional workplace. On a personal level, we may
not always welcome criticism of the quality of our own work or that of valued
colleagues and friends, or be quite as willing as we should be to accept the ex-
cellent work done by those we dislike. The ultimate revelation of an ad hominem
assessment of evidence lies in the statement that “I didn’t believe it until we saw
it in our own data.” Such self-esteem may have great psychological value but is
worrisome with regard to objectivity. Epidemiologists may also be motivated to
protect the prestige of the discipline, which can encourage us to overstate or un-
derstate the conclusiveness of a given research product. We may be tempted to
Drawing Inferences from Epidemiologic Evidence 33
close ranks and defend our department or research team in the face of criticism,
especially from outsiders. Such behavior is admirable in many ways, but counter
to scientific neutrality.
Perhaps most significant for epidemiologists, who are often drawn to the field
by their strong conviction to promote public health agendas, is the temptation to
promote those public health agendas in part through their interpretations of sci-
entific evidence (Savitz et al., 1999). The often influential, practical implications
of epidemiology, the greatest strength of the discipline, can also be its greatest
pitfall to the extent that it detracts from dispassionate evaluation. The implica-
tions of the findings (quite separate from the scientific merits of the research it-
self) create incentives to reach a particular conclusion or at least to lean one way
or another in the face of true ambiguity. The greatest service epidemiologists can
provide those who must make policy decisions or just decide how to live their
lives is to offer an objective evaluation of the state of knowledge and let
the many other pertinent factors that bear on such decisions be distilled by the
policy maker or individual in the community, without being predigested by the
epidemiologist.
For example, advocates of restrictions on exposure to environmental tobacco
smoke may be inclined to interpret the evidence linking such exposures to lung
cancer as strong whereas the same evidence, viewed by those who oppose such
restrictions, is viewed as weak. A recent review of funding sources and conclu-
sions in overviews of the epidemiologic evidence on this topic finds, not sur-
prisingly, that tobacco industry sponsorship is associated with a more skeptical
point of view (Barnes & Bero, 1998). Whereas judgment of the epidemiologic
evidence is (or should be) a matter for science, a position on the policy of re-
stricting public smoking is, by definition, in the realm of advocacy—public pol-
icy decisions require taking sides. However, the goal of establishing sound pub-
lic policy that advances public health is not well served by distorting the
epidemiologic evidence.
Fallible epidemiologic evidence on the health effects of environmental tobacco
smoke may well be combined with other lines of evidence and principles to jus-
tify restricted public smoking. Believing that public smoking should be curtailed
is a perfectly reasonable policy position but should not be used to retrofit the epi-
demiologic evidence linking environmental tobacco smoke to adverse health ef-
fects and exaggerate its strength. Similarly, strongly held views about individual
liberties may legitimately outweigh epidemiologic evidence supporting adverse
health effects of environmental tobacco smoke in some settings, and there is no
need to distort the epidemiologic evidence to justify such a policy position. As
discussed in Chapter 2, epidemiologic evidence is only one among many sources
of information to consider, so that limited epidemiologic evidence or even an ab-
sence of epidemiologic evidence does not preclude support for a policy of such
restriction nor does strong epidemiologic evidence dictate that such a policy must
34 INTERPRETING EPIDEMIOLOGIC EVIDENCE
The starting point for evaluating the validity of results is to calculate and pres-
ent estimates of the effect measure or measures of primary interest. This esti-
mate might be disease prevalence, a risk ratio or risk difference, or a quantita-
tive estimate of the dose-response function relating exposure to a health outcome.
In order to consider the extent to which the study has successfully measured what
Drawing Inferences from Epidemiologic Evidence 35
it sought to measure, the key outcomes must be isolated for the most intense
scrutiny. The question of validity is operationalized by asking the degree to which
the estimated measure of effect is accurately representing what it purports to
measure.
The measure of interest is quantitative, not qualitative. Thus, the object of eval-
uation is not a statement of a conclusion, e.g., exposure does or does not cause
disease, or an association is or is not present. Instead, the product of the study
is a measurement of effect and quantification of the uncertainty in that estimate,
e.g., we estimate that the risk of disease is 2.2 times greater among exposed than
unexposed persons (with a 95% confidence interval of 1.3 to 3.7), or for each
unit change in exposure, the risk of disease rises by a 5 cases per 1000 persons
per year (95% confidence interval of 1.2 to 8.8). Statement of the result in quan-
titative terms correctly presents the study as an effort to produce an accurate
measurement rather than to create the impression that studies generate dichoto-
mous results, e.g., the presence or absence of an association (Rothman, 1986).
The simplification into a dichotomous result, based either on statistical tests or
some arbitrary, subjective judgments about what magnitude of association is real
or important hinders the goal of quantitative, objective evaluation.
The alternative approach, driven by conventional frequentist statistical con-
cepts, is to focus on the benchmark of the null hypothesis, motivated perhaps by
a desire for neutral, restrained interpretation of evidence. In this framework, study
results are viewed solely to determine whether the data are sufficiently improb-
able under the null hypothesis to lead to rejection of the null hypothesis or a fail-
ure to reject the null hypothesis. For studies that generate ratio measures of ef-
fect, this is equivalent to asking whether we reject or fail to reject the null
hypothesis that the relative risk is 1.0. Rejecting the null hypothesis implies that
the relative risk takes on some other value but tells us no more than that. The
null value is just one hypothetical true value among many with which the data
can be contrasted, not the only or necessarily the most important one. The mag-
nitude of uncertainty in an estimated relative risk of 1.0 is not conceptually dif-
ferent than the uncertainty in estimates of 0.5 or 5.0. Also, focusing on the meas-
ure as the study product avoids the inaccurate impression that successful studies
yield large measures of effect and unsuccessful studies do not. Successful stud-
ies yield accurate measures of effect, as close as possible to the truth with less
uncertainty than unsuccessful ones.
The measure of interest is determined by the substantive study question, pre-
sented in common language rather than statistical jargon. For example, a study
may suggest that persons who drink 4 or more cups of coffee per day have 1.5
times the risk of a myocardial infarction compared to persons who drink fewer
than 4 cups of coffee per day, with a 95% confidence interval of 0.8 to 2.6. Study
products are not expressed in terms of well-fitting models or regression coeffi-
cients, nor should the results be distilled into test statistics such as t-statistics or
36 INTERPRETING EPIDEMIOLOGIC EVIDENCE
viously it is no longer a source of bias. On the other hand, attempts to make such
corrections can be ineffective or, at worst, harmful.
Data collection efforts often yield numerous analyses and contribute substan-
tively to many different lines of research (Savitz & Olshan, 1995b). For the pur-
poses of examining and discussing validity of measurement, each analysis and
each key measure must be considered separately. While there are techniques for
refining individual estimates based on an array of results that address random er-
ror (Greenland & Robins, 1991), for evaluating systematic biases, the focus is
not the study or the data set but rather the result, since a given study may well
yield accurate results for one question but erroneous results for others. Features
of the study uncovered through one analysis may bear positively or negatively
on the validity of other analyses using the same data set, in that the same sources
of bias can affect multiple estimates. However, the question of whether the study
product is accurate must be asked for each such product of the study.
Some confusion can arise in discussing accurate measurement of individual
variables versus accurate measures of association. If the study is designed to
measure disease prevalence, for example, the study product and object of scrutiny
is the prevalence measure. We ask about sources of distortion in the observed
relative to the unknown true value of disease prevalence. When the focus is on
measures of association, the measure of association is the key study product. Er-
rors in measurement of the pertinent exposure, disease, or confounders all may
produce distortion of the measure of association, but the focus is not on the meas-
urement of individual variables; it is on the estimate of the association itself.
Viewing the study product as a quantitative measure makes the evaluation of the
accuracy of the study equivalent to a quantitative evaluation of the accuracy of
the measure. Just as studies are not good or bad but fall along a continuum, the
accuracy of the study’s findings are not correct or incorrect but simultaneously
informative and fallible to varying degrees. This quantitative approach to the ex-
amination of bias is contrasted with an evaluation that treats biases as all-or-none
phenomena. If the product of a study is presented as a dichotomy, e.g., exposure
is/is not associated with disease, then sources of potential error are naturally ex-
amined with respect to whether or not they negate that association: Is the asso-
ciation (or lack of association) due to random error? Is the association due to re-
sponse bias? Is the association due to confounding? The search for error is viewed
as an effort to implicate or exonerate a series of potentially fatal flaws that could
negate the results of the study. Such an approach simplifies the discussion but it
does not make full use of the information available to draw the most appropri-
ate inference from the findings.
38 INTERPRETING EPIDEMIOLOGIC EVIDENCE
effect that is now positioned to account for the distortion resulting from each bias
and has a width that takes into account the possibility that the source of bias
yields varying magnitudes of error. Starting with the conventionally constructed
confidence interval derived solely from considerations of random error, there
might be a shift in the general placement of the interval to account for a given
form of bias to be more likely to shift results upward than downward, but there
would also be a widening of the confidence interval, perhaps asymmetrically, de-
pending on how probable it is that varying magnitudes of distortion are present.
Additional sources of bias could, in principle, be brought into consideration, each
one providing a more accurate reflection of the estimate of the measure of ef-
fect, integrating considerations of bias and precision.
In practice, the ability to quantify sources of bias other than random error poses
a great challenge, but the conceptual benchmark remains useful. This attempt at
quantification reminds us that even in an ideal world, hypothetical biases would
not be proven present or absent, but their possible effect would be quantified,
and the estimated measure of effect would shift as a result. In some instances,
we may have the good fortune of finding that the range of plausible effects of
the bias are all negligible, enabling us to focus our energies elsewhere. When we
find that the bias is capable or even likely to introduce substantial distortion,
those are the biases that need to be countered in subsequent studies in order to
remove their effect or at least to more accurately account for their impact and
reduce uncertainty. The strategies of the following chapters are intended to help
in estimating the direction and magnitude of distortion resulting from various bi-
ases, focusing wherever possible on the use of empirical evaluation to help bridge
the gap between the ideal quantitative, probabilistic insights and what is often a
largely intuitive, informal characterization of the impact of bias that is commonly
applied at present. Collecting and analyzing additional data, conducting sensi-
tivity analyses, and incorporating information from outside the study of interest
are among the strategies that are proposed to help in this challenging mission.
Examination and critical evaluation of a study result should begin with an enu-
meration of the primary sources of vulnerability to error, either by the authors
as the first ones to see and evaluate the findings, or by the users of such infor-
mation. Although this seems obvious, there may be a temptation to focus on the
sources that are more easily quantified (e.g., nondifferential misclassification) or
to enumerate all conceivable biases, giving equal attention to all. Instead, the first
stage of evaluation, to ensure that the scrutiny is optimally allocated, should be
to identify the few possibilities for introducing large amounts of error. Project-
ing forward to conceptualize that unknown but ideal integrated measure of
40 INTERPRETING EPIDEMIOLOGIC EVIDENCE
are often not as compelling unless there are rather strong risk factors that are likely
to be associated with the exposure of interest. This situation is perhaps less uni-
versal than the problems of non-response and measurement error. In small stud-
ies, where some of the cells of interest contain fewer than 10 subjects, random er-
ror may be the overwhelming concern that limits the strength of study results.
For each study or set of studies under review, the critical issues, which are
generally few in number, should be specified for close scrutiny to avoid super-
ficial treatment of an extensive list of issues that mixes trivial with profound con-
cerns. These critical issues are distinguished by having a sufficiently high prob-
ability of having a quantitatively important influence on the estimated measure
of effect. If such candidates are considered in detail and found not to produce
distortion, the strength of the evidence would be markedly enhanced. These is-
sues are important enough to justify conducting new studies in which the po-
tential bias can be eliminated, sometimes simply requiring larger studies of sim-
ilar quality, or to suggest methodological research that would determine whether
these hypothetical problems have, in fact, distorted past studies.
of non-response more generally would also be of interest but even less directly
applicable. Integration of these diverse threads of information into an overall as-
sessment is a challenge and may well lead to discordant judgments.
A natural by-product of this effort is the identification of gaps in knowledge
that would help describe and quantify probabilities of biases that would distort
the findings in a specific direction of some specified magnitude. That is, in at-
tempting to implement the ambitious strategy of specifying, quantifying, and as-
sessing the probabilities of specific types of bias, limitations in our knowledge
will be revealed. Sometimes those limitations will be in the conceptual under-
standing of the phenomenon that precludes assessment of the potential bias, point-
ing toward need for further methodological work. Questions may arise regard-
ing such issues as the pattern of non-response typically associated with
random-digit dialing, which points toward empirical methodological research to
evaluate this technique in order to produce generalizable information. Often, the
solution must be found within the specific study, pointing toward further analy-
ses or additional data collection. Finally, and perhaps most importantly, the largest
and most likely potential biases in one study or in the set of studies suggests re-
finement required in the next attempt to address the hypothesized causal rela-
tionship.
Even though the ambitious attempt to delineate and quantify biases as pro-
posed above will always fall short of success, the uncertainties revealed by the
effort will be constructive and specific. Instead of being left with such unhelp-
ful conclusions as “the evidence is weak” or “further studies are needed,” we are
more likely to end up with statements such as “the pattern of non-response is not
known with certainty, but if exposed, non-diseased persons are underrepresented
to a sizable extent, the true measure of association could be markedly smaller
than what was measured.” Moving from global, descriptive statements to spe-
cific, quantitative ones provides direction to the original investigators, future
researchers, and to those who must consider the literature as a basis for policy
decisions.
(PCBs), might increase the risk of developing breast cancer. A major motivation
for such inquiry is the experimental evidence of carcinogenicity of these com-
pounds and the postulated effects of such compounds on estrogenic activity in
humans and other species (Davis et al., 1993). Prior to 1993, studies in humans
had generally been small and were based largely on comparisons of normal and
diseased breast tissue rather than on an evaluation of exposure levels in women
with and without breast cancer. Because the report by Wolff et al. (1993) was a
major milestone in the literature and stood essentially in isolation, it provides a
realistic illustration of the interpretive issues surrounding a specific epidemio-
logic study. The fact that a series of subsequent evaluations have been largely
negative (Hunter et al., 1997; Moysich et al., 1998; Millikan et al., 2000) does
not detract from the methodologic issues posed at the time when the initial study
was first published and evaluated.
In order to evaluate the possible association between exposure to persistent
organochloride compounds and breast cancer, Wolff et al. (1993) identified over
14,000 women who had been enrolled in a prospective cohort study between
1985 and 1991 that included collection of blood samples for long-term storage.
From this cohort, all 58 women who developed breast cancer and a sample of
171 controls who remained free of cancer had their sera analyzed for levels of
DDT, DDE, and PCBs. After adjustment for potential confounders (family his-
tory of breast cancer, lifetime history of lactation, and age at first full-term preg-
nancy), relative risks for the five quintiles of DDE were 1.0 (referent), 1.7, 4.4,
2.3, and 3.7. Confidence intervals were rather wide (e.g., for quintile 2, approx-
imately 0.4–6.8 as estimated from the graph, and for quintile 5, 1.0–13.5).
The focus here is on the critical interpretation of these results in terms of epi-
demiologic methods, but the contribution of this study to expanding interest in
the potential environmental influences on breast cancer generally is a notable
achievement with implications yet to be fully realized. The first step in examin-
ing these data is to define the result that is to be scrutinized for potential error.
Although PCBs were examined as well as DDT and DDE, we will focus on DDE
and breast cancer, for which the evidence was most suggestive of a positive as-
sociation. An entirely different set of criticisms might arise in evaluating the va-
lidity of the measured absence of association (or very small association) identi-
fied for PCBs.
There were three main calculations undertaken for DDE: a comparison of
means among cases versus controls (of dubious value as a measure of associa-
tion), adjusted odds ratios calculated across the five quintiles (as provided above),
and an estimated adjusted odds ratio for increasing exposure from the 10th to
90th percentile of 4.1 (95% confidence interval: 1.5–11.2), corresponding to an
assumed increase from 2.0 ng/mL to 19.1 ng/mL. Although the latter number
smoothes out the irregularities in the dose–response gradient that were seen across
the quintiles, and may mask non-linearity in the relationship, it provides a
46 INTERPRETING EPIDEMIOLOGIC EVIDENCE
directly distort the measured relative risk given that the controls did not
experience the disease of concern. Assessment of the validity of this hy-
pothesis requires examination of the literature on metabolism, storage,
and excretion of persistent organochlorides and an understanding of the
physiologic changes associated with the early stages of breast cancer. In-
dependent of this study, examining patterns of association for cases with
varying stages of disease might help to evaluate whether such bias oc-
curred, with the expectation that the bias would result in stronger influ-
ence among cases with more advanced disease and little or no influence
among cases with carcinoma in situ of the breast (Millikan et al., 1995).
Such a bias might also be expected to be strongest for cases diagnosed
close to the time of serum collection (when latent disease is more likely
to be present) as compared to cases diagnosed later relative to serum col-
lection.
2. Has lactation or childbearing confounded the measured association between
serum DDE and breast cancer? The investigators reported that lactation was
associated with a decreased risk of breast cancer (Wolff et al., 1993) as re-
ported by others, and that adjustment for lactation markedly increased the
relative risk. Lactation is known to be a major pathway to eliminating stored
organochlorides and thus causes lower measured levels of these compounds
in the body. Whatever exposure level was truly present prior to the period
of lactation, the level measured after lactation would be lower. If adjust-
ment affected the reported relative risk for the comparison of 10th to 90th
percentile of DDE to the same extent as it affected their categorical meas-
ure of relative risk of DDE, the odds ratio without adjustment for lactation
would have been around 2.4 instead of 4.1. Thus, the validity of the lacta-
tion-adjusted estimate warrants careful scrutiny (Longnecker & London,
1993).
If early-life DDE levels are etiologically important, lactation presumably
has artificially lowered later-life serum levels and introduced error relative
to the exposure of interest (prelactation levels). If lactation reduced the risk
of breast cancer (independent of its DDE-lowering influence), then lacta-
tion would be expected to introduce positive confounding and falsely ele-
vate the relative risk (Longnecker & London, 1993). Lactation would lower
the measure of exposure and lower breast cancer risk, so that failure to ad-
just for lactation would result in a spuriously elevated relative risk for DDE
and breast cancer, and adjustment for lactation would therefore lower the
relative risk. The reason for the opposite effect of adjustment for lactation
is not clear (Dubin et al., 1993), but it suggests that lactation history
was associated with a higher level of DDE rather than a lower level of
DDE in this population. The high proportion of nulliparous (and thus
never-lactating) women in the Wolff et al. (1993) study may influence the
48 INTERPRETING EPIDEMIOLOGIC EVIDENCE
Each of these issues could affect the true (unknown) measure of the relative
risk in comparison to the observed value of 4.1. We would like to be able to as-
sign probabilities to these alternative scenarios given that they have implications
for the interpretation of the study results. If these potential biases were incorpo-
rated, the distribution of values around the point estimate would not necessarily
be symmetrical, as is presumed for random error, but may take other shapes. For
example, metabolic effects of early disease seem more likely to artificially ele-
vate case serum DDE levels relative to controls rather than lower them, so that
the confidence interval might be weighted more on the lower relative risk end.
Lactation may require several curves to address its potential role according to the
alternative hypotheses. Insofar as it reflects a true confounder of the DDE/breast
cancer association, more refined measurement and adjustment for the relevant as-
pects of lactation might be predicted to further elevate the DDE/breast cancer as-
sociation in the Wolff et al. (1993) study (Greenland & Robins, 1985; Savitz &
Barón, 1989). As a marker only of reduced body burden of DDE, it should not
have been adjusted and thus the smaller relative risks reported without adjustment
may be more valid, making true values below 4.1 more compatible with the ob-
served results than values above 4.1. On the other hand, since the confounding
Drawing Inferences from Epidemiologic Evidence 49
influence of lactation was counter to the expected direction (Longnecker & Lon-
don, 1993), we may wish to raise questions about the assessment of lactation or
DDE, and spread the probability curve more broadly in both directions.
Evaluation of results through specifying and working through the consequences
of a series of potential biases, focusing on two principal ones in some detail, has
not answered the question of whether the measured association of DDT/DDE and
breast cancer was accurate, but it helped to refine the question. Instead of ask-
ing whether the study’s results are valid, we instead ask a series of more focused
and answerable questions that bear on the overall result. Does preclinical breast
cancer distort measured levels of serum DDE, and if so, in which direction? Is
lactation inversely related to breast cancer, independent of DDE? Is serum DDE
level a more accurate reflection of early-life exposure among non-lactating
women? Some of these questions point toward research outside of the scope of
epidemiology, but other approaches to addressing these questions would involve
identifying populations in which the threat to validity is much reduced. The lac-
tation issue could be examined in a population in which breastfeeding is absent,
not resolving the questions about lactation, DDE, and breast cancer, but ad-
dressing DDE and breast cancer without vulnerability to distortion by lactation.
These refined questions are, in principle, testable and would help to resolve the
questions raised by the Wolff et al. (1993) study. The critical evaluation of study
results should enhance intellectual grasp of the state of the literature, help us
judge the credibility of the measured association, and identify testable hypothe-
ses that would clarify a study’s results and advance knowledge of the issue.
REFERENCES
Barnes DE, Bero LA. Why review articles on the health effects of passive smoking reach
different conclusions. JAMA 1998;279:1566–1570.
Copeland KT, Checkoway H, McMichael AJ, Holbrook RH. Bias due to misclassifica-
tion in the estimation of relative risk. Am J Epidemiol 1977;105:488–495.
Davis DL, Bradlow HL, Wolff M, Woodruff T, Hoel DG, Anton-Culver H. Medical hy-
pothesis: xenoestrogens as preventable causes of breast cancer. Environ Health Per-
spect 1993;101:372–377.
Dubin N, Toniolo PG, Lee EW, Wolff MS. Response “Re: Blood levels of organochlo-
rine residues and risk of breast cancer.” J Natl Cancer Inst 1993;85:1696–1697.
Flanders WD, Drews CD, Kosinski AS. Methodology to correct for differential misclas-
sification. Epidemiology 1995;6:152–156.
Furberg H, Newman B, Moorman P, Millikan R. Lactation and breast cancer risk. Int J
Epidemiol 1999;28:396–402.
Greenland S. Response and follow-up bias in cohort studies. Am J Epidemiol 1977;
106:184–187.
Greenland S. Randomization, statistics, and causal inference. Epidemiology 1990;1:
421–429.
50 INTERPRETING EPIDEMIOLOGIC EVIDENCE
Greenland S, Criqui MH. Are case-control studies more vulnerable to response bias? Am
J Epidemiol 1981;114:175–177.
Greenland S, Robins JM. Confounding and misclassification. Am J Epidemiol 1985;
122:495–506.
Greenland S, Robins JM. Empirical-Bayes adjustments fo rmultiple comparisons are some-
times useful. Epidemiology 1991;2:244–251.
Hertz-Picciotto I. Invited Commentary Shifting the burden of proof regarding biases and
low-magnitude associations. Am J Epidemiol 2000;151:946–948.
Hunter DJ, Hankinson SE, Laden F, et al. Plasma organochlorine levels and risk of breast
cancer in a prospective study. N Engl J Med 1997;337:1253–1258.
Joffe M, Li Z. Male and female factors in fertility. Am J Epidemiol 1994;140:921–929.
Kleinbaum DG, Kupper LL, Morgenstern H. Epidemiologic research: principles and quan-
titative methods. Belmont, CA: Lifetime Learning Publications, 1982.
Last JM. A Dictionary of Epidemiology, Fourth Edition. New York: Oxford University
Press, 2001;184–185.
Longnecker MP, London SJ. Re: Blood levels of organochlorine residues and risk of breast
cancer. J Natl Cancer Inst 1993;85:1696.
Millikan R, DeVoto E, Duell EJ, Tse C-K, Savitz DA, Beach J, Edmiston S, Jackson S,
Newman B. Dichlorodiphenyldicholoroethane, polychlorinated biphenyls, and breast
cancer among African-American and white women in North Carolina. Cancer Epi-
demiol, Biomarkers Prev 2000;9:1233–1240.
Millikan R, Dressler L, Geradts J, Graham M. The importance of epidemiologic studies
of in-situ carcinoma of the breast. Breast Cancer Res Treat 1995;34:65–77.
Moysich KB, Ambrosone CB, Vena JE, et al. Environmental organochlorine exposure
and postmenopausal breast cancer risk. Cancer Epidemiol Biomarkers Prev
1998;7:181–188.
Newcomb PA, Storer BE, Longnecker MP, et al. Cancer of the breast in relation to lac-
tation history. N Engl J Med 1994;330:81–87.
Rothman KJ. Modern Epidemiology. Boston: Little, Brown, and Company, 1986:77.
Savitz DA, Barón AE. Estimating and correcting for confounder misclassification. Am J
Epidemiol 1989;129:1062–1071.
Savitz DA, Olshan AF. Re: “Male and female factors in infertility.” Am J Epidemiol
1995;141:1107–1108.
Savitz DA, Olshan AF. Multiple comparisons and related issues in the interpretation of
epidemiologic data. Am J Epidemiol 1995b;142:904–908.
Savitz DA, Poole C, Miller WC. Reassessing the role of epidemiology in public health.
Am J Pub Health 1999;89:1158–1161.
Wolff MS, Toniolo PG, Lee EW, Rivera M, Dubin N. Blood levels of organochlorine
residues and risk of breast cancer. J Natl Cancer Inst 1993;85:648–652.
4
SELECTION BIAS IN COHORT STUDIES
STUDY DESIGNS
Except for research that seeks simply to characterize the frequency of disease
occurrence in a population, epidemiologic studies make comparisons between
two or more groups. The goal is to draw inferences about possible causal rela-
tions between some attribute that may affect health, generically called exposure,
and some health outcome or state, generically called disease. The exposure may
be a biological property, such as a genotype or hormone level; an individual be-
havior, such as drug use or diet; or a social or environmental characteristic, such
as living in a high-crime neighborhood or belonging to a particular ethnic group.
Disease also covers many types of health events, including a biochemical or phys-
iologic state, for example, elevated low density lipoprotein cholesterol, a clini-
cal disease, for example, gout, or an impairment in daily living, for example, the
inability to walk without assistance.
The study designs used to examine exposure–disease associations were clearly
organized by Morgenstern et al. (1993). All designs are intended to allow infer-
ences about whether exposure influences the occurrence of disease. Two axes
serve to define the universe of possible study designs. The first concerns the way
in which the health outcome is measured. The health event can be assessed in
the form of disease incidence, defined as the occurrence of disease in persons
51
52 INTERPRETING EPIDEMIOLOGIC EVIDENCE
tablishment of the exposed group places the burden on the investigator to iden-
tify an unexposed group that approximates the disease incidence that the exposed
group would have had if they had not been exposed. In the absence of any true
influence of the exposure on disease occurrence, the exposed and unexposed
groups should have equal disease incidence and confounding or selection bias is
said to be present if this condition is not met. Distinguishing between differences
in disease incidence due to the causal effect of exposure and differences in dis-
ease incidence due to selection bias or confounding is an essential challenge in
the interpretation of cohort studies. There is no direct way of isolating these con-
tributors to the observed differences (or lack of differences) in disease occur-
rence, but this chapter offers a number of tools to help in this assessment.
Because the key issue is comparability between two groups with varying ex-
posure rather than the actual constitution of either group, it is somewhat arbi-
trary to “blame” one of the groups when the two are not comparable. The dis-
cussion of cohort studies focuses on obtaining a non-exposed group that is
comparable to the exposed group, but the challenge could be viewed with equal
legitimacy as identifying an exposed group that is comparable to the non-exposed
one. Whether beginning with an exposed group and seeking a suitable unexposed
group or the reverse, what matters is their comparability.
The focus in this chapter is on examining whether the method by which the
groups were constituted has produced selection bias. The algorithm for generat-
ing the comparison groups is the subject of evaluation, though the actual consti-
tution of the group depends on both the algorithm as it is defined and the process
of implementing it to form the study groups. No matter how good the theoreti-
cal properties of the selection mechanism, non-response or loss to follow up can
introduce distortion or, under rather optimistic scenarios, correct errors resulting
from a faulty group definition. Non-response is a form of selection bias, but be-
cause it is so pervasive and important in epidemiologic studies, a separate chap-
ter (Chapter 6) addresses that issue alone. Another source of error in constitut-
ing the study groups, which is not discussed in this chapter, is when the
mechanism for selection of subjects does not perform as intended due to random
processes. This possibility is accounted for in generating variance estimates and
confidence intervals, with random selection of subjects one of the few sources
of readily quantifiable sampling error in epidemiologic studies (Greenland, 1990)
(Chapter 10). In asking whether selection bias has arisen, it is useful to distin-
guish between a faulty mechanism for selection and a good mechanism that gen-
erated an aberrant result. This chapter focuses on the mechanism.
There are two closely related processes that introduce bias into the comparison
of exposed and unexposed subjects in cohort studies. When there is a distortion
56 INTERPRETING EPIDEMIOLOGIC EVIDENCE
due to the natural distribution of exposures in the population, the mixing of ef-
fects is referred to as confounding. When there is a distortion because of the way
in which our study groups were constituted, it is referred to as selection bias. In
our hypothetical cohort study of dietary fat intake and prostate cancer, we may
find that the highest consumers of dietary fat tend to be less physically active
than those who consume lower amounts of dietary fat. To the extent that physi-
cal activity influences the risk of disease, confounding would be present not be-
cause we have chosen the groups in some faulty manner, but simply because
these attributes go together in the study population. In contrast, if we chose our
high dietary fat consumers from the labor union retirees, and identified low fat
consumers from the local Sierra Club, men who are quite likely to be physically
active, there would be selection bias that results in part from the imbalance be-
tween the two groups with respect to physical activity, but also quite possibly
through a range of other less readily identified characteristics.
Confounding tends to be the focus when the source of non-comparability is
measurable at least in principle and can therefore be adjusted statistically. To the
extent that the source of non-comparability can be identified, whether it arises
naturally (confounding) or as the result of the manner in which the study groups
were chosen (selection bias), its effects can be mitigated by statistical ad-
justment. When the concern is with more fundamental features of the groups to
be compared and seems unlikely to be resolved through measurement of co-
variates and statistical control, we usually refer to the consequence of this non-
comparability as selection bias.
The potential for selection bias depends entirely on the specific exposures and
diseases under investigation, since it is the relation between exposure and dis-
ease that is of interest. Groups that seem on intuitive grounds to be non-compa-
rable could still yield valid inferences regarding a particular exposure and dis-
ease, and groups that seem as though they would be almost perfectly suited for
comparison could be problematic. Similarly, there are some health outcomes that
seem almost invariant with respect to the social and behavioral factors that in-
fluence many types of disease and other diseases subject to a myriad of subtle
(and obvious) influences.
For example, incidence of acute lymphocytic leukemia in childhood varies at
most modestly in relation to social class, parental smoking, or any other expo-
sures or life circumstances examined to date (Chow et al., 1996). If we wished
to assess whether the incidence of childhood leukemia in the offspring of men
who received therapeutic ionizing radiation as treatment for cancer was increased,
the selection of an unexposed group of men might be less daunting since the vari-
ability in disease incidence appears to be independent of most potential deter-
minants studied thus far. That is, we might be reasonably confident that rates
from general population registries would be adequate or that data from men who
received medical treatments other than ionizing radiation would be suitable for
Selection Bias in Cohort Studies 57
Multiple sources of information, both within and outside the study, can help in
the assessment of whether selection bias is likely to be present. These indicators
do not provide definitive answers regarding the probability that selection bias of
a given direction and magnitude is present, which is the desired goal. Nonethe-
less, by drawing upon multiple threads of information, the ability to address these
critical questions is markedly enhanced. While the goal is a fully defined distri-
bution of probabilities for bias of varying magnitude, a more realistic expecta-
tion for these tools is to begin to sketch out that information and use the dis-
parate pieces of information as fully and appropriately as possible. Not all the
tools suggested below are applicable in every study, but a more systematic con-
sideration of the repertoire of these approaches should yield insights that would
not otherwise be obtained. In some instances, the very lack of information needed
to apply the tool provides relevant information to help characterize the certainty
of the study’s findings and suggests approaches to develop better resources for
addressing the potential bias.
of association between exposure and disease. The reason to focus on the unexposed
group is that the exposed group’s disease rates may differ from an external popu-
lation either due to a true effect of the exposure or to the same sort of idiosyn-
crasies alluded to above with regard to the unexposed group. If the rate of disease
in the unexposed group differs substantially from an external reference population,
however, it is clearly not due to the exposure but due to some other characteristics
of the unexposed population.
An important challenge to implementing this approach is that only a few dis-
eases have standardized ascertainment protocols and readily available informa-
tion on frequency of disease occurrence in populations external to the study. For
example, cancer registries comprehensively document the occurrence of diag-
nosed disease using rigorous protocols and publish rates of disease on a regular
basis (Ries et al., 1996). When conducting a study of the possible carcinogenic-
ity of an industrial chemical, for example, in which the strategy is to compare
cancer incidence in exposed workers to workers without that exposure, it would
be informative to compare the cancer incidence observed among the unexposed
workers to the incidence in populations under surveillance as part of the Sur-
veillance, Epidemiology, and End Results (SEER) Program or other geographi-
cally suitable cancer registries. Some assurance that the rate difference or ratio
comparing the exposed to unexposed workers is valid would be provided if the
cancer rate in the unexposed workers were roughly comparable to that of the
general population. If a notable discrepancy were found between the unexposed
workers and the community’s incidence rate more generally, the suitability of
the unexposed workers serving as the referent for the exposed workers might be
called into question.
A critical assumption in applying this strategy is that the methods of ascer-
tainment need to be comparable between the unexposed group in the study and
the external referent population. For some outcomes, for example, overall mor-
tality rates, comparisons can be made with some confidence in that the diagno-
sis and comprehensiveness of ascertainment is likely to be comparable between
the unexposed subset of the cohort and an external population. However, for
many diseases, the frequency of occurrence depends heavily on the ascertain-
ment protocol, and the sophistication of methods in a focused research enterprise
will often exceed the quality of routinely collected data. Comparing an unex-
posed segment of the cohort in which disease is ascertained using one method
with an external referent population with a substantially different method of dis-
ease ascertainment is of little value, and observing discrepancies has little or no
bearing on the suitability of the unexposed as a referent. If the direction and ide-
ally the magnitude of the disparity between the methods of disease ascertainment
were well understood, such a comparison might help to provide some assurance
that the disparity in the disease occurrence across groups is at least in the ex-
pected direction.
60 INTERPRETING EPIDEMIOLOGIC EVIDENCE
As an illustration of the strategy and some of the complexity that can arise in
implementation, consider a study of the relationship between pesticide exposure
and reproductive outcomes in Colombia (Rostrepo et al., 1990). Using ques-
tionnaires to elicit self-reported reproductive health outcomes for female work-
ers and the wives of male workers, the prevalence of various adverse outcomes
was tabulated, comparing reproductive experiences before exposure onset to their
experiences after exposure onset (Table 4.1). In principle, the prevalence before
exposure should reflect a baseline risk, with the prevalence after onset of expo-
sure reflecting the potential effect of pesticides.
A number of the odds ratios are elevated, but concerns were raised by the au-
thors regarding the credibility of the findings based on the anomalously low fre-
quency of certain outcomes, most notably spontaneous abortion. Whereas one
would generally expect a risk of approximately 8%–12% based on self-reported
data, the results here show only 3.6% of pregnancies ending in spontaneous abor-
tions among female workers prior to exposure and 1.9% among wives of male
workers prior to their partners’ exposure. This could reflect in part the very low
risk expected for a selectively healthly population, by definition younger than
those individuals after exposure onset. Much of the aberrantly low prevalence of
spontaneous abortion is likely due to erroneous underreporting of events in both
groups, however, an issue of disease misclassification. Regardless, the strategy
of comparing outcome frequency among the unexposed to that of an external
population was informative.
In selected areas of the country, congenital malformations are comprehensively
tabulated and data on prevalence at birth are published. Information from vital
records, including birth records (birth weight, duration of gestation) and death
data used to generate cause-specific mortality, constitutes a national registry for
the United States and provides a readily available benchmark for studying health
events that can be identified in such records. Yet the ascertainment methods even
for these relatively well-defined conditions can differ markedly between a given
research protocol and the vital records protocol. For example, identification of
congenital malformations depends strongly on whether medically indicated abor-
tions are included or excluded as well as the frequency of such abortions, how
systematically newborn infants are examined, how far into life ascertainment is
continued (many malformations only become manifest some time after birth),
etc. Cumulatively, those differences can cause substantial differences between
groups that would be identical if monitored using a consistent protocol. Even for
monitoring gestational age or birth weight of infants, differences can arise based
on the algorithm for estimating conception date (e.g., the use of ultrasound ver-
sus reliance on last menstrual period for dating), and inclusion or exclusion of
marginally viable births.
For many chronic health conditions of interest, both diagnosed diseases and
less clearly defined symptoms, data from comparable populations may be un-
TABLE 4.1. Comparison of Prevalence Ratios for Reproductive Outcomes Before and After Onset of Potential Pesticide Exposure, Colombia
Female Workers Wives of Male Workers
Risks (%) Risks (%)
BEFORE AFTER BEFORE AFTER
PREGNANCY OUTCOME EXPOSURE EXPOSURE OR 95% CI EXPOSURE EXPOSURE OR 95% CI
Induced abortion 1.46 2.84 1.96 1.47–2.67 0.29 1.06 3.63 1.51–8.70
Spontaneous abortion 3.55 7.50 2.20 1.82–2.66 1.85 3.27 1.79 1.16–2.77
Premature baby 6.20 10.95 1.86 1.59–2.17 2.91 7.61 2.75 2.01–3.76
Stillbirth 1.35 1.34 0.99 0.66–1.48 1.01 0.89 0.87 0.42–1.83
Malformed baby 3.78 5.00 1.34 1.07–1.68 2.76 4.16 1.53 1.04–2.25
OR, odds ratio; CI, confidence interval.
Rostrepo et al., 1990.
62 INTERPRETING EPIDEMIOLOGIC EVIDENCE
ies. Verification that such expected patterns are present within the study cohort
provides indirect evidence against some forms of selection bias as well as some
evidence against extreme measurement error.
This strategy is illustrated in a recent study of the possible role of anxiety and
depression in the etiology of spontaneous labor and delivery (Dayan et al., 2002),
a topic for which results of previous studies have not led to firm conclusions. A
cohort of 634 pregnancies was identified during 1997–1998 in France, and women
were administered instruments to measure anxiety and depression as well as a
range of other known and suspected risk factors for preterm birth. In addition to
examining and presenting results for anxiety and depression, exposures of un-
known etiologic significance, the authors presented results for a range of factors
for which the associations are well established (Table 4.2). Despite the impreci-
sion in this relatively small cohort, increased risk associated with heavy smok-
ing, low prepregnancy body mass index, prior preterm delivery, and genitouri-
nary tract infection was confirmed. This does not guarantee that the results found
for anxiety and depression are certain to be correct, but it increases confidence
somewhat that the cohort is capable of generating results compatible with those
of most previous studies for other, more extensively examined, predictors of risk.
For demographic and social predictors, the internal comparisons help to assess
whether there has been some differential selection that has markedly distorted
the patterns of disease. If we conducted a study of osteoporosis in which African-
American women experienced comparable rates of osteoporosis or higher rates
than white women, we would be motivated to ask whether there had been some
unintended selection that yielded an aberrant group of African Americans or
whites or both. When men and women or young and old persons show the ex-
pected pattern of disease risk relative to one another, then it is less likely that
the pattern of selection differed dramatically in relation to gender or age. That
is, the young men who were enrolled show the disease patterns expected of young
men, and the older women who were enrolled show the disease patterns expected
of older women.
It is possible, of course, to select an entire cohort that has aberrant disease
rates but ones that are uniformly aberrant across all subgroups. We would thus
find the expected pattern by making such comparisons within the cohort. We
may find that all enrolled persons show a lower or higher than expected rate of
disease, but that subgroups differ as expected relative to one another. A uniformly
elevated or depressed rate of disease occurrence may well be less worrisome in
that the goal of the study is to make internal comparisons, i.e., among exposed
versus unexposed persons. The choice of the study setting or population may
yield groups with atypically high or low overall rates. For example, in almost
any randomized clinical trial the high degree of selectivity for enrollment in the
trial does not negate the validity of the comparison of those randomized to dif-
ferent treatment arms.
TABLE 4.2. Sociodemographic and Biomedical Characteristics of Pregnant Women,
France, 1987–1989
CHARACTERISTICS NO. %
Age (years)
20 31 4.9
20–34 516 81.4
35 87 13.7
Marital status
Living alone 71 11.2
Married or cohabiting 563 88.8
Ethnicity
Europe 598 94.3
Others 36 5.7
School education
Primary school 50 7.9
Secondary school 420 66.2
Higher education 164 25.9
Occupation
Not employed 243 38.3
Lower level of employment 262 41.3
Middle and higher level of employment 129 20.3
Parity
0 236 37.2
1–2 333 52.5
3 65 10.3
(continued)
65
TABLE 4.2. Sociodemographic and Biomedical Characteristics of Pregnant Women,
France, 1987–1989 (continued)
CHARACTERISTICS NO. %
66
Selection Bias in Cohort Studies 67
than younger study participants, that deviation may be an indicator of the oper-
ation of selection bias, with the younger group generating the more valid result.
The challenge in interpretation is that selection bias across strata would pro-
duce the exact same pattern as effect measure modification across the same strata.
For example, although selection bias that is more extreme for elderly participants
would produce different measures of association than among younger partici-
pants, the stronger or weaker association among the elderly may reflect true ef-
fect modification in the absence of any bias. Elderly people may truly respond
to the putative causal agent differently than younger people. Of course, both se-
lection bias and effect measure modification can be operating, either in the same
direction or in opposite directions. Thus, the absence of such a pattern does not
persuasively rule out the potential for selection bias. Perhaps the elderly really
do experience a weaker association between exposure and disease, and selection
bias masks that pattern by increasing the strength of association among the eld-
erly and thereby eliminating the appearance of effect measure modification.
In some instances, the mechanism thought to underlie selection bias may be
directly amenable to empirical evaluation. A classic selection bias is the healthy
worker effect in studies that compare health and mortality among industrial work-
ers with health and mortality patterns in the community population. The demand
for fitness at the time of hire and for sustained work in physically demanding
jobs gives rise to an employed group that is at lower risk of mortality from a
range of causes as compared to the general population (Checkoway et al., 1989),
literally through selection for employment. Consistent with the approach sug-
gested for examining selection bias, the more highly selected subgroups are in
regard to the physical or other demands of their job that predict favorable health,
such as education or talent, the more extreme the discrepancy tends to be (Check-
oway et al., 1989). One might expect the magnitude of selection to be greater
for a job requiring intense physical labor, such as longshoreman, or one that re-
quires specialized talents, such as carpenter, as compared to jobs that are less de-
manding physically (e.g., clerk) or professionally (e.g., janitor).
The effect of this selection for hire tends to diminish over time, presumably
because the good health that was required at the time of hire has faded. Those
chosen to be fit have become less fit relative to their peers, even though there is
typically still some level of selectivity for sustained employment (Checkoway et
al., 1989). Those who leave work before retirement age show evidence of se-
lectively unfavorable mortality, for example, in comparison to those who sustain
their employment. By elucidating the pattern and extent of the healthy worker
effect, our understanding of the phenomenon has markedly increased and there-
fore our ability to recognize and control its effects has been greatly enhanced. It
is difficult to identify any other form of selection bias that is so well understood
because addressing the healthy worker effect over the past 30 years has been fun-
damental to progress in studies of occupational disease. This elucidation of the
Selection Bias in Cohort Studies 69
healthy worker effect has required extensive effort to dissect the process (Choi,
1992; Arrighi & Hertz-Picciotto, 1994), develop specialized analytic approaches
to minimize its impact (Steenland & Stayner, 1991), and recognize that failure
to account for the phenomenon adequately substantially weakens the validity of
research in occupational epidemiology.
In studies of prevalence, a particular form of selection bias concerns the loss
of potentially eligible individuals prior to the time of assessment. A study of fe-
male garment workers compared the prevalence of musculoskeletal disorders to
that of hospital workers who did not have jobs associated with the putative er-
gonomic stressors (Punnett, 1996). She reported a crude prevalence ratio of 1.9,
but was concerned with the possibility of a stronger causal effect that was masked
by the more affected garment workers selectively leaving employment, with those
remaining to be included in the prevalence study showing a lower prevalence of
the disorder. To address this possibility, she examined the incidence of new on-
set of pain in relation to the number of years prior to the survey (Fig. 4.1). This
figure demonstrates that the onset of musculoskeletal pain among garment work-
ers was markedly greater in the period proximal to the survey and rare in the ear-
lier years, consistent with the hypothesized attrition of workers whose pain on-
set was earlier. No such pattern was found among hospital workers. The
magnitude of selection, and thus selection bias, is least for the recent period prior
to the survey and much greater for the more temporally remote time period.
Like other approaches to identifying bias, stratifying on indicators of severity
of selection bias is fallible, but can yield informative clues. Even in the absence
of a mechanistic understanding of the underlying process, hypothesizing a plau-
sible pathway for selection bias, examining results within strata of differing
vulnerability to the hypothesized selection, and observing whether there are
Survey
Annual Incidence Period
0.20
Garment
Workers
0.15
0.10
Hospital
0.05 Workers
0.00
10 9 8 7 6 5 4 3 2 1
differences in the pattern of association across those strata provides valuable ev-
idence to assess the probability and magnitude of selection bias.
tor vehicle injury due to seat belt use, likely correlated with sunscreen use as a
preventive health measure. We would be reminded to look carefully for other,
correlated preventive health measures that may lead to more (or less) favorable
patterns of melanoma incidence among sunscreen users, such as more frequent
examination by a physician. If the sunscreen users had disease patterns similar
to nonusers, except for the one of interest, i.e., melanoma, the potential for se-
lection bias would be reduced.
A recent report on the impact of fine particulate air pollution on mortality from
respiratory and cardiovascular disease, plausible consequences of such exposure,
also considered a residual set of deaths from other causes (Pope et al., 2002).
The extraordinarily large study of volunteers enrolled by the American Cancer
Society into the Cancer Prevention II Study, 1.2 million adults, provided the ba-
sis for this investigation. As is often the case with studies of this issue, the meas-
ures of association between pollutants and mortality are modest in magnitude but
highly precise, given the large population (Table 4.3). The categories of partic-
ular interest and plausibility, lung cancer and cardiopulmonary disease, showed
increments in risk of 6% to 13% per 10 g/m3 over the time intervals examined,
contributing to an association with all-cause mortality that was present but lower
in magnitude. Once deaths from lung cancer and cardiopulmonary disease are
removed, the residual category showed essentially no association, as one might
expect from a conglomeration of other cancers, infectious diseases, injury mor-
tality, etc. That is, observing an association between fine particular air pollution
and deaths from causes other than those most plausible would raise the serious
possibility that some selection bias for persons living in high exposure commu-
nities was operating and would suggest that the apparent effect of particulates on
lung cancer and cardiopulmonary diseases might be due to some non-specific as-
pect of living in more highly exposed communities.
TABLE 4.3. Adjusted Mortality Relative Risk Associated with a 10 g/m3 Change in
Fine Particles Measuring Less Than 2.5 m in Diameter, American Cancer Society
Cancer Prevention II Study
Adjusted RR (95% CI)*
CAUSE OF MORTALITY 1979–1983 1999–2000 AVERAGE
Like all criteria for assessing selection bias, this approach can also be mis-
leading. As already noted, diseases thought to be unrelated to exposure may turn
out to be causally related to the exposure, so that we would erroneously infer se-
lection bias when it is not present. Many if not all known causes of disease af-
fect more than one specific entity. Conversely, comparability for diseases other
than the one of interest is only indirectly pertinent to whether the exposure groups
are comparable for the disease of interest. A selection bias may be present or ab-
sent solely for the health outcome of interest, so that reassuring patterns for other
outcomes are misinterpreted as indicative of valid results for the outcome of in-
terest. The patterns of disease other than the one of interest are a flag to exam-
ine the issue further, not a definitive marker of the presence or absence of bias.
tinctive attributes that predict a reduced risk of coronary heart disease, beyond
any influence of their regimen of physical exertion. In an effort to isolate the ef-
fect of physical exertion and control for confounding, we would measure and ad-
just for suspected influential differences in diet, medication use, preventive health
care, etc. The goal is to make this group, which has been self-selected to be phys-
ically active, less deviant and better approximate the group that would have re-
sulted from randomized assignment of intense exercise. It seems likely that the
proposed adjustments would help to move the results in the desired direction, but
unlikely that the adjustments would be completely successful given that such
self-selection is an elusive construct. By observing the change in pattern of re-
sults with successive adjustments, information is generated to speculate about the
impact of the unattainable complete adjustment for the selection bias. If adjust-
ment diminishes the observed benefits of intense physical activity substantially
compared to the crude results, then the possibility that complete adjustment would
fully eliminate observed benefits is more plausible. We might infer that the ob-
served benefits in the unadjusted comparison are a result of incomplete control
for selection bias. In some instances, this problem of baseline comparability is
so severe as to demand a randomized study, despite the daunting logistical as-
pects of such an undertaking for exposures such as physical exercise. Short of
eliminating the problem, the goals of understanding its origins, reducing it, and
speculating about the impact of its elimination must suffice.
Similarly, understanding the health risks and benefits for women using post-
menopausal estrogen therapy has been extremely challenging, in that the women
who do and do not elect to use such therapy are not comparable in baseline risk
of cardiovascular disease. Matthews et al. (1996) compared future users of es-
trogen replacement therapy with future nonusers, and users were shown to have
a consistently favorable profile on a number of attributes (Table 4.4). Cohorts of
users and nonusers clearly would not reflect women at equal baseline risk of
coronary heart disease. Investigators who address the risks and benefits of es-
trogen replacement therapy are aware of this threat to validity, and make attempts
to adjust for a wide array of factors. For example, analyses of the Nurses Health
Study data (Stampfer et al., 1991) adjusted for a long list of candidate baseline
differences and still found a marked reduction in heart disease incidence among
users (Table 4.5). Whether comparability was truly achieved despite these efforts
is open to debate; recent findings from the Women’s Health Initiative suggest
the observational studies were in error (Writing Group for the Women’s Health
Initiative Investigators, 2002).
TABLE 4.4. Mean Standard Error Levels of Other Biologic Characteristics and
Health Behaviors of Premenopausal Examination of Subsequent Users and Nonusers of
Estrogen Replacement Therapy, Pittsburgh, Pennsylvania, 1983–1992.
SUBSEQUENT T-TEST P
CHARACTERISTIC USER NONUSER OR ⌾2 VALUE
No hormone 179,194 250 1.0 129 1.0 123 1.0 56 1.0 19 1.0
use
Current 73,532
hormone
use
Adjusted — 45 0.51 21 0.48 39 0.96 23 1.26 5 0.80
for age (0.37–0.70) (0.31–0.74) (0.67–1.37) (0.78–2.02) (0.30–2.10)
Adjusted — 0.56 0.61 0.97 1.46 0.53
for age (0.40–0.80) (0.37–1.00) (0.65–1.45) (0.85–2.51) (0.18–1.57)
and risk
factors
Former 85,128
hormone
use
Adjusted — 110 0.91 55 0.84 62 1.00 34 1.14 12 1.42
for age (0.73–1.14) (0.61–1.15) (0.74–1.36) (0.75–1.74) (0.70–2.90)
Adjusted — 0.83 0.79 0.99 1.19 1.03
for for age (0.65–1.05) (0.56–1.10) (0.72–1.36) (0.77–1.86) (0.47–2.25)
and risk
factors
*Women with no hormone use served as the reference category in this analysis. The risk factors included in the multivariate models were age (in five-year categories), cigarette
smoking (none, former, current [1 to 14, 15 to 24, and 25 cigarettes per day]), hypertension (yes, no), diabetes (yes, no), high serum cholesterol level (yes, no), parental
myocardial infarction before the age of 60 (yes, no), Quetelet index (in five categories), past use of oral contraceptives (yes, no), and time period (in five two-year periods).
RR, relative risk; CI, confidence interval.
Stampfer et al., 1991.
76 INTERPRETING EPIDEMIOLOGIC EVIDENCE
measures of association that differ from groups that are less susceptible. If we
find such a pattern, more stock should be placed in the findings for the exposed
group that is less vulnerable, and if we do not, then there is some evidence that
the hypothesized selection bias has not materialized and we can be somewhat
more confident that there is no need to subdivide the exposed group in that
manner.
For example, the potential effect of sexual activity in late pregnancy on risk
of preterm birth has been considered in a number of studies (Read & Klebanoff,
1993; Sayle et al., 2001). The comparison of sexually active to sexually inactive
women is fraught with potential for selection bias. Some women undoubtedly re-
frain from sexual activity due to discomfort or irritation associated with genital
tract infection, which may well be a marker of increased risk of preterm birth
(French & McGregor, 1997). Others may refrain from sexual activity because of
concerns associated with a history of previous poor pregnancy outcomes, a strong
predictor of subsequent adverse pregnancy outcome. For some women, the lack
of a partner may be the basis for abstinence, possibly correlated with lack of so-
cial support or economic stress. In order to try to isolate a subgroup of women
for whom the level of sexual activity is least susceptible to selection bias, analy-
ses may be restricted to women who are thought to have equal baseline risk
of preterm birth. In an attempt to reduce or eliminate selection bias, we would
eliminate those women who were told to refrain from sexual activity by their
physicians, based on perceived elevated risk and those who experienced symp-
toms associated with genital tract infection as the basis for remaining sexually
inactive. We might eliminate those who are unmarried or not living with a part-
ner. The goal is to find a subset of women for whom the allocation of sexual ac-
tivity is as random as possible, i.e., to simulate as closely as possible a random-
ized trial in which the allocation of exposure is independent of baseline risk,
accepting that some reasons for being unexposed are too closely tied to disease
risk to be informative.
An illustration of a specific candidate source of selection bias is provided by
studies of whether physical activity protects against depression. A key issue is
whether those who are inactive due to disability, and are more likely to be de-
pressed as a result of their disability, should be excluded from such studies. The
potential gain in validity from excluding disabled participants was examined by
Strawbridge et al. (2002) using data from the Alameda County Study, a prospec-
tive cohort study of nearly 7000 adults. The authors examined the 1947 adults
who were over age 50 and living in 1999 and had all essential data for address-
ing physical activity and depression. Disability was defined as the inability to
walk 0.25 miles, walk up 10 stairs without resting, stand from a stooping or kneel-
ing position, or stand after sitting in a chair. Considering various potential con-
founding factors, those who were more physically active had a 25% lower risk
of depression (Table 4.6). Excluding the 151 individuals who were disabled ac-
TABLE 4.6. Sequential Logistic Regression Models Showing Relations Between 1994 Physical Activity and Depression in 1994 and 1999, with Adjust-
ments for Other Risk Factors Among 1947 Men and Women, Alameda County Study, California, 1994–1999
Incident 1999 Depression (Longitudinal Analyses)
with 1994 Depressed Subjects Excluded
Prevalent 1994 Depression
(cross-sectional analyses) with Disabled Included Disabled Excluded
All Subjects Included (n 1947) (n 1802) (n 1651)
MODEL AND 1994 CONVARIATES OR* 95% CI† OR 95% CI OR 95% CI
1. Age, sex, and ethnicity 0.75 0.68, 0.84 0.75 0.66, 0.85 0.73 0.63, 0.85
2. Model 1 education, financial strain, 0.78 0.70, 0.87 0.76 0.67, 0.87 0.75 0.65, 0.87
and neighborhood problems
3. Model 2 physical disability,† chronic 0.86 0.76, 0.96 0.82 0.72, 0.94 0.78 0.67, 0.91
conditions, BMI, smoking, and alcohol
consumption
4. Model 3 no. of relatives, no. of friends, 0.90 0.79, 1.01 0.83 0.73, 0.96 0.79 0.67, 0.92
and satisfaction with relations
*Odds ratios (OR) represent the approximate relative likelihood of being depressed associated with a one-point increase in the physical activity scale. Because the incidence
rate for depression is relatively small (5.4%), the resulting odds ratios for the longitudinal analyses closely approximate relative risks.
†This variable is omitted from models in which physically disabled subjects were excluded.
cording to the above criteria made no material difference in the results, with no
tendency whatsoever to move closer to the null value. In this example, it ap-
peared that omission of disabled persons was not necessary to enhance validity
and only produced a small loss in precision. Nonetheless, the strategy of re-
stricting to evaluate selection bias is well illustrated by this study, with regard
to methods, implementation, and interpretation.
The evaluation of potential selection bias in cohort studies is much like the eval-
uation of confounding. We first specify known determinants of the health out-
come of interest, since it is only selection in relation to such factors that can gen-
erate erroneous comparisons. Those markers may be broad (e.g., social class,
geographic setting) or narrow (e.g., biological indices of risk, dietary con-
stituents). The question of selection bias is whether, conditional on adjustment
for known determinants of disease risk, the exposed and unexposed groups are
comparable except for differing exposure status. Since we cannot repeat the ex-
periment and assign the exposed population to a no exposure condition to meas-
ure their true baseline risk, we need to make statistical adjustments and ultimately
make a judgment regarding the comparability of the groups. Accepting the ex-
posed population as given, the challenge is to evaluate whether the unexposed
population has done its job, i.e., generated disease rates that approximate those
that would have been found in the exposed population had they lacked exposure.
A number of indirect tools can be applied to address the following questions:
Though none of these is a definitive test for selection bias, all bear on the prob-
ability of selection bias of varying magnitudes. An array of favorable responses
adds markedly to the evidence against selection bias, and responses suggestive of
selection bias would warrant more refined analysis or even further data collection
to examine the possibility. Again, the model of the healthy worker effect in occu-
pational epidemiology illustrates the fundamental importance of selection bias but
also how much progress can be made with concerted effort to elucidate and con-
trol the sources of bias. Most specific scenarios of selection bias can be postulated
and tested using the above tools, either diminishing the credibility of the results
through discovery that significant amounts of bias are likely to be present or
strengthening the credibility by refuting these hypothesized sources of bias.
REFERENCES
Arrighi HM, Hertz-Picciotto I. The evolving concept of the healthy worker survivor ef-
fect. Epidemiology 1994;5:189–196.
Checkoway H, Pearce N, Crawford-Brown DJ. Research methods in occupational epi-
demiology. New York: Oxford University Press, 1989:78–91.
Choi BCK. Definition, sources, magnitude, effect modifiers, and strategies of reduction
of the health worker effect. J Occup Med 1992;34:979–988.
Chow W-H, Linet MS, Liff JM, Greenberg RS. Cancers in children. In D Schottenfeld,
JF Fraumeni Jr (eds), Cancer Epidemiology and Prevention, Second Edition. New
York: Oxford University Press, 1996:1331–1369.
Dayan J, Creveuil C, Herlicoviez M, Herbel C, Baranger E, Savoye C, Thouin A. Role
of anxiety and depression in the onset of spontaneous preterm labor. Am J Epidemiol
2002;155:293–301.
Elwood M, Elwood H, Little J. Diet. In JM Elwood, J Little, and JH Elwood (eds), Epi-
demiology and Control of Neural Tube Defects. New York: Oxford University Press,
1992:521–602.
French JI, McGregor JA. Bacterial vaginosis: history, epidemiology, microbiology, se-
quelae, diagnosis, and treatment. In Borschardt KA, Noble MA (eds), Sexually trans-
mitted diseases: Epidemiology, pathology, diagnosis, and treatment. Boca Raton,
Florida: CRC Press, 1997:3–39.
Greenland S. Randomization, statistics, and causal inference. Epidemiology 1990;1:
421–429.
Greenland S, Robins JM. Confounding and misclassification. Am J Epidemiol
1985;122:495–506.
Greenland S, Robins JM. Identifiability, exchangeability, and epidemiologic confound-
ing. Int J Epidemiol 1986;15:413–419.
Little J, Elwood H. Socio-economic status and occupation. In JM Elwood, J Little, and
JH Elwood (eds), Epidemiology and Control of Neural Tube Defects. New York: Ox-
ford University Press, 1992a:456–520.
Little J, Elwood M. Ethnic origin and migration. In JM Elwood, J Little, and JH Elwood
(eds), Epidemiology and Control of Neural Tube Defects. New York: Oxford Uni-
versity Press, 1992b:146–167.
80 INTERPRETING EPIDEMIOLOGIC EVIDENCE
Matthews KA, Kuller LH, Wing RR, Meilahn EN, Plantinga P. Prior to use of estrogen
replacement therapy, are users healthier than nonusers? Am J Epidemiol 1996;143:
971–978.
Miettinen OS, The “case-control” study: valid selection of subjects. J Chron Dis
1985;38:543–548.
Morgenstern H, Thomas D. Principles of study design in environmental epidemiology.
Environ Health Perspect 1993;101 (Suppl 4):23–38.
Pope CA III, Burnett RT, Thun MT, Calle EE, Krewski D, Ito K, Thurston GD. Lung
cancer, cardiopulmonary mortality, and long-term exposure to fine particulate air pol-
lution. JAMA 2002; 287: 1132–1141.
Punnett L. Adjusting for the healthy worker selection effect in cross-sectional studies. Int
J Epidemiol 1996;25:1068–1076.
Read JS, Klebanoff MA. Sexual intercourse during pregnancy and preterm delivery: Ef-
fects of vaginal microorganisms. Am J Obstet Gynecol 1993;168:514–519.
Ries LAG, Hankey BF, Harras A, Devesa SS. Cancer incidence, mortality, and patient
survival in the United States. In D Schottenfeld, JF Fraumeni Jr (eds), Cancer Epi-
demiology and Prevention, Second Edition. New York: Oxford University Press,
1996:168–191.
Rostrepo M, Muñoz N, Day NE, Parra JE, de Romero L, Nguyen-Dinh X. Prevalence of
adverse reproductive outcomes in a population occupationally exposed to pesticides
in Colombia. Scand J Work Environ Health 1990;16:232–238.
Rothman KJ. Modern Epidemiology. Boston: Little, Brown and Co., 1986.
Rothman KJ, Greenland S. Modern epidemiology, Second edition. Philadelphia: Lippincott-
Raven Publishers, 1998.
Savitz DA, Barón AE. Estimating and correcting for confounder misclassification. Am J
Epidemiol 1989;129:1062–1071.
Sayle AE, Savitz DA, Thorp JM Jr, Hertz-Picciotto I, Wilcox AJ. Sexual activity during
late pregnancy and risk of preterm delivery. Obstet Gynecol 2001;97:283–289.
Stampfer MJ, Colditz GA, Willett WC, Manson JE, Rosner B, Speizer FE, Hennekens
CH. Postmenopausal estrogen therapy and cardiovascular disease. N Engl J Med
1991;325:756–762.
Steenland K, Stayner L. The importance of employment status in occupational cohort mor-
tality studies. Epidemiology 1991;2:418–423.
Strawbridge WJ, Deleger S, Roberts RE, Kaplan GA. Physical activity reduces the risk
of subsequent depression for older adults. Am J Epidemiol 2002;156:328–334.
Writing Group for the Women’s Health Initiative Investigators. Risks and benefits of es-
trogen plus progestin in healthy postmenopausal women. Principal results from the
Women’s Health Initiative randomized controlled trial. JAMA 2002;288:321–333.
5
SELECTION BIAS IN CASE–CONTROL STUDIES
For many years, there was the misperception that case–control studies were fun-
damentally inferior to cohort designs, suffering from backward logic (health out-
come leading to exposure rather than exposure to health outcome). As the con-
ceptual basis for the design is more fully understood (Miettinen, 1985), it has
become clear that the only unique threat to validity is the susceptibility to se-
lection bias. The logistics of selecting and enrolling cases and controls are often
fundamentally different from each other, making the concern with selection bias
justifiably prominent.
CONTROL SELECTION
81
82 INTERPRETING EPIDEMIOLOGIC EVIDENCE
ciation between exposure and disease, and also does so by comparing two groups,
i.e., the exposed and the unexposed. Unfortunately, there is a temptation to se-
lect cases and controls in a manner that mimics the selection of exposed and un-
exposed subjects in a cohort study. The role of an unexposed group in a cohort
study however, and a control group in a case–control study are entirely differ-
ent. In a cohort study, the goal is to select an unexposed group that has identi-
cal baseline risk of disease as the exposed group other than any effect of the ex-
posure itself. If that goal is met, then the disease experience of the unexposed
group provides a valid estimate of the disease risk the exposed persons would
have had if they had not been exposed (counterfactual comparison). Cohort stud-
ies attempt to mimic a randomized trial or experiment in which the exposure
of interest is manipulated to ensure, to the maximum extent possible, that the
exposed and unexposed are identical in all respects other than the exposure of
interest.
In a case–control study, by contrast, given a set of cases with the disease, the
goal is to select controls who approximate the exposure prevalence in the study
base, that is, the population experience that generated the cases. The key compar-
ison to assess whether the control group is a good one is not between the cases and
the controls, but between the controls and the study base they are intended to ap-
proximate. The available cases define the scope of the study base, namely the pop-
ulation experience that gave rise to that particular set of cases. Once defined clearly,
the challenge for control selection is unbiased sampling from that study base. If
this is done properly, then the case–control study will give results as valid as those
that would have been obtained from a cohort study of the same population subject
to sampling error. It should be noted, however, that biases inherent in that under-
lying cohort, such as selection bias associated with exposure allocation, would be
replicated in the case–control study sampled from that cohort.
Consider two studies of the same issue, agricultural pesticide exposure and the
development of Parkinson’s disease. In the cohort study, we identify a large pop-
ulation of pesticide users to monitor the incidence of Parkinson’s disease and an
unexposed cohort that is free of such exposure. We would then compare the in-
cidence of Parkinson’s disease in the two groups. In the case–control study, as-
sume we have a roster of Parkinson’s disease cases from a large referral center
and select controls for comparison from the same geographic region as the cases,
in order to assess the prevalence of exposure to agricultural pesticides in each
group and thereby estimate the association. The methodologic challenge in the
cohort study is to identify an unexposed cohort that is as similar as possible to
the exposed group in all other factors that influence the risk of developing Parkin-
son’s disease, such as demographic characteristics, tobacco use, and family dis-
ease history. Bias arises to the extent that our unexposed group does not gener-
ate a valid estimate of the disease risk the pesticide-exposed persons would have
had absent that exposure.
Selection Bias in Case–Control Studies 83
Bias arises in case–control studies not because the cases and controls differ on
characteristics other than exposure but because the selected controls do not ac-
curately reflect exposure prevalence in the study base. In our efforts to choose
appropriate controls for the referred Parkinson’s disease cases, we need to first
ask what defines the study base that generated those cases—what is the geo-
graphic scope, socioeconomic, and behavioral characteristics of the source pop-
ulation for these cases, which health care providers refer patients to this center,
etc. Once we fully understand the source of those cases, we seek to sample with-
out bias from that study base. The critical comparison that defines whether we
have succeeded in obtaining a valid control group is not the comparison of con-
trols to Parkinson’s disease cases but the comparison of controls to the study
base that generated those cases. Only if the cases are effectively a random sam-
ple from the study base, that is, only if it is a foregone conclusion that there are
no predictors of disease, would the goal of making controls as similar as possi-
ble to cases be appropriate.
Properly selected controls have the same exposure prevalence, within the range
of sampling error, as the study base. Selection bias distinctive to case–control
studies arises when the cases and controls are not coherent relative to one an-
other (Miettinen, 1985), i.e., the groups do not come from the same study base.
Thus, the falsely seductive analogy that “exposed and unexposed should be alike
in all respects except disease” in cohort studies and therefore “cases and controls
should be alike in all respects except disease” is simply incorrect.
In many case–control studies, however, the very definition of the study base
is complex. In some instances, the study base is defined a priori, e.g., all persons
enrolled in a given health care plan for a defined period of time, or all persons
who reside in a given geographic area over some time period, and the challenge
is to accurately identify all cases of disease that arise from that study base
(Miettinen, 1985). Given that the investigator has chosen the study base, the con-
ceptual definition is clear, though practical aspects of sampling from that base
in an unbiased manner may still pose a challenge. Random sampling from a ge-
ographically defined population is often not easy, at least in settings in which
population rosters are lacking.
In other instances, a roster of cases is available, for example, from a given
medical practice or hospital, and even the conceptual definition of the study base
is unclear. The investigator must consider the entire set of attributes that are pre-
requisites to being enrolled as a case. The conceptual definition of the study base
producing cases may include whether symptoms come to attention, whether peo-
ple seek a diagnosis for those symptoms, whether they have access to medical
care, and who they choose as their health care provider (Savitz & Pearce, 1988).
Thus, the assessment of whether a particular mechanism of control selection has
generated an unbiased sample from the study base (Miettinen, 1985) requires
careful evaluation and informed judgment.
Obtaining perfectly coherent case and control groups from the same study base
guarantees that there will be no additional selection bias introduced in the
case–control sampling beyond whatever selection bias may be inherent in the un-
derlying cohort. The failure to do so, however, does not automatically produce
selection bias; it just introduces the possibility. In a cohort study, the ultimate
purpose of the unexposed group is to estimate the disease risk of the exposed
group absent exposure. In a case–control study, the purpose of the controls is to
generate an accurate estimate of the exposure prevalence in the study base that
gave rise to the cases. Given this goal, by good fortune or careful planning, a
control group that is not coherent with the cases may nevertheless generate a
valid estimate of exposure prevalence in the study base that gave rise to the cases.
If, for example, the exposure of interest in a case–control study of melanoma
among women were natural hair color (associated with skin pigmentation and
response to sunlight), and we knew that hair color was not related to gender, we
might well accept the exposure prevalence estimates among male controls in a
geographically defined study base as a valid estimate for female cases. In no
sense could we argue that the controls constitute a random sample from the study
base that produced the cases, which must be exclusively female, yet the expo-
sure prevalence of the controls would be a valid estimate of the exposure preva-
lence in that study base under the assumptions noted above.
A second consideration is that a control group can be well suited to address
one exposure and yet be biased for assessing others. If controls are sampled in
Selection Bias in Case–Control Studies 85
a valid manner from the proper study base, then they will generate accurate es-
timates of prevalence for all possible exposures in the study base, and thus
case–control comparisons of exposure prevalence will generate valid measures
of association. However, with deviations from the ideally constituted controls,
the potential for selection bias needs to be considered on an exposure-by-
exposure basis. In the above example of a case–control study of melanoma, males
would not serve well as controls for female cases in efforts to address the preva-
lence of sunscreen use or diet, let alone reproductive history and oral contra-
ceptive use. The question of whether the controls have generated a good esti-
mate of exposure prevalence in the study base, and thus a valid measure of the
exposure–disease association of concern, must be considered for each exposure
of interest.
Among the most challenging exposures to evaluate are those that are associ-
ated with social factors or discretionary individual behaviors, e.g., diet, exercise,
tobacco use. These characteristics are often susceptible to selection bias in that
they may well be related to inclination to seek medical care, source of medical
care, and willingness to voluntarily participate in studies. In contrast, if exposure
were determined solely by genetic factors, e.g., blood type or hair color, or those
not based on conscious decisions, e.g., public water source, eating at a restau-
rant discovered to employ a carrier of hepatitis, then selection bias is less likely.
Therefore, it is much easier to choose controls for studies of some exposures,
such as blood type, than others, such as psychological stress or diet.
In asking whether a particular method of control selection constitutes an un-
biased method of sampling from the study base, corrections can be made for in-
tentionally unbalanced sampling, e.g., stratified sampling by demographic at-
tributes or cluster sampling. Consideration of confounding may justify
manipulation of the sampling of controls to better approximate the distribution
of the confounding factor among cases. Such manipulation of control selection
is a form of intentional selection bias (Rothman, 1986), which is then removed
through statistical adjustment. When it is known that stratification and adjust-
ment for the confounding factor will be required to obtain valid results, then there
may be some benefit from manipulating the distribution of the confounding fac-
tor among the controls. If that stratified sampling makes the distribution of the
confounding factor among controls more similar to the distribution among cases,
then the stratified analysis will be more statistically efficient and thus generate
more precise results than if the distribution were markedly different among cases
and controls.
For example, we may be interested in the question of whether coffee con-
sumption is associated with the risk of developing bladder cancer. We know that
tobacco use is a major determinant of bladder cancer and also that coffee con-
sumption and smoking tend to be positively associated. Thus, we can anticipate
that in our analysis of the association between coffee consumption and bladder
86 INTERPRETING EPIDEMIOLOGIC EVIDENCE
of the association between exposure and disease, but rather their composition rel-
ative to one another. Thus, there is no such thing as poorly constituted cases or
poorly constituted controls, only groups that are incoherent with one another. In
practice, once one group, cases or controls, has been operationally defined, then
the desired attributes of the other is defined and the challenge is a practical one
of meeting that conceptual goal. Miettinen (1985) coined the terms primary study
base and secondary study base. With a primary study base, the definition of the
population-time experience that produces the cases is explicitly demarcated by
calendar periods and geographic boundaries. In such instances, the challenge is
to fully ascertain the cases that arise from within that study base and to accu-
rately sample controls from that study base. A secondary base corresponds to a
given set of cases identified more by convenience, such as those that appear and
are diagnosed at a given hospital, posing the challenge of identifying a means of
properly sampling controls from the ill-defined study base.
In reality, there is a continuum of clarity in the definition of study bases, with
the goal being the identification of a study base that lends itself to selection of
coherent cases and controls. A choice can be made in the scope of the study base
itself that will make coherent case and control selection more or less difficult. It
may be more useful to focus on the identification of a coherent base for identi-
fying both cases and controls than to first focus on making case or control se-
lection alone as easy as possible and then worrying about how to select the other
group. The ability to formally define the geographic and temporal scope of a
study base is less critical than the practical ability to identify all the cases that
are produced in a study base and to have some way to properly sample controls
from it.
Coherence may sometimes be achieved by restricting the constitution of one
of the groups to make the task easier. For example, in a study of pregnancy
outcome based in prenatal care clinics, the case group may include women who
began normal pregnancies and developed the disease of interest, e.g., preg-
nancy-induced hypertension, as well as women who began prenatal care else-
where and were referred to the study clinic because they developed medical
problems, including the one of interest. The source of referrals is very difficult
to identify with clarity, since it depends on financial incentives, patient and
physician preferences, etc. Therefore, one option would be to simply exclude
those referred from other prenatal care providers from the case group and
thereby from the study base itself, and instead study non-referred cases and a
sample of patients who enrolled in prenatal care at the study settings without
necessarily being at high risk. Note that the problem is not in identifying re-
ferred cases, which is straightforward, but rather in sampling from the ill-
defined pool of pregnant women who would, if they had developed health prob-
lems, have been referred to the study clinics. Restricting cases and controls to
women who began their care in the study clinics improves the ability to ensure
that they are coherent.
88 INTERPRETING EPIDEMIOLOGIC EVIDENCE
such data sources as drivers’ license rosters is becoming more limited. Further-
more, public wariness manifested by increased proportions of unlisted telephone
numbers and use of telephone answering machines to screen calls has made tele-
phone based sampling more fallible. At their best, the available tools such as ran-
dom-digit dialing telephone sampling, neighborhood canvassing, or use of drivers’
license rosters are far from perfect, even before contending with the problems of
non-response that follow. Conceptually, a geographically defined study base is at-
tractive, but it may not be so on logistical grounds.
Sampling from the study base that generates patients for a particular hospital or
medical practice raises even more profound concerns. The case group is chosen
for convenience and constitutes the benchmark for coherent control sampling, but
the mechanisms for identifying and sampling from the study base are daunting.
Without being able to fully articulate the subtleties of medical care access, prefer-
ence, and care-seeking behavior, diseased controls are often chosen on the as-
sumption that they experience precisely the same selection forces as the cases of
interest. To argue that choosing patients hospitalized for non-malignant gastroin-
testinal disease, for example, constitutes a random sample from the population that
produced the cases of osteoporotic hip fracture may be unpersuasive on both the-
oretical and empirical grounds. Such strategies are rarely built on careful logic and
there is no way to evaluate directly whether they have succeeded or failed, even
though by good fortune they may yield valid results. Their potential value would
be enhanced if it could be demonstrated that the other diseases are not related to
the exposure of interest and that the sources of cases are truly identical.
Selection in still more conceptually convoluted ways, such as friend controls,
is also not amenable to direct assurance that they represent the appropriate study
base. We need to ask whether friend controls would have been enrolled as cases
in the study had they developed the condition of interest and whether they con-
stitute a random sample of such persons. Viewed from a perspective of sampling,
it is not obvious that such methods will yield a representative sample with re-
spect to exposure. When the procedure seems like an odd way of sampling the
study base, attention should be focused on the ultimate question of whether the
controls are odd in the only way that actually matters—Do they reflect the ex-
posure prevalence in the study base that produced the cases? That question is
synonymous with “Is selection bias present?”
rollment as a control. Some of the relevant attributes are easily and routinely con-
sidered, such as age range for eligibility and geographic scope of residence, but
others may be more subtle and difficult to assess.
Calendar time is an often-neglected component in the definition of the study
base, particularly when cases were diagnosed over some period of time in the
past. Case registries for diseases such as cancer are a convenient resource for
mounting case–control studies, but obtaining an adequate number of cases often
necessitates including not just cases diagnosed subsequent to the initiation of the
study, but also some who were diagnosed and registered in the past. Because the
cases occurred over some period of historical calendar time, the study base from
which controls are recruited should accommodate the temporal aspects of case
eligibility if the exposure prevalence may change over time. At the extreme, if
we had enrolled cases of colon cancer aged 45–74 diagnosed in metropolitan At-
lanta, Georgia during 1992–1995, the roster of Atlanta residents aged 45–74 in
1990 or 1998 would not be coherent with the cases due to the changes that oc-
curred in the dynamic cohort of residents. The questions must be asked, “Were
all members of the study base eligible for control selection at the time of case
occurrence, considering time-varying factors such as residence, age, and health
status? Would the roster of potential controls available at the instant of case oc-
currence correspond to those from which controls were actually sampled?” In the
ideal study, we would have enrolled cases as they occurred, in the period
1992–1995. As each case occurred, controls would be randomly sampled from
the roster of the persons in the community. Note that the roster would be differ-
ent in 1990, 1992, and 1995 due to changes in age, in- and out-migration, and
death. Only if the prevalence of exposure is invariant, and thus the prevalence
in 1990, 1992, 1995 and every year in between is the same, can we safely assess
exposure among controls over a different period of time than for cases and thus
employ a non-coherent study base.
Several studies addressing the potential association of elevated levels of mag-
netic fields from electric power lines in relation to childhood cancer selected con-
trols, at least in part, some years after the cancer cases had been diagnosed (Savitz
et al., 1988; London et al., 1991; Preston-Martin et al., 1996). Given the rarity of
childhood leukemia or brain cancer, accrual of a sufficient number of cases through
prospective monitoring of the population is challenging. This requires either con-
ducting the study in a very large population from which cases arise, as was done
in a study in large segments of the Midwest and eastern United States (Linet et al.,
1997) and in the United Kingdom (United Kingdom Childhood Cancer Study In-
vestigators, 1999), or by sustaining an active study to accrue cases over many years.
It is much more efficient in time and expense to extend the case accrual period
into the past using historical cancer registry data rather than solely into the future.
For example, in the study conducted in Denver, data collection began in 1984,
yet cases diagnosed as early as 1976 were eligible (Savitz et al., 1988). Given a
Selection Bias in Case–Control Studies 91
case that occurred several years before data collection for the study began, for
example in 1976 or 1977, how do we properly select controls from the study
base of the past that no longer exists?
Imagine the fate of the roster of eligible controls for a case occurring eight
years before the start of the study. Over the intervening years, the population of
the geographic area has changed, some otherwise eligible children have died, and
those children of similar age as the case are now much older. Age is easily back-
calculated to determine who, based on current age, would have been eligible at
some point in the past. Past residence is a much more serious concern. Of all the
potentially eligible controls at the time of case occurrence, a small number of
children have died, but many may have moved out of the geographic area and
many new children have probably moved into the geographic area. One source
of ineligibility for control selection is rather easily addressed, namely whether
the potential control resided in the study area at the time in the past when the
case occurred. We can simply ask this present-day resident where they were liv-
ing at the relevant time in the past. The otherwise eligible potential control who
has left the geographic area, however, cannot readily be identified. We simply
have no convenient way of tracking and including those who moved elsewhere
but would have been eligible for sampling if we had been conducting our study
at the optimal time in the past. Thus, our ideal roster of controls is inaccessible
in the absence of true historical population registers, such as those that can be
reconstructed in Sweden (Feychting & Ahlbom, 1993).
In the Denver study of magnetic fields and cancer, we chose to restrict the
controls to those who resided in the area at the time of case diagnosis and con-
tinued to reside there at the time the study was initiated (Savitz et al., 1988; Savitz
& Kaune, 1993a). There have been concerns raised over the consequences of our
inability to fully sample from the study base and the resulting exclusion of res-
identially mobile controls (Jones et al., 1993). The magnitude of bias from this
restriction is difficult to assess directly. A suggested solution to this problem is
to restrict the cases comparably to those who remained residentially stable fol-
lowing their diagnosis. The logic of restricting on postdiagnosis experience, how-
ever, is questionable and it is quite possible that the reasons for moving would
differ among families who suffered from having a child diagnosed with cancer
compared to other families.
The optimal solution to the temporal incoherence of cases and controls is to
eliminate it. If there were rosters from times in the past, we could effectively
sample from the study base. As noted, records in Sweden allow reconstruction
of neighborhoods as they were configured in the past. Records from schools,
health care plans, birth records, town records, telephone directories, drivers’ li-
cense rosters, or voter registration lists are examples of data resources that allow
stepping back to the time of interest. Each has imperfections and potential sources
of selection bias, and the challenge of locating persons who are identified through
92 INTERPRETING EPIDEMIOLOGIC EVIDENCE
such historical rosters is apparent. Any archival information that allows for se-
lection from the desired historical population roster is worthy of serious consid-
eration. The only alternative is to mount studies in large enough populations to
permit control selection to be concurrent with case diagnosis.
Selection bias from non-concurrent case and control selection can arise when
trying to define a suitable sampling frame for cases whose disease began prior
to the initiation of data collection for the study. Often, cases diagnosed in the
past are combined with those newly diagnosed as the study progresses, leading
to subsets of concurrent (for newly diagnosed cases) and non-concurrent (for past
cases) controls. Even within the stratum of non-concurrent, the more remote in
time, the more susceptible to selection bias, whatever the exact mechanism that
produces it. Controls selected for marginally non-concurrent cases, e.g., those in
the past year, are less susceptible to this bias than controls selected for cases di-
agnosed in the more remote past, e.g., five years ago. Critiques of studies of res-
idential magnetic fields associated with power lines near homes and childhood
leukemia had appropriately raised concerns about the influence of selection bias
due to this phenomenon (Poole & Trichopoulos, 1991; Jones et al., 1993) in a
study in Denver, Colorado (Savitz et al., 1988). Specifically, the study data col-
lection began in 1983, with cases diagnosed over the period 1976–1983. The
most direct test to address all conceivable problems of non-concurrence is to
stratify cases by degree of non-concurrence using the year of diagnosis, i.e.,
1976–1979 (more non-concurrent) and 1980–1983 (less non-concurrent), to as-
sess the patterns of association in those groups. In this instance, the magnitude
of association was stronger, not weaker, for the most recently diagnosed cases
(Table 5.1), suggesting this form of selection bias was unlikely to have biased
the odds ratio upwards (Savitz & Kaune, 1993b).
TABLE 5.1. Sratum-Specific Results for High Wire Code Versus Low Wire Code: Den-
ver, Colorado, 1976–1983, Interviewed Subjects
Total Cancers Leukemia Brain Cancer
PARAMETER OR CI OR CI OR CI
Age at diagnosis
0–4 years 1.9 0.9–4.2 3.9 1.4–10.6 1.0 0.2–5.3
5–9 years 1.8 0.3–10.2 2.3 0.3–19.0 2.3 0.3–19.0
10–14 years 2.9 1.0–8.5 4.1 0.9–19.0 6.2 1.0–38.2
Gender
Male 1.6 0.8–3.3 2.4 1.0–6.0 1.7 0.5–6.2
Female 3.3 1.2–9.1 7.0 1.9–26.3 2.4 0.5–11.2
Father’s education
16 years 1.8 0.9–3.8 4.2 1.6–11.1 2.3 0.8–7.0
6 years 2.3 0.8–6.2 2.5 0.7–8.9 —
Per capita income
$7000/year 2.1 1.0–4.4 3.6 1.5–8.9 1.7 0.6–5.2
$7000/year 1.8 1.1–3.1 2.8 0.7–11.4 1.8 0.2–18.5
Year of diagnosis
Before 1980 1.3 0.5–3.5 1.8 0.5–6.3 0.9 0.2–5.0
1980 or later 3.7 1.4–9.7 7.1 2.3–22.1 3.8 1.5–9.9
Residential stability
Unstable 2.9 1.3–6.6 4.7 1.8–12.5 1.7 0.4–7.1
Stable 1.5 0.6–3.6 2.7 0.8–9.4 2.1 0.5–8.5
Residence type
Single family 2.1 1.1–4.1 3.9 1.7–8.9 2.0 0.6–6.2
Other 2.0 0.4–11.2 2.8 0.3–28.3 1.4 0.1–13.6
OR, odds ratio; CI, confidence interval.
Savitz & Kaune, 1993.
First, we must focus on exactly how and why the cases ended up going to that
particular hospital, beyond the fact of their health condition. Location of resi-
dence or workplace is often influential. If there are a small number of cases who
came to that hospital for peculiar reasons, for example, they were visiting friends
who live in the area when their hip fracture occurred, we may prefer to exclude
them since the roster of such visitors would be virtually impossible to identify
for control selection purposes. The identity of the woman’s regular physician, or
whether she even has a regular physician, may influence the specific hospital she
would go to when hip fracture occurs. Financial aspects of their health care, such
as insurance plan or Medicare/Medicaid eligibility, could influence the patient’s
likely source of care.
All the steps that resulted in the identification of these cases contribute to the
definition of the study base, and therefore are elements to be considered in con-
stituting the sampling frame for selection of controls. If the geographic coverage
of the hospital is well defined, then only persons who reside in that area are part
of the study base. In fact, if the medical care system were based solely on ge-
ography, then residence would unambiguously determine source of medical care.
In the United States and many other settings, however, geography alone does not
determine the health care provider. The choice of physician, insurance coverage,
and physician and patient preferences are often relevant considerations in the se-
lection of a hospital. If medical practices constituted the network for patient re-
ferral, then the study base would consist of patients seen within those practices,
since they would have come to the study hospital in case of a hip fracture. If par-
ticular insurance plans direct patients to that hospital, then to be eligible as a con-
trol, the woman should be insured through such a plan. One reason that health
maintenance organizations are particularly attractive for epidemiologic research,
as noted above, is that the source population is unambiguously defined by en-
rollment in the plan. In the current, largely disorganized system of medical care
in the United States, anticipating who would go to a given health care facility
for a particular condition is a complex and only partially predictable process de-
pendent on physician inclination, whether alternative facilities are available, fi-
nances, and subjective considerations of physician and patient preferences.
In light of this complexity, the traditional approach is to try to identify other
health conditions that have a comparable source population (Miettinen, 1985),
even without being able to articulate just what is required to be a member of that
source population. Thus, we may speculate that women who come to the hospi-
tal for acute gastrointestinal conditions (gallstones, appendicitis) effectively rep-
resent a random sample from the study base, at least with respect to the expo-
sure of interest. The conditions that define the source of controls must be unrelated
to calcium intake, of course, for this strategy to be valid. Inability to operationally
define the study base and thereby circumscribe the roster of potential controls
constitutes a major disadvantage in evaluating the suitability of the chosen con-
Selection Bias in Case–Control Studies 95
trols. Some degree of luck is required for the sampling mechanism of choosing
persons with other diseases to yield an effectively random sample from the study
base, and there is no direct way to determine if one has succeeded. It is clearly
not a conceptually appropriate control group in terms of identifying and sam-
pling from a well-defined roster, and will yield a valid measure of association
only under the assumption that it nonetheless yields an accurate estimate of the
prevalence of the exposure of interest in the study base.
Choosing controls based on having other, specific health conditions presumes
that the exposure of interest has no direct or indirect positive or negative relation
to the controls’ diseases. The classic concern is that the exposures are as yet undis-
covered risk factors for the control disease, as occurred in choosing patients with
chronic bronchitis as controls in an early study of lung cancer and cigarette smok-
ing (Doll & Hill, 1950). At that time, it was believed that smoking was unlikely
to be related to bronchitis, so that choosing bronchitis patients would give a good
estimate of smoking prevalence in the source population. Given the epidemiolo-
gists’ well-founded belief that disease does not occur randomly, it is difficult to ar-
gue with confidence that a given exposure has no relation, direct or indirect, pos-
itive or negative, with a given disease. When we choose the presence of a disease
as the basis for sampling controls, and the exposure of interest is related through
any means, not necessarily causal, to the disease for which controls are sampled,
the estimate of exposure prevalence in the study base will be distorted.
There are also more subtle ways in which persons with illness may have ex-
posures that are not representative of those in the appropriate study base, espe-
cially when health behaviors or other aspects of lifestyle are involved. Continu-
ing with the interest in calcium intake and osteoporotic hip fracture, assume the
goal is to identify hospitalized patients whose past diet is representative of the
study base of patients with hip fracture and we have chosen patients with benign
gynecologic conditions. Assume that these hospitalized patients are truly mem-
bers of the study base, i.e., if they had experienced a hip fracture, they would
have become cases in the study. Thus, the only question is whether the sampling
mechanism of other diseases is suitable.
First, diet is likely to have an etiologic relationship with a wide range of con-
ditions, some of which have not yet been discovered. It is difficult to argue that
any health condition is certain to be free of dietary influences. Second, early or
preclinical illness is likely to alter diet subtly so that even in the absence of a
causal relationship with the other diseases, reported diet may be distorted. Even
when respondents are asked to report on diet at times in the more remote past,
they tend to be influenced by recent diet (Wu et al., 1988), and may well report
diet that has been altered by early or subclinical disease. Third, even if past diet
were truly representative of the study base, the reporting of it may be affected
by the presence of an unrelated illness. The psychological impact of the illness
may well cause patients to misreport diet. (The point concerns information bias
96 INTERPRETING EPIDEMIOLOGIC EVIDENCE
whether they fall within an expected range, with appropriate caution, exposure
prevalence among controls can sometimes be beneficially compared to the ex-
posure prevalence in external populations. Because typically we have less exten-
sive information concerning exposure prevalence than disease patterns in exter-
nal populations, however, the opportunity to apply this strategy in case–control
studies is more limited than the corresponding approach in cohort studies.
Controls are selected in a case–control study to provide an estimate of expo-
sure prevalence in the study base; for example, the proportion of women using
estrogen replacement therapy or the proportion of men aged 50–59 who eats five
or more servings of beef per week. If data from population surveys on preva-
lence of use of estrogen replacement therapy were available, for example, strat-
ified as needed by age, social class, and other important influences on patterns
of use, then a comparison could be made between exposure prevalence in the
study controls and exposure prevalence in the general population. Data are most
widely available for exposures of general interest, such as reproductive history,
use of medications, tobacco and alcohol use, diet, and certain social and eco-
nomic factors. Even when such data are available, however, the exact method of
measuring and reporting them may differ from the methods used in the case–
control study and thereby diminish the informativeness of the comparison. If per-
fectly suitable data were already available from an appropriate population, then
there would be no need to identify and collect information from controls at all.
At best, data from somewhat similar populations on roughly comparable expo-
sures can yield comparisons with the study controls that can identify gross aber-
rations. If supplemental estrogen use from sociodemographically similar popu-
lations to those in the study ranges from 10% to 20%, and the controls in our
study report 3% use or 53% use, we would have reason to look carefully at the
manner in which the controls were chosen (as well as our methods for ascer-
taining supplemental estrogen use). If we measured exposure in the reported range
or close to it, however, we would have some added confidence that the controls
are more likely to be appropriately constituted and the exposure was properly
assessed.
Continuing with the example of estrogen use and endometrial cancer intro-
duced previously, Hulka et al. (1980) examined the question of whether select-
ing controls based on having had a D&C would produce an erroneous estimate
of the prevalence of estrogen use in the population. They examined three con-
trol selection strategies, one consisting of community controls (the “gold stan-
dard” for this purpose), the second consisting of women with other gynecologi-
cal conditions, and the third consisting of women who had undergone D&Cs. As
anticipated, relative to the community controls, the D&C controls had an inflated
prevalence of estrogen use, reflecting selection bias (Table 5.3). Approximately
35% of white women and 24% of African-American women had used estrogens
in the D&C group as compared to 27% and 8% of white and African-American
TABLE 5.3. Percent of Cases and Controls Reporting Any Estrogen Use, by Race: Case-Control Study of Endometrial Cancer and Exogenous Estrogen,
North Carolina, 1970–1976
Controls
Cases D&C Gynecology Community
Estrogen Estrogen Estrogen Estrogen
RACE TOTAL NO. NO. % TOTAL NO. NO. % TOTAL NO. NO. % TOTAL NO. NO. %
White 186 61 32.8 208 72 34.6 153 35 22.9 236 64 27.1
African-American 70 7 10.0 108 26 24.1 71 9 12.7 85 7 8.2
D&C, dilatation and curettage.
Hulka et al., 1980.
100 INTERPRETING EPIDEMIOLOGIC EVIDENCE
come from either the controls in the study or from the external population sur-
vey to which they are compared. Like many “red flags,” the disparity in expo-
sure prevalence is a trouble sign but not a definitive indicator that trouble is pres-
ent or just what has caused the trouble.
activity levels did not decline with advancing age among the controls and per-
haps even rose with advancing age. This would run counter to the expected pat-
terns of declining physical activity with advancing age, suggesting that we had
obtained a sample that was deviant among older age groups.
An empirical application of this strategy comes from a study of serum lycopene
(an antioxidant form of carotenoid found in fruits and vegetables) in relation to
the risk of prostate cancer (Vogt et al., 2002). A multicenter case–control study
was conducted in the late 1980s in Atlanta, Detroit, and 10 counties in New Jer-
sey. Controls were chosen through random-digit dialing for men under age 65
and through the Health Care Financing Administration records for men age
65 and older. Among a much larger pool of participants, 209 cases and 228 con-
trols had blood specimens analyzed for lycopenes. Serum lycopene was inversely
associated with risk of prostate cancer and found to be lower among African-
American controls as compared to white controls (Table 5.4). To corroborate
the plausibility of lower levels among African Americans (who experience a
markedly higher risk of prostate cancer generally), the authors examined perti-
nent data from the National Health and Nutrition Examination Survey. In fact,
there is strong confirmatory evidence that African Americans in the United States
do have lower lycopene levels than whites across the age spectrum (Fig. 5.1).
Other methodological concerns aside, this pattern provides evidence in support
of having enrolled reasonably representative African-American and white men
into the case–control study.
Internal comparisons could, of course, reveal the patterns that would be ex-
pected based on prior information, but still have stratum-specific and overall ex-
posure prevalences that are disparate from that in the study base. If we recruited
our controls for the study of physical activity and myocardial infarction by ran-
dom digit dialing, and had a resulting preference for women who stayed at home
across the age spectrum, we might well over-sample physically inactive women
with some fraction of such women unable to maintain employment due to lim-
ited physical ability. The patterns by age might still be exactly as expected, but
with a selectively inactive sample within each stratum and therefore a biased
sample overall. Nonetheless, for at least some hypothesized mechanisms of se-
lection bias, we would expect the extent of it to vary across strata of other ex-
posure and disease predictors, and for those candidate pathways, examination of
exposure prevalence across subgroups may be useful.
30
Whites
25 Blacks
Serum lycopene (µg/dl)
20
15
10
0
40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79
Age (years)
FIGURE 5.1. Median serum lycopene concentrations by age and race among males from
the Third National Health and Nutrition Examination Survey, 1988–1994 (Vogt et al.,
2002).
stratified by father’s education and per capita income in the family (Table 5.1).
The expected pattern of results would be to observe less bias in the upper edu-
cation and income groups. In this case, there is some tendency for the measures
of association to be less elevated among upper socioeconomic status participants
overall, but subject to the imprecision in stratified analyses, the odds ratios are
not markedly different across strata.
One of the challenges in interpreting the results of this aspect of potential se-
lection bias is the inability to distinguish between measures of association that
truly differ across subgroups (effect measure modification) and varying meas-
ures of association across strata that result from differential selection bias across
strata. In the above examples, if physical activity truly had a different effect on
risk of myocardial infarction among younger and older women, the exact same
pattern might be seen as the one that would result from selection bias that af-
fects younger and older women differently. Similarly, in the example of resi-
dential magnetic fields and childhood cancer, if there were some reason that the
pattern of risk truly differed by socioeconomic status, the effect of selection bias
could not be isolated from a genuine modification of the measure of association
across strata of education or income. If there were changes in the prevalence of
other conditions necessary for the causal process to operate or changes in the na-
ture of the exposure across socioeconomic groups, the exact same pattern of re-
sults could be found. As is often the case, outside evidence and insights need to
be applied in assessing the implications of apparent effect measure modification.
When we can identify and measure the process by which selection bias is thought
to operate, we can adjust for those determinants just as we adjust for confounders.
Some forms of selection bias can be viewed as unintentional stratified sampling,
exactly comparable to intentional stratified sampling as discussed earlier under
Selection of Controls from the Study Base if the selection acts to sample ran-
domly within the strata. Thus, if the method of sampling from the study base has
generated an excess (or deficit) of men, or younger people, or those who reside
in one county rather than another, we can readily stratify and adjust for those at-
tributes in the analysis. The question then is whether there is selection bias within
the strata, i.e., whether among young men sampled from a given county the ex-
posure prevalence is reflective of young men in that county.
In the conventional application of sampling principles, the question of selec-
tion bias is equivalent to a question of whether the exposure prevalence is cor-
rectly estimated within strata. In the above example using a drivers’ license reg-
istry to sample women for a case–control study of physical activity and
myocardial infarction, the sampling is distinctly non-random across age strata.
106 INTERPRETING EPIDEMIOLOGIC EVIDENCE
TABLE 5.5. Distribution of Sample Chosen by Random Digit Dialing and Census Popu-
lation Aged 40–74 Years According to Whether Respondents Had Certain Screening
Tests, Otsego County, New York, 1989
Random
Digit Dialing Census
Sample Population
SCREENING TEST NO. % NO. %
Had blood pressure checked in past 2 years
Yes 306 89.7 13,403 86.1
No 30 8.8 1503 9.7
No response 5 1.5 657 4.2
Total 341 100.0 15,563 100.0
Had cholesterol checked in past 2 years*
Yes 230 67.4 8699 55.9
No 102 29.9 5855 37.6
No response 9 2.6 1009 6.5
Total 341 99.9 15,563 100.0
Ever had stool test or rectal examination
for colon cancer
Yes 174 51.0 7215 46.4
No 155 45.5 7238 46.5
No response 12 3.5 1110 7.1
Total 341 100.0 15,563 100.0
menopausal estrogen use and osteoporosis, there might be no causal relation be-
tween screening history and that outcome. If there were an association between
screening and use of estrogens, however, which is plausible, then the distortion
due to random digit dialing would require adjustment in the analysis. The con-
trol sampling mechanism would have generated an association with disease sta-
tus because of the overrepresentation of women who tend to have more health
screening (and may well have a higher prevalence of estrogen use as well).
108 INTERPRETING EPIDEMIOLOGIC EVIDENCE
The impact of factors that cannot be captured directly or completely can still
be addressed to some extent. We rarely have a measure of the precise indicator
of the source of selection bias, e.g., proclivity to seek medical care or health con-
sciousness or willingness to participate in telephone surveys. We may have mark-
ers however, that are at least associated to some extent with those attributes, for
example, insurance coverage, frequency of routine physical examinations, and
level of education. In the same manner that adjustment for an imperfectly meas-
ured confounder adjusts only partially, adjustment for these imperfect markers
would adjust partially for the selection bias. Not only might adjustment for such
factors yield a less biased measure of association, but the comparison of unad-
justed and adjusted measures of association would help to determine the direc-
tion of bias and estimate how large the residual effect is likely to be, analogous
to the examination of residual confounding (Savitz & Barón, 1992). If adjust-
ment for the proxy indicator shifted the measure of association in a given direc-
tion, then we can safely expect that a refined measure of that attribute would
have shifted the measure of association even further in the same direction, and
if adjusting for the marker has a large impact on the measure of association, more
complete adjustment is likely to move the estimate farther still.
VARIABLE n
Physical activity
Quartile 1 (lowest activity) 76 185 1.0 (reference)
Quartile 2 57 151 1.05 (0.66–1.67)
Quartile 3 68 324 0.58 (0.38–0.88)
Quartile 4 (highest activity) 40 175 0.59 (0.36–0.97)
Menopausal age
49 years 77 268 1.0 (reference)
49–50 years 107 305 1.35 (0.92–1.99)
51–52 years 26 129 0.71 (0.41–1.21)
52 years 37 172 0.78 (0.48–1.27)
Menopausal status
Premenopausal 29 124 1.0 (reference)
Postmenopausal 218 750 1.10 (0.66–1.81)
Smoking status
Never smoker 161 653 1.0 (reference)
Former smoker 42 128 1.40 (0.90–2.18)
Current smoker 44 93 1.87 (1.18–2.98)
(continued)
109
110 INTERPRETING EPIDEMIOLOGIC EVIDENCE
TABLE 5.6. Odds Ratios for Covariates in the Hip Fracture Study, Swedish
Mammography Cohort, 1987–1995 (continued)
CASE– ODDS RATIO
PATIENTS CONTROLS (95% CI)
VARIABLE n
Hormone replacement therapy
Never user 229 741 1.0 (reference)
Former user 15 80 0.58 (0.32–1.08)
Current user 3 53 0.11 (0.02–0.49)
Oral contraceptives
Never user 232 812 1.0 (reference)
Ever user 15 62 0.73 (0.35–1.56)
Oral cortisone
Never user 227 815 1.0 (reference)
Ever user 17 53 1.18 (0.65–2.14)
Diabetes mellitus
No diabetes 219 835 1.0 (reference)
Oral treatment of diabetes 17 30 2.45 (1.24–4.86)
Insulin treatment of diabetes 11 9 5.10 (1.87–13.9)
impact imperfect selection mechanisms are likely to have on estimating its preva-
lence. Consideration of the deviations from the ideal method of sampling from
the study base should focus on the impact on exposure prevalence. Are the er-
roneously included or excluded segments of the study base likely to have an ex-
posure prevalence that differs from the properly constituted study base? If so, in
which direction and by how much?
Evaluation of control selection strategies that are not closely linked to a de-
fined study base, such as selection of hospital controls, must go directly to the
evaluation of whether the exposure prevalence that has been generated is likely
to be similar to that which would have been obtained by sampling the appropri-
ately defined study base. In other words, the mechanism of control selection is
so far removed from sampling the study base that we can only consider whether
it is likely to have yielded a valid result, not whether the mechanism was a good
one. A distinct disadvantage of such an approach is the difficulty in addressing
this question with empirical evidence.
Sometimes, the prevalence of exposure can be compared to suitable external
populations to determine whether the chosen controls are roughly similar to oth-
ers who have been appropriately constituted and measured. The pattern of ex-
posure distribution among the controls may be compared to known or expected
patterns to determine whether there is likely to have been a differential selection
bias among the controls. If there are segments of the study base that are likely
to have been sampled more effectively than others, measures of association should
be generated with stratification on those potential markers of the degree of se-
lection bias. The more valid result comes from the stratum in which selection
bias is least likely to have occurred. The influence of adjustment for markers,
both direct and indirect, of selection bias should be evaluated to determine the
direction and amount of influence of adjustment. Where there is an imperfect
proxy measure of the basis for selection bias rather than the exact measure of in-
terest, the influence of the unmeasured factor on the results should be estimated
and the bias controlled in the same manner as confounding. The ability to repli-
cate known and strongly suspected exposure–disease associations should be at-
tempted and failures to do so considered in more detail. All these tools are suit-
able for control selection mechanisms that attempt to choose directly from the
study base as well as for those mechanisms that do not. The more that the method
of control selection deviates from sampling the study base, however, the greater
the need for evidence that the result is valid.
Whereas a perfectly coherent set of controls for a given set of cases assures
freedom from selection bias, a non-coherent control group does not guarantee
that bias will occur. That depends entirely on the exposure of interest and whether
that exposure is related to the source of the lack of coherence. In the example
regarding residential mobility and magnetic field exposure from power lines,
there would be no selection bias if residential mobility were not associated with
Selection Bias in Case–Control Studies 113
nearby electrical wiring and residential magnetic fields. If the source of non-
coherence were unrelated to the exposure, then restricting the sampling to those
potential controls who did not change residences over the period between case
diagnosis and study conduct would introduce no bias. Sampling in an unbalanced
way from those who are residentially stable relative to those who are residen-
tially mobile is only a problem if residential mobility is related to the exposure
of interest. A group that is not coherent based on having omitted residentially
unstable members could still generate the correct estimate of exposure preva-
lence in the study base and thus the correct measure of association. Using as a
benchmark the ideal source of controls, critical evaluation must focus on the ex-
tent to which the less than ideal control group has generated a valid result. The
question that must be asked is whether omitting parts of the study base or in-
cluding experience outside the study base has distorted the estimate of exposure
prevalence.
REFERENCES
Doll R, Hill AB. Smoking and carcinoma of the lung: preliminary report. Br Med J
1950;739–748.
Feychting M, Ahlbom A. Magnetic fields and cancer in children residing near Swedish
high-voltage power lines. Am J Epidemiol 1993;138:467–481.
Horwitz RI, Feinstein AR. Alternative analytic methods for case-control studies of estro-
gens and endometrial cancer. N Engl J Med 1978;299:1089–1094.
Hulka BS, Grimson RC, Greenberg RG, Kaufman DG, Fowler WC Jr, Hogue CJR, Berger
GS, Pulliam CC. “Alternative” controls in a case-control study of endometrial cancer
and exogenous estrogen. Am J Epidemiol 1980;112:376–387.
Jones TL, Shigh CH, Thurston DH, Ware BJ, Cole P. Selection bias from differential res-
idential mobility as an explanation for associations of wire codes with childhood can-
cer. J Clin Epidemiol 1993;46:545–548.
Linet MS, Hatch EE, Kleinerman RA, Robison LL, Kaune WT, Friedman DR, Severson
RK, Haines CM, Hartsock CT, Niwa S, Wacholder S, Tarone RE. Residential expo-
sure to magnetic fields and acute lymphoblastic leukemia in children. N Engl J Med
1997;337:1–7.
London SJ, Thomas DC, Bowman JD, Sobel E, Cheng T-C, Peters JM. Exposure to res-
idential electric and magnetic fields and risk of childhood leukemia. Am J Epidemiol
1991;134:923–937.
Longnecker M. Alcoholic beverage consumption in relation to risk of breast cancer: meta-
analysis and review. Cancer Causes Control 1994;5:73–82.
Melhus H, Michaelsson K, Kindmark A, Bergström R, Holmberg L, Mallmin H, Wolk
A, Ljunghall S. Excessive dietary intake of Vitamin A is associated with reduced bone
mineral density and increased risk for hip fracture. Ann Intern Med 1998;129:770–778.
Miettinen OS, The “case-control” study: valid selection of subjects. J Chron Dis
1985;38:543–548.
Olson SH, Kelsey JL, Pearson TA, Levin B. Evaluation of random digit dialing as a
method of control selection in case-control studies. Am J Epidemiol 1992;135:
210–222.
114 INTERPRETING EPIDEMIOLOGIC EVIDENCE
Poole C, Trichopoulos D. Extremely-low frequency electric and magnetic fields and can-
cer. Cancer Causes Control 1991;2:267–276.
Preston-Martin S, Navidi W, Thomas D, Lee P-J, Bowman J, Pogoda J. Los Angeles study
of residential magnetic fields and childhood brain tumors. Am J Epidemiol 1996;143:
105–119.
Rothman KJ. Modern Epidemiology. Boston: Little, Brown and Co., 1986.
Savitz DA, Barón AE. Estimating and correcting for confounder misclassification. Am J
Epidemiol 1989;129:1062–1071.
Savitz DA, Kaune WT. Childhood cancer in relation to a modified residential wire code.
Environ Health Perspect 1993a;101:76–80.
Savitz DA, Kaune WT. Response: potential bias in Denver childhood cancer study. En-
viron Health Perspect 1993b;101:369–370.
Savitz DA, Pearce NE. Control selection with incomplete case ascertainment. Am J Epi-
demiol 1988;127:1109–1117.
Savitz DA, Wachtel H, Barnes FA, John EM, Tvrdik JG. Case-control study of childhood
cancer and exposure to 60-Hz magnetic fields. Am J Epidemiol 1988;128:21–38.
Singletary KW, Gapstur SM. Alcohol and breast cancer. Review of epidemiologic and
experimental evidence and potential mechanisms. JAMA 2001;286:2143–2151.
United Kingdom Childhood Cancer Study Investigators. Exposure to power-frequency
magnetic fields and the risk of childhood cancer. Lancet 1999;354:1925–1931.
Vogt TM, Mayne ST, Graubard BI, Swanson CA, Sowell AL, Schoenberg JB, Swanson
GM, Greenberg RS, Hoover RN, Hayes RB, Ziegler RG. Serum lycopene, other serum
carotenoids, and risk of prostate cancer in US Blacks and Whites. Am J Epidemiol
2002; 155:1023–1032.
Wu ML, Whittemore AS, Jung DL. Error in reported dietary intakes. II. Long-term re-
call. Am J Epidemiol 1988;128:137–145.
6
BIAS DUE TO LOSS OF STUDY PARTICIPANTS
The previous chapters addressed the mechanism by which subjects were selected
and the distinctive biases that may result from the methods of selection in co-
hort and case–control studies. The focus in those chapters was on the manner in
which the groups were constituted, and whether, if implemented as designed, the
selection process would yield a valid measure of association. With a poor choice
for the non-exposed group in a cohort study, even flawless execution of that se-
lection method would yield biased results. Similarly, if the control group defined
for a case–control study fails to reflect the prevalence of the exposure of inter-
est in the study base, then selection bias is present regardless of how success-
fully we identify and recruit subjects from that poorly chosen sampling frame.
The concern in previous chapters was in the definition of the study groups, not
in the implementation of the selection method. In this chapter, we focus on the
potential for bias that arises in the implementation of the selection process, fo-
cusing on the problems resulting from the inability of researchers to enroll and
follow the individuals who were chosen for the study. Even with a perfectly valid
plan that meets all the conceptual goals for a valid study, systematic loss from
115
116 INTERPRETING EPIDEMIOLOGIC EVIDENCE
the defined study groups is such a common source of bias that it warrants ex-
tended discussion.
Many of these problems with attrition are unique to studying free-living human
populations. Controlled laboratory experiments do not have to contend with ro-
dents moving and leaving no forwarding address, or bacteria that refuse to permit
the investigator to impose the potentially noxious exposure. If the organism and
experimental conditions are properly chosen and implemented, with the investiga-
tor in complete control, there is little room for the selective loss of subjects to yield
an erroneous result. On occasion, an outbreak of infection will disrupt laboratory
experiments or failures to follow the protocol will occur due to human or machine
error. However, the experimental control of the investigator is rather complete.
Contrast that tight experimental control with the typical situation in observa-
tional epidemiology and to a large extent, in experimental studies in human pop-
ulations. The investigator designates a study group of interest, for example, all
men diagnosed with ulcerative colitis in a given geographic area or a randomly
selected sample of children who receive medical care through a health mainte-
nance organization. Even with the good fortune of starting with complete rosters
of eligible participants, a rare situation in practice, there are multiple opportuni-
ties for losses in going from those desired to those who actually contribute data
to the final analysis. The disparity between the persons of interest and those who
are successfully enrolled in the study and provide the desired data poses a sig-
nificant threat to validity.
Complete documentation of study methods in epidemiology includes a com-
plete and honest accounting of eligible subjects and the numbers lost for various
reasons, culminating in the tally of those who were included in the final analy-
ses. This accounting is vital to quantifying the potential for biased results through
evaluation of the disparity between those sought for the study and those actually
in the study. Multiple processes contribute to those losses, with the reason for
the losses critical to evaluating the potential impact on the validity of the study
results. These losses are not failings of the investigators or research staff, but an
inherent and undesirable feature of studying human populations.
All other considerations equal, the smaller the volume of loss, the less sus-
ceptible the study is to erroneous results of a given magnitude. Also, the more
random the losses are, the less damage they do to the validity of results. A per-
fectly random pattern of loss only reduces precision and can, if the sampling
frame is large enough, be compensated by increasing the sampling fraction. For
example, if a computer error deleted every tenth subject from a randomly or-
dered list, there would be no impact on validity, and increasing the sampling
fraction by 10% would result in no loss in precision either. In sharp contrast, loss
of 10% of eligible subjects because they could not be contacted by telephone is
a distinctly non-random process, not compensated by increasing the sampling
fraction by 10%.
Bias Due to Loss of Study Participants 117
The key question is whether those who remain in the study after losses are
systematically different in their key attributes (risk of disease in cohort studies,
prevalence of exposure in case–control studies) compared to those in the initial
sampling frame. Some mechanisms of loss are likely to be very close to random.
For example, in studies that recruit from patients in a clinic setting, sometimes
there are insufficient resources to recruit during all clinic hours so that otherwise
eligible patients are lost because of lack of staff coverage at particular times of
day or certain days of the week. Even for such ostensibly random sources of loss,
however, questions may be raised about whether subjects who come to a clinic
at inconvenient times (weekends, nights) are different than those who come at
times that staff are more readily available (weekdays).
In a recent study in which pregnant women were recruited in a prenatal care
setting, those lost due to missed opportunity to recruit were somewhat different
than women who were contacted, more often young and less educated (Savitz et
al., 1999). We hypothesized that one of the reasons for our inability to contact
women in the clinic was that they had changed names or rescheduled visits on
short notice, events quite plausibly related (though indirectly) to risk of adverse
pregnancy outcome as a result of a less favorable demographic profile. In fact,
those women who we were unable to contact in the clinic had a slightly higher
risk of preterm birth as compared to women we could speak to and attempt to
recruit for the study.
Mechanisms of loss that are based on the decisions or lifestyle of potential
participants, such as refusal, absence of access to a telephone, screening calls
with an answering machine, not being home during the day, or changing resi-
dences are more obviously non-random in ways that could well affect the study’s
result. Socioeconomic and demographic characteristics, behavioral tendencies,
exposures of interest, and disease risk are often intertwined. This same complex
set of factors is likely to extend to the determinants of the ability to be located,
the inclination to agree to be enrolled, and the decision to drop out once enrolled.
With a little imagination, the many correlates of “difficult to contact” or “un-
willing to contribute time to research” make such losses non-random with regard
to the exposures and health outcomes of interest to epidemiologists.
Table 6.1 illustrates some of the processes by which subjects may be lost across
the phases of a study, and suggests some of the underlying mechanisms that may
be operative. Not all of these phases apply to every study, nor is the list ex-
haustive. Limited data are available to empirically assess which reasons for losses
are more tolerable, i.e., closer to random losses, than others. Even when data on
the nature of such losses are available from other studies, the patterns of reasons
for loss and the implications for study validity are likely to vary across popula-
tions, time periods, and for different exposures and diseases of interest, making
it difficult to generalize. These processes are in large part cultural, sociological,
and psychological, so that universal predictors that apply to all humans are un-
118 INTERPRETING EPIDEMIOLOGIC EVIDENCE
Subject refusal
• Hostility towards research based on bad experience or limited understanding
• Poor health precludes provision of data
• Protective of privacy due to engagement in embarrassing or illegal behaviors
• Overburdened with work or family responsibilities and thus lacking in time
• Self-confidence to refuse requests from authorities
Missing data
• Refusal to provide information that is unusual, embarrassing, or illegal
• Exhaustion due to poor health that precludes completion of survey
• Poor communication skills or low educational attainment
worthy of attention in assessing the potential for bias. Most often, participant re-
fusal is the dominant reason for loss, and its familiarity and inevitability should
not be misinterpreted as an indication that it is benign. The magnitude of devi-
ation between those lost and those enrolled is ideally evaluated empirically, but
since this requires information that is often unavailable, indirect evidence may
be brought to bear on the issue. After enumerating and quantifying the many
sources of loss, priorities can be defined regarding which problems deserve
scrutiny, validation substudies, or sensitivity analyses.
Like other forms of selection bias, if the losses are related to measured at-
tributes, like age and educational level, but random within strata of those attrib-
utes, then adjustment for the measured factors will eliminate bias just as it elim-
inates confounding. That is, if refusals are more common among less educated
eligible subjects, but random within strata defined by education, then after ad-
justment for education, the bias due to non-participation will be reduced. Ques-
tions must be asked regarding whether the measured attributes (e.g., educational
level) adequately approximate the attributes of ultimate interest (e.g., proclivity
to participate in studies) in order to fully adjust for the potential bias. Even though
adjustment can ameliorate the bias, it is very unlikely to fully eliminate it.
The specific exposures and diseases under investigation must be scrutinized care-
fully in order to assess the potential for bias. The abstract question of whether those
available in the analysis are or are not representative of the desired study popula-
tion has little meaning without consideration of the particular characteristics of con-
cern. The guiding question is whether the omission of some eligible participants
affects the disease rate in a cohort study or the exposure prevalence in a case–
control study. In studying a disease that is closely associated with a number of
health behaviors, such as lung cancer or coronary heart disease, subjects lost due
to refusal to participate are likely to introduce distortion due to deviant smoking
and dietary habits, for example, relative to those who enroll. It has been found re-
peatedly across diverse modes of data collection that smokers tend to refuse study
participation more frequently than do nonsmokers (Criqui et al., 1978; Macera et
al., 1990; Psaty et al., 1994). In contrast, for diseases less closely related to such
behaviors, e.g., prostate cancer, the distortion due to refusals may well be less, or
perhaps we simply lack sufficient information at present to make an informed judg-
ment for such diseases. Analogously, when assessing the impact of losses due to
physician refusal to permit patient contact in a case–control study, we might have
little concern if the exposure of interest were a genetic variant, not likely to be di-
rectly related to physician judgment, whereas if the exposure were the level of psy-
chological stress, physician refusals could have disastrous consequences if the cases
perceived to have the highest stress levels were systematically eliminated.
Several years ago, the patterns of loss that would and would not produce bias
were clearly specified (Greenland, 1977; Greenland & Criqui, 1981). In a cohort
study, loss of subjects from the exposed and unexposed groups that are not dif-
120 INTERPRETING EPIDEMIOLOGIC EVIDENCE
ferential by disease status do not result in bias, even if losses are unequal for the
exposed and unexposed groups. Furthermore, even losses that are selective for per-
sons with (or without) the disease of interest do not introduce bias in ratio meas-
ures of association, so long as those disease-selective losses are quantitatively the
same in the exposed and unexposed groups. In case–control studies, losses that dis-
tort the exposure prevalence among cases and controls are tolerable so long as the
losses are comparably selective for the two groups. Even if exposed (or unexposed)
subjects are preferentially lost, so long as the magnitude of that preferential loss is
comparable in cases and controls, bias in the odds ratio will not result.
Only when the losses are differential by exposure and disease status is there
selection bias. That is, in cohort studies, the key question is whether there is a
preferential loss of diseased subjects that differs for the exposed and unexposed.
If each group loses 10% of diseased and 5% of non-diseased, there is no bias,
but if the exposed lose 10% of diseased and 10% of non-diseased and the unex-
posed lose 5% of diseased and 10% of non-diseased, bias will result. Similarly,
if losses of subjects from a case–control study are related to exposure and of dif-
ferent magnitude for cases and controls, for example, 10% of exposed subjects
and 10% of non-exposed subjects are lost from the control group whereas 5% of
exposed subjects and 10% of non-exposed subjects are lost from the cases. The
harmful pattern can be summarized as occurring when response status acts as an
effect–modifier of the exposure–disease association. Under such circumstances,
the magnitude of the exposure–disease relation differs among those who partic-
ipate in the study and those who do not.
Characterize Nonparticipants
A straightforward approach to assessing the potential impact of non-response on
measures of association is to characterize a sample of nonrespondents with re-
Bias Due to Loss of Study Participants 121
gard to key attributes of exposure and disease, as well as other important pre-
dictors of disease. A sample of subjects lost for each major reason (refused, not
traceable, etc.) is subjected to the intense effort required to obtain the desired in-
formation, anticipating at least partial success in obtaining information on some
of the potential participants who had been initially considered lost. This approach
is predicated on the assumption that there is a gradation of effort that can be ex-
pended to obtain participation and a corresponding gradation of difficulty in re-
cruiting potential respondents. Every study balances available resources with the
expected yield, and some limit must be placed on the amount of effort that can
be devoted to reaching and obtaining the participation of all eligible subjects.
In general, expanding the effort to locate or recruit nonparticipants will yield
some additional participants and data obtained from those recovered participants
is informative.
Subject refusal probably remains the predominant reason for losses in most
epidemiologic studies. Even after intensive efforts to persuade subjects to par-
ticipate in interviews or specimen collection have failed, uniquely talented, mo-
tivated interviewers can usually persuade a sizable proportion of subjects who
had initially refused to change their minds (refusal converters). Locating subjects
is even more clearly tied to resources expended. Commercial tracking compa-
nies typically have an explicit policy—the more money you are willing to spend
to locate a given person, the more intensive the effort, and the more likely it is
that they will be able to locate that person. Thus, after a reasonable, affordable
level of effort has been expended to locate subjects, a subset of formerly un-
traceable subjects can be subjected to more intensive tracking methods and lo-
cated to generate the desired data. The product of these refusal conversions or
intensive tracking efforts is information on formerly nonparticipating subjects
who can help us make some inferences about the otherwise eligible subjects who
remain nonparticipants.
Assuming that at least some of the former nonrespondents can be enrolled, the
goal is to characterize the exposure or health outcome of primary interest. In a
cohort study, the occurrence or non-occurrence of disease is central. After nor-
mal follow-up procedures, some fraction of the original cohort is likely to re-
main lost to follow-up. To evaluate the impact of that loss, a fraction of those
lost to follow-up would be located through more intensive means in order to de-
termine their disease outcomes. With that information in hand, formal correc-
tions can be made if one assumes that those former nonrespondents who were
found represent their counterparts who remain nonrespondents. For example, if
the disease incidence among the 10 subjects who were formerly lost to follow-
up but located through intensive effort were 20%, one might assume that of the
total 100 subjects lost to follow-up, 20 of them developed disease as well. Such
a calculation assumes that those who were converted from lost to found repre-
sent those who were permanently lost, which is subject to uncertainty. When the
recovered subjects constitute a complete roster of a randomly chosen subset, con-
122 INTERPRETING EPIDEMIOLOGIC EVIDENCE
fidence in generalizing to all those remaining lost is much greater than when
those who are recovered are simply the most easily recovered from the pool of
those initially lost and are thus likely to differ systematically from those who re-
main lost. Incorporating the data on those subjects who were found directly
reduces the non-response, whereas extrapolating from that subset that could be
found to the entire roster of those lost is more akin to a sensitivity analysis.
(“What if those lost had the disease experience of those we could find?”) Nev-
ertheless, the alternative assumption that nonrespondents are identical to re-
spondents is lacking in any empirical support and thus far more tenuous.
Beyond the reduction in non-response that results from these intensive follow-
up efforts, and the subsequent ability to estimate the association of interest with-
out such losses, this strategy can provide additional insight into the underlying
reasons for non-participation. In some cases, direct inquiry built into the follow-
up process can reveal the ways in which those who were included differ from
those who initially were not. In the case of subject refusal, information on social
and demographic attributes, occupation, medical history, etc. will help to describe
the patterns and potential bias, but also to better understand why they refused in
the first place. Former refusals can be queried directly regarding their reason for
having been reluctant to participate in the study. To the extent that honest an-
swers can be generated, there is the opportunity to examine whether study meth-
ods could be refined to improve response or at least add to the understanding of
the process that resulted in their having been lost.
Similarly, eligible subjects who were initially untraceable and then located can
be evaluated to reveal why they were untraceable or at least to characterize the
types of persons who fall into that category. Perhaps they were less likely to
use credit cards or more likely to be self-employed. Such general descriptors of
the lost individuals and informed speculation about the underlying process help
the investigator and reviewer to judge the potential for biased measures of asso-
ciation among the participants who were included. Also, information may be gen-
erated to indicate cost-effective approaches to reducing the magnitude of non-
response in the ongoing study or at least in future ones.
In comparing those who were lost to those who participated, investigators of-
ten focus on broad demographic attributes of the two groups because those are
most readily available. Unwarranted comfort may be taken when the demographic
profile of those lost is similar to those who were enrolled. Such a pattern is used
to infer that those lost are effectively a random sample of those enrolled, and
thus, on average, participants would generate measures of association equivalent
to those from all eligible subjects. Data sources on nonrespondents, such as pub-
lic records or city directories typically provide some information on gender, age,
and sometimes occupation or educational level. Such descriptors do provide some
insight into the process by which subjects were lost, and provide limited data to
address the hypothesis that the loss process has generated a random sample from
Bias Due to Loss of Study Participants 123
give hints about those who moved and could not be found. Similarly, we might
expect that those who were reluctant to participate but were ultimately persuaded
to do so would fall in between the eager participants and those who chose not to
participate at all. Estimation of a quantitative dose-response function of nonpar-
ticipation and formal extrapolation to nonrespondents would be the ultimate goal,
but a qualitative assessment may be all that is feasible.
In mail surveys, the design typically calls for a series of steps to enhance re-
sponse (Dillman, 1978), each step yielding more respondents and depleting the
pool of refusals. Some respond directly to the initial questionnaire, some respond
only to reminder postcards, others respond only to repeat mailing of the ques-
tionnaire, continuing to those who must be interviewed by telephone because
they ignore all mailed material. It can be difficult in practice to determine ex-
actly which action precipitated a response, but through careful coding of ques-
tionnaires, monitoring mailing and receipt dates, and some inferences based on
those dates, a gradient of willingness to cooperate can be defined among the par-
ticipants. Those who promptly returned the questionnaire without a reminder are
at one end of that spectrum, and those who responded only after the most ex-
treme efforts, e.g., telephone calls, are at the other end of the spectrum.
The characteristics of those who responded at each stage can be examined,
both to describe them in broad social and demographic terms, but more impor-
tantly to determine their statuses with regard to the variables of primary interest
and the estimated measure of effect for that subgroup. In a cohort study in which
disease is ascertained by questionnaire, the proportion affected by the disease,
stratified by the effort required to elicit a response, would indicate whether the
ultimate nonrespondents were likely to have higher or lower disease rates than
the respondents. Stratifying by exposure might address the critical question of
whether exposed nonrespondents are likely to have a different health outcome
than unexposed nonrespondents. Similarly, in a case–control study, the exposure
prevalence can be assessed across strata of cooperativeness, separately for cases
and controls, to extrapolate and assess whether the ability to include the remaining
nonrespondents would be likely to change the pattern of results. In a sensitivity
analysis, the nonrespondents could be assumed to have the traits of reluctant re-
spondents, or a more extreme version of the reluctant respondent profile, and an
assessment made of the expected results for the full study population. In contrast
to making arbitrary and implausible extreme assumptions (e.g., all those missing
are exposed cases), the evaluation of reluctant respondents provides a basis for
much more credible estimates.
In a community survey in Montreal, Siematycki and Cambell (1984) exam-
ined participant characteristics for those who responded at the first stage of the
survey that was conducted by mail, compared to the cumulation of first- and sec-
ond-stage responders, with second-stage participants only responding with fur-
ther contact, including a home interview if needed. As shown in Table 6.3, few
126 INTERPRETING EPIDEMIOLOGIC EVIDENCE
differences were noted between results based on the first stage alone versus the
first and second stages combined. Note however, that this presentation of results
does not isolate the second-stage respondents, who in fact had a lower level of
education and were more likely to be current smokers, but not to a sufficient ex-
tent to influence the cumulative sample.
By isolating those responding in each of the stages, there is an opportunity to
extrapolate to those who did not participate at all. In studies that are large enough,
measures of association can be calculated for subgroups defined by the stage at
which they responded. In a study using a mailed questionnaire, for example, the
relative risk for those responding to the first questionnaire, the reminder post-
card, the second questionnaire, and the telephone interview can be calculated to
identify a gradient and estimate what the relative risk would be in those who did
not participate at all. Through this approach, it is possible to assess directly
whether inclusion of all eligible participants is likely to generate a relative risk
that is larger or smaller than that found for the participants alone, and even to
Bias Due to Loss of Study Participants 127
provide some quantitative basis for how much different it would be. If the gra-
dient of relative risks moving from easy to difficult response were 1.2, 1.4, 1.4,
and 1.7, one might guess that the relative risk for those who did not participate
would be approximately 1.8–2.0, and therefore conduct a sensitivity analysis un-
der that assumed value. Although this is subject to uncertainty, the alternative is
to assume nonrespondents are identical to respondents, and the data make that
assumption even less tenable.
Telephone and in-person interviews have a comparable spectrum of difficulty,
though it may be less easily measured as compared to mailed questionnaires. In-
terviewed respondents vary greatly in their initial enthusiasm for the study, with
some requiring intensive efforts to persuade them to become involved and oth-
ers much more readily agreeable. The amount of persuasion that is required is
worth noting to facilitate extrapolation to those who could not be persuaded even
with intensive effort. The number of telephone calls required to reach a respon-
dent is another dimension of difficulty in recruitment, reflective of accessibility
or cooperativeness. Those who require many calls may give some insights into
those who would have required an infinite number of calls (i.e., those who were
never reached). Subjects who ultimately participate after one or more missed ap-
pointments may be reflective of those who repeatedly miss appointments and ul-
timately become nonrespondents for that reason. Because refusal to participate
has multiple etiologies, including lack of belief in the value of research, insuffi-
cient time, and inability to be located, there is some danger in seeking a single
dose-response function for recruitability. It may be more informative to separate
out the degrees of reluctance for each causal pathway and extrapolate for each
specific reason they could not be enrolled,
The other major source of loss is due to the inability to locate presumably el-
igible subjects. Efforts to locate individuals can be documented and placed into
ordinal categories of required effort. For example, some subjects are readily found
through the available address or telephone number, or they are listed in the tele-
phone directory at the expected address. Others can be traced through a for-
warding address or are found in directories at their new address. Further down
the list, some are found by contacting neighbors or by calling former employers.
Studies generally follow a well-defined sequential algorithm based on the nature
of the population and the impressions of the investigators regarding the most ef-
ficient approach. Whatever series of steps is followed, the documentation of what
was ultimately required to locate the subject should be noted and examined so
that the pattern of results can be extrapolated to those never found. Here the gra-
dient of traceability may be more obvious than for refusals, since those who move
multiple times or are lost would be expected to be similar to those who move
and are found, only more so.
An important caveat with this strategy is that ultimate nonrespondents may be
qualitatively different than reluctant respondents. That is, there may be a dis-
128 INTERPRETING EPIDEMIOLOGIC EVIDENCE
continuity in which easy and reluctant respondents are similar to one another and
intransigent nonrespondents are altogether different from either group. There is
no easy way to detect this, in that the most uncooperative subjects or those most
difficult to locate will remain nonparticipants regardless of the amount of effort
that is expended. To the extent that this is true, the sensitivity analysis that as-
sumes otherwise will be misleading, and this untestable possibility should tem-
per the degree to which extrapolation to nonrespondents is viewed as a solution
rather than an exploration of the problem.
good as actual data, and confidence in the validity of the results should be re-
duced due to the missing data, yet imputation can make it tempting to act as
though the non-response problem has been solved. In many national surveys
distributed by the National Center for Health Statistics, for example, the data
are complete but with an imputation “flag.” It is actually more convenient to
use the full data set, including imputed values, than to regenerate a data set in
which imputed values are treated as missing. Missing data should not be hid-
den or disguised, because non-response needs to be known to make an accu-
rate assessment of the study’s susceptibility to bias. Even when presented hon-
estly as a sensitivity analysis, (“What would the results be if we made the
following assumptions about nonparticipants?”) it is intuitively unappealing to
many to impute not just social or demographic factors but exposure or disease
status. Imputation seems to violate our ingrained demands for real data, ob-
jectively described.
On the other hand, once we determine that there are eligible subjects who have
not been enrolled, we are faced with a range of imperfect alternative approaches.
Inevitably subjects are lost, and the proportions are typically in the range of
20%–30% in even the most meticulously conducted studies, with losses of
40%–50% not uncommon. Given that the losses have occurred, we can (1) ig-
nore the losses altogether; (2) describe the losses by analyzing patterns among
those who provided data, using a variety of tools and logic described in previ-
ous sections to speculate about the impact of those losses on the study results;
or (3) impute data for nonparticipants, analyzing the combined set of subjects
with real and imputed data, and discussing the strengths and limitations of the
imputation.
When the analysis is restricted to those subjects who provided data, we are as-
suming that they are a random sample from the pool of eligibles and that results
are valid, though somewhat less precise, as a result of those losses. As discussed
above, that is a highly questionable assumption, with self-selection in particular
unlikely to be random. To the extent that participants are a non-random sample
with respect to key attributes of exposure or disease or both, the measures of as-
sociation may be distorted (Rothman & Greenland, 1998).
Imputation makes a different set of assumptions, with a variety of techniques
available but all are built on the strategy of using known attributes to impute un-
known attributes. Regression techniques use information from all subjects with
complete data to develop predictive equations for each of the missing measures.
Those predictive models are then applied to those with some data available to
predict values for missing items, and those predicted values are substituted for
the missing ones. Another approach is to identify a set of subjects with complete
data who are similar to the lost subject with respect to known attributes and use
a randomly selected individual subject’s profile to substitute for the missing one.
Bias Due to Loss of Study Participants 131
In any event, the guess is an informed one, though not subject to verification.
The underlying assumption of imputation is that attributes can be predicted for
those who are nonparticipants based on information from participants. To the ex-
tent that this is in error, then the data generated by imputation will be incorrect
and the results using the imputed data will be biased.
Given the two options, accepting missing information as missing and imput-
ing missing information, each with positive and negative aspects, perhaps the op-
timal approach is to do the imputation but retain the ability to examine results
for subjects with measured data and the combined population with measured plus
imputed data separately. In that way, the evaluator of the evidence, including the
investigators who generated it, can consider the plausibility of the assumptions
and make an informed judgment about which results are likely to be more valid.
Similar results under the two approaches provides evidence that bias due to non-
participation is less likely, whereas different results for complete and imputed
subjects suggests that the losses may have introduced distortion and, subject to
the level of confidence in the imputation process, that the more valid result is
obtained through imputation.
An evaluation of the ability to impute data and assess its impact on study
results is provided by Baris et al. (1999). They considered a study of mag-
netic fields and childhood leukemia conducted by the National Cancer Insti-
tute in which some of the children had lived in two or more homes. The “gold
standard” measure of exposure in such instances was considered to be the
time-weighted average of measurements in the two residences, but they were
interested in comparing more cost-efficient approaches to determine whether
information would be lost by failing to obtain measurements in both homes.
To examine this, they presented results based on various approaches to im-
puting the value of the unmeasured home using available information on one
measured home chosen as the longest-occupied or currently occupied (at the
time of the study).
Using the exposure indices presented in Table 6.4, the gold standard meas-
urements yielded odds ratios and 95% confidence intervals across the three
groups above 0.065 microTesla (T), a measure of magnetic flux density, as
follows: 0.97 (0.52–1.81), 1.14 (0.63–2.08), and 1.81 (0.81–4.02), with a p-
value for trend of 0.2. Whether overall means for study subjects were used
for imputation (results on left) or case and control group-specific means were
used (results on right), there was severe attenuation of the pattern from using
imputed results. What had been a weak gradient, with some indication of higher
risk in the highest exposure stratum, disappeared upon using imputed data for
the second home. Even with a reasonable correlation of fields across homes,
the loss of information to characterize exposure–disease associations was sub-
stantial in this instance.
TABLE 6.4. Odd Ratios (95% CI) from Different Imputation Strategies Categorized According to Initial Cut Off Points of Magnetic Field
Exposure, National Cancer Institute Childhood Leukemia Study
Relative Risk for Acute Lymphoblastic Leukaemia Calculated With:
Control Mean Imputation*
TWA Based on Longer Lived in Homes Plus TWA Based on Current Lived in Homes Plus
Imputed Values for Shorter Lived in Homes Imputed Values for Former Lived in Homes
EXPOSURE MEAN MEAN
CATEGORIES (T) CASES OR 95% CI (T) CASES OR 95% CI
Among the threats to the validity of epidemiologic studies, none is more ubiq-
uitous or severe than the problem of non-participation. By choosing to study free-
living humans, epidemiologists face an unavoidable loss due to the characteris-
tics distinguishing those who are eligible and those who provide all the desired
information. The loss of desired participants is rarely random with respect to the
attributes of interest, and thus the participants will often be non-representative
on key attributes and have serious potential to generate biased measures of
association.
The only unambiguous solution, free of assumptions, is to eliminate non-
response. Before embarking on indirect, fallible approaches to assessing or con-
trolling effects of nonparticipation, every effort should be made to eliminate or
at least minimize the magnitude of the problem. At the point of choosing a study
population, attention should be given to seeking study settings in which the losses
will be minimized. There is often some tension between studying highly selec-
tive, cooperative groups, perhaps distinctive for being of higher social class or
belonging to some motivated group (e.g., nurses), and the ultimate interest in ap-
plying findings to more diverse populations. Unless the study questions demand
inclusion of a range of persons who differ in availability and willingness to par-
ticipate, starting with a series of valid studies in highly selected populations may
be the preferred strategy.
To evaluate the impact of nonparticipation, the specific pathways of subject
loss need to be isolated, perhaps even more finely than is typically done in epi-
demiologic studies. Refusal due to distrust of medical researchers may have dif-
ferent implications than refusal due to lack of available time, for example. We
are concerned with systematic (non-random) losses and whether those losses are
related to the exposure and disease of concern. The more specificity with which
the mechanism of loss can be stated, the greater our opportunity to consider em-
pirically or theoretically the effect such losses will have on the study results.
Those losses most likely to be related to the occurrence of disease in a cohort
study or the prevalence of exposure in a case–control study deserve the greatest
scrutiny. The potential for distortion plus the magnitude of loss combine to de-
termine the importance of the phenomenon.
Within each of the pathways of loss, there are several parallel approaches to
assessing the impact on study results: Intensive effort can usually rescue some
proportion of those lost. This allows a comparison of the characteristics of those
successfully enrolled despite initial failure to those enrolled more readily. In turn,
if those who were rescued can be assumed to have traits in common with the
nonparticipants, the impact of non-response can be estimated. If there is a gra-
dient associated with the mechanism of loss, e.g., degree of reluctance or diffi-
134 INTERPRETING EPIDEMIOLOGIC EVIDENCE
culty of tracing, the pattern of results across that spectrum can be analyzed to
project to those who remained lost. Characteristics of those with lesser doses of
the tendency to be lost can be compared to those with greater amounts of that
same tendency and then extrapolated to the most extreme subset (who remain
lost). Subsets of the study base in which the loss was less severe can be exam-
ined and compared to subsets in which there was greater loss, both to assess the
pattern in measures of association across that spectrum as well as to generate re-
sults for subsets in which non-response bias is unlikely to be a major problem.
Finally, methods of imputation should be considered, with appropriate caveats,
to estimate what the results would have been without losses.
Many of these techniques depend on the foresight of the investigators in ac-
quiring and presenting relevant data. Without presenting the needed information
on study attrition, the reviewer may be left with much more generic speculation
about non-response and its impact. Published papers typically give some clues
at least regarding the reasons for loss and some characteristics of those not
participating. With those clues regarding the sources and magnitude of non-
participation, the key questions can at least be formulated and partial answers
provided. Non-participation represents a challenge in which greater effort is likely
to yield reduced risk of bias and revealing more information is certain to help
the user of information from the study to draw more valid inferences.
REFERENCES
Baris D, Linet MS, Tarone RE, Kleinerman RA, Hatch EE, Kaune WT, Robison LL, Lu-
bin J, Wacholder S. Residential exposure to magnetic fields: an empirical assessment
of alternative measurement strategies. Occup Environ Med 1999;56:562–566.
Criqui M, Barrett-Connor E, Austin M. Differences between respondents and non-
respondents in a population-based cardiovascular disease study. Am J Epidemiol
1978;108:367–372.
Dillman DA. Mail and telephone surveys. The total design method. New York: John Wi-
ley & Sons, 1978.
Greenland S. Response and follow-up bias in cohort studies. Am J Epidemiol 1977;106:
184–187.
Greenland S, Criqui MH. Are case-control studies more vulnerable to response bias? Am
J Epidemiol 1981;114:175–177.
Hatch EE, Kleinerman RA, Linet MS, Tarone RE, Kaune WT, Auvinen A, Baris D, Ro-
bison LL, Wacholder S. Do confounding or selection factors of residential wiring codes
and magnetic fields distort findings of electromagnetic field studies? Epidemiology
2000;11:189–198.
Macera CA, Jackson KL, Davis DR, Kronenfeld JJ, Blair SN. Patterns of non-response
to a mail survey. J Clin Epidemiol 1990;43:1427–1430.
Psaty BM, Cheadle A, Koepsell TD, Diehr P, Wickizer T, Curry S, VonKorff M, Perrin
EB, Pearson DC, Wagner EH. Race- and ethnicity-specific characteristics of partici-
pants lost to follow-up in a telephone cohort. Am J Epidemiol 1994;140:161–171.
Bias Due to Loss of Study Participants 135
posure, to provide an estimate of what the experience of the exposed group would
have been absent the exposure. Ignoring various forms of selection and mea-
surement error and random processes, the reason that comparing the exposed to
the unexposed group would fail to accurately measure the causal effect of ex-
posure is confounding. That is, the unexposed group may have other influences
on disease, due to both measurable factors and unknown influences, which make
its disease experience an inaccurate reflection of what the exposed subjects them-
selves would have experienced had they not been exposed. Other disease deter-
minants have rendered the comparison of disease risk across exposure levels an
inaccurate reflection of the causal impact of exposure. This has been referred to
as non-exchangeability in that the exposed and unexposed are not exchangeable,
aside from any effect of the exposure itself.
There is an important distinction to be made between the concept of con-
founding as defined above and the definition of a confounder or confounding
variable. A confounding variable is a marker of the basis for non-comparability.
It provides at least a partial explanation for the underlying differences in disease
risk comparing the exposed and unexposed aside from the exposure itself. If we
wish to assess the influence of coffee drinking on the risk of bladder cancer, we
should be concerned that coffee drinkers and abstainers may not have compara-
ble baseline risk of disease independent of any effects of coffee itself, i.e., con-
founding is likely to be present. One important source of such non-comparability
would be attributable to the fact that persons who habitually drink different
amounts of coffee also tend to differ in cigarette smoking habits, and cigarette
smoking is a known cause of bladder cancer. Thus, we are concerned with cig-
arette smoking as a confounder or marker of the non-comparability among per-
sons who consume different amounts of coffee. Put in other terms, we would like
for the disease experience of the non-coffee drinkers in our study to accurately
reflect the disease experience that the coffee drinkers themselves would have had
if they had not been coffee drinkers. If smoking habits differ between the two
groups, however, then the consequences of coffee drinking will be mixed with
those of cigarette smoking and give an inaccurate representation of the effect of
coffee drinking as a result.
Because the concept of confounding based on the counterfactual model relies
on unobservable conditions, epidemiologists usually concentrate on the more
practical approach of searching for specific confounders that may affect the com-
parison of exposed and unexposed, and make extensive efforts to control for con-
founding. Although this effort is often fully justified and can help markedly to
remove bias, we should not lose appreciation for the underlying conceptual goal.
Exchangeability of exposed and unexposed is the ideal and the search for mark-
ers of non-exchangeability is undertaken to better approximate that ideal. Statis-
tical adjustment for confounding variables is simply a means toward that end.
The inability to identify plausible candidate confounding variables, for exam-
Confounding 139
ple, or extensive efforts to control for known and suspected confounding vari-
ables by no means guarantees the absence of confounding. Doing one’s best is
laudable, but circumstances outside the control of the investigator often make
some degree of confounding inescapable. In observational studies in which ex-
posure cannot be assigned randomly, the attainment of exchangeability is a very
high aspiration. When randomization of exposure is feasible, the opportunity to
force the exposed and unexposed groups to be exchangeable is greatly enhanced.
The randomization process itself is precisely for the purpose of making sure that
the groups are exchangeable. If exposure is assigned randomly, those persons (or
rats or cell cultures) that receive the exposure should be exchangeable with those
that do not receive the exposure. Regardless of how extensive the measurement
and control of extraneous determinants of disease may be in observational stud-
ies, producing groups that are functionally randomized is a nearly unattainable
goal.
Consider efforts to truly isolate the effects of specific dietary practices, occu-
pational exposures, or sexual behavior. The challenges are apparent in that these
exposures are not incurred in any sense randomly. The choice of diet, job, or
sexual partner is integrally tied to many other dimensions of a person’s life, at
least some of which are also likely to affect the risk of disease. We obviously
cannot ethically or feasibly randomize such experiences, however, and thus must
accept the scientifically second-best approach of trying to understand and con-
trol for the influences of other associated factors. We must reflect carefully on
other known determinants of the health outcomes of interest that are likely to be
associated with the exposure of primary interest and make statistical adjustment
for those markers, simulating to the extent possible a situation in which the ex-
posures themselves had been randomly allocated. Accepting that the ideal is not
attainable in no way detracts from the incremental value of feasible approaches
to improve the degree of comparability of exposed and unexposed.
Statistical methods of adjusting for confounding variables are exercises in
which we estimate the results that would have been obtained had the exposure
groups been balanced for those other disease determinants even though, in fact,
they were not. For the extraneous disease determinant or confounder, which is
associated with the exposure of interest, we estimate the influence of exposure
on disease after statistically removing any effect of the extraneous exposures.
The most straightforward approach is to stratify on the potential confounding fac-
tor, creating subgroups in which the extraneous factor is not related to exposure.
In the above illustration, in which our interest was in isolating the effect of cof-
fee drinking on bladder cancer from that of cigarette smoking, we can stratify on
smoking status and estimate the coffee–bladder cancer association separately
among nonsmokers, among light smokers, and among heavy smokers. Within
each of those groups, cigarette smoking would not distort the association because
smoking is no longer related to coffee drinking. We can, if desired, then pool the
140 INTERPRETING EPIDEMIOLOGIC EVIDENCE
imagine that coffee drinkers were always smokers, and that the only non-coffee
drinkers were never smokers. We would be unable to extricate the effect of cof-
fee drinking from that of smoking and vice versa, even though in theory the ob-
served association with disease may be wholly due to one or the other or par-
tially due to both. One exposure might well confound the other but there would
be no opportunity to measure or control that confounding. Similarly, if some
condition is completely predictive of disease, such as exposure to asbestos and
the development of asbestosis, then we cannot in practice isolate that exposure
from others. We cannot answer the question, “Independent of asbestos exposure,
what is the effect of cigarette smoking on the development of asbestosis?” The
confounder–disease association is complete, so that we would be able to study
only the combination and perhaps consider factors that modify the association.
In practice, such extreme situations of no association and complete associa-
tion are rare. Potential confounding variables will more typically have some de-
gree of association with both the exposure and the disease and the strength of
those associations, taken together, determines the amount of confounding that is
present. In examining the two underlying associations, the stronger association
puts an upper bound on the amount of confounding that could be present and the
weaker association puts a lower bound on the amount of confounding that is plau-
sible. If one association is notably less well understood than the other, some in-
ferences may still be possible based on estimates for the one that is known.
In practice, much of the attention focuses on the confounder–disease associa-
tion, given that this association is often better understood than the confounder–
exposure association. Epidemiologists typically focus on the full spectrum of po-
tential causes of disease and less intensively on the ways in which exposures re-
late to one another. The strength of the confounder–disease association places an
upper bound on the amount of confounding that could be present, which will
reach that maximum value when the exposure and confounder are completely as-
sociated. That is, if we know that the risk ratio for the confounder and disease
is 2.0, then the most distortion that the confounder could produce is a doubling
of the risk. If we have no knowledge at all about the confounder–exposure as-
sociation, we might infer that an observed risk ratio for exposure of 1.5 could be
explained by confounding (i.e., the true risk ratio could be 1.0 with distortion
due to confounding accounting for the observed increase), a risk ratio of 2.0 is
unlikely to be fully explained (requiring a complete association between con-
founder and exposure), and a risk ratio of 2.5 could not possibly be elevated
solely due to confounding.
As reflected by its dependence on two underlying associations between the po-
tential confounding variable and disease and between the potential confounding
variable and exposure, the algebraic phenomenon of confounding is indirect rel-
ative to the exposure and disease of interest. In contrast to misclassification or
selection bias, which directly distorts the exposure or disease indicators and their
144 INTERPRETING EPIDEMIOLOGIC EVIDENCE
association by shifting the number of observations in the cells that define the
measure of effect, confounding is a step removed from exposure and disease. In
order for confounding to be substantial, both the underlying associations, not just
one of them, must be rather strong. Such situations can and do arise, but given
the paucity of strong known determinants for many diseases, illustrations of
strong confounding that produces spurious risk ratios on the order of 2.0 or more
are not common.
The amount of confounding is expressed in terms of its quantitative impact on
the exposure–disease association of interest. This confounding can be in either
direction, so it is most convenient to express it in terms of the extent to which
it distorts the unconfounded measure of association, regardless of whether that
unconfounded value is the null, positive, or negative. Note that the importance
of confounding is strictly a function of how much distortion it introduces, with
no relevance whatsoever to whether the magnitude of change in the confounded
compared to the unconfounded measure is statistically significant. Similarly, there
is no reason to subject the confounder–exposure or confounder–disease associa-
tions to statistical tests given that statistical testing does not help in any way to
evaluate whether confounding could occur, whether it has occurred, or how much
of it is likely to be present. The sole question is with the magnitude, not preci-
sion, of the underlying associations.
The more relevant parameter to quantify confounding is the magnitude of de-
viation between the measure of association between exposure and disease with
confounding present versus the same measure of association with confounding
removed. We often use the null value of the association as a convenient bench-
mark of interest but not the only one: Given an observed association of a spec-
ified magnitude in which confounding may be present, how plausible is it that
the true (unconfounded) association is the null value? We might also ask: “Given
an observed null measure of association in which confounding may be present,
how likely is it that the unconfounded association takes on some other specific
value?” Based on previous literature or clinical or public health importance, we
might also ask: “How likely it is that the unconfounded association is as great
as 2.0 or as small as 1.5?”
The amount of confounding can also be expressed in terms of the confounding
risk ratio, which is the measure of distortion it introduces. This would be the
risk ratio which, when multiplied by the true (unconfounded) risk ratio would
yield the observed risk ratio, i.e., RR (confounding) RR (true) RR (ob-
served). If the true risk ratio were the null value of 1.0, then the observed risk
ratio would be solely an indication of confounding whether above the null or
below the null value. A truly positive risk ratio could be brought down to the
null value or beyond, and a truly inverse risk ratio (1.0) could be spuriously
elevated to or beyond the null value. Quantitative speculation about the mag-
Confounding 145
EVALUATION OF CONFOUNDING
ing, nor has randomized exposure assignment been employed to address the prob-
lem without needing to fully understand its origins. Our goal is to measure and
control for the attributes that will make these non-comparable groups as compa-
rable as possible. Viewed in this manner, effectiveness is not measured as a di-
chotomy, in which we succeed in eliminating confounding completely or fail to
have any beneficial impact, but should be viewed as a continuum in which we
mitigate confounding to varying degrees.
The conceptual challenge is to identify those characteristics of exposed and
unexposed subjects, other than exposure, which confer differing disease risks.
The underlying basis for the confounding may be such elusive constructs as so-
cioeconomic status or tendency to seek medical care. Undoubtedly, information
is lost as we operationalize these constructs into measures such as level of edu-
cation or engaging in preventive health behaviors. Just as for exposures of in-
terest, something is often lost in moving from the ideal to operational measures,
and although it is often difficult to quantify that loss, it is an important contrib-
utor to incomplete control of confounding. Misclassification at the level of con-
ceptualizing and operationalizing the construct of interest dilutes the ability to
control confounding through statistical adjustment.
The more familiar problems concern accuracy of measurement of the poten-
tial confounding variable and the way in which the variable is treated in the analy-
sis, e.g., categorized versus continuous, number of categories used. Errors arise
in all the ways considered in the discussion of exposure measurement (Chapter
8): clerical errors, misrepresentation on self-report, faulty instrumentation, etc.
In addition, for a given confounding variable, there is an optimal way of mea-
suring it and constructing it to maximize its association with disease, thus en-
hancing the extent to which confounding is controlled. If we are concerned about
a confounding effect of cigarette smoking in studying exposure to air pollution
and lung cancer, then we can measure tobacco smoking in a number of ways,
including “ever smoked,” “years of smoking,” “cigarettes per day,” or
“pack–years of smoking.” In choosing among these measures, the guiding prin-
ciple is to choose the one that best predicts risk of developing lung cancer, typ-
ically “pack–years of smoking.” By choosing the one most strongly related to
the health outcome, adjustment would be most complete, far better than if we re-
lied on a crude dichotomy such as “ever versus never smoked.”
There are different ways the confounding variable can be handled in the sta-
tistical analysis phase, and the same goal applies: define the measure to maxi-
mize its association with disease. A measure like “pack–years of smoking” could
be treated as continuous measure and included in a logistic regression model in
which the relationship with disease is presumed to be log-linear. Alternatively,
it could be categorized into two or more levels, with many potential cutpoints,
or modeled using more flexible approaches such as spline regression (Greenland,
1995). All these options apply to assessing the exposure of primary interest as
148 INTERPRETING EPIDEMIOLOGIC EVIDENCE
well, but for the confounding variable, the goal is to remove its effect, not nec-
essarily to fully understand its effect.
Regardless of the underlying reasons for imperfect measurement of the sources
of confounding, the effect is the same: incomplete control for the confounding
that arises from the specific phenomenon it is intended to address. That is, what-
ever the amount of confounding that was originally present, only a fraction will
be removed through the adjustment efforts and the size of that fraction is de-
pendent on the quality of assessment (Greenland & Robins, 1985; Savitz & Barón,
1989). If a perfectly measured confounder completely adjusts for the distortion,
and a fully misclassified measure is of no benefit, adjustment for an imperfectly
measured confounder falls somewhere in between. The more the measure used
for adjustment deviates from the ideal, the less of the confounding is eliminated.
This dilution of confounder control can arise by poor selection of an operational
measure, measurement error, or inappropriate choices for categorization in the
analysis.
The magnitude of confounding present after attempts at adjustment depends
on both the magnitude of confounding originally present and the fraction of that
confounding that has been effectively controlled. Incomplete control of con-
founding due to imperfect assessment and measurement of the confounding vari-
ables is proportionate to the amount of confounding originally present. If the
amount of original confounding was substantial, then whatever the fraction that
was controlled, the amount that is not controlled, in absolute terms, may still be
of great concern. On the other hand, if the amount of confounding originally
present were small, which is often the case, then leaving some fraction of it un-
controlled would be of little concern in absolute terms.
Kaufman et al. (1997) provide a quantitative illustration of a common prob-
lem of inaccurate confounder measurement. There is often an interest in isolat-
ing the effects of race from the strongly associated socioeconomic factors that
both differ by race and are strongly associated with many health outcomes. The
goal is generally to isolate some biological differences between African Ameri-
cans and whites from their socioeconomic context. The challenge, of course, is
to effectively eliminate the influence of such a strong confounder despite its elu-
sive nature, often reverting to simplistic approaches such as adjusting for edu-
cation. In a simulation, Kaufman et al. (1997) illustrate four ways in which fre-
quently applied methods of adjustment for socioeconomic status fall short of
controlling confounding of racial differences. One simple one suffices to illus-
trate the point—residual confounding due to categorization.
Many health outcomes vary continuously by measures of socioeconomic sta-
tus, so that when a dichotomous measure or even one that has several levels is
used, there will be residual confounding within strata. This problem is exacer-
bated in the case of racial comparisons when the mean levels are markedly dif-
ferent in the two groups and no single cutpoint will do justice to the two
Confounding 149
TABLE 7.1. The Spurious Association of African American Race with “Disease”
Owing to Categorization Bias in the Exposure Variable: Results for Simulations with
1000 African Americans, 1000 Whites*
MEAN CRUDE MEAN ADJUSTED REPETITIONS WITH
OR FOR AFRICAN OR FOR AFRICAN OR 1.0
 AMERICAN RACE† AMERICAN RACE‡ (%)§
0.10 1.07 1.03 60
0.20 1.13 1.05 70
0.30 1.19 1.07 74
0.40 1.27 1.09 81
0.50 1.33 1.11 86
0.60 1.40 1.13 90
0.70 1.47 1.16 92
0.80 1.53 1.18 94
0.90 1.59 1.19 96
1.00 1.65 1.21 97
e(z)
*“Disease” generated randomly as: p(d)
1000 repetitions at each ; Zwhite ⬃
1 e(z)
N(0.30, 1) and ZAfrican-American ⬃ N( 0.30, 1), representing “education.”
†From the model: logit(disease) ␣  race ⑀, where race is coded 1 African-American,
1
0 white.
‡From the model: logit(disease) ␣  education  race ⑀, where education is di-
1 2
chotomized at Z 0 and race is coded 1 African-American, 0 white.
§The percentage of replications with adjusted odds ratios for African-American 1.0.
the unadjusted risk ratio was 2.0, consider two situations: in one, adjustment for
a marker of confounding yields a risk ratio of 1.8, in the other instance, adjust-
ment yields a risk ratio of 1.4. If asked to make an assessment of how likely it
is that a fully adjusted risk ratio would be at or close to the null value, it is more
plausible in the second than the first scenario. That is, an imperfectly measured
confounder that yields an adjusted risk ratio of 1.4 compared to the crude value
of 2.0 indicates a more substantial amount of confounding than if the adjustment
had yielded an adjusted risk ratio of 1.8, assuming that the quality of measure-
ment is roughly comparable.
An illustration of this concern arises in assessing a potentially beneficial im-
pact of barrier contraception on the risk of pelvic inflammatory disease, which
can result in subsequent infertility. A potential confounding effect of sexually
transmitted disease history must be considered, given that choice of contracep-
tion may well be related to risk of acquiring a sexually transmitted infection, and
such infections are strong determinants of the risk of pelvic inflammatory dis-
ease. Virtually all approaches to measuring sexually transmitted infection are in-
complete, with self-report known to be somewhat inaccurate, but even biologic
measures are subject to uncertainty because they can only reflect prevalence at
a given point in time. Assume we have obtained self-reported information on
sexually transmitted diseases to be evaluated as a potential confounder of the bar-
rier contraception—pelvic inflammatory disease association. Further assume that
the unadjusted measure of association shows an inverse association with a risk
ratio of 0.5. If adjustment for self-reported sexually transmitted diseases increased
the risk ratio to only 0.6, we might argue that even with a perfect measure of
sexually transmitted diseases, the adjustment would be unlikely to go all the way
to the null value and perhaps 0.7 or 0.8 is the more accurate measure of associ-
ation. On the other hand, if the adjusted measure rose from 0.5 to 0.8, we might
infer that a more complete adjustment could well yield a risk ratio at or very
close to 1.0. Insight into the quality of the confounder measure (often known in
qualitative if not quantitative terms), unadjusted measure, and partially adjusted
measure (always available) helps to assess the extent of incompleteness in the
control of confounding and generate an estimate of what the (unknown) fully ad-
justed measure would be.
The ideal marker of confounding is presumed not to be available, because if
it were, it would be used in preference to any speculation about residual con-
founding from suboptimal measures. There is often the option of examining con-
founders of varying quality within the range available, however, which would
allow for assessing the impact of adjustment using markers of varying quality.
The impact of successive refinements in control of a particular source of con-
founding can be informative in estimating what impact full adjustment would
have, as described for exposure measures more generally in Chapter 8. No ad-
justment at all corresponds to a useless marker of confounding, and as the marker
152 INTERPRETING EPIDEMIOLOGIC EVIDENCE
gets better and better, more and more of the confounding is controlled, helping
to extrapolate to what the gold standard measure would accomplish.
For this reason, the opportunity to scrutinize unadjusted and adjusted meas-
ures is critical to assessing the extent to which confounding has been controlled.
Merely providing the results of a multivariate analysis, without being able to as-
sess what effect the adjustment for confounding variables had, limits the reader’s
ability to fully consider the amount of residual confounding that is likely to be
present. As noted above, an adjusted risk ratio of 1.4 has a different interpreta-
tion if the crude measure was 2.0 and control of confounding is likely to be in-
complete than if the crude measure were 1.5, and important potential confounders
have been well controlled and properly analyzed. Even if the adjusted measure
is conceded to be the best available, some attention needs to be paid to the source
and pattern of any confounding that has been removed.
In other cases, the basis for potential confounding is much more likely to be
universal, and thus be more readily extrapolated from one study to another. If
we are interested in the effect of one dietary constituent found in fruits and veg-
etables, for example, beta carotene, and concerned about confounding from other
micronutrients found in fruits and vegetables, for example, vitamin C, the po-
tential for confounding would be more universal. That is, if the same food prod-
ucts tend to contain multiple constituents or if people who consume one type of
fruit or vegetable tend to consume others, then the amount and direction of con-
founding observed in one study may be applied to other studies.
The information on confounder–disease associations will more often be gen-
eralizable from one study to another to the extent that it reflects a basic biolog-
ical link given that such relations are more likely to apply broadly. Once an as-
sociation has been firmly established in a set of studies, it is reasonable to assume
that the association would be observed in other populations unless known effect-
modifiers are operative to suggest the contrary. We can safely assume, for ex-
ample, that cigarette smoking will be related to risk of lung cancer in all popu-
lations, so that in seeking to isolate other causes of lung cancer, confounding by
tobacco use will be a concern.
If attempts to identify and control confounding fail to influence the measure
of effect, despite a strong empirical basis for believing that confounding should
be present, concerns arise about whether the study has successfully measured the
confounding variable of interest. In the above example of silica exposure and
lung cancer, if we attempted to measure cigarette smoking and found it to be un-
related to risk of lung cancer, and thus not a source of confounding, we should
question whether cigarette smoking had truly been measured and controlled. The
strategies for evaluating potential exposure misclassification (Chapter 8) apply
directly to confounding factors, which are just exposures other than those of pri-
mary interest.
To illustrate, when Wertheimer and Leeper (1979) first reported an associ-
ation between residential proximity to sources of magnetic field exposure and
childhood cancer, one of the challenges to a causal interpretation of the asso-
ciation was the potential for confounding. Because they had relied on public
records, there was no opportunity to interview the parents and assess a wide
range of potential confounding factors such as parental tobacco use, medica-
tions taken during pregnancy, and child’s diet. When a study of the same ex-
posure–disease association in the same community was undertaken roughly a
decade later (Savitz et al., 1988), it included extensive consideration of poten-
tial confounding factors, and found essentially no indications of confounding.
Although it is theoretically possible that undetected confounding due to those
factors was present in the earlier study, the later study makes that possibility
far less likely. That does not negate the possibility that both studies suffer from
confounding by as yet undiscovered risk factors for those cancers, but at least
154 INTERPRETING EPIDEMIOLOGIC EVIDENCE
the usual suspects are less of a concern in the earlier study based on the results
of the latter study.
Ye et al. (2002) provide a quantitative illustration of the examination of con-
founding using evidence from previous studies. In a cohort study of alcohol abuse
and the risk of developing pancreatic cancer in the Swedish Inpatient Register,
information on smoking, a known risk factor for pancreatic cancer, was not avail-
able. The investigators applied indirect methods using the observed association
found for alcohol use (relative risk of 1.4). By assuming a relative risk for cur-
rent smoking and pancreatic cancer of 2.0, 80% prevalence of smoking among
alcoholics and 30% in the general population of Sweden, a true relative risk of
1.0 for alcohol use would rise to 1.4 solely from confounding by smoking. That
is, “The observed excess risk in our alcoholics without complications may be al-
most totally attributable to the confounding effect of smoking.” (Ye et al., 2002,
p. 238). Although this may not be as persuasive as having measured smoking in
their study and adjusting for it directly, the exercise provides valuable informa-
tion to help interpret their findings with some appreciation of the potential for
the role of confounding.
To make appropriate use of information on confounding in studies other than
the one being evaluated, the phenomenon occurring in the other studies needs to
be fully understood. The ability to extrapolate the relations between the con-
founder and both exposure and disease should be carefully considered before as-
serting that the presence or absence of confounding in one study has direct im-
plications for another study. If the study settings are sociologically similar and
the study structures are comparable, such extrapolation may well be helpful in
making an informed judgment. Extrapolation is not a substitute for measurement
and control of confounding in the study itself, but speculation informed by pre-
vious research can be far superior to speculation without such guidance.
in that they are only markers of underlying etiologic relations. Geography per se
does not cause disease, though it may enhance the probability of exposure to an
infectious agent, and low income does not cause disease, though it can affect
availability of certain foods that prevent disease. Given that these are intention-
ally non-specific indicators of many potential exposures, and imperfectly reflec-
tive of any one exposure, underlying confounding by specific exposures will not
be fully controlled. Nevertheless, some insight would be gained regarding the
potential for substantial confounding to be present based on these broad proxy
measures.
Hypothetical scenarios can be described, indicating the strength of association
between the confounder, exposure, and disease required to yield various alter-
native measures of effect, in the same manner as described above for imperfectly
measured confounders. That is, the general marker can be viewed as a proxy con-
founder measure, and various candidate gold standard measures might be con-
sidered to ask about how much confounding may yet remain. As discussed pre-
viously, if control for the non-specific marker has no impact whatsoever, none
of the exposures it reflects are likely to have an impact, whereas if it does change
the measure of effect, we would expect that a sharper focus on the pertinent ex-
posure would yield a more sizable change in the effect measure.
The danger of failing to control for true confounding factors in the face of ig-
norance is a threat, but also there is a danger of controlling for anything that hap-
pens to be measured if such adjustment changes the association of interest. When
little is known about the causes of a particular disease, unnecessary adjustment
for a broad range of factors that are not truly confounders results in a loss of pre-
cision under the assumption that the relations found reflect only random processes
(Day et al., 1980). Moreover, if only those factors that reduce the association of
interest are selected for adjustment, there will be bias toward the null value even
when all associations are the product of random error (Day et al., 1980). When
a lengthy list of candidate factors is screened for confounding, without adequate
attention to the plausibility of their having an association with the health out-
come based on mechanistic considerations and findings of prior studies, there is
a danger of finding associations by chance and producing an adjusted effect es-
timate that is more rather than less biased than the unadjusted one.
In order to evaluate the extent to which confounding may have biased the results
of an epidemiologic study, the conceptual basis for confounding must first be ex-
amined. The question of exchangeability of exposed and unexposed needs to be
posed for the specific research question under consideration: Do nonsmokers have
the risk of lung cancer that smokers would have had if they had not smoked? Do
women with low levels of calcium intake have the same risk of osteoporosis that
women with high levels of calcium intake would have had if their intakes had
been low? This question, posed in the grammatically awkward but technically
correct counterfactual manner serves to focus interest on the phenomenon of con-
founding rather than available covariates, which are at best, a means of addressing
and mitigating the lack of comparability of the groups. The goal is not to achieve
statistical control of covariates or to consider the longest possible list of poten-
tial confounders but rather to reduce or eliminate confounding. In some instances,
there is little or no confounding present, or the confounding is readily addressed
TABLE 7.2. Characteristics of the Study Population According to Their Consumption of Coffee, Norway
Coffee Consumption, Cups per Day
Men Women
2 3–4 5–6 7 2 3–4 5–6 7
Number at risk 2855.00 5599.00 6528.00 6753.0 2648.0 7350.0 6820.0 4420.0
Age (years) 45.9 46.5 46.2 45.1 46.0 46.5 45.9 44.9
Menopause (% yes) 27 30 27 24
Body mass index (g/cm2) 2.54 2.55 2.55 2.54 2.46 2.50 2.50 2.50
Smoke daily (% yes) 23.00 33 47 66 17 23 38 57
No. of cigarettes per day 11.5 11.6 12.6 15.5 9.1 8.6 9.8 12.1
(among cigarette smokers)
Total cholesterol (mmol/l) 5.95 6.20 6.29 6.47 6.01 6.19 6.27 6.37
Triglycerides (mmol/l) 2.17 2.21 2.13 2.07 1.50 1.47 1.48 1.47
Physical inactivity (% sedentary) 15.00 14 15 20 17 15 17 22
History of cardiovascular disease 11.00 11 10 8 11 10 9 8
and/or diabetes (% yes)
Beer consumption in an ordinary 27.00 25 25 26 8 7 6 6
week (% yes)
Wine/liquor consumption in an 26.00 28 27 32 11 10 12 13
ordinary week (% yes)
Residence (% in Finnmark) 12.00 16 24 43 15 18 25 41
Stensvold & Jacobsen, 1994.
Confounding 159
with one or two markers, and in other instances, even a comprehensive effort
with many potential confounders measured and adjusted will fall far short. By
focusing on the conceptual basis for confounding, attention to confounding is
less likely to be mistaken for elimination of confounding.
With the question clearly in mind regarding exchangeability, and reason to be-
lieve that the exposed and unexposed groups are not exchangeable, the next step
is to consider what attributes may serve as markers of the basis for non-
exchangeability. That is, if there is reason for concern about confounding in com-
paring exposed and unexposed, what are the underlying reasons for that con-
founding? Are the groups likely to differ in behaviors that go along with the one
under investigation? Might the groups tend to interpret symptoms differently and
have differential access to medical care? To the extent that the source of the con-
founding can be identified, it can be addressed through statistical methods. Again,
the question is not “What variables are available to examine for their effect on
the exposure–disease association?” but rather “What are the disease determinants
which may make the exposed and unexposed non-exchangeable?”
Having identified the factors that are thought to generate the confounding, we
now have to take on the challenge of measuring those factors for purposes of sta-
tistical control. Some sources of confounding are more easily measured than oth-
ers due to the clarity of the concept and accessibility of information. An assess-
ment of the effectiveness with which the construct of interest has been captured
is helpful in addressing the question of how effectively that source of confounding
has been controlled. Referring back to the underlying phenomenon of con-
founding, the availability and quality of markers should be contrasted with the
ideal set of markers one would like to have available to control this source of
confounding. If we are concerned with confounding by social class, and have
data only on education, and none on income, occupation, wealth, etc., then we
must acknowledge and contend with the inherent shortcoming in our control for
confounding by social class. The potential for incomplete control of confound-
ing draws upon substantive knowledge of the phenomenon, but also can be as-
sessed to some extent within the available data. The change in the measure of
effect resulting from the adjustment process can help in estimating what the meas-
ure of effect would have been had adjustment been complete.
Once candidate confounders have been operationalized and measured, the im-
pact of statistical adjustment on the exposure–disease association can be assessed.
Note again how many considerations arise prior to this point in the evaluation,
all emanating from the concept of non-exchangeability. Errors in reasoning or
measurement between the concept and the covariate will diminish or negate the
effectiveness of control of confounding. Making statistical adjustments for
confounding is an exercise, a form of sensitivity analysis, in that new informa-
tion is generated to help in understanding the data and evaluating the hypothe-
sis of interest. In order for the adjusted measure of effect to be superior to the
160 INTERPRETING EPIDEMIOLOGIC EVIDENCE
REFERENCES
Day NE, Byar DP, Green SB. Overadjustment in case-control studies. Am J Epidemiol
1980;112:696–706.
Greenland S. Randomization, statistics, and causal inference. Epidemiology 1990;1:
421–429.
Greenland S. Dose-response and trend analysis in epidemiology: alternatives to categor-
ical analysis. Epidemiology 1995;6:356–365.
Greenland S, Robins JM. Confounding and misclassification. Am J Epidemiol 1985;
122:495–506.
Greenland S, Robins JM. Identifiability, exchangeability, and epidemiologic confound-
ing. Int J Epidemiol 1986;15:413–419.
Kaufman JS, Cooper RS, McGee DL. Socioeconomic status and health in blacks and
whites: the problem of residual confounding and the resiliency of race. Epidemiology
1997;8:621–628.
Last JM. A Dictionary of Epidemiology, Fourth Edition. New York: Oxford University
Press, 2001:37–38.
Confounding 161
Savitz DA, Barón AE. Estimating and correcting for confounder misclassification. Am J
Epidemiol 1989;129:1062–1071.
Savitz DA, Wachtel H, Barnes FA, John EM, Tvrdik JG. Case-control study of childhood
cancer and exposure to 60-Hz magnetic fields. Am J Epidemiol 1988;128:21–38.
Stensvold I, Jacobsen BK. Coffee and cancer: a prospective study of 43,000 Norwegian
men and women. Cancer Causes Control 1994;5:401–408.
Wertheimer N, Leeper E. Electrical wiring configurations and childhood cancer. Am J
Epidemiol 1979;198:273–284.
Ye W, Lagergren J, Weiderpass E, Nyrén O, Adami H-O, Ekbom A. Alcohol abuse and
the risk of pancreatic cancer. Gut 2002;51:236–239.
This page intentionally left blank
8
MEASUREMENT AND CLASSIFICATION
OF EXPOSURE
Many of the concepts and much of the algebra of misclassification are applica-
ble to assessing and interpreting errors in exposure and disease misclassification.
Differences arise based on the structure of epidemiologic studies, which are de-
signed to assess the impact of exposure on the development of disease and not
the reverse. Also, the sources of error and the ways in which disease and expo-
sure are assessed tend to be quite different, and thus the mechanisms by which
errors arise are different as well. Health care access, a determinant of diagnosis
of disease, does not correspond directly to exposure assessment, for example.
Health and disease, not exposure, are the focal points of epidemiology, so that
measurement of exposure is driven by its relevance to health. The degree of in-
terest in an exposure rises or falls as the possibility of having an influence on
health evolves, whereas the disease is an event with which we are inherently con-
cerned, whether or not a particular exposure is or is not found to affect it. Once
an exposure has been clearly linked to disease, e.g., tobacco or asbestos, then it
becomes a legitimate target of epidemiologic inquiry even in isolation from stud-
ies of its health impact.
The range of exposures of interest is as broad, perhaps even broader, than the
spectrum of health outcomes. Exposure, as defined here, includes exogenous
agents such as drugs, diet, and pollutants. It also includes genetic attributes that
affect ability to metabolize specific compounds; stable attributes such as height
163
164 INTERPRETING EPIDEMIOLOGIC EVIDENCE
issues considered below are really just amplifications of this basic goal—
measure the exposure that is most pertinent to the etiology of the disease of in-
terest. Though such a strategy may be obvious for exposures to chemicals or
viruses, it is equally applicable to psychological or social conditions that may in-
fluence the occurrence of disease. The overriding goal is to approximate the ex-
posure measure that contributes to the specific etiologic process under investi-
gation. Of course, in many if not all situations, the mechanisms of disease
causation will not be understood sufficiently to define the exact way in which
exposure may influence disease, but intelligent guesses should be within reach,
resulting in a range of hypothesized etiologic pathways that can be articulated to
define clearly the goals of exposure assessment.
In evaluating an exogenous agent such as exposure to DDT/DDE in the cau-
sation of breast cancer, we might begin by delineating the cascade leading from
exposure to disease: The use of DDT for pesticide control leads to persistent en-
vironmental contamination, which leads to absorption and persistence in the body,
which leads to a biological response that may affect the risk of developing breast
cancer. Residues of DDT and its degradation product, DDE, are ubiquitous in
the soil due to historical use and persistent contamination.
One option for an exposure measure would be average DDT/DDE levels in
the county of current residence. Obviously, this is an indirect measure of expo-
sure: county averages may not apply directly to environmental levels where the
individual lives and works; the individuals of interest may not have always lived
in the county in which they resided at the time of their participation in the study;
and the present levels of contamination in the environment are likely to have dif-
fered in the past. For these reasons, this measure of environmental contamina-
tion is not likely to approximate individual exposure levels very accurately. Re-
finements in spatial resolution to finer geographic levels and incorporation of
individual residential history and historical levels of contamination in the area
would move us closer to the desired measure of exposure. Note that if our goal
was defined solely as the accurate characterization of the county’s average ex-
posure in the year 1995, we could carefully sample the environment, ensure that
the laboratory equipment for measuring DDT and DDE in soil samples was as
accurate as possible, and employ statistical methods that are optimal for charac-
terizing the geographic area appropriately. Nevertheless, without consideration
of historical changes, residential changes, and individual behaviors that influence
exposure and excretion, even the perfect geographic descriptor will not neces-
sarily provide a valuable indicator of the individual biologically effective expo-
sure. Accuracy must be defined in relation to a specific benchmark.
Measurement of present-day blood levels of DDT/DDE in women with breast
cancer and suitably selected controls would reflect absorption and excretion over
a lifetime, integrating over the many subtle behavioral determinants of contact
with contaminated soil, food, water, and air, and better reflecting the dose that
166 INTERPRETING EPIDEMIOLOGIC EVIDENCE
has the potential to affect development of cancer. The temporal aspects of ex-
posure relevant to disease etiology must still be considered, encouraging us to
evaluate what the ideal measure would be. Perhaps the ideal measure would be
a profile, over time, of DDT/DDE levels in breast tissue from puberty to the time
of diagnosis, or over some more circumscribed period in that window, e.g., the
interval 5–15 years past or the interval prior to first birth. By hypothesizing what
the biologically effective exposure measure would be, we could evaluate con-
ceptually, and to some extent empirically, the quality of our chosen measure of
current serum level. Research can examine how closely serum measures gener-
ally correspond to levels in breast tissue. Archived serum specimens from the
distant past can be examined to determine how past levels correspond to current
levels and what factors influence the degree of concordance over time. If our
only goal were to accurately measure the existing serum DDT/DDE levels, a nec-
essary but not sufficient criterion for the desired measure, then we need only en-
sure that the laboratory techniques are suitable. As important as the technical ac-
curacy of the chosen measure may be, the loss of information is often substantial
in going from the correctly ascertained exposure to the conceptually optimal
measure.
Defining the ideal exposure marker requires a focus on the exposure charac-
teristics that have the greatest potential for having a biologic influence on the
etiology of the disease. Often, exposure has many features of potential rele-
vance, such as timing, intensity, duration, and the specific agents from within
a group that require decisions and hypotheses regarding the biologically rele-
vant form. As one moves outward from that unknown, biologically relevant form
of exposure, and incorporates sources of variability in exposure that are not rel-
evant to disease etiology, there is a loss of information that will tend to reduce
the strength of association with disease. Assuming DDT in breast tissue in the
interval 5–15 years past was capable of causing cancer, the association with
DDT in present-day serum would be somewhat weaker because present-day lev-
els would correlate imperfectly with integrated exposure over the desired time
period. The association with DDT in the environment would be weaker still for
the reasons noted previously. The quality of these surrogate measures, which
epidemiologic studies always rely on to one degree or another, affects the abil-
ity to identify causes of disease. We are still able to make progress in identify-
ing causes of disease even when we measure some aspect of exposure other than
the ideal, but only if the measure we choose is strongly correlated with the right
measure and the underlying association is sufficiently strong to be identifiable
despite this shortcoming. We will observe some diluted measure of association,
with the magnitude of dilution dependent on the degree of correlation between
the ideal and actual values (Lagakos, 1988). If the ideal exposure measure is
strongly related to disease, we may take comfort in still being able to observe
some modest effect of a surrogate indicator of exposure on disease, but if the
Measurement and Classification of Exposure 167
true relationship is weak, the effect may fall below a level at which it can be
detected at all.
As biological markers become increasingly diverse and accessible, there can
be confusion regarding where a refined exposure marker ends and an early dis-
ease marker begins. Indicators of a biological interaction between the exposure
and target tissue, e.g., formation of DNA adducts, illustrate exposure biomark-
ers that come very close to a biologic response relevant to, if not part of, the
process of disease causation. Biological changes indicative of early breast can-
cer would, empirically, be even more closely related to risk of clinical breast can-
cer than the most refined measure of exposure to DDT, but such events are in-
creasingly removed from the environmental source and not directly amenable to
intervention or prevention. Each step in the cascade from environmental agent
to clinical disease is of scientific interest and therefore worthy of elucidation, but
the conceptual distinction between exposure and disease needs to be retained for
considering measures to alter exposure to reduce risk of disease.
ical agent, study of caffeine from coffee alone would constitute underascertain-
ment of exposure, and the exposure that was assigned would be lower to the ex-
tent that women were exposed to unmeasured sources of caffeine. A closely re-
lated but distinctive hypothesis however, concerns the possible effect of coffee
on risk of miscarriage, in which constituents of the coffee other than caffeine are
considered as potential etiologic agents. To address this hypothesis, aggregation
of caffeinated and decaffeinated coffee would be justified. Under that hypothe-
sis, coffee alone is the appropriate entity to study. Once the hypothesis is clearly
formulated, then the ideal measure of exposure is defined, and the operational
approaches to assessing exposure can be compared with the ideal.
Nutritional epidemiology provides an abundance of opportunities for creating
exposure indices and demands clear hypotheses about the effective etiologic
agent. Much research has been focused on specific micronutrients, such as beta-
carotene or dietary fiber, and with such hypotheses, the goal is to comprehen-
sively measure intake of that nutrient. An alternative approach is to acknowledge
the multiplicity of constituents in foods, and develop hypotheses about fruit and
vegetable intake, for example, or even more holistically, hypotheses about dif-
ferent patterns of diet. A hypothesis about beta-carotene and lung cancer is dis-
tinct from a hypothesis about fruit and vegetable intake and lung cancer, for ex-
ample, with different demands on exposure assessment. Exposure indices must
be defined with sufficient precision to indicate which potential components of
exposure should be included and which should be excluded, and how the meas-
ure should be defined.
In practice, there are circumstances in which exposure is considered in groups
that are not optimal for considering etiology but are optimal for practical reasons
or for considering mitigation to reduce exposure. For example, there is increas-
ingly clear evidence that small particles in the air (particulate air pollution) ex-
acerbate chronic lung and heart disease and can cause premature mortality (Kat-
souyanni et al., 1997; Dominici et al., 2000). The size and chemical constituents
of those particles differ markedly, and their impact on human disease may dif-
fer in relation to those characteristics as well. Technology now allows isolation
of small particles, 10 microns or 2.5 microns, so that it is feasible to regu-
late and monitor compliance with regulation for the particle sizes thought to be
most harmful to human health. It is not feasible however, to monitor the chem-
ical constituents of those particles and thus regulations do not consider particle
chemistry. We seek to reduce exposure to particulates, accepting that the effect
of the mixture of particles with greater and lesser amounts of harmful constituents
is accurately reflected in their average effect. Perhaps stronger associations would
be found for subsets of particles defined by their chemical constitution, but the
measured effect of particulates in the aggregate is still useful for identifying eti-
ology and suggesting beneficial mitigation of exposure.
170 INTERPRETING EPIDEMIOLOGIC EVIDENCE
research reports, the often implicit definitions of the “gold standard” need to be
extricated so that the actual methods of exposure assessment can be compared
to the ideal. Readers should be watchful for the temptation on the part of re-
searchers to state their goals in modest, attainable terms whereas the more etio-
logically appropriate index is less readily approximated. Problems can arise
in the choice of the ideal exposure measure as well as in implementing that
measure.
All operational exposure measures are related only indirectly to the exposure
of ultimate interest. Self-reported information clearly is removed from the eti-
ologically effective exposures in that the verbal utterance in an interview does
not constitute exposure nor does the checking of a box on a questionnaire con-
stitute the exposure that results in disease. Even biological markers are to vary-
ing degrees proxies for the disease-causing factor, with compromises made
with regard to the timing or site of assessment. Often, the measurement is
taken at the feasible rather than ideal time (e.g., present versus past), and the
assumption is made that the measure can be extrapolated to the critical time
for addressing disease etiology. Similarly, collection of accessible specimens
such as urine or blood is extrapolated to the exposure of interest in a less ac-
cessible site such as the kidneys or heart. As noted above, there is the ever-
present potential for lumping too broadly, combining irrelevant with relevant
exposures, or too narrowly, omitting key contributors to the exposure of
interest.
Epidemiologists are well aware that use of imperfect exposure markers intro-
duces error and should be viewed as opportunities for improvement in study
methods. Sometimes we are deceived however, by advances in technology for
biological or environmental evaluation, believing that laboratory sophistication
automatically confers an advantage over the crudeness of self-report or paper
records. We can lose sight of the fact that even measures that employ sophisti-
cated technology also contain error relative to the desired information (Pearce et
al., 1995). Even if the technology for measuring environmental or biological spec-
imens is highly refined, the sources of error often arise at the point where the
times and sites of collection of samples are decided. Accepting that epidemiol-
ogy relies on indirect measures of the exposure of ultimate interest, the critical
questions concern how effective the proxies are and the impact of the misclas-
sification that they introduce relative to the unattainable “gold standard.” The
preferred option, of course, would be to obtain the ideal information and avoid
the need to evaluate error and assess its impact. Accepting that this can never be
achieved, there are a number of useful strategies for assessing the presence of
Measurement and Classification of Exposure 173
inventory. In either circumstance, with both the routine and superior measure
available for the subset, the values can be compared to assess how closely the
routine measure approximates the superior measure. With this information, esti-
mates and judgments can be made as to what the results would have been if the
superior measure had been applied routinely.
The more closely the routine exposure measure approximates the superior ex-
posure measure, the less exposure misclassification is likely to be present. If they
were perfectly correlated, then the “inferior” measure would simply represent a
more efficient approach to gathering exactly the same information. More typi-
cally, there will be some inaccuracy in the routine measure relative to the more
refined one, and the amount and pattern of inaccuracy can be assessed in a sub-
set of study subjects. What this effort will not reveal is how close or distant both
the routine and superior measures are from the ideal measure. They could be
close to one another, giving the impression that the routine measure is quite good,
yet both have substantial error relative to the unattainable ideal measure. Alter-
natively, there could be a major difference between the two, yet even the supe-
rior measure is far from optimal. What would be desirable but is rarely known
is not just the ordinal ranking of the quality of the alternative measures but the
absolute quality of each relative to the gold standard.
An important challenge to the interpretation of two measures that are presumed
to be ordinal in quality is that the allegedly superior measure may just be dif-
ferent without really being better. When the basic strategies are similar, e.g., diet
inventory or environmental measurements, and the superior measure has more
detail or covers a more extended time period, the probability that it is better in
absolute quality is quite high. When the approaches are qualitatively different,
however, e.g., self-report versus biological marker, there is less certainty that the
approaches can be rank-ordered as better and worse. Similarity between the su-
perior and routine measure may give a false assurance regarding the quality of
the routine measure if the allegedly superior measure really is not better. Worse
yet, the ostensibly superior measure could be worse, so that a disparity is mis-
interpreted entirely. The common temptation is to accept any biological marker
as superior to any form of self-report, and to downgrade confidence regarding
the quality of self-report when they are not in close agreement. Biological mark-
ers of exposure often reflect a precise measure for a very brief period around the
time of specimen collection, however, whereas self-report can represent an inte-
gration over long periods of time. Remembering that the ideal measure includes
consideration of the etiologically relevant time period, it is not certain that a pre-
cise measure for the wrong time period from a biological marker is superior to
an imprecise measure for the relevant time period based on self-report. Both
strategies need to be compared to the ideal measure to the extent possible.
Assuming that there is a clear gradient in quality, what such comparisons be-
tween the inferior and superior measures provide is the basis for a quantitative
assessment of the loss of information and expected reduction in measures of as-
Measurement and Classification of Exposure 175
sociation based on the inferior measure relative to the superior one (Lagakos,
1988; Armstrong, 1990). Using the readily calculated correlation coefficient be-
tween the two measures, an estimate can be generated regarding how much of
the information in the superior measure is lost by using the inferior one, using
the expression 1 r2, where r is the correlation coefficient. For example, if the
correlation is 0.5, this expression equals 0.75, so that 75% of the information
contained in the superior measure is lost by relying on the inferior one. Corre-
lations of 1 and 0 correspond to values of 0% being lost and 100% being lost.
Though a number of assumptions are made to justify this interpretation, it is a
useful general rule of thumb.
An illustration of the loss of information in using an inferior compared to a
superior exposure measure was provided by Baris et al. (1999) in a methodologic
examination of exposure assignment methods for a study of residential magnetic
fields and childhood leukemia. When children lived in two homes over the course
of their lives, both were measured and integrated as a time-weighted average ex-
posure, considered the “gold standard” relative to measurements in a single home.
Various approaches to choosing the single home to measure were considered, in-
cluding the duration of occupancy and whether or not it was currently occupied.
Correlation coefficients between the gold standard and various surrogates (Table
8.1) indicate a range of values across the more limited measurement approaches,
from 0.55 to 0.95. Among the simpler indices, the home that was lived in for the
longer period provides the better proxy for the time-integrated measure, as ex-
pected. Counter to expectation, however, when the measures of association were
computed using the various indices (Table 8.2), results for both former homes
and currently occupied homes yielded results closer to the operational “gold stan-
dard” measure than the longest lived-in home. The relative risk in the highest
exposure group was attenuated for all of the surrogate indices of exposure rela-
tive to the time-weighted average, suggesting that the extra expense of collect-
ing data on multiple homes was a worthwhile investment of resources.
posure measures, i.e., both positive, one measure positive and one measure neg-
ative, and both negative. Unless one or both of the measures is behaving very
strangely, it would be expected that these levels would correspond to a monot-
onic gradient of true exposure, less subject to misclassification than use of either
measure in isolation.
When one of the data sources on exposure can be viewed as a “gold standard,”
it provides an opportunity to better understand and ultimately refine the routine
measure. For example, with a combination of self-reported exposure to environ-
mental tobacco smoke over long periods in the past, and short-term biochemical
markers, there is an opportunity to integrate the information to validate the self-
report. Self-reported exposure can be generated over the time frame of ultimate
interest, as well as for the brief time period reflected by the biochemical mea-
sure of exposure, i.e., the recent past. With that information and accepting the
biochemical marker as a gold standard for the recent past, predictive models can
be developed in which the self-reported information is optimized to estimate ac-
tual exposure. In the example of environmental tobacco smoke, self-report of ex-
posure in the preceding 24 or 48 hours might be queried, for which the bio-
chemical indicator would be a legitimate gold standard. With that quantitative
prediction model now in hand, the questionnaire components for the period of
etiologic relevance, typically a prolonged period in the past, can be weighted to
generate a more valid estimate of historical exposure. The data would be used to
determine which self-reported items are predictive of measured exposure to en-
vironmental tobacco smoke and the magnitude of the prediction, through the use
of regression equations. Although there is no direct way to demonstrate that this
extrapolation from prediction over short periods to prediction over long periods
is valid, the availability of a “gold standard” for brief periods offers some as-
surance. The development of the predictive model linking self-reported exposure
data to biological markers need not include all the study subjects and could be
done on similar but not identical populations. The relationship between perceived
experiences and actual exposure may well differ among different populations,
however, suggesting that the validation be done on the study population or a
group that is quite similar to the actual study subjects.
Multiple exposure indicators also may be used when it is unclear which is the
most influential on the health outcome of interest. A sizable body of research has
addressed particulate air pollution and health, particularly morbidity and mortal-
ity from cardiovascular and respiratory disease. As the research has evolved, there
has been an increasing focus on the small particles, those 10 g/m3 or 2.5
g/m3 in diameter. In a large cohort study of participants in the American Can-
cer Society’s Cancer Prevention II Study, air pollution measures in metropolitan
areas were examined in relation to mortality through 1998 (Pope et al., 2002).
To examine the nature of the relationship between particulate air pollution and
mortality, a range of indices were considered, defined by calendar time of mea-
surement, particle size, and sulfate content (Fig. 8.1). These results suggest once
180 INTERPRETING EPIDEMIOLOGIC EVIDENCE
A All-Cause Mortality
1.50
1.40
1.30
RR (95% Cl)
1.20
1.10
1.00
0.90
0.80
B Cardiopulmonary Mortality
1.50
1.40
1.30
RR (95% Cl)
1.20
1.10
1.00
0.90
0.80
1.20
1.10
1.00
0.90
0.80
1.20
1.10
1.00
0.90
0.80
3
.5 )
90
80
80
80
rte 8
98
00
ag
98
98
98
98
99
99
99
98
99
ua 9
10
15
r
19
19
19
19
Q -19
M
-2
-1
-2
er
-1
-1
-1
-1
-1
-1
-1
-1
-1
15
(P
(P
M
79
99
Av
80
79
82
80
82
82
82
80
82
ird 82
6
(P
19
19
19
19
19
19
19
19
19
19
19
Th 19
99
98
3
-1
-1
98
87
79
-1
19
19
79
FIGURE 8.1. Adjusted mortality relative risk (RR) ratio evaluated at subject weighted mean
concentration of particulate and gaseous air pollution in metropolitan areas, American
Cancer Society Cancer Prevention II Study (Pope et al., 2002). PM2.5 indicates particles
measuring less than 2.5 m in diameter; PM10, particles measuring less than 10 m in
diameter; PM15, particles measuring less than 15 m in diameter; PM15-2.5, particles meas-
uring between 2.5 and 15 m in diameter; and CI, confidence interval.
again that the fine particles ( 2.5 m) are more clearly associated with mor-
tality from cardiopulmonary disease, lung cancer, and total mortality, as com-
pared to inhalable coarse particles or total suspended particulates. The associa-
tions for measurements in different calendar times are similar to one another, and
sulfate particles are similar to fine particles in their effects. Examination of mul-
tiple indices suggests that the associations are rather consistent across measure-
ment period but specific to fine particles.
Measurement and Classification of Exposure 181
hypertension and that all of the available measures reflect the magnitude of that
association imperfectly.
TABLE 8.3. Odds Ratios for Lipid-Adjusted DDE and PCBs and Breast Cancer, Strati-
fied by Parity and History of Breastfeeding, North Carolina, 1993–1996
Referent
CASES CONTROLS OR* (95% CI) OR† (95% CI)
Nulliparous
DDE‡
0.394 41 25
0.394 to 1.044 36 23 1.28 (0.59–2.78) 1.24 (0.54–2.82)
1.044 35 17 1.87 (0.67–5.20) 1.48 (0.49–4.46)
Total PCBs§
0.283 37 25
0.283 to 0.469 39 21 1.51 (0.69–3.33) 2.06 (0.88–4.85)
0.469 36 19 1.74 (0.67–4.54) 1.62 (0.57–4.58)
what below the null. Such results, though suggestive at most, may be reflective
of the superiority of the serum measures as an exposure indicator for women who
had never breastfed and accurately reflect a small increase in risk associated with
the exposure.
As illustrated by the above example, information on predictors of accuracy
in exposure classification can be used to create homogeneous strata across
which the validity of exposure data should vary in predictable ways. All other
influences being equal, those strata in which the exposure data are better would
be expected to yield more accurate measures of association with disease than
those strata in which the exposure data are more prone to error. Identifying
gradients in the estimated validity of the exposure measure and examining pat-
terns of association across those gradients serves two purposes—it can pro-
vide useful information to evaluate the impact of exposure misclassification
and also generate estimates for subsets of persons in whom the error is least
severe. Note that it is not helpful to adjust for indicators of data quality as
though they were confounding factors, but rather to stratify and determine
whether measures of association differ across levels of hypothesized exposure
data quality.
The quality of women’s self-reported information on reproductive history and
childhood social class was evaluated in a case–control study of Hodgkin’s dis-
ease in northern California using the traditional approach of reinterviewing some
time after the initial interview (Lin et al., 2002). Twenty-two cases and 24 con-
trols were reinterviewed approximately eight months after their initial interview,
and agreement was characterized by kappa coefficients for categorical variables
and intraclass correlation coefficients for continuous measures. Whereas cases
and controls showed similar agreement, education was rather strongly associated
with the magnitude of agreement (Table 8.4). Across virtually all the measures,
women who had more than a high school education showed better agreement
than women of lower educational level, suggesting that the more informative re-
sults from the main study would be found within the upper educational stratum.
Non-specific markers of exposure data quality such as age or education may
also yield strata that differ in the magnitude of association for reasons other than
the one of interest. There may be true effect measure modification by those at-
tributes, in which exposure really has a different impact on the young compared
to the old, or there may be other biases related to non-response or disease mis-
classification that cause the association to differ. When the identification of per-
sons who differ in the quality of their exposure assignment is based on specific
factors related to exposure, such as having been chosen for a more thorough pro-
tocol or having attended a clinic or worked in a factory in which more extensive
exposure data were available, then the observed pattern of association is more
likely to be reflective of the accuracy of the exposure marker as opposed to other
correlates of the stratification factor.
Measurement and Classification of Exposure 185
is the pattern of drug use among friends. Therefore, it would be expected that
those who are using illicit drugs would also report having friends who do so.
Thus, to avoid at least part of the sensitivity and stigma associated with such be-
havior, questionnaires might include items pertaining to drug use among friends,
something that respondents may be more willing to admit to than their own drug
use. Such information can be used to determine whether the expected positive
association is found with self-reported drug use, and also to create a category of
uncertain drug use when the individual reports not using drugs but having friends
who do so.
Another illustration of ascertaining and using information on the predictors of
exposure is often applied in the assessment of use of therapeutic medications.
There are certain illnesses or symptoms that serve as the reasons for using those
medications, and the credibility of reports of drug use (or even non-use) can be
evaluated to some extent by acquiring information on the diseases that the drug
is used to treat. When a respondent reports having an illness that is known to be
an indicator for using a specific medication, along with recall of using that med-
ication, confidence is enhanced that they are accurately reporting the medication
use. Those who had an illness that should have resulted in use of the drug but
did not report doing so, and those who reported using the medication but with-
out having reported an illness for which that medication is normally used, would
be assigned a less certain exposure status.
In a study of the potential association between serum selenium and the risk of
lung and prostate cancer among cigarette smokers, Goodman et al. (2001) pro-
vided a rather detailed analysis of predictors of serum selenium concentrations
(Table 8.5). In addition to addressing concerns with the comparability of collec-
tion and storage methods across study sites, they were able to corroborate the
expected reduction in serum selenium levels associated with intensity and re-
cency of cigarette smoking. Even though the background knowledge is limited
to help anticipate what patterns to expect, confirming the inverse association with
smoking adds confidence that the measurements were done properly and are more
likely to be capturing the desired exposure.
Even when the linkage of antecedent to exposure is less direct, as in the case
of social and demographic predictors, there may still be value in assessing ex-
posure predictors as a means of evaluating the accuracy of exposure information.
Weaker associations with exposure or those that are less certain will be less con-
tributory but can help to provide at least some minimal assurance that the expo-
sure information is reasonable. If assessing the consequences of otitis media in
children on subsequent development, the known positive association of the ex-
posure with attendance in day care and sibship size and patterns of occurrence
by age (Hardy & Fowler, 1993; Zeisel et al., 1999) may be helpful in verifying
that otitis media has been accurately documented. As always, when the data con-
flict with prior expectations, the possibility that prior expectations were wrong
Measurement and Classification of Exposure 187
Gender 0.30
Male 11.55 (0.07)
Female 11.75 (0.17)
Race 0.99
White 11.58 (0.07)
African American 11.57 (0.32)
Other/Unknown 11.64 (0.37)
are not fully understood or the study would be of little value. When data can be
obtained to assess whether strongly expected associations with exposure are
found, in the same sense as a positive control in experimental studies, confidence
regarding the validity of exposure measurement is enhanced. If expected associ-
ations are not found, serious concern is raised about whether the exposure is re-
flective of the construct that was intended.
In a cohort study designed to examine the effect of heavy alcohol use on the
risk of myocardial infarction, we might examine the association between our
measure of heavy alcohol use and risk of cirrhosis, motor vehicle injuries, or de-
pression, since these are established with certainty as being associated with heavy
alcohol use. If feasible, we might evaluate subclinical effects of alcohol on liver
function. If the study participants classified as heavy alcohol users did not show
any increased risk for health problems known to be associated with heavy alco-
hol use, the validity of our assessment of alcohol use would be called into ques-
tion. If the expected associations were found, then confidence in the accuracy of
the measure would be enhanced.
An examination of the effect of exercise on depression made use of known
consequences of physical activity to address the validity of the assessment, il-
lustrating this strategy. In a prospective cohort study of older men and women
in Southern California, the Rancho Bernardo Study (Kritz-Silverstein et al.,
2001), exercise was addressed solely with the question “Do you regularly engage
in strenuous exercise or hard physical labor?” Given the simplicity and brevity
of this query, it is reasonable to ask whether it has successfully isolated groups
with truly differing activity levels. Building on previous analyses from this co-
hort, they were able to refer back to an examination of predictors of cardiovas-
cular health. Reaven et al. (1990) examined a number of correlates and potential
consequences of self-reported engagement in physical activity (Table 8.6). There
was clear evidence of a lower heart rate and higher measure of HDL cholesterol
among men and women who responded affirmatively to the question about reg-
ular exercise. Though differences were not large, they do indicate some validity
to the question in distinguishing groups who truly differ in level of physical ac-
tivity. Note that the ability to make this inference depends on the knowledge that
activity is causally related to these measures, which must come from evidence
obtained outside the study.
Ideally, we would like to go beyond this qualitative use of the information to
increase or decrease confidence in the accuracy of exposure measures and actu-
ally quantify the quality of exposure data. Few exposure–disease associations are
understood with sufficient precision however, to enable us to go beyond verify-
ing their presence and examine whether the actual magnitude of association is at
the predicted level or whether exposure misclassification has resulted in attrition
of the association. For the few exposure–disease associations for which the mag-
nitude of association is known with some certainty, e.g., smoking and lung can-
Measurement and Classification of Exposure 189
TABLE 8.6. Relation Between Regular Exercise Status and Mean Plasma Lipid Levels
Lipoprotein Levels and Other Variables Adjusted for Age in Men and Women Aged 50
to 89 Who Were Not Currently Using Cholesterol-Lowering Medications, Rancho
Bernardo, California, 1984–1987
Regular Exercise in Regular Exercise in
Men (n 1019) Women (n 1273)
NO YES P VALUE NO YES P VALUE
Other Variables
Age* (year) 172.000 167.400 0.001 1170.811 67.611 0.001
BMI 26.1 25.8 0.12 24.6 24.1 0.10
WHR 0.918 0.909 0.03 110.798 0.788 0.04
Cigarettes (per day) 2.8 0.5 0.001 2.3 2.0 0.53
Alcohol (mL/week) 131.6 115.0 0.14 78.2 81.3 0.69
Heart rate (beats 61.9 58.9 0.001 64.5 62.0 0.001
per minute)
PME use (% — — — 27.0 34.0 0.03
who use)
disease risk, then, by definition, more of that exposure will result in a greater
probability of developing the disease. We will observe a dose-response gradient,
in which increasing exposure results in an increasing risk of disease. The re-
strictions and uncertainties inherent in this evaluation should be recognized
(Weiss, 1981). The critical aspect of the exposure that will yield increasing risk
of disease may not be obvious, with the default approach based solely on some
index of cumulative dose subject to uncertainty and error. What may be more
important than the total amount of exposure is the form of the exposure, its bi-
ological availability, peak exposures, when it occurs, etc. A rather detailed un-
derstanding of the biologic process linking exposure and disease is required to
quantify the relevant dose accurately. Furthermore, the shape of the dose-response
function, if one is present at all, will vary across levels of exposure, potentially
having a subthreshold range in which there is no response with increasing dose
as well as ranges in which the maximum response has been attained and dose no
longer matters. If the variation in the exposure levels that are available to study
are all below or above the range in which disease risk responds to varying ex-
posure, then no dose-response gradient will be found. Nonetheless, the potential
value in identifying gradients in disease risk in relation to varying levels of ex-
posure is always worthy of careful evaluation. When such a gradient is observed,
it is informative and can support a causal hypothesis, but when it is absent, a
causal association is by no means negated.
The hypothesized etiologic relationship under study should include at least a
general specification of the type of exposure that would be expected to increase
the risk of disease. The hypothesis and available data need to be scrutinized care-
fully for clues to the aspects of exposure that would be expected to generate
stronger relations with disease. Total amount of exposure, generally measured as
intensity integrated over time, is commonly used. Even in the absence of any
measures of intensity, the duration of exposure may be relevant. More complex
measures such as maximum intensity or average intensity over a given time pe-
riod may also be considered.
For example, assume we are evaluating the relationship between oral contra-
ceptive use and the development of breast cancer. If the hypothesis suggests that
dose be defined by cumulative amount of unopposed estrogen, we would exam-
ine specific formulations of oral contraception and the duration of time over
which those formulations were used. If the hypothesis concerns the suppression
of ovulation, then the total number of months of use might be relevant. If the hy-
pothesis concerned a permanent change in breast tissue brought about by oral
contraceptive use prior to first childbirth, we would construct a different dose
measure, one that is specific to the interval between becoming sexually active
and first pregnancy.
In a study of oral contraceptives and breast cancer in young women ( 45
years of age), Brinton et al. (1995) examined several such dose measures. A mul-
Measurement and Classification of Exposure 191
tisite case–control study enrolled 1648 breast cancer cases and 1505 controls
through random-digit dialing with complete data for analysis of oral contracep-
tive use. To examine varying combinations of duration of use and timing of use
(by calendar time and by age), relative risks were calculated for a number of dif-
ferent dose measures (Table 8.7). Except for some indication of increased risk
with more recent use, no striking patterns relative to duration and timing of oral
contraceptive use were identified.
There is no universally applicable optimal measure because the specific fea-
tures of the etiologic hypothesis lead to different ideal exposure measures. Mul-
tiple hypotheses of interest can be evaluated in the same study, each hypothesis
leading to a specific exposure measure. In many cases, the various indices of ex-
posure will be highly correlated with one another. On the positive side, even if
we specify the wrong index of exposure (relative to the etiologically effective
TABLE 8.7. Relative Risks and 95% Confidence Intervals of Breast Cancer by
Combined Measures of Oral Contraceptive Use Patterns: Women Younger Than 45
Years of Age, Multicenter Case-Control Study, 1990–1992
Used 6 months
to 5 years Used 5–9 years Used 10 years
NO. RR (95% CI) NO. RR (95% CI) NO. RR (95% CI)
No. of Years
Since First Use
15 136 1.44 (1.1–1.9) 66 1.55 (1.0–2.3) 22 1.27 (0.7–2.4)
15–19 221 1.14 (0.9–1.4) 120 1.45 (1.1–2.0) 93 1.58 (1.1–2.2)
20 292 1.29 (1.0–1.6) 190 1.10 (0.8–1.4) 119 1.11 (0.8–1.5)
No. of Years
Since Last Use
5 80 1.66 (1.1–2.4) 87 1.49 (1.0–2.1) 131 1.37 (1.0–1.8)
5–9 66 1.28 (0.9–1.9) 71 1.49 (1.0–2.2) 66 1.13 (0.8–1.7)
10 503 1.21 (0.9–1.5) 218 1.14 (0.9–1.5) 37 1.34 (0.8–2.3)
Age at First
Use, Years
18 79 1.04 (0.7–1.5) 80 1.55 (1.1–2.2) 75 1.47 (1.0–2.2)
18–21 342 1.32 (1.1–1.6) 224 1.21 (0.9–1.5) 122 1.12 (0.8–1.5)
22 228 1.29 (1.0–1.6) 72 1.21 (0.8–1.8) 37 1.68 (0.9–3.0)
*Adjusted for study site, age, race, number of births, and age at first birth. All risks relative to women
with no use of oral contraceptives or use for less than 6 months (389 patients and 431 control sub-
jects).
RR, relative risk; CI, confidence interval.
Brinton et al., 1995.
192 INTERPRETING EPIDEMIOLOGIC EVIDENCE
measure), it will be sufficiently correlated with the correct one to observe a dose-
response gradient. On the other hand, correlations among candidate indices of
exposure make it difficult or sometimes impossible to isolate the critical aspect
of exposure.
Definitions
Consideration of the pattern of error is critical to evaluating its likely impact on
measures of association. The most important question is whether the pattern of
error in exposure ascertainment varies in relation to disease status. If the nature
of the error is identical for persons with and without the disease of interest, the
misclassification is referred to as nondifferential. If the pattern of exposure mis-
classification varies in relation to disease status, it is called differential misclas-
sification. In the absence of other forms of error, nondifferential misclassifica-
tion of a dichotomous exposure indicator leads to a predictable bias towards the
null value for the measure of the exposure–disease association (Copeland et al.,
1977). A number of exceptions to this principle have been identified, for exam-
ple, multiple exposure levels in which misclassification may occur across non-
adjacent categories (Dosemeci et al., 1990) and categorization of a continuous
measure (Greenland, 1995). Nonetheless, the determination of whether disease
status is affecting the pattern of error in exposure assignment remains a critical
step in assessing the potential consequences of misclassification. When the qual-
ity of exposure assignment differs in relation to disease status, there are no read-
ily predictable consequences and the direction of bias needs to be assessed on a
case-by-case basis. In many, even most, cases of non-differential misclassifica-
tion, the bias tends to be toward the null value.
The source of exposure misclassification that arises from the disparity between
the ideal, etiologically relevant exposure indicator and the feasible, operational
definition of exposure would for the most part apply equally to persons with and
without disease. That is, inaccuracy in exposure assessment that results from in-
herent limitations in the ability to capture the construct of interest would usually
be independent of disease status. In contrast, the errors that arise going from the
operational definition to the acquisition of the relevant data through monitoring,
interviews, and specimen collection are often susceptible to distortion related to
the past or even future occurrence of disease. There are almost always dispari-
ties between the operational definition of exposure and the information on which
that assignment is based, and the processes by which such errors arise (recall,
perception, behaviors, physiology) often differ in relation to disease status.
Measurement and Classification of Exposure 193
evolving disease itself. Early stages of disease or disease precursors could alter
the measure of exposure and produce an association between measured exposure
and disease that has no etiologic significance.
Research on chlorinated hydrocarbons and breast cancer illustrates the con-
cerns that can arise about whether the disease or its treatment might affect mea-
sured serum levels of DDT, DDE, and other stored compounds (Ahlborg et al.,
1995). If, in fact, early stages of breast cancer result in the release of some of
the persistent organochlorines that are stored in fat tissue, then women who go
on to develop clinically apparent breast cancer will have had some period prior
to diagnosis during which blood levels of those compounds were elevated, solely
as a result of the early disease process. A prospective study that uses measure-
ments obtained in the period shortly before case diagnosis could be affected. In
a case–control study in which blood levels of organochlorines are measured af-
ter diagnosis and treatment of disease, there is even greater susceptibility to hav-
ing case–control differences arise as a result of metabolic changes among cases.
The problem arises when the measure of exposure is also, in part, reflecting the
consequence of the disease itself.
One way to evaluate the potential influence of disease on measures of expo-
sure is through a thorough understanding of the biological determinants of the
exposure marker, allowing assessment of whether the disease process would be
expected to modify the measure. Obviously, this requires substantive knowledge
about the often complex pathways affecting the measure and a thorough under-
standing of the biologic effects over the course of disease development. In the
case of breast cancer and chlorinated hydrocarbons, the pathways are quite com-
plex and make it difficult to predict the combined effect of the disease on me-
tabolism, influence of disease-associated weight loss, etc.
A better approach is to empirically assess the influence of disease on the ex-
posure marker, through obtaining measurements before the disease has occurred,
ideally even before the time when disease precursors would have been present,
as well as after disease onset. The interest is in exposure during the etiologic pe-
riod before the disease has begun to develop, so the measure prior to the onset
of the disease can be viewed as the “gold standard” and the accuracy of the mea-
sure taken after onset can be evaluated for its adequacy as a proxy. This requires
having stored specimens for diseases that are relatively rare and develop over
long periods of time, so that a sufficient number of prediagnosis measurements
are available. One study was able to use specimens obtained long before the man-
ifestation of disease (Krieger et al., 1994), but did not report any comparisons of
prediagnosis with postdiagnosis measures from the same women. One study did
evaluate the effect of treatment for breast cancer on such markers (Gammon et
al., 1996), and verified that the values just prior to initiation of treatment were
quite similar to those observed shortly after treatment began, addressing at least
one of the hypothesized steps at which distortion might arise.
Measurement and Classification of Exposure 197
Even if the ideal data are not available, strategies exist for indirectly assess-
ing the likely impact, if any, of disease on the measure of exposure. The disease
may have varying degrees of severity, with influences on exposure measures
more likely for severe than mild forms of the disease. For example, breast can-
cer ranges from carcinoma in situ, which should have little if any widespread bi-
ologic effects, to metastatic disease, with profound systemic consequences. If the
association between chlorinated hydrocarbons and breast cancer were similar
across the spectrum of disease severity, it would be unlikely to merely reflect
metabolic changes associated with disease given that the severity of those meta-
bolic changes would be quite variable among breast cancer cases. Examining the
pattern of results across the spectrum of disease severity would reveal the extent
to which the disease process had altered measurements, with results for the least
severe disease (carcinoma in situ) most valid and the results for late-stage dis-
ease least valid.
The timing of exposure ascertainment relative to the onset of disease has been
examined as well to indicate the likely magnitude of distortion. A series of stud-
ies in the 1970s and early 1980s had suggested that low levels of serum choles-
terol were related to development of a number of types of cancer, with choles-
terol assessed prior to the onset of disease (Kritchevsky & Kritchevsky, 1992).
Nonetheless, there was great concern with the possibility that even cancer in its
early, preclinical stage may have affected the serum cholesterol levels. Under
this scenario, low levels of cholesterol followed within a limited period, say 6 to
12 months, by the diagnosis of cancer, may have been low due to the develop-
ing cancer itself. The approach taken to assess this problem has been to exam-
ine risk stratified by time since the measurement in longitudinal studies, in or-
der to determine whether the pattern of association with disease differs across
time. It would be more plausible that preclinical cancer that became manifest in
the first 6 months following cholesterol measurement had affected the choles-
terol measure than it would for cancers diagnosed 5 years or 10 years after cho-
lesterol measurement.
Disease May Cause Exposure. For certain types of exposure, it is possible for
early disease not just to distort the measure of exposure, as described above, but
also to actually cause the exposure. This is especially problematic for exposures
that are closely linked to the early symptoms of the disease of concern, which
may include medications taken for those symptoms or variants of the symptoms
themselves. Prior to being diagnosed, early disease may lead to events or be-
haviors that can be mistakenly thought to have etiologic significance. The causal
sequence is one in which preclinical disease results in exposure, and then the dis-
ease evolves to become clinically recognized.
Among the candidate influences on the etiology of brain tumors are a number
of diseases or medications, including the role of epilepsy and drugs used to
198 INTERPRETING EPIDEMIOLOGIC EVIDENCE
control epilepsy (White et al., 1979; Shirts et al., 1986). It is clear that epilepsy
precedes the diagnosis of brain tumors, and medications commonly used to treat
epilepsy are more commonly taken prior to diagnosis by brain tumor cases than
controls in case–control studies. What is not clear is whether the early symptoms
of brain tumors, which are notoriously difficult to diagnose accurately in their
early stages, include epilepsy, such that the disease of interest (brain tumor) is
causing the exposure (epilepsy and its treatments). Similar issues arise in study-
ing medications used to treat early symptoms of a disease, e.g., over-the-counter
medications for gastrointestinal disturbance as a possible cause of colon cancer.
Similarly, undiagnosed chronic disease has the potential to distort certain ex-
posures of interest as potential causes, illustrated in a recent study of depression
as a potential influence on the risk of cancer. In a large cohort study in Denmark,
Dalton et al. (2002) evaluated the association between depression and other af-
fective disorders in relation to the incidence of cancer. By stratifying their re-
sults by years of follow-up (Table 8.8), they were able to consider the time course
of depression and cancer incidence to better understand the etiologic significance
of the results. Focusing on the results for the total cohort, note that brain cancer
risk was substantially elevated for the first year of follow-up only, returning to
near baseline thereafter. Although it could be hypothesized that depression is
causally related to brain cancer with a short latency, it seems much more likely
that undiagnosed brain cancer was a cause of the depressive symptoms given
what is known about the time course of cancer development. That is, the disease
of interest, brain cancer, undiagnosed at the time of entry into the cohort, may
well have caused the exposure of interest, depression. The likelihood that undi-
agnosed brain cancer becomes manifest many years after entry, distorting mea-
sures of association years later, is much less plausible so that the overall pattern
is suggestive of reverse causality.
Total Cohort 9922 1.05 1.03, 1.07 654 1.19 1.11, 1.29 4655 1.02 0.99, 1.05 4613 1.07 1.03, 1.10
Tobacco-related cancers† 2813 1.21 1.16, 1.25 182 1.37 1.18, 1.59 1224 1.09 1.03, 1.16 1407 1.30 1.23, 1.37
Non-tobacco-related cancers 7109 1.00 0.98, 1.02 472 1.14 1.04, 1.25 3431 1.00 0.97, 1.03 3206 0.99 0.95, 1.02
Brain cancer 277 1.18 1.00, 1.32 46 3.27 2.39, 4.36 142 1.24 1.04, 1.46 89 0.84 0.67, 1.03
Other 6832 0.99 0.97, 1.02 426 1.06 0.97, 1.17 3289 0.99 0.96, 1.02 3117 0.99 0.96, 1.03
Diagnostic Level‡
Bipolar psychosis 1217 0.99 0.93, 1.03 62 1.16 0.89, 1.49 557 1.00 0.92, 1.09 598 0.94 0.87, 1.02
Tobacco-related cancers† 292 0.92 0.82, 1.04 18 1.30 0.77, 2.06 133 0.94 0.78, 1.11 141 0.88 0.74, 1.04
Non-tobacco-related cancers 925 1.00 0.93, 1.06 44 1.11 0.80, 1.48 424 1.02 0.93, 1.12 457 0.97 0.88, 1.06
Brain cancer 26 0.82 0.53, 1.20 4 2.72 0.73, 6.97 12 0.81 0.42, 1.42 10 0.64 0.31, 1.17
Other 899 1.00 0.94, 1.07 40 1.04 0.75, 1.42 412 1.03 0.93, 1.13 447 0.98 0.89, 1.07
Unipolar Psychosis 4345 0.98 0.95, 1.01 290 1.00 0.89, 1.12 2176 0.94 0.90, 0.98 1879 1.03 0.99, 1.08
Tobacco-related cancers† 1144 1.05 0.99, 1.11 76 1.07 0.84, 1.34 538 0.95 0.87, 1.03 530 1.18 1.08, 1.29
Non-tobacco-related cancers 3201 0.96 0.93, 1.00 214 0.98 0.85, 1.12 1638 0.94 0.90, 0.99 1349 0.98 0.93, 1.04
Brain cancer 119 1.19 0.99, 1.43 21 3.10 1.92, 4.75 58 1.11 0.84, 1.43 40 0.99 0.70, 1.34
Other 3082 0.96 0.92, 0.99 193 0.91 0.79, 1.05 1580 0.94 0.89, 0.99 1309 0.98 0.93, 1.04
(continued)
TABLE 8.8. Standardized Incidence Ratios for All Types of Cancer Combined and for Tobacco-Related Cancers, by Diagnostic Group, in Patients
Hospitalized with an Effective Disorder in Denmark, 1969–1993 (continued)
Portion of Follow-up Period
Total First Year of Follow-up 1–9 Years of Follow-up 10 Years of Follow-up
DIAGNOSIS OBS SIR* 95% CI OBS SIR 95% CI OBS SIR 95%CI OBS SIR 95% CI
Reactive Depression 2075 1.13 1.08, 1.18 184 1.62 1.39, 1.87 997 1.12 1.05, 1.19 894 1.07 1.00, 1.14
Tobacco-related cancers† 663 1.41 1.30, 1.52 58 2.03 1.54, 2.62 297 1.32 1.17, 1.48 308 1.43 1.27, 1.59
Non-tobacco-related cancers 1412 1.03 0.98, 1.09 126 1.48 1.23, 1.76 700 1.06 0.98, 1.14 586 0.95 0.87, 1.02
Brain cancer 59 1.20 0.92, 1.55 14 4.55 2.48, 7.63 29 1.21 0.81, 1.74 16 0.73 0.42, 1.18
Other 1353 1.03 0.97, 1.08 112 1.36 1.12, 1.64 671 1.05 0.97, 1.13 570 0.95 0.88, 1.03
Dysthymia 2285 1.18 1.13, 1.23 118 1.32 1.09, 1.58 925 1.15 1.08, 1.23 1242 1.19 1.13, 1.26
Tobacco-related cancers† 714 1.56 1.45, 1.68 30 1.58 1.07, 2.26 256 1.40 1.24, 1.59 428 1.68 1.52, 1.84
Non-tobacco-related cancers 1571 1.06 1.01, 1.12 88 1.25 1.00, 1.54 669 1.07 0.99, 1.16 814 1.04 0.97, 1.11
Brain cancer 73 1.34 1.05, 1.68 7 2.54 1.02, 5.24 43 1.80 1.30, 2.43 23 0.82 0.52, 1.23
Other 1498 1.05 1.00, 1.11 81 1.19 0.95, 1.48 626 1.04 0.97, 1.13 791 1.04 0.97, 1.12
*Observed number of cases/expected number of cases. The expected number of cases was the number of cancer cases expected on the basis of age-, sex-, and calendar-year-
specific incidence rates of first primary cancers in Denmark.
†Cancers of the buccal cavity, larynx, lung, esophagus, pancreas, kidney, and urinary bladder.
‡Bipolar psychosis: ICD-8 codes 296.39, 296.19, and 298.19; unipolar psychosis: ICD-8 codes 296.09, 296.29, 296.89, and 296.99; reactive depression: ICD-8 code 298.09,
dysthymia: ICD-8 codes 300.49 and 301.19.
Obs, observed; SIR, standardized incidence ratio; CI, confidence interval; ICD-8, International Classification of Diseases, Eighth Revision.
Dalton et al., 2002.
Measurement and Classification of Exposure 201
REFERENCES
Ahlbom A, Navier IL, Norell S, Olin R Spannare B. Nonoccupational risk indicators for
astrocytomas in adults. Am J Epidemiol 1986;124:334–337.
Ahlborg UF, Lipworth L, Titus-Ernstoff L, Hsieh C-C, Hanberg A, Baron J, Trichopou-
los D, Adami H-O. Organochlorine compound in relation to breast cancer, endome-
trial cancer, and endometriosis: An assessment of the biological and epidemiological
evidence. Crit Rev Toxicol 1995;25:463–531.
Albert C, Mittleman M, Chae C, Lee I, Hennekens C, Manson J. Triggering of sudden
death from cardiac causes by vigorous exercise. N Engl J Med 2000;343:1355–1361.
Armstrong BG. The effects of measurement errors on relative risk regressions. Am J Epi-
demiol 1990;132:1176–1184.
Measurement and Classification of Exposure 203
In this chapter, the challenges that arise in accurately identifying the occurrence
of disease, or more broadly, health outcomes, are considered. Also, considera-
tion is given to the methods for detecting such errors and quantifying their im-
pact on measures of association. Some amount of error in assessment is inevitable,
with the extent of such error dependent on the nature of the health outcome of
interest, the ability to apply definitive (sometimes invasive) tests to all partici-
pants, and human and instrument error associated with gathering and interpret-
ing the information needed to assess the outcome.
205
206 INTERPRETING EPIDEMIOLOGIC EVIDENCE
The most obvious source of disease misclassification is outright error, for ex-
ample, a clerical error in which a healthy person is assigned the wrong labora-
tory result and is incorrectly labeled as diseased. Conversely, an individual who
truly has a disease may be examined, but through clinician oversight, critical
symptoms are not queried or the appropriate diagnostic tests are not conducted.
This oversight results in a false negative diagnosis and the patient is erroneously
labeled as being free of disease. In the conventional table of a dichotomous dis-
ease cross-classified with exposure, the person who belongs in the diseased cell
is placed in the non-diseased cell or vice versa.
Measurement and Classification of Disease 207
There are many variations and complexities in disease ascertainment that re-
sult in some form of misclassification. One level beyond simple errors at the
individual level is error in the very definition of disease that is chosen and ap-
plied in the study. Sometimes the rules for disease ascertainment, even if fol-
lowed precisely, predictably generate false positive and false negative diag-
noses. For many diseases, such as rheumatoid arthritis or Alzheimer’s disease,
the diagnosis requires meeting a specified number of predesignated signs and
symptoms, which has the virtue of providing a systematic, objective method
of classification. Systematic and objective does not necessarily mean valid,
however, only reliable. At best, such criteria are probabilistic in nature, with
cases that are identified by meeting the diagnostic criteria likely or even very
likely to be diseased, and those assigned as noncases very likely to be free of
the disease. Even when the rules are followed precisely however, some pro-
portion of false positives and false negatives are inevitable. Often, the diagno-
sis based on a constellation of symptoms is truly the best that can be done on
living patients, but sometimes there is an opportunity to evaluate the validity
of the diagnosis at some future date when diagnostic information is refined.
Where a definitive diagnosis can be made after death, as in Alzheimer’s dis-
ease, the diagnostic accuracy based on applying clinical criteria can be quan-
tified. In other instances, such as many psychiatric diagnoses, there is no such
anchor of certainty so that the patient’s true status is unknown and the effec-
tiveness of the diagnostic criteria is not readily measured.
As a reminder that disease assignment can be a fallible process, expert com-
mittees periodically evaluate diagnostic criteria and modify them as concepts and
empirical evidence evolve. Clearly, either the initial criteria, the revised criteria,
or both had to contain systematic error relative to the unknown gold standard.
If, in fact, some of those signs and symptoms should not have been included as
criteria or if some components of the diagnosis that should have been included
were omitted, errors of diagnosis must have resulted from proper application of
the rules. Even an optimally designed checklist has probabilistic elements in that
a given constellation of symptoms cannot definitively lead to a specific diagno-
sis and exclude all others. Regardless of the origins of the problem, under this
scenario all relevant signs and symptoms will be elicited properly, recorded ac-
curately, and integrated according to the best rules available, yet a certain frac-
tion of persons labeled as diseased will not have the disease and some of those
labeled as non-diseased will actually have it.
Beyond the likelihood of error in properly applied operational definitions of
disease, the actual process by which individuals come to be identified as cases
in a study provides many additional opportunities for errors to arise. Typically,
a sequence of events and decisions are made to become a diagnosed case such
that overcoming one hurdle is required to proceed with the next on the pathway
leading to disease identification and inclusion as a case in the study. These
208 INTERPRETING EPIDEMIOLOGIC EVIDENCE
include both technical considerations based on disease signs and symptoms, but
also hurdles defined by recognition of symptoms and clinician insights.
False positives tend to be more readily identified and corrected than false neg-
atives. False positives are often less likely because those who meet a given mile-
stone on the path to diagnosis are evaluated more intensively at the next stage
and can thus be weeded out at any point along the way. On the other hand, fail-
ure to identify a person as needing further scrutiny in any step along the path-
way to diagnosis can result in elimination from further consideration and a false
negative diagnosis. Once a potential case of disease comes to the attention of the
health care system, substantial effort can be devoted to verifying the presence of
disease as a prelude to choosing among treatment options.
Optimally, a series of sequential screening efforts of increasing cost and so-
phistication is designed to end up with true cases, starting with a rather wide net
that is highly sensitive and not terribly specific, and proceeding to increasingly
specific measures. If the early stages of screening are not highly sensitive, then
false negative errors will occur and not have the opportunity to be identified given
that potential cases will leave the diagnostic pathway. Once such an error is made
in the early phase of identifying disease, it is inefficient for the massive num-
bers of presumptively negative individuals to be re-evaluated to confirm the ab-
sence of disease. People do not always recognize symptoms, they may fail to
seek treatment for the symptoms they do recognize, and even after seeking treat-
ment, they may not be diagnosed accurately. Each of these steps in reaching a
diagnosis constitutes the basis for terminating the search, affording numerous op-
portunities for underascertainment.
Many diseases, defined as a specific biologic condition, are simply not
amenable to comprehensive ascertainment for logistical reasons. For example, a
sizable proportion (perhaps 10%–15%) of couples are incapable of reproducing,
and some fraction of that infertility is attributable to blockage in the fallopian
tubes, preventing the transport of the ovum from the ovary to the uterus. This
specific biologic problem, obstruction of the fallopian tubes, is considered as the
disease of interest in this illustration. Under the right set of circumstances, such
couples will be accurately diagnosed as having infertility and further diagnostic
assessment will result in identification of occlusion of the fallopian tubes as the
underlying basis for the infertility. In order for this diagnosis to be made, how-
ever, the otherwise asymptomatic individual has to meet a number of criteria that
are dependent in part on choices and desires. The intent to have a child has to
be present. Persons who are not sexually active, those sexually active but using
contraception, and even those who engage in unprotected intercourse without the
intent of becoming pregnant would not define themselves as infertile and would
thus not undergo medical evaluation and diagnosis. Many women who in fact
have occlusion of the fallopian tubes are not diagnosed because they do not seek
the medical attention required to be diagnosed (Marchbanks et al., 1989).
Measurement and Classification of Disease 209
function such as IQ score or the proportion below a threshold for mild retarda-
tion. On the other hand, in the case of certain viral infections during pregnancy,
such as rubella, severe losses of intellectual ability would be anticipated, so that
the proper disease endpoint for study is the dichotomy severe mental retardation
or absence of severe mental retardation. Alcohol consumption during pregnancy
may be capable of causing both a shift toward lower intellectual ability and cases
of severe mental retardation in the form of fetal alcohol syndrome, depending on
the dose. The consequences of inappropriately categorizing or failing to catego-
rize can be viewed as forms of misclassification.
A commonly encountered decision in selecting a disease endpoint is the opti-
mal level of aggregation or disaggregation, sometimes referred to as lumping ver-
sus splitting. The decision should be based on the entity that is most plausibly
linked to exposure, reflecting an understanding of the pathophysiology of the dis-
ease generally and the specific etiologic process that is hypothesized to link ex-
posure to the disease. There are unlimited opportunities for disaggregation, with
subdivisions of disease commonly based on severity (e.g., obesity), age at oc-
currence (e.g., premenopausal versus postmenopausal breast cancer), exact
anatomic location (e.g., brain tumors), microscopic characteristics (e.g., histo-
logic type of gastric cancer), clinical course (e.g., Alzheimer’s disease), molec-
ular traits (e.g., cytogenetic types of leukemia), or prognosis (e.g., aggressive-
ness of prostate cancer). For any given disease, experts are continually deriving
new ways in which the disease might be subdivided, particularly through the use
of refined biological markers. There are a number of laudable but distinctive
goals for such refined groupings, including identification of subsets for selecting
appropriate treatment, assessing prognosis, or evaluating etiology. The goals of
the epidemiologic study must be matched to the classification system to identify
the optimal level of aggregation. There is no generic answer to the lumping ver-
sus splitting dilemma; it depends on the goal.
In principle, errors of excessive subdivision should only affect statistical power.
That is, subdividing on irrelevant features of the disease, such as the day of the
week on which it was diagnosed or the clinician’s zodiac sign, would not lead
to a loss of validity but only a loss of precision. It would be as though a random
number was assigned to each event and they were grouped on the basis of the
numerical assignment. By definition, selection based on an attribute that is un-
related to etiology constitutes a random sample. The loss of precision is not a
small concern, however, given that imprecision and random error are frequently
major barriers to distinguishing the signal from the noise. In studies of genetic
susceptibility markers, for example, imprecision resulting from an interest in ef-
fect modification places severe burdens on study size (Greenland, 1983) and can
result in a degree of imprecision that renders studies virtually uninformative. If
greater aggregation is biologically appropriate, more informative studies can be
conducted.
212 INTERPRETING EPIDEMIOLOGIC EVIDENCE
and the exposure prevalence among erroneously diagnosed false positive cases.
This mixing will yield a bias towards the null, giving the observed case group
an exposure prevalence between that of true cases and that of a random sample
of the study base, represented by the false positives. Only when there is no as-
sociation between exposure and disease, whereby cases would have the same ex-
posure prevalence as the study base, will no bias result. Given the risk of over-
whelming true cases with false positives when disease is rare, it is important in
case–control studies to seek the maximum level of specificity even at the ex-
pense of some loss in sensitivity (Brenner & Savitz, 1990). Therefore, compar-
ing results for varying levels of disease sensitivity and specificity (see Section
below, “Examine results across levels of diagnostic certainty”) suggests that the
most valid estimates will be obtained for the most restrictive, stringent disease
definitions. Given that only ratio measures of effect (odds ratios) can be assessed
in case–control studies, all of the comments about bias due to nondifferential
misclassification refer to the odds ratio.
In contrast, nondifferential underascertainment of disease in cohort studies does
not produce a bias in ratio measures of effect (risk ratios, odds ratios) (Poole,
1985; Rothman & Greenland, 1998). Assume that the disease identification mech-
anism, applied identically among exposed and unexposed subjects, successfully
identifies 80% of the cases that are truly present. The absolute rate of disease
will be 0.80 times its true value in both the exposed and unexposed groups. For
ratio measures of effect, the sampling fractions cancel out, such that there is no
bias—0.80 times the disease rate among exposed subjects divided by 0.80 times
the disease rate among unexposed subjects produces an unbiased estimate of the
risk ratio. Note the minimal assumptions required for this to be true: only dis-
ease underascertainment is present and it is identical in magnitude for exposed
and unexposed subjects. If these constraints can be met, either in the study de-
sign or by stratification in the analysis, then unbiased measures of relative risk
can be generated. In this situation, however, the measure of rate difference will
be biased, proportionately smaller by the amount of underascertainment. For a
given sampling fraction, for example, 0.80, the rate difference will be 0.80 times
its true value: 0.80 times the rate in the exposed minus 0.80 times the rate in the
unexposed equals 0.80 times the true difference.
For non-differential disease overascertainment, the consequences are the op-
posite with respect to ratio and difference measures, i.e., bias in ratio measures
but not in difference measures of effect. In contrast to underascertainment, in
which a constant fraction of the true cases are assumed to be missed, overascer-
tainment is not proportionate to the number of true cases but instead to the size
of the study base or denominator. That is, the observed disease incidence is the
sum of the true disease incidence and the incidence of overascertainment, with
the total number of false positive cases a function of the frequency of over-
ascertainment and the size of the study base. If the disease incidence due to
Measurement and Classification of Disease 215
not the person has asbestosis would undoubtedly be affected by the presumed
history of asbestos exposure.
More subtle forms of exposure-driven diagnosis can also result in differential
disease misclassification, typically creating a spuriously strong association if the
exposure increases the probability of more accurate or complete diagnosis. To-
bacco use is a firmly established, major cause of a variety of diseases, including
cancer of the lung and bladder, coronary heart disease, and chronic obstructive
pulmonary disease. While several of these diseases have unambiguous diagnoses,
such as advanced lung cancer, there are others such as chronic bronchitis or
angina that can involve a certain amount of discretionary judgment on the part
of the physician. Integration of the complete array of relevant data to reach a fi-
nal diagnosis is likely to include consideration of the patient’s smoking history.
This can be viewed as good medical practice that takes advantage of epidemio-
logic insights since the probabilities of one diagnosis or another are truly altered
by the smoking history. Incorporation of the exposure history into the diagnos-
tic evaluation however, may well result in a more complete assessment or even
overdiagnosis of diseases known to be related to tobacco among patients with a
smoking history. In some instances, the greater reluctance to diagnose a non-
smoker as having bronchitis and the greater willingness to diagnose a smoker as
having bronchitis may help make the diagnoses more accurate, but such expo-
sure-driven judgments also have the potential to introduce disease misclassifica-
tion that is differential in relation to smoking history.
Another way in which differential disease misclassification may arise is as a nat-
ural consequence of the exposure of interest rather than as a result of the behavior
of the affected individual or the predilections of the person making the diagnosis.
The exposure may alter our ability to accurately diagnose the disease. In examin-
ing the causes of spontaneous abortion, for example, we have to contend with am-
biguity of diagnosis, particularly in the first few weeks of pregnancy. Early spon-
taneous abortion may not be recognized at all or misinterpreted as a heavy menstrual
period, and conversely, a late, heavy menstrual period may be misinterpreted as a
spontaneous abortion. If we are interested in examining the influence of menstrual
cycle regularity or other menstrual bleeding characteristics on the risk of sponta-
neous abortion, a problem arises. Our ability to make accurate diagnoses will be
greatest for women who have regular menstrual cycles, which may be viewed as
the unexposed group, and the exposed women with irregular menstrual cycles will
be less accurately diagnosed. The solution to such a problem may reside in a di-
agnostic method such as evaluation of hormone metabolites in urine (Wilcox et al.,
1988), thus freeing the diagnosis from the influence of menstrual cycle regularity.
Such an approach to diagnosis eliminates the association between the exposure of
interest and the accuracy of disease diagnosis.
The many opportunities for differential misclassification should be considered
comprehensively, addressing all the ways in which exposure could directly or in-
Measurement and Classification of Disease 219
postulated in assigning such labels is that if the absolute truth were known, those
classified as definite would contain the highest proportion of persons with the
disease present, those labeled as probable the next highest, and those called pos-
sible having the lowest proportion of truly diseased persons. The only inference
is ordinal, such that these three groups might contain 100%, 80%, and 60% who
truly have disease or 50%, 30%, and 10%. Although it would be more desirable
to be able to attach precise quantitative probabilities to these categories, even in-
formation on the relative degree of certainty has value in interpreting study re-
sults. The improved sensitivity in going from possible to probable to definite is
virtually certain to be accompanied by a loss in specificity, i.e., increasing num-
bers of persons erroneously excluded (false negatives).
As noted previously, the greater concern in misclassification of relatively rare
diseases is usually with false positives, since even a modest loss in specificity
can result in overwhelming the few true positives with many false positives. Sen-
sitivity is critical for enhancing precision, since false negative cases are not con-
tributing to the pool of identified cases, but the infusion of a handful of false
negatives into a large group of true negatives would have little impact on mea-
sures of effect. Given those considerations, bias in measures of association should
be least when using the most stringent, restrictive case definitions and greatest
for the more uncertain, inclusive categories. Because specificity is presumably
highest for the most restrictive definition, when a gradient in measures of effect
across levels of diagnostic certainty is found, the most valid result is likely to be
for the most definite cases—whether those show the strongest or weakest mea-
sure of association.
If, in fact, the most certain cases yield the most valid results, one might ques-
tion the value of considering the other categories at all. In other words, why not
just set up a highly restrictive case definition at the outset and accept the loss of
some true cases in order to weed out a greater proportion of the non-cases? First,
the opportunity to examine a gradient in certainty of classification is informa-
tive. The contrast between results for more versus less certain cases generates a
spectrum of information that is helpful in assessing the magnitude and impact of
disease misclassification. Observing a risk ratio of 2.0 for definite cases in iso-
lation may be less informative than observing a risk ratio of 2.0 for definite cases,
1.5 for probable cases, and 1.2 for possible cases. The opportunity to assess pat-
tern of risk in relation to diagnostic certainty is an incentive to retain strata of
cases that are less certain. Nevertheless, if resource limitations force us to com-
mit to only one case definition, then the more restrictive one would generally be
preferred. Second, our initial assessment of certainty of diagnosis may be in er-
ror, in which case we would lose precision and not gain validity by restricting
cases to the highest level of diagnostic certainty. Observing relative risks around
2.0 for definite, probable, and possible cases may suggest that all three categories
are equally likely to contain true positive cases, even if we do not know exactly
Measurement and Classification of Disease 221
what the proportion is. If, in fact, there were empirical evidence that the groups
were similar with regard to their patterns of risk, aggregating them would be a
reasonable strategy for enhancing precision with little or no loss in validity.
Many physiologic parameters have no true threshold for abnormal, so that the
more restrictive and extreme the cutpoint, the more likely it is that persons la-
beled as abnormal will suffer clinical consequences. This phenomenon is clearly
illustrated in that case of considering semen parameters related to male infertil-
ity, in which the cutpoints for normality are rather arbitrary relative to the clin-
ical consequences for fertility. In fact, as the abnormal semen parameters are de-
fined with increasingly restrictive criteria, the certainty that there is a clinically
important functional abnormality present is enhanced. For example, when sperm
concentration reaches 0 per ml, conception will be impossible, when it is very
low, e.g., 5 million per ml, conception is very unlikely, etc. In a study of in-
fertility patients treated in the Netherlands (Tielemans et al., 2002), the investi-
gators considered three case definitions of increasing restrictiveness, using a com-
bination of sperm concentration, motility, and morphology as characterized in
the table footnote (Table 9.1). As more stringent standards were applied to de-
fine sperm abnormality, the association with cigarette smoking became stronger
but also less precise. This pattern is consistent with a true effect of tobacco use
on semen parameters, but one that is diluted by misclassification using more lib-
eral disease definitions. The restricted study population referred to in the table
excluded those who were least likely to be aware of the cause of their infertility
as described in detail in the manuscript.
An important potential drawback to this approach is the possibility that a
strategy intended to establish levels of diagnostic certainty might instead re-
flect fundamentally different disease subsets that have different etiologies. For
example, definite cases with a key symptom required to label them as certain
may actually have a different condition than those lacking that symptom and
labeled possible. Severity, often used to establish certainty, may also be asso-
ciated with qualitative differences in etiology, as illustrated earlier in this chap-
ter for mental retardation. Severe mental retardation is diagnosed with greater
certainty than mild mental retardation, but may well also represent a funda-
mentally different entity with a different pathogenesis and different determi-
nants. If a study were conducted that found a given exposure was associated
with a risk ratio of 2.0 for severe mental retardation and 1.5 for mild mental
retardation, it would not necessarily correspond to more disease misclassifica-
tion among those individuals labeled as having mild mental retardation. Both
results could be perfectly valid, reflecting differing magnitudes of association
with different health outcomes. The judgment about whether the groups iden-
tified as more or less certain to have the disease reflect the same entity with
differing quality or different entities has to be made based on substantive knowl-
edge about the exposures and diseases under investigation. The study results
TABLE 9.1. Estimated Odds Ratios for Abnormal Semen Parameters and Male Cigarette Smoking in the Total Study Population and the Restricted
Population, the Netherlands, 1995–1996*
Total Population Restricted Population
NO.† OR 95% CI OR 95% CI NO. OR 95% CI OR 95% CI
Case Definition A‡
Male smoking 153 1.25 0.88, 1.79 1.34 0.90, 2.00 51 1.98 0.96, 4.11 2.07 0.95, 4.51
Female smoking 137 0.86 0.58, 1.29 41 0.90 0.42, 1.93
Case Definition B§
Male smoking 75 1.69 1.11, 2.57 1.97 1.20, 3.24 23 2.30 1.00, 5.27 2.99 1.17, 7.67
Female smoking 58 0.73 0.44, 1.21 13 0.51 0.19, 1.38
Case Definition C¶
Male smoking 20 1.92 0.98, 3.74 2.62 1.22, 5.61 7 2.45 0.77, 7.81 4.58 1.20, 17.47
Female smoking 12 0.50 0.22, 1.14 1 0.09 0.01, 0.87
*Smoking by the male partner was entered alone and simultaneously with female smoking into logistic regression models.
†Number of cases with the particular risk factor.
‡Sperm concentration below 20 106/ml, less than 50% spermatozoa with forward progression and also less than 25% spermatozoa with rapid progression, less than 14% sper-
matozoa with normal forms, or abnormal values for more than one parameter.
§Sperm concentration below 5 106/ml, less than 10% spermatozoa with forward progression, less than 5% spermatozoa with normal forms, or abnormal values for more than
one parameter.
¶Azoospermia
OR, odds ratio (calculations based on the control group with the following characteristics: sperm concentration of 20 106/ml or more, 50% or more spermatozoa with for-
ward progression of 25% or more spermatozoa with rapid progression, and 14% or more spermatozoa with normal forms); CI, confidence interval.
Tielemans et al., 2002.
Measurement and Classification of Disease 223
alone will not make it clear which is operative if the magnitude of association
differs across groups.
ship under study, and have the same impact as any other source of false posi-
tives on the results.
If benzene exposure truly caused only one form of leukemia, acute myeloid
leukemia, as some have argued (Wong, 1995), then studies of benzene and
leukemia that include other forms, such as chronic myeloid leukemia and acute
lymphocytic leukemia would be expected to yield weaker ratio measures of as-
sociation. That weaker measure would accurately reflect the impact of benzene
on total leukemia, but would reflect a smaller magnitude than would be found
for acute myeloid leukemia alone. Those cases of other forms of leukemia would
act analogously to false positive cases of acute myeloid leukemia, diluting the
measured association. Under the hypothesis of an effect limited to acute myeloid
leukemia, the exposure pattern of cases of other types of leukemia would be iden-
tical to those of persons free of disease. On the other hand, if multiple types of
leukemia are in fact affected by benzene, as suggested in a recent review (Savitz
& Andrews, 1997) and a report from a large cohort study (Hayes et al., 1997),
then restricting an already rare disease, leukemia, to the subset of acute myeloid
leukemia, is wasteful. Relative to studying all leukemias, there would be a sub-
stantial loss of precision, and may not be a gain in specificity of association with
benzene exposure.
Often we are faced with uncertainty and reasonable arguments that would sup-
port more than one approach to disease grouping. Rather than arbitrarily adopt-
ing one strategy, the best approach may be to examine the results under several
scenarios and consider what impact misclassification would be likely to have had
under those alternative assumptions. If, in fact, there is a causal association with
at least some subset of disease, then the analysis that is restricted to that subset
will show a stronger exposure–disease association than analyses that are more
inclusive. If there is reasonable doubt about whether etiologically distinctive sub-
sets of disease may be present, there is an incentive to present and evaluate re-
sults for those subsets. Should the subsets all yield similar measures of effect,
then one might infer that nothing was gained and the exposure has similar con-
sequences for all the subgroups of disease. On the other hand, generating data
for disease subsets is the only means for discovering that some subsets are af-
fected by the exposure whereas others are not.
For example, we hypothesized that among all cases of preterm delivery, dis-
tinctive clinical presentations may correspond to different etiologic mechanisms:
some occur following spontaneous onset of labor, some following spontaneous
rupture of the chorioamniotic membranes, and some result from medical inter-
ventions in response to health complications of the mother or fetus that require
early delivery, such as severe pre-eclampsia or fetal distress (Savitz et al., 1991).
If this is a valid basis for dividing cases to study etiology, then associations with
subsets of cases will be stronger than for the aggregation of all preterm delivery
cases. At present, the empirical evidence regarding such heterogeneity is mixed,
with some risk factors distinctive by subtype whereas other potential causes of
Measurement and Classification of Disease 225
preterm birth appear to be associated with two or all three subgroups (Lang et
al., 1996; Berkowitz et al., 1998).
Some diseases are aggregations of subgroups, in a sense demanding consid-
eration of subtypes of a naturally heterogeneous entity. Brain cancer is defined
solely by the anatomic location of the tumor, with a wide range of histologic
types with varying prognosis and quite possibly varying etiology. In a rather so-
phisticated examination of the issue of magnetic field exposure and brain cancer
in a Canadian case–control study, Villeneuve et al. (2002) hypothesized that the
exposure acts as a tumor promoter and would thus show the strongest associa-
tion for the most aggressive subtypes of brain cancer. Subsets of brain cancer
were examined empirically (Table 9.2) and there was clear heterogeneity in pat-
TABLE 9.2. The Risk of Brain Cancer According to the Highest Average Level of
Occupational Magnetic Field Exposure Ever Received by Histological Type. Canadian
National Enhance Cancer Surveillance System, Male Participants, 1994–1997
HIGHEST AVERAGE
OCCUPATIONAL
EXPOSURE
MAGNETIC FIELDS ODDS ODDS
EVER RECEIVED CASES CONTROLS RATIO* 95% CI RATIO† 95% CI
Astrocytomas
0.3 T 163 160 1.0 1.0
0.3 T 51 54 0.93 0.60–1.44 0.93 0.59–1.47
0.6 T 12 16 0.61 0.26–1.49 0.59 0.24–1.45
Glioblastoma
Multiforme
0.3 T 143 156 1.0 1.0
0.3 T 55 42 1.50 0.91–2.46 1.48 0.89–2.47
0.6 T 18 6 5.50 1.22–24.8 5.36 1.16–24.78
Other
0.3 T 92 94 1.0 1.0
0.3 T 23 21 1.11 0.59–2.10 1.10 0.58–2.09
0.6 T 9 7 1.50 0.53–4.21 1.58 0.56–4.50
*Unadjusted odds ratio obtained from the conditional logistic model.
†The odds ratio was adjusted for occupational exposure to ionizing radiation and vinyl chloride.
‡Referent group.
Villeneuve et al., 2002.
226 INTERPRETING EPIDEMIOLOGIC EVIDENCE
terns of association across tumor groupings. A modest association was found for
brain cancer in the aggregate (relative risks of 1.3–1.4 in the highest exposure
category), with markedly stronger associations for the more aggressive subtype,
glioblastoma multiforme, with relative risks over 5.0. Whether this pattern re-
flects a causal effect or not, the heterogeneity in risk across subtypes provides
informative suggestions and helps to focus additional research that addresses the
same hypothesis or actually refine the hypothesis about whether and how mag-
netic fields might affect brain cancer.
As in many suggested approaches to epidemiologic data analysis, there is no
analysis that can discern the underlying truth. Hypotheses are proposed, results
are generated, and then interpretations are made, with greater information pro-
vided when informative disease subsets can be isolated, and considered. Several
caveats to this approach must be noted however. Alternative grouping schemes
need to have a logical basis in order for the results to be interpretable. A plau-
sible theoretical foundation is needed for each approach to grouping that is then
examined in order for the association to have any broader meaning and to ad-
vance understanding of disease etiology. To note that an arbitrarily chosen sub-
set of cases, such as those who came to the clinic on Tuesdays, shows a stronger
relationship to disease than cases in the aggregate, is of little help in evaluating
misclassification and understanding the causal process. Through random
processes, there will always be disease subsets more and less strongly related to
exposure, but to be worthy of evaluation, finding such heterogeneity or even the
absence of heterogeneity that might have been expected under some plausible
hypothesis should advance knowledge. In fact, random error becomes a much
greater problem for case subgroups than for the disease group in the aggregate,
simply due to a diminution of the numbers of cases in the analysis. Arbitrary,
excessive splitting of cases for analysis has the danger of generating false leads
based solely on random error. Nonetheless, except when imprecision is extreme,
it would often be preferable to have a less precise result for the subgroup of cases
that is truly affected by the exposure than to have a more precise result for a
broader aggregation of cases, some of which are affected by the exposure and
some of which are not.
are false positives. Ideally, a sufficient number of exposed and unexposed cases
can be evaluated to determine whether the proportion who are erroneously la-
beled as diseased is associated with or independent of exposure status (i.e.,
whether disease misclassification is differential or non-differential).
On the other hand, assuming disease is relatively rare, there are many persons
presumptively identified as disease-free, and subjecting each one to definitive di-
agnostic evaluation to correct false negatives is generally not feasible or neces-
sary. Instead, some sample of individuals identified as free of disease can be
evaluated with more definitive diagnostic tests to verify the absence of disease.
Often, the challenge is to screen a sufficient number of presumptive non-cases
to identify any false negatives or know that a sufficient number have been eval-
uated, even if no missed cases of disease are found.
With quantitative estimates of the frequency of disease misclassification, an
estimate of the magnitude of association in the absence of those errors can be
made through simple algebra (Kleinbaum et al., 1982). Within an exposure stra-
tum, for example, a certain proportion of those labeled as having the disease rep-
resent false positives and the correction for that false positive proportion is to
deplete the cell by that amount. If there were 100 persons classified as having
disease, and the false positive proportion were 8%, then 8 persons would be
moved to the no disease cell, leaving 92. If there were also some fraction of those
labeled disease-free who represent false negatives, then some number of persons
would need to be shifted from the no disease to the disease cell. Assuming that
there were originally 900 persons classified as free of disease and that 2% are
false negatives, then 18 persons would move across cells. The net change in the
proportion with disease would be from 100/1000 0.10 to 110/1000 0.11. A
comparable adjustment would be made in the other exposure strata to produce
adjusted measures of the rate ratio. More sophisticated methods are available that
account for the imprecision in the correction terms, and incorporate the precision
of that estimate in the variability of the adjusted measures (Rothman & Green-
land, 1998, pp. 353–355). Additional refinements would incorporate misclassi-
fication of exposure and adjustment for confounding factors, making the algebra
much more complex.
This approach was applied to the evaluation of chronic obstructive pulmonary
disease in the Nurses Health Study (Barr et al., 2002). In such large cohort stud-
ies, direct confirmation is infeasible due to the geographic dispersion of partici-
pants and the lengthy interval over which diagnoses occur. Using self-reported
information from the questionnaire, women were classified as definite, probable,
or possible cases, depending on the amount of detail that they were able to pro-
vide to document the diagnosis of chronic obstructive pulmonary disease. A ran-
dom sample of 422 women who reported the disease was initially selected, and
medical records were obtained for 376 women to allow for the direct confirma-
tion of physician-diagnosed chronic obstructive pulmonary disease. The propor-
228 INTERPRETING EPIDEMIOLOGIC EVIDENCE
tion confirmed in this manner was 78%, with a greater proportion of those as-
signed as definite confirmed than among those classified initially as probable or
possible (Table 9.3). Note that this reflects the opportunity to examine and con-
firm (or refute) self-reported diagnoses, but does not allow for assessment of false
negative reports, i.e., women who would be considered to have chronic obstruc-
tive pulmonary disease based on medical record review but who did not report
it on the questionnaire.
Definite 73 86 81 51 60 45 (18)
Probable 218 80 71 50 64 50 (19)
Possible 273 78 67 48 67 50 (19)
*Confirmed using medical records and uniform diagnostic criteria.
‡All three elements were not available for some participants.
§Among participants with pulmonary function test reports.
COPD, chronic obstructive pulmonary disease; FEV1, forced expiratory volume in 1 second.
Barr et al., 2002.
230 INTERPRETING EPIDEMIOLOGIC EVIDENCE
vious suggestion that inferred positive associations between estrogen use and en-
dometrial cancer were largely attributable to estrogen causing vaginal bleeding,
which led to more intensive diagnostic surveillance and detection of otherwise
subclinical cancers. In a case–control study, this would manifest itself as a spu-
riously high proportion of cases who had experienced bleeding compared to con-
trols, and was cited as the rationale for choosing controls who had themselves
experienced vaginal bleeding and would thus have been carefully scrutinized for
the presence of endometrial cancer.
Hulka et al. (1980) conducted a number of relevant analyses, including a sim-
ulation of the potential magnitude of such a bias, if present. In a hypothetical
population of 100,000 women age 50 and older, an estimate was made of the
proportion of women with asymptomatic cancer (the pool in which selective de-
tection could occur), the proportion of women using estrogen, and the number
of diagnosed endometrial cancer cases (who would be enrolled in a study). As
shown in Figure 9.1, only the women who fall into the intersection of those three
groups contribute to the bias, i.e., asymptomatic cancer, estrogen user, diagnosed
cases. In this scenario, there would be only 5 such cases. Their inclusion would
yield a relative risk of 3.9 comparing estrogen users to nonusers, and their ex-
clusion (eliminating the hypothesized detection bias) would yield a relative risk
of 3.7. Under a set of reasonable assumptions, the hypothesized bias is therefore
negligible in magnitude.
BIAS
5
145 25
9825
Group II:
10,000 Estrogen Users
FIGURE 9.1. Detection bias and endometrial cancer: diagnosed cases and asymptomatic
cases. Assuming a hypothetical population of 100,000 women age 50 years, three groups
are formed: Group I the 5-year cumulative incidence of diagnosed cancer; Group II
the 5-year period prevalence of estrogen use; Group III the 5-year period prevalence
of asymptomatic cancer (Hulka et al., 1980).
NOTES TO FIGURE 9.1
1. Incidence of diagnosed endometrial cancer 1/1000/year* 5 years 100,000
women 500 diagnosed cases (Group I).
2. 5-year period prevalence of estrogen use 10%† of 100,000 women 10,000 women
having used estrogen (Group II).
3. 5-year period prevalence of asymptomatic cancers 3/1000 (27,28) 100,000
women 300 asymptomatic cancers (Group III).
4. 30% of diagnosed cases used estrogen 0.30 500 150.
5. 10%† of asymptomatic cases used etrogen 0.10 300 30.
6. 20%‡ of estrogen users with previously asymptomatic cancer bled and became diag-
nosed cases 0.20 30 5.
7. 6% of non-estrogen-using diagnosed cases were asymptomatic 0.06 350 21.
thereby bias measures of association. For that reason, another control group was
selected from the general population using drivers’ license records, referred to
as community controls. As shown in Table 9.4, the two control groups yielded
generally similar results for both men and women, subject to some imprecision.
For vegetable intake among men, results for the community controls tended to
show stronger inverse gradients than the colonoscopy-negative controls. Except
for juice intake among women, associations were generally absent or weak. In-
clusion of the two control groups, one of which was diagnosed accurately to be
free of polyps, allows evaluation of the impact, if any, of incomplete diagnosis
and selection bias, but only relative to each other and not relative to a true “gold
standard” that is free of either potential bias.
Even if a stratum cannot be created in which disease ascertainment is accu-
rate, perhaps subgroups can be isolated in which there is only non-differential
underascertainment and an absence of false positives. In such a study, ratio mea-
sures of association will be nearly unbiased (Rothman & Greenland, 1998). En-
suring the absence of overascertainment is essential for this approach to be
effective. By creating strata with varying degrees of underascertainment, infor-
mation is also generated for examining a dose-response gradient in potential for
bias due to differential underascertainment. That is, if we can define strata of
low, moderate, and high degrees of underascertainment, examining measures of
association across those strata may help to indicate whether bias is present.
Attempts to create strata with accurate disease ascertainment or non-differential
underascertainment will result either in finding that the exposure–disease asso-
ciation does or does not differ in relation to the presumed indicator of the mag-
nitude of non-differential disease underascertainment. When the estimated meas-
ure of association differs, then as long as the basis for stratification is valid, the
more accurate measure comes from the stratum that is free of overascertainment
or influenced solely by non-differential underascertainment. When the results are
similar across strata, then there are several possible explanations. The effort to
isolate the subset that is affected only by non-differential underascertainment may
have been unsuccessful, i.e., all strata continue to suffer from such bias. Alter-
natively, there may have been no bias due to disease misclassification in the first
place so that stratification failed to generate the expected pattern.
Mean Intake Cases vs. Colonoscopy-Negative Controls Cases vs. Community Controls
(Servings/week) Women Men Women Men
FOOD GROUP QUINTILE WOMEN MEN OR 95% CI OR 95% CI OR 95% CI OR 95% CI
Fruits
1 3.3 2.1 1.00 1.00 1.00 1.00
2 7.4 5.9 0.95 0.52, 1.72 0.79 0.46, 1.36 0.65 0.34, 1.25 0.73 0.43, 1.23
3 11.2 9.6 0.91 0.50, 1.63 1.06 0.61, 1.84 0.78 0.40, 1.52 1.00 0.59, 1.68
4 15.8 14.7 1.10 0.59, 2.05 0.79 0.44, 1.43 0.61 0.30, 1.20 0.62 0.36, 1.06
5 27.5 26.9 1.34 0.66, 2.69 0.66 0.35, 1.24 0.68 0.32, 1.43 0.75 0.41, 1.35
p trend 0.54 0.16 0.29 0.44
Vegetables
1 10.1 8.8 1.00 1.00 1.00 1.00
2 17.6 15.1 1.12 0.62, 2.01 1.29 0.75, 2.23 1.08 0.56, 2.07 0.67 0.39, 1.13
3 23.8 20.2 1.16 0.62, 2.16 1.11 0.64, 1.93 0.86 0.44, 1.68 0.73 0.43, 1.26
4 31.6 27.1 2.26 1.23, 4.14 1.30 0.72, 2.34 1.34 0.69, 2.59 0.59 0.34, 1.03
5 51.4 44.7 1.70 0.87, 3.34 0.90 0.48, 1.69 1.40 0.67, 2.92 0.55 0.30, 0.98
p trend 0.10 0.69 0.24 0.16
(continued)
TABLE 9.4. Multivariate-Adjusted Odds Ratios* for Colorectal Adenomas by Quintile of Fruit and Vegetable Intake for Women and Men, Minnesota Can-
cer Prevention Research Unit Case-Control Study, 1991–1994 (continued)
Mean Intake Cases vs. Colonoscopy-Negative Controls Cases vs. Community Controls
(Servings/week) Women Men Women Men
FOOD GROUP QUINTILE WOMEN MEN OR 95% CI OR 95% CI OR 95% CI OR 95% CI
Juice
1 0.5 0.5 1.00 1.00 1.00 1.00
2 2.2 1.9 0.81 0.48, 1.39 1.53 0.86, 2.73 0.97 0.53, 1.78 1.16 0.67, 2.01
3 4.8 4.2 0.72 0.41, 1.27 1.24 0.73, 2.10 0.80 0.43, 1.51 0.83 0.51, 1.35
4 7.7 7.4 0.61 0.34, 1.09 0.88 0.52, 1.51 0.56 0.31, 1.03 0.75 0.45, 1.26
5 14.2 15.1 0.50 0.27, 0.92 0.98 0.55, 1.73 0.56 0.30, 1.06 0.97 0.56, 1.67
p trend 0.02 0.97 0.04 0.58
Total Fruits
and Vegetables
1 18.4 16.5 1.00 1.00 1.00 1.00
2 31.8 26.8 0.76 0.42, 1.38 0.80 0.46, 1.38 0.61 0.32, 1.18 0.76 0.45, 1.30
3 41.8 36.1 1.06 0.59, 1.92 1.05 0.60, 1.83 1.01 0.52, 1.94 0.95 0.56, 1.61
4 53.8 48.5 1.48 0.79, 2.78 0.82 0.44, 1.51 0.71 0.36, 1.38 0.46 0.27, 0.80
5 82.8 75.9 0.96 0.47, 1.96 0.61 0.31, 1.22 0.76 0.34, 1.66 0.60 0.32, 1.12
p trend 0.79 0.40 0.86 0.20
*Adjusted for age (continuous), energy intake (continuous), fat intake (continuous), body mass index (continuous), smoking status (never, current, former), alcohol status (non-
drinker, former drinker, current drinkers consuming 1 drink/week, current drinkers consuming 1 drink/week), nonsteroidal antiinflammatory use (yes, no), multivitamin use
(yes, no), and hormone replacement therapy use (yes, no in women only).
OR, odds ratio; CI, confidence interval.
Smith-Warner et al., 2002.
Measurement and Classification of Disease 235
tainment is the only form of error. Finally, when misclassification makes it im-
possible to study the full spectrum of a disease entity, including the subset that
is highly susceptible to misclassification, the interest can sometimes be shifted
to a subset of the disease that is more tractable, i.e., the more severe and there-
fore more accurately diagnosed cases. Some degree of disease misclassification
is inherent in epidemiologic studies, but through careful evaluation of the source
and manifestation of the problems, the consequences can be mitigated or at least
understood for accurate interpretation of the study results.
REFERENCES
Barr RG, Herbstman J, Speizer FE, Camargo CA Jr. Validation of self-reported chronic
obstructive pulmonary disease in a cohort study of nurses. Am J Epidemiol 2002;155:
965–971.
Berkowitz GS, Blackmore-Prince C, Lapinski RH, Savitz DA. Risk factors for preterm
birth subtypes. Epidemiology 1998;9:279–285.
Brenner H, Savitz DA. The effects of sensitivity and specificity of case selection on va-
lidity, sample size, precision, and power in hospital-based case-control studies. Am J
Epidemiol 1990;132:181–192.
Greenland S. Tests for interaction in epidemiologic studies: a review and a study of power.
Stat Med 1983;2:243–251.
Hayes RJ, RB, Yin S-N, Dosemeci M, Li G-L, Wacholder S, Travis LB, Li C-Y, Roth-
man N, Hoover RN, Linet MS. Benzene and the dose-related incidence of hemato-
logic neoplasms in China. J Natl Cancer Inst 1997;89:1065–1071.
Hulka BS, Grimson RC, Greenberg RG, Kaufman DG, Fowler WC Jr, Hogue CJR, Berger
GS, Pulliam CC. “Alternative” controls in a case-control study of endometrial cancer
and exogenous estrogen. Am J Epidemiol 1980;112:376–387.
Kleinbaum DG, Kupper LL, Morgenstern H. Epidemiologic research: principles and quan-
titative methods. Belmont, CA: Lifetime Learning Publications, 1982.
Lang JM, Lieberman E, Cohen A. A comparison of risk factors for preterm labor and term
small-for-gestational-age birth. Epidemiology 1996;7:369–376.
Marchbanks PA, Peterson HB, Rubin GL, Wingo PA, and the Cancer and Steroid Hor-
mone Study Group Research on infertility: definition makes a difference. Am J Epi-
demiol 1989;130:259–267.
Morgenstern H, Thomas D. Principles of study design in environmental epidemiology.
Environ Health Perspect 1993;101 (Suppl 4):23–38.
Poole C. Exception to the rule about nondifferential misclassification (abstract). Am J
Epidemiol 1985;122:508.
Rose G. Sick individuals and sick populations. Int J Epidemiol 1985;14:32–38.
Rothman KJ. Modern epidemiology. Boston: Little, Brown and Company, 1986.
Rothman KJ, Greenland S. Modern epidemiology, Second edition. Philadelphia:
Lippincott-Raven Publishers, 1998.
Rothman KJ, Greenland S. Modern epidemiology, Second edition. Philadelphia: Lippincott-
Raven Publishers, 1998:353–355.
Savitz DA, Andrews KW. Review of epidemiologic evidence on benzene and lymphatic
and hematopoietic cancers. Am J Ind Med 1997;31:287–295.
Measurement and Classification of Disease 241
Savitz DA, Blackmore CA, Thorp JM. Epidemiology of preterm delivery: etiologic het-
erogeneity. Am J Obstet Gynecol 1991;164:467–471.
Savitz DA, Brett KM, Evans LE, Bowes W. Medically treated miscarriage among Black
and White women in Alamance County, North Carolina, 1988–1991. Am J Epidemiol
1994;139:1100–1106.
Schwartz J. Low-level lead exposure and children’s IQ: a metaanalysis and search for a
threshold. Environ Res 1994;66:42–55.
Smith-Warner SA, Elmer PJ, Fosdick L, Randall B, Bostick RM, Grandits G, Grambsch
P, Louis TA, Wood JR, Potter JD. Fruits, vegetables, and adenomatous polyps. The
Minnesota Cancer Prevention Research Unit case-control study. Am J Epidemiol
2002;155:1104–1113.
Tielemans E, Burdorf A, te Velde E, Weber R, van Kooij R, Heederik D. Sources of bias
in studies among infertility clinics. Am J Epidemiol 2002;156:86–92.
Villeneuve PJ, Johnson KC, Mao Y, Canadian Cancer Registries Epidemiology Research
Group. Brain cancer and occupational exposure to magnetic fields among men: results
from a Canadian population-based case–control study. Int J Epidemiol 2002;31:210–
217.
Vogt TM, Mayne ST, Graubard BI, Swanson CA, Sowell AL, Schoenberg JB, Swanson
GM, Greenberg RS, Hoover RN, Hayes RB, Ziegler RG. Serum lycopene, other serum
carotenoids, and risk of prostate cancer in US Blacks and Whites. Am J Epidemiol
2002;155:1023–1032.
Wilcox AJ, Weinberg C, O’Connor J, et al. Incidence of early loss of pregnancy. N Engl
J Med 1988;319:188–194.
Wong O. Risk of acute myeloid leukemia and multiple myeloma in workers exposed to
benzene. Occup Environ Med 1995;52:380–384.
This page intentionally left blank
10
RANDOM ERROR
Previous chapters considered the role of systematic error or bias in the evalua-
tion of epidemiologic evidence. In each case, specific phenomena were consid-
ered which would predictably result in erroneous estimates of effect, and the dis-
cussion was focused on evaluating whether the underlying conditions that would
lead to the bias were present. Hypotheses of bias constitute candidate explana-
tions, ideally specific and testable, for why the observed results might not reflect
the true causal relation or lack of relation between exposure and disease. Ran-
dom error is different in character, despite the fact that it also generates estimates
of effect that deviate from the correct measurement. The key difference is that
random error, by definition, does not operate through measurable, testable causal
pathways, making it a more elusive concept. We can ask why a coin does not
land as heads or tails exactly 50% of the time in a series of trials. Perhaps the
subtle differences in the way it was flipped or the wind speed and direction hap-
pened to favor heads or tails slightly. Although there is a physical process in-
volving movement and gravity that could in principle lead to a predictable result
on any given trial or series of trials, we ascribe the deviation of a series of out-
comes from the expected 50/50 split as random error.
The two pathways that lead to statistically predictable patterns of random er-
ror are sampling and random allocation. Sampling black and white balls from an
urn, there will be a predictable pattern of deviation from the true ratio in a
243
244 INTERPRETING EPIDEMIOLOGIC EVIDENCE
sume that small deviations between the true and measured values are more prob-
able than large deviations, and with some assumptions, we can even estimate the
probability that deviations of differing magnitudes between the measured and
true values will occur. Second, as the study size becomes larger, the probability
of a deviation of at least a given magnitude decreases, in contrast to all the other
forms of bias that have been discussed. If there is bias from non-response or con-
founding, increasing study size will not make it go away. None of these attri-
butes actually define the underlying phenomenon that generates random error,
but they do suggest ways to evaluate and minimize its possible influence on the
results of epidemiologic research.
null is motivated in part by looking for interesting findings (i.e., non-null results)
at the expense of focusing on the comprehensive, critical evaluation of all the
scientifically important findings. Surely, if the results are worth generating, given
all the effort that requires, they are worthy of evaluation in a comprehensive, sys-
tematic manner.
The credibility of the findings is a function of freedom from both systematic
and random error. Practically, we do need to focus our effort on the major sources
of distortion that operate, and in small studies, random error may well be at the
top of the list, whereas in larger studies, various forms of systematic error will
likely predominate. Perhaps the availability of a refined statistical framework for
examining random error that is far better developed and more widely applied
than those for other forms of error may tempt epidemiologists to give special at-
tention to this concern, like looking for lost keys where the light is brightest. In-
stead, we need to look for the lost keys in the location where they were most
likely lost.
Observational studies make the conventional tools for evaluation of random er-
ror far more tenuous than in experimental studies (Greenland, 1990). There are
two considerations that make the application of classical frequentist statistics less
applicable to observational research: First, there is generally not a formal sam-
pling of participants from a defined roster, except in some instances in recruit-
ing cohorts or selecting controls in case–control studies. More often, the choice
of study setting and participants is based on a variety of scientific and logistical
decisions that do not resemble a random sampling procedure. Second, whereas
in experiments, the exposure or treatment is randomly assigned (as in laboratory
studies on animals or in randomized controlled trials in humans), in observational
studies, exposure occurs through a diverse and often ill-defined implicit alloca-
tion method.
The frequentist statistical framework for measuring and describing random er-
ror is based on sampling theory and the probability of obtaining deviant samples
or deviant allocation through purely random processes. The random sampling in-
volved in the allocation of the exposure or treatment is critical because one can
formally consider the universe of other ways in which the units could have been
sampled or assigned. That is, if we allocate 20 rats or 20 people to one of two
treatments, 10 per group, we can formally delineate all the ways treatment could
be distributed across those participants. Given random allocation, one can ask
the critical question of how probable is it, under some assumption about the truth,
that a perfectly unbiased method of assigning treatment has generated results that
Random Error 247
deviate by specified amounts from that assumed, true value. We are not consid-
ering a biased method in assigning treatment but a perfectly valid one that will
still yield groups that are not perfectly balanced in regard to baseline risk, and
will occasionally generate assignments that are substantially deviant from equal
baseline risk. Sampling theory is used to quantify the probability of obtaining
deviant samples of any given magnitude.
Even in the case of a randomized trial, however, a focus solely or primarily
on random error is justified only when other critical features of the study have
been properly designed and managed. Characteristics of the study such as biased
assessment of outcomes by participants or researchers, control for other deter-
minants of outcome that happen to be unequally distributed, and compliance with
study protocols need to be considered alongside the possibility that a perfectly
random procedure for allocation went awry. Nevertheless, in a carefully designed
trial, those other issues may be put to rest with greater confidence than in an ob-
servational study, giving greater relative importance to the role of random error.
In the case of observational studies, the model of random allocation and a dis-
tribution of potential imbalances in that allocation as the basis for the interpre-
tation of statistical results is simply not applicable (Greenland, 1990). That is,
the framework for generating estimates of variability due to random sampling
cannot be justified based on a random sampling or random assignment process.
It is difficult to develop a compelling rationale for treating exposures that result
from societal forces or individual choice as though they were randomly assigned,
and thus the interpretation of results that are predicated on such allocation must
be less formal, at a minimum. There are elements of chance in observational
studies, of course, such as selection of the study setting and time period. Such
qualitative elements of randomness however, fall far short of a justification for
direct application of the technology that was generated for experiments in which
the exposure of interest is randomly allocated. The concept of random error still
applies to observational epidemiologic studies, in that we believe that random
forces will prevent us from measuring precisely the causal association of inter-
est, but its origins and thus its statistical properties are poorly defined. Nonethe-
less, like random error in experimental studies, the scatter introduced by random
error in observational studies is presumed to be symmetrical, and big studies suf-
fer less from the problem than small studies. Thus, in observational studies, there
is a desire to quantify the role of random error, despite recognition that we can-
not do so precisely.
We thus have a dilemma in which it is recognized that random error contributes
to observational studies but the dominant statistical framework was constructed
for other types of research. It is not surprising that epidemiologists have turned
to that approach as a means of addressing random error, nor is it surprising that
the lack of direct applicability of the framework is often neglected. The recom-
mended approach, subject to much-needed improvement, is to apply the tools
248 INTERPRETING EPIDEMIOLOGIC EVIDENCE
that are built on the random sampling paradigm in a flexible manner so as to in-
form judgments about the potential impact of random error on the study find-
ings, while recognizing that the product is at best a guideline or clue to inter-
pretation. As discussed in the section, “Interpretation of Confidence Intervals,”
confidence intervals should not be used to define boundaries or dichotomize re-
sults as compatible or incompatible with other findings, but to broadly charac-
terize the study’s precision and convey some notion of how influential random
error may have been (Poole, 2001). Probability values, used on a continuous scale
and cautiously interpreted may also have value for the same purpose (Weinberg,
2001). What is indefensible in observational studies, and questionable even in
experimental ones, is a rigid, categorical interpretation of the results epitomized
by statistical significance testing or equivalently, examination of confidence in-
terval boundaries. The evidence itself is not categorical: random error does not
act in an all-or-none fashion, the underlying assumptions for statistical testing
are not met, and any attempt to claim that the study results are “due to chance”
or “not due to chance” is unwarranted.
2001). The desire for a simple approach to categorize study findings as positive
(statistically significant) or negative (not statistically significant) is understand-
able, since the alternative is much less tidy.
The parallel approach to systematic error, in which we would categorize re-
sults as susceptible to misclassification versus immune from misclassification or
prone to selection bias versus free of selection bias would be no less absurd. It
is obvious that such methodologic issues do not occur in an all-or-none manner,
nor is the probability that they are present zero or one. Particularly when stud-
ies generate many findings, it is tempting to find some approach to help whittle
down the results into a more manageable short list of those that really deserve
scrutiny. The alternative to arbitrary dichotomization of findings is to scrutinize
each and every result for the information it provides to advance knowledge on
one narrowly defined hypothesis (Cole, 1979). If there is a narrowing of focus
to selected findings it should be based on the magnitude of prior interest and the
quality of information generated by the study.
Unfortunately, as pervasive as statistical testing remains in epidemiology
(Savitz et al., 1994), the concepts of statistical testing are more entrenched in
other branches of research; communication with colleagues, as well as lawyers
and policy makers, often places unrelenting pressure on epidemiologists to gen-
erate and use statistical tests in the interpretation of their findings. Just as it is a
challenge to substitute a more valid, but complex, interpretation of results for
communication among epidemiologists, it is a challenge to explain to those out-
side the discipline what is required to make meaningful inferences in epidemi-
ology. The argument that “everyone else does it” is specious, but unfortunately
many epidemiologists are not so secure as to be comfortable in deviating from
the (inappropriate) norms of our colleagues working in more advanced scientific
disciplines.
Results of statistical tests are often posed and interpreted as asking whether
the results are likely to have occurred by chance alone. More formally, we esti-
mate the probability of having obtained results as or more extremely divergent
from the null as those observed, under the assumption that the null hypothesis is
true. Then we ask whether the calculated p-value falls below the critical value,
typically 0.05, and if it does, we declare the results unlikely to have arisen by
chance alone if the null hypothesis is true. If we cannot conclude that the results
are unlikely to have arisen by chance alone—i.e., they do not pass this screen-
ing—a conclusion is drawn that the results could have arisen by chance alone
even if the null hypothesis is true. It is then inferred that no meaningful associ-
ation is present—i.e., what has been observed as a deviation from the null re-
flects random error, and therefore there is no need for discussion of other po-
tential biases that might have yielded this association. If the result passes the
screening and is found to be statistically significant, then an association may
be declared as established and further examined for contributing biases or the
250 INTERPRETING EPIDEMIOLOGIC EVIDENCE
The concern with random error has also been applied to the interpretation of the
array of results within a given study or even more broadly, to a broader universe
of results from the same data collection effort or generated by the same investi-
gator. Here the concern as typically raised is not with the evaluation of precision
in a specific measure of association, which is the focus in generating a confi-
dence interval, but rather with using the broader array of results to help judge a
specific findings.
When examining an array of results, one can ask about the probability that
a specified number of those findings will appear to be statistically significantly
252 INTERPRETING EPIDEMIOLOGIC EVIDENCE
different from the null value. That is, a conventional critical p-value of 0.05 ap-
plies to a single observed result, but if one is interested in maintaining this crit-
ical value for an array of results, for example 10 measures of association, then
the critical p-value for each of those measures must be smaller to ensure that the
critical p-value for the aggregate of findings remains 0.05, i.e., that there is a 5%
probability of one p-value of less than 0.05 if all the null hypotheses are true. A
formula that provides the basis for calculating the actual critical p-value for one
or more statistically significant results is 1 (1 alpha)n where alpha is the crit-
ical p-value for each measure, typically 0.05, and n is the number of such cal-
culations that are made. Taking this experiment-wise error into account using the
Bonferroni correction is intended to be and is, of course, conservative. Fewer
findings will be declared statistically significant, but of course there is also a
tremendous loss of statistical power that results from making the critical p-
values more stringent. The formal statistical hypothesis that is addressed is how
likely is it that we would observe an array of findings that is as extreme or more
extreme than the array we obtained under the assumption that all associations
being explored are truly null? We assume that there are no associations present
in the whole array of data, just as we assume the individual null hypothesis is
true to generate a p-value for an individual finding.
Consideration of an array of findings is most often used to ask whether a given
result from a study that attains some critical p-value is nevertheless likely to have
arisen by chance. Consideration of the number of such calculations made in the
course of analyzing data from the study is used to address the question, “How
many statistically significant measures of association would we expect even if
none are truly present?” Assuming that the universal null hypothesis is correct,
and that there are truly no associations present in the data, the number of statis-
tically significant results that are generated will increase as the number of asso-
ciations that are examined increases. From this perspective, a result that emerges
from examination of many results is more likely to be a false positive finding
than a result that is generated in isolation or from a small array of results. Ac-
cording to the formal technology of generating p-values and assigning statistical
significance, this is certain to be true.
Despite the technical accuracy of this line of reasoning, there are a number of
serious problems with attempts to use the concept of multiple testing, as con-
ventionally applied, to interpret individual study results (Savitz & Olshan, 1995).
The relevant constellation of results that defines the universe of interest is arbi-
trarily defined, sometimes consisting of other results reported in a given publi-
cation, but it could just as legitimately be based on the results generated from
that data collection effort or by that investigator. Each measure of effect that is
generated addresses a different substantive question, and lumping those results
together simply because the same data collection effort produced them all makes
no logical sense. It would be no less arbitrary to group all the results of a given
Random Error 253
ever, it is solely the quality of the evidence that determines its value, not the tim-
ing of data analysis or the mental processes of the investigator.
In practice, many who interpret epidemiologic data continue to put a great deal
of stock in the multiple comparisons issue. Arguments are made that a result
should be given more credence because it came from an a priori hypothesis or
less credence because it did not. Sometimes investigators will be careful to point
out that they had a question in mind at the inception of the study, or to make
special note of what the primary basis was for the study having been funded. Re-
viewers of manuscripts sometimes ask for specification of the primary purpose
of the study, particularly when presenting results that address issues other than
the one that justified initial funding of the research. At best, these are indirect
clues to suggest that the investigators may have been more thorough in assess-
ing the relevant background literature on the primary topic before the study as
compared to secondary interests. Perhaps that effort helped them to refine data
collection or to choose a study population that was especially well-suited to ad-
dress the study question. The only consequence of interest is in how the data
were affected, and it is the quality of the data that should be scrutinized rather
than the investigators’ knowledge and when it arose. If all that foresight and plan-
ning failed to result in a suitable study design, there should be no extra points
awarded for trying and if they were just lucky in having ideal data for an unan-
ticipated study question, then no points should be deducted for lack of foresight.
Analogous arguments are commonly made that a result should be given less
credence because it came from an ex post facto hypothesis, sometimes referred
to as data dredging or fishing. Such criticisms are without merit, other than per-
haps to focus attention to the possibility of substantive concerns. If a particular
finding has little context in the literature, then even with the addition of new sup-
porting evidence, the cumulative level of support is likely to remain modest. If
an issue was not anticipated in advance, this may result in lower quality data to
address the question, with inferior exposure or disease measurement or lack of
data on relevant potential confounders. Any such problems warrant close exam-
ination and criticism. What does not warrant scrutiny is how many analyses were
done, what other uses have been made of the data, or how and when the analy-
sis plans came about.
Where selective analyses and reporting of findings become important is in the
dissemination of findings through presentations at research meetings and espe-
cially in the published literature. Often, the primary hypothesis is of sufficient
importance that investigators will be motivated to publish the results regardless
of the outcome, whether confirming or refuting previous literature. On the other
hand, secondary hypotheses may well be dismissed quickly if the results are not
interesting (i.e., not positive), and thus the body of published literature is not
only incomplete but is a biased sample from the work that has been done (Dick-
ersin et al., 1987; Chalmers, 1990). If data are dredged not simply to glean all
Random Error 255
that can be obtained from the data but to find positive results (at its worst, to
skim off statistically significant findings), then the available literature will pro-
vide a poor reflection of the true state of knowledge. At the level of the indi-
vidual investigator, results-based publication constitutes a disservice to science,
and decisions by editors to favor positive over null findings would exacerbate
the difficulty in getting the scientific literature to accurately reflect the truth.
Techniques have recently been proposed and are becoming more widespread
that make beneficial use of the array of findings from a study to improve each
of the estimates that are generated (Greenland & Robins, 1991; Greenland &
Poole, 1994). The goal is to take advantage of information from the constella-
tion of results to make more informed guesses regarding the direction and amount
of random error and thereby produce a set of revised estimates that are proba-
bilistically, in the aggregate, going to be closer to their true values. Estimates are
modified through empirical Bayes or Bayesian shrinkage; the most extreme and
imprecise estimates in an array of results are likely to have random error that in-
flated the estimates, just as those far below the null value are likely to suffer
from random error that reduced the estimates. By using this information on likely
direction and magnitude of random error, obtainable only from the constellation
of findings, outliers can be shrunk in light of presumed random error toward
more probable estimates of the magnitude of association. Instead of simply giv-
ing less attention to extreme, imprecise results informally, the findings from other
analyses of the data help to produce a better estimate. The nature of random er-
ror, which will tend to generate some erroneously high and some erroneously
low measures of association is exploited to make extremely high and low im-
precise values less extreme since it is almost certain that random error has in-
troduced distortion that contributes to their being outliers. A better guess of their
values, free of random error, is that they would be closer to the null on average.
Random error is always present to some extent and generates error in estimating
measures of effect. Furthermore, larger studies suffer less distortion from ran-
dom error than smaller studies. Thus, some method is needed to quantify the
amount of random error that is present. Though the underlying rationale for con-
structing confidence intervals is based on the same sampling theory as p-values,
they are useful for characterizing the precision in a much more general way. The
statistical framework is used to help make an assessment of the role of random
error, but it is advisable to step back from the formal underpinnings and not at-
tempt to make fine distinctions about values within or outside the interval, to
screen results as including or excluding the null value, or to let the confidence
interval dominate the basis for drawing conclusions.
256 INTERPRETING EPIDEMIOLOGIC EVIDENCE
Starting with a known true value for the measure of interest, the confidence
interval is the set of values for which the p-value will be greater than 0.05 (for
a 95% confidence interval) in a test of whether the data deviate from the true
value. That is, assuming the point estimate is the correct value, all those values
up to the boundary of the interval would genreate p-values of 0.05 or greater if
tested using the given study size. Obviously, as the study size gets larger, the
array of values that would generate p 0.05 becomes smaller and smaller, and
thus the confidence interval becomes narrower and narrower. A consequence of
the manner in which the interval is constructed is that the true value for the pa-
rameter of interest will be contained in such intervals in a specified proportion
of cases, 95% if that is the chosen coverage. Confidence intervals, despite be-
ing based on many of the same unrealistic assumptions as p-values and statis-
tical tests, are much more useful for quantifying random error, so long as they
are not merely used as substitutes for statistical tests. If one uses the formal sta-
tistical properties of confidence intervals, the bounds of the interval can be in-
terpreted as two-tailed statistical tests of the null hypothesis—if the lower bound
of a 95% confidence interval exceeds the null value, then the p-value is less
than 0.05. If the interpretation of the confidence interval is solely based on
whether that interval includes or excludes the null value, then it functions as a
statistical test and suffers from all the drawbacks associated with an arbitrary
dichotomization of the data. Instead, confidence intervals should be used as in-
terval estimates of the measure of interest, conveying a sense of the precision
of that estimate.
Only marginally better is the use of confidence intervals to define bounds of
compatibility, i.e., interpreting the interval as a range within the true value is
likely to lie. In this view, the confidence interval is treated as a step function
(Poole, 1987), with values inside the interval presumed to be equally likely and
values outside the interval equally unlikely. In reality, the point estimate and val-
ues near it are far better estimates of the true value based on the data than val-
ues further away but still within the interval. Similarly, values toward the ex-
tremes of the interval and values just outside the interval are not meaningfully
different from one another either. A slight variant of the exclusionary interpre-
tation of confidence intervals is to compare confidence intervals from two or
more studies to assess whether they overlap. If they do overlap, the interpreta-
tion is that the results are statistically compatible and if they do not, the results
are believed to be statistically different from one another. If the goal is to test
the statistical difference in two estimates, or an estimate and a presumed true
value, there are more direct ways to do so. In fact, if such a difference is of par-
ticular interest, the point estimate of that difference can be calculated and a con-
fidence interval constructed around the estimated difference.
Confidence intervals are useful to convey a sense of the random variation in
the data, with a quantitative but informal interpretation of the precision of that
Random Error 257
estimate. Values toward the center of the interval are more compatible with the
observed results than values toward the periphery, and the boundary point itself
is rather improbable and unworthy of special focus. The p-value function is a de-
scription of the random error that is present, and the bounds of the confidence
interval help to describe how peaked or flat the p-value function is around the
point estimate. In a large study, the p-value function will be highly peaked, with
the most probable values very close to the point estimate, whereas in a small
study, it will be flatter, with fairly probable values extending more widely above
and below the point estimate. The confidence interval, while ill-suited to mak-
ing fine distinctions near the extremes of the interval, is nonetheless very useful
for providing a rough indication of precision.
The information provided by the confidence interval can be used to compare
the precision of different studies, again not focusing on the exact values but more
generally on the width of the interval. A simple measure, the confidence limit
ratio (Poole, 2001) has very attractive features for characterizing precision and
making comparisons across studies. Two studies with point estimates of a rela-
tive risk of 2.0, one with a 95% confidence interval of 1.1 to 3.3, and the other
with a confidence interval of 0.9 to 4.3 (confidence limit ratios of 3.3/1.1 3.0
and 4.4/0.9 4.8) are different but not substantially different in their precision.
A study with a confidence interval of 0.4 to 9.6 (confidence limit ratio
9.6/0.4 24) is well worth distinguishing as less precise, perhaps so imprecise
as to render any inferences meaningless
Confidence intervals are helpful in comparing results from one study to results
from others, not formally testing them but assessing whether they are close to or
far from the point estimate, taking the imprecision of that point estimate and
overlap of intervals into account. Broadly characterizing the extent of overlap in
confidence interval coverage (not dichotomizing as overlapping versus non-over-
lapping) can be helpful in describing the similarity or differences in study find-
ings. For example, a study with a relative risk of 1.0 and a confidence interval
of 0.4 to 2.0 is fairly compatible with a study yielding a relative risk of 1.5 with
a confidence interval of 0.8 to 3.0. Two larger studies with narrower confidence
intervals but the same point estimates—for example, a relative risk of 1.0, with
a confidence interval of 0.8–1.2, and a relative risk of 1.5, with a confidence in-
terval of 1.1–2.2—would be interpreted as more distinctly different from one an-
other. Beyond the formal comparisons across studies, confidence intervals allow
for a rapid assessment of the magnitude of imprecision, especially to describe
whether it is at a level that calls into serious question the information value of
the study. Studies with cells of 2 or 3 observations, which tend to generate ex-
tremely imprecise estimates and confidence limit ratios of 10 or more, can quickly
be spotted and appropriately discounted. This informal interpretation of confi-
dence intervals, aided by calculation and examination of the confidence interval
ratio, is a valuable guide to imprecision.
258 INTERPRETING EPIDEMIOLOGIC EVIDENCE
estimates that are likely to be closer to the value that would have been obtained in
the absence of random error (Greenland & Robins, 1991). Instead of the usual sit-
uation in which the influence of random error is assumed to be symmetric (on the
appropriate scale) around the observed value, consideration of a distribution of re-
sults can suggest that certain values are deviantly high or low, likely to be due in
part to random processes. Therefore, the better estimate is not the original (naive)
estimate but one that is shrunk by some amount in the direction of the null value.
Such approaches go beyond quantifying random error and begin to compensate for
it to produce improved estimates, reflecting significant progress in refining epi-
demiologic data analysis. Efforts to incorporate other random elements such as
those that are part of misclassification are being developed, as well as more com-
prehensive approaches to the integration of random and systematic error.
REFERENCES
As in other sciences, results from a single epidemiologic study are rarely if ever
sufficient to draw firm conclusions. No matter how much care is taken to avoid
biases and ensure adequate precision, idiosyncrasies inherent to any study ren-
der its results fallible. If nothing else, the chosen study population may demon-
strate a different exposure—disease relation than other populations would show,
so that the bounds of inference would need to be tested. Furthermore, limitations
in methods ranging from subtle methodologic pitfalls to mundane clerical errors
or programming mistakes render conclusions even for the study population itself
subject to error. Therefore, rather than seeking a single, perfect study to provide
clear information on the phenomenon of interest, the full array of pertinent re-
sults from a series of imperfect studies needs to be considered in order to accu-
rately summarize the state of knowledge and draw conclusions.
Multiple studies provide an opportunity to evaluate patterns of results to draw
firmer conclusions. Not only can the hypothesis of a causal relation between ex-
posure and disease be examined using the full array of information, but hy-
potheses regarding study biases can be evaluated as well. The concept of repli-
cation reflects a narrow, incomplete subset of the issues that can be fruitfully
evaluated across a series of studies that address the same causal relationship. As
often applied, the search for replicability refers to a series of methodologically
similar studies which enables the reviewer to examine the role of random error
261
262 INTERPRETING EPIDEMIOLOGIC EVIDENCE
in accounting for different findings across those studies. Given that research de-
signs and study methods inevitably differ in epidemiology, however, the ques-
tion is not simply, “Are the studies consistent with one another?” but rather,
“What is the summary of evidence provided by this series of studies?” A series
of studies yielding inconsistent results may well provide strong support for a
causal inference when the methodologic features of those studies are scrutinized
and the subset of studies that support an association are methodologically
stronger, while those that fail to find an association are weaker. Similarly, con-
sistent evidence of an association may not support a causal relation if all the stud-
ies share the same bias that is likely to generate spurious indications of a posi-
tive association. In order to draw conclusions, the methods and results must be
considered in relation to one another, both within and across studies.
There has been a dramatic rise in interest and methodology for the formal, quan-
titative integration of evidence across studies, generally referred to as meta-
analysis (Petitti, 1994; Greenland, 1987, 1998). In the biomedical literature, much
of the motivation comes from a desire to integrate evidence across a series of
small clinical trials. The perceived problem that these tools were intended to ad-
dress is the inability of individual trials to have sufficient statistical power to de-
tect small benefits, whereas if the evidence could be integrated across studies,
statistical power would be enhanced. If subjected to formal tests of statistical sig-
nificance, which is the norm in assessing the outcome of a clinical trial, many
individual trials are too small to detect clinically important benefits as statisti-
cally significant. When such non-significant tendencies are observed across re-
peated studies, there is an interest in assessing what the evidence says when ag-
gregated. Note that the intended benefits were focused on reducing random error
through aggregation of results, implicitly or explicitly assuming that the indi-
vidual studies are otherwise compatible with regard to methods and freedom from
other potential study biases.
The value of this effort to synthesize rather than merely describe the array of
results presumes an emphasis on statistical hypothesis testing. A rigid interpre-
tation of statistical testing can and does lead to situations in which a series of
small studies, all pointing in the same direction, for example, a small benefit of
treatment, would lead to the conclusion that each of the studies found no effect
(based on significance testing). If the evidence from that same series of studies
were combined, and summarized with a pooled estimate of effect, evidence of a
statistically significant benefit would generate a very different conclusion than
the studies taken one at a time. Obviously, if a series of small studies shows sim-
ilar benefit, those who are less bound by adherence to statistical testing may well
Integration of Evidence Across Studies 263
infer that the treatment appears to confer a benefit without the need to assess the
statistical significance of the array of results. Those who wish to compare the ar-
ray of results to a critical p-value, however, are able to do so. In fact, as dis-
cussed below in the section on “Interpreting Consistency and Inconsistency,” the
consistency across studies with at least slightly different methods and the po-
tential for different biases might actually provide greater confidence of a true
benefit. Identically designed and conducted studies may share identical biases
and show similar effects across the studies due to those shared errors.
As discussed in Chapter 10, in well-designed and well-executed randomized
trials, the focus on random error as the primary source of erroneous inferences
may be justified. That is, if the principles of masked, objective assessment of
outcome are followed, and an effective randomization procedures is employed
to ensure that baseline risk does not differ across exposure groups, the major
threat to generating valid results is a failure of the random allocation mechanism
to yield groups of baseline comparability. Generating a p-value addresses the
probability that the random allocation mechanism has generated an aberrant sam-
ple under the assumption that there is no true difference between the groups.
Thus, repetition of the experiment under identical conditions can be used to ad-
dress and reduce the possibility that there is no benefit of treatment but the al-
location of exposure by groups has, by chance, generated such a pattern of re-
sults. A series of small, identical randomized trials will yield a distribution of
results, and the integration of results across those trials would provide the best
estimate of the true effect. In the series of small studies, the randomization itself
may not be effective, although the deviation in results from such randomization
should be symmetrical around the true value. Integration of information across
the studies should help to identify the true value around which the findings from
individual studies cluster.
The randomized trial paradigm and assumptions have been articulated because
the direct application of this reasoning to observational studies is often prob-
lematic, sometimes severely so. Just as the framework of statistical hypothesis
testing has limited applicability to a single epidemiologic study, the framework
of synthetic meta-analysis has limited applicability to a set of observational
studies.
Observational studies are rarely if ever true replications of one another. The
populations in which the studies are conducted differ, and thus the presence of
potential effect-modifiers differs as well. The tools of measurement are rarely
identical, even for relatively simple constructs such as assessment of tobacco use
or occupation. Exact methods of selecting and recruiting subjects differ, and the
extent and pattern of nonparticipation varies. Susceptibility to confounding will
differ whenever the underlying mechanism of exposure assignment differs. Thus,
the opportunity to simply integrate results across a series of methodologically
identical studies does not exist in observational epidemiology. Glossing over these
264 INTERPRETING EPIDEMIOLOGIC EVIDENCE
differing features of study design and conduct, and pretending that only random
error accounts for variability among studies is more likely to generate mislead-
ing than helpful inferences.
Closely related to this concern is the central role assigned to statistical power
and random error in the interpretation of study results. The fundamental goal of
integrating results is to draw more valid conclusions by taking advantage of the
evidence from having several studies of a given topic rather than a single large
study. While enhanced precision from the larger number of subjects accrued in
multiple studies is an asset, the more valuable source of insight is often the op-
portunity to understand the influence of design features on study results. This
can only be achieved by having multiple studies of differing character and scru-
tinizing the pattern that emerges, not suppressing it through a single synthetic es-
timate. Imagine two situations, one with a single study of 5000 cases of disease
in a cohort of 1,000,000 persons, and the other a series of 10 studies with 500
cases each from cohorts of 100,000 persons. The single, extremely precise study
would offer limited opportunity to learn from the methodologic choices that were
made since a single protocol would have been followed. Differing approaches to
measurement of exposure and disease, control of confounding, and modification
of the estimated effect by covariates would be limited because of the lack of di-
versity in study methods. In contrast, the variation in methodologic decisions
among the 10 studies would provide an opportunity to assess the pattern of re-
sults in relation to methods. With variability in attributes across studies (viewed
as a limitation or barrier to deriving a single estimate), one can gain an under-
standing of how those study features influence the results (an advantage in eval-
uating hypotheses of bias and causality).
Cases
Coghill 48 5 2 0 1 0 56 0
Dockerty 72 9 3 1 1 1 87 34
Feychting 30 1 1 2 0 4 38 0
Linet 403 152 41 20 13 9 638 46
London 110 30 5 9 4 4 162 68
McBride 174 77 32 11 1 2 297 102
Michaelis 150 17 3 3 3 0 176 0
Olsen 829 1 0 0 0 3 833 0
Savitz 24 7 2 3 0 0 36 62
Tomenius 129 16 5 0 0 3 153 0
Tynes 146 2 0 0 0 0 148 0
Verkasalo 30 1 0 0 1 0 32 3
Controls
Coghill 47 9 0 0 0 0 56 0
Dockerty 68 13 1 0 0 0 82 39
Feychting 488 26 18 10 2 10 554 0
Linet 407 144 41 17 5 6 620 69
London 99 28 6 2 2 6 143 89
McBride 194 96 28 5 3 3 329 70
Michaelis 372 29 7 4 0 2 414 0
Olsen 1658 3 2 2 0 1 1666 0
Savitz 155 28 10 3 2 0 198 67
Tomenius 546 119 24 4 2 3 698 21
Tynes 1941 25 7 5 4 22 2004 0
Verkasalo 300 9 6 4 0 1 320 30
*No measure for a residence at or before time of diagnosis (cases) or corresponding index date (for controls).
†See Greenland et al. (2000) for citations to original reports.
Greenland et al., 2000.
Integration of Evidence Across Studies 267
1.1
1.0
0.9
Floated Case–Control Ratio
0.8
0.7
0.6
0.5
0.4
0.3
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
Magnetic Field (microtesias)
ual study could address, were the most supportive of a positive relation between
magnetic fields and leukemia. Note that the myriad limitations in the design of
individual studies, including potential response bias, exposure misclassification,
and confounding were not resolved through data pooling, but the sparseness of
the data in a range of the dose–response curve in which random error was a pro-
found limitation was overcome to some extent.
A technique that is often better suited to maximizing the information from a se-
ries of broadly comparable studies, which also requires access to the raw data or at
least to highly cooperative collaborators willing to undertake additional analyses, is
comparative analysis. In this approach, rather than integrating all the data into the
same analysis, parallel analyses are conducted using the most comparable statisti-
cal techniques possible. That is, instead of the usual situation in which different el-
igibility criteria are employed, different exposure categories are created, different
potential confounders are controlled, etc., the investigative team imposes identical
decision rules on all the studies that are to be included. Applying identical decision
rules and analytic methods removes those factors as candidate explanations for in-
consistent results and sharpens the focus on factors that remain different across stud-
ies such as the range of exposure observed or selective non-response.
The basic design and the data collection methods are not amenable to modi-
fication at the point of conducting comparative analysis, of course, except for
268 INTERPRETING EPIDEMIOLOGIC EVIDENCE
those that can be changed by imposing restrictions on the available data (e.g.,
more stringent eligibility criteria). The series of decisions that proceed from the
raw data to the final results are under the control of the investigators conducting
the comparative analysis. Methodologically, the extent to which the choices made
account for differences in the results can be evaluated empirically. When multi-
ple studies address the same issue in a similar manner and yield incompatible re-
sults, the possibility of artifactual differences resulting from the details of ana-
lytic methods needs to be entertained. Comparative analysis addresses this
hypothesized basis for differences in results directly, and either pinpoints the
source of disparity or demonstrates that such methodologic decisions were not
responsible for disparate findings. The opportunity to align the studies on a com-
mon scale of exposure and response yields an improved understanding of the ev-
idence generated by the series of studies.
Lubin and Boice (1997) conducted such an analysis to summarize the evidence
on residential radon and lung cancer. A series of eight pertinent case–control
studies had been conducted, but both the range of radon exposure evaluated and
the analytic approaches differed, in some cases, substantially, across studies. They
found that the radon dose range being addressed across the series of studies dif-
fered markedly (Fig. 11.2), and the results could be reconciled, in part, by more
formally taking the dose range into account. Each study had focused on the in-
ternal comparison of higher and lower exposure groups within their study set-
tings, yet the absolute radon exposure levels that the studies addressed were quite
distinctive, with the highest dose group ranging from approximately 150 Bq/m3
to approximately 450 Bq/m3. If in fact the studies in a lower dose range found
no increase in risk with higher exposure, and those in the higher dose range did
find such a pattern, it would be difficult to declare the studies inconsistent in a
sense. They would be internally inconsistent but consistent relative to one an-
other.
In this instance, the results were largely compatible when put on a common
scale, with a combined relative risk estimate for a dose of 150 Bq/m3 of 1.14
(Table 11.2). Even the combination of evidence across residential and occupa-
tional exposures from miner studies, where the doses are typically much higher,
showed compatible findings when put on a common dose scale. Few exposures
permit the quantification of dose to reconcile study findings, but this example
clearly illustrates the need to go beyond the notion of consistent or inconsistent
in gleaning the maximum information from a set of studies.
Instead of data pooling, which is logistically difficult due to the need to obtain
raw data and conduct new analyses, epidemiologists have increasingly turned to
30 30 30
1 1 1
30 30 30
Winnipeg Stockholm Sweden
5 5 5
Relative Risk
1 1 1
0.3
0 100 200 300 400
0.3
0 100 200 300 400
0.3
0 100 200 300 400
FIGURE 11.2. Relative risks (RRs) for
Radon concentration (Bq/m3) radon concentration categories and fitted
30 30 exposure–response models for each
Missouri New Jersey
5 5 case–control study. Fitted lines are ad-
justed to pass through the quantitative
Relative Risk
TABLE 11.2. Estimates of the Relative Risk at 150 Bq/m3 and the 95% Confidence
Interval for Each Study and for All Studies Combined, Meta-Analysis of Epidemiologic
Studies of Residential Radon and Lung Cancer
REPORTED IN
STUDY RR* 95% CI ORIGINAL PAPER†
is markedly more precise than any of the original studies taken in isolation. Cyn-
ics suggest that the main goal of meta-analysis is to take a series of studies that
demonstrate an effect that is not statistically significant and combine the studies to
derive a summary estimate that is statistically significant, or worse yet, to take a
series of imprecise and invalid results and generate a highly precise invalid result.
One of the challenges in conducting such meta-analyses is to ensure that the
studies that are included are sufficiently compatible methodologically to make
the exercise of synthesizing a common estimate an informative process. The al-
gebraic technology will appear to work even when the studies being combined
are fundamentally incompatible with respect to the methods used to generate the
data. In practice, studies always differ from one another in potentially important
features such as study locale, response proportion, and control of confounding,
so that the decision to derive a summary estimate should be viewed at best as an
exercise, in the same spirit as a sensitivity analyses. The question that is ad-
dressed by a meta-analysis is as follows: If none of the differing features of study
methods affected the results, what would be the best estimate of effect from this
set of studies? The value and credibility of that answer depends largely on the
credibility of the premises.
An alternative approach to synthetic meta-analysis focused on the derivation
of a single, pooled estimate, is to apply the statistical tools of meta-analysis to
examine and better understand the sources of heterogeneity across the compo-
nent studies. By focusing on the variability in results as the object of study, we
can identify and quantify the influences of study methods and potential biases
on study findings, rather than assume that such methodologic features are unim-
portant. The variability in study features, which are viewed as a nuisance when
seeking a summary estimate, is the raw material for exploratory meta-analysis
(Greenland, 1998).
In exploratory meta-analysis, the structural features of the study, such as lo-
cation, time period, population source, and the measures of study conduct such
as response proportion, masking of interviewers, and amount of missing infor-
mation, are treated as potential determinants (independent variables) of the study
results. Through exploratory meta-analysis, the manner in which study methods
influence study results can be quantified, perhaps the most important goal in eval-
uating a body of epidemiologic literature. In parallel with the approach to ex-
amining methods and results within a single study, the focus of previous chap-
ters, the same rationale applies to the examination of methods and results across
studies. Just as the insights from such analyses of potential biases within a study
help to assess the credibility of its findings, the pattern of results across a series
of studies helps to more fully understand the constellation of findings and its
meaning.
Sometimes, systematic examination of the pattern of results across studies
yields a clear pattern in which methodologic quality is predictive of the results.
272 INTERPRETING EPIDEMIOLOGIC EVIDENCE
That is, studies that are better on average tend to show stronger (or weaker) mea-
sures of association, suggesting where the truth may lie among existing results
or what might be expected by extrapolating to studies that are even better than
the studies conducted thus far. For example, if higher response proportions were
independently predictive of stronger associations, one would infer, all other things
being equal, that a stronger association would be expected if non-response could
be eliminated altogether. The studies with the higher response proportion are pre-
sumably yielding more valid results, all other things equal, and thus the obser-
vation that these studies yield stronger associations supports an association be-
ing truly present and stronger in magnitude than was observed even in the study
with the best response thus far. The opposite pattern, higher response proportions
predicting weaker association, would suggest that no association or only a weak
one is present. Heterogeneity of results across studies is being explained in a
manner that indicates both which results among completed studies are more likely
to be valid and the basis for projecting what would be found if the methodologic
limitation could be circumvented altogether. In meta-regression, such indepen-
dent effects of predictors can be examined with adjustment for other features of
the study that might be correlated with response proportions. With a sufficient
number of observations, multiple influences on study findings can be isolated
from one another.
Interpretation of the patterns revealed by exploratory meta-analysis is not al-
ways straightforward, of course, just as the corresponding relation between meth-
ods and results is not simple within individual studies. For example, one might
observe that studies conducted in Europe tend to yield different results (stronger
or weaker associations) than those conducted in North America. Neither group
is necessarily more valid, but this pattern would encourage closer scrutiny of is-
sues such as the methods of diagnosis, available tools for selecting controls in
case–control studies, cultural attitudes toward the exposure of interest, or even
the very nature of the exposure, which may well differ by geographic region.
Sometimes there are time trends in which results of studies differ systematically
as a function of the calendar period of study conduct, again subject to a variety
of possible explanations. Even when features of the studies that do not corre-
spond directly to indices of quality are predictive of results, much progress has
been made beyond simply noting that the studies are inconsistent. The product
of examining these attributes is refinement of the hypotheses that might explain
inconsistent results in the literature.
The requirements for the application of exploratory meta-analysis are sub-
stantial, and often not met for topics of interest. The key feature is having a suf-
ficient number of studies to conduct regression analyses that can examine and
isolate multiple determinants of interest. The number of available studies deter-
mines in part the feasibility of conducting meta-regression with multiple predic-
tors, just as the number of individual subjects does so in regression analyses of
Integration of Evidence Across Studies 273
Among the most complex issues in the interpretation of a body of scientific lit-
erature is the meaning of consistency and inconsistency in results across studies.
For those unwilling or unable to grapple with the details, a sweeping pro-
nouncement of consistent implies that the phenomenon is understood and all the
studies are pointing in the same (correct) direction. On the other hand, inconsis-
tency among studies may be interpreted as lack of support for the hypothesis, an
indication that the truth is unknown, or evidence that all the studies are subject
to error. The reality in either case is more complex. Examined in detail, a series
of studies addressing the same question will always show some inconsistencies,
regardless of the truth and the quality of the individual studies.
The search for consistency may derive from the expectations of laboratory ex-
periments in which replication is expected to yield identical results, subject only
to random error, if the phenomenon is operating as hypothesized. When differ-
ent researchers in different laboratories apply the same experimental conditions,
and they observe similar results, it suggests that the original finding is valid. The
ability to consistently generate a predicted result across settings strongly supports
the hypothesized phenomenon. In contrast, inconsistency across laboratories, for
example, or across technicians within a laboratory suggests some error has been
made in the experiment or that some essential element of the phenomenon has
not yet been identified. If the originally reported phenomenon cannot be
274 INTERPRETING EPIDEMIOLOGIC EVIDENCE
replicated despite multiple attempts to do so, then the original study is appro-
priately assumed to be in error. Certainly, epidemiologists seek confirmation, but
pure replication is never feasible in observational epidemiology in that the con-
ditions (populations, study methods) inevitably differ across studies.
Inconsistent Findings
As commonly applied, the criticism that studies are inconsistent has several im-
plications, all of them interpreted as suggesting that the hypothesized association
is not present: (1) No association is present, but random error or unmeasured bi-
ases have generated the appearance of an association in some but not all studies.
(2) No conclusions whatsoever can be drawn from the set of studies regarding
the presence or absence of an association. (3) The literature is methodologically
weak and pervasive methodologic problems are the source of the disparate study
findings. An equally tenable explanation is that the studies vary in quality and
that the strongest of the studies correctly identify the presence of an association
and the methodologically weaker ones do not, or vice versa. Unfortunately, the
observation of inconsistent results per se, without information on the character-
istics of the studies that generated the results and the nature of the inconsistency,
conveys very little information about the quality of the literature, whether infer-
ences are warranted, and what those inferences should be. Inconsistencies across
studies can arise for so many reasons that without further scrutiny the observa-
tion has little meaning.
Random error alone inevitably produces inconsistency in the exact measures
of effect across studies. If the overall association is strong, then such deviations
may not detract from the overall appearance of consistency. For example, if a
series of studies of tobacco use and lung cancer generate risk ratios of 7.0, 8.2,
and 10.0, we may legitimately interpret the results as consistent. In contrast, in
a range of associations much closer to the null value, or truly null associations,
fluctuation of equal magnitude might well convey the impression of inconsis-
tency. Risk ratios of 0.8, 1.1, and 1.5 could well be viewed as inconsistent, with
one positive and two negative studies, yet the studies may be estimating the same
parameter, differing only due to random error. When the precision of one or more
of the studies is limited, the potential for random error to create the impression
of inconsistency is enhanced. While the pursuit of substantive explanations for
inconsistent findings is worth undertaking, the less intellectually satisfying but
often plausible explanation of random error should also be seriously entertained.
Results that fluctuate within a relatively modest range do not suggest that the
studies are flawed, but rather may simply suggest that the true measure of the
association is somewhere toward the middle of the observed range and the scat-
ter reflects random error. Conversely, substantial variability in findings across
studies should not immediately be assumed to result from random error, but ran-
Integration of Evidence Across Studies 275
dom error should be included among the candidate contributors, particularly when
confidence intervals are wide.
Those who compile study results will sometimes tally the proportion of the
studies that generate positive or negative associations, or count the number of
studies that produce statistically significant associations. While there are ways to
infer whether the count of studies deviates from the expectation under the null
(Poole, 1997), it is far preferable to examine the actual measures of effect and
associated confidence intervals. To count the proportion of studies with relative
risks above or below the null sacrifices all information on the magnitude of ef-
fect and variation among the studies generating positive and inverse associations.
A focus on how many were statistically significant hopelessly confounds mag-
nitude of effect with precision. A series of studies with identical findings, for ex-
ample, all yielding risk ratios of 1.5, could well yield inconsistent findings with
regard to statistical significance due to varying study size alone. Variability in
study size is one easily understood basis for inconsistency due to its affect on
precision. As suggested in Chapter 10, statistical significance is of little value in
interpreting the results of individual studies, and the problems with using it are
compounded if applied to evaluating the consistency of a series of studies.
Another mechanism by which a series of methodologically sound studies could
yield inconsistent results is if the response to the agent in question truly differs
across populations, i.e., there is effect measure modification. For example, in a
series of studies of alcohol and breast cancer, one might find positive associa-
tions among premenopausal but not postmenopausal women, with both sets of
findings consistent and valid. Some studies may include all or a preponderance
of postmenopausal women and others predominantly premenopausal women. If
the effect of alcohol varies by menopausal status, then the summary findings of
those studies will differ as well. Whereas the understanding of breast cancer has
evolved to the point that there is recognition of the potential for distinctive risk
factors among premenopausal and postmenopausal women, for many other dis-
eases the distinctiveness of risk factors in subgroups of the population is far less
clear. Where sources of true heterogeneity are present, and the studies vary in
the proportions of participants in those heterogeneous groups, the results will in-
evitably be inconsistent. All studies however may well be accurate in describing
an effect that occurs only or to a greater extent in one subpopulation.
This differing pattern of impact across populations is one illustration of effect
modification. In the above example, it is based on menopausal status. Analogous
heterogeneity of results might occur as a function of baseline risk. For example,
in studies of alcohol and breast cancer, Asian-American women, who generally
have lower risk, might have a different vulnerability to the effects of alcohol
compared to European-American women, who generally have higher risk. The
prevalence of concomitant risk factors might modify the effect of the one of in-
terest. If the frequency of delayed childbearing, which confers an increased risk
276 INTERPRETING EPIDEMIOLOGIC EVIDENCE
of breast cancer, differed across study populations and modified the effect of al-
cohol, the results would be heterogeneous across populations that differed in their
childbearing practices.
Where strong interaction is present, the potential for substantial heterogeneity
in study results is enhanced. For example, in studies examining the effect of al-
cohol intake on oral cancers, the prevalence of tobacco use in the population will
markedly influence the effect of alcohol. Because of the strong interaction be-
tween alcohol and tobacco in the etiology of oral cancer, the effect of alcohol
intake will be stronger where tobacco use is greatest. If there were complete in-
teraction, in which alcohol was influential only in the presence of tobacco use,
alcohol would have no effect in a tobacco-free population, and a very strong ef-
fect in a population consisting of all smokers. Even with less extreme interac-
tion and less extreme differences in the prevalence of tobacco use, there will be
some degree of inconsistency across studies in the observed effects of alcohol
use on oral cancer. If we were aware of this interaction, of course, we would ex-
amine the effects of alcohol within strata of tobacco use and determine whether
there is consistency within those homogeneous risk strata. On the other hand, if
unaware of the interaction and differing prevalence of tobacco use, we would
simply observe a series of inconsistent findings.
There is growing interest in genetic markers of susceptibility, particularly in
studies of cancer and other chronic diseases (Perera & Santella, 1993; Tockman
et al., 1993; Khoury, 1998). These markers reflect differences among individu-
als in the manner in which they metabolize exogenous exposures, and should
help to explain why some individuals and not others respond to exposure to the
same agent. If the proportion that is genetically susceptible varies across popu-
lations, then the measured and actual effect of the exogenous agent will vary as
well. These molecular markers of susceptibility are not conceptually different
from markers like menopausal status, ethnicity, or tobacco use, although the
measurement technology differs. All provide explanations for why a specific
agent may have real but inconsistent effects across populations.
Until this point, we have considered only inconsistent results among a set of
perfectly designed and conducted studies that differ from one another solely due
to random error or true differences in the effect. Introducing methodological lim-
itations and biases offers an additional set of potential explanations for incon-
sistent results. By definition, biases introduce error in the measure of effect.
Among an array of studies of a particular topic, if the extent and mix of biases
varies across studies, results will vary as well. That is, if some studies are free
of a particular form of bias and other studies are plagued to a substantial degree
by that bias, then results will be inconsistent across those sets of studies. Sus-
ceptibility to bias needs to be examined on a study-by-study basis, and consid-
ered among the candidate explanations for inconsistent results. In particular, if
there is a pattern in which the findings from studies that are most susceptible to
Integration of Evidence Across Studies 277
a potentially important bias differ from those of studies that are least suscepti-
ble, then the results will be inconsistent but highly informative. The studies that
are least susceptible to the bias would provide a more accurate measure of the
association.
In order to make an assessment of the role of bias in generating inconsistent
results, the study methods must be carefully scrutinized, putting results aside.
Depending on preconceptions about the true effect, there may be a temptation to
view those studies that generate positive or null results as methodologically su-
perior because they yielded the right answer. In fact, biases can distort results in
either direction, so that unless truth is known in advance, the results themselves
give little insight regarding the potential for bias in the study. Knowing that a
set of studies contains mixed positive and null findings tells us nothing about
which of them is more likely to be correct or whether all are valid or all are in
error. In particular, there is no logical reason to conclude from such an array of
results that the null findings are most likely to be correct, by default—mixed
findings do not provide evidence to support the hypothesis of no effect. The de-
mand on the interpreter of such evidence is to assess which are the stronger and
weaker studies and examine the patterns of results in relation to those method-
ologic attributes.
Consistent Findings
There are basically two ways to generate a series of consistent findings: they may
be consistently right or consistently wrong. When an array of studies generates
consistent findings, a reasonable inference might be that despite an array of po-
tential biases in the individual studies, the problems are not so severe as to pre-
vent the data from pointing in the direction of the truth. Hypothesized biases
within an individual study cannot be confirmed or refuted, but it may be possi-
ble to define a gradation of susceptibility to such biases across a series of stud-
ies. If a series of studies with differing strengths and limitations, and thus vary-
ing vulnerability to bias, all generate broadly comparable measures of association,
one might infer that the studies are all of sufficient quality to have accurately ap-
proximated the association of interest.
Unfortunately, it is also possible for a series of studies to generate consistently
incorrect findings. There are often similarities across studies in the design or
methods of conduct that could yield similarly erroneous results. For example, in
studies of a stigmatized behavior, such as cocaine use, in relation to pregnancy
outcome, there may be such severe underreporting as to yield null results across
a series of studies. On the other hand, cocaine use is strongly associated with
other adverse behaviors and circumstances that could confound the results, in-
cluding tobacco and alcohol use and sexually transmitted infection. These ten-
dencies may well hold across a wide range of populations. Thus, the observation
278 INTERPRETING EPIDEMIOLOGIC EVIDENCE
errors are uncovered and corrected, surely there are times that they escape
detection. These inadvertent sources of erroneous results are highly unlikely to
occur across multiple studies, just as a biased or dishonest data collector can dis-
tort results from a single study, but is not plausible that a series of studies on the
same topic would all suffer from such misfortune. The problems will increase
the dispersion of study findings, creating the potential for unexplained inconsis-
tencies.
Sometimes, under a causal hypothesis, some features of the study setting and
methods should influence the results, and if that does not occur, the consistent
evidence may argue against a causal association. One of the key ways in which
this can occur is a function of dose. Where markedly different doses evaluated
across studies yield similar measures of association, the consistency may run
counter to a causal explanation. For example, oral contraceptives have changed
markedly in the estrogen content over the years, with notably lower doses at pres-
ent compared to in the past. If we observed identical associations between oral
contraceptive use and thromboembolism for those past high doses and the pres-
ent low doses, as a form of consistency, we would be advised to call into ques-
tion whether we have accurately captured an effect of the oral contraceptives or
some selective factor associated with oral contraceptive users. That is, consis-
tency despite critical differences may suggest that a shared bias accounts for the
results.
When trying to evaluate the role of a specific potential bias, an array of stud-
ies offers the opportunity to examine the relationship between vulnerability to
those biases and the pattern of results. A single study always has the potential
for spurious findings due to its methods, both design and conduct. While close
scrutiny of that study is helpful in assessing the likelihood, direction, and mag-
nitude of potential bias, there are real limits in the strength of inference that can
be made from a single result in evaluating hypotheses of bias or causality. The
vulnerability to bias may be assessed in qualitative terms in an individual study,
but not with any anchor of certainty. In contrast, a series of studies that clearly
differ in their susceptibility to bias, ranging from highly vulnerable to virtually
immune, offers an opportunity to examine whether the gradient of susceptibility
corresponds to a gradient of results. If there is a clear pattern across studies, for
example, in which higher quality of exposure assessment corresponds to stronger
associations, then it may be inferred that a true association is present that is di-
luted, to varying degrees, from exposure misclassification. On the other hand, if
the series of studies with differing quality of exposure assessment were invari-
ant in their results, a causal association would seem less likely to be present and
explanations of some consistent bias should be entertained.
the new ones addressing specific deficiencies in those that precede them. Ideally,
each subsequent study would combine all the strengths of its predecessors and
remedy at least one limitation of the prior studies. In such a system for the evo-
lution of the literature, each study would address a non-causal explanation for an
observed association and either reveal that the previous studies had been in er-
ror because they had not been quite so thorough, or demonstrate that the hy-
pothesized source of bias was not influential. If the improvement resulted in a
major shift in the findings, the new study would suggest that the previous stud-
ies were deficient and suffered from a critical bias that had been identified and
eliminated. While discovering that a potentially important refinement had no im-
pact may seem like a disappointment, such an observation could be of profound
importance in addressing and eliminating a non-causal explanation. Assume, for
example, a key potential confounder was neglected in a series of studies, and the
next study provided precise measurement and tight control for that potential con-
founding factor, but the study results do not differ from those that came before.
It might be inferred that all the previous studies were also free from confound-
ing, though this could not be addressed directly within those studies. In this sim-
ple view of how studies evolve, exonerating a potential bias in one study negates
the possibility of that bias in the studies that came before, since the improve-
ments are cumulative.
We now move from this ideal model, in which studies build perfectly and log-
ically on one another through cumulative improvements, to a more realistic one
in which studies are stronger in some respects than previous ones, but are often
weaker in other respects. For logistical reasons, making one refinement tends to
incur sacrifices along other dimensions. Applying a demanding, detailed mea-
surement protocol to diminish the potential for information bias may well reduce
the response proportions and increase the potential for selection bias, for exam-
ple. Choosing a population that is at high risk for the disease and thus yields
good precision may incur some cost in terms of susceptibility to confounding.
Conducting a study that is extremely large may sacrifice rigor in the assessment
of exposure or disease. In epidemiology, as in life, there’s no “free lunch.”
The critical question is whether, when a series of studies with differing
strengths and weaknesses are conducted and consistent results are found, an in-
ference can be made that none of the apparent deficiencies distorted the results
and all are pointing in the correct direction. If one study is susceptible to con-
founding, another to bias in control selection, and yet another to exposure mea-
surement error, but all yield the same measure of effect, can one infer that none
of the hypothetical problems are of great importance? This question is different
from the situation in which a series of studies share the same deficiency, and
may yield consistent but biased results due to common unidentified problems. It
also differs from the situation in which one or more studies are free of all major
threats to validity.
Integration of Evidence Across Studies 281
There are two reasons that a series of studies with differing strengths and weak-
nesses may yield consistent results: either the disparate deficiencies all yield errors
of the same nature to produce consistently erroneous results or the deficiencies are
all modest in their impact and the studies yield consistently valid results. The dis-
tinction between these two possibilities requires a judgment based on the credibil-
ity of the two scenarios, specific to the substantive and methodologic issues under
consideration. Generally, it is unlikely that a series of studies with differing method-
ological strengths and weaknesses would yield similar results through different bi-
ases acting to influence findings in a similar manner. Faced with such a pattern, the
more plausible inference would be that the potential biases did not distort the re-
sults substantially and, thus, the series of studies are individually and collectively
quantifying the causal association between exposure and disease accurately.
It is also possible that despite the recognition of potential biases that were re-
solved in one or more of the studies, all the studies might share a remaining lim-
itation that has not been identified or is known but cannot readily be overcome.
For example, many exposures simply cannot be randomized for ethical or logis-
tical reasons, and thus no matter how much study designs vary, all address ex-
posure that is incurred due to circumstances outside the investigator’s control.
While the studies may vary with regard to some potential remediable biases be-
ing present or absent, all may continue to share an insurmountable, important
limitation. Consistency across such studies suggests that the weaknesses that vary
across studies are unlikely to account for spurious results but any deficiency that
the studies have in common may still do so. Only the variation in methodologic
strengths and weaknesses across existing studies can be examined, not the pos-
sible effects of improvements not yet undertaken.
Several conclusions can be drawn from this discussion of consistency and in-
consistency in aggregation of results across a series of epidemiologic studies.
First, without considering study quality and specific methodologic features of
the studies, there is little value in simply assessing the pattern of results. To
dichotomize studies as positive or null and examine whether a preponderance
fall in one category or another yields no credible insights about the literature.
This is the case whether those studies are consistent or inconsistent with one
another, though traditionally a series of studies that generate similar results are
viewed as correct whereas inconsistent results are viewed as inconclusive or
weak. If consistent, it needs to be asked as if the studies share a methodologic
deficiency. When such a flaw common to the studies cannot be detected, and
there is not a clear case for different errors producing bias in the same direc-
282 INTERPRETING EPIDEMIOLOGIC EVIDENCE
tion across the studies, then it is more reasonable to infer that they are likely
to be valid.
If inconsistent, there is a need to first consider whether there are specific
reasons to expect the results to differ across studies—for example, the study
populations differ in the prevalence of an important effect modifier. Evalua-
tion of the patterns continues with the examination of gradients of quality and
specific methodologic strengths and weaknesses among the studies. Such an
evaluation, with or without the tools of meta-analysis, yields hypotheses re-
garding the operation of biases. Which of the results are most likely to be valid
depends on the estimated effect of those different biases and a judgment about
what the findings would have been in the absence of the consequential biases.
Among the completed studies may well be one or more that have those desir-
able attributes.
The process for scrutinizing inconsistent findings helps to define the bound-
aries of current knowledge. By evaluating the methodologic details of studies
and their results, judgments can be made about the most important candidate ex-
planations for the inconsistency. Those are precisely the study design, conduct,
and analysis features that would help to resolve the controversy. One may also
infer that some potential sources of bias do not seem to be of consequence, and
therefore in assessing the tradeoffs inherent in designing and conducting research,
other threats to validity may be given greater attention even at the expense of the
bias now believed to have been ameliorated. This evaluation of the evidence gen-
erates not just an accurate assessment of the state of knowledge, but also some
indication of its certainty and the requirements for further research to enhance
its certainty.
This evaluation process involves judgments about the impact that hypothesized
biases are likely to have had on the results. Several approaches can be used to
make such judgments. Evidence from other studies that were free of such limi-
tations suggests whether there was any effect. Assessment of the direction of bias
and potential magnitude based on differing scenarios, or sensitivity analyses,
combined with information on the plausibility of those scenarios generates use-
ful information. Methodological literature, in the form of theoretical or quanti-
tative consideration of biases or empirical evaluations of bias is critical.
Careful evaluation of results across a series of studies is highly informative,
even when it does not provide definitive conclusions. Juxtaposing results from
studies with differing strengths and limitations, with careful consideration of the
relationship between design features and results, is essential both in drawing the
most accurate conclusions possible at a given point in time and in planning how
to advance the literature. Substantive understanding of the phenomenon inte-
grated with appreciation of the relation of study design features to potential bi-
ases is required. While far more subtle and challenging than the simple search
for consistency or its absence, it is also far more informative.
Integration of Evidence Across Studies 283
REFERENCES
Bradford-Hill AB. The environment and disease: association or causation? Proc Royal
Soc Med 1965;58:295–300.
Greenland S. Quantitative methods in the review of epidemiologic literature. Epidemiol
Rev 1987;9:1–30.
Greenland S. Can meta-analysis be salvaged? Am J Epidemiol 1994;140:783–787.
Greenland S. Meta-analysis. In Rothman KJ, Greenland S, Modern epidemiology.
Philadelphia: Lippincott-Raven Publishers, 1998;643–673.
Greenland S, Sheppard AR, Kaune WT, Poole C, Kelsh. A pooled analysis of magnetic
fields, wire codes, and childhood leukemia. Epidemiology 2000;11:624–634.
Holzman C, Paneth N. Maternal cocaine use during pregnancy and perinatal outcomes.
Epidemiol Rev 1994;16:315–334.
Khoury MJ. Genetic epidemiology. In Rothman KJ, Greenland S, Modern epidemiology.
Philadelphia: Lippincott-Raven Publishers, 1998;609–621.
Lubin JH, Boice JD Jr. Lung cancer risk from residential radon: meta-analysis of eight
epidemiologic studies. J Natl Cancer Inst 1997;89:49–57.
National Research Council. Possible health effects of exposure to residential electric and
magnetic fields. Committee on the Possible Effects of Electromagnetic Fields on Bi-
ological Systems, Board on Radiation Effects Research, Commission on Life Sciences,
National Research Council. Washington, DC: National Academy Press, 1997.
Perera FP, Santella R. Carcinogenesis. In Schulte PA, Perera FP (eds), Molecular epi-
demiology: principles and practices. San Diego: Academic Press, Inc., 1993;277–300.
Petitti DB. Meta-analysis, decision analysis, and cost-effectiveness analysis in medicine.
New York: Oxford University Press, 1994.
Poole C. One study, one vote: ballot courting in epidemiologic meta-analysis. Am J Epi-
demiol 1997;145:S85.
Portier CJ, Wolfe MS, eds. Assessment of health effects from exposure to power-line fre-
quency electric and magnetic fields. Working group report. In: US Department of
Health and Human Services, National Institute of Environmental Health Sciences, Na-
tional Institutes of Health publication no. 98-3981, 1998.
Tockman MS, Gupta PK, Pressman NJ, Mulshine JL. Biomarkers of pulmonary disease.
In Schulte PA, Perera FP (eds), Molecular epidemiology: principles and practices. San
Diego: Academic Press, Inc., 1993;443–468.
This page intentionally left blank
12
CHARACTERIZATION OF CONCLUSIONS
285
286 INTERPRETING EPIDEMIOLOGIC EVIDENCE
whether or not the evidence warrants a clear statement. Many of those who
wish to make use of the knowledge encourage or may even demand such clar-
ity, including policy makers, attorneys, journalists, or the general public, as
well as scientists. At a given point in the evolution of our understanding, the
only justifiable assessment may be a murky, complex one that includes alter-
nate scenarios and some probability that each is correct. Explaining such a state
of affairs is often tedious to those who are only mildly interested and may de-
mand a rather sophisticated understanding of epidemiologic methods to be fully
understood. Thus, there will often be pressure to provide a simpler inference,
placing the evidence on some scale of adjectives such as suggestive, strong, or
weak. For example, the International Agency for Research on Cancer has a
well-developed system for classifying carcinogens into such categories as prob-
able and possible (Vainio, 1992). A common question from reporters or the
lay public to the technical expert that seeks to cut through the often complex
and seemingly wishy-washy conclusions is “What would you do?” as a result
of the new information. The idea is that good scientists prefer to remain non-
committal and focus on the points of uncertainty, but that they nevertheless
have the ability as human beings to wisely integrate the evidence so that their
personal application of the information reflects such wisdom. Behavioral de-
cisions of the scientist, like those of anyone else, will typically go well beyond
the assessment of the epidemiologic evidence and incorporate other lines of re-
search and even the personal values that affect life decisions, realms that even
the most expert epidemiologist is not uniquely qualified to address in a gener-
alizable manner.
Researchers who wish to justify their grant application or argue why their pub-
lication is useful may have the opposite temptation, seeking out or even exag-
gerating the uncertainty to make their contribution seem more important. No one
is free of incentives and values when it comes to the distractions from objectiv-
ity in the interpretation of evidence. While much attention is focused on those
with a direct financial interest, such as private corporations or the opposing par-
ties in legal conflicts, the sources of personal bias are much broader. A topic as
seemingly neutral as whether coronary heart disease has continued to decline in
the 1990s (Rosamond et al., 1998) raises great concern not just from researchers
who have found conflicting results, but from those who believe that more radi-
cal changes are needed in social systems to effect benefit, from those who ad-
minister programs intended to reduce coronary heart disease, and from those who
wish to defend or criticize the cost of medications associated with the preven-
tion and treatment of these diseases. No one is neutral. In this chapter, the basis
for drawing conclusions from epidemiologic data is discussed, and a range of
purposes and audiences with interest in those epidemiologic conclusions is con-
sidered.
Characterization of Conclusions 287
APPLICATIONS OF EPIDEMIOLOGY
use medications, engage in sexual activity, and drive automobiles with or with-
out conclusive epidemiologic evidence. Because public health issues for which
epidemiologic evidence is relevant pervade society, many decisions made col-
lectively and individually could benefit from epidemiologic insights, even those
that are based on evolving information.
Epidemiologic evidence remains forever inconclusive in the sense that scien-
tific certainty is an elusive goal. The potential for erroneous conclusions can be
successively narrowed through increasingly refined studies, but there is no point
at which the potential for bias has been completely eliminated. Instead, the lines
of evidence bearing on a public health decision, with epidemiologic data pro-
viding one of the critical streams of information, reach a point where the prac-
tical benefits of further epidemiologic information and refinement are limited.
For well-studied issues such as the health hazards of asbestos or benefits of us-
ing seat belts, epidemiology has offered much if not all that it can for informing
the basic policy decision—asbestos exposure should be minimized; seat belt use
should be maximized. Further refining the risk estimates or extending the infor-
mation to previously unstudied subgroups is not without value, but the evidence
has accumulated to the point that major shifts in policy resulting from that new
knowledge are highly unlikely. (The effectiveness of interventions, for example,
may still benefit from epidemiologic scrutiny; perhaps refining the understand-
ing of dose–response functions for asbestos or the benefits of seat belt use in
conjunction with air bags could help to further refine policy.)
An example of a situation in which each bit of epidemiologic evidence is ben-
eficial to policy decisions is drinking water safety. The balancing of risks and
benefits in regard to drinking water treatment in the United States must incor-
porate the well-established risk of waterborne infection associated with inade-
quate treatment, the potential adverse effects of low levels of chlorination
by-products on cancer and reproductive health, and myriad economic and engi-
neering considerations associated with alternate approaches to providing drink-
ing water to the public (Savitz & Moe, 1997). Decisions about changing from
chlorination to other methods of treatment (e.g., ozonation), as well as decisions
regarding specific methods of drinking water chlorination, are based on a pre-
carious balance of epidemiologic, toxicologic, economic, and other considera-
tions, including public perception. Shifts in any of those considerations may well
lead to a modification of policy. For example, if the threat of waterborne infec-
tion were found to be greater than previously thought, or if the chlorination lev-
els required to reduce the threat of waterborne infection were found to be greater
than expected, support for increased levels of chlorine treatment would be
strengthened despite the present ambiguous indications regarding adverse health
effects of chlorination by-products. If the epidemiologic evidence linking chlo-
rination by-products to bladder and colon cancer were to be increased through
improved study methods, then the scales would be tipped toward efforts to de-
Characterization of Conclusions 289
crease chlorination by-product levels, either through accepting some greater risk
of waterborne infection or through more sophisticated and expensive engineer-
ing approaches. Similarly, toxicologic research demonstrating toxicity at lower
levels of chlorination by-products than previously observed would tip the bal-
ance, as would engineering breakthroughs making alternatives to chlorination
cheaper or more effective. In this realm of policy, changes in the epidemiologic
evidence matter a great deal.
A counter example, no less controversial, in which refined epidemiologic ev-
idence is unlikely to have significant policy influence is active cigarette smok-
ing. While the full spectrum of health effects, dose–response relations, and re-
sponsible agents in tobacco smoke remain to be elucidated, the epidemiologic
evidence of a monumental public health burden from tobacco use is clear. The
policy controversies tend to focus more on uncertainty in the most effective meas-
ures to prevent adoption of smoking by teenagers, management of adverse im-
plications for tobacco farmers, and issues of personal freedom. Refining the epi-
demiologic evidence linking tobacco use with disease is not likely to have a major
impact on policy. The level of certainty for major tobacco-related diseases such
as lung and bladder cancer and coronary heart disease is so high that minor shifts
upward or downward from additional studies would still leave the certainty within
a range that the epidemiology encourages policy that aggressively discourages
tobacco use. Whether or not tobacco smoking causes leukemia or delayed con-
ception is not going to have much effect on the overall policy. The epidemio-
logic evidence has, in a sense, done all it can with regard to the basic question
of health impact of active smoking on chronic disease, particularly lung and other
cancers and heart disease in adults. Nonetheless, the concerns with environmen-
tal tobacco smoke, potential impact of tobacco smoking on myriad other diseases
such as breast cancer and Alzheimer’s disease, the mechanisms of causation for
diseases known to be caused by smoking, and measures for reduction in smok-
ing call for additional epidemiologic research for these and other purposes.
Identifying policy decisions that would be affected by modest shifts in the epi-
demiologic evidence is one consideration in establishing priorities for the field.
Those decisions that are teetering between nearly equally acceptable alternatives
are the ones most likely to be tipped by enhancing the quality of the epidemio-
logic evidence, whatever the results of those additional studies might show.
Strengthening or weakening the degree of support would have bearing on pub-
lic health decisions. Research needs to continue even when the evidence is so
limited that a strikingly positive epidemiologic study would fail to tip the bal-
ance, since such research is building toward future influence on policy. At the
other end of the spectrum of knowledge, further understanding of well-
established relationships may clarify biological mechanisms or help to point the
way to discoveries involving other causative agents or other health conditions.
This framework suggests how epidemiologic evidence should be presented to
290 INTERPRETING EPIDEMIOLOGIC EVIDENCE
This relevance to policy and life in general is both a blessing and a curse to
the field of epidemiology. The opportunity to contribute to issues of societal con-
cern and issues that affect people’s daily lives is inspiring to practitioners of epi-
demiology and a principal incentive to fund research and disseminate the find-
ings. On the other hand, the hunger for answers to the questions epidemiologists
ask can also lead to public dissemination of incorrect findings, exaggerated claims
of certainty, or unwillingness to accept evidence that is counter to otherwise de-
sirable policy or lifestyle choices.
Some efforts have been made to catalog all forms of potential bias in epidemi-
ologic studies (Sackett, 1979), though such lists tend to become exercises in nam-
ing similar biases that arise in slightly different ways. Rather than serving as a
checklist to ensure all concerns have been considered, they tend to serve more
like a dictionary for looking up terms or as a demonstration of how fallible epi-
demiology is. Instead of such laundry lists of potential bias, a small number of
concerns typically predominate in evaluating a specific study or set of studies.
Rather than asking, without consideration of the specific phenomenon under study
or the design of that study, “What are the ways that the results could be in er-
ror?,” one needs to focus on the specifics of the phenomenon and design to de-
termine the major threats to validity. Because a serious evaluation of a single
source of potential bias is a painstaking process, involving a detailed, thought-
ful examination of data from within and outside the study, consideration of more
than a handful of such issues is impractical.
The uncertainty that limits conclusions based on the research is generally not
evenly distributed among dozens of limitations, each accounting for a small
amount of the potential error. More often, a handful of issues account for the
bulk of the uncertainty, and perhaps dozens more each contribute in presumably
minor, perhaps even offsetting, ways. Major concerns often include control se-
lection in case–control studies or misclassification of disease or exposure. Among
the myriad minor concerns are dishonesty by the investigators or data collectors,
data entry errors, or programming errors—the potential is always present, but
with reasonable attention, the probability of such problems having a sizable im-
pact on the results is minimal. With an initial triage to identify the few critical
concerns, resources can be brought to bear to examine the key problems in de-
tail, and plans can be made for the next study of the phenomenon to improve
upon one or more of the key limitations.
For the major sources of potential error, a critical review calls for speculation
about the direction and magnitude of bias. A thorough assessment includes ex-
amination of the patterns of results within the study data, where possible, as well
292 INTERPRETING EPIDEMIOLOGIC EVIDENCE
Even if we could successfully identify the small set of critical limitations, and
accurately characterize the potential for bias associated with each of those con-
cerns, we would still face the challenge of integrating this information into an
overall evaluation of a study or set of studies addressing the same hypothesis.
What we would really like to know is the probability distribution for the causal
relation between exposure and disease, that is, a quantitative assessment of the
probability that the causal impact of exposure on disease takes on alternate val-
ues. This assessment would integrate observed results with hypothesized biases.
Information of this quality and certainty is not readily obtained, but there are
simplified approaches to integration of evidence that move in that direction.
One simple but important question to address is the direction of bias most
likely to result from the array of methodological considerations. If all the sources
of bias are predicted to result in errors of a particular direction (overestimation
or underestimation of the effect), the presumption would be that the effects are
cumulative, and that more extreme biases may result than would be predicted
from each factor operating in isolation. The overall probability that the measured
value deviates from the true value in one direction is increased relative to the
probability that it deviates in the opposite direction, and the probability that the
magnitude of error is extreme would likewise be increased.
If the major sources of potential bias are in conflicting directions, some most
likely to inflate the measure of effect, others likely to reduce it, then the inte-
gration may increase the probability assigned to values closer to the one that was
observed. Because it is unknown which bias will predominate, there may be a
range on either side of the observed value with relatively high probabilities as-
signed, but extreme deviations are less plausible if the biases counteract one an-
other than if the biases act synergistically.
294 INTERPRETING EPIDEMIOLOGIC EVIDENCE
Only direction of bias has been considered so far, but of course the magnitude
of bias is critical as well. In some instances, there may be absolute bounds on
the amount of error that could be introduced; for example, if all nonrespondents
had certain attributes or if an exposure measure had no validity whatsoever, these
extreme cases would usually be assigned low probabilities. Quantification of the
magnitude of individual biases more generally would be valuable in assessing
their relative importance. Where multiple biases operate in the same direction, if
one predominates quantitatively over the others, assessment of its impact alone
may provide a reasonable characterization of the range of potential values. Where
biases tend to compensate for one another, the overall impact of the combina-
tion of biases would likely reside with the largest one, tempered to only a lim-
ited extent by the countervailing biases.
The terminology used to provide the bottom line evaluation of potential bias
is important to consider. The evaluation should incorporate benchmarks of par-
ticular concern for differing responses to the evidence. The observed measure of
effect is one such benchmark, but others would include the null value and val-
ues that would be of substantial policy or clinical importance. If a doubling of
risk associated with exposure would reach some threshold for regulatory action,
for example, then information on the observed results and potential biases should
be brought together to address the probability that the true value meets or ex-
ceeds that threshold. Short of generating the full probability distribution of true
effects, some estimation of the probability that the observed measure is likely to
be correct, the probability that the null value is accurate, or the probability that
the measure of effect is in a range of policy or clinical relevance would be help-
ful in characterizing the state of the evidence.
edge from a range of basic and clinical sciences, helps to assess the potential for
human health effects and will help in making a probabilistic judgment regarding
health risks or benefits in populations.
Epidemiologists have to guard against being defensive and thus insufficiently
attentive to evidence from other disciplines. Although basic biomedical disci-
plines often enjoy greater prestige and their practitioners sometimes fail to ap-
preciate the value of epidemiology, epidemiologists should not retaliate by un-
dervaluing those approaches. Critics from disciplines other than epidemiology
(and some epidemiologists) may place excessive weight on the more tightly con-
trolled but less directly relevant lines of research and insufficient weight on epi-
demiology in drawing overall conclusions. Similarly, epidemiologists and occa-
sionally other scientists mistakenly believe an integrated evaluation of the array
of relevant evidence should place all the weight on observed health patterns in
populations (i.e., epidemiology), and none on other approaches to evaluating and
understanding the causes of disease. The key point is that these other disciplines
and lines of evidence are not done solely to assist in the interpretation of epi-
demiologic evidence, i.e., evaluating the plausibility of the epidemiology, but
rather to help in making the broader evaluation of health risks and correspon-
ding policy decisions.
An argument can be made that perfectly constructed epidemiologic informa-
tion is the most relevant basis for assessing the presence or absence of human
health effects in populations. Epidemiology alone examines the relevant species
(humans) under the environmental conditions of concern (i.e., the actual expo-
sures, susceptibility distribution, and concomitant exposures of interest). There
is no extrapolation required with regard to species, agent, dose, etc. Epidemiol-
ogists alone have the capability of studying the people whose health is adversely
affected. Even if the detailed understanding of mechanisms of disease causation
requires other disciplines, the basic question of whether humans are being af-
fected would be answered by epidemiology—if only it could be done perfectly.
Epidemiology alone however, is never sufficient for fully assessing human
health risks. The principal limitation is the uncertainty regarding epidemiologic
findings and the inescapable challenge and fallibility of inferring causal relations.
The very strength of epidemiology, studying humans in their natural environ-
ment, imposes limitations on causal inference that are only partially surmount-
able. While perfect epidemiologic evidence would provide a direct indication of
the true effects in human populations, epidemiologic evidence is never without
uncertainty, whether it points toward or away from a causal association. In prac-
tice, the complementary strengths of epidemiology and more basic biomedical
evidence always provide a clearer picture than could be generated by epidemi-
ology in isolation. Even where epidemiology has been most autonomous, for ex-
ample in identifying the substantial excess risks of lung cancer associated with
prolonged, heavy cigarette smoking, confidence was bolstered by identifying the
296 INTERPRETING EPIDEMIOLOGIC EVIDENCE
worrying the public unnecessarily, and that these assertions will later be proven
false, discrediting the field.
A more sanguine view is that thoughtful, non-rancorous debate is beneficial
for the field and helpful, not harmful, in characterizing the current state of knowl-
edge and its implications for policy. All sciences engage in such assertions and
challenges, from physics to anthropology, and the underlying science is not and
should not be called into question as a result. Nor should the controversy in epi-
demiology be viewed as evidence against the rigor and value of the discipline.
Presentation of new research findings is a form of debate, using data rather than
isolated logical arguments. New research is designed to enter into the ongoing
debate, hypothesizing that the proposed study will shift the evidence in one di-
rection or the other. Direct challenges in the form of debates at research meet-
ings, in editorial statements, or face-to-face in informal settings can only stimu-
late thinking and move the field forward; if spurious issues are raised they can
be put to rest as a result of being aired. Obviously, when the debate becomes
personal or ideological, progress will be inhibited and objectivity compromised.
If the benefit of controversy is clarity about the evidence and the identification
of strategies for enhancing the quality of evidence, emotion and ego detract from
the realization of such benefits.
In proposing principles for the evaluation of epidemiologic evidence, the in-
tent is not to provide a framework that will enable all knowledgeable, objective
epidemiologists to reach the same conclusions. That would imply that the truth
is known and if we could just sit down together, have the same base of infor-
mation available, and apply the same methodologic principles, we would reach
the same conclusions. Because the information is always incomplete to varying
degrees, we are extrapolating from what is currently known to what would be
known if the evidence were complete. Even the methodologic principles do not
ensure unanimity of interpretation, because the inference about potential biases
is just that—an informed guess about what might be. The goal of applying the
tools and principles summarized in this book is to change the debate from one
based on global impressions and subjective biases to one that is specific, in-
formed, and generates clarity regarding research needs, pinpointing issues for
further methodologic development.
Starting from the premise that all bodies of research are inconclusive to some
extent, the principal need is to determine precisely where the gaps are. Such state-
ments such as “The evidence is weak” or “Studies are inconsistent” or “Some-
thing is going on there” invite disagreement of an equally subjective and global
nature. A debate over conflicting, superficial impressions offers no benefit either
in clarifying where the current evidence stands or in pinpointing what is needed
to reach firmer conclusions. The basis for those conflicting inferences needs to
be elucidated. If the evidence is perceived to be weak by some and strong by
others based on, for example, differing views of the quality of exposure assess-
Characterization of Conclusions 299
ment, then we need to reframe the debate by compiling available evidence re-
garding the quality of exposure assessment. We may find that validation studies
are the most pressing need to advance this avenue of research. If controversy is
based on adequacy of control for confounding, we have tools to address the plau-
sibility of substantial uncontrolled confounding. Placing the controversy on more
secure foundations will not produce agreement, and may even anchor both sides
more firmly in their original positions. Again, the goal is not consensus, but an
informative and constructive debate.
For the major issues that underlie differing conclusions (not all possible
methodologic issues), the evidence needs to be dissected. The evidence from
studies of the topic of concern, as well as ancillary methodological work and
other sources of extraneous information should be brought to bear. Typically,
there will be some support for each of the opposing sides, and a disagreement
over the summary judgment may well remain after a complete examination of
the issues. What should emerge, however, is a much more specific argument that
helps to define empirical research that would alter the weight of evidence. Each
side may bring the preconception that the additional research will confirm what
they believe to be true, but making the argument specific and testable is a sig-
nificant step forward.
where (Rothman & Poole, 1996), the items that are enumerated provide clues to
potential for bias and there would be greater benefit on focusing directly on the
potential for bias than on these indirect markers.
Strong associations or those that show dose–response gradients, Hill’s first two
criteria, provide evidence against the association being entirely due to con-
founding, under the assumption that confounding influences are more likely to
be weak than strong and not likely to follow the exposure of interest in a dose–
response manner. Preferably, one can focus directly on the question of con-
founding, and gather all the evidence from within and outside the study to sug-
gest the presence or absence of distortion of varying magnitude. In fact, a strong
association may be influenced by confounding of modest magnitude and still re-
main a fairly strong association, whereas a weak association is more likely to be
obliterated. Hill’s rationale is very likely to be valid, but there are more strate-
gies for assessing confounding than he offered, and all are potentially valuable.
Other considerations proposed by Hill—such as specificity, in which causal-
ity is more likely if there is a single agent influencing a single disease—are of
little value, except as a reminder to be watchful for selection bias or recall bias
that might produce spurious associations with an array of exposures or outcomes.
One approach suggested in this book is the examination of whether associations
are observed that are unlikely to be causal, and thus call the association of in-
terest into question. The menu of strategies is much broader than Hill’s brief list
would suggest.
The linkage to biological plausibility and coherence looks to evidence outside
of epidemiology for support, whereas the evidence from other disciplines should
be integrated with the epidemiologic evidence for an overall appraisal, not merely
used to challenge or buttress the epidemiology. Concern with temporal sequence
of exposure and disease is a reminder to be wary of the potential for disease to
influence the exposure marker, such as early stages of cancer possibly reducing
serum cholesterol levels (Kritchevsky & Kritchevsky, 1992), leading to an erro-
neous inference that low serum cholesterol is causally related to the development
of cancer.
Hill’s criteria are of value in helping to remind the interpreters of epidemio-
logic evidence to consider alternatives to causality, even when a positive asso-
ciation has been observed. A preferable approach to evaluating a study or set of
studies is to focus on sources of distortion in the measure of effect, isolating any
causal relation from distortion due to bias or random error. That process over-
laps directly with issues raised by Hill but attempts to be more direct in pin-
pointing sources of error that can then be tested. The Hill criteria were intended
to assess causality when an observation of an association had been made, not to
interpret epidemiologic evidence more generally. For null associations, for ex-
ample, the Hill criteria are not applicable, whereas the results nevertheless call
for scrutiny and interpretation.
Characterization of Conclusions 301
More extensive, formal schemes for assessing epidemiologic methods and ev-
idence have been proposed. In some instances, the goal is to enumerate attributes
that make epidemiologic studies credible (Chemical Manufacturers Association,
1991; Federal Focus Inc, 1996), focusing on methods or on the description of
methods. A recent proposal for the use of an “episcope” (Maclure & Schneeweiss,
2001) would be a welcome formalization of the evaluation of bias, very much
concordant with the ideas expressed here. Encouragement to provide informa-
tion that helps to make informed judgments of the epidemiologic evidence can
only be helpful. As the focus turns to making sense of the evidence that has been
generated, rigid application of the checklists becomes more problematic. A skep-
tical interpretation is that epidemiologists prefer to shroud their expertise in some
degree of mystery to ensure long-term job security. So long as an epidemiolo-
gist, rather than a computer program, is needed to interpret epidemiologic re-
search, we remain a valued and relatively scarce resource. There are more valid
reasons however, that such approaches can serve at best as a reminder of the is-
sues to be considered but not as an algorithm for judgment.
The universe of issues that could potentially influence study results is quite
extensive, even though the considerations are often grouped into a few categories,
such as confounding, selection bias, and information bias. Within each of these,
the application to a specific topic, study, or set of studies involves dozens of
branches, tending toward very long lists of specific manifestations. The relative
importance of these concerns is based not on some generic property resulting
from the nature of the bias, but rather depends on the specific characteristics of
the phenomenon being addressed in a given study. The sociologic characteristics
of the population, the nature of exposure and disease measurement, the details
of study conduct, and the methods of statistical analysis all have direct bearing
on the potential for bias. The use of ancillary information from other relevant
studies adds yet another dimension to the evaluation, requiring inferences re-
garding the applicability of data from other studies, for example, based on sim-
ilarities and differences from the population at hand. Integrative schemes require
some approach to weighting that synthesizes the highly distinctive issues into
scores, so that even if the menu of items could be stipulated, their relative im-
portance could not. In contrast to any predesignated weighting scheme, each sit-
uation calls for examination of the key concerns and a tentative assignment of
relative importance based on epidemiologic principles. In one set of studies, non-
response may be the dominant concern and deserve nearly all the attention of
evaluators; in other studies, exposure measurement error may dominate over the
other issues. Any scheme that assigns generic relative weights to non-response
and to exposure measurement error is surely doomed to failure.
The alternative to an algorithm for evaluating epidemiologic evidence is eval-
uation by experts. There is a loss in objectivity, in that one expert or set of ex-
perts may well view the evidence differently than another expert or set of ex-
302 INTERPRETING EPIDEMIOLOGIC EVIDENCE
perts. The goal throughout this book has been to identify the specific basis for
such variation, specifying the considerations that lead to a final judgment. Ide-
ally, experts attempt to examine the spectrum of methodologic concerns in rela-
tion to the results, identify sources of uncertainty, evaluate the plausibility and
implications of that uncertainty, and reach appropriate conclusions regarding the
strength of the evidence and key areas in need of refinement.
While discussions about epidemiologic evidence often focus on an assignment
of the proper adjective, such as strong, moderate, or weak evidence, in reality,
the assessment is made to determine whether a specific decision is justified. In
purely scientific terms, the bottom line questions concern the validity of study
results, distortions introduced in the measure of effect, and the probability that
the true association takes on different values. In applications of epidemiologic
evidence, the question is how the information bears on personal and policy de-
cisions. One might ask whether the epidemiologic evidence is strong enough to
impose a regulation or to modify one’s behavior, taking other scientific and non-
scientific considerations into account. This balancing of epidemiologic evidence
against other factors makes the judgment even more complex, less suitable for
checklists and algorithms, but it may be helpful in making the uses of the epi-
demiology more explicit. The practical decision is not between differing views
of epidemiologic evidence, but rather between differing courses of action. The
incentive to invoke weak epidemiologic evidence as strong, or vice versa, in or-
der to justify a decision would be avoided, and epidemiologists could focus on
characterizing the evidence without the distraction of how that assessment will
be used. Clarity and accuracy in the presentation of the evidence from epidemi-
ology must be the overriding goal, trusting that others will use the information
wisely to chart the appropriate course of action.
REFERENCES
Chemical Manufacturers Association. Guidelines for good epidemiology practices for oc-
cupational and environmental research. Washington, DC: The Chemical Manufactur-
ers Association, 1991.
Federal Focus Inc. Principles for evaluating epidemiologic data in regulatory risk assess-
ment. Washington DC: Federal Focus, Inc, 1996.
Hill AB. The environment and disease: association or causation? Proc Royal Soc Med
1965;58:295–300.
Kritchevsky SB, Kritchevsky D. Serum cholesterol and cancer risk: an epidemiologic per-
spective. Annu Rev Nutr 1992;12:391–416.
Maclure M, Schneeweiss S. Causation of bias: the episcope. Epidemiology 2001;
12:114–122.
National Research Council. Possible health effects of exposure to residential electric and
magnetic fields. Committee on the Possible Effects of Electromagnetic Fields on Bi-
Characterization of Conclusions 303
Abortion, spontaneous. See also Miscarriage comparison, 64, 65t, 67–68, 86, 101–2, 104,
alcohol and, 18–20 104t, 105, 109t, 123, 126t, 191t
anesthetic gases and, 129 confounding by, 15
elective vs., 19 gestational, 210, 238
epidemiologic evidence, inferences and, selection bias with higher, 104
18–20 Aggregation (lumping)
error measurement and, 18 data pooling as, 264–65, 267
gestational age and analysis of, 238 disease misclassification and, 210, 211,
heavy menstrual period and, 218 225–26, 239
pesticide exposure and, 60, 61t excessive, 211
Accuracy of individual trials, 262
absolute v. relative, 194, 201 random error and, 252, 262, 264
of causal relations, 26–27 Air pollution, fine particulate, 71, 71t, 169,
diagnostic, 207–8, 226–28, 229t 179–80, 180f
disease ascertainment and, 206, 226 Alcohol, 293
disease ascertainment, subgroups and, alterations in consumption of, 20
230–32, 233t–234t, 239 breast cancer and, 111, 275–76
disease outcome ascertained, inference epidemiologic evidence, inferences and,
restricted and, 232, 235, 236t–237t, 238 18–20, 217
evidence presented with, 302 error measurement in use of, 18
in exposure prevalence, 86 intake, 164
informative and fallible degrees of, 37 mental retardation, pregnancy and, 211
public health and, 18, 302 myocardial infarction and, 170, 188
severity of disease and, 232, 235, 238 race and, 275
Action smoking and, 276, 277
causal inference and, 24–27 spontaneous abortion and, 18–20
epidemiologic evidence basis of, 25, 302 timing of use of, 19
Adenomatous polyps, 230–31 Algorithms, 55, 299–302
Age Anxiety, spontaneous labor/delivery and, 64,
back-calculating of, 91 65t–66t
305
306 INDEX
association, DDT/DDE and, 46–47, 48, 49, coherent controls vs. non-coherent controls
165–67, 182–84, 183t for, 112–13
bias in more advanced disease of, 47 coherence of cases and controls for, 86–89,
blood levels, DDT and, 165–66 108, 112–13
chlorinated hydrocarbons and, 196–97 cohort studies vs., 5, 52, 53, 81–82
DDT, DDE and PCB exposure for, 44–49 control group in, 81–82, 83
lactation, DDT/DDE and, 47–49, 182–84, 183t control selection for, 81–89, 182
oral contraceptives association with, 190–91, control’s purpose in, 84
191t controls vs. study base in, 53
Breast cancer screening defined, 52, 81
cancer reduction and, 17 of disease incidence, 52
epidemiologic evidence, inferences and, of disease prevalence, 52
16–18 (accurate) exposure prevalence and unbiased,
mortality rate and, 16–18 86
participants vs. nonparticipants of, 16–17 exposure prevalence in controls compared to
public health strategy for, 17–18 external population in, 96, 98, 99t,
Brinton, L.A., 190, 191t 100–101, 112, 115
Bronchitis, smoking and, 70, 95 exposure prevalence measurement in, 53, 85,
88
exposure prevalence variation among
Cancer. See also Brain tumors; Breast cancer; controls in, 101–2, 103t, 104t, 120
Leukemia; Lung cancer exposure–disease associations confirmed in,
bladder, 85–86, 138–40, 141, 143 108, 109t–110t, 110–11, 112
brain, 90, 93t, 198, 199t, 200t, 225–26, 226t geographic areas with, 88–89, 91, 94, 106
cervical, 13–14, 230 (discretionary) health care for, 92–96, 97t
cholesterol and, 188, 189t, 197, 300 health outcome-dependent sampling for, 52
depression and, 198, 199t–200t historical population roster for, 91–92
endometrial, 96, 97t, 98, 99t, 100, 228, 230, integrated assessment of potential for,
231f, 235 111–13
lymphoma and, 212 non-concurrent vs. concurrent, 92
magnetic fields and, 90–91, 92, 93t, 104–5, social factors, individual behaviors and, 85,
112–13, 123, 124t, 131, 132t, 153, 86, 95, 101, 111, 182
225–26, 225t, 265, 266t, 267, 267f socioeconomic status for, 93t, 101, 104–5,
oral, 276, 293 106, 123, 124t
pancreatic, 154 study base controls with, 83–86
prostate, 102, 103t, 104t, 186, 187t, 209, subject selection for, 81–83
232, 235 temporal coherence of cases and controls
registries, 59, 90, 104, 154 with, 89–92, 93t
research, 286 Categories, nominal and ordinal, 52
Cancer Prevention II Study, 71, 71t, 179 Causal inference, 11, 20–24
Cardiovascular disease, 204, 209, 223, 286 action and, 24–27
estrogen therapy and, 73, 75t alternative explanations as subjective in, 23
fine particulate air pollution influence on, bias’s role with, 22–23
71, 71t, 169, 179–80 challenges to, 22
Case registries, 90 data, evaluation and, 22
Case–control studies definition of, 20
disease misclassification and, 213–14 descriptive goals and, 15–16
(nondifferential) overascertainment for, drawing of, 21
213–14 from epidemiologic evidence for similar
study participant’s loss, exposure prevalence, policy implications, 25
cooperativeness and, 125, 213 epidemiologic research for clearer/broader,
(nondifferential) underascertainment for, 213 21, 287
Case–control studies, selection bias in, 235, multiple studies, inconsistent results and, 262
236t–237t non-causal explanations for observed
association’s measures adjustment for known associations excluded from, 20
sources of non-comparability in, 105–8, policy decisions, individual behavioral
107t decisions and, 24–27
association’s measures in relation to markers Causal relation, 279
of potential, 102, 104–5 accurate estimation of, 12
308 INDEX
Insurance, 94 Medication
Interest, measurement of assessment of, 186, 187t, 193, 195, 235,
common language vs. statistical jargon for, 236t–237t
35–36 side-effects, 146, 197–98
primary, 34 Melanoma, 70–71, 84–85
quantitative, 36 Melhaus, H., 108
International Agency for Research on Cancer, Menopausal status, 275–76
286 Mental retardation, 210–11, 221
Interpretation, 23–24 Meta-analysis, 268, 282
(case against) algorithms for, 299–302 compatibility of methods studies for, 271
challenges in, 4, 301 data pooling compared to, 270
conclusions and, 11 definition of, 5
issues in, of evidence, 3, 297–99 exploratory, 271–73
over, 26–27, 248–49, 258 meta-regression and, 272–73
Intuition narrative review of studies and, 273
examined by, 228 observation of individual study with, 270
incorrect, 101 potential determinants of, 271
single pooled estimate for, 271
statistical significance and, 271
Jacobsen, B.K., 157, 158t synthetic, 270–71
variability sufficient for, 273
Miettinen, O.S., 87
Kaufman, J.S., 147, 149t Migraine headaches, 106
Miscarriage, caffeine and, 168–69
Misclassification, 15, 19, 38, 41, 96. See also
Labeling, 220, 221, 222t, 227 Disease, misclassification
Labor/delivery, spontaneous, 224–25 algebra of, 163
anxiety, depression and, 64, 65t–66t confounding and, 147, 150
sociodemographic/biomedical characteristics consequences of, 206
of pregnant women and, 64, 65t–66t definition of, 2, 205
Lactation, 47–49, 182–84, 183t differential exposure, 193–95, 216–19, 227,
Language, statistical jargon vs. common, 35–36 228, 232
Leukemia differential vs. nondifferential exposure,
acute lymphocytic, 56–57, 224 192–202
acute myeloid, 224 direct distortion by, 143–44
childhood, 90–91, 92, 93t, 104–5, 123, error in implementing chosen approach for
124t, 153, 175, 175t, 177t, 265, 266t, conceptual, 170
267, 267f exposure, 163, 168, 172–92
Lifestyles, 95, 117, 157, 158t loss information and, 166, 174–75
Loss. See Bias, from loss of study participants nondifferential, 46, 198, 201, 212–15, 216,
Lubin, J., 268, 270 227, 232
Lumping. See Aggregation (lumping) quantification of, 40
Lung cancer, 95, 152, 218 routine exposure approximating superior
coffee, smoking and, 156–57, 158t exposure measurement for less, exposure,
fine particles and, 70, 71, 71t, 179–80, 180f 174
radon and, 268, 269f, 270t Morgenstern, H., 51
smoking and, 111, 142, 147, 153, 156–57, Mortality rate
158t, 167, 186, 187t, 188–89, 289, 295–96 breast cancer screening and, 16–18
heat waves and, 223
lung cancer, heart disease and, 71, 71t, 169,
Magnetic fields, 90–91, 92, 93t, 104–5, 179–80, 180f
112–13, 123, 124t, 131, 132t, 153, 175, Musculoskeletal disorders, 69, 69t
175t, 177t, 225–26, 225t, 265, 266t, 267, Myocardial infarction, 101–2, 104, 105, 168,
267f, 296 170
Mail surveys, 125–27, 126t, 129, 193 alcohol and, 170, 188
Matthews, K.A., 73
Media, overinterpretation and sensationalism
of, 26–27 National Center for Health Statistics, 100, 130
Medical examinations, 217 National Health and Nutrition Examination
Medical interventions, 74 Survey, 100, 102
316 INDEX
integrated assessment of potential for, 78–79 confounding and, 147, 149, 149t, 152
lack of information as tool for, 58 exposure measurement and, 181–82, 184,
markers of susceptibility, assessing pattern 185t
and, 67–70, 69t, 74 loss of study participants and, 123, 124t, 133
origins of, 72–73 Specificity, 220, 224, 300
residual, 72 Speculation, intelligent, 21
severity of, 69–70 Statistical associations, 2, 10
sources of potential bias for, 58 Statistical dependence, 1–2
specific candidate source of, 76, 78 Statistical jargon, 35–36
as specific exposure-disease association of Statistical methods, 2
interest, 57 bias removal from, 36
stratified sampling, exposure prevalence and, causal relations and, 10–11
105–6, 112 confounding and, 56–58, 139–40, 149–50,
study designs for, 51–53 151, 159–60
study group, error in constituting, and, 55 multiple studies, synthesis and, 262–63, 264
study group restriction, enhanced probability and, 141
comparability and, 73–74, 76, 77t, 78 Statistical significance testing
subject selection in, 81–82 (examining of) association’s increased for
unexpected patterns of disease, assessing increase of results of, 252
presence and, 64, 67 confidence interval vs., 256
unexposed disease rates compared to dichotomous decision and, 250, 251, 258
external populations in, 58–60, 61t, 62–63 inconsistency and, 275
un/exposed group, disease rate/incidence individual trials too small for, 262
and, 53–55, 58–59, 64, 82, 84, 119–20 meta-analysis and, 271
Self-esteem, 32 null hypothesis and, 249–50, 252, 256, 258
Sensitivity analysis, 125, 127, 129, 130 overinterpretation encouraged by, 248–49,
Serum lycopene, 102, 103t, 104t 258
Serum selenium, 186, 187t parallel approach to systemic error with, 249
Sexual activity, 139 positive vs. negative, 249
in late pregnancy, 76 probability of obtained results with, 249
Sexually transmitted disease, 151, 277 probability values contrived with, 250–51
Smoking, 25, 109t, 119, 126 random error and, 248–51
alcohol and, 276, 277 systemic error, random error and, 248–49,
bladder cancer and, 86, 138–40, 141, 143 251
brain cancer and, 198, 199t–200t Stensvold, I., 157, 158t
bronchitis, lung cancer and, 70, 71t, 95 Step function, 256
exposure history of, 217, 218 Strawbridge, W.J., 76, 77t
exposure measurement and, 171, 173, 176, Studies. See also Case–control studies; Cohort
179 studies
infertility and, 221, 222t complete documentation of methods of, 116
interpretations of problems with, 33 confidence intervals for comparing precision
lung cancer and, 111, 142, 147, 153, of different, 257
156–57, 158t, 167, 186, 187t, 188–89, consistency vs. bias in, 4
295–96 consistency vs. causality for, 4
pancreatic cancer and, 154 as continuum, 37
policy decisions for, 24–25 cross-sectional, 52
related diseases, research and, 289 deficiencies and previous, 31
Social factors. See also Demographic patterns effectiveness of evidence influencing future,
case–control studies, selection bias and, 85, 23, 31, 44
86, 95, 101, 111, 182 hypothesis formulation, data analysis and,
cohort studies, selection bias and, 64, 253–54
65t–66t key concerns about, 291–93
confounding and, 157, 158t lumping of, 210–11, 225, 239, 252
exposure measurements and, 165, 182, 186, non-randomized, 74
187t, 198, 201 random error’s decrease with increase size
loss of study participants and, 117, 118t, of, 8, 42, 141, 245, 253, 255, 265
123, 124t, 125, 126t, 129, 133 randomization process for, 54, 73–74, 76,
Socioeconomic status 82, 139, 141, 263
case–control studies, selection bias and, 93t, resources vs. yield of, 121
101, 104–5, 106, 123, 124t revealing/assessment of methods of, 22
320 INDEX