Clinical Epidemiology - Principles, Methods, and Applications For Clinical Research (PDFDrive)
Clinical Epidemiology - Principles, Methods, and Applications For Clinical Research (PDFDrive)
Clinical Epidemiology - Principles, Methods, and Applications For Clinical Research (PDFDrive)
PART 1 OVERVIEW
Chapter 1: Introduction to Clinical Epidemiology
Introduction
Clinical Epidemiology
Research Relevant to Patient Care
Epidemiologic Study Design
Design of Data Collection
Design of Data Analysis
Diagnostic, Etiologic, Prognostic, and Intervention Research
Moving from Research to Practice: Validity, Relevance, and Generalizability
Hans-Olov Adami
Arno W. Hoes, MD, PhD (1958) studied medicine at the Radboud University in
Nijmegen. He obtained his PhD degree in clinical epidemiology at the Erasmus
Medical Center in Rotterdam. He was further trained in clinical epidemiology at
the London School of Hygiene and Tropical Medicine. In 1991, he was
appointed Assistant Professor of Clinical Epidemiology and General Practice in
the Department of Epidemiology and the Department of General Practice at the
Erasmus Medical Center. In the latter department, he headed the research line,
“cardiovascular disease in primary care.” In 1996, he moved to the Julius Center
for Health Sciences and Primary Care of the University Medical Center in
Utrecht, where he was appointed Professor of Clinical Epidemiology and
Primary Care in 1998. Since 2010, he has been the Chair of the Julius Center.
Most of his current research activities focus on the (early) diagnosis, prognosis,
and therapy of common cardiovascular diseases. His teaching experience
includes courses on clinical epidemiology, diagnostic research, case-control
studies, drug risk assessment, and cardiovascular disease. He is a member of the
Dutch Medicines Evaluation Board, the Health Council of the Netherlands, and
is on the editorial boards of several medical journals.
CONTRIBUTORS
We thank the following colleagues and friends for their invaluable contributions
and critical comments on several of the chapters of this text.
INTRODUCTION
Epidemiology is essentially occurrence research [Miettinen, 1985]. The object of
epidemiologic research is to study the occurrence of illness and its relationship
to determinants. Epidemiologic research deals with a wide variety of topics. A
few examples include the causal role of measles virus infection in the
development of inflammatory bowel disease in children, the added value of a
novel B-type natriuretic peptide serum bedside test in patients presenting with
symptoms suggestive of heart failure, the prognostic implications of the severity
of bacterial meningitis on future school performance, and the effect of antibiotics
in children with acute otitis media on the duration of complaints. What binds all
of these examples is the study and, more precisely, the quantification of the
relationship of the determinants (in these cases, measles infection, the novel
bedside test, the severity of bacterial meningitis, and antibiotic therapy) with the
occurrence of an illness or other clinically relevant outcome (that is,
inflammatory bowel disease, heart failure, school performance, and duration of
otitis media complaints). Central to epidemiologic studies in such diverse fields
is the emphasis on occurrence relations as objects of research.
The origins of epidemiology lie in unraveling the causes of infectious disease
epidemics and the emergence of public health as an empirical discipline. Every
student of epidemiology will enjoy reading the pioneer works of John Snow on
the mode of the transmission of cholera in 19th century London, including the
famous words: “In consequence of what I said, the handle of the pump was
removed on the following day” [Snow, 1855]. Subsequently, the methods of
epidemiology were successfully applied to identifying causes of chronic
diseases, such as cardiovascular disease and cancer, and now encompass
virtually all fields of medicine.
In recent decades, it has increasingly been acknowledged that the principles
and methods of epidemiology may be fruitfully employed in applied clinical
research. In parallel with a growing emphasis in medicine on using quantitative
evidence to guide patient care and to judge its performance, epidemiology has
become one of the fundamental disciplines for patient-oriented research and a
cornerstone for evidence-based medicine. Clinical epidemiology deals with
questions relevant to clinical practice: questions about diagnosis, causes,
prognosis, and treatment of disease. To serve clinical practice best, research
should be relevant (i.e., deal with problems encountered in clinical practice),
valid (i.e., the results are true and, thus, not biased), and precise (i.e., the results
lie within a limited range of uncertainty). (See Box 1–1.) These prerequisites are
crucial for research results eventually to be applied with confidence in daily
practice.
CLINICAL EPIDEMIOLOGY
Clinical epidemiology is epidemiology [Grobbee & Miettinen, 1995]. It is a
descriptive label that denotes the application of epidemiologic methods to
questions relevant to patient care. Then why use a different term? Clinical
epidemiology does not indicate a different discipline or refer to specific aspects
of epidemiologic research, such as research on iatrogenic disease. Traditionally,
practitioners of epidemiology predominantly have been found in public health or
community medicine, which can be well understood from the perspective of its
history. Epidemiologic research results have unique value in shaping preventive
medicine as well as in the search for causes of infectious and chronic disease that
affect large numbers of people in our societies. Yet, with the growing
recognition of the importance of probabilistic inference in matters of diagnosis
and treatment of individual patients, an obvious interest has grown in the
approaches epidemiologic research has to offer in clinical medicine. Use of the
term clinical epidemiology therefore refers to its relevance in “applied” clinical
science; conversely it helps to remind us that the priority in the clinical research
agenda must be set with a keen appreciation of what is relevant for patient care.
Clinical epidemiology provides a highly useful set of principles and methods for
the design and conduct of quantitative clinical research.
Traditionally, epidemiologic research has largely been devoted to etiologic
research. Investigators have built careers and departments’ reputations on
epidemiologic research into the causes of infectious or chronic diseases, while
for patient care the ability to establish an individual’s diagnosis and prognosis is
commonly held to be of greater importance. Still, the work of most master’s and
doctoral fellows in epidemiology, in particular those working outside of a
medical environment, is concentrated on etiology. Perhaps they do not realize
that this focus actually restricts the value of epidemiologic research for medical
care.
valid
Main Entry: val · id
Pronunciation: 'va-l d
Function: adjective
Etymology: Middle French or Medieval Latin; Middle French valide, from
Medieval Latin validus, from Latin, strong, potent, from ValEre
1 : having legal efficacy or force; especially : executed with the proper legal authority and formalities
<a valid contract>
2a : well-grounded or justifiable : being at once relevant and meaningful <a valid theory> b : logically
correct <a valid argument> <valid inference>
3 : appropriate to the end in view : effective <every craft has its own valid methods>
4 of a taxon : conforming to accepted principles of sound biological classification
precise
Main Entry: precise
Pronunciation: pri-'sīs
Function: adjective
Etymology: Middle English, from Middle French precis, from Latin praecisus, past participate of
praecidere to cut off, from prae- + caedere to cut
1 : exactly or sharply defined or stated
2 : minutely exact
3 : strictly conforming to a pattern, standard, or convention
4 : distinguished from every other <at just that precise moment>
Interpret the clinical profile: predict What illness best explains the symptoms and Diagnostic knowledge
the presence of the illness signs of the patient?
Explanation of the illness Why did this illness occur in this patient? Etiologic knowledge
Predict the course of disease 1. What will the future bring for this patient, Prognostic knowledge (including
assuming no intervention takes place? therapeutic knowledge)
2. To what extent may the course of disease be
affected by treatment?
Decision about medical action Which treatment, if any, should be chosen for Balancing benefits and risks of
this particular patient? available options
Execution of medical action Initiation of treatment Skills
When designing applied clinical research, the principal objective should be to
provide knowledge that is applicable in the practice of medicine. To achieve this,
the research question should be clearly formulated and an answer should be
given in a way that it is both valid and sufficiently precise. First comes validity,
the extent to which a research result is true and free from bias. Valid research
results must be sufficiently precise to allow adequate predictions for individual
patients or groups of subjects. For example, knowing that the 5-year mortality
rate after a diagnosis of cancer is validly estimated at 50% is one thing, but when
the precision of the estimate ranges between 5% and 80%, the utility of this
knowledge is limited for patient care. The design of studies focused on
diagnosis, etiology, prognosis, and treatment needs to meet these goals. General
and specific design characteristics of clinical epidemiologic research will be
discussed in some detail in the next section.
Theoretical Design
The theoretical design of a study starts from a research question. Formulating the
research question is of critical importance as it guides the theoretical design and
ensures that, eventually, the study produces an answer that fits the needs of the
investigator. Therefore, a research question should be expressed as a question
and not as a vague ambition. All too often, investigators set out to “examine the
association between X and Y.” This is far from a research question and will not
lead to a clinically useful answer.
Theoretical design
Design of the occurrence relation
Design of data collection
Design of the conceptual and operational collection of data to document the empirical occurrence
relationship in a study population
Design of data analysis
This includes a description of the data and quantitative estimates of associations
For starters, a research question should end with a question mark. An example
of a useful research question is: “Does 5-day treatment with penicillin in
children with acute otitis media reduce the duration of complaints?” This
research question combines three crucial elements: (1) one or more determinants
(in this case 5-day treatment with penicillin), (2) an outcome (the duration of the
complaints), and (3) the domain. The domain refers to the population (or set of
patients) to whom the results can be applied. The definition of the domain (in
this case, children with acute otitis media) is typically much broader than the
selection criteria for the patient population included in the study (e.g., children
enlisted in 25 primary care practices located in the central region of the
Netherlands during the year 2000 who were diagnosed with acute otitis media).
Similarly, the domain of the famous British study in the 1940s addressing the
causal role of cigarette smoking in lung cancer was man and not restricted in
place or time. The domain for a study is like a pharmaceutical package insert. It
specifies the type of patients to whom the results can be applied. It guides patient
selection for the research, but this selection is usually further restricted for
practical or other reasons. When an appropriate research question is formulated,
the design of the occurrence relation is relatively easy.
The occurrence relation is central to the theoretical design of a clinical
epidemiologic study. The occurrence relation is the object of research and
relates one or multiple determinants to an outcome. In subsequent phases of the
study, the “true” nature and strength of the occurrence relation is documented
and quantitatively estimated using empirical data. Occurrence relations in
diagnostic, etiologic, prognostic, and intervention research each have particular
characteristics, but all have a major impact on the other two components of
epidemiologic study design: design of data collection and design of data
analysis. To facilitate the theoretical design of a study and determine the
(elements of the) occurrence relation, a distinction should be made between
descriptive and causal research.
where ED could include lifestyle factors such as smoking and alcohol but also
treatments for depression that might lead to heart disease, such as tricyclic
antidepressants.
To allow the collection of empirical data for the study, typically the
conceptual definitions of outcome and determinants need to be operationalized
to measurable variables. In this example, depression could be measured using
the Zung depression scale and heart disease could be operationalized by a record
of admission to a hospital with an acute myocardial infarction. Often, this step
leads to simplification or to measures that do not fully capture the conceptual
definitions. For example, we may wish to measure quality of life but may need
to settle for a crude approximation using a simple 36-item questionnaire. To
appreciate the results of a study, it is important to realize that such compromises
may have been made.
The multiple determinants include the novel BNP test, the findings from
history taking (including known comorbidity), and physical examination, which
are available in daily practice anyway; the outcome is a diagnosis of heart
failure. The domain should not be too narrow and could be defined as patients
presenting to primary care with dyspnea or, alternatively, all patients presenting
to primary care with symptoms suggestive of heart failure in the view of the
physician. The corresponding occurrence relation can be summarized as the
presence of heart failure as a function of multiple determinants, including the
novel BNP test:
Etiologic Research
Clinicians and epidemiologists alike tend to be most familiar with etiologic
research, despite its limited direct relevance to patient care and its methodologic
complexities. As in all epidemiologic studies, and starting from the research
question, the first step is the design of the occurrence relation. For etiologic
research, this includes consideration of a determinant as well as one or multiple
extraneous determinants.
Consider, for example, the causes of childhood inflammatory bowel disease
(IBD), particularly to what extent a certain factor (e.g., a measles virus infection)
may be responsible for its occurrence. The research question could be
formulated as follows:
Does measles virus infection cause IBD in children?
Prognostic Research
To be able to set a prognosis is an essential feature of daily clinical practice. The
process of estimating an individual patient’s prognosis is illustrated by the
following question often asked by practicing physicians: “What will happen to
this patient with this illness if I do not intervene?” In essence, prognostication
implies predicting the future, a difficult task at best. As in the diagnostic process,
estimating a patient’s prognosis means taking into account multiple potential
determinants, some of which pertain to the clinical profile (e.g., markers of the
severity of the illness) and some of which refer to the nonclinical profile (e.g.,
age and sex). Ideally, prognostic evidence should help the clinician to adequately
and efficiently predict a clinically relevant prognostic outcome in an individual
patient. More general prognostic information, such as 5-year survival of types of
cancer and 1-year recurrence rates in stroke patients is typically not sufficiently
informative to guide patient management. Moreover, several prognostic outcome
parameters can be of interest. Apart from survival or specific complications,
quality of life indices can also be extremely relevant.
Imagine a 10-year-old child who experienced a recent episode of bacterial
meningitis. The parents ask the clinical psychologists about the possible longer-
term sequelae of their son’s illness. They are particularly worried about their
child’s future school performance. To predict the child’s school performance, in
this example, in 5 years’ time, the psychologist will consider both nonclinical
(such as age and previous school performance) and clinical parameters, notably
indices of the severity of the meningitis. The clinical psychologist is uncertain
which combination of these latter parameters best predicts future school
performance.
An example of a research question of prognostic research addressing this topic
is:
Which combination of measures of disease severity (e.g., duration of symptoms prior to admission
because of meningitis, leukocyte count in cerebral spinal fluid, dexamethasone use during
admission) best predicts future school performance in children with a recent history of bacterial
meningitis?
Intervention Research
An intervention is any action taken in medicine to improve the prognosis of a
patient. This can include treatment or advice as well as preventive actions. The
most common form of intervention research in medicine is research on the
effects of drug treatment. Research into the benefits and risks of interventions
merits particular attention. The design of intervention research generally requires
the design of an occurrence relation that serves both the estimation of the
prognosis of a particular patient when the intervention is initiated and a valid
estimation of the causal role of the intervention in that prognosis. In other words,
intervention research aims to both predict prognosis following the intervention
and understand the effect caused by the intervention.
From the perspective of the patient, the change in prognosis brought about by
treatment is of the greatest interest. However, from the perspective of, for
example, the drug manufacturer or regulator, the question is whether it is the
pharmacologic action of the drug and nothing else that improved the prognosis.
The question is about the causality of the treatment effect. Consequently, the
object, data collection, and analysis should comply with the specific
requirements of both causal and descriptive research. Typically the requirements
of being able to draw causal conclusions and the exclusion of confounding
factors drive the design. Importantly, intervention research, particularly its most
appreciated form, the randomized trial, can serve as a role model for causal
research at large because trials are designed to remove major sources of
confounding [see Chapter 10 and Miettinen, 1989].
One may question whether causal research that does not take prognostic
implications into account has value for clinical medicine. In intervention studies,
principles of both causal and descriptive or, according to Miettinen,
“intervention-prognostic” research apply [Miettinen, 2004]. Because the design
of data collection and data analysis of causal research calls for a strict control of
confounding factors, the causal outlook of intervention research commonly
dominates in intervention studies. However, the challenge for the investigator is
not only to provide an answer on causality but also to produce a meaningful
estimate of the effect on the prognosis of individual patients. Consider an 18-
month-old toddler visiting a primary care physician because of acute otitis
media. According to her mother, this is the second episode of otitis; the first
episode occurred some 9 months ago and lasted 10 days. The mother is afraid of
continued prolonged periods of complaints and asks for an antibiotic
prescription. First, the clinician will estimate the prognosis of the child, taking
into account the child’s prior medical history, current clinical features (e.g.,
fever, uni/bilateral ear infection), and other prognostic markers such as age.
Then the effects of antibiotic therapy on the prognosis will be estimated. To this
end, the causal (i.e., true) effects of antibiotic therapy in young children should
be known. The research question of an intervention study providing this
evidence is: “Does antibiotic therapy reduce the duration of complaints in young
children with acute otitis media?” Here, antibiotic therapy is the determinant and
the number of days until resolution of symptoms is the outcome. The domain is
young children (younger than 2 years) with acute otitis media. Although one
could argue that the domain may be as large as all children with otitis, the
prognosis in young children is considered to be relatively poor and the effects of
antibiotics could be different in this subgroup of children. The occurrence
relation can be summarized as:
The essence of knowledge is generalization. That fire can be produced by rubbing wood a certain way
is a knowledge derived from individual experiences; the statement means that rubbing wood in this
way will always produce fire. The art of discovery is therefore the art of generalization. What is
irrelevant, such as the particular shape or size of the piece of wood used is to be excluded from the
generalization: what is relevant, for example, the dryness of the wood, is to be included in it. The
meaning of the term relevant can thus be defined: that is relevant which must be mentioned for the
generalization to be valid. The separation of relevant from irrelevant factors is the beginning of
knowledge.
Reproduced from Reichenbach H in: The rise of scientific philosophy. New York: Harper and Row. 1965.
Diagnostic Research
INTRODUCTION
A 55-year-old man visits his general practitioner (GP) complaining of dyspepsia.
He has had these complaints for more than 3 months, but their frequency and
severity have increased over the last 4 weeks. The patient has a history of angina
but has not required sublingual nitroglycerin for more than 2 years. He is known
to the GP as having been unsuccessful in quiting smoking despite frequent
attempts to do so. The GP asks several additional questions related to the nature
and severity of the dyspepsia to estimate the chance that the patient suffers from
a peptic ulcer. The GP also asks about possible anginal complaints. A short
physical examination reveals nothing except some epigastric discomfort during
palpation of the abdomen. The GP considers a peptic ulcer the most likely
diagnosis. The probability of a coronary origin of the complaints is deemed very
low. The GP decides to test for Helicobacter pylori serology, to further increase
(rule in) or decrease (rule out) the probability of (H. pylori-related) peptic ulcer.
The H. pylori test is negative. The GP prescribes an acid-suppressing agent and
asks the patient to visit again in a week. When the man visits the GP again, his
complaints have virtually disappeared.
Knowing how to live with uncertainty is a central feature of clinical judgment: the skilled physician
has learned when to take risks to increase certainty and when to simply tolerate uncertainty.
—Riegelman, 1990
Reproduced from: Riegelman, R. Studying a study and testing a test. Boston: Little, Brown, 1990.
In addition to clinical data about the patient, nonclinical data such as age,
gender, and working conditions also may be considered. The estimated
probability of the target disease will guide the doctor in choosing the most
appropriate action. The physician may perform additional diagnostic tests,
initiate therapeutic interventions, or, perhaps most importantly, may decide to
refrain from further diagnostic or therapeutic actions for that disease (e.g., when
the probability of that disease is considered low enough) and possibly search for
other underlying diseases [Ferrante di Ruffano et al., 2012]. The diagnostic
workup is a continuing process of updating the probability of the target disease
presence given all available documented information on the patient. The goal of
this workup is to achieve a relatively high or a relatively low probability of a
certain diagnosis, that is, the threshold probability beyond or below which a
doctor is confident enough about the presence or absence of a certain diagnosis
to guide clinical decisions. Threshold probabilities are determined by the
consequences of a false-positive or false-negative diagnosis. These critically
depend on the anticipated course or prognosis of the diagnosis considered and
the potential beneficial and adverse effects of possible additional diagnostic
procedures or treatments. Importantly, these two thresholds, A and B, are
commonly implicitly defined in daily practice and will often vary between
doctors. Often, history taking and physical examination already provide
important and sufficient diagnostic information to rule in or rule out a disease
with enough confidence so that the estimated probability of presence of the
disease is below A or above B (see Figure 2–1).
διáγνωσις
The term diagnosis is a compound of the Greek words διá (dia), which means apart or distinction and
γνωσις (gnosis), which means knowledge. Diagnosis in medicine can be defined as “the art of
distinguishing one disease from the other.” (Dorland WAN. The American Illustrated Medical
Dictionary, 20th ed. Philadelphia, London: WB Saunders Company; 1944). In clinical practice a
diagnosis does not necessarily imply a well-defined, pathophysiologically distinct, disease entity, such
as acute myelocyte leukemia; many diagnoses are set on a much more aggregate level, notably in the
beginning of the diagnostic process. For example, a physician on weekend call who speaks to a patient
with dyspnea or their family member will first try to set or rule out the diagnosis, “a condition
requiring immediate action,” before a more precise diagnosis can be made, usually at a later stage. The
precision of the diagnosis also depends on the clinical setting. In primary care there often is no need
for a very specific diagnosis to decide on the next step (for example, an antibiotic prescription for a
patient with the diagnosis of “probable pneumonia” based on signs and symptoms only), whereas in an
intensive care setting in a tertiary care hospital, with more virulent bacteria, more antibiotic resistance,
and more immunocompromised and seriously ill patients, a more specific diagnosis may be required
(“vancomycin-resistant pneumococcal ventilator-associated pneumonia”) involving imaging
techniques, serology, cultures, and resistance patterns.
But when the probability of the disease is estimated to lie in the grey middle
area (between A and B), additional diagnostic tests are commonly ordered to
decrease the remaining uncertainty about the presence or absence of the disease.
Typically, this additional testing first includes simple, easily available tests such
as blood and urine markers or simple imaging techniques like chest x-ray. If
after these tests are conducted doubt remains (i.e., the probability of disease
presence has not yet crossed the thresholds A or B), more invasive and costly
diagnostic procedures are applied such as magnetic resonance imaging (MRI),
computed tomography (CT), or positron emission tomography (PET) scanning,
arthroscopy, and/or biopsy. This process of diagnostic testing ends when the
estimated probability of the target disease becomes sufficiently higher or lower
than the A or B threshold to guide medical action.
In the example of our patient with complaints of dyspepsia, history taking and
physical examination apparently did not provide the doctor with enough
information to decide about the initiation of therapeutic interventions, for
example, symptomatic treatment with acid-suppressing agents or, alternatively,
triple therapy to treat an underlying H. pylori infection. In view of the patient
burden of invasive H. pylori testing (i.e., gastroscopy with biopsy) in
combination with the relatively mild complaints and potential benefits of H.
pylori–targeted therapy, the physician decided to perform a noninvasive serology
test, although this test is considered less accurate than the gastroscopy.
Apparently, the negative test results indeed convinced the physician that the
probability of H. pylori ulcer disease was lower than the clinically relevant
threshold (e.g., 10 or 20%) because triple therapy targeted at H. pylori was not
initiated. Instead, symptomatic treatment, an acid-suppressing drug, was
prescribed. The probability of coronary heart disease—one of the differential
diagnoses of a patient with these complaints—as the underlying cause of the
complaints also was considered to be very low from the start (far below
threshold A for this disease), such that no additional tests for that diagnosis were
ordered.
This example may seem subjective, not quantitative and not evidence based,
but the diagnostic process in clinical practice is often just like that. In contrast to
many therapeutic interventions, quantitative evidence of the value of diagnostic
tests and certainly of the added value of a test beyond previous, more simple test
results, is often lacking [Linnet et al., 2012; Moons et al., 2012c]. Given the
importance of diagnosis in everyday practice, there is an urgent need for
research providing such quantitative knowledge [Grobbee, 2004; Knottnerus,
2002b].
The diagnostic process thus is a multivariable concern. It typically involves
the documentation and interpretation of multiple test results (or diagnostic
determinants), including nonclinical patient information [Moons et al., 1999]. In
practice, hardly any diagnoses are based on a single diagnostic test. The number
of diagnostic tests applied in everyday practice may differ considerably and
depends, for example, on the targeted disease, patient characteristics, and the
diagnostic certainty required to decide on patient management (see Box 2–3).
Importantly, a natural hierarchy of testing exists. Almost without exception, any
diagnostic workup starts with nonintrusive tests such as history taking and
physical examination. Although one could argue about whether these should be
considered “tests,” we will treat them as such here, as each consecutive finding
will influence the probability of disease, just as a blood test would. This is
followed by simple laboratory or imaging tests, and eventually more
burdensome and expensive tests, such as imaging techniques requiring contrast
fluids or biopsies. Subsequent test results are always interpreted in the context of
previous diagnostic information [Moons et al., 1999; Moons & Grobbee, 2002a].
For example, the test result “presence of chest pain” is obviously interpreted
differently in a healthy 5-year-old girl than in a 60-year-old man with a history
of myocardial infarction. The challenge to the physician lies in predicting the
probability of the absence or presence of a certain target disease based on all
documented test results. This requires knowledge about the contribution of each
test result to the probability estimation. The diagnostic value of the H. pylori test
in the earlier example is negligible if it adds nothing to the findings offered by
the few minutes of history taking and physical examination, information that is
always acquired by physicians anyway. More technically, the H. pylori test is
worthless if the test result does not change (increase or decrease) the probability
of presence of peptic ulcer disease as based on the results from history taking
and physical examination. Importantly, in case the next step in clinical
management is already decided upon (when the disease probability is below A or
above B in Figure 2–1), one may, and perhaps should, refrain from additional
testing.
Primum non nocere (first do no harm) refers to the principle that doctors should always take into
account the possible harm of their actions to patients, and that an intervention with an obvious
potential for harm should not be initiated, notably when the benefits of the intervention are small or
uncertain.
Although this Hippocratic principle is most often applied to discussions on the effects of
therapeutic interventions, it is equally applicable to diagnostic tests, especially for the more invasive
and burdensome ones. When the course of management for a patient has been determined, additional
diagnostic tests obviously have no benefit and can therefore only be harmful, albeit sometimes to the
healthcare budget only. In daily practice many diagnostic tests are being performed that have no
potential helpful consequences for patient management. Especially when additional test ordering is
relatively easy, for example, serum parameters and imaging such as x-rays, the potential consequences
for patient management, as well as possible harm, are not always taken into account. In a patient with
a rib contusion as a result of a fall, an x-ray to rule out a rib fracture is useless, because the test result
will not influence treatment (i.e., rest and painkillers). The challenge to the physician in any diagnostic
process thus not only lies in choosing the optimal diagnostic tests and in what order, but also in
knowing when to stop testing.
The works of the 18th century Scottish pastor and mathematician, Thomas
Bayes, have been instrumental in the development of a more scientific approach
toward the diagnostic process in clinical practice. He established a mathematical
basis for diagnostic inference. Bayes recognized the sequential and probabilistic
nature of the diagnostic process. He emphasized the importance of prior
probabilities, that is, the probability of the presence of a target diagnosis before
any tests are performed. He also recognized that, based on subsequent test
results, doctors will update this prior probability to a posterior probability. The
well-known Bayes’ rule formally quantifies this posterior probability of disease
presence given the test result, based on the prior probability of that disease and
the so-called diagnostic accuracy estimates (such as sensitivity and specificity or
likelihood ratio) of that test (see Box 2–4). Although it has repeatedly been
shown that this mathematical rule often does not hold—because the underlying
assumption of constant sensitivity and specificity or likelihood ratio across
patient subgroups is not realistic in most situations [Detrano et al., 1988;
Diamond, 1992; Hlatky et al., 1984; Moons et al., 1997]—the rule has been
crucial in understanding the probabilistic and stepwise nature of diagnostic
reasoning in clinical practice.
We should emphasize that setting a diagnosis is not itself a therapeutic
intervention. It is a vehicle to inform patients and guide patient management
[Biesheuvel et al., 2006; Bossuyt et al., 2012]. An established diagnosis is a label
that, despite being highly valued by medical professionals, is of no direct
consequence to a patient other than to obtain a first estimate of the expected
course of the complaints and to set the optimal management strategy.
Accordingly, a diagnostic test commonly has no direct therapeutic effects and
therefore does not directly influence a patient’s prognosis. Once a diagnosis, or
rather the probability of the most likely diagnosis, is established and an
assessment of the probable course of disease in the light of different treatment
alternatives (including no treatment) has been made, the optimal treatment
strategy will be chosen to eventually improve the patient’s prognosis. There are
also other pathways through which a diagnostic test may affect a patient’s health
[Ferrante di Ruffano et al., 2012]. Knowledge of specific test results or disease
presence may change the patient’s (and the physician’s) expectations and
perceptions, and test results may shorten the time between symptom onset and
treatment initiation, as well as improve treatment adherence. Finally, a
diagnostic test may have direct therapeutic properties and change patient
outcomes. Such procedures are rare, but salpingography to determine patency of
the uteral tubes is an example.
Finally, the difference between diagnosing and screening for a disease should
be recognized. The former starts with a patient presenting with a particular
symptom and sign suspected of a particular disease and is inherently
multivariable. Screening for a disease typically starts with asymptomatic
individuals and is commonly univariable. Examples include phenylketonuria
screening in newborns and breast cancer screening in middle-aged women,
where a single diagnostic test is performed in all subjects irrespective of
symptoms or signs. In this chapter, we will deal with diagnosing exclusively.
BOX 2–4 Example of a Two-by-Two Table with Test Results and Bayes’ Rule
Test characteristics of test (T) N-terminal pro B-type natriuretic peptide (NT-proBNP; cut-off 36
pmol/L) in the detection of heart failure in primary care patients with conditions known to be
associated with a high prevalence of heart failure.
NT-proBNP positive (T+) NT-proBNP negative (T−) Total
where positive predictive value = P(D+| T+) = 9/78 = 12%; negative predictive value = P(D−| T−) =
55/55 = 100%; sensitivity = P(T+| D+) = 9/9 = 100%; specificity = P(T−| D−) = 55/124 = 44%;
likelihood ratio positive test (LR+) = P(T+| D+)/[1 − P(T−| D−)] = (9/9)/(69/124) = 1.8; likelihood
ratio negative test (LR−) = [(1 − P(T+| D+)]/[P(T−| D−)] = (0/9)/(55/124) = 0.
Bayes’ rule:
and
Alternative (so-called odds) notation of Bayes’ rule → (1) divided by (2):
For sequential diagnostic tests, Bayes’ rule theoretically can be simply extended:
Note that this form of Bayes’ rule assumes that the results of test 1 to test 3 are independent of each
other. However, it has been shown that this assumption in practice typically does not hold, as test
results are often mutually related simply because they are reflections of the same underlying disease
(see text).
P(T) = f(D)
where P(T), that is, the probability (0–100%) of the test result of the single index
test T, is studied as a function of the presence or absence of the target disease D.
In the case of a dichotomous test, this occurrence relation can be rewritten for
the estimation of sensitivity as:
P(T+) = f(D+),
P(T–) = f(D–),
where T+ and T– indicate a positive and negative index test result, respectively,
and D+ and D– the presence or absence of the target disease.
The occurrence relation of test research that focuses on predictive values of a
single test can be summarized as:
P(D) = f(T)
P(D+) = f(T+),
P(D–) = f(T–).
Are sensitivity and specificity given properties of a diagnostic test, and do predictive values critically
depend on the prevalence of the disease?
The common emphasis on sensitivity and specificity in the presentation of diagnostic studies is at least
partly attributable to the notion that predictive values critically depend on the population studied,
whereas sensitivity and specificity are considered by many to be constant [Moons & Harrell, 2003].
There is no doubt that predictive values of diagnostic tests are influenced by the patient domain. This
may be best illustrated by comparing the performance of a test in primary and secondary care. Because
of the inherent higher prevalence of the relevant disease in suspected patients in secondary care
compared to primary care (because of the selection of patients with a higher probability of disease for
referral), positive predictive values are commonly higher in secondary care (i.e., fewer false-positives)
than in primary care (more false-positives), while negative predictive values are usually higher in
primary care (fewer false-negatives). Sensitivity, specificity, and likelihood ratios indeed are not
directly influenced by the prevalence of the disease because these parameters are conditional upon the
presence or absence of disease. It has been shown extensively, however, that they do vary according to
differences in the severity of disease [Hlatky et al., 1984; Detrano et al., 1988; Diamond, 1992]. In
secondary care, for example, where more severely ill patients will be presented than in primary care,
higher levels of diagnostic markers of a particular disease (and thus more test positives) can be
expected among those with the disease than in primary care. This will result in a higher sensitivity in
secondary care than in primary care and a higher specificity in primary care [Knottnerus, 2002a]. That
sensitivity and specificity are not constant is illustrated in two studies by the same researchers on the
value of near patient testing for Helicobacter Pylori infection in dyspepsia patients. The sensitivity and
specificity in the primary care setting were 67% and 98%, respectively, while these values were 92%
and 90% in secondary care [Duggan et al., 1999; Duggan et al., 1996].
The focus on the quantification of the value of a single test to diagnose or rule
out a disease and the common preoccupation of such research with a test’s
sensitivity and specificity are typical of prevailing diagnostic research [Moons et
al., 2004a; Moons et al., 2012c]. This is also illustrated by the following
statements found in classic textbooks in clinical epidemiology or biostatistics:
Identify the sensitivity and specificity of the sign, symptom, or diagnostic test you plan to use. Many
are already published and sub specialists worth their salt ought either to know them from their field
or be able to track them down [Sackett et al., 1985].
and
For every laboratory test or diagnostic procedure there is a set of fundamental questions that should
be asked. Firstly, if the disease is present, what is the probability that the test result will be positive?
This leads to the notion of the sensitivity of the test. Secondly, if the disease is absent, what is the
probability that the test result will be negative? This question refers to the specificity of the test
[Campbell & Machin, 1990].
DIAGNOSTIC RESEARCH
Because the object of diagnosis in practice is to predict the probability of the
presence of disease from multiple diagnostic test results, the design of diagnostic
research is very much determined by the understanding, if not mimicking, of
everyday practice [Moons & Grobbee, 2005]. In the following sections, the three
components of clinical epidemiologic diagnostic study design will be discussed:
theoretical design, design of data collection, and design of data analysis.
Theoretical Design
As mentioned earlier, the occurrence relation of diagnostic research is:
Illustration of the difference between (typical) diagnostic research, assessing the contribution of
multiple diagnostic determinants to the estimation (prediction) of the presence of a certain disease and
diagnostic intervention research aimed at estimating (in this case explaining) the effect of diagnostic
tests (plus subsequent interventions) on the patient’s prognosis. The latter type of research becomes
intervention research, and requires taking extraneous determinants (i.e., confounders) into account.
Diagnostic Research
P(Diagnosis) = f (T1, T2, T3, … T n)
Where P(D) is the probability of the presence (i.e., prevalence) of the disease of interest and T1 …
Tn represent the diagnostic determinants to be assessed
The occurrence relation of diagnostic research covers the bold part of this scheme:
Diagnostic problem → diagnostic strategy → diagnosis → intervention → outcome
Where the prognostic outcome could be any clinically relevant patient outcome, such as survival,
incidence of a specific outcome, duration of the complaints or quality of life; T1…Tn represent the
diagnostic determinants to be assessed; I is intervention following diagnosis and ED are
extraneous determinants (or confounders) that should be taken into account in this causal study.
The occurrence relation of a diagnostic intervention study covers this entire scheme:
Time
The object of the diagnostic process is cross-sectional by definition. In
diagnostic research the probability of the presence of a disease (prevalence) is
estimated, not its future occurrence. Accordingly, the data for diagnostic studies
are collected cross-sectionally. The determinant(s) (the diagnostic test results)
and the outcome (the presence or absence of the target disease as determined by
the so-called reference standard) are theoretically determined at the same time.
This is the moment that the patient presents with the symptoms or signs
suggestive of the disease (t = 0). Even when the assessment of all diagnostic
determinants to be studied takes some time and when it takes several days or
weeks before the definitive diagnosis becomes known, this time period is used to
determine whether at t = 0 the disease was present. Also, when a “wait and see”
period of several months (e.g., to see whether an underlying disease, such as
cancer, becomes manifest or whether targeted therapy has a beneficial effect) is
used to set the final diagnosis, these additional findings are used to establish the
diagnosis present at the time the patient presented the symptoms (i.e., at t = 0)
[Reitsma et al., 2009], Thus, in our view, diagnostic research is cross-sectional
research (time is zero). It should be noted, however, that others consider time to
be larger than zero when it takes some time to set the final diagnosis and, as a
consequence, they characterize the design of data collection as a follow-up or
cohort study.
Census or Sampling
Generally, diagnostic research takes a census approach in which consecutive
patients suspected of a certain disease and who fulfill the predefined inclusion
criteria are included. The potentially relevant diagnostic determinants as well as
the “true” presence or absence of the target disease are measured in all patients.
Sometimes, however, a sampling approach (i.e., a case-control study; see the
later chapter on case-control studies) can offer a valid and efficient alternative.
In a diagnostic case-control study (which is a cross-sectional case-control study),
all patients suspected of the target disease who are eventually diagnosed with the
disease (“cases”) are studied in detail, together with a sample of those suspected
of the disease who turn out to be free from the target disease (“controls”). This
implies that the outcome (reference standard) has to be assessed in all patients
suspected of the target disease (otherwise the cases cannot be identified and the
controls cannot be sampled), but that the diagnostic determinants only have to be
measured in cases and controls. As in diagnostic research using a census
approach, the goal is to obtain absolute probabilities of disease presence given
the determinants. Consequently, in the data analysis of a diagnostic case-control
study, the sampling fraction of the controls should always be accounted for. A
diagnostic case-control study offers a particularly attractive option when the
measurement or documentation of one or more of the diagnostic tests under
study are time consuming, burdensome to the patient, or expensive, such as
certain imaging tests [Rutjes et al., 2005]. Diagnostic case-control studies are
still relatively rare, despite their efficiency [Biesheuvel et al., 2008a]. In the
example in Box 2–7, a case-control approach was chosen to assess the added
value of cardiac magnetic resonance (CMR) imaging in diagnosing heart failure
in patients known to have chronic obstructive pulmonary disease. Because of the
costs, time, and patient burden involved, CMR measurements were performed in
all patients with heart failure (cases) but in only a sample of the remainder of the
participants (controls) [Rutten et al., 2008].
Confusingly, diagnostic studies comparing test results in a group of patients
with the disease under study—often those in an advanced stage of disease—with
test results in a group of patients without this disease, often a group of healthy
individuals from the population at large, tend to be referred to as diagnostic case-
control studies [Rutjes et al., 2005]. Many of these studies are not case-control
studies, however, as there is no sampling of controls from the study base
[Biesheuvel et al., 2008a]. In addition, as discussed earlier, such studies will bias
the estimates of diagnostic accuracy of the tests being studied and compromise
the generalizability of the study results. This is because the cases and certainly
the healthy controls do not reflect the relevant patient domain, which is all those
suspected of having the disease for whom the tests are intended.
Experimental or Observational
Diagnostic research is typically observational research. In patients suspected of
the disease in daily practice, the diagnostic determinants of interest (most of
which will be measured in clinical practice anyway), including possible new
tests, will be measured and the presence of disease will be determined using the
reference standard. Such a cross-sectional study will be able to show which
combination of tests best predicts the presence of disease or whether a new test
improves diagnostic accuracy.
METHODS: Participants were recruited from a cohort of 405 patients aged 65 years or older with
mild to moderate and stable COPD. In this population, 83 (20.5%) patients had a new diagnosis of
CHF, all left-sided, established by an expert panel using all available diagnostic information, including
echocardiography. In a nested case-control study design, 37 consecutive COPD patients with newly
detected CHF (cases) and a random sample of 41 of the remaining COPD patients (controls) received
additional CMR measurements. The value of CMR in diagnosing heart failure was quantified using
univariable and multivariable logistic modeling in combination with area under the receiver operating
characteristic curves (ROC area).
RESULTS: The combination of CMR measurements of left-ventricular ejection fraction, indexed left-
and right-atrial volume, and left-ventricular end-systolic dimensions provided high added diagnostic
value beyond clinical items (ROC area = 0.91) for identifying CHF. Left-sided measurements of CMR
and echocardiography correlated well, including ejection fraction. Right-ventricular mass divided by
right-ventricular end-diastolic volume was higher in COPD patients with CHF than in those without
concomitant CHF.
Reproduced with permission of MOSBY, INC, from: Rutten FH, Vonken EJ, Cramer MJ, Moons KG,
Velthuis BB, Prakken NH, Lammers JW, Grobbee DE, Mali WP, Hoes AW. Cardiovascular magnetic
resonance imaging to identify left-sided chronic heart failure in stable patients with chronic obstructive
pulmonary disease. Am Heart J. 2008;156:506–512.
As discussed, setting a diagnosis is not an aim in itself, but rather a vehicle to
guide patient management and treatment in particular. The ultimate goal of
diagnostic testing is to improve patient outcomes. Hence, it has widely been
advocated that when establishing the accuracy of a diagnostic test or strategy, its
impact on patient outcomes also must be quantified. Consequently, it has been
proposed that experimental studies (diagnostic intervention studies comparing
two diagnostic strategies) be used to answer diagnostic research questions
[Bossuyt et al., 2012; Lord et al., 2006].
If a cross-sectional diagnostic study has indicated that the diagnostic test or
strategy improves estimation of the presence of the disease, the effect on patient
outcome can usually be validly established without the need for a diagnostic
intervention study [Koffijberg et al., 2013]. After all, earlier studies often
adequately quantified the effects on patient outcome of the available treatment(s)
for that disease. Using simple statistical or decision modeling techniques, one
can combine the results of the cross-sectional diagnostic accuracy study and
those of randomized therapeutic intervention studies. Hence, the effect on patient
outcome can be quantified if (1) diagnostic research has shown that the
diagnostic test or strategy improves diagnostic accuracy and (2) the effects of
available therapeutic interventions in that disease on patient outcome are known,
preferably from randomized trials. An example in which a randomized study was
not necessary to quantify the effect of the new test on patient outcome is a study
assessing whether an immunoassay test for the detection of H. pylori infection
can replace the established but more costly and invasive reference test (a
combination of rapid urease test, urea breath test, and histology) [Weijnen et al.,
2001]. The new test indeed provided similar diagnostic accuracy. As consensus
exists about the therapeutic management of patients infected with H. pylori
(based on randomized controlled trials establishing the efficacy of treatment
[McColl, 2002]), a subsequent diagnostic intervention study to quantify the
effects of using the new immunoassay test on patient outcome was not needed.
There are situations, however, in which diagnostic intervention studies are
needed to properly quantify the consequences of a novel diagnostic test or
strategy on patient outcome [Biesheuvel et al., 2006; Bossuyt et al., 2000; Lord
et al., 2006]. Notably, when a new diagnostic technology under study might be
“better,” to the extent that it provides new information potentially leading to
other treatment choices, than the existing tests or strategy, a randomized trial
may be useful. As described previously, functional imaging with PET in
diagnosing pancreatic cancer, for which CT is the current reference, is an
example. Also, when there is no direct link between the result of the new
diagnostic test under study and an established treatment indication, such as the
finding of uncalcified small nodules (less than 5.0 mm) when screening for lung
cancer with low-dose spiral CT scanning, an experimental approach quantifying
the effect on patient outcome may be required. When an acceptable reference
standard for a disease is lacking, for instance, in a diagnostic study in suspected
migraine or benign prostatic hyperplasia, a diagnostic intervention may also be
the best option. Finally, as mentioned earlier, the index test itself (e.g.,
salpingography in suspected tubal blockage) may have direct therapeutic effects.
When performing a diagnostic intervention study to determine the impact of a
diagnostic test or strategy on patient outcome, an initial diagnostic research
question is transformed into a therapeutic research question (with the goal of
establishing causality) with corresponding consequences for the design of the
study. A disadvantage of a randomized approach to directly quantifying the
contribution of a diagnostic test and treatment to the patient’s outcome is that it
often addresses diagnosis and treatment as a single combined strategy, a
“package deal.” This makes it impossible to determine afterward whether a
positive effect on patient outcome can be attributed solely to the improved
diagnostic accuracy or to the new subsequent treatment strategies.
Study Population
A diagnostic test or strategy should be able to distinguish between those with the
target disease and those without, among subjects representing the relevant
clinical domain. The domain is thus defined by patients suspected of having a
particular disease. Consequently, patients in whom the presence of disease has
already been established or in whom the probability of the disease is considered
high enough to initiate adequate therapeutic actions fall outside the domain,
similar to when the probability of disease is deemed sufficiently low to exclude
the diagnosis (see also Figure 2–1). Furthermore, we recommend that
investigators restrict domain definitions, and thus the study population, to the
setting or level of care (e.g., primary or secondary care), as the diagnostic
accuracy and combinations of these tests usually vary across care settings
[Knottnerus, 2002a; Oudega et al., 2005a]. This is a consequence of differences
in the distribution of severity of the disease across the different settings.
The population of a study could be defined as all consecutive patients
suspected of the disease of interest that present themselves to one of the
participating centers during a defined period and in whom the additional
diagnostic tests under investigation are considered. Exclusion criteria should be
few to ensure wide applicability of the findings. They would typically include
alarm symptoms requiring immediate action or referral (e.g., melena in the
dyspepsia example in the beginning of this chapter) and contraindications for
one of the major diagnostic determinants (tests) involved (e.g., claustrophobia
when MRI assessments are involved). One could argue that “patients suspected
of the disease” as an inclusion criterion is too subjective. In many studies the
definition, therefore, includes symptoms and signs often accompanying the
disease. For example, a study to address the added value of a novel test to
diagnose or exclude myocardial infarction in the primary care setting could
include “patients with symptoms suggestive of myocardial infarction in primary
care.” Alternatively, the study population can be defined as “patients with chest
pain or discomfort in primary care” or a combination of the two: “patients with
chest pain of discomfort or other symptoms and signs compatible with a
myocardial infarction in primary care” [Bruins Slot et al., 2013].
Diagnostic Determinants
As the diagnosis in practice is typically made on the basis of multiple diagnostic
determinants, all test results that are (potentially) used in practice should be
considered and measured. In the earlier example of the H. pylori test to diagnose
peptic ulcer, the main signs and symptoms as well as the H. pylori test have to be
included as potential determinants. There is, however, a limit to the number of
tests that can be included in a study, because of logistics and the larger sample
size required with each additional test that is considered (see the following
discussion). Hence, the choice of the determinants to be included should be
based on both the available literature and a thorough understanding of clinical
practice.
To optimize the applicability of the findings of diagnostic research, the
assessment of the diagnostic determinants should resemble the quality of this
information in daily clinical practice. Consequently, one could argue that all
determinant information should be collected according to usual care, without
efforts to standardize or improve the diagnostic assessment. In a study involving
multiple sites and physicians, this may significantly increase inter-observer
variability in diagnostic testing, which means the potential diagnostic value of
test results could be underestimated, although the study would indicate the
current average diagnostic value of the tests in clinical practice. This effect is
likely to be larger for more subjective tests, such as auscultation of the lungs. An
alternative would be to train the physicians to apply a standardized diagnostic
assessment. One may also ask experts in the field to do the diagnostic tests under
study. This, however, has the disadvantage that it will likely overestimate the
diagnostic accuracy of the tests in daily practice and reduce the applicability of
the study results. For a multicenter, multi-doctor study, we recommend a
pragmatic approach where all diagnostic determinants are assessed as much as
possible according to daily practice and by the practicing physicians involved,
with some efforts to standardize measurements.
Outcome
The outcome in diagnostic research is typically dichotomous: the presence or
absence of the disease of interest (e.g., myocardial infarction or pneumonia). As
discussed, in clinical practice commonly more than one disease is considered in
a patient presenting with particular symptoms and signs, that is, the so-called
differential diagnosis [Sackett et al., 1985]. Thus, the outcome should be
polytomous rather than dichotomous, although in daily practice sequential
dichotomous steps are often taken; the most likely (or most severe) disease in the
differential diagnosis is diagnosed or excluded before the next diagnosis is
considered. Diagnostic research with polytomous or even ordinal outcomes is
relatively rare and the data analysis is more complicated [Harrell, 2001]. Current
methodologic developments in this field no doubt will increase the use of
polytomous outcomes in diagnostic research [Biesheuvel et al., 2008b; Roukema
et al., 2008; Van Calster et al., 2012].
In diagnostic research, as in each epidemiologic study, adequate assessment of
the outcome is crucial. The outcome should be measured as accurately as
possible and with the best available methods. The term most often applied to
indicate the ideal diagnostic outcome is gold standard, referring to the virtually
nonexistent situation where measuring the disease is devoid of false-negatives
and false-positives [Reitsma et al., 2009]. More recently, the more appropriate
term reference standard was introduced to indicate the “non-golden” properties
of almost all diagnostic procedures in today’s practice, including procedures like
biopsy combined with histologic confirmation for cancer diagnoses. Very few
diagnostic procedures do not require human interpretation. Deciding on the
reference standard is a crucial but difficult task in diagnostic research. The
reference standard is the best procedure(s) that exists at the time of study
initiation to determine the presence or absence of the target disease. The word
best in this context means the measurement of disease that best guides
subsequent medical action. Hence, the reference method to be used in a
diagnostic study may very well include one or a combination of expensive and
complicated tests that are not routinely available or applied in everyday clinical
practice. Note that this contrasts with the assessment of the diagnostic
determinants of interest, which should more or less mimic daily practice to
enhance generalizability of study results to daily practice.
Preferably, the final diagnosis should be established independent of the results
of the diagnostic tests under study. Commonly, the observer who assesses the
final diagnosis using the reference method is blinded for all of the test results
under study. If this blinding is not guaranteed, the information provided by the
preceding tests may implicitly or explicitly be used in the assessment of the final
diagnosis. Consequently, the two information sources cannot be distinguished
and the estimates of accuracy of the tests being studied may be biased. Although
theoretically this bias can lead to both an under-and overestimation of the
accuracy of the evaluated tests, it commonly results in an overestimation; the
final diagnosis may be guided to some extent by the results of the test under
evaluation, artificially decreasing the number of false-positive and false-negative
results. This kind of bias is often referred to as diagnostic review or
incorporation bias [Begg & Metz, 1990; Ransohoff & Feinstein, 1978; Sackett
et al., 1985; Swets, 1988].
The possibility of blinding the outcome assessors for the results of the tests
under study depends on the type of reference standard applied. It is surely
feasible if the reference standard consists of a completely separate test, for
example, imaging techniques or serum levels of a marker. Because this kind of
reference test is not available for many diseases (e.g., psychiatric disorders), or is
infeasible or even unethical to apply in all cases (notably when the test is
invasive and patient burdening), next best solutions are often sought. In
particular, an approach involving a so-called consensus diagnosis determined by
an outcome panel often is applied; this often is combined with a clinical follow-
up period to further promote an adequate assessment of the presence of the
disease [Begg, 1990; Reitsma et al., 2009; Swets, 1988]. Outcome panels consist
of a usually unequal number of experts on the clinical problem. During
consensus meetings, the panel establishes the final diagnosis in each study
patient based on as much patient information as possible. This includes
information from patient history, physical examination, and all additional tests.
Often, any clinically relevant information (e.g., future diagnoses, response to
treatment targeted at the outcome disease) from each patient during a
prespecified follow-up period is also forwarded to the outcome panel in order to
allow for a better judgment on whether the target disease was present at the time
of (initial) presentation [Moons & Grobbee, 2002b]. When using a consensus
diagnosis based on all available information as the reference standard, the test
results studied as potential diagnostic determinants are usually also included
(“incorporated”) in the outcome assessment, leading to a risk of incorporation
bias. To fully prevent incorporation bias, the outcome panel should decide on the
final diagnosis without knowledge of the results of the particular test(s) under
study. This may seem an attractive solution, but limiting the information
forwarded to the panel may increase misclassification in the outcome
assessment. There are no set solutions to this dilemma that is inherent to using a
consensus diagnosis as the reference standard. The pros and cons of excluding or
including the results from all or some of the tests under study in the assessment
of the final diagnosis by the outcome panel should be weighed in each particular
study. Consider a study that aims to assess the diagnostic value of NT-proBNP
serum levels or echocardiography in addition to signs and symptoms in patients
suspected of heart failure. As in several earlier studies on suspected heart failure,
an outcome panel can determine the “true” presence or absence of heart failure
[Moons & Grobbee, 2002b; Rutten et al., 2005b]. When studying the accuracy of
a test known to receive much weight in the consensus judgment (in this example
echocardiography and to a lesser extent NT-proBNP levels), it is preferable not
to use these tests in the assessment of the final diagnosis. Doing so requires that
the remaining diagnostic information, including clinical follow-up data, enable
the panel to accurately diagnose patients. Lack of availability of the NT-proBNP
levels will probably not pose a major problem, but withholding the
echocardiographic findings, a key element in the diagnosis of heart failure, from
the outcome panel may seriously endanger the validity of the outcome
assessment. Consequently, we may be able to quantify the added value of NT-
proBNP levels but not the added value of the echocardiogram [Kelder et al.,
2011]. Alternatively, the outcome panel could judge the presence or absence of
heart failure first without considering the echocardiographic findings and then
subsequently with the echocardiography results. Comparing the outcome
classification according to both approaches may provide some insight into the
effect of incorporation bias on the (boundaries of the) accuracy of the test under
study, in this case echocardiography.
As mentioned earlier, in certain situations it is not feasible and may even be
unethical to apply the best available reference method in all study patients at the
time of presentation, in particular when the reference test is invasive and may
lead to complications (such as pulmonary angiography in suspected pulmonary
embolism). Also in studies in suspected malignancies, it is often difficult to
establish or rule out a malignancy at t = 0, even when multiple tests, including
sophisticated imaging techniques, are performed. Under such circumstances, a
clinical follow-up period may offer useful information. It should be emphasized
here that a clinical follow-up period is applied to assess whether the disease of
interest was indeed present at the time of presentation of the complaints (t = 0).
It is then assumed that the natural history of the (untreated) target disease
implies that the target disease was present but unrecognized at t = 0. A clinical
follow-up period to establish a diagnosis has been successfully applied in studies
on the accuracy of diagnostic tests for a variety of diseases, including pulmonary
embolism, bacterial meningitis, and certain types of cancer. For example, Fijten
et al. [1995] studied which signs and symptoms were helpful in ruling out
colorectal cancer in patients presenting with fecal blood loss in primary care. It
was impossible to perform colonoscopies and additional imaging or surgery in
all participants to rule in or out a malignancy at t = 0. Therefore, all patients
were followed for an additional period of at least 12 months after inclusion in the
study, assuming that colorectal cancer detected during the follow-up period
would indicate presence of the cancer at baseline. Obviously, the follow-up
period should be limited in length, especially in diseases with a relatively high
incidence, to prevent new cases from being counted as prevalent ones. The
acceptable clinical follow-up period varies and depends on the natural history
and incidence of the disease studied. A 6-to 12-month period is often
encountered in the literature for cancer studies. For venous thromboembolism
this is usually 3 months, and in a study of bacterial meningitis it was 1 week.
Besides documenting the natural history of a disease during such a clinical
follow-up period, one may also document the response to treatment targeted at
the outcome diagnosis and use this information to determine whether the target
disease was present at t = 0. Response to therapy may be helpful in excluding (in
the case of no response) or confirming (in the case of a beneficial effect on
symptoms) the target disease. In these situations, one should be aware that
response following therapy provides no definite proof of the disease, because the
response could result from other factors. Similarly, lack of response does not
preclude the presence of the disease at t = 0. Examples of using the response to
empirical treatment to confirm a diagnosis are studies in suspected heart failure
[Kelder et al., 2011].
Partial and differential outcome verification. Ideally, the index tests and
reference standard are determined in all study participants and in a standardized
manner. For various reasons, however, the reference standard may not have been
performed in all patients. Such partial outcome verification might be attributable
to ethical concerns or patient or physician preferences (e.g., when the reference
test is considered unnecessary or too burdening, or because it is simply
impossible to perform in all patients; for example, biopsy and histology as the
reference standard in diagnosing cancer can only be performed in subjects with
detected nodes or hot spots based on previous testing [de Groot et al., 2011a ;
Reitsma et al., 2009]). Partial outcome verification (i.e., partially missing
outcome data) often occurs not completely at random but selectively. The reason
for performing the reference standard is typically related to the test results of
preceding index tests. Such partial verification may lead to biased estimates of
the accuracy of the index tests if only the selective subsample of patients in
whom the reference test was executed are included in the analysis. This is known
as partial verification bias, work-up bias, or referral bias [Rutjes et al., 2007].
Often researchers use a different, second best, reference test to verify the target
disease presence in those subjects for whom the first, preferred reference test
cannot be used [de Groot et al., 2011b]. Such differential verification will lead to
bias when the results of the two reference tests are treated in the analysis as
interchangeable, while both are of different quality in classifying the target
disease or may even define the target disease differently. Hence, simply
combining all disease outcome data in a single analysis as if both reference tests
yield the same disease status does not reflect the “true” pattern of disease
presence. Such an estimation of disease prevalence thus differs from what one
would have obtained if all subjects had undergone the preferred reference
standard. Consequently, all estimated measures of the accuracy of the diagnostic
index tests will be biased; this is called differential verification bias [de Groot et
al., 2011b; Reitsma et al., 2009]. Several solutions to deal with partial and
differential outcome verification and its consequential bias have been proposed
[de Groot et al., 2011b, 2001c]. One solution is multiple imputation of missing
outcomes.
Univariable Analysis
Before proceeding to multivariable analyses, we recommend first performing a
univariable analysis in which each individual potential determinant is related to
the outcome. Biostatisticians often refer to this type of analysis as a bivariate
analysis because the association between two variables (determinant and
outcome) is studied. In diagnostic research, categorical determinants with more
than two categories and continuous determinants are often dichotomized by
introducing a threshold. This commonly leads to loss of information [Royston et
al., 2006]. For example, dichotomizing the body temperature > 37.5° Celsius (C)
as test-positive and ≤ 37.5° Celsius as test-negative implies that the diagnostic
implications for a person with a temperature of 38.0°C are the same as for a
person with a temperature of 41°C. Second, the resulting association heavily
depends on the threshold applied. This may explain why different studies of the
same diagnostic test yield different associations. The aim of univariable analysis
is to obtain insight into the association of each potential determinant and the
presence or absence of the disease. Although it is common to only include in the
multivariable analysis the determinants that show statistical significance (P-
value < 0.05), in univariable analysis this may lead to optimistic estimates of the
accuracy of a diagnostic model [Harrell, 2001; Steyerberg et al., 2000; Sun et al.,
1996]. This chance of “optimism” increases when the number of potential
determinants clearly exceeds the “1 to 10 rule” described earlier. It is therefore
recommended to use a more liberal selection criterion, for example, P < 0.20,
0.25, or an even higher threshold [Steyerberg, 2009]. The downside to this is that
more determinants will qualify for multivariable analysis, requiring the need for
so-called internal validation and penalization or shrinkage methods that we will
discuss later in this chapter. Alternatively, univariable analyses may guide
combination and clustering of determinants, ideally influenced by prior
knowledge of the most important determinants. Methods have been developed to
incorporate prior knowledge into the selection of predictors [Harrell, 2001;
Steyerberg et al., 2004]. Finally, univariable analysis is useful to determine the
number of missing values for each determinant and for the outcome, and
whether these missing values are missing completely at random (MCAR),
missing at random (MAR), or missing not at random (MNAR).
Multivariable Analysis
Diagnostic practice is probabilistic, multivariable, and sequential. Consequently,
a multivariable approach is the main component of the data analysis in
diagnostic research. In the multivariable analysis, the probability of disease is
related to combinations of multiple diagnostic determinants, in various orders.
Multivariable analysis can accommodate the order in which tests are used in
practice and will show which combination of tests truly contributes to the
diagnostic probability estimation. To address the chronology and sequence of
testing in clinical practice, the accuracy of combinations of easily obtainable
determinants should be estimated first and subsequently the added value of the
more burdensome and costly tests [Moons et al., 1999].
Logistic regression modeling is the generally accepted statistical method for
multivariable diagnostic studies with a dichotomous outcome [Harrell, 2001;
Hosmer & Lemeshow, 1989]. Other statistical methods, such as neural networks
and classification and regression trees (CART), have been advocated, but these
received much criticism as both often result in overly optimistic results [Harrell,
2001; Tu, 1996]. Therefore, we will focus on the use of logistic regression
models for multivariable diagnostic research.
The determinants included in the first multivariable logistic regression model
are usually selected on the basis of both prior knowledge and the results of
univariable analysis. Also, the first model tends to concentrate on determinants
that are easy to obtain in practice. Hence, this model typically includes test
results from history taking and physical examination [Moons et al., 2004a;
Moons et al., 1999]. A logistic regression model estimates the log odds (logit) of
the disease probability as a function of one or more predictors:
FIGURE 2–3 Example of an ROC curve of the reduced multivariable logistic regression model, including
the same six determinants as in Figure 2.2. The ROC area of the “reduced history + physical model” was
0.70 (95% confidence interval [CI], 0.66–0.74) and of the same model added with the D-dimer assay 0.84
(95% CI, 0.80–0.88).
The next step is to extend this model by the subsequent test from the workup
in our example study on DVT; this was the D-dimer assay. This allows
estimation of the assay’s diagnostic value in addition to the items from history
taking and physical examination. In this analysis, the same statistical procedures
as just described are used. Whether the D-dimer test is a truly independent
predictor is estimated again by the log likelihood ratio test [Harrell, 2001;
Hosmer & Lemeshow, 1989]. Next, the calibration and discrimination of the
extended model (including the “reduced history + physical model” items plus the
D-dimer assay) are examined. The calibration of this extended model was good
(data not shown), and the discriminatory value was high (ROC area = 0.84;
Figure 2–3). Methods have been proposed to formally estimate the precision of
differences between ROC areas, in this case 0.84–0.70 = 0.14, by calculating the
95% confidence interval (CI) or P-value of this difference. In this calculation,
one needs to account for the correlation between both models (“tests”) as they
are based on the same subjects [Hanley & McNeil, 1983]. In our example study,
the CIs did not overlap, indicating a significant added value of the D-dimer assay
at the 0.05 level.
This process of model extension can be repeated for each subsequent test.
Moreover, all of these analytic techniques can be used to compare the difference
in the added diagnostic value of two tests separately when the aim is to choose
between the two or to compare the diagnostic accuracy of various test orders.
We should emphasize that the ROC area of a multivariable diagnostic model or
even a single diagnostic test has no direct clinical meaning. It estimates and can
compare the overall discriminative value of diagnostic models or strategies.
The DVT example exemplifies the need for multivariable diagnostic research.
A comparison between models including fewer or additional tests enables the
investigator to learn not only about the added value of tests but also about the
relevance of moving from simple to more advanced testing in practice. It should
be noted that the data analysis as outlined here only quantifies which subsequent
tests have independent or incremental value in the diagnostic probability
estimation and thus should be included in the final diagnostic model from an
accuracy point of view. It might still be relevant to judge whether the increase in
accuracy of the test outweighs its costs and patient burden. This weighing can be
done formally, including a full cost-effectiveness or cost-minimization analysis
accounting for the consequences and utilities of false-positive and false-negative
diagnoses [Moons et al., 2012b; Vickers & Elkin, 2006]. This enters the realm of
medical decision making and medical technology assessment and is not covered
here.
The multivariable analysis can be used to create a clinical prediction rule that
can be used in clinical practice to estimate the probability that an individual
patient has the target disease given his or her documented test results. There are
various examples of such multivariable diagnostic rules: a rule for diagnosing
the presence or absence of DVT [Oudega et al., 2005b; Wells et al., 1997],
pulmonary embolism [Wells et al., 1997], conjunctivitis [Rietveld et al., 2004],
and bacterial meningitis [Oostenbrink et al., 2001]. How to derive a diagnostic
rule, the ways to present it in a publication and how to enhance its use in clinical
practice will be described next.
External Validation
As explained earlier, the possible optimism of a diagnostic model may be
addressed by internal validation. However, external validation, using new data,
is generally necessary before a model can be used in practice with confidence
[Altman & Royston, 2000a; Justice et al., 1999; Reilly & Evans, 2006]. External
validation is the application and testing of the model in new patients. The term
external refers to the use of data from subjects who were not included in the
study in which the prediction model was developed. So defined, external
validation can be performed, for example, in patients from the same centers but
from a later period than that during which the derivation study was conducted, or
in patients from other centers or even another country [Justice et al., 1999; Reilly
& Evans, 2006]. External validation studies are clearly warranted when one aims
to apply a model in another setting (e.g., transporting a model from secondary to
primary care) or in patient subgroups that were not included in the development
study (e.g., transporting a model from adults to children) [Knottnerus, 2002a;
Oudega et al., 2005a].
Too often, researchers use their data only to develop their own diagnostic
model, without even mentioning—let alone validating—previous models. This is
unfortunate as prior knowledge is not optimally used. Moreover, recent insights
show that in the case where a prediction (diagnostic or prognostic) model
performs less accurately in a validation population, the model can easily be
adjusted based on the new data to improve its accuracy in that population
[Moons et al., 2012b; Steyerberg et al., 2004]. For example, the original
Framingham coronary risk prediction model and the Gail breast cancer model
were adjusted based on later findings and validation studies [Costantino et al.,
1999; Grundy et al., 1998]. An adjusted model will then be based on both the
development and the validation data set, which will further improve its stability
and applicability to other populations. The adjustments may vary from
parsimonious techniques such as updating the intercept of the model for
differences in outcome frequency, via adjusting the originally estimated
regression coefficients of the determinants in the model, to even adding new
determinants to the model. It has been shown, however, that simple updating
methods are often sufficient and thus preferable to the more extensive model
adjustments [Janssen et al., 2008 & 2009; Steyerberg et al., 2004].
With these advances, the future may be one in which prediction models—
provided that they are correctly developed—are continuously validated and
updated if needed. This resembles cumulative meta-analyses in therapeutic
research. Obviously, the more diverse the settings in which a model is validated
and updated, the more likely it will generalize to new settings. The question
arises about how many validations and adjustments are needed before it is
justifiable to implement a prediction model in daily practice. Currently there is
no simple answer. “Stopping rules” for validating and updating prediction
models should be developed for this purpose.
WORKED-OUT EXAMPLE
Recognition and ruling out of DVT is difficult based on history taking and
physical examination alone. An adequate diagnosis in patients presenting with
symptoms suggestive of DVT (usually a painful, swollen leg) is crucial because
of the risk of potentially fatal pulmonary embolism when DVT is not adequately
treated with anticoagulants. False-positive diagnoses also should be avoided
because of the bleeding risk associated with anticoagulant therapy. The serum D-
dimer test clearly improves the accuracy of diagnosing and ruling out DVT in
suspected patients. Algorithms, including clinical assessment (i.e., signs and
symptoms) and D-dimer testing are available that are widely applied in clinical
practice and recommended in current guidelines. The most famous of these, the
Wells rule, was developed and validated in secondary care settings [Wells et al.,
1997]. Research demonstrated that the Wells rule cannot adequately rule out
DVT in patients suspected of DVT in primary care as too many (16%) patients
in the low-risk category (Wells score below 1) still had DVT [Oudega et al.,
2005a]. The goal of the study presented here (see Box 2–8), was to develop the
optimal diagnostic strategy, preferably by way of a diagnostic rule, to be applied
in the primary care setting [Oudega et al., 2005b].
BOX 2–8 Ruling Out Deep Venous Thrombosis in Primary Care: A Simple Diagnostic Algorithm
Including D-dimer Testing
In primary care, the physician has to decide which patients have to be referred for further diagnostic
work-up. At present, only in 20% to 30% of the referred patients the diagnosis DVT is confirmed. This
puts a burden on both patients and health care budgets. The question arises whether the diagnostic
work-up and referral of patients suspected of DVT in primary care could be more efficient. A simple
diagnostic decision rule developed in primary care is required to safely exclude the presence of DVT
in patients suspected of DVT, without the need for referral. In a cross-sectional study, we investigated
the data of 1295 consecutive patients consulting their primary care physician with symptoms
suggestive of DVT, to develop and validate a simple diagnostic decision rule to safely exclude the
presence of DVT. Independent diagnostic indicators of the presence of DVT were male gender, oral
contraceptive use, presence of malignancy, recent surgery, absence of leg trauma, vein distension, calf
difference and D-dimer test result. Application of this rule could reduce the number of referrals by at
least 23% while only 0.7% of the patients with a DVT would not be referred. We conclude that by
using eight simple diagnostic indicators from patient history, physical examination and the result of D-
dimer testing, it is possible to safely rule out DVT in a large number of patients in primary care,
reducing unnecessary patient burden and health care costs.
Reproduced from: Oudega R, Moons KGM, Hoes AW. Ruling out deep venous thrombosis in primary care:
A simple diagnostic algorithm including D-dimer testing. Thromb Haemost 2005b;94:200–5.
Theoretical Design
The research question was: “Which combination of diagnostic determinants best
estimates the probability of DVT in patients suspected of having DVT in
primary care?”
Determinants considered included findings from history taking and physical
examination as well as the D-dimer test result. The occurrence relation can be
summarized as:
where T1 … Tn refer to all potential diagnostic determinants studied (in total 17).
The domain of the study consisted of patients presenting to primary care with
symptoms suggestive of DVT.
The score ranged from 0–13 points, and the ROC area of the simplified rule
was also 0.78. Table 2–1 shows the number of participants and probability of
DVT in different categories of the risk score.
As an example, a woman using oral contraceptives who was without a leg
trauma but had vein distension and a negative D-dimer test would receive a
score of 3 (0 + 1 + 0 + 0 + 1 + 1 + 0 + 0), corresponding with a very low
estimated probability of DVT of 0.7%.
It was concluded from the study that a simple diagnostic algorithm based on
history taking, physical examination, and D-dimer testing can be helpful in
safely ruling out DVT in primary care and thus would reduce the number of
unnecessary referrals for suspected DVT.
Later, the accuracy of this simplified rule was externally validated in three
regions in the Netherlands [Büller et al., 2009]. This study showed that among
DVT-suspected patients not referred for ultrasonography in daily practice
because of a risk score of ≤ 3, the proportion with a diagnosis of DVT or
pulmonary embolism within 3 months was indeed low (1.4%). The rule has been
included in the current primary care clinical guideline on suspected DVT in the
Netherlands.
Chapter 3
Etiologic Research
INTRODUCTION
A 57-year-old female had a heart attack. She had no prior symptoms of vascular
disease, is not obese, is a nonsmoker and has normal blood pressure and lipid
levels. However, she has several family members who experienced a myocardial
infarction at a relatively young age. At the time of her cardiac event, she was
quickly transported to the hospital and had immediate coronary angioplasty with
placement of a drug-eluting stent. The attending cardiologist subsequently put
her on a regimen of aspirin, beta-blockers, and an angiotensin-converting
enzyme (ACE) inhibitor.
She visits you to ask what she can do to prevent a future cardiac event. Is
there an explanation for her disease? Might it be genetic? Is it because of
reaching menopause? Is there anything she should change in her lifestyle? You
promise her that you will look at the literature, and soon you come across an
intriguing report by Sullivan [1981] suggesting that one protective mechanism
for heart disease in women before menopause is actually monthly periods. In
some women, the loss of blood compensates for excessive iron storage.
Excessive iron storage can make the heart more vulnerable to ischemia or
promote atherosclerosis. Another paper by Roest et al. [1999] shows that a
relatively common heterozygous form of the gene that also codes for
hemochromatosis may lead to subclinical cardiac tissue iron accumulation and
thereby increase the risk of cardiac events. Apart from a genetic tendency to
accumulate iron, it also has been suggested that excess iron storage may result
from an inappropriately high intake of iron through the diet. This raises the
question of whether a high dietary iron intake may be involved in cardiac risk in
otherwise low-risk individuals.
FIGURE 3–1 Cover of John Snow’s report, On the Mode of Communication of Cholera, published in 1855
by John Churchill, London. Snow’s observations on the method of transfer of this disease virtually ended a
London cholera epidemic and laid the foundation for the new science of clinical epidemiology.
Reproduced from Snow (1855). On the Mode of Communication of Cholera. London: John Churchill, New
Burlington Street, England.
BOX 3–1 Dietary Haem Iron and Coronary Heart Disease in Women
AIMS: A role for iron in the risk of ischaemic heart disease has been supported by in vitro and in vivo
studies. We investigated whether dietary haem iron intake is associated with coronary heart disease
(CHD) risk in a large population-based cohort of middle-aged women.
METHODS AND RESULTS: We used data of 16,136 women aged 49–70 years at recruitment
between 1993 and 1997. Follow-up was complete until 1 January 2000 and 252 newly diagnosed CHD
cases were documented. Cox proportional hazards analysis was used to estimate hazard ratios of CHD
for quartiles of haem iron intake, adjusted for cardiovascular and nutritional risk factors. We stratified
by the presence of additional cardiovascular risk factors, menstrual periods, and antioxidant intake to
investigate the possibility of effect modification. High dietary haem iron intake was associated with a
65% increase in CHD risk [hazard ratio (HR) = 1.65; 95% confidence interval (CI): 1.07–2.53], after
adjustment for cardiovascular and nutritional risk factors. This risk was not modified by additional risk
factors, menstruation, or antioxidant intake.
CONCLUSION: The results indicate that middle-aged women with a relatively high haem iron intake
have an increased risk of CHD.
Reproduced from Van der A DL, Peeters PHM, Grobbee DE, Marx JJM, Van der Schouw Y. Dietary haem
iron and coronary heart disease in women. European Heart Journal 2005;26:257–262.
THEORETICAL DESIGN
Etiologic epidemiologic research explores the causes of a health outcome. Its
aim is to demonstrate or exclude the relationship between a potential cause and
the occurrence of a disease or other health outcome. To achieve this goal,
alternative explanations for an apparent link between determinant and outcome
need to be excluded in the research. These alternative explanations are offered
by relationships due to extraneous determinants (confounders). The form of the
etiologic occurrence relation, the object of research, is therefore outcome as a
function of a determinant, conditional on confounders. The domain, the type of
subjects for whom the relation is relevant, is defined by all those capable of
having the outcome and who are at risk of being exposed to the determinant.
Thus the domain for a study on the role of boxing in causing memory deficits is
all human beings who could possibly engage in boxing, which is essentially
everyone. The domain for the study in Box 3–1 on risks of coronary disease due
to excessive iron intake is all women, and possibly all men too. The perspective
on whether men should be a subset in the domain rests on the degree to which
the investigator believes that a risk associated with high iron exposure is
something particular to women or is a general feature of Homo sapiens.
Typically, etiologic research focuses on a single determinant at a time. In the
example in Box 3–1, the emphasis was on haem iron intake operationalized by
estimating intake from a food frequency questionnaire. All variables potentially
related to both the risk of coronary disease and the levels of iron intake were
treated as possible confounders; an elaborate discussion of the definition of
confounders is given later in this chapter. In this study on iron intake and heart
disease risk, the confounders were age, total energy intake, body mass index
(BMI), smoking, physical activity, hypertension, diabetes, hypercholesterolemia,
energy-adjusted intakes of saturated fat and carbohydrates, fiber, alcohol, beta-
carotene, vitamin E, and vitamin C intake. All were measured at the time of
inclusion in the cohort. When each was taken into account, however, none
changed the risk estimate of iron intake materially, suggesting that none had a
major impact in the association.
In another study addressing the importance of lifestyle in the occurrence of
breast cancer, a particular research question might focus on the putative causal
role of a high alcohol intake in the occurrence of breast cancer. The occurrence
relation would then be breast cancer as a function of alcohol use, conditional on
confounders. The domain would be all women. Among the confounders,
smoking would most likely be important. In a second analysis of the same study,
the question could be about the causal role of smoking in breast cancer. Now
smoking would be the single causal determinant of interest and alcohol
presumably among the confounders. (The importance of making clear
distinctions between determinants and confounders in a given analysis for a
given research question is outlined next.) Disregarding confounders or having
incomplete or suboptimal confounder information may lead to results that are
not true and thus invalid. The overriding importance of the need to exclude
confounding makes etiologic epidemiologic research particularly difficult.
Courtroom Perspective
If you are doing etiologic research, pretend that you are in a courtroom. You are
the prosecutor and your task is to show beyond reasonable doubt that the
defendant, and not someone else, is to blame for the criminal act. Etiologic
research is about accusation. As an investigator (author of the study), you must
convince the jury (your peers and readers) that the determinant is causally
involved in the occurrence of the disease. It is common for an initial report on a
causal factor in disease to be superseded by newer research contradicting the
initial finding because of evidence on confounders. One report in 1981
[MacMahon et al.] suggested a strong relationship between coffee use and
pancreatic cancer. Since then, however, most studies could not confirm a
substantial association when more confounding factors were considered, and the
overall evidence suggests that coffee consumption is not related to pancreatic
cancer risk.
CONFOUNDING
Assessment of confounding by detecting the presence of possible extraneous
determinants is critical to obtaining valid results in etiologic studies. A first step
is to clearly decide which determinant is the assumed causal factor of interest.
Commonly, diseases are caused by multiple factors, which can act in concert or
separately. In subsequent studies, multiple possible causative agents may be
addressed consecutively. At each instant, however, there is typically one
determinant of primary etiologic interest, while other determinants of the
outcome are extraneous to that particular occurrence relation. Confounders can
be very specific to a particular determinant–outcome relationship. Potential
confounders may or may not distort the relationship between the determinant of
interest and the outcome in the data, depending on the presence or absence of
associations between these variables.
Frequently, assessment of confounding is proposed by simply determining the
links of possible extraneous determinants between both the outcome and the
causal determinant of interest. The prevailing view is that if a factor X is known
to be related to both the determinant and outcome in an occurrence relation, then
X is a confounder. Clearly, if a confounder is not related to both outcome and
determinant, confounding will never result. However, even when a perceived
extraneous determinant is simultaneously associated with the outcome and
determinant, this does not invariably imply confounding. An example is when
the variable is somewhere in the causal pathway and thus not extraneous.
For a third variable to act as a confounder in etiologic research, it should be
(1) related to the occurrence of the outcome and thus be a determinant of the
outcome by itself, (2) associated with the exposure determinant of interest, and
(3) extraneous to the occurrence relation. By extraneous, we mean that this
variable is not an inevitable part of the causal relationship or causal chain
between the determinant of interest and the outcome variable (e.g., because it is
part of the causal pathway; see the discussion that follows). The terms
confounder and extraneous determinant can be used interchangeably; although
less commonly used, the use of the term extraneous determinant indicates more
clearly the type of determinant.
Assume that you are interested in the causal relationship between body weight
and the occurrence of diabetes mellitus (see Figure 3–2). In a study designed to
shed light on the causal role of obesity in diabetes, age is extraneous to the
occurrence relation. Because age is known to be related to both body weight and
the occurrence of diabetes (note the two arrows in the figure), any estimate of a
causal effect of excessive body weight in the occurrence of diabetes is likely to
be distorted by the effect of age. To validly estimate the true effect of obesity,
differences in distributions of age across groups of patients with different body
weights should be taken into account, either in the design of the data collection
or in the design of data analysis. To return to the courtroom analogy, you should
not blame body weight for the occurrence of diabetes when in fact age is
“guilty.” Extraneous to the occurrence relation also means that the third variable
should not be part of the causal chain. If it is part of the causal chain, the
variable is an intermediate factor rather than an extraneous variable. Such an
intermediate factor may induce changes in other factors, which then serve to
change the outcome.
FIGURE 3–2 A simple causal pathway showing the influence of an extraneous determinant on the
determinant and outcome.
FIGURE 3–3 A specific example of a causal pathway showing several extraneous determinants.
Data are from 1,265 individuals. Pairwise correlations are between blood pressure, heart rate, cigarette
smoking, age, and body mass index.
* P < 0.05.
The ongoing debate about the possible increased risk of myocardial infarction
in subjects with a high coffee intake serves as an example. In the mid-1970s,
reports were published suggesting that coffee users were at a twofold increased
risk of myocardial infarction compared to nonusers. The increased risk remained
after adjustment for possible confounding factors. Hennekens and coworkers
[1976] published a case-control study in which they compared the effects of
adjustment for a limited set of extraneous determinants; these included restricted
adjustment as in other published reports at the time and adjustment for a more
extensive set of possible confounders that included several dietary variables.
Cases were male patients who had a fatal myocardial infarction, and controls
were sampled from neighbors who remained free from coronary heart disease
during the same time period. Information on coffee use and a range of
confounders was obtained by interviewing the wives of the myocardial infarction
victims and their neighbors (controls). First, an analysis was performed that
replicated previous reports with adjustment for a limited set of 10 confounders.
In this analysis, the relative risk of myocardial infarction for coffee users
compared to those who did not drink coffee was 1.8 (95% CI 1.2–2.5). However,
when nine additional confounders were taken into account in the analyses, the
relative risk was reduced to 1.1 (95% CI 0.8–1.6), which showed an insignificant
10% risk, rather than an 80% increased risk. Apparently, in previous work the
“adjusted” association was still suffering from “residual” confounding.
Subsequent studies with larger numbers of patients and even more extensive
adjustment for potential confounders have further reduced the likelihood of a
clinically meaningful increased risk of heart disease due to drinking coffee
[Grobbee et al., 1990]. A possible exception is the use of so-called “boiled”
coffee, in the past quite normal in Scandinavia, which has been shown to raise
cholesterol and thus increase the risk of atherosclerosis and cardiovascular
events [Bak & Grobbee, 1989]. In the latter example, cholesterol elevation is an
intermediate variable.
One way to invalidate findings in etiologic research is to fail to consider
relevant extraneous factors, and an alternative way to produce invalid results is
to measure such confounding factors poorly. Adjustment is incomplete when
confounders are not taken into account in the data analyses, but the adjustment
for confounders in the analysis may be similarly inadequate if the measurement
of confounders is not sufficiently comprehensive and precise.
FIGURE 3–4 Do postmenopausal circulating estrogen levels affect bone density? Differences in bone
density between high-and low-estone groups, with and without adjustment for differences in BMI are
shown above. Measurements were made using dual-photon absorptiometry of the spine (DPAspine) and
single-photon absorptiometry of the distal and proximal forearm (SPAdist and SPAprox, repectively). Light
gray bars = crude distances between groups; dark gray bars (“NS”) = differences after adjustment.
When adjustments are made in the analyses of differences between the two
estrone groups in the BMI, the results look materially different compared to the
crude unadjusted analysis (see Figure 3–4).
After an adjustment for BMI, none of the initial differences in bone density
between low-and high-estrone women remains. However, the question arises
about whether this adjustment is appropriate. Rather, you could argue that
differences in circulating estrone levels between women largely reflect
differences in body fat, which is the prime site for estrogen production through
conversion of androgens in postmenopausal women. While BMI is correlated to
both the determinant and the outcome, it does not qualify as an extraneous
determinant because it is not extraneous to the occurrence relation of interest. In
contrast, the likely mechanism for increased bone density in post-menopause is:
Obesity precedes higher estrogen production and thus is in the causal chain
relating estrogen to bone density. The example illustrates the notion that
classification of a factor related to both outcome and determinant as a
confounder assumes this factor to be extraneous. Rather than being extraneous, a
certain factor may lead to a changed physiology that in turn affects the
determinant under study and subsequently the outcome (see Figure 3–5).
An important message from this and the alcohol → HDL cholesterol → heart
disease example is that judgment of the potential for confounding requires
knowledge of possible etiologic mechanisms involved. This may well create a
“catch 22” situation in which an absence of etiologic insight creates confounding
that in turn invalidates subsequent observations. Frequently in etiologic
epidemiologic research, initial observations subsequently must be corrected
because of expanding knowledge and adjustment for newly recognized
confounders [Taubes, 1995]. While assessment of correlations in the data may
be useful to detect possibilities for confounding, statistical software is not
sufficiently sophisticated to determine the actual confounder. It remains the
responsibility of the investigator to exclude confounding in the design of data
collection and the design of data analysis of a study. To decide upon the
presence of confounding with confidence, insight into mechanisms involved is
required. If a particular determinant is not the putative causal determinant of
interest but is a precursor or intermediary in a causal chain, there is no
confounding and making an adjustment in the analysis will lead to over-
adjustment. This generally results in an underestimation of the true association
between the determinant and the outcome.
FIGURE 3–5 Determining confounders. Suppose that the objective of your study is to determine the causal
role of variation in circulating estrogen levels in the occurrence of bone fractures. You gather a cohort of
women and establish a baseline BMI, estrogen levels, and bone density for each. They are followed up for
10 years, as you record the occurence of fractures (outcome) as a function of circulating estrogen levels
(determinant), conditional on confounders. Because of the etiologic nature of your research, confounding
factors need to be excluded. Age is related with risk of fractures as well as with estrogen levels (and is not
in the causal chain) and thus is a confounder. While BMI and bone density both are related to the outcome,
they are in the causal chain (fat tissue is a source of estrogen production and bone density is increased by
higher circulating estrogen levels). BMI is a precursor and bone density is an intermediate of the
association. Consequently, they are not confounding the relationship and their effects should not be
removed from the association by adjustments.
Handling of Confounding
Once confounding is suspected, there are several approaches to removing it from
the observed association. As previously indicated, confounding may occur when
a variable is associated with both the determinant of interest and the outcome
and it is not part of the causal chain. Being associated with implies that the
confounder is related to the outcome and that the distribution of the confounder
varies across levels of the determinant. To remove confounding requires that the
distribution of the confounder is made the same across levels of the determinant.
When distributions of the confounder are made the same across levels of the
determinant, and the determinant–outcome relationship persists, we conclude
that the relationship is conditional on the confounder. Removal of confounding
may be achieved in the design of data collection, in the design of data analysis,
or the combination of both. For example, suppose that in a particular study age is
thought to be a confounder of the relationship between sex and stroke risk,
implying that age distributions for men and women are different (and age is
associated with stroke risk). In order to remove the confounding effect of age,
age distributions need to be made similar for men and women. This can be done
in a number of ways. First, confounding may be removed in the design of data
collection by means of restriction. If only men and women within a small age
range are included in the study, the distribution of age across gender is the same
and age will not be a confounder. Similarly, men and women may be matched
for age. Matching can be done on an individual basis (individual matching),
where each individual with the determinant (male in this example) is closely
matched with someone without the determinant (female in this example)
according to the confounder (age in this example). Alternatively, the age
distributions among those with and without the determinant are made using
approximately the same methods, such as stratified sampling; this is called
frequency matching. In this example, matching ensures that, although the
distributions of age may be wide, they are the same (mean, median, standard
deviation) for men and women. One can also remove confounding in the design
of data analysis. One approach is to perform a stratified analysis. The association
between gender and stroke risk is then analyzed in separate age strata, each of
which cover a small age range. Within age strata, males and females are similar
regarding age, and age will not be a confounder. Next, the estimates for the
strata are pooled using some statistical method that weights the information by
stratum, such as the Mantel-Haenszel procedure. Essentially the same can be
achieved in a multivariable regression analysis where age is added to the
multivariable model next to the determinant (male/female) and possibly other
confounders.
More recently, certain new approaches such as the use of propensity scores
and instrumental variables (both can be applied in the design of data analysis and
in the design of data collection) have been introduced into clinical epidemiology
to remove confounding. These methods have primarily been used in assessing
causal treatment effects in observational studies (for a review of classic and new
methods to remove confounding see Klungel, 2004). In the assessment of
treatment effects without the use of randomization, confounding by indication is
a major problem, but the principles of adjustment apply similarly to causal
research where the determinant (exposure) is not a drug given for a particular
indication, but, for example, is related to lifestyle characteristics such as level of
physical activity.
As a summary variable for several confounders, propensity scores may be
used for statistical adjustment (in the design of data analysis), matching, or
restriction (in the design of data collection). Propensity may be defined as an
individual’s probability of being exposed to the determinant of interest, for
example, receiving a specific treatment, given the complete set of all information
about that individual. The propensity score provides a single variable that
summarizes all the information from potential confounding variables such as
disease severity and comorbidity; it estimates the probability of a subject being
exposed to the intervention of interest given his or her clinical and nonclinical
status. In case of a binary treatment, the propensity score may be estimated for
each subject from a logistic regression model in which treatment assignment is
the dependent variable. The prognosis in the absence of treatment is assumed to
be the same (balanced) across groups of subjects with the same propensity score.
When treated and untreated subjects are then matched according to propensity
score or the analysis is restricted to those within a limited range of the propensity
score, treated and untreated subjects will have on average the same prognosis in
the absence of treatment. Alternatively, the propensity score can be included as a
covariate in a multivariable regression model relating the treatment to the
outcome. An example is a study showing that treatment with beta-blockers may
reduce the risk of exacerbations and improve survival in patients with chronic
obstructive pulmonary disease [Rutten et al., 2010]. Physicians typically avoid
using beta-blockers in patients with chronic obstructive pulmonary disease and
concurrent cardiovascular disease because of concerns about adverse pulmonary
effects. Therefore, in this observational study, those with chronic obstructive
pulmonary disease treated with beta-blockers very likely have a different
cardiovascular prognosis then those not treated with them. Adjustments for
confounding were made using conventional logistic regression and propensity
score analyses. Both methods showed a reduced mortality risk for beta-blocker
use, with the propensity score analyses showing larger reductions, suggesting
that propensity score analysis more thoroughly deals with confounding in this
example. Note, however, that confounding may remain even after propensity
score adjustment, if relevant subject characteristics were not measured or were
only measured imprecisely [Nicholas, 2008].
The use of instrumental variables, originating from econometrics where
randomized comparisons are largely impossible, has been suggested for use in
epidemiologic analyses with the same objective as propensity scores but with the
potential to also adjust for unmeasured confounders [Martens et al., 2006]. The
key assumptions for an instrumental variable (IV) are that (1) the IV is strongly
associated with the exposure (often treatment assignment), (2) the IV is
unrelated to confounders of the occurrence relation, and (3) the IV is
independent of the outcome through factors other than the exposure. These three
assumptions are shown in Figure 3–6.
FIGURE 3–6 Assumptions of an instrumental variable applied to remove confounding in a study assessing
the causal relationship between an exposure and an outcome. The numbers 1–3 refer to the three
assumptions that are explained in the text.
CAUSALITY
Etiologic research aims to find causal associations. A determinant is believed to
be causally related to an outcome if the association remains when confounding is
excluded. Other requirements are necessary, however, in order to conclude that
the association is indeed causal and to exclude both residual confounding by
some unidentified factors and the mere play of chance.
Many criteria have been proposed to make a causal association more probable.
These include a large number of independent studies with consistent results, a
temporal relationship where the cause precedes the outcome, a strong
association, a dose–response relationship, and biologic plausibility. These
criteria stem from the work of Hill [1965] and others, but each of the criteria has
been challenged and none provides definitive proof. Even a temporal
relationship in which the determinant follows the outcome does not rule out the
possibility that in other circumstances the determinant could lead to the outcome.
Probably the most limiting factor in disclosing causal relationships in
epidemiologic studies is the general focus on single determinant outcome
relationships. Very few diseases are caused by a single factor. For example,
many people are exposed to methicillin-resistant Staphylococcus aureus. Some
bacteria will be colonized and still fewer people will suffer from serious
infection. It is likely that the genotype modifies the risk of colonization after
exposure. The interplay between different factors, possibly through different
mechanisms, is the rule rather than the exception in the etiology of the disease.
Yet other factors, such as the quality of the immune response, will modify the
risk of serious infection. The genetic disorder phenylketonuria (PKU)
convincingly shows that the interaction of genes and environment cause a
disease commonly thought to be purely genetic. Dietary exposure to a particular
amino acid gives rise to mental retardation in children with mutations in the
phenylalanine hydroxylase gene on chromosome 12q23.2 encoding the L-
phenylalanine hydroxylase enzyme, resulting in PKU. Because exposure to both
factors is necessary for PKU to occur, infants with the genetic defect are put on a
lifelong restricted diet to prevent the development of the disease. Rothman and
Greenland [2005] have made important contributions to our understanding of
multicausality in epidemiologic research. (A full discussion goes beyond the
scope of this text, however.) The central principle is that a disease can be caused
by more than one causal mechanism, and every causal mechanism involves the
joint action of a multitude of component causes (see Figure 3–7). As a
consequence, particular causal determinants of disease may be neither necessary
nor sufficient to produce disease. Nevertheless, a cause need not be necessary or
sufficient for its removal to be useful in prevention. For example, alcohol use
when driving is neither necessary nor sufficient to lead to car accidents, yet
prevention of drunk driving will decrease a fair number of casualties. That the
cause is not necessary implies that some disease may still occur after the cause is
blocked, but a component cause will nevertheless be a necessary cause for some
of the cases that occur. When the strength of a causal effect of a certain
determinant depends on or is modified by the presence or absence of another
factor, there is causal, or biologic, interaction or modification. Although
modification of a causal association may be very relevant, it may best be viewed
as secondary to the main determinant–outcome relationship. It adds detail to it,
albeit sometimes extremely important detail.
Descriptive Modification
We propose restricting the term descriptive modification to the analysis of the
extent to which the strength of a causal or noncausal determinant–outcome
association varies across another factor without the need to explain the nature of
that modification. The extent to which the effectiveness of vaccination varies
across age groups serves as an example [Hak et al., 2005]. The only intention
here is to determine whether it should be recommended to target the intervention
at particular age groups from the perspective of cost-effectiveness. There is no
need to understand the modification in causal terms. The causal association
addressed here concentrates on the effect of the intervention (i.e., influenza
vaccination) on the outcome (e.g., survival) only. Modification is examined to
learn about differential effects of vaccination across relevant population
subgroups such as those defined by age. The assessment of modification by age
adds detail to the research on the causal association between vaccination and the
outcome parameter with a view toward practical application of the result.
Descriptive modification may easily occur due to differences in the prevalence
of the disease across populations or population subgroups. For example, the
effectiveness of screening for HIV will be modified by the proportions of hetero-
and homosexual individuals in the populations because this will reflect different
prevalence rates of the disease. In other words, while the fraction of cases
detected will be the same (90%), the absolute number of HIV-infected subjects
detected will be modified by the prevalence of homosexual subjects in the
population studies. The latter example illustrates that modification may occur
both on a relative scale (as in modification by age of the effect of influenza
vaccination on survival) and on an absolute scale (the absolute number of newly
detected HIV-infected individuals), further adding to the complexity of the issue.
Descriptive modification can be equally addressed in causal and descriptive
studies. An example in descriptive studies is when the question is asked about
whether signs and symptoms of heart failure have a different diagnostic value in
patients who suffer from chronic lung disease than in patients without this
concomitant disease [Rutten et al., 2005a; Rutten & Hoes, 2012].
Causal Modification
The interest in causal modification of a determinant–disease association is of an
entirely different nature. Garcia-Closas and colleagues’ 2005 study on the extent
to which the presence of a particular genotype increases the risk of bladder
cancer resulting from cigarette smoking is an example. Here, two causal
questions were addressed. Primarily, the causal association between cigarette
smoking and bladder cancer occurrence was assessed, but the authors also
examined the possible increased sensitivity to cigarette smoke in the presence of
the genotype. Garcia-Closas and coworkers [2005] found that persons who were
current smokers or had smoked cigarettes in the past had a higher risk of
developing bladder cancer. However, the relative risk related to smoking was
2.9-fold increased in those who had the NAT2 slow acetylator genotype and 5.1-
fold increased among those with the intermediate or rapid acetylator genotype.
In researching the benefits and risks of treatment, causal as well as descriptive
modifications are often, albeit sometimes implicitly, addressed when subgroups
show a higher or lower response to the intervention.
BOX 3–2 Astrological Daily Prediction Taking the ISIS Trial Findings on Aspirin into Account
A loan will be easy to obtain tomorrow, but you must have a list of items you own so that you will
have something to show as collateral. This loan could be to improve the home or to purchase a car.
Things are happening, and your career or path depends on your own ambition and drive, as well as
your ability to be patient and bide your time. You are able to use good common sense to guide you,
and you can feel the trends and make the right moves. The time is coming soon to take action and get
ahead. You may contemplate a career move and next week is a most positive one as you make yourself
known. You are advised to use no aspirin.
TABLE 3–2 A Meta-Analysis of 24 Blood Pressure Trials Involving 68,099 Randomized Patients1
“Rate” means rate of cardiovascular disease.
1
Unpublished results.
Measurement of Modification
Measurement of modification is conceptually straightforward. Suppose we study
the risk of gastric bleeding for those using aspirin therapy by comparing
bleeding rates across users and nonusers of aspirin, with adjustments for
extraneous determinants related to both aspirin use and the baseline (before use)
risk of bleeding (such as age, comorbidity, and the severity of the disease for
which the aspirin was prescribed). If an overall increased risk of bleeding caused
by aspirin use is established, a next concern may be to determine which patients
treated with high-dose aspirin are at a particularly high risk. Certain patients on
the same dose of aspirin may be more likely to experience gastric bleeding than
others. For example, concurrent use of corticosteroids might enhance the
bleeding risk. In other words, steroid use modifies the risk of high-dose aspirin
as it makes the risk even higher. In this occurrence relation, corticosteroids are
causal modifiers of the risk of bleeding associated with high-dose aspirin use.
The modifier changes the magnitude of the association between determinant and
outcome; the effect estimate depends on the value of the modifier. In this
example, suppose that the overall relative risk of bleeding for those taking high-
dose aspirin compared to low-dose aspirin was 2; for those taking a
corticosteroid that relative risk became 4. The modification becomes visible
when the association of interest is compared across strata of the modifier.
In etiologic research, analysis of modifiers may help the investigator to
understand the complexity of multicausality and causally explain why a
particular disease may be more common in certain individuals despite an
apparent similar exposure to a determinant. After the unconfounded
measurement of an overall association between a determinant and an outcome,
putative modification may be estimated by comparing the strength of the
exposure–outcome association across categories of the modifier. Causal
modification also can be studied experimentally. Activated factor VII (FVIIa) is
a very potent coagulant and may be a key determinant in the outcome of a
cardiovascular event. FVIIa increases in response to dietary fat intake. Mennen
and coworkers [1999] studied whether the response of FVIIa to fat intake is
modified (in this case reduced) by the genetic R353Q polymorphism. A fat-rich
test breakfast and a control meal were given to 35 women carrying the Q allele
and 56 women without the Q allele genotype. At 8 AM (after an overnight fast),
the first blood sample was taken, and within 30 minutes the subjects ate their
breakfasts. Additional blood samples were taken at 1 PM and 3 PM. The mean
absolute response of FVIIa was 37.0 U/L in the group with the RR genotype and
16.1 U/L (P < 0.001) in those carrying the Q allele (see Figure 3–8).
FIGURE 3–8 Comparison of activated factor VII (FVIIa) in women carrying the Q allele and those
carrying the RR genotype before and after a meal.
Reproduced from Mennen LI, de Maat MP, Zock P, Grobbee DE, Kok FJ, Kluft C, Schouten EG.
Postprandial response of activated factor VII in elderly women depends on the R353Q polymorphism. Am J
Clin Nutr 1999;70:435–8.
Attributable proportion (AP) of cases owing to the interaction of migration history and family dysfunction.
4
Reproduced from Patino LR, Selten JP, Van Engeland H, Duyx JH, Kahn RS, Burger H. Migration, family
dysfunction and psychotic symptoms in children and adolescents. Br J Psychiatry 2005;186:442–3.
As long as it is well understood that the choice of the scale on which
modification is measured and the selection of the statistical model have an
impact on the detection and magnitude of modification, and as long as it is clear
why the modification is addressed, there is room for additive as well as
multiplicative models.
PURPOSE: Although physical activity has been consistently inversely associated with colon cancer
incidence, the association of physical activity with other diet and lifestyle factors that may influence
this association is less well understood. Confounding and effect modification are examined to better
understand the physical activity and colon cancer association.
METHODS: Based on hypothesized biological mechanisms whereby physical activity may alter risk
of colon cancer, we evaluated confounding and effect modification using data collected as part of a
case-control study of colon cancer (N = 1993 cases and 2410 controls). We examined associations
between total energy intake, fiber, calcium, fruit and vegetables, red meat, whole grains as well as
dietary patterns along with cigarette smoking, alcohol consumption, BMI, and use of aspirin and/or
NSAIDs and physical activity.
RESULTS: No confounding was observed for the physical activity and colon cancer association.
However, differences in effects of diet and lifestyle factors were identified depending on level of
physical activity. Most striking were statistically significant interactions between physical activity and
high-risk dietary pattern and vegetable intake, in that the relative importance of diet was dependent on
level of physical activity. The predictive model of colon cancer risk was improved by using an
interaction term for physical activity and other variables, including BMI, cigarette smoking, energy
intake, dietary fiber, dietary calcium, glycemic index, lutein, folate, vegetable intake, and high-risk
diet rather than using models that included these variables as independent predictors with physical
activity. In populations where activity levels are high, the estimate of risk associated with high
vegetable intake was 0.9 (95% CI 0.6–1.3), whereas in more sedentary populations the estimate of risk
associated with high vegetable intake was 0.6 (95% CI 0.5–0.9).
CONCLUSIONS: Physical activity plays an important role in the etiology of colon cancer. Its
significance is seen by its consistent association as an independent predictor of colon cancer as well as
by its impact on the odds ratios associated with other factors. Given these observations, it is most
probable that physical activity operates through multiple biological mechanisms that influence the
carcinogenic process.
Reproduced from Slattery ML, Potter JD. Physical activity and colon cancer: Confounding or interaction?
Med Sci Sports Exerc. 2002 Jun;34(6):913–9.
Time
Typically, etiologic studies are longitudinal because the goal is to relate a
potentially causal determinant to the future occurrence, for example, the
incidence of a disease. This temporal relationship should be incorporated in the
design of data collection to ensure that the determinant indeed precedes the
development of disease, for example, by means of a cohort study. Consequently,
a cross-sectional design, where determinant and outcome are measured at the
same point in time, is generally not the preferred approach in etiologic research.
Several examples illustrate this point. In studies on dietary habits as a possible
cause for cancer, a cross-sectional study design may reveal a positive association
between low fat intake and cancer, while in fact the preclinical cancer itself may
have caused a change in dietary habits. Such a “which comes first, the chicken or
the egg” phenomenon constitutes less of a problem when the etiologic factor
cannot change over time (e.g., gender or a genetic trait).
Census or Sampling
The classic approach to collecting data in etiologic epidemiologic research is a
cohort study, where a group of subjects exposed to the causal factor under study
and a group of unexposed subjects are followed over time to compare the
incidence of the outcome of interest. Such a study takes a census approach in
that in all study participants the determinant, outcome, and potential confounders
(and, if the aim is to study modification, modifiers) are measured. Alternatively,
and often more efficiently, information on the determinant and confounders (and
possibly modifiers) can be collected in patients with the outcome of interest (the
cases) and a sample (controls) from the population in which these cases arise.
The latter approach is called a case-control study.
Experimental or Observationally
Etiologic research can be conducted experimentally or observationally (i.e.
nonexperimentally). Experimental means that an investigator manipulates the
determinant with the goal of learning about its causal effects. Case-control
studies are nonexperimental by definition, but cohort studies can either be
experimental or nonexperimental. The best known type of experimental cohort
study is a randomized trial. Randomized trials are particularly suited to study
effects of interventions.
The study in Box 3–1, which addressed the cardiac risks associated with a
high haem iron intake, was a cohort study where determinant, confounder, and
outcome data were collected on all members of the cohort. From our discussion,
it is obvious that in an etiologic study data need not only be collected for the
determinant and outcome under study, but also for potential confounders and, in
case modification is of interest, the effect modifiers.
There are several ways of collecting this information. Participants can be
interviewed, face to face or by telephone; they can answer questionnaires at
home or under supervision; they can keep diaries; and physical measurements
can take place. The chosen method depends on the reliability of the different
ways of collecting the data, the feasibility, and affordability. Determinant,
confounder, and modifier data are usually collected at the start of the study, that
is, at baseline. It also is possible for information to be collected from the past. In
the cohort study in Box 3–1, dietary information was collected for the year prior
to enrollment. Another example is when information about reproductive
characteristics of women is needed from postmenopausal women. Milestones in
their reproductive history such as menarche, menstrual cycles, childbirth, and
lactation all happened in the past.
Measurement error is one of the most important problems in the data
collection of epidemiologic studies and can lead to considerable bias.
Measurement error occurs when the measurement is not valid or when the
measurement is not sufficiently precise. Invalid measurements occur when the
method used does not measure what the investigator intends to measure. An
example is an uncalibrated blood pressure device that systematically measures
the blood pressures 10 mm Hg too high. Such an error will impair inference for
absolute blood pressure levels. If the measurement is sufficiently precise (i.e.,
there is little random variation), however, there is no problem with ranking each
study participant correctly in the population distribution. In the example in Box
3–1, when the haem iron content is unknown for many foods, this will lead to
underestimation of the haem iron intake of essentially all individuals and hence
to misclassification of persons with a truly higher intake in categories of lower
intake. When this occurs to the same extent for persons who develop coronary
heart disease as for persons who do not (this is called nondifferential
misclassification), it will lead to an underestimation of the association. Suppose
that particular foods are missed exclusively for those who subsequently
experience heart disease. In this situation the underestimation of intake becomes
related to the outcome of interest (this is called differential misclassification) and
the observed association will become severely biased.
Measurement is as important for the determinant as for the confounders.
When there is measurement error for the confounders, the effect of the
extraneous determinant cannot be fully adjusted for and this leads to what is
called residual confounding. Residual confounding leads to biased estimation of
the determinant–outcome relationship.
TABLE 3–4 Incidence Densities of Coronary Heart Disease for Haem Iron Intake Quartiles
Range (mg/day) Cases/Person-Years
c
Haem iron intake
Quartile 1 < 1.28 54/17,413
Quartile 2 1.28–1.76 53/17,384
Quartile 3 1.76–2.27 57/17,334
Quartile 4 > 2.27 88/17,469
To summarize the risk involved with increasing amounts of haem iron intake,
we calculated the hazard ratios, which is the risk of higher intakes compared to a
reference level of intake (see Table 3–5). Usually persons with no exposure, or
with the lowest or highest category of exposure, are considered to be the
reference group. The choice of the reference group depends on the study
question. In our example, we considered those with the lowest intake of haem
iron to be hypothetically the best, and therefore we took the lowest quartile as
the reference category. Sometimes, when, for example, numbers in the extreme
category are very low, other strata are taken as a reference category. This does
not change the inference, but it does affect the relative risk estimates across the
strata and should therefore clearly be indicated. Table 3–5 shows the estimates
of relative risk, displayed with various degrees of confounder adjustment.
TABLE 3–5 Hazard Ratios of Coronary Heart Disease for Increasing Haem Iron Intake
a
Adjusted for age at intake (continuous), BMI (continuous), smoking (current/past/never), physical activity
(continuous), hypertension (yes/no), diabetes (yes/no), hypercholesterolemia (yes/no).
b
Adjusted for age at intake (continuous), total energy intake (continuous), BMI (continuous), smoking
(current/past/never), physical activity (continuous), hypertension (yes/no), diabetes (yes/no),
hypercholesterolemia (yes/no), energy-adjusted saturated fat intake (continuous), energy-adjusted
carbohydrate intake (continuous), energy-adjusted fiber intake (continuous), energy-adjusted alcohol intake
(quintiles), energy-adjusted β-carotene intake (continuous), energy-adjusted vitamin E intake (continuous),
energy-adjusted vitamin C intake (continuous).
Our study showed that women with the highest haem iron intake had a 1.65
times higher risk of coronary heart disease than women with the lowest intake.
This effect is statistically significant, as the 95% confidence interval for the
hazard ratio (1.07–2.53) does not include 1. While the hazard ratio or the relative
risk represents the likelihood of disease in individuals with the determinant
relative to those without, there is also a measure providing information on the
absolute effect of the determinant, or the excess risk of disease in those
compared to those without the determinant. This is the risk difference (or the
attributable risk) and is calculated as the difference of cumulative incidences or
incidence densities. In our example, we could calculate the attributable risk as
[(88/17,469) – (54/17,413)] = 1.9 per thousand women. From a practical or
preventive perspective, it may be useful to estimate the proportion of the
incidence of the outcome that is attributable to the determinant (in this case the
highest quartile of intake): the attributable risk proportion. It is calculated as
[(1.9/1,000)/(88/17,469)] × 100 = 37.7%. It also can be interesting to estimate
the excess rate of the outcome in the total study population that might be
attributed to the determinant. This measure is called the population attributable
risk (PAR), and it illustrates the importance of a specific determinant in the
causation of a disease or outcome. The PAR is calculated as the rate of disease in
the population minus the rate of disease in the subpopulation without the
determinant, or alternatively, as the attributable risk multiplied by the proportion
of individuals with the determinant in the population. In our example, the PAR is
0.0019 × 0.25 = 4.8 per 10,000 women.
WORKED-OUT EXAMPLE
The beneficial effects of moderate alcohol intake on coronary heart disease risk
have been clearly established. Whether there is a similar effect of alcohol intake
on risk of type 2 diabetes is not yet clear. For the study in Box 3–4, data on
alcohol intake as well as information on the occurrence of type 2 diabetes were
collected as part of a large cohort study initially designed to study the role of diet
in cancer occurrence.
BOX 3–4 Alcohol Consumption and Risk of Type 2 Diabetes Among Older Women
OBJECTIVE: This study aimed to investigate the relation between alcohol consumption and type 2
diabetes among older women.
RESEARCH DESIGN AND METHODS: Between 1993 and 1997, 16,330 women aged 49–70
years and free from diabetes were enrolled in one of the Dutch Prospect-EPIC (European Prospective
Study Into Cancer and Nutrition) cohorts and followed for 6.2 years (range 0.1–10.1). At enrollment,
women filled in questionnaires and blood samples were collected.
RESULTS: During follow-up, 760 cases of type 2 diabetes were documented. A linear inverse
association (P = 0.007) between alcohol consumption and type 2 diabetes risk was observed, adjusting
for potential confounders. Compared with abstainers, the hazard ratio for type 2 diabetes was 0.86
(95% CI 0.66–1.12) for women consuming 5–30 g alcohol per week, 0.66 (0.48–0.91) for 30–70 g per
week, 0.91 (0.67–1.24) for 70–140 g per week, 0.64 (0.44–0.93) for 140–210 g per week, and 0.69
(0.47–1.02) for > 210 g alcohol per week. Beverage type did not influence this association. Lifetime
alcohol consumption was associated with type 2 diabetes in a U-shaped fashion.
CONCLUSIONS: Our findings support the evidence of a decreased risk of type 2 diabetes with
moderate alcohol consumption and expand this to a population of older women.
© 2003 American Diabetes Association. Alcohol Consumption and Risk of Type 2 Diabetes Among Older
Women. “Diabetes Care,” Vol 28, 2005; 2933–2938. Reprinted with permission from The American
Diabetes Association.
Theoretical Design
The research question was, “Does moderate alcohol consumption protect against
the development of type 2 diabetes?” This translates into the following
occurrence relation:
Incidence of type 2 diabetes = f (alcohol intake | extraneous determinants)
TABLE 3–6 Baseline Characteristics* by Alcohol Consumption Categories in 16,330 Dutch Women
Data are means ± SD.
*
All characteristics are age-adjusted except age.
†P value ≤ 0.001 between alcohol intake categories.
TABLE 3–7 Baseline Alcohol Consumption and Risk of Type 2 Diabetes Among 16,330 Dutch Women
Prognostic Research
INTRODUCTION
A 40-year-old woman diagnosed with rheumatoid arthritis contacts her
rheumatologist for a routine follow-up visit. This woman is well informed about
her disorder, and she has recently learned that patients suffering from
rheumatoid arthritis may be at an elevated risk for infections [Doran et al.,
2002]. She asks her rheumatologist if there is any reason to worry about
infection currently. Her doctor responds by stating that this is indeed a relevant
issue, because the patient has been using corticosteroids since her last visit a
couple of months ago and these medications may well increase infection risk.
To become better informed about her patient’s risk of contracting an infection,
the rheumatologist searches for extra-articular manifestations of rheumatoid
arthritis, such as skin abnormalities (cutaneous vasculitis), which are also
associated with a higher infection risk. She observes none. Still, the
rheumatologist feels uncertain about the probability that future infections will
occur in her patient. She decides to draw blood and send it to the lab for a
leukocyte count. No leukopenia is found. Now the rheumatologist feels
confident enough to reassure her patient and does not schedule more frequent
follow-up visits than those initially planned.
Prognosis in clinical practice can be defined as a prediction of the course or outcome of a certain
illness, in a certain patient. It combines the ancient Greek word πρo, meaning beforehand, and γνωσις,
meaning knowledge. Although prognoses are all around us, such as weather forecasts and corporate
finance projections, the word has a medical connotation. After setting a diagnosis, and perhaps making
a statement on the surmised etiology of the patient’s illness, making a prognosis (“prognostication”) is
the next challenge a physician faces. Accurate prognostic knowledge is of critical importance to both
patients and physicians. Although perhaps obvious, it must be stressed that a person does not require
an established illness or disease to have a prognosis. For instance, life expectancy typically is a
prognosis relevant to all human beings, diseased and nondiseased. Preventive medicine is concerned
with intervening on those who are still free of disease yet have a higher risk of developing a particular
disease, that is, those with a poor prognosis. In the medical context and context of clinical
epidemiology, however, prognosis is commonly defined as the course and outcome of a given illness
in an individual patient.
Reproduced from: Finster, Mieczyslaw M.D. and Wood, Margaret M.D.; The Apgar Source Has Survived
the Test of Time. Anesthesiology. April 2005. Volume 102. Issue 4. pp 855–857. © 2005 American Society
of Anesthesiologists, Inc. Reprinted with permission from Wolters Kluwer Health.
In practice, the three approaches discussed in this section are often used
implicitly and even simultaneously. It is unlikely that a physician estimates a
prognosis based on a prediction model only. The aim of a prediction model in
any medical field is not to take over the job of the physician. The intention rather
is to guide physicians in their decision making based on more objectively
estimated probabilities as a supplement to any other relevant information,
including clinical experience and pathophysiologic knowledge [Christensen,
2004; Concato et al., 1993; Feinstein, 1994; Moons et al., 2009a; Moons et al.,
2012a].
PROGNOSTICATION IS A
MULTIVARIABLE PROCESS
It is common practice in the medical literature as well as during clinical rounds
to refer to the prognosis of a disease rather than to the prognosis of a patient:
“The prognosis of pancreatic cancer is poor”; “Concussion most often leaves no
lasting neurologic problems”; or, more quantitatively, “Five-year survival in
osteosarcoma approximates 40%.” These so-called textbook prognoses are not
individualized prognoses but merely average ones. They are imprecise because
many patients will deviate substantially from the average, and they are clinically
of limited value because the aim of prognostication—individual risk prediction
—cannot be attained. Typically, the prognosis of an individual patient, for
example, for 5-year survival is determined by a variety of patient characteristics,
not just by a single element such as a diagnosis of osteosarcoma. A combination
of prognostic determinants is often referred to as a risk profile. This profile
usually comprises both nonclinical characteristics such as age and gender, and
clinical characteristics such as the diagnosis, symptoms, signs, possible etiology,
blood or urine tests, and other tests such as imaging or pathology. Thus,
prognosis is rarely adequately estimated by a single prognostic predictor.
Physicians—implicitly or explicitly—use multiple predictors to estimate a
patient’s prognosis [Braitman & Davidoff, 1996; Concato, 2001; Moons et al.,
2009a; Moons et al., 2012a]. Adequate prognostication thus requires knowledge
about the occurrence of future outcomes given combinations of prognostic
predictors. This knowledge in turn requires prognostic studies that follow a
multivariable approach in design and analysis to determine which predictors are
associated, and to what extent, with clinically meaningful outcomes. The results
provide outcome probabilities for different predictor combinations and allow
development of tools to estimate these outcome probabilities in daily practice.
These tools, often referred to as clinical prediction models, predictions rules,
prognostic indices, or risk scores enable physicians to explicitly transform
combinations of values of prognostic determinants documented in an individual
patient to an absolute probability of developing the disease-related event in the
future. [Laupacis et al., 1997; Moons et al., 2009a; Randolph et al., 1998;
Royston et al., 2009; Steyerberg, 2009]. Similar tools based on multiple
determinants are also applied in diagnosis.
TABLE 4–2 Prognostic Score for Preoperatively Predicting the Probability of Severe Early Postoperative
Pain
Reproduced from: Pain, 105, Kalkman CJ, Visser K, Moen J, Bonsel GJ, Grobbee DE, Moons KG.
Preoperative prediction of severe postoperative pain. pp. 415–23. Copyright Elsevier 2003. Reprinted with
permission of the International Association for the Study of Pain® (IASP). The fi gures may NOT be
reproduced for any other purpose without permission.
Prognostic information is not only used to guide individual decisions but also to make proper
Prognostic information is not only used to guide individual decisions but also to make proper
adjustments for “case mix” when comparing the performances of different hospitals. The aim of these
comparisons is to make causal inferences about the care given, that is, to assess whether differences in
performance are due to differences in quality of care. This can only be accomplished if the analyses
are adequately adjusted for the confounding effect of initial prognosis. Prognostic models that are
themselves the results of descriptive research can be helpful in achieving this.
A good example comes from a study by the International Neonatal Network. In this study, a scoring
system to predict mortality in preterm neonates with low birth weight admitted to neonatal intensive
care units was developed [International Neonatal Network, 1993]. The scoring system, denoted as the
CRIB score, included birth weight, duration of gestation, congenital malformations, and several
physiologic parameters measured during the first 12 hours of life. It showed excellent predictive
accuracy with an area under the receiver operating characteristic (ROC) curve of 0.9. Apart from
developing this score for the purpose of helping doctors to make mortality predictions in individual
neonates, the authors aimed to compare the performance of the intensive care units of tertiary hospitals
with those of nontertiary hospitals, as reflected by their relative neonatal mortality rates.
Because the initial prognosis of neonates admitted to tertiary hospitals may be different from that of
neonates referred to nontertiary hospitals, these causal analyses were performed adjusting for the
confounding effect of initial mortality risk as indicated by the CRIB score. It appeared that only after
adjustment for CRIB score did tertiary hospitals showed convincingly less mortality than the
nontertiary hospitals. This example illustrates that adjustment for initial prognosis or “case mix” is
essential when performance audits are carried out. Yet the validity of this approach is highly
dependent on the degree to which the prognostic scores used to adjust for confounding adequately
capture prognosis.
BOX 4–3 Study on the Prognostic Value of Gene-Expression Profiles in Predicting Distant Metastasis in
Patients with Lymph-Node-Negative Primary Breast Cancer
Summary
Background: Genome-wide measures of gene expression can identify patterns of gene activity that
subclassify tumors and might provide a better means than is currently available for individual risk
assessment in patients with lymph-node-negative breast cancer.
Methods: We analyzed, with Affymetrix Human Ul33a GeneChips, the expression of 22,000
transcripts from total RNA of frozen tumor samples from 286 lymph-node-negative patients who had
not received adjuvant systemic treatments.
Findings: In a training set of 115 tumors, we identified a 76-gene signature consisting of 60 genes for
patients positive for estrogen receptors (ER) and 16 genes for ER-negative patients. This signature
showed 93% sensitivity and 48% specificity in a subsequent independent testing set of 171 lymph-
node-negative patients. The gene profile was highly informative in identifying patients who developed
distant metastases within 5 years (hazard ratio 5.67 [95% CI 2.46–12.4]), even when corrected for
traditional prognostic factors in multivariate analysis (5.55 [2.46–12.15]). The 76-gene profile also
represented a strong prognostic factor for the development of metastasis in the subgroups of 84
premenopausal patients (9.60 [2.28–40.5]), 87 postmenopausal patients (4.04 [1.57–10.4]), and 79
patients with tumors of 10–12 mm (14.1 [3.34–59.2]), a group of patients for whom prediction of
prognosis is especially difficult.
Interpretation: The identified signature provides a power tool for identification of patients at high
risk of distant recurrence. The ability to identify patients who have a favorable prognosis could, after
independent confirmation, allow clinicians to avoid adjuvant systemic therapy or to choose less
aggressive therapeutic options.
Reproduced from The Lancet, Vol. 365, Whang Y, Klein JG, Zhang Y, Sieuwerts AM, Look MP, Yang F,
Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatokoe T, Berns EM, Atkins D, Foekens JA.
Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. 671-9;
© 2005, reprinted with permission from Elsevier.
PROGNOSTIC RESEARCH
Once it is recognized that the aim of prognostication is to stratify patients
according to their absolute risk of a certain future relevant health event, based on
their clinical and nonclinical profile, the three components of epidemiologic
study design (theoretical design, design of data collection, and design of data
analysis) follow logically.
Theoretical Design
The object of medical prognostication is to predict the future occurrence of a
health-related outcome based on the patient’s clinical and nonclinical profile.
Outcomes may include a particular event such as death, disease recurrence, or
complication, and also continuous or quantitative outcomes such as pain or
quality of life. As noted already, the architecture of prognostic research strongly
resembles that of diagnostic research. The major difference is that time or
follow-up is elementary to prognostic research, whereas diagnostic research is
inherently cross-sectional. The occurrence relation of prognostic research is
given by:
Time
The object of the prognostic process is inherently longitudinal (t > 0).
Accordingly, prognostic research follows a longitudinal design in which the
determinants or prognostic predictors are measured before the outcome is
observed. The time period needed to observe the outcome occurrence or
outcome development may vary from as short as several hours (e.g., in the case
of early postoperative complications) to as long as days, weeks, months, or
years.
Census or Sampling
As the outcomes of prognostic studies are generally expressed in absolute terms,
the design most suitable to address prognostic questions is a cohort study in
which all patients with a certain condition are followed for some time to monitor
the development of the outcome; this uses a census approach. Preferably, the
data are collected prospectively rather than retrospectively because this allows
for optimal measurement of predictors and outcome, as well as adequate
(complete) follow-up. Typically, all consecutive patients with a particular
condition who are at risk for developing the outcome of interest (i.e., who are
part of the domain) are included. The potential prognostic determinants and the
outcome are measured in all patients.
As in diagnostic research, sometimes a case-control design (and thus a
sampling rather than a census approach) is used in prognostic research [Ganna et
al., 2012; Iglesias de Sol et al., 2001;]. This is done for efficiency reasons, for
example, when measurement of one or more of the prognostic determinants is
burdensome to patients or is expensive, or when the prognostic outcome is rare.
This design does not allow for an estimation of absolute risks of an outcome
when cases and controls are obtained from a source population of unknown size.
When, however, the sampling fraction of the controls (i.e., the proportion of the
population experience of the entire cohort that is sampled in the controls) is
known, the true denominators, and thus absolute risks, can be estimated by
reconstructing the 2 × 2 table [Biesheuvel et al., 2008; Moons et al., 2009a;
Moons et al., 2012a]. The case-cohort design, a specific type of case-control
study performed within a cohort study, is increasingly being used in prognostic
research because of its efficiency and because it yields absolute probabilities
[Ganna et al., 2012].
Experimental or Observational
Almost all prognostic studies outside the realm of intervention research are
observational, where a well-defined group of patients with a certain condition
are followed for a period of time to monitor the occurrence of the outcome. The
researcher observes and measures the nonclinical and clinical parameters
anticipated to be of prognostic significance. These potential prognostic
determinants are not influenced (let alone randomly allocated) by the researcher.
However, as in diagnostic research, one could imagine that prognostic studies
involve experimentation, for example, when comparing the impact on a certain
outcome (e.g., mortality) of the use of two prognostic risk scores by randomly
allocating the two rules to individual physicians or patients [Moons et al., 2012b;
Reilly & Evans, 2006].
Alternatively, however, randomized trials can serve as a vehicle for prognostic
research. Then the study population of the trial is taken as a plain cohort where
the prognostic determinants of interest are just observed and not influenced by
the researcher. Consequently, a prognostic study within a trial bears a greater
resemblance to an observational study than to a typical experimental study. The
issue up for debate is whether one should limit the prognostic analysis to the trial
participants in the reference (or control) group, that is, to those who did not
undergo the randomly allocated prognosis-modifying intervention and perhaps
were given a placebo [Moons et al., 2012a]. In the case of an ineffective
intervention, most researchers will include both the intervention and reference
cohort in the prognostic study, whereas when the intervention is beneficial or
harmful, only the reference group is included. Is should be emphasized,
however, that even in cases of no observed overall difference in effect of the
randomly allocated intervention, the intervention can modify the association of
the prognostic determinants with the outcome. To study such effect
modification, one could perform separate prognostic analyses in the two
comparison groups of the trial, guided by tests for interaction between the
intervention and the other prognostic predictors. Certainly, both analyses may
provide clinically useful information: The prognostic study within the placebo
group of a trial will help physicians to accurately estimate the prognosis in a
patient with a certain condition if no intervention is initiated (i.e., the natural
history of a disease or condition) and can be instrumental in deciding about
treatment initiation [Dorresteijn et al., 2011]. A prognostic analysis within the
treated patient group will facilitate quantification of the expected course (in
terms of absolute risks) in an individual patient following treatment. An example
of a prognostic study performed within a trial is shown in Box 4–4, which
attempted to help physicians to identify those children with acute otitis media
prone to experience prolonged complaints (and thus possibly requiring closer
monitoring or antibiotic treatment). Rovers et al. [2007] performed a prognostic
analysis in a data set including the placebo groups of all available randomized
trials assessing the effect of antibiotic treatment in children with acute otitis
media. An obvious advantage of such an analysis of a trial is the availability of
high-quality data. On the other hand, however, the findings may have restricted
generalizability due to the strict inclusion and exclusion criteria applied in the
trials [Kiemeney et al., 1994; Marsoni & Valsecchi, 1991; Moons et al., 2012a].
Moreover, the high-quality data on prognostic determinants may be a blessing in
disguise, because in the real-life application the available clinical information
may be of lower quality and the predictors thus will show reduced prognostic
performance.
Study Population
The study population in prognostic research should be representative of the
domain. Prognostic predictors, models, or strategies are investigated for their
ability to predict a future health outcome as accurately as possible. Accordingly,
and as noted before, the domain of a prognostic study is comprised of
individuals who are at risk for developing that outcome. Patients who have
already developed the outcome or in whom the probability is considered so low
(“zero”) that the physician does not even consider estimating this probability fall
outside the domain, because subsequent patient management (e.g., to initiate or
refrain from therapeutic actions) is evident. Furthermore, as in diagnostic
research, we recommend restricting domain definitions and thus study
populations in prognostic research to the setting of care (notably primary or
secondary care) of interest, due to known differences in predictive accuracy of
determinants across care settings [Knottnerus, 2002a; Oudega et al., 2005a;
Moons et al., 2009b; Toll et al., 2008]. Finally, the selection or recruitment of
any study population is often further restricted by logistical circumstances, such
as the necessity for patients to live near the research center or the availability of
their time to participate in the study. These characteristics are often unlikely to
influence the applicability and generalization of study findings. It may be
challenging to appreciate which characteristics truly affect the generalizability of
results obtained from a particular study population. This appreciation usually
requires knowledge of those characteristics (effect modifiers) that may modify
the nature and strength of the estimated associations between the prognostic
determinants and outcome. Therefore, generalizability from study population to
the relevant domain is not an objective process that can be framed in statistical
terms. Generalizability is a matter of reasoning, requiring external knowledge
and subjective judgment. The question to be answered is whether in other types
of subjects from the domain who were not represented in the study population
the same prognostic predictors would be found with the same predictive values
[Moons & Grobbee, 2005].
BOX 4–4 Predictors of a Prolonged Course in Children with Acute Otitis Media: An Individual Patient
Meta-Analysis
Background: Currently there are no tools to discriminate between children with mild, self-limiting
episodes of acute otitis media (AOM) and those at risk of a prolonged course.
Methods: In an individual patient data meta-analysis with the control groups of 6 randomised
controlled trials (n = 824 children with acute otitis media, aged 6 months to 12 years), we determined
the predictors of poor short term outcome in children with AOM. The primary outcome was a
prolonged course of AOM, which was defined as fever and/or pain at 3–7 days.
Main findings: Of the 824 included children, 303 (37%) had pain and/or fever at 3–7 days.
Independent predictors for a prolonged course were age < 2 years and bilateral AOM. The absolute
risks of pain and/or fever at 3–7 days in children aged less than 2 years with bilateral AOM (20% of
all children) was 55%, and in children aged 2 years or older with unilateral AOM 25% (47% of all
children).
Interpretation: The risk of a prolonged course was two times higher in children aged less than 2 years
with bilateral AOM than in children aged 2 years or older with unilateral AOM. Clinicians can use
these features to advise parents and to follow these children more actively.
Reproduced with permission from Pediatrics, Vol. 119, 579–85, Copyright © 2007 by the AAP. Rovers
MM, Glasziou P, Appelman CL, Burke P, McCormick DP, Damoiseaux RA, Little P, Le Saux N, Hoes
AW.
Outcome
The outcome in prognostic research is typically dichotomous: the occurrence, in
this case the incidence (yes/no) of the event or disease course of interest. In
addition, prognostic outcomes may comprise continuous variables such as tumor
growth, pain, or quality of life, rather than incidence or nonoccurrence of a
particular event. In both instances, we recommend that the researcher studies
outcomes that really matter to patients, such as remission of disease, survival,
complications, pain, or quality of life. One preferably should not study so-called
proxy or intermediate outcomes such as joint space in patients with osteoarthritis
of the knee (instead of pain, the ability to walk, or quality of life), unless a clear
relationship between such an intermediate outcome and outcomes more relevant
for patients has been established. The latter may apply for the use of CD4 count
as a prognostic outcome (rather than the occurrence of AIDS or even death) in
HIV studies.
As in all research, criteria defining the absence or presence of the outcome as
well as the measurement tools used should be described in detail. Importantly,
the outcome occurrence is assessed as accurately as possible, with the best
available methods to prevent misclassification, even if this requires measures
that are never taken in clinical practice.
The time period during which the outcome occurrence is measured requires
special attention. Predicting an outcome occurrence over a 3-month period
typically yields different predictors or different predictor–outcome associations
than prediction of the same outcome after 5 years. As with weather and stock
value forecasting, prediction over a shorter period is commonly less problematic
than prediction over a longer time period.
Finally, as in most research, outcomes should ideally be measured without
knowledge of the value of the predictors under study to prevent self-fulfilling
prophecies, particularly if the outcome measurement requires observer
interpretation. For example, the presence of those determinants believed to be
associated with the prognostic outcome may influence the decision to consider
the outcome to have occurred. This bias can cause under-or overestimation of
the accuracy of predictors, but it more commonly leads to overestimation; it can
be prevented by blinding the assessors of the outcome to the values of the
prognostic determinants [Loy & Irwig, 2004; Moons et al., 2002c; Moons et al.,
2009a]. Blinding is not necessary for mortality or other outcomes that can be
measured without misclassification.
Confounding Bias
In prognostic research, the interest is in the joint predictive accuracy of multiple
predictors. As stated earlier, there is no central determinant for which the
relationship to the outcome should be causally isolated from other outcome
predictors, as in causal research. Confounding thus is not an issue in prognostic
research, as in all types of prediction research.
Other Biases
While confounding does not play a role in prediction research, other biases
certainly do. Bias that may occur when the outcome assessor is aware of the
determinants was discussed in the last paragraph. In addition, loss to follow-up,
and thus nonassessment and missing of the outcomes that is not completely at
random (MCAR) but rather selectively missing likely leads to biased estimates
of the prognostic or predictive value of the predictors under study, if the analysis
is reserved to only those individuals in whom the outcome was assessed.
Selectively missing outcomes means that the subsample of the original study
population with the observed outcomes are different from the subsample with
the missing outcomes [de Groot et al., 2011a]. This bias can be addressed or
minimized using several methods [de Groot et al., 2008; de Groot et al., 2011b,
de Groot et al., 2011c], including the use of multiple imputation techniques
[Groenwold et al., 2012]. Bias due to selective loss to follow-up may also occur
[Groenwold et al., 2012; Little et al., 2012].
Analysis Objective
The aims of the data analysis in multivariable prognostic research are similar to
multivariable diagnostic research, except for the dimension of time: to provide
knowledge about which potential predictors independently contribute to the
outcome prediction, and to what extent. Also, one may aim to develop and
validate a multivariable prediction model or rule to predict the outcome given
the values of a combination of predictors. The methods to determine the required
number of subjects and the data analysis steps of prognostic studies are similar
to diagnostic studies. For example, to guide decision making in individual
patients, the analysis and reporting of prognostic studies concentrates on
absolute risk estimates (in prognostic studies on incidence and in diagnostic
studies on prevalence) of an outcome given combinations of predictors and their
values. In view of the large similarities between the analysis of prognostic and
diagnostic studies, we will concentrate on the differences that exist between the
two types of studies.
Different Outcomes
In contrast to diagnostic research where the outcome is largely dichotomous,
prognostic research can distinguish between various types of outcomes. The first
and most frequently encountered type of outcome is the occurrence (yes/no) of
an event within a specific, preferably short, period of time [Moons et al., 2012a;
Steyerberg, 2009]. For example, one might study the occurrence of a certain
complication within 3 months, where ideally each included patient has been
followed for at least this period. The cumulative incidence, expressed as a
probability between 0% and 100%, of the dichotomous outcome at a certain time
point (t) is to be predicted using predictors measured before t. For these
outcomes, the analysis is identical to the analysis in diagnostic research. The
second most common outcome in prognostic research is the occurrence of a
particular outcome event over a (usually) longer period of time, where the
follow-up time may differ substantially between study participants. Here, the
time to occurrence of the event can be predicted using the Kaplan-Meier method
or Cox proportional hazard modeling. It is also possible to predict the absolute
risk of a certain outcome within multiple time frames (e.g., 3 months, 6 months,
1 year, and 3 years), although the maximum time period is determined by the
maximum follow-up period of the included patients (see also the Worked-Out
Example at the end of this chapter). Other, less regular outcomes in prognostic
prediction studies are continuous variables [Harrell, 2001], such as the level of
pain or tumor size at t, and—as in diagnostic research—polytomous (nominal)
outcomes [Biesheuvel et al., 2008] or ordinal outcomes [Harrell et al., 1998]. An
example of the latter is the Glascow Outcome Scale collapsed into three ordinal
levels: death, survival with major disability, and functional recovery [Cremer et
al., 2006].
Required Number of Subjects
As for diagnostic research, the multivariable character of prognostic research
creates problems for estimating the required number of study subjects; there are
no straightforward commonly accepted methods. Ideally, prognostic studies
include several hundreds of patients that develop the outcome event [Harrell,
2001; Moons et al., 2009a; Simon & Altman, 1994]. As with all dichotomous
outcomes analyzed with multivariable logistic regression analysis, experience
has shown that for the analysis of time to event outcomes using Cox proportional
hazard modeling, at least 10 subjects in the smallest of the outcome categories
(i.e., either with or without the event during the study period) are needed for
proper statistical modeling [Concato et al., 1995; Peduzzi et al., 1995]. Such
rules are largely lacking for ordinal and polytomous outcomes [Harrell, 2001].
For continuous outcomes, the required number of subjects may be estimated
crudely by performing a sample size calculation for the t-test situation where the
two groups are characterized by the most important dichotomous predictor.
Another approach, more directed at the use of multiple linear regression
modeling, is to define the allowable limit in the number of covariates (or rather,
degrees of freedom) for the model by dividing the total number of study subjects
by 15 [Harrell, 2001]. For more sophisticated approaches, we refer readers to an
article by Dupont and Plummer [1998].
Statistical Analysis
Modeling of the cumulative incidence of a dichotomous outcome at a specific
time t using logistic regression is discussed elsewhere in the text. For time to
event outcomes, also denoted as survival-type outcomes, the univariable analysis
can be performed using the Kaplan-Meier method. Similar to the analysis of
dichotomous outcomes, the observed probabilities depend on the threshold
values of the predictor. Unfortunately, the construction of a receiver operating
characteristic (ROC) curve is not straightforward because the outcomes of the
censored patients are unknown. The so-called concordance-statistic (c-statistic or
c-index), however, can be easily calculated and its value has the same
interpretation as the area under the ROC curve [Harrell, 2001]. For the
multivariable analysis of time to event data using Cox proportional hazard
modeling, we refer to the Worked-Out Example at the end of the chapter.
When the outcome is continuous (for example, tumor size), univariable and
multivariable analyses are usually carried out using linear regression modeling.
The discriminatory power of a linear regression model can be assessed from the
squared multiple correlation coefficient (R2), also known as the explained
variance [Harrell et al., 1996; Harrell, 2001]. This measure unfortunately is not
intuitively understood. Detailed information on the analysis of continuous as
well as ordinal and polytomous outcomes is available in the literature
[Biesheuvel et al., 2008; Harrell, 2001; Roukema et al., 2008].
TABLE 4–3 Reclassification of Cardiovascular Risk with Carotid Artery Intima-Media Thickness Added
to Framingham Risk Score: Findings from the USE-IMT Consortium
A Distribution of 45,828 individuals without and with events in USE-IMT across risk categories
A, Individuals without and with events classified according to their 10-year absolute risk to develop a
myocardial infarction or stroke predicted with the Framingham Risk Score variables or classified according
to their 10-year absolute risk to develop a first-time myocardial infarction or stroke predicted with the
Framingham Risk Score and a common carotid intima-media thickness (CIMT) measurement. B, Observed
Kaplan-Meier absolute risk estimates for all individuals (with and without events). The observed risk in
reclassified individuals is significantly different from the observed risk of the individuals in the gray cells.
Reproduced from den Ruijter H et al. Common carotid intima-media thickness measurements in
cardiovascular risk prediction. A meta-analysis. JAMA. 2012;308(8):796–803.
TABLE 4–4 Summary of the Indices of Added Value in the Total USE-IMT Cohort and in the Intermediate
Risk Categories, by Sex: Findings from the USE-IMT Consortium
CI, confidence interval; IDI, integrated discrimination improvement; NRI, net reclassification improvement;
USE-IMT, USE Intima-Media Thickness collaboration.
Reproduced from den Ruijter H et al. Common carotid intima-media thickness measurements in
cardiovascular risk prediction. A meta-analysis. JAMA. 2012;308(8):796-803.
WORKED-OUT EXAMPLE
This example is based on a study conducted by Spijker and colleagues [2006]. It
illustrates the design of data analysis in the case of time to event data, which
includes how to obtain absolute risks from a Cox proportional hazard model,
how to shrink coefficients, how to assess discriminatory power, and how to
calculate theoretical sensitivity and specificity using the predictive values.
Useful methodologic considerations underlying this example can be found in the
literature [Altman & Andersen, 1989; Harrell, 2001; Moons et al., 2012b;
Steyerberg, 2009; Steyerberg et al., 2000; Steyerberg et al., 2001; Van
Houwelingen & Le Cessie, 1990; Vergouwe et al., 2002].
BOX 4–5 Guide to the Main Design and Analysis Issues for Prognostic Studies
Design
• Objective: To develop a model/tool to enable objective estimation of
outcome probabilities (risks) according to different combinations of
predictor values.
• Study participants: Individuals with the same characteristic, for example,
individuals with a particular symptom or sign suspected of a particular
disease or with a particular diagnosis, at risk of having (diagnostic
prediction model) or developing (prognostic prediction model) a specific
health outcome.
• Sampling design: Cohort, preferably prospective to allow for optimal
documentation of predictors or outcomes, including a cohort of individuals
that participate in a randomized therapeutic trial. Case-control studies are
not suitable, except nested case-control or case-cohort studies.
• Outcomes: Relevant to individuals and preferably measured without
knowledge of the measured predictor values. Methods for outcome
ascertainment, blinding for the studied predictors, and duration of follow-
up (if applicable) should be clearly defined.
• Candidate predictors: Theoretically, all potential and not necessarily
causal correlates of the outcome of interest. Commonly, however,
preselection based on subject matter knowledge is recommended. Similar
to the outcomes, candidate predictors are clearly defined and measured in a
standard and reproducible way.
Analysis
• Missing values: Analysis of individuals with only completely observed
data may lead to biased results. Imputation, preferably multiple imputation,
of missing values often yields less biased results.
• Continuous predictors: Should not be turned into dichotomies and linearity
should not be assumed. Simple predictor transformation can be
implemented to detect and model nonlinearity, increasing the predictive
accuracy of the prediction model.
• Predictor selection in the multivariable modeling: Selection based on
univariable analysis (single predictor–outcome associations) is
discouraged. Preferably, if needed, backward selection or a full model
approach should be used, depending on a priori knowledge.
• Model performance measures: Discrimination (e.g., c-index), calibration
(plots), and (re)classification measures.
• Internal validation: Bootstrapping techniques can quantify the model’s
potential for overfitting, its optimism in estimated model performance
measures, and a shrinkage factor to adjust for this optimism.
• Added value of predictor/test/marker: Should be pursued for subsequent
(or new) predictors, certainly if their measurement is burdensome and
costly. Because overall performance measures (e.g., c-index) are often
insensitive to small improvements, reclassification measures may be used
for this purpose.
Reproduced from Moons KGM, et al. Risk prediction models: I. Development, internal validation, and
assessing the incremental value of a new (bio)marker. Heart. BMJ (2012), with permission from BMJ
Publishing Group Ltd.
Theoretical Design
The study objective was to construct a score that allows prediction of MDE
persistence over 12 months in individuals with MDE, using potential
determinants of persistence identified in previous research. The prognostic
determinants considered were measures of social support, somatic disorders,
depression severity and recurrence, and duration of previous episodes.
The occurrence relation can be represented as follows:
The domain in this study was confined to those individuals from the general
population with MDE.
S(t) = S0(t)exp(LP)
Results
Follow-up time ranged from 2 weeks to 24 months and 187 subjects out of the
total population (N = 250) recovered. The final proportional hazards regression
model appeared to be reasonably calibrated as the predicted and observed
probabilities were similar over the entire range (see Figure 4–1).
FIGURE 4–1 Calibration plot of the Cox proportional hazards model for the prediction of depression
persistence at 12 months of follow-up. The dotted line represents the line of identity, that is, perfect
calibration model.
Repoduced with permission from Spijker J, de Graaf R, Ormel J, Nolen WA, Grobbee DE, Burger H. The
persistence of depression score. Acta Psychiatr Scand 2006;114:411–6.
The shrinkage factor for the coefficients that was obtained from the bootstrap
process was 0.91. The results presented are based on the findings after
shrinkage. Coefficients from the model as well as the hazard ratios as measures
of relative risk are displayed in Table 4–5, together with the risk points per
predictor.
Table 4–6 shows the relationship between categories of the score, the
observed risk, and the predicted risk of MDE persistence after 1 year. The mean
risk was 23% and the predicted risks increased from 7–40% with increasing
score categories and were generally in agreement with the observed risk. From
Table 4–4, it can also be seen that the patient introduced earlier has a 29% risk
of persistence of depression. The overall discriminatory power of the score was
fair, with a c-statistic of 0.68. For specific cut-offs, the sensitivity, specificity,
and predictive values are also shown.
If, for instance, a cut-off ≥ 5 is chosen as the threshold for a high risk of
persistence and thus requires more intense treatment, 69% (sensitivity) of those
who would still suffer depression after 1 year will have received this treatment,
however, 12% (1-NPV) of those who did not undergo the more intense treatment
because their test was negative will have persisting MDE.
Reproduced from Spijker J, de Graaf R, Ormel J, Nolan WA, Grobbee DE, Burger H. The persistence of
depression score. Acta Psychiatr Scand. 2006;114:411–16.
CONCLUSION
Prognostic research shows great similarity to diagnostic research; in fact,
prognoses can be seen as diagnoses in the future. Most importantly, they are
both variants of prediction research. To ensure applicability of prognostic (and
diagnostic) research in clinical practice, several prerequisites should be met:
INTRODUCTION
Effective treatment is the stronghold of modern medicine. Despite all other types
of care clinical medicine has to offer, patients and physicians alike expect first
and foremost that diseases can be cured and symptoms relieved by appropriate
interventions. Evidence-based treatment—or prevention for that matter—
demands the unequivocal demonstration by empirical research of the efficacy
and safety of the intervention. In general, all interventions are characterized by
intended and unintended effects, where the intended effects (main effects) are
those for which the treatments are given. However, interventions also have
unintended effects. These may range from relatively trivial discipline required
by the patient to adhere to the intervention to potentially life-threatening adverse
effects. Ideally, intended effects should be highly common, predictable, and
large, and unintended effects rare and mild. Drugs and other interventions vary
markedly with regard to the relative frequency and severity of unintended
effects, just as they vary in effectiveness with regard to their intended effects.
Intervention research aims to quantify the full spectrum of relevant effects of
intervention. However, the approaches used for demonstrating the intended or
primary effects generally differ from those for demonstrating safety. This
chapter concentrates on intended effects.
Research on the benefits and risks of interventions is central to current clinical
epidemiologic research. For centuries, the field of medicine was very limited in
terms of what it had to offer for adequate treatment. This has dramatically
changed in recent decades. Rapidly expanding pharmacopeias and advances in
surgical techniques are both progressing, with an increasing emphasis on less
invasive techniques. In medicine, intervention is a general term for a deliberate
action intended to change the prognosis in a patient and includes drug treatment,
surgery, physiotherapy, lifestyle interventions such as physical exercise, and
preventive actions such as vaccination. To treat a patient with confidence, the
physician needs to know about the potential benefit of the treatment (i.e., the
intended or main effects of the intervention), which must be weighed against
possible risks (i.e., the unintended or side effects of the intervention). The
deliberate decision not to treat or to postpone treatment can be viewed as an
intervention itself. Increasingly, cost considerations also play a role when
choices are made between different treatment options. Money is not only an
issue from the perspective of the fair and efficient use of available resources; it is
also an important driving force for the development and marketing of new
treatments. Pharmaceutical companies and manufacturers producing medical
devices increasingly emphasize their compassion for patients as a motive for
their search for new compounds, but they typically—and understandably—are
primarily focused on their shareholders and profits. This elevates research on
treatment effects to an arena in which huge interests play a role. As a
consequence, much more than in any other area of medical research, the quality
and reliability of intervention research has been the topic of major interest and
development. The result is a highly sophisticated set of principles and methods
that guides intervention research.
In intervention research, the principles of causal and descriptive research
combine. Intervention research is commonly causal research, because it is the
true effect of the intervention (i.e., caused by the intervention) that needs to be
estimated free from confounding variables. Intervention research commonly is
also prognostic; in order to use an intervention in medical practice, it is
important to know as precisely as possible both the beneficial and untoward
impact the intervention may have on an individual patient’s prognosis. For
example, for a given drug, 1-year mortality may be expected to decrease from
30% to 10% (intended or main effect), while the risk of developing orthostatic
hypotension (unintended or side effect) is 10%.
To serve clinical decisions of treatment best, intervention research in general
and clinical trials in particular should be viewed as the means to measure the
effects of interventions on prognosis. It is generally not sufficient to know
whether a treatment works. What is needed is a valid estimate of the size of the
effects. In clinical epidemiologic intervention research, randomized controlled
trials (RCTs) play an essential role, not only because they are often considered
the only approach to definitively demonstrate the magnitude of benefits of
treatment, but also because RCTs offer a role model for causal research. The
principles of the design of randomized trials are quite straightforward. When
appropriately understood, they also will greatly help to improve causal research
under those circumstances where a randomized trial cannot be conducted. To
understand the nature of randomized trials is to understand unconfounded
observation.
INTERVENTION EFFECTS
The challenges of measuring the effects of an intervention can be illustrated by a
simple example in which a physician is considering using a new drug to treat
high blood pressure in a group of his patients. The drug has been handed to him
by a sales representative, who promised a rapid decline in blood pressure for
most patients, with excellent tolerability. Let us assume that the physician
decides to try out the drug on the next 20 or so patients who visit his office with
a first diagnosis of hypertension. He carefully records each patient’s baseline
blood pressure level and asks them to return a number of times for re-
measurement in the next weeks. His experience with these patients is
summarized in Figure 5–1.
The physician is satisfied. A gradual decline in systolic blood pressure is
shown in his patients. Moreover, most were very pleased with the drug because
the treatment had few side effects; one patient mentioned the development of
mild sleeping disturbances. Would it be wise to conclude that the drug works, is
well tolerated, and can now become part of routine treatment with confidence?
Clearly not. There are a number of reasons why the observed response may not
adequately reflect the effect caused by the drug. In order to use the drug in
similar future patients, it is necessary to ensure that the response in fact resulted
from the pharmacologic agent and does not reflect other mechanisms. Although
a patient may not care why the reduction occurred as long as the hypertension
was treated, from a sensible medical viewpoint it is necessary to know whether
the effect can be attributed to the drug. If it is not, then additional costs are
generated, the patients is medicalized, and side effects may be induced without a
sound scientific justification. Let us examine alternative explanations for the
observation made by the physician.
FIGURE 5–1 Hypothetical patient blood pressure data.
Children and parents had the same mean height of 68.2 inches. The ranges
differed, however, because the mid-parent height was an average of two
observations and thus had a smaller range. Now, consider those parents with a
relatively high mid-height between 70 and 71 inches. The mean height of their
children was 69.5 inches, which was closer to the mean height of all children
than the mean height of their parents was to the mean height of all parents.
Galton called this phenomenon regression toward mediocrity. The term was
coined with this report, but the observation is different from what is currently
considered regression toward the mean because this concerned the full
population without selection. The principle, however, is the same.
Regression toward the mean is not an exclusive phenomenon in epidemiologic
research. Consider, for example, students who take a clinical epidemiology
exam. Students who receive an unexpected, extremely low score will probably
get a better score when they repeat the exam, even when they put no further
effort into understanding the topic. It is likely that some bad luck was involved
in getting the exceptional score, and this bad luck is unlikely to occur for a
second time in a row, given the usual higher score in this student. It is a common
mistake in everyday life to assign a causal role to something apparently related
to the observed effect that in reality is likely due to regression toward the mean.
Take, for example, the case of the poor badminton champion from Kuala
Lumpur (see Box 5–1). Some of this champion’s predecessors very likely
achieved greater than their usual level of performances because of a lucky play
of chance, and their subsequent downfall was attributed to the “spoiling” by gifts
of appreciation.
FIGURE 5–3 Comparison of the heights of children to their parents made by Francis Galton (1822–1911).
Diagonal line shows the average height.
Reproduced from Bland JM, Altman DG. Statistic notes: regression towards the mean. BMJ 1994:308:1499
with permission from BMJ Publishing Group Ltd.
KUALA LUMPUR: Prime Minister Datuk Seri Dr. Mahathir Mohamad congratulated Malaysian
shuttler Mohd Hafiz Hashim for his achievement but warned that he should not be “spoilt” with gifts
shuttler Mohd Hafiz Hashim for his achievement but warned that he should not be “spoilt” with gifts
like previous champions.
Dr. Mahathir said people should remember what had happened to previous champions when they were
spoilt with gifts of land, money and other items.
“I hope the states will not start giving acres of land and money in the millions, because they all seem
not to be able to play badminton after that,” he said after taking part in the last dry run and dress
rehearsal for the 13th NAM Summit at the PWTC yesterday.
Modified from “Mahathir asks states not to ‘spoil’ Hafiz,” The Star Online, 2/18/2003.
Regression toward the mean is but one component of natural history and it is
an entirely statistical phenomenon. There are many other factors that may
influence natural history that are linked to the outcome by some
pathophysiologic mechanism. When this is known, we may try to adjust our
observation based on this knowledge. Typically, however, determinants of
natural history are unknown and cannot simply be subtracted from the observed
effect.
Extraneous Effects
A second reason why the physician observed a response following drug
treatment (but one that is not a result of drug treatment) may be that other
determinants of blood pressure changed concomitantly. The patients were told
that they had high blood pressure and that this is a risk factor for stroke and
myocardial infarction that should be treated. This information could motivate
patients to try to adjust their lifestyle. They may have improved their diet, started
exercising, or reduced alcohol intake. All of these actions also may have reduced
the blood pressure. These effects are called extraneous because they are outside
of the effect of interest, namely the drug effect. In a study, we may attempt to
measure extraneous effects and take these into account in the observation, but
this requires that the effects be known and measurable.
There is one particularly well-known extraneous effect that is so closely
linked to the intervention that it generally cannot be directly measured or
separated from the drug effect: the placebo effect. Placebo effects can result
simply from contact with physicians when a diagnosis or simple attention from a
respected professional alleviates anxiety. As Hróbjartsson [1996] put it, “Any
therapeutic meeting between a conscious patient and a doctor has the potential of
initiating a placebo effect.” In research, obtaining informed consent has been
shown to induce a placebo effect. There is a wealth of literature on placebo
effects and considerable dispute on the mechanism of action. Clearly,
psychological mechanisms are likely to play a role, and certain personality
characteristics have been particularly related to strong placebo responses
[Swartzman & Burkell, 1998]. In addition, other, seemingly pharmacologic
phenomena are related to placebo responses. For example, the placebo response
to placebo-induced analgesia can be reversed by naloxone, an opioid antagonist
[Fields & Price, 1997]. Obviously, the type of outcome that is being studied is
related to the presence and magnitude of a placebo response. Outcomes that are
more subjective, such as anxiety or mood, will be more prone to placebo effects.
Expectation also powerfully influences how subjects respond to either an inert or
active substance. In a study where subjects were given sugar water but were told
that it was an emetic, 80% of patients responded by vomiting [Hahn, 1997].
Placebo effects are, to a greater or lesser extent, an inherent component of
interventions and they will obscure the measurement of the intervention effect of
interest, such as the pharmacologic action of a drug. This may or may not be a
problem in intervention research. Again, from the perspective of the patient, it
does not really matter whether the relief results in part from a placebo effect of
the drug. Cure is cure. Similarly, from the viewpoint of the physician, the
placebo effect may be a welcome additional benefit of an intervention. Even for
an investigator studying the benefits of treatment, the placebo effect can be
accepted as something that is inseparable from the drug effect and therefore
should be included in the overall estimate of the benefit of one treatment
compared to another (e.g., nondrug) treatment strategy. Different treatments may
have different placebo effects and this will also explain differences in benefits
when employed in real life. In other words, the need to exclude placebo effects
in research on benefits and risks of interventions is not a given and depends on
the objectives of the investigator. Although many believe that the best evidence
for treatment effects comes from trials in which a placebo effect has been ruled
out by comparing treatment to placebo treatment, there are good examples of
research where potential placebo effects were included in the measured
treatment effect that provide a more meaningful result than when placebo effects
were removed. The motives and consequences of research that does or does not
separate the pharmacologic from the placebo effects were well outlined in a
classic paper by Schwarz and Lellouch [1967] on pragmatic and explanatory
trials. Their article gives an example from a real case in which a decision needed
to be made between different options to determine the benefits of a drug aimed
at sensitizing cancer patients for required radiotherapy. The assumption was that
when patients were pretreated with the drug, the effect of the radiotherapy was
enhanced. The investigators decided to do a randomized comparison between
usual therapy and the new treatment scheme. For the usual therapy arm of the
study, there were two options (see Figure 5–4, taken from the original report by
Schwarz and Lellouch). One option was to just treat the patients as usual, which
implied the immediate treatment with radiotherapy. The alternative option was
to first give a placebo drug and then start radiotherapy. In the second option,
placebo effects from the drug would be removed from the comparison. However,
radiotherapy would be put at a disadvantage because compared to the approach
in daily practice the installment of radiotherapy would be delayed. In contrast, in
the first option the new approach would be compared to the optimal way of
delivering radiotherapy without the sensitizing drug, but placebo effects could
not be ruled out. Given that the new drug was not without side effects, a
distinction between the pharmacologic and placebo benefits seemed important.
There is no single best solution to this problem. Probably, when little is known
about a drug, first a comparison with placebo is necessary to determine the true
pharmacologic action devoid from placebo effects. Next, the researcher can
establish its value in real life as compared with the best standard treatment, in
this case immediate radiotherapy. The result of either comparison also
determines the relevance of the answer.
FIGURE 5–4 Trial arms where placebo effects are removed (explanatory) and where the placebo effect
was considered to be part of the overall treatment (pragmatic).
Reproduced from Schwartz D, Lellouch J. Explanatory and pragmatic attitudes in therapeutic trials. J Chron
Dis 1967:20:637–4, with permission from Elsevier.
Suppose that in the blinded comparison radiotherapy (without the new drug)
still is shown to be superior. Now, a comparison with immediate radiotherapy is
not needed because, if anything, the effect would be even more beneficial than
when combined with the new strategy. In their article, the authors propose the
term explanatory for a trial in which placebo effects are removed and the term
pragmatic for a study in which placebo and other extraneous effects are taken as
part of the overall treatment response of interest. There are many circumstances
in which the true effects, without placebo effects, of a drug are well established
and where a pragmatic trial will deliver a result that better reflects the anticipated
effect in real life than an explanatory trial. In some cases, the apparent “main”
intervention is not even the most important part of the strategy. For example, in a
pragmatic randomized trial comparing the effect of minimally invasive coronary
bypass surgery to conventional bypass grafting on postsurgery cognitive decline,
the assumption was that the necessary use of a cardiopulmonary pump during
conventional surgery was the most important component of the intervention with
regard to adverse effects on cognitive function [Van Dijk et al., 2002].
Unfortunately, the term pragmatic sounds somewhat less scientific and
rigorous, and some investigators are hesitant to refrain from rigorous placebo
control in their research. In doing so, they may eventually produce results that do
not adequately address the question that medical practitioners need to have
answered. It is important to understand that removal of placebo and other
extraneous effects is a deliberate decision that an investigator needs to make in
the design of a study; in some cases pragmatic studies may be the preferred
option. There is ample confusion about the nature of pragmatic intervention
research. For example, some authors propose that explanatory studies “recruit as
homogeneous a population as possible and aim primarily to further scientific
knowledge” or that “in a pragmatic trial it is neither necessary nor always
desirable for all subjects to complete the trial in the group to which they were
allocated” [Roland & Torgerson, 1998]. These views are erroneous. The
homogeneity of the study population may affect the generalizability and relates
to the domain of a study irrespective of whether a trial is pragmatic or
explanatory. In both explanatory and pragmatic trials, patients sometimes
complete the study in the group to which they were not randomized; for
example, they may need the treatment originally allocated to the other group and
thus “cross-over” from one treatment arm to the other. This is common and not a
problem as long as the patients are analyzed according to allocated treatment,
that is, by intention to treat. “Pragmatic” and “explanatory” do not refer to the
methodologic rigor or the scientific value of the knowledge that is generated.
The distinction between pragmatic and explanatory trials reflects the nature of
the comparison that is being made. In pragmatic studies, the treatment response
is the total difference between two treatments (i.e., treatment strategies),
including treatment and associated placebo or other extraneous effects, and this
will often better reflect the likely response in practice.
Observation Effects
The third, and last, reason for an observed response to treatment that is not
attributable to the treatment lies in the influence of the observer/researcher or the
observed (participant) on the measurement of the outcome (see Figure 5–5).
Without deliberate intention, the observer may favorably interpret the report
of a patient or adjust (round up or down) measurement results to better values.
The observation effect is that which an observer or the observed participant has
on the particular observations made. Observer bias is a systematic effect that
moves the observed effect from the true effect. Observation effects may well
reflect an interaction between observer and patient. For example, a physician has
just received a sample of a new drug that is reputed to work exceptionally well
in cases of chronic sleeping problems. When Mrs. Jones visits his surgery again
with a long-lasting complaint of sleeping problems so far resistant to any
medication, the doctor proposes this new miracle drug, which may offer a last
resort. At the next visit, Mrs. Jones may be inclined not to disappoint her doctor
again and gives a somewhat positively colored account of her sleeping history in
the last couple of weeks. At the same time, the physician is reluctant to accept
yet another failure of treatment in this patient. Together they create a biased
observation of an otherwise unchanged problem. Just as with placebo effects, the
magnitude of the potential for observation effects will depend on the type of
observation that is being made. The “softer” the outcome, the more room for
observation effects. In a study on the benefits of a drug in patients with ischemic
cardiac disease, measures of quality of life and angina will be more susceptible
to observer bias than vital status or myocardial infarction, although the latter is
also sufficiently subjective to be affected. For example, disagreement in the
determination of electrocardiographic ST-segment elevation by emergency
physicians occurs frequently and is related to the amount of ST-segment
elevation present on the electrocardiogram.
FIGURE 5–5 Observer-observee difference in perceived response to treatment.
TREATMENT EFFECT
Despite all of the reasons why an observed treatment response need not
necessarily show the benefit of the treatment per se, obviously there is the
possibility that the effect being observed is entirely or in part the result of the
treatment. In intervention research, the mission is to extract from the observation
the component in which we are interested. This can only be achieved by
comparing a group of patients who are being treated to a group of patients who
are not treated or who are treated differently. There is no way in which a valid
estimate of the effect of a drug or other treatment can be obtained from
observing treated patients only. Consequently, in the example of the physician
trying out a new antihypertensive drug, there is no way that the true effect of the
new drug can be determined from the overall observation. A comparative study
is needed. The treatment effect and the three alternative explanations for the
observed treatment response (natural history, extraneous effects, and observation
effects), as well as the handling of the latter three in research, can be illustrated
by a simple equation. In a comparative study where a treatment, for example a
drug named “Rx,” is compared to no treatment at all, the responses in the index
(i.e., treated) group can be summarized as follows [Lubsen & de Lang, 1987]:
where OEi is the observed effect in the index group, Rx is the treatment effect,
NHi is the effect of natural history, EFi is the effect of extraneous factors
including placebo effects, and OBi is the observation effect in the index group.
The corresponding equation in the reference (r) group not receiving the
intervention is:
The difference between the effects observed in the two comparison groups can
be written as:
If the interest is in the treatment effect per se, in this example the single
pharmacologic effect of the drug, Rx, OEi – OEr needs to equal Rx. To achieve
this, the other terms need to cancel out. Consequently, NHi needs to equal NHr,
EFi needs to equal EFr, and OBi needs to equal OBr. The equation for a
comparison between two treatments (an index treatment Rxi and reference
treatment Rxr) is the same except that after cancelling out the other terms, OEi –
OEr now equals Rxi – Rxr, that is, the net benefit of the index treatment over the
other.
The principles of intervention research can be summarized as ways to make
all terms in the equation in the two groups the same, except for the treatment
term. This means that natural history, extraneous effects, and observation effects
are made the same in the groups that are compared. Note that an alternative way
to achieve comparability of natural history, extraneous effects, and observation
effects is by removing them completely from the study. However, this is
generally impossible to achieve. Rather, by accepting these effects and ensuring
that they are cancelled out in the observation, a valid estimate of the treatment
effect is obtained.
COMPARABILITY OF NATURAL
HISTORY
Comparability of natural history is a conditio sine qua non (Latin legal term
meaning “without which it could not be”) in intervention research. Because
natural history may be highly variable between individuals, an intervention
effect estimated from research that includes effects from natural history cannot
be generalized to what can be expected in practice. Consequently, it is of critical
importance that in a comparison between two or more groups to estimate the
effect of an intervention, the effects of natural history are the same in all groups.
There are several ways in which this can be achieved. First, a quasi-
experimental study can be conducted where the participants in the groups are
carefully selected in such a way that each group represents the same distribution
of natural histories. For example, in a comparison of two anticancer drugs for
treatment of leukemia, patients in the two groups can be deliberately selected so
that they have a similar age, proportion of males, severity of the disease, and so
on. One could even go as far as to closely match each individual in the index
group to an individual from the reference group according to characteristics
expected to be related to prognostic characteristics expected to determine natural
history. This would improve the probability that, in the absence of treatment, the
two groups would show the same natural history and, therefore, an observed
difference in response would not reflect a difference in natural history. A related
approach would be to restrict the entire study population to a highly
homogeneous group of patients who, because of their similarity, are expected to
all have a highly similar prognosis (natural history). Alternatively, there could be
no preselection made and patients could receive treatment as deemed by the
physician, but prognostic indicators would be recorded in detail. Clearly,
initiation of a specific intervention in daily practice is everything but random
because physicians tend to treat those patients with a relatively poor prognosis
more often. Therefore, in the statistical analysis of the data from the study,
multivariate adjustments should be made to remove the effect of differences in
natural history from the comparison.
A necessary requirement for either of these approaches to ensure
comparability of natural history is that all relevant prognostic factors that could
be different between the groups are known and can be measured validly. In
addition, the source population of patients should be large enough to make
preselection and matching possible. Similarly, for multivariate analysis, the size
of the study population should be large enough to allow for statistical
adjustments. The overriding problem, however, is that comprehensive
knowledge of all relevant prognostic factors is typically lacking. A variable that
is not known or measured cannot be taken into account in preselecting study
groups, nor can it be controlled for in the analysis. This holds true for any causal
research where the effect of an exposure needs to be separated from other related
but confounding determinants of the outcome. However, the problem in
intervention research is accentuated because of the complexity of the decision to
treat patients. In setting an indication for prescribing a drug to a patient, the
treating physician will take many factors into consideration such as the severity
of the disease, the likelihood of good tolerance and compliance, the experience
in this patient with previous treatments, the patient’s preference, and so forth.
When groups of patients with the same disease but with and without a
prescription for treatment by a physician are compared, they are probably
different in many ways, some of which can be measured while others are very
implicit and neither reflected in the patient file nor measurable through
additional efforts. The indication for treatment (i.e., the composite of all reasons
to initiate it) is a very strong prognostic indicator. If a patient is judged to have
an indication to use a drug, this patient probably has a more severe untreated
prognosis than a patient with the same diagnosis in which the physician decides
to wait before deciding on drug treatment. The effect on natural history of the
presence or absence of a pertinent indication in patients with the same disease
who are or are not treated is termed confounding by indication [Grobbee &
Hoes, 1997].
FIGURE 5–6 Reasons underlying the decision to intitiate treatment are important potential confounders.
Figure 5–6 shows that the reasons underlying the decision to initiate
treatment are important potential confounders. These reasons, often related to
patient characteristics such as severity of disease, by definition are associated
with the probability of receiving the intervention (illustrated by the exclamation
mark). If these reasons are also related to the probability of developing the
outcome, which is the case where patients with more severe disease are more
prone (or less prone for that matter) to receive the intervention, then the right
arrow also exists. Consequently, confounding will occur.
Although many drugs can affect the course of a disease positively, the
outcome in people with that disease compared to those who do not have it or
who have a less severe form may be worse or, at best, similar. Confounding by
indication can completely obscure an intervention effect when treated and
untreated patients are compared who do or do not receive the intervention in
routine care. To illustrate this effect, Table 5–1 shows the risks for
cardiovascular mortality in women with hypertension who participated in a
population-based cohort study and were either treated or not treated by their
physicians.
The crude rate ratio for mortality was 1, suggesting that the treatment had no
effect because the treated and untreated hypertensive groups had the same
cardiovascular mortality risk. However, when adjustments were made for a
number of factors that were expected to be related to both the indication for
treatment and cardiovascular mortality, and thus possibly were confounding the
comparison, the rate ratio dropped in a way that was compatible with the rate for
a benefit of treatment.
Whether the adjusted rate ratio reflects the true treatment effect depends on
whether an adjustment was made for all of the differences in confounding
variables between the treated and untreated groups. This conclusion is very
difficult to draw. Confounding by indication commonly creates insurmountable
problems for nonrandomized research on intended effects of treatment. Valid
inferences can much more likely be drawn under those rare circumstances in
which (1) groups of patients with the same indications but different treatments
can be compared and (2) residual dissimilarities in characteristics in patients
receiving different treatments for the same indications are known, adequately
measured, and can be adjusted for. For example, Psaty et al. [1995] compared
the effects of several antihypertensive drugs on the risk of angina and
myocardial infarction. In a case-control study, they selected patients who all
shared the indication for drug treatment for hypertension. Consequently, both
cases and controls had this indication. In addition, they took ample measures to
exclude residual confounding by indication, notably in the design of data
analysis.
TABLE 5–1 Crude and Adjusted Rate Ratios for Death from Cardiovascular Causes in Untreated and Drug
Treated Women that Were All Hypertensive According to Common Criteria
Rate Ratio (95% Confi dence Interval)
Apart from the reasons to start an intervention (i.e., the indication), reasons to
refrain from initiating the intervention may act as confounding variables. This is
sometimes referred to as confounding by contraindication. Just as with
confounding by indication (see Figure 5–6), these reasons (e.g., patient
characteristics known to increase the risk of developing unintended or side
effects of the intervention) will be associated with the probability of receiving
the intervention, albeit here the association represented by the left arrow will be
inverse. If these reasons not to start the intervention are also associated with the
probability of developing the outcome of interest, (i.e., the right arrow exists),
then confounding is very likely to occur. Such confounding by contraindication
is illustrated in a study on the putative association between the use of the drug
ibopamine and mortality, after its use was restricted in 1995 [Feenstra et al.,
2001]. In a comparison between patients using the drug before and after
September 8, 1995, the relative risk for death associated with the use of
ibopamine was 3.02 (95% confidence interval [CI], 2.12–4.30) for the period
before and 0.71 (CI, 0.53–0.96) for the period after September 2008. The
marked inversion of the relative risk estimate is very likely the result of a
changed practice in the use of (relative) contraindications in these patients.
Apparently, ibopamide was preferentially prescribed to patients with a much
lower mortality risk after 1995 than in the preceding period. Consequently, the
observed mortality risk in users of ibopamide was reduced. We will only use the
term confounding by indication (where indication is then defined as reasons to
initiate or refrain from a certain intervention) to indicate circumstances when the
reasons to start or not to initiate the intervention are also related to the beneficial
or unfavorable outcome of interest, and, thus, confounding may occur.
RANDOMIZATION
The most effective way to resolve the problem of confounding by indication and
other confounding effects of differences in natural history in a comparative study
is by randomization (Figure 5–7). Randomization means that the treatment is
allocated at random to individual participants in a study. Indication for drug use
is thus set randomly. Any resulting difference in prognosis in the absence of
treatment between randomized groups is the sole result of random imbalances.
The risk of remaining prognostic differences is thus inversely related to the size
of the population that is randomized.
Figure 5–7 shows the major strength of random allocation of patients to an
intervention. Because of randomization, the distribution of all known and
unknown reasons to start or not to start an intervention that would apply in daily
practice (and that may be related to the occurrence of the outcome) are made
similar in the two comparison groups. Consequently, there will be no association
between (contra)indications and the probability of receiving the intervention:
The left arrow does not exist and there will be no confounding. Obviously,
patients with an unequivocal indication or clear contraindication cannot be
randomized and would in any event not reflect the domain of a study to
determine the effects of an intervention.
COMPARABILITY OF EXTRANEOUS
EFFECTS
While comparability of natural history is mandatory in a comparative study on
treatment effect, the extent to which extraneous effects should be the same in the
comparison groups is a matter of choice. As discussed, in an explanatory trial,
every effort should be made to exclude extraneous effects, including placebo
effects. In a nonexperimental study, this is difficult to achieve. There, placebo
effects only can be conquered when two or more treatments are compared that
have similar placebo effects. In a randomized trial, placebo treatment and
blinding are the two tools that ensure comparability of extraneous effects.
Treatment can be compared with placebo treatment without disclosure of the
allocation to the patient on the one hand and/or the investigator and healthcare
professionals involved on the other. This makes the study blinded, either single-
(patient) or double- (patient and observer/healthcare professional) blinded,
depending on how many parties remain ignorant about the allocation. In an
explanatory trial blinding is crucial to yield explanatory results, while in
pragmatic studies extraneous effects are accepted as being inherently part of the
intervention strategy and the use of placebo and blinding is not indicated (see
Figure 5–8).
COMPARABILITY OF OBSERVATIONS
There are a number of ways to prevent or limit observation effects. First, hard
outcomes may be studied. When hard outcomes are used that can be measured
objectively, such as mortality, incomparability of observations will be limited.
Often, however, softer and more subjective outcomes may be more relevant for
the research. Alternatively, the measurement can be highly standardized with
strict protocols, which will limit the room for subjective interpretation. This will
help but is not foolproof.
A more rigorous way to prevent observation effects is to separate the
observation from knowledge of the intervention. By blinding the observer for the
assigned treatment, the observation will not be systematically different according
to treatment status even if the measurement is sensitive to subjective
interpretation. To further reduce the impact of the observer, the patient also can
be blinded for the intervention. Another way to separate observation from
intervention knowledge is to have an observer who plays no role in the
treatment. For example, in a study on the effects of different drugs on glucose
control in diabetic patients, the laboratory technician measuring HbA1C need
not be informed about which intervention the patients receive. Similarly, a
radiologist can judge the presence of vertebral fractures in osteoporotic women
participating in a trial on a new anti-osteoporotic treatment without being
informed about the mode of treatment the women receive. Note that even in a
trial that should preferably be pragmatic, one may still decide to conduct a
blinded trial because of the type of outcome involved, with the aim to achieve
comparability of observations.
TRIAL LIMITATIONS
The principles of RCTs can be fully understood by appreciation of the
comparability requirements. Randomization ensures comparability of natural
history (NHi = NHr). Blinding and use of placebo ensure comparability for
extraneous effects (EFi = EFr). Blinding also prevents observer bias due to
differential observations or measurements in either group (OBi = OBr). While
comparability for natural history is always needed for a valid estimation of the
treatment effect, the need for blinding varies according to the objective of the
trial and the nature of the outcome that is measured. In the case of a pragmatic
study, extraneous effects are included in the treatment comparison and placebo
treatment is not needed. Still, blinding may be desirable to ensure unbiased
outcome assessment. With very solid outcome measures, observation effects
may be negligible, making blinding unnecessary.
For a trial that needs to be blinded because of the outcome measure, but has
the goal of providing pragmatic knowledge (which calls for an unblinded study),
one option is to make the trial only partially blinded. For example, it could be
open for the patients but blind for the observers. Because confounding by
differences in natural history, in particular confounding by indication, is a major
problem in nonrandomized comparisons (where allocation of treatment is done
by the doctor in daily practice), the use of nonexperimental studies to assess the
benefits of treatment has major disadvantages. The RCT is generally the
preferred option to quantify intended treatment effects.
However, there are many reasons why randomized trials, although preferred,
cannot always be conducted and an alternative nonexperimental approach needs
to be sought. First, the necessary number of participants needed in a particular
trial may be too large to be feasible. This applies to studies where the outcome,
although important, occurs at a low rate; an example is when preventive
treatments are studied in low-risk populations. Low outcome rates are a
particular problem in research on side effects of treatments. Take, for example,
the relationship between the drug diethylstilbestrol (DES) and vaginal cancer in
daughters of users. Vaginal cancer, even in the exposed group, is extremely rare.
Alternatively, the expected difference in the rate of events between two
interventions that are being compared may be very small, for example, when two
active treatments are compared but one is only slightly better than the other. The
latter situation is increasingly common for research on new treatments for an
indication where an effective intervention already exists. For example, when two
effective antihypertensive drugs are compared in a hypertensive population, it
may take a very big study to demonstrate a small, albeit meaningful, difference
in efficacy. Apart from practical restrictions, a randomized trial simply might be
too expensive or time consuming. Randomized trials need considerable budgets,
particularly when they are large and of long duration, which is quite common for
so-called Phase 3 drug research required as part of the Food and Drug
Administration (FDA) or European Medicines Agency (EMA) approval process
before marketing. Time may be a problem in itself, for example, when an answer
to a question about the effect of a treatment needs to be obtained quickly and
there is not enough time for a long-term trial to be completed. This is more often
the case in research on side effects than on main effects. If, for example, a life-
threatening side effect is suspected, adequate and timely action may be
warranted and nonexperimental studies may be necessary to provide the relevant
scientific evidence. Another problem with the duration of trials is that they are
less suited for outcomes that take many years or even generations to occur.
Randomized trials usually run a couple of years at maximum. Longer trials
become too expensive, and also with time the number of people who drop out of
the study (attrition rate) may become unacceptably high. Recall the DES
example; even if vaginal cancer in the daughters of users of this drug is a
common outcome, it would be difficult to perform a trial because the follow-up
period spans an entire generation.
In circumstances where the sample size, money, or the duration of follow-up
poses no insurmountable problems, random allocation of patients may be
problematic. For example, random allocation of a lifestyle intervention, such as
heavy alcohol use or smoking, is generally impossible. Moreover, “true”
blinding in a trial may be difficult to achieve. A trial can be nicely blinded on the
surface, but in reality participants or investigators may well be able to recognize
the allocated treatment. In the large, three-armed Women’s Health Initiative
(WHI) trial, examining the effect of long-term postmenopausal hormone therapy
on cardiovascular and other outcomes, over 40% of participants correctly
identified the allocated treatment. Knowledge of randomized treatment may
affect the likelihood of noticing or diagnosing an outcome event and may thus
severely invalidate the comparison (see Table 5–2), as has been worked out by
Garbe and Suissa [2004]. Despite randomization, the reported small increase in
risk in the WHI study could be spurious because of differential unblinding of
hormone replacement therapy users, which could have resulted in higher
detection rates of otherwise clinically unrecognized acute myocardial infarction
in these women. Altering diagnostic patterns because of unblinding could lower
the crude rate ratio of 1.28 to 1.02.
TABLE 5–2 Illustration of Detection Bias for the Ratio of AMI Stratifi ed by Blinding Status of Exposure,
Assuming the Unblinded Subjects were 1.2, 1.5, and 1.8 Times More Likely to be Diagnosed than the
Blinded Study Subjects
a
Rate as cumulative incidence of acute MI per 1,000.
b
The detection rates of 22–44% relate to the proportion of incident MIs that remain clinically unrecognized
at the time they occur but can be detected by ECG (Sheifer et al., 2001).
Reproduced from Garbe E, Suissa S. Issues to debate on the Women’s Health Initiative (*WHI) study:
Hormone replacement therapy and acute coronary outcomes: methodological issues between randomized
and observational studies. Hum Reprod 2004;19:8–13.
Another possible limitation of trials is that they tend to include highly selected
patients and not those patients who are most likely to receive the intervention in
daily practice. Typically, randomized trials include younger, healthier patients
who have less comorbidity and take fewer medications, and who are more
compliant than real-life patients. Evidently, this has no bearing on the validity of
the results of the study itself (it can actually be helpful to include a homogeneous
population) but may limit the generalizability of the findings to the relevant
clinical domain. This only occurs, however, when the differences in
characteristics of trial populations and patients in daily practice modify the effect
of the intervention. For example, the earlier trials on drug therapy in heart failure
included mostly relatively young patients with little comorbidity, whereas the
typical heart failure patients are older and have multiple comorbidities.
Generalizability of the findings of the earlier studies to the elderly has long been
debated. Currently, trials are being conducted among the very old to provide
evidence of the efficacy of heart failure therapy in this large group of patients.
Finally, a trial involving randomized allocation and possibly blinding may be
deemed to be unethical. An example is when there are highly suggestive data to
support the marked superiority of a new treatment, particularly in a situation
where no alternative treatments are available for a very serious disease.
Unfortunately, the presence of weak data from flawed research sometimes
prohibits a decent trial, leaving medical practitioners without a sound basis for
treatment decisions. Sir Austin Bradford-Hill [1951] succinctly summarized the
problem of publication of questionable but suggestive data on treatment benefits:
If a treatment cannot ethically be withheld then clearly no controlled trial can be instituted. All the
more important is it, therefore, that a trial should be begun at the earliest opportunity, before there
is inconclusive though suggestive evidence of the value of treatment. Not infrequently, however,
clinical workers publish favorable results on three or four cases and conclude their article by
suggesting that this is the mode of choice, or that what now is required is a trial on an adequate
scale. They do not seem to realize that by their very publication they have vastly increased
difficulties of the trial or, indeed, made it impossible.
FIGURE 5–9 Life table profiles for 98 inosiplex-treated SSPE patients and for 333 composite SSPE
controls (Israeli, Lebanese, and U.S. registry patients).
Reproduced from The Lancet, Vol. 319, Jones CE, Dyken PR, Hutten Locher PR, Jabour JT, Maxwell KW.
Inosiplex therapy in subacute sclerosing panencephalitis. 1035; © 1982, reprinted with permission from
Elsevier.
INTRODUCTION
A 75-year-old woman who has had rheumatoid arthritis for more than 25 years
visits her doctor because of increasing joint pain. She has been taking
nonsteroidal anti-inflammatory drugs (NSAIDs) for many years. In the past, she
stopped several NSAIDs and replaced them with others because she suffered
from dyspepsia attributed to the drugs. Three years ago she developed a peptic
ulcer. Currently, she takes ibuprofen on a daily basis in conjunction with a
proton-pump inhibitor to prevent NSAID-induced gastrointestinal side effects.
Because of the current severity of the complaints, the doctor decides to switch
her to Metoo-coxib, a novel cyclooxygenase (COX)-2 inhibitor, with powerful
analgesic properties that is believed to cause less gastrointestinal side effects
than classic NSAIDs. COX-2 selective inhibitors were developed as an
alternative to classic (nonselective) NSAIDs because COX-1 inhibition exerted
by the latter drugs decreases the natural protective mucus lining of the stomach.
Indeed, within a month the patient’s pain decreases considerably and no
gastrointestinal side effects are encountered. Consequently, the proton-pump
inhibitor is withdrawn. After 3 months, however, the woman suffers from a
myocardial infarction. This certainly comes as a surprise, because apart from
advanced age, no cardiovascular risk factors were present. Doctor and patient
wonder whether the myocardial infarction was caused by Metoo-coxib.
Interventions (treatments) in clinical practice are meant to improve a patient’s
prognosis. After careful consideration of the expected natural course of a
patient’s complaint or disease (prognostication), a physician has to decide
whether, and to what extent, a particular intervention is likely to improve this
prognosis. To make this decision, it is essential to know the anticipated intended
(main) effects of the intervention.
In the rheumatoid arthritis example, the doctor presumably believed that the
joint pain of the patient would increase or last an unacceptably long time and
thus warranted prescription of a different, novel, and apparently stronger
painkiller. The alleged stronger analgesic properties of the novel drug should be
based on evidence from valid research on the intended effect. Apart from the
primary (intended) effect of an intervention, however, unintended (side) effects
could, and in fact should, factor into the decision to initiate or refrain from this
or any other intervention (see Box 6–1).
• Side effects
• Harm
• Adverse effects
• Risks
• Adverse drug reactions (ADRs) or adverse drug events (in the case of
pharmaceutical interventions) In our view, the term unintended effects (as
opposed to intended effects), best reflects the essence of these intervention
effects [Miettinen, 1983]. Pharmacovigilance is the term increasingly being
applied to indicate the methodology or discipline or, if one wishes, art, of
assessing side effects of pharmacologic interventions. Alternatively, drug risk
assessment, post marketing surveillance, and pharmacoepidemiology are terms
often applied, although the latter often also encompasses nonexperimental
research on the use of drugs in daily practice (drug utilization) and on intended
effects [Strom, 2005].
Only when the expected benefits are likely to outweigh the anticipated
harmful effects is initiation of an intervention justifiable. In the case of the
elderly woman with arthritis, the impressive history of gastrointestinal effects
that occurred during the use of previous NSAIDs presumably also contributed to
the initiation of Metoo-coxib as an intervention, as it was believed to confer
fewer gastrointestinal side effects. This decision should have been based on solid
evidence that the incidence of these unintended effects is lower with Metoo-
coxib than with classic NSAIDs.
Napoleon Bonaparte presumably was among the first to ban a drug (in this case, herbal) because of
serious side effects. While in Egypt around 1800 the French occupying forces indulged in the use of
cannabis, either through smoking or consumption of hashish-containing beverages.
He prohibited the use of cannabis in 1800: “It is forbidden in all of Egypt to use certain Moslem
beverages made with hashish or likewise to inhale the smoke from seeds of hashish. Habitual drinkers
and smokers of this plant lose their reason and are victims of violent delirium which is the lot of those
who give themselves full to excesses of all sorts” [Allain, 1973].
Although Napoleon undoubtedly interpreted the observed effects of cannabis as side effects, the
question remains whether the effects were indeed considered “unintended” by the consumers. The fact
that consumption of hashish was reported by some to increase after the official prohibition illustrates
that the effects may, to some extent at least, have been “intended.”
The thalidomide tragedy dramatically changed the way a drug’s primary and
side effects are assessed. In 1954, the small German firm Chemie Grünenthal
patented the sedative thalidomide. The alleged absence of side effects, even at
very high dosages, fueled the impression that the drug was harmless [Silverman,
2002]. The potential hypnotic effect of the drug was revealed after free samples
of the, at the time unlicensed, drug were distributed. The drug was licensed in
Germany in 1957 and sold as a nonprescription drug because of its presumed
safety. Within a few years, the drug was by far the most often used sedative.
Sold in more than 40 countries around the world, thalidomide was quick to be
marketed as the anti-emetic drug of choice for pregnant women with morning
sickness. About a year after its release, however, a neurologist noticed peripheral
neuritis in patients who received the drug. Even as reports of this side effect
were accumulating rapidly, the company denied any association between
thalidomide and this possible unintended effect. In 1960, marketing
authorization was sought in the United States. Interestingly, at that time only
proof of safety (rather than clinical trials to demonstrate efficacy) of a drug was
required for approval by the Federal Drug Administration (FDA). By the end of
1961, the first reports of increasing numbers of children with birth defects were
published. These defects included phocomelia, a very rare malformation
characterized by severe stunting of the limbs; children had flippers instead of
limbs. In that same year, the pediatrician Lenz presented a series of 161
phocomelia cases linked with thalidomide, and the firm withdrew thalidomide
from the German market. In Box 6–3 an extract of a lecture delivered by Dr.
Lenz in 1992 is presented, illustrating the way this dramatic unintended effect
was discovered. Exact statistics are unknown, but it has been estimated that more
than 10,000 infants developed phocomelia because of their mother’s use of
thalidomide during pregnancy.
Despite its dramatic past, thalidomide received marketing authorization in the
late 1990s, with the caveat that it only could be applied under strict conditions
and its use in pregnant women was absolutely contraindicated. The drug is
currently used for several disorders, including multiple myeloma and erythema
nodosum leprosum, a severe complication of leprosy. The beneficial effects of
thalidomide have been attributed to its tumor necrosis factor-alpha (TNF-a)
lowering properties.
BOX 6–3 Extract from a Lecture Given by Dr. Widukind Lenz at the 1992 UNITH Congress
Though the first child afflicted by thalidomide damage to the ears was born on December 25, 1956, it took
about four and a half years before an Australian gynaecologist, Dr. McBride of Sydney, suspected that
thalidomide was the cause of limb and bowel malformations in three children he had seen at Crown Street
Women’s Hospital. There are only conflicting reports unsubstantiated by documents on the reaction of his
colleagues and the Australian representatives of Distillers Company, producers of the British product
Distaval between June and December 16, 1961, when a short letter of McBride was published in The
Lancet. Distillers Company in Liverpool had received the news from Australia on November 21, 1961,
almost exactly at the same time as similar news from Germany.
I had suspected thalidomide to be the cause of an outbreak of limb and ear malformation in Western
I had suspected thalidomide to be the cause of an outbreak of limb and ear malformation in Western
Germany for the first time on November 11, 1961, and by November 16, I felt sufficiently certain from
continuing investigations to warn Chemie Gruenenthal by a phone call. It took ten more days of intensive
discussions with representatives of the producer firm, of health authorities, and of experts before the drug
was withdrawn, largely due to reports in the press.
Reproduced from the lecture “The History of Thalidomide,” delivered at the 1992 United International
Thalidomide Society Congress. Available at: www.thalidomidesociety.co.uk/publications.htm. Accessed
May 9, 2013.
The thalidomide tragedy and other tragedies from pharmaceutical use clearly
show the importance of weighing the risks and benefits of interventions before
bringing drugs to marketing (i.e., widespread use) as well as in the physician’s
decision, after licensing, to initiate the intervention in individual patients in daily
practice. This requires empirical evidence of the expected intended and
unintended effects of the intervention and, thus, valid studies. Naturally,
researchers and those employed by the manufacturers of the interventions are
more likely to direct their research efforts at the intended effects of interventions
than at possible unintended effects. In addition, quantifying unintended effects of
interventions is often more complicated than estimating their benefits, because
the research paradigm to determine effects of intervention—the randomized trial
—is less suited to evaluate unintended effects. In this chapter, the methods
available to assess unintended effects of interventions are presented. Most
examples in this chapter are drawn from studies on the unintended effects of
drug interventions, but the same principles also hold for surgical, lifestyle, and
other healthcare interventions.
FIGURE 6–1 The “confounding triangle” in research on unintended effects. The reasons to initiate or
refrain from a specific intervention are important potential confounders.
Critical evaluation of the two arrows in Figure 6–1, that is, the association of
potential confounders with both the exposure to the intervention and the
unintended effect, is essential. In daily practice, as in the study of intended
effects of interventions, the reasons an intervention is initiated in or withheld
from patients (i.e., relative or absolute indications or contraindications) are by
definition associated with exposure to the intervention [Grobbee & Hoes, 1997].
Consequently, the left arrow in Figure 6–1 exists unless allocation of the
intervention is a random process. This typically only occurs when the researcher
ensures comparability of natural history in those who do or do not receive the
intervention through randomization, that is, by performing a randomized
controlled trial. The presence or absence of a relationship between the reasons to
initiate the intervention (the indication) and the unintended effect (the arrow on
the right) determines the potential for confounding. Because the indication then
acts as a confounder, this is sometimes termed confounding by indication. When
drugs are particularly used by (“indicated for”) patients at a higher or lower risk
of developing the unintended effect of interest than patients not receiving the
intervention, failure to take this confounding into account will bias the study
findings. When, for example, COX-2 inhibitors are for some reason
preferentially prescribed to patients with an unfavorable cardiovascular risk
profile, comparison of the incidence of myocardial infarction of patients
receiving the drug (such as the 75-year-old woman in the earlier example) with
those not using the drug in daily practice may reveal an increased risk of this
side effect. At least part of this increased risk will be attributable to confounding
by indication.
by Mark Kaufman
Washington Post Staff Writer
Saturday, August 20, 2005; Page A01
After less than 11 hours of deliberation, a Texas jury yesterday found Merck & Co. responsible for the
death of a 59-year-old triathlete who was taking the company’s once-popular painkiller, Vioxx.
The jury hearing the first Vioxx case to go to trial awarded the man’s widow $253.4 million in punitive and
compensatory damages—a sharp rebuke to an industry leader that enjoyed an unusually favorable public
image before the Vioxx debacle began to unfold one year ago.
Reproduced from Kaufman, M. The Washington Post, Aug 20, 2005, p. A01. © 2005 Washington Post
Company. All rights reserved. Used by permission and protected by the Copyright Laws of the United
States. The printing, copying, redistribution, or retransmission of this Content without express written
permission is prohibited.
Box 6–4 is an excerpt from a Washington Post article published August 20,
2005. Apparently, the judge considered the causal relationship between the use
of rofecoxib (Vioxx), a COX-2 inhibitor, and the untimely death of the athlete
proven. Rofecoxib was withdrawn from the market by the manufacturer in
September 2004, after a randomized trial showed an increased risk of
cardiovascular disease among rofecoxib users [Bresalier et al., 2005].
The importance of taking confounding into account in research on unintended
effects of interventions and possible bias attributable to initiation of drug
interventions in high-risk patients is clearly exemplified by the following quote
from John Urquhart, emeritus professor of pharmacoepidemiology: “Did the
drug bring the problem to the patient or did the patient bring the problem to the
drug?” (Urquhart, 2001).
As in all types of research aimed at quantifying causal associations,
confounding in the assessment of unintended effects of interventions can be
accounted for either in the design of data collection or in the design of data
analysis. The potential for confounding, however, critically depends on the type
of unintended effect involved: type A or type B [Rawlins & Thompson, 1977].
TYPE A AND TYPE B UNINTENDED EFFECTS
FIGURE 6–2 Potential confounding in the study of type A unintended effects of an intervention with the
example of anticoagulants and bleeding.
This is a rare event characterized by swelling around the eyes and lips, which
in severe cases also may involve the throat, a side effect that is potentially fatal.
Again, determinants of enalapril prescription (blood pressure level, levels of
other cardiovascular risk factors, and relevant comorbidity such as heart failure
or diabetes) will influence the use of the drug in clinical practice (left arrow in
Figure 6–3). In contrast to type A unintended effects, these patient
characteristics are very unlikely to be associated with the outcome. For example,
blood pressure, cholesterol levels, and diabetes are not related to the risk of
developing angioedema. Consequently, the arrow on the right in Figure 6–3 is
nonexistent and confounding is not a problem in such type B unintended effects
[Miettinen, 1982; Vandenbroucke, 2006].
Measures to prevent confounding are therefore generally not necessary in type
B unintended effects, although one has to be absolutely sure that characteristics
of recipients of the intervention are indeed not related to the unintended event
under study.
BOX 6–5 Example of a Type A Unintended Effect of a Drug Intervention that was First Considered a Type
B Effect
An example is the abstract in Box 6–5. With the first reports of angina
pectoris or myocardial infarction in recipients of sumatriptan, a then novel
antimigraine drug, these rare events were primarily considered type B
unintended effects (see also the wording “unknown” in the abstract)
[Ottervanger et al., 1993]. With accumulating evidence, however, the effect was
shown to be related to the primary action of the drug, that is, its vasoconstrictive
properties, and also the predictability of the effect increased. Currently this
adverse drug reaction is primarily considered a type A effect, although it
remains, fortunately, rare.
Myocardial infarction is also a possible consequence of Metoo-coxib, the drug
introduced in the beginning of this chapter; this is more characteristics of a type
A than a type B unintended effect. COX-2 inhibition promotes platelet
aggregation because of inhibition of endothelial prostacyclin, while COX-1
inhibition inhibits aggregation because of inhibition of platelet thromboxane
synthesis. Thus, selective COX-2 inhibition was expected to increase platelet
aggregation, which may indeed promote thrombus formation and eventually
cause myocardial infarction. The observed dose–response relationship further
illustrates that myocardial infarction may be a type A effect [Andersohn et al.,
2006]. Consequently, confounding by indication may pose an important threat to
the validity of research on this potential side effect of Metoo-coxib or other
COX-2 inhibitors.
THEORETICAL DESIGN
The occurrence relation of research on the unintended effects of an intervention
closely resembles that of research on the intended effects of interventions:
Unintended effect = f (intervention | EF)
Because the primary goal is to assess causality, the occurrence relation should
be estimated conditional on confounders (external factors, or EF).
The domain usually includes patients with an indication for the intervention
(e.g., a specific disease), or defined more broadly, patients in whom a physician
considers initiating the intervention.
In the Metoo-coxib example, the occurrence relation would be,
and the domain is defined as a patient with osteoarthritis (or perhaps other
diseases) requiring analgesics.
Time
As for studies assessing intended effects of interventions, the time dimension for
research on unintended effects is larger than zero. The aim is to establish
whether a specific intervention is related to the future occurrence of a certain
effect. In principle, therefore, research on unintended effects is longitudinal.
Census or Sampling
In contrast to diagnostic studies and research on the intended effects of
interventions, studies addressing unintended effects of interventions relatively
often take a sampling instead of a census approach. There are several reasons
why sampling (and, thus, a case-control study) is attractive here. First, sampling
is efficient when the unintended effect is rare, as is typically the case in type B
unintended effects. A census approach would imply following in time very large
numbers of patients receiving or not receiving the treatment. For the example at
the beginning of the chapter, this would entail following a large group of patients
with rheumatoid arthritis receiving Metoo-coxib and a large group receiving no
or other analgesics. Alternatively, one may hypothetically define and follow a
study base, consisting in this example of patients with rheumatoid arthritis, and
only study in detail those developing the unintended effect (i.e., cases) during
the study period and a sample representative of that study base (i.e., controls).
Obviously, the definition of the study base critically depends on the domain of
the study. Case-control studies are efficient also when the measurement of the
determinant and other relevant characteristics, such as potential confounders and
effect modifiers, is expensive, time consuming, or burdensome to the patient.
For example, when detailed information, including dosage, duration of use,
compliance to medications (including Metoo-coxib), and relevant comorbidity is
difficult to obtain, a case-control study should be considered. In addition, when
unintended effects take a long time to develop or when the time from exposure
to the intervention until the occurrence of the effect are unknown, a case-control
approach is attractive.
The classic example of a case-control study establishing the causal association
between the use of the estrogen diethylstilboestrol (DES) in mothers and the
occurrence of clear-cell adenocarcinoma of the vagina in their daughters
illustrates the strengths of case-control studies; a census approach would require
an unrealistic follow-up time lasting one generation and a huge study population
because vaginal carcinoma is extremely rare. The results of the original case-
control study from 1971 on this topic are shown in Table 6–1 [Herbst et al.,
1971].
In that study, eight cases were compared with 32 matched controls. The
mothers of seven of the eight daughters with vaginal carcinoma had received
DES (a drug primarily prescribed for women with habitual abortion to prevent
future fetal loss) during pregnancy, whereas none of the mothers of the 32
control daughters had used DES. Although no quantitative measure of
association was reported (in fact the odds ratio cannot be calculated because its
numerator includes 0 and the odds ratio reaches infinity), it was not difficult to
conclude that DES increases the risk of vaginal carcinoma in daughters. When
assuming that the mother of one control received DES during pregnancy, the
odds ratio would be (7 × 31)/(1 × 1) = 217, still indicating a more than 200-fold
risk.
Experimental or Observational
The main challenge of research on unintended effects of interventions lies in
proving beyond a reasonable doubt that the intervention is causally involved in
the occurrence of the outcome. An experimental approach (i.e., a randomized
controlled trial) best ensures that the outcome is indeed attributable to the
intervention, mainly because randomization will achieve comparability of
natural history of those who do and do not receive the intervention and, thus,
prevent confounding. Moreover, randomized controlled trials, when properly
conducted, will also achieve the other two “comparabilities,” that is,
comparability of extraneous effects and comparability of observations, which are
necessary to prove that the intervention is “guilty,” to return to the courtroom
analogy. However, there are several reasons why this paradigm for assessing
causality in intervention research is less suitable when the aim is to establish
unintended intervention effects.
TABLE 6–1 Results of the Original Case-Control Study (with 8 Cases and 32 Controls) on the Association
between DES use in Mothers and Vaginal Carcinoma in their Daughters
Standard error of difference 1.7 yr (paired t-test); N.S. = not statistically signficant.
†
Reproduced from: Herbst AL, Ulfelder H, Poskanzer DC. Adenocarcinoma of the vagina. Association of
maternal stilbestrol therapy with tumor appearance in young women. N Engl J Med 1971;284:878–81.
Copyright © 1971. Massachusetts Medical Society. All rights reserved.
Typical circumstances under which randomized trials are not suited for the
study of unintended effects are situations where case-control studies are
particularly efficient—when the outcome is rare and when the time between
exposure to the intervention and the development of the outcome is very long.
There is no doubt that a randomized trial to estimate the risk of vaginal
carcinoma in daughters of mothers exposed to DES during pregnancy is not
feasible because it would be an unrealistically large trial with an unachievably
long follow-up period. Also, when the time from exposure to the side effect is
unknown, randomized trials are of limited value. In fact, one of the major
strengths of observational studies on unintended effects is that they can
determine the influence of the duration of the exposure on the occurrence of the
effect [Miettinen, 1989].
Table 6–2 shows that the number of patients required in each of the two arms
of a randomized trial to detect a relative risk of 2 (with a type 1 error of 0.05,
and type 2 error of 0.20) increases dramatically when the incidence of the
outcome effect becomes rare.
Type B unintended effects are especially difficult to detect in a randomized
trial because the frequency of the outcome, such as anaphylactic shock in those
not receiving the drug under study or an alternative intervention, is usually lower
than 0.1% or even 0.01%.
TABLE 6–2 Risk of the Outcome in the Control Group and the Number of Participants Required in Each
Group of a Randomized Trial
Risk of Outcome in Control Group Number Required in Each Group
50% 8
25% 55
10% 198
5% 435
1.0% 2,331
0.1% 23,661
0.01% 236,961
* Patients at potential risk were patients with heart disease, pulmonary disease, or metabolic disease.
Thirty-two subjects were excluded because of incomplete data, 10 of whom were at potential risk.
Reproduced from: Govaert TM, Dinant GJ, Aretz K, Masurel N, Sprenger MJ, Knottnerus JA. Adverse
reactions to infl uenza vaccine in elderly people: Randomised double blind placebo controlled trial. BMJ
1993;307:988–90 with permission from BMJ Publishing Group Ltd.
Although multiple exclusion criteria can be very helpful and may be justified
to optimize the safety of participants in an efficacy trial, they also may lead to
inadequate estimates of the unintended effects occurring in daily practice where
patients will be treated outside the domain of the study.
In a randomized study specifically designed to compare gastrointestinal side
effects in those receiving rofecoxib and the NSAID naproxen, recipients of the
COX-2 inhibitor experienced a 50% lower risk of gastrointestinal side effects
(see Table 6–4) [Bombardier et al., 2000].
This trial among patients with rheumatoid arthritis shows the strength of
randomized trials in estimating the risk of relatively frequent unintended effects
(e.g., four upper gastrointestinal events per 100 patient years in the naproxen
group). It also exemplifies that when trials are large enough, they may be
instrumental in detecting even relatively rare effects. In this study including
8,076 randomized patients, the risk of myocardial infarction was lower in the
naproxen group (0.1%) than in the rofecoxib group (0.4%; relative risk 0.2; 95%
confidence interval [CI], 0.1–0.7). It took several more years, however, before
another trial, this one in patients with colorectal adenoma, confirmed the
increased risk of cardiovascular events among rofecoxib users, urging the firm to
withdraw the drug from the market [Bresalier et al., 2005].
TABLE 6–4 Incidence of Gastrointestinal Events in Patients Using Different Types of COX-2 Inhibitors or
NSAIDs
COMPARABILITY IN OBSERVATIONAL
RESEARCH ON UNINTENDED EFFECTS
Comparability of Observations
Blinding is the generally accepted method for achieving comparability of
observations between those receiving the intervention and the comparison group.
In a randomized trial, tools are available to keep all those involved in measuring
the outcome (the observer, but possibly also the patients and doctors or other
healthcare workers when they can influence the measurements) blinded to
treatment allocation, notably by the use of a placebo. In observational research,
usually only part of the observations can be blinded. In a cohort study examining
the effect of Metoo-coxib on the risk of myocardial infarction, for example, one
could blind the researchers involved in adjudication of the outcome by deleting
all information pertaining to the medication used by the patients from the data
forwarded to them. If, however, the use of COX-2 inhibitors urges healthcare
workers and patients to be more perceptive of signs of possible myocardial
infarction, leading more often to ordering tests to establish or rule out the
disease, incomparability of observations may artificially inflate the drug’s risk.
Alternatively, one could choose the technique of measuring the outcome such
that observer bias is minimized. For example, automated biochemical
measurements do not require blinding, although in daily practice routine
ordering of such tests may very well be influenced by the intervention the patient
receives. Finally, a hard outcome, such as death, will increase comparability of
observations.
FIGURE 6–4 Potential confounding in a study on the causal role of DES prescription in the occurrence of
vaginal carcinoma in daughters.
BOX 6–7 Means to Limit Confounding by Indication in Observational Studies on Side Effects of
Interventions
In the design of data collection:
1. Multivariable analyses
2. Propensity scores*
*Instrumental variables and propensity scores can be applied both in the design of data collection and in
the design of data analysis.
In a nested case-control study with the objective of quantifying the risk of
myocardial infarction or sudden cardiac death of COX-2 inhibitors, the study
population was a cohort comprised of patients who filled at least one
prescription of a COX-2 inhibitor or NSAID [Graham et al., 2005]. Thus, all
participants had (or had in the past) an indication for a painkiller and did not
have a clear contraindication for NSAIDs. Nevertheless, there may be reasons to
choose a specific NSAID within the indicated population, and if these reasons
are related to the risk of myocardial infarction or sudden cardiac death,
confounding will result. Although restriction can be a powerful means to limit
confounding, additional methods are usually required to preclude residual
confounding.
TABLE 6–5 Selected Characteristics of Controls from the Case-Control Study Receiving Different COX-2
Inhibitors or NSAIDs and Ex-users (“Remote Use”) of These Drugs
Reproduced from The Lancet, Vol. 365; Graham DJ et al. Risk of acute myocardial infarction and sudden
cardiac death in patients treated with cyclooxygenase 2 selective and nonselective nonsteroidal anti-
inflammatory drugs: Nested-case-control study. 475–81. © 2005, reprinted with permission from Elsevier.
Instrumental Variables
Another method is believed to not only limit (or even prevent) known, but also
unknown confounding in observational causal research: the use of instrumental
variables. An instrumental variable (IV) is strongly related to exposure (here,
the intervention), is not related to the confounders, and is not related to the
outcome (except through its relation to the intervention). Categorizing study
participants according to an instrumental variable implies that, if indeed the
instrumental variable is not associated with the probability of developing the
outcome (other than through its strong association with the intervention), all
potential confounders are equally distributed among the categories of the
instrumental variable [Martens et al., 2006]. Instrumental variables that have
been applied include regional preferences for the intervention (e.g., drug
therapy) or the distance to a clinic. A study on the effects of more intensified
treatment (including cardiac catheterization) on mortality in patients with
myocardial infarction was one of the first to apply this method [McClellan,
McNeill, & Newhouse, 1994]. Distance to the hospital was used as an
instrumental variable as it was considered to be closely related to the chance of
the intervention (i.e., intensified treatment is more likely to be initiated when the
distance to the hospital is shorter), while the IV (distance to the hospital) itself
was judged not to be related to the confounders or to the outcome (mortality).
Theoretically, comparison of patients living close to a hospital with those living
farther away would provide for an unconfounded estimate of the effect of more
intensified treatment of myocardial infarction on mortality.
The IV method is increasingly being applied in research on side effects of
interventions [Huybrechts et al., 2011]. Brookhart et al. [2006] used the
physician’s preference of COX-2 inhibitors or other NSAIDs as an IV to
compare the risk of gastrointestinal side effects of these drugs. The abstract of
this study is presented in Box 6–8; it illustrates both the potential strength and
the uncertainties of the method.
Results: Using conventional multivariable regression adjusting for 17 potential confounders, we found
no protective effect due to COX-2 use within 120 days from the initial exposure (risk difference =
−0.06 per 100 patients; 95% confidence interval =−0.26 to 0.14). However, the proposed instrumental
variable method attributed a protective effect to COX-2 exposure (−1.31 per 100 patients; −2.42 to
−0.20) compatible with randomized trial results (−0.65 per 100 patients; −1.08 to −0.22).
Conclusions: The instrumental variable method that we have proposed appears to have substantially
reduced the bias due to unobserved confounding. However, more work needs to be done to understand
the sensitivity of this approach to possible violations of the instrumental variable assumptions.
Reproduced from: Brookhart MA, Wang PS, Solomon DH, Schneeweiss S. Evaluating short-term drug
effects using a physician-specific prescribing preference as an instrumental variable. Epidemiology 2006,
17;268–75, with permission from Wolters Kluwer Health.
Although IVs appear to offer a rather ideal solution to the danger of known
and even unknown confounding in observational causal research, they may be
hard to find in a particular study. Most notably, it is difficult to prove that the
main assumptions underlying this method hold [Groenwold et al., 2010].
Multivariable Analyses
The essence of adjusting for confounders is that potential confounders should be
identified in advance and measured appropriately, and then the observed crude
association of the intervention and the outcome (here, unintended effect) is
adjusted using available statistical techniques. There is no consensus about the
way to select confounders and how to build a multivariable model. The decision
to adjust for a potential confounder can be based on a close examination of its
relationship with both the determinant and the outcome in the database of the
study. To measure and include those potential confounders in the analysis that
are, based on the available literature, known to confound the association between
the intervention and the unintended effect is a more pragmatic and safe approach
to limit confounding [Groenwold et al., 2011]. Often, all confounders are
included in a multiple regression model all at once or researchers develop
computer models to build the multivariable model using statistical reasons to
include or exclude a potential confounder. However, we recommend that
confounders be included one at a time, starting with the strongest confounder
based on clinical expertise, earlier studies, and the univariable analysis of the
confounder with the outcome. Then the effect of each included potential
confounder on the risk estimate can be evaluated. When this effect is large
enough, arbitrarily a change of 5% or 10% in the measure of association, for
example the odds ratio, between the intervention and the outcome is sometimes
chosen, confounding by this included variable is considered present. The
methodical single inclusion of potential confounders may also indicate the
potential for residual confounding. If, for example, the risk estimate remains
stable after inclusion of the first major confounders and even after inclusion of
additional potential confounders, one may argue that any unmeasured or
unknown confounder is unlikely to result in a major change in the risk estimate.
The advantage of subsequent inclusion of individual confounders in a multiple
regression model is illustrated in our case-control study on the risk of sudden
death in hypertensive patients using non-potassium-sparing diuretics compared
to other antihypertensives (see Table 6–6).
TABLE 6–6 Risk of Sudden Cardiac Death Among Patients with Hypertension Receiving Non-Potassium-
Sparing Diuretics (NPSD) Compared to Other Antihypertensive Drugs.
Results of multivariable logistic regression analysis. Subsequent inclusion of the first (strongest)
confounders yielded the expected changes in the risk estimate. Inclusion of additional confounders hardly
changed the odds ratio, indicating that residual confounding may be limited.
Potential Confounders Included in the Model Odds Ratio (95%) of Sudden Cardiac Death for NPSD
Versus Other Antihypertensives
Crude 1.7 (0.9–3.1)
Crude 1.7 (0.9–3.1)
+ Prior myocardial infarction 2.0 (1.1–3.8)
+ Heart failure 2.0 (1.0–3.9)
+ Angina 2.1 (1.1–4.1)
+ Stroke 2.1 (1.0–4.1)
+ Arrhythmias 2.1 (1.1–4.1)
+ Claudication 2.1 (1.1–4.2)
+ Diabetes 2.1 (1.0–4.1)
+ Obstructive pulmonary disease 2.2 (1.1–4.6)
+ Cigarette smoking 2.2 (1.1–4.4)
+ Hypercholesterolemia 2.2 (1.1–4.5)
+ Mean blood pressure prior 5 years 2.2 (1.1–4.6)
Data from Hoes AW, Grobbee DE, Lubsen J, Man in ‘t Veld AJ, van der Does E, Hofman A. Diuretics,
beta-blockers, and the risk for sudden cardiac death in hypertensive patients. Ann Intern Med
1995a;123:481–7.
Propensity Scores
The propensity score represents the probability of receiving the intervention. It
often (for example in the case of a dichotomous intervention variable) results
from a multiple logistic regression analysis including patient and other
characteristics believed to be related to initiation of the intervention as
independent variables and exposure to the intervention as the dependent
variable. Thus, the propensity score focuses on the left arrow of the confounding
triangle and summarizes information from all potential confounders. In patients
with a similar propensity score the prognosis will then be the same in the
absence of the intervention. Rosenbaum and Rubin [1984] were the first to
summarize all characteristics related to the initiation or non-initiation of the
intervention in a propensity score. In the Metoo-coxib example, this would
imply that a score predicting the use of Metoo-coxib instead of the reference
exposure (e.g., other NSAIDS) would first be derived. After a propensity score is
calculated for each participant, one can match those who are receiving and not
receiving the intervention according to their propensity score or include the
score in a multivariable regression analysis [Rubin, 1997]. The popularity of the
propensity score in observational studies on intended and unintended effects of
drugs has increased rapidly in recent years [Rutten et al., 2010; Yasunaga et al.,
2013]. The method, however, has its inherent limitations. These include the
complexity of developing appropriate propensity scores (in fact, many studies
fail to report in detail how the score was derived) and the fact that only known
and measurable patient characteristics can be accounted for [Belitser et al., 2011;
Heinze & Jüni, 2011].
Table 6–7 compares several available methods to limit confounding in
observational studies assessing the effects of interventions. The example is taken
from a study on the intended effect of influenza vaccination on influenza-related
complications, including death [Hak et al., 2002]. The methods compared
include restriction (separate analyses in the elderly and in younger subjects are
presented), individual matching (“quasi-experiment,” which requires a
conditional analysis to account for the matching), one-by-one inclusion of
individual confounders in a multivariable regression analysis, and the propensity
score method. Because influenza vaccination is expected to reduce
complications, the crude odds ratio of 1.14 indicates confounding by indication.
Restriction of the study population to certain age categories and inclusion of a
few confounders in a multiple regression model reduced confounding
dramatically (OR < 1.0). Also, individual matching according to different
confounders (quasi-experiment) or on the propensity score clearly reduced
confounding, while subsequent inclusion of additional potential confounders did
not change the effect estimate.
TABLE 6–7 Methods to Limit Confounding in an Observational Study on the Effect of Influenza
Vaccination
Study Population and Analysis Adjusted For Odds Ratio (95% CI)
Adult patients
(18–102 y, n 5 1,696) Crude value 1.14 (0.84 to 1.55)
Conventional control: MLR*
+ Age (in years) 0.87 (0.64 to 1.20)
+ Disease (asthma/COPD) 0.82 (0.59 to 1.13)
+ GP visits (in number) 0.76 (0.54 to 1.05)
+ Remaining factors 0.76 (0.54 to 1.06)
Elderly patients
(65–102 y, n 5 630) Crude value 0.57 (0.35 to 0.93)
Conventional control: MLR*
+ Age (in years) 0.56 (0.35 to 0.92)
+ Disease (asthma/COPD) 0.53 (0.32 to 0.87)
+ GP visits (in number) 0.50 (0.30 to 0.83)
+ Remaining factors 0.50 (0.29 to 0.83)
Younger patients
(18–64 y, n 5 1,066) Crude value 1.27 (0.84 to 1.94)
Conventional control: MLR*
+ Age (in years) 1.11 (0.73 to 1.70)
+ Disease (asthma/COPD) 1.08 (0.70 to 1.66)
+ GP visits (in number) 0.94 (0.61 to 1.47)
+ Remaining factors 0.94 (0.60 to 1.45)
Quasi-experiment
(18–64 y, n 5 676) Matched crude value 0.90 (0.63 to 1.52)
Conventional control: MCLR†
+ Age/disease/GP visits 0.89 (0.52 to 1.54)
Younger patients
(18–64 y, n 5 1,066) Matched crude value 0.87 (0.56 to 1.35)
Propensity score: MCLR†
+ Age/disease/GP visits 0.86 (0.55 to 1.35)
*MLR, multivariable logistic regression analysis; †MCLR, multivariable conditional logistic regression
analysis.
Reproduced from Hak E, Verheij TJ, Grobbee DE, Nichol KL, Hoes AW. Confounding by indication in
non-experiemental evaluation of vaccine effectiveness: the example of prevention of influenza
complications. J Epidemiol Community Health 2002;56:951–5, with permission from BMJ Publishing
Group Ltd.
INTRODUCTION
The design of data collection is an element of critical importance in the
successful design of clinical epidemiologic studies. The prime consideration in
choosing from different options to collect data is the expected quality of the
results of the data analyses in terms of relevance, validity, and precision. The
relevance is first and foremost determined by the research question, with the
type of subjects from whom data are collected adequately reflecting the domain.
A number of other issues are important as well. Time constraints and budgetary
aspects of a study may impact the choice of study population and type of data
collection. For example, when a widely used drug is suspected of causing a
serious side effect, it is usually impossible to postpone action for a number of
years until a study yields results. Also, lack of money may force an investigator
to limit the number of measurements or the size of the group of patients studied.
Sometimes ethical limitations apply, for example when an investigator wants to
examine whether particularly high doses of radiotherapy induce secondary
tumors in patients treated for a primary cancer. The investigator should
preferably use the data at hand rather than wait until another group of patients is
exposed.
There is no unique optimal way to collect data for any research question.
Despite the sometimes fiercely voiced belief that the most reliable results are
obtained in a randomized trial, there are many examples of bad trials and many
of much better “non-trials,” and there are obvious instances where a trial is not
feasible or otherwise not justified. This chapter discusses some general aspects
of the design of data collection, with the goal of offering a consistent and
comprehensive taxonomy without confusing terminology.
In clinical epidemiology, all studies can be classified according to three
characteristics: time, census or sampling, and experimental or observational.
TIME
Time is an essential aspect of data collection. The time between collection of
determinant and outcome information can be zero or larger than zero. When data
on determinant and outcome are measured simultaneously, the time axis of the
study is zero and the study is called cross-sectional. In all other study types the
time axis is larger than zero. Furthermore, both determinant and outcome data
already may or may not be available at the start of the study. If the data have
been recorded in the past (i.e., have been collected retrospectively), the study is
termed retrospective. When the data are yet to be collected and recorded for both
outcome and determinants when the study is started, the data are collected
prospectively and the study is termed prospective. Combinations of retrospective
and prospective data collection can occur.
There are no inherent implications for the validity of a study when data are not
prospectively collected. Still, frequently authors as well as readers use and
interpret the term retrospective as a negative qualification. Retrospective data
should only be viewed with caution if a similar study with a prospective data
collection would provide results that are more valid, precise, and/or clinically
relevant. For example, in an etiologic study, the available retrospective data may
lack information on certain confounders or have confounder information that is
less precise than necessary for full adjustment. Results from such a study may be
biased or contain residual confounding that would not apply if data had been
collected prospectively.
Alternatively, data on certain outcomes may be lacking. The results would
then necessarily be restricted to inferences made from the outcomes that are in
the data. While restricted, the research may still be valid and relevant. In
descriptive research, the lack of particular data may create even fewer problems
because there is not a need for full confounder information. Consider, for
example, a study on the value of exercise testing in the diagnostic workup of
patients suspected of ischemic coronary disease. An available database may not
include results from troponin measurements, which are being used to assist the
diagnosis in these patients. Consequently, the added value of exercise testing in
the presence of troponin measurements cannot be studied. Still, the results may
be useful to position exercise testing for those settings in which there is no
access to troponin measurements in these patients.
Retrospective data collection may suffer more from missing data than data
that are purposely collected prospectively. Missing data, for example, are a
typical problem for routine clinical data that were stored before they were used
for research. Here, the size of the problem depends on the importance of the
variables that are missing and the proportion of subjects with missing data.
Depending on the size of the overall study and the completeness of other data,
the problem of missing data may be reduced or overcome by estimating the
value of the missing data points using, for example, multiple imputation. The
principle of imputation is based on the view that if sufficient information on a
certain subject, or comparable subjects, is available the value of unobserved
variables may be estimated with confidence. For example, suppose that in an
existing database the data on body weight is missing for some individuals. With
the use of available data on height, age, gender, and ethnicity, a reliable estimate
of an individual’s body weight may be obtained through regression modeling.
Provided that the number of missing data is not too high, say less than 10% for a
few variables, valid analyses may be done on all subjects.
It is important to realize that the time dimension of a study is not necessarily
the same as the time dimension of the object of research. With the exception of
diagnostic research, where diagnostic determinants and the outcome occur at the
same time, all determinant outcome relationships are longitudinal by nature.
Take, for example, a study on the relationship between the BCR-ABL gene and
leukemia that is conducted with a time axis of zero (i.e., cross-sectionally).
Genes are measured in all patients. While in this study determinant and outcome
information were collected at the same point in time (and thus time is zero), the
inference of an increased risk of leukemia in those with the p210 BCR-ABL
gene points at a longitudinal relationship: Those with the gene have an increased
risk of acquiring the disease in the future.
The terms retrospective and prospective thus refer to the timing of data
collection, that is, before or after the study is initiated. Historical cohort study
would be a better name than retrospective cohort study because it more directly
speaks to the operational aspects of the study. However, the term retrospective is
much more commonly used.
CENSUS OR SAMPLING
When the determinant(s) and outcome (and, when relevant, confounders or
effect modifiers) are measured in all members of a population that is studied
(such as in a cohort study) a “census” approach is taken. The cohort study is the
paradigm of epidemiologic research. A cohort is a group of subjects from whom
data are collected over a certain time period. The word cohort is derived from
Roman antiquity, where a cohort was a body of about 300 to 600 soldiers, the
tenth part of a legion. Once part of the cohort, there was no escape; you always
remained a member. Now that you are reading this text, you are part of the
cohort of readers who read the text. You will never get rid of that qualifying
event.
In epidemiologic research, the qualifying event for becoming member of a
cohort is typically that a subject is selected together with a smaller or larger
group of other individuals to become part of a study population that is then
followed over time. Sometimes, subjects can enter and leave a study population,
as for example the population of a town that is followed over time. As the
months and years go by, people will move into the town and become part of the
study population while others will leave. Such a study population is best called a
dynamic population. The membership of a cohort is fixed (in essence, once a
member, always a member until you die) while dynamic populations change
over time. The term dynamic cohort is an oxymoron. For reasons of simplicity
we will use the term cohort studies for all studies taking a census approach and
with time between the measurement of the determinant and outcome being larger
than zero. Thus, both conventional cohort studies and dynamic population
studies will be referred to as cohort studies.
In studies of cohorts and dynamic populations, epidemiologic analyses will
compare the development of disease outcomes across categories of a
determinant. For example, if the risk of heart disease is elevated among those
with high blood homocysteine levels, the rates of disease will be higher in those
with a high baseline homocysteine level compared to those with a low baseline
homocysteine level. This is epidemiologic research in its most basic form.
Clearly, when the causal role of high homocysteine in the occurrence of heart
disease needs to be clarified, a number of confounders must be taken into
account simultaneously.
Sometimes investigators may face the need to follow a large population to be
able to address particular rare outcomes, for example, in the study of the gene–
environment interaction and the occurrence of Hodgkin’s lymphoma. To
determine genetic abnormalities in the whole population would create
insurmountable expenses. An alternative is to wait until cases of lymphoma
occur (“cases”) and perform genetic analyses only in those with the disease and
in a random sample of the remainder of the population (“controls”). The purpose
of such a sampling approach is straightforward. If a valid sample is taken and the
sample is sufficiently large, the distribution of determinants (and, in causal
studies, confounders) in the sample will reliably reflect the distributions in the
population experience from whom the sample was drawn. In other words, the
sample provides the same information as the much larger full population would.
Across categories of the determinant in the combined samples of diseased
subjects and controls, relative rates and risks can now be calculated with
adjustments for confounders where appropriate. In this approach, rather than
examining the entire population (census), an equally informative subgroup of the
population is studied (sampling). Such a study is called a case-control study.
There is no innate reason why the results of a case-control study should be
different than when the whole population is analyzed, as long as the researcher
adheres to some fundamental principles. The main principle in sampling is that
determinants are sampled without any relationship to outcomes, and that
outcomes are sampled without relationship to the determinant. If not, then the
relationships may be biased. Suppose, for example, that only cases of Hodgkin’s
disease are sampled that are known to have the oncogene BCL11A. It will come
as no surprise that this gene will show an increased risk even though it may not
play a role in reality. In a case-control study, biased inclusion of cases or biased
sampling of controls should be prevented. For example, in some situations, cases
may only become known to the investigator when they have certain
determinants; a physician may be less suspicious of gastrointestinal bleeding
problems in patients using a new nonsteroidal anti-inflammatory drug (NSAID)
that is marketed as much safer than another, older brand. In contrast, when
examining patients using the older drugs the same physician may be more
suspicious and thus discover more cases of minor bleeding. If a case-control
study were to be conducted using the cases noted in this physician’s practice
over a period of time, it would show that a relationship had been introduced in
favor of the newer drug, although this is not necessarily an accurate reflection.
Another issue in case-control studies compared to full cohort analyses is that
the number of controls sampled needs to be sufficiently large to obtain adequate
precision. There is no general rule about how large a control sample needs to be.
Given that all cases that arise in a population are included in the research, this
will depend on the strength of the relationship being studied and the frequency
with which particular determinants of interest occur in the population. Generally,
one to four times the size of the case series is drawn.
Frequently, in a case-control study the actual size of the population from
which cases and controls are drawn is not exactly known. For example, in a
well-known case-control study on the risk of vaginal cancer in female daughters
of mothers exposed to diethylstilboestrol (DES), a case series was collected and
a number of controls without any reference to the size of the population from
which the cases and controls originated (see Figure 7–1). If the population size
is not known, a limitation of the study is that no estimates of absolute risk can be
obtained, such as for example, rate differences. Then, only relative measures of
risk, notably odds ratios, may be obtained. However, in those instances where
cases and controls are sampled from a population of known size, the same
absolute and relative measures of disease risk can be calculated as in a regular
full cohort analysis (i.e., using the census approach).
Case-control studies are best known for their role in etiologic research on
relationships between determinants and rare outcomes. However, case-control
studies also may be fruitfully employed in descriptive research, such as in
diagnostic and prognostic studies.
EXPERIMENTAL OR OBSERVATIONAL
STUDIES
The world is full of data, most of which are waiting to be studied. Indeed, most
published clinical epidemiologic research is based on data that were previously
collected from available sources, such as data in patient records of clinical files,
or on data that were collected in groups of subjects for the purpose of research.
To take the paradigmatic cohort study again, investigators typically start with a
goal of relating a particular determinant to an outcome, as for example in a study
on breast cancer risk among women using long-term estrogen-progestin
treatment. Researchers would start by collecting data on hormone use plus
relevant confounders and then follow the population over time to relate baseline
drug information to future occurrences of breast cancer.
Sometimes a cohort study is started from a particular research aim, but with
time the data may offer many other opportunities to address questions that were
not on the mind of the investigator when the research was initiated. This makes
cohorts highly valuable assets to investigators. The limitations only rest in the
type of population studied and the extent of determinant and outcome (and, if
applicable, confounder or modifier) information collected.
Sometimes the investigator will not rely on the mere recording of determinant
data that occur “naturally,” but rather may wish to manipulate exposure to
certain determinants or allocate patients purposely to a particular exposure, such
as a drug, with the principal goal of learning about the effects of this exposure.
The investigator thus conducts an experiment and such studies are called
experimental studies, in contrast to nonexperimental studies, where the
determinant is studied as it occurs naturally. The difference between a physician
treating patients with a particular drug and an investigator allocating a patient to
a particular drug is in the intention. The intention of the physician is simply to
improve the condition of the patient, while the investigator wants to learn about
the effect of the drug, quantify the extent of improvement, and document any
safety risks. Experiments in clinical epidemiology are called trials.
The best-known and most widely used type of trial is the randomized trial,
where patients are allocated to different treatment modalities by a random
process. A randomized trial obviously differs from the deliberate prescription of
drugs to patients in clinical care. However, when an investigator decides to study
a new series of arthritis patients, to specifically determine the functional benefit
of knee replacement surgery where he measures functional status before and
after the operation, he is also engaged in a trial. Studies are either experimental
or nonexperimental. The term nonexperimental, while logical, is not commonly
used. Rather, nonexperimental studies in epidemiology are called observational.
The contrast between experimental and observational is somewhat peculiar
because it seems to imply that in experiments no observations are made.
• A cohort study has a time dimension greater than zero; analyses are based
on a census of all subjects in the study population, and the data collection
can be conducted prospectively or retrospectively. The study can be
observational or experimental, but if it is experimental it usually takes the
form of a randomized trial.
• A dynamic population study has a time dimension greater than zero;
analyses are based on a census of all subjects in the study population for the
time they are members of the population, and the data collection can be
conducted prospectively or retrospectively. Such studies are typically
nonexperimental (i.e., observational). Because the term dynamic population
study is hardly ever applied in the literature, we use the term cohort study to
indicate both studies involving dynamic populations and cohorts.
INTRODUCTION
The classic epidemiologic approach is to collect data on a defined population (a
cohort) and relate determinant distributions at baseline to the occurrence of
disease during follow-up. This research approach has led to our understanding
such diverse cause and disease relationships as the one between the lifetime risk
of coronary heart disease by cholesterol levels and selected ages in the
Framingham Heart Study [Lloyd-Jones et al., 2003], the relationship between
physical activity and the risk of prostate cancer in the Health Professionals
Follow-up Study [Giovannucci et al., 2005], the relationship between smoking
and lung cancer during 50 years of observation in the British Doctor’s Study
[Doll et al., 2004], the relationship between caloric restriction during the Dutch
famine of 1944–1955 and future breast cancer in the DOM (which stands for
“Diagnostisch Onderzoek Mammacarcinoom” or “Diagnostic Study on breast
cancer”) cohort [Elias et al., 2004], the relationship between apolipoprotein E
(Apo-E) and Alzheimer’s disease in the Rotterdam Study [Hofman et al., 1997],
and the relationship between radiation and leukemia in atomic bomb survivors in
Hiroshima [Pierce et al., 1996].
The essential characteristic of a cohort study is that data are collected from a
defined group of people, which forms the cohort. Cohort membership is defined
by being selected for inclusion according to certain characteristics. For example,
in the Rotterdam Study, 7,983 subjects age 55 years and over who agreed to
participate after invitation of all inhabitants in a particular neighborhood of
Rotterdam formed the Rotterdam Study cohort [Hofman et al., 1991].
The typical design of data collection in a cohort study is to start to collect data
at the time of the inception of the cohort. The starting point of a cohort, t = zero,
is called the baseline. Sometimes, as in the Framingham Study, data collection is
subsequently repeated at certain time intervals, but for other cohorts only a
single set of baseline data is collected. After the baseline collection, a cohort is
generally followed over time and disease occurrences among the members are
recorded. The term cohort study was used for the first time in research in the
1930s.
Some of the best-known cohort studies start from a population of presumed
healthy individuals, but cohort studies can equally well be conducted with
groups of patients. For example, one etiologic study followed a cohort of
premature neonates for chronic cerebral damage and related behavioral problems
[Rademaker et al., 2004]. This same cohort was also used to study the prognostic
meaning of neonatal cerebral imaging by ultrasound compared to magnetic
resonance imaging (MRI) scanning [Rademaker et al., 2005]. Prognostic cohort
studies are obviously conducted on cohorts of patients. In diagnostic studies, the
cohort typically consists of subjects suspected of having the disease of interest in
whom the value of diagnostic testing is studied.
BACKGROUND: The prognostic impact of primary tumor resection in patients presenting with
unresectable synchronous metastases from colorectal carcinoma (CRC) is not well established. In the
present study, we analyzed 15 factors to define the value of primary tumor resection with regard to
prognosis.
PATIENTS AND METHODS: We identified 186 consecutive patients with proven stage IV CRC
from the years 1995 to 2001. Variables were tested for their relationship to survival in univariate
analyses with the Kaplan-Meier method and the log rank test. Factors that showed a significant impact
were included in a Cox proportional hazards model. The tests were repeated for 107 patients who had
no symptoms from their primary tumor.
RESULTS: Overall there were six independent variables with a relationship to survival: performance
status, ASA-class, CEA level, metastatic load, extent of primary tumor, and chemotherapy. In the
asymptomatic patients we investigated 13 factors, 3 of which proved to be independent predictors of
survival: performance status, CEA level, and chemotherapy. Resection of primary tumor was only
predictive of survival if in-hospital mortality was excluded.
CONCLUSION: Resection of the tumor, if possible, is doubtless the best option for stage IV CRC
patients with severe symptoms caused by their primary tumor. In asymptomatic patients,
chemotherapy is preferable to surgery.
Reproduced from Stelzner S, Hellmich G, Koch R, Ludwig K. Factors predicting survival in stage IV
colorectal carcinoma patients after palliative treatment: A multivariate analysis. J Surg Oncol 2005;
89:211–217.
CROSS-SECTIONAL STUDIES
Cross-sectional studies are cohort studies with a time interval of zero between
the collection of determinant and outcome data. In other words, the determinant
and outcome information are collected simultaneously. An example is a study on
the relationship between certain determinants and joint bleeds in hemophilia
patients, where a history of bleeding is obtained at the same time as the possible
risk factors for bleeding (e.g., compliance with treatment, dosage of treatment,
and engagement in sports and other activities with trauma risk).
Another example is the analysis of risk of congenital malformations after
exposure to antidepressant drugs during pregnancy, where all the data are
collected from women at the time of delivery of their children, who may or not
may have malformations. It is important to realize that while the data collection
for determinants and outcome is organized at the same time, the association
being studied is longitudinal. The assumption is that drug exposure precedes the
occurrence of malformations. The consequence is that the investigators need to
seek assurance that no bias is introduced by this difference between the timing of
data collection and the temporal sequence of the presumed cause and effect. For
example, suppose that women with malformed children have a better
recollection of their drug use during pregnancy; this may induce an invalid,
biased association between the drug use and the congenital malformation. This
problem is known as recall bias.
When a study is cross-sectional, it is not necessarily conducted at a single
point in time. Even though data collection of determinants and outcome in an
individual takes place simultaneously at a particular moment, different
individuals participating in a study may be examined sequentially over a longer
time period.
ECOLOGIC STUDIES
Ecologic studies are cohort studies. The cohort is assembled from the aggregate
experience of several populations, for example, those living in different
geographic areas. In contrast to the usual approach in cohort studies, data are
collected from summary measures in populations rather than from individual
members of populations. For example, a study on the proportion of alcohol
intake from wine and the occurrence of coronary heart disease used the
distribution of wine intake across countries and the country-specific rates of
coronary heart disease to determine the possible cardioprotective effect of
different levels of wine consumption. The data were from different populations,
but the inference was made for individuals within populations, suggesting that
rather than alcohol per se, it was the cardioprotective effect of wine that was
particularly clear (see Figure 8–1) [Criqui & Ringel, 1994]. The study was
etiologic, and this implies that the effect from wine on heart disease risk should
be adjusted for confounders. In particular, there seem to be several aspects of
lifestyle, including dietary habits, which could confound the observed crude
association.
A major problem in ecologic studies is the very limited extent to which
confounder information is generally available. For example, data on differences
in fat intake in populations of countries with a different wine consumption may
not exist, or when data are available at a population level, the distribution within
a country and its relationship to the distribution of wine intake within that
country may remain unknown. Even when two countries show similar overall
levels of intake of fat and wine, within the countries the relationship between fat
intake and wine consumption on an individual level may be different. Indeed,
with regard to wine and heart disease risk, a more extensive analysis of a number
of cohort studies with ample adjustment for confounders showed that an initial
ecologic observation of a higher cardiovascular protection from wine compared
to other alcoholic beverages could not be confirmed [Rimm et al., 1996]. This
implies that it is the alcohol, rather than its form, that conveys protection. In
clinical epidemiology, an example of an ecologic comparison is that between
different hospital infection rates in relation to local policies regarding infection
prevention. Even though the crude association suggests that infection rates are
higher in those hospitals with a less extensive prevention program, this still may
be confounded by, for example, differences in the type of surgery between
hospitals. Because of inherent difficulties with handling of confounding,
ecologic studies generally do not provide strong evidence in favor of or against
causal associations.
FIGURE 8–1 Example of an ecological study assessing the relationship between wine consumption and the
coronary heart disease (CHD) mortality rate in men aged 44 to 64.
Reproduced from The Lancet, Vol. 344; Criqui MH, Ringel BL. Does diet or alcohol explain the French
paradox? 1719. © 1994, reprinted with permission fom Elsevier.
Missing Data
Probably one of the most general and difficult problems with the use of routine
care data is that certain data are missing in the files. Missing data pose a problem
in all types of medical research, no matter how strict the design and protocols.
But this problem is accentuated in research based on routine care data, as there is
commonly no strict case-record-form or data measurement protocol in daily
practice.
In epidemiologic research we distinguish three types of missing data [Rubin,
1976]. If subjects whose data are missing are a random subset of the complete
sample of subjects, the missing data are called missing completely at random
(MCAR). Typical examples of MCAR are an accidentally dropped tube
containing venous blood (thus blood parameters cannot be measured) or a
questionnaire that is accidentally lost. The reason for the missing data is
completely random. In other words, the probability that an observation is
missing is not related to any other patient characteristic.
If the probability that an observation is missing depends on information that is
not observed, like the value of the observation itself, the missing data are called
missing not at random (MNAR). For example, data on smoking habits may be
more likely to be missing when subjects do not smoke.
When missing data occur in relation to observed patient characteristics,
subjects with missing data are a selective rather than a random subset of the total
study population. This pattern of missing data is confusingly called missing at
random (MAR), where missing values are random conditional on other available
patient information [Rubin, 1976]. Data that are missing at random are very
common in routine care databases. For example, in a diagnostic study among
children with neck stiffness, investigators quantified which combination of
predictors from patient history and physical examination could predict the
absence of bacterial meningitis (outcome), and which blood tests (e.g., C-
reactive protein level) have added predictive value. Patients presenting with
severe signs such as convulsions, which commonly occur among those with
bacterial meningitis, often received additional blood testing before full
completion of patient history and physical examination, which in turn were
largely missing in the records. On the other hand, patients with very mild
symptoms, who frequently had no bacterial meningitis, were more likely to have
a completed history and physical but were less likely to have had additional
tests, because the physician had already ruled out a serious disease. Missing data
on particular tests was thus related to other observed test results and—although
indirectly—to the outcome.
This mechanism of missing data is even more likely to occur in longitudinal
studies based on routine care data. When following patients over time in routine
care practice, loss to follow-up is a common problem and often is directly related
to particular patient characteristics. Accordingly, outcomes may be only
available for particular patients, the selection of whom is related to certain
determinants. Consider a study to compare the prognosis of patients with
minimally versus invasive cancer. Suppose that patients who were treated in a
particular hospital during a certain period were followed up on over time using
data from patient records. Follow-up information for subsequent morbidity may
be more complete for patients with initial invasive cancer, because these patients
visited the clinic more regularly and during a longer time period as part of
routine procedures. One can easily check whether data are MCAR [Van der
Heijden et al., 2006]. If the subset of patients with and without missing values
does not differ on the other observed patient characteristics, the missing values
are likely MCAR (although theoretically they might still be MNAR).
Typically, in epidemiologic research, missing data are neither MCAR nor
MNAR, but rather MAR, although this cannot be tested, only assumed [Donders
et al., 2006; Greenland & Finkle, 1995; Little & Rubin, 1987; Schafer, 1997;
Schafer & Graham, 2002; Vach, 1994].
There are various methods for dealing with missing values in clinical
epidemiologic research. The best method obviously is to conduct a more active
follow-up of the patients for whom crucial information is (partly) missing in
order to obtain as much as possible of this information. For example, in the
previously mentioned cancer study with the selective follow-up, the researchers
could conduct a more active follow-up of all patients regardless of the baseline
disease condition. Similarly, in the Utrecht Health Project, routine care data are
supplemented with predetermined additional data collection [Grobbee et al.,
2005]. The quality of the routine care data in the Utrecht Health Project is
further optimized by a dedicated training program for healthcare personnel, with
ample attention given to ensure complete and adequate coding.
If a more active follow-up does not suffice or is not feasible, however,
researchers usually exclude all subjects with a missing value on any of the
variables from the analysis. The so-called complete or available case analysis is
the most common method currently found in clinical epidemiologic studies,
probably because most statistical packages implicitly exclude the subjects with a
missing value on any of the variables analyzed. Obviously, simply excluding
subjects with missing values affects precision. But it is commonly not
appreciated that—more seriously—it produces severely biased estimates of the
associations investigated when data are not missing completely at random, as
shown in the examples of the diagnosis of bacterial meningitis and prognosis of
cancer patients presented earlier. It is better to use other methods in the data
analysis than a complete case analysis [Donders et al., 2006; Little & Rubin,
1987; Schafer, 1997; Schafer & Graham, 2002; Rubin, 1987; Vach, 1994; Vach
& Blettner, 1991].
There are a variety of alternative methods to cope with missing values in the
analysis. Some of these are briefly discussed next. Illustrative examples can also
be found in Boxes 8–2 and 8–3.
When missing data are MNAR, valuable information is lost from the data and
there is no universal method of handling the missing data properly [Little &
Rubin, 1987; Rubin, 1987; Schafer, 1997; Schafer & Graham, 2002; Vach,
1994]. When missing data are MCAR, the complete case analysis gives
unbiased, although obviously less precise, results [Greenland & Finkle, 1995;
Little & Rubin, 1987; Moons et al., 2006; Schafer, 1997; Schafer & Graham,
2002; Rubin, 1987; Vach, 1994]. However, like the missing indicator method,
the unconditional mean imputation method still leads to biased results when data
are MCAR [Donders et al., 2006; Greenland & Finkle, 1995]. In the case of
MAR, which is most commonly encountered in research based on routine care
data (as described earlier), a complete case analysis will result in biased
associations between determinants and outcome due to selective missing data.
Also, the indicator method and the unconditional mean imputation method then
give biased results [Donders et al., 2006; Greenland & Finkle, 1995; Little &
Rubin, 1987; Moons et al., 2006; Schafer, 1997; Schafer & Graham, 2002; Vach,
1994]. Only more sophisticated techniques, like conditional single or multiple
imputation and the maximum likelihood estimation method, give less biased or
rather the most valid estimations of the study associations. Although single and
multiple conditional imputations both yield unbiased results, the latter is
preferred as it results in correctly estimated standard errors and confidence
intervals, while single imputation yields standard errors that are too small. All
this is illustrated using simple simulation studies in Boxes 8–2 and 8–3.
Empirically, it has been shown that even in the presence of missing values in
about half of the subjects, multiple conditional imputation still yields less biased
results as compared to the commonly used complete case analysis [Moons et al.,
2006]. The question arises how many missing values one may accept and how
many subjects can be imputed before multiple imputations will not suffice.
There are yet no empirical studies showing an upper limit of missing values that
can be imputed validly.
Consider a diagnostic study with only one continuous diagnostic test and a true disease status
(present/absent).
We simulated 1,000 samples of 500 subjects drawn from a theoretical population consisting of equal
numbers of diseased and nondiseased subjects. The true regression coefficient in a logistic regression
model linking the diagnostic test to the probability of disease was 1.0 (odds ratio = 2.7), with an
intercept of 0. The diagnostic test was normally distributed with mean 0 and standard deviation 2. No
other tests or subject characteristics were considered.
In each sample, 80% of the nondiseased subjects was assigned a missing value on the test. The
diseased subjects had no missing data. Accordingly, missing data were MAR as they were based on
other observed variables, here the true disease status only. Overall about 40% of the data was missing.
Using the procedure mice (for details about the software we refer to the literature [Van Buuren,
1999]), 10 multiple imputed data sets were created for each sample. Then the association between the
test and the disease status plus standard error was estimated in each data set using a logistic regression
model. Subsequently, all associations with standard errors were analyzed within each of the 10
model. Subsequently, all associations with standard errors were analyzed within each of the 10
multiply imputed data sets. The 10 regression coefficients and standard errors were then combined
using standard formulas [Rubin, 1987]. One extra data set was imputed and analyzed as a single
imputed data set. Finally, the results were averaged over the 1,000 simulations. For both the single and
multiple imputation procedure, the estimate of the association was indeed unbiased. The single
imputation procedure appears more precise because of the smaller standard error, thus leading to
smaller confidence intervals, but the 90% confidence interval does not contain the true parameter as
often (only 63.6%) as it should, that is 90%.
Multiple imputation leads to a larger standard error and wider confidence intervals, but the estimated
standard errors are more correct and the confidence interval has the correct coverage (i.e., 90.3%).
Hence, in contrast to single imputation, multiple imputation gives sound results both with respect to
bias and precision.
BOX 8–3 Illustration of the Problems with the Missing Indicator Method and the Unconditional Mean
Imputation, Even when Values Are Missing Completely at Random
Missing indicator method. We used the same example study as in Box 8–2 but considered a second
continuous test, which is a proxy for the first test. This means that the second test is not directly related
to the disease (OR = 1; regression coefficient = 0) but only to the first test. Fitting a logistic regression
model to predict disease status using the first test, only a positive regression coefficient was found
(case 1). When only the second test was included, we also found a positive association because of the
indirect relationship between disease status and the second test (case 2). Using both tests, only a
positive association for the first test was found, comparable to case 1, and a regression coefficient near
0 for the second test (case 3). Suppose there were missing values on the first test but not on the second
test, and that these are MCAR, that is, equal proportion in diseased and nondiseased subjects. We
defined a missing indicator variable as 1 if the result of the first test was missing and 0 otherwise. One
can see that in a model used to predict the true disease status using both tests plus the missing
indicator, the regression coefficient of the second test would not be 0 as it should be. For the subjects
with no missing data, indeed, case 3 applied. But for the subjects with a missing value on the first test,
case 2—rather than case 3—suddenly applied, as there were no observations for the first test. Hence
the estimate for the regression coefficient of the second test was biased and somewhere between 0, the
true estimate (case 3), and the value of case 2. Moreover, if the regression coefficient of the second
test was biased, so was the regression coefficient of the first test due to the mutual adjustment in
multivariable modeling.
To illustrate this, we performed a second simulation study similar to that of Box 8–2. We again
simulated 1,000 samples of 500 subjects drawn from the same theoretical population, which now also
included a proxy variable for the first test with a correlation of 0.75 with the first diagnostic test. For
the first test, 40% missing values were assigned completely at random, that is, 20% for the diseased
and nondiseased. The table shows that the regression coefficient of the diagnostic test was indeed
heavily biased (as the true value was 1.0) as well as the proxy variable (as the true value was 0). Thus,
heavily biased (as the true value was 1.0) as well as the proxy variable (as the true value was 0). Thus,
although the indicator method has the appealing property that all available information and subjects
are used in the analyses, the fact that it can lead to biased associations for the original variables is
reason enough to discard this method even when missing data are MCAR, let alone when data are
MAR.
Unconditional mean imputation. In the example study in Box 8–2 it may be obvious that the
magnitude and significance of the association (regression coefficient) of the continuous test with the
outcome was completely determined by the difference in overlap of the test result distributions
between the diseased and nondiseased subjects. The less overlap, the higher and more significant the
regression coefficient was. If the two distributions completely overlapped, the regression coefficient
would be 0. Consider the same simulation study as was used for the missing indicator method, with
40% missing values assigned completely at random (20% for the diseased and 20% for the
nondiseased).
Imputing or replacing these missing values by the overall mean of the test result as estimated from the
remaining (observed) subjects—that is, nondiseased and diseased subjects combined—would
obviously increase the amount of overlap in the two test result distributions. Hence, the association
between the test result and the outcome would be diluted and the regression coefficient would be
biased toward 0 and insignificance. This is illustrated in the lower part of this box. The regression
coefficient was not 1, but rather 0.55. Like the indicator method, the overall mean imputation of
missing values should also be discarded, as it leads to biased associations, even when missing data are
MCAR.
Diagnostic Test Regression Coefficient (standard Proxy Regression Coefficient (standard
error) error)
Indicator 0.55 (0.14) 0.51 (0.08)
method*
Overall mean 0.55 (0.14) Not applicable
Apart from these problems, routine care data comply with two essential
characteristics of determinant data in descriptive (diagnostic and prognostic)
research. First, routine care data are likely to match the range of variables that
are of interest to the investigator. For example, if an investigator wants to study
the diagnostic value of symptoms, signs, and results from diagnostic tests in
setting a diagnosis of heart failure in general practice and the need for referral to
secondary care, the patient files from primary care practices will likely show
those variables that lead a general practitioner to suspect that a patient has the
disease. General practitioners may use electrocardiography but are unlikely to
routinely have results from chest x-rays. Therefore, although chest x-rays may
add diagnostic information, such data would not be relevant in view of the
research question. Hence, the lack of this variable in the patient records is no
problem. Second, routine data likely reflect a quality of data collection that is
typical of the quality of the data in the application of the research findings in
clinical practice. As an example, when the goal is to determine the diagnostic
value of abdominal palpation for aortic aneurysms in patients suspected of
having this vascular problem, routine records with results from palpation
performed by the average physician are likely to offer a better view of the
diagnostic value of this test in the diagnostic workup of these patients in daily
practice than when all patients were carefully examined by a highly skilled
vascular surgeon.
To conclude, the extent to which patient data from routine care may
effectively and validly be used to answer research questions depends on the type
of research question and the type of research. For causal research, the
availability and quality of confounder data need to be carefully addressed and
may often be shown to be inadequate. In descriptive research, it is important that
the routine care data comprise all clinically relevant diagnostic or prognostic
determinants to yield a relevant research result. For all types of research it is
necessary that the patients can indeed be retrieved from the files based on
uniform and unselective coding, that the outcome is assessed in each subject, and
that missing data are properly dealt with.
Theoretical Design
The research question was, “Does arterial stiffness predict recurrent vascular
events in patients with manifest vascular disease?” This leads to the etiologic
occurrence relation: incidence of vascular events as a function of arterial
stiffness conditional on confounders. The domain is patients who are referred to
the hospital and diagnosed with cardiovascular disease. The operational
definition of recurrent vascular disease (the outcome) was vascular death,
ischemic stroke, coronary ischemic disease, and the composite of these vascular
events. Measurement of arterial stiffness was operationalized by measurement of
distension of the left and right common carotid arteries. Measurement of several
possible confounders and effect modifiers was operationalized using
questionnaires, blood chemistry, and measurement of blood pressure.
BOX 8–4 Cohort Study on the Causal Link Between Carotid Stiffness and New Vascular Events in Patients
with Manifest Cardiovascular Disease
AIMS: To study whether arterial stiffness is related to the risk of new vascular events in patients with
manifest arterial disease and to examine whether this relation varies between patients who differ with
respect to baseline vascular risk, arterial stiffness, or systolic blood pressure (SPB).
METHODS AND RESULTS: The study was performed in the first consecutive 2183 patients with
manifest arterial disease enrolled in the SMART study (Second Manifestations of ARTerial disease), a
cohort study among patients with manifest arterial disease or cardiovascular risk factors. Common
carotid distension (i.e., the change in carotid diameter in systole relative to diastole) was measured at
baseline by ultrasonography. With the distension, several stiffness parameters were determined. In the
entire cohort, none of the carotid artery stiffness parameters was related to the occurrence of vascular
events. However, decreased stiffness was related to decreased vascular risk in subjects with low
baseline SPB. The relation of carotid stiffness with vascular events did not differ between tertiles of
baseline risk and carotid stiffness.
CONCLUSION: Carotid artery stiffness is no independent risk factor for vascular events in patients
with manifest arterial disease. However, in patients with low SBP, decreased carotid stiffness may
indicate a decreased risk of vascular events.
Reproduced from Dijk DJ, Algra A, van der Graaf Y, Grobbee DE, Bots ML on behalf of the SMART
study group. Cartoid stiffness and the risk of new vascular events in patients with manifest cardiovascular
disease. The SMART study. Eur Heart J. 2005 Jun; 26 (12): 1213–20.
Reproduced from Dijk DJ, Algra A, van der Graaf Y, Grobbee DE, Bots ML on behalf of the SMART
study group. Cartoid stiffness and the risk of new vascular events in patients with manifest cardiovascular
disease. The SMART study. Eur Heart J. 2005 Jun; 26(12):1213–20.
Model I: unadjusted
Model II: Model I additionally adjusted for age
Model III: Model II additionally adjusted for mean arterial pressure, sex, age, pack-years smoked, and
use of antihypertensive medication at baseline
aIn all models adjusted for end-diastolic diameter carotid arteries and mean arterial pressure.
Reproduced from Dijk DJ, Algra A, van der Graaf Y, Grobbee DE, Bots ML on behalf of the SMART
study group. Cartoid stiffness and the risk of new vascular events in patients with manifest cardiovascular
disease. The SMART study. Eur Heart J. 2005 Jun; 26(12):1213–20.
As published data mainly reported on subjects with risk factors for vascular
disease who generally can be considered to have a lower risk than the patients
with manifest arterial disease in our study, the different reported relationship
between arterial stiffness and vascular disease may be explained by an
association between arterial stiffness and vascular events in low-risk patients
only. However, the observation in studies on patients with end-stage renal
disease who are known to be at high vascular risk that arterial stiffness was
associated with vascular events does not jive with this explanation. Moreover,
our finding that the association between arterial stiffness and vascular events is
not modified by baseline risk does not support this hypothesis either.
Chapter 9
Case-Control Studies
INTRODUCTION
There is no doubt that of all the available approaches to data collection in
epidemiology, case-control studies continue to attract the most controversy. On
the one hand this is understandable, because many poorly conducted case-
control studies have been reported in the literature and most textbooks in
epidemiology present famous examples of case-control studies that produced
biased results. Indeed, the validity of case-control studies in general is often
questioned, and some epidemiologists go so far as to place case-control studies
at the low end of their hierarchy of study designs, just above the case-report or
case-series designs. This is illustrated by the following statement from the first
edition of a textbook by one of the founders of clinical epidemiology:
If the best you can find is a case-control study, you must recognize that this is a weak design that
often has led to erroneous conclusions [Sackett et al., 1985].
On the other hand, one cannot deny that since their introduction to clinical
research in 1920, case-control studies have proven their potential value, notably
in causal research. Apart from identifying etiologic factors for many diseases
(such as smoking as a causal determinant of lung cancer [Doll & Hill, 1950]),
case-control studies have been important in identifying and quantifying risks of
drugs. Examples of the latter include the association between aspirin use and
Reye syndrome in children [Hurwitz et al., 1987] and between diethylstilboestrol
(DES) use by pregnant women and the occurrence of clear cell vaginal
carcinoma in their daughters [Herbst et al., 1971]. The potential strength of case-
control studies in medicine was emphasized by Kenneth Rothman in the first
edition of his textbook:
The sophisticated use and understanding of case-control studies is the most outstanding
methodological development of modern epidemiology [Rothman, 1986].
Many researchers conduct case-control studies where a group of patients with a certain disease is
identified and compared with another group who does not have the disease. Selection of controls is
often done as if quickly opening a “can” of non-cases, without an appreciation of the primary principle
of case-control studies: Controls should be representative of the population experience from which the
cases emerge. In addition, there is a tendency to match controls to the cases according to a range of
characteristics (notably, potential confounders). This often results in very atypical control subjects
(those with many risk factors for the disease but who manage not to develop the disease), who share
more similarities with “museum exhibits” than with existing individuals. Consequently, and
unfortunately, too many case-control studies could be summarized by the famous Andy Warhol
canvas, “Campbell’s Soup Can.”
One of the problems surrounding case-control studies is the large number of terms applied to indicate
the case-control method or to describe its subtypes. A nonexhaustive list includes these terms:
Case-referent study Case-cohort study
TROHOC study Nested case-control study
Retrospective study Case-crossover study
Case-only study
Case-specular study
The left row lists alternative terms for case-control studies that have been suggested over the years.
Although the term case-referent study seems more appropriate, we propose using the term case-
control study instead to ensure that both researchers and readers understand the underlying
methodology. In particular, terms such as TROHOC (the reverse of cohort study) and retrospective
studies should be avoided [Schulz & Grimes, 2002] because they imply a “reverse” nature of the case-
control approach (from disease to determinant instead of the other way around), while the direction of
the occurrence relation is in fact similar to studies using a census approach: outcome as a function of
the determinant. Moreover, case-control studies can be both retrospective and prospective. In the right
row several types of case-control studies are listed. These terms could be used because they do
indicate several methods that can be applied in case-control studies, as long as one realizes that these
studies are in fact case-control studies in that they sample controls from the study base.
FIGURE 9–1 Case control study. Abbreviations are det, determinant; dis, disease.
In a case-control, and thus a sampling approach, the same study base as in the
census approach is followed over time to monitor the occurrence of the disease
of interest. In contrast, however, the determinants and relevant covariables are
not measured in all members of the study base, but only in those developing the
disease (the cases) and in a sample of the study base (controls or referents). The
term referents is more appropriate because it clearly indicates that the sample
members are referents from the study base from which the cases emerge, but we
use the term controls because of its widespread use in the literature. By
definition, the members of the control group do not have the disease of interest
when they are selected as controls. It should be emphasized, however, that the
controls are not a sample of the non-cases (shown in Figure 9–1), because these
non-cases only represent those participants who do not develop the disease
during the total follow-up period. In fact, some of the members of the control
group could subsequently develop the disease. Therefore, in the likely event of
changes in the population during the study period (often new people will enter
and others leave the study base with or without having developed the outcome),
it is wiser not to sample controls at one specific time during the study, but
instead at several time points throughout the study experience, to ensure a proper
representation of the study base from which the cases develop. In a later section,
the methods to validly sample controls from the study base will be outlined in
more detail with the introduction of the study base (or “swimming-pool”)
principle.
FIGURE 9–2 The first report of a case-control study published in the medical literature in 1920.
Reproduced from Broders AC. Squamous-cell epithelioma of the lip. A study of 537 cases. JAMA
1920;74:656–64.
The year 1950 heralded an important period in the acceptance of the case-
control method in clinical research. In that year, four case-control studies
assessing the association between tobacco consumption and the risk of lung
cancer were published. Despite methodologic problems in several aspects,
including the way the control group was sampled and misclassification of
smoking history, these early studies clearly illustrated the potential of this study
design [Doll & Hill, 1950].
In 1951, Cornfield gave a strong impulse to the further application of the case-
control method by proving that, under the assumption that the outcome of
interest is rare, the odds ratio resulting from a case-control study equals the
incidence ratio that would result from a cohort study [Cornfield, 1951]. Another
influential paper was published in 1959, in which Mantel and Haenszel
described a procedure to derive odds ratios from stratified data, thus enabling
adjustment for potential confounding variables. Later, Miettinen [1976a, 1976b]
made several important contributions to the development of case-control studies,
including landmark publications on how to appropriately sample controls from
the study base so that the resulting odds ratio always (also when the outcome is
not rare) provides a valid estimate of the incidence density ratio that would be
observed in a cohort study.
Over recent decades, the case-control method has been applied throughout the
field of clinical medicine far beyond the research on cancer etiology for which it
was first developed. The method also provides important applications for the
study of intended and unintended effects of interventions. Especially for the
latter, case-control studies have proven their enormous potential. Examples
include studies on the risk of fatal asthma in recipients of beta-agonists, cancer
of the vagina in daughters of mothers receiving DES during their pregnancy,
and, more recently, deep vein thrombosis resulting from the use of third-
generation oral contraceptives. Thus far, the case-control method has not been
widely applied in descriptive (diagnostic and prognostic) research, but its
efficiency in both diagnostic and prognostic research is increasingly being
recognized.
THEORETICAL DESIGN
The research question and associated occurrence relation may take any form,
depending on the objective of the case-control study. Usually, case-control
studies are applied when the goal is to unravel causality, and therefore the
occurrence relation should include conditionality on extraneous determinants
(i.e., confounders). More recently, the case-control method has also been applied
in descriptive research [Biesheuvel et al., 2008].
Identification of Cases
As in any other type of study, the definition of the outcome is crucial. The
challenge to the researcher lies in designing a “net” that is capable of capturing
all members of the study base that fulfill the case definition during the study
period while ignoring those who do not meet the case criteria. In addition, a date
on which the outcome occurred should be designated for each case to facilitate
valid sampling of the control subjects.
Sometimes, existing registries can be applied to identify cases. Examples
include cancer or death registries, hospital discharge diagnoses, or coded
diagnoses in primary care or health maintenance organization databases. It
should be emphasized that the number of false-positive and false-negative
diagnoses in existing registries may be considerable and they clearly depend on
the outcome; for example, death is much easier to diagnose than depression,
benign prostatic hyperplasia, or sinusitis.
When valid registries of the case disease are not available, ad-hoc registries
can be developed. For example, in a case-control study on the risk of sudden
cardiac death associated with diuretics and other classes of blood pressure–
lowering drugs, we developed a method to detect cases of sudden cardiac death
among all treated hypertensive patients in a well-defined geographical area
[Hoes at al., 1995a]. During the 2.5-year study period, all doctors signing a death
certificate received a very short questionnaire, including a question about the
period between the onset of symptoms and the occurrence of death and the
probability of a cardiac origin. Sudden cardiac death was defined as a death
occurring within 1 hour of symptom onset for which a cardiac origin could not
be excluded.
Although in theory rigorous criteria to define the case disease should be
applied, one should weigh the feasibility of these methods against the
consequences of false-positive diagnoses and missing cases (false-negatives).
Misclassification of the outcome will dilute the association between the
determinant and the outcome if such misclassification occurs independent of the
determinants studied. Then, false-positive diagnosis (i.e., non-cases counted as
cases) may lead to a larger dilution than nonrecognition of cases; most of these
false-negatives will not be sampled as controls because in many case-control
studies the outcome is rare. Consequently, incompleteness of a registry does not
necessarily reduce the validity of a study. Misclassification can also be
differential and, thus, depend on the presence of the determinant. For example,
in a case-control study on the risk for deep vein thrombosis among users of
different types of oral contraceptives, such differential misclassification might
occur when thrombosis is more often classified as such in women using
particular oral contraceptives. The bias resulting from such misclassification
may be considerable.
TABLE 9–1 Oral Contraceptive Use and the Risk of Developing Rheumatoid Arthritis
Inclusion of all lung cancer patients diagnosed at several hospitals as the cases
may result in very few (or even zero) cases who never smoked because of the
very high prevalence of smoking among lung cancer patients, while the
proportion of smokers among the controls would be much lower. Adjustment for
confounding by smoking history would then be virtually impossible. One
solution would be to decide to include all lung cancer patients who never
smoked and a random sample (say 30%) of the lung cancer patients with a
positive smoking history as cases. This stratified sampling of the cases would
have important implications for the sampling of controls. In fact, the controls
would need to be sampled analogously. This means that of all controls who were
sampled from the study base, all controls with a negative smoking history, and a
sample (again 30%) of all smoking controls would need to be included as the
control group.
Interestingly, one could also imagine sampling cases in strata according to the
determinant of interest, although this may seem counterintuitive. Stratified
sampling should be considered when the number of cases in a certain category of
the determinant is expected to be very small. Again, this implies a similar
sampling strategy in the control patients.
The strengths of stratified sampling of cases are nicely illustrated in a case-
control study assessing the causal role of the sex of the blood donor in the
development of transfusion-related acute lung injury (TRALI) (Middelburg et
al., 2010). Most TRALI cases receive blood from multiple donors from either
sex and identification of the sex of the donor causing the TRALI is impossible in
these cases. As a solution, the researchers restricted the analysis to “unisex”
cases, that is, cases that received blood exclusively from either male or female
donors. Consequently, sampling of the controls followed the same selection
process; only “unisex” controls (patients without TRALI that received blood
from only male or only female donors) were included to estimate the sex
distribution of the donors in the study base. Thus, the researchers were able to
show that plasma from female donors increased the risk of TRALI.
It is beyond the scope of this chapter to further elaborate on the specifics of
stratified sampling of cases, because this approach is hardly ever used by
researchers. More information can be found elsewhere [Weinberg & Sandley,
1991; Weinberg & Wacholder, 1990].
A control could develop the case disease later in the study period, although
this is unlikely because the studied outcome in most case-control studies is rare.
Importantly, however, control subjects who later become a case do not violate
the study base principle at all. Such an individual was at the time of being
sampled as a control representative of the study base in which the case occurred,
only later fulfilling the case definition. Consequently, this subject should be
included both as a case and a control. Similarly, a control subject could again be
randomly sampled as a control later in the study, for example, when subject 4 is
diagnosed. Because both times this control is representative of the study base
from which the cases originate, this control should be included twice. Including
an individual twice during the same study period does not necessarily mean that
all characteristics are the same; exposure (e.g., being prescribed a certain drug)
may have changed.
Sometimes it may be difficult to sample a control each time a case occurs. An
alternative is to assign each case a random date during the study period and
sample controls from the members of the study base on that particular day. In
addition, one could sample controls after a well-defined time period, say after
each week or month.
To assess whether control subjects are indeed part of the swimming pool, the
researcher should answer the following question: “Would the control subject be
identified as a case should he or she develop the outcome under study during the
study period?” The answer should be yes. This rule of thumb can be applied for
essentially all case-control studies.
The study on the risk of sudden cardiac death associated with diuretics and
other antihypertensive drug classes among treated hypertensive patients
introduced earlier may serve as an example of how to sample controls each time
a case develops. The study base consisted of all inhabitants of Rotterdam who
were treated pharmacologically for hypertension, which clearly bears all of the
characteristics of a dynamic population. Each time a case of sudden cardiac
death was identified, a random control was selected as follows: A general
practitioner in Rotterdam was randomly selected using a designated computer
program and this general practitioner was visited at her or his surgery by one of
the researchers. Then, using a computer file of all enlisted adult patients or the
alphabetically ordered paper files, the first patient with the same sex and within
the same 5-year age category was chosen, starting from the first name following
the case’s surname. If, according to the doctor, that patient was using
antihypertensive drugs for hypertension on the day the corresponding case had
died, that patient was included as a control. Age and gender were chosen as
matching variables in this study for reasons that will be explained later in this
chapter. It should be emphasized that the sampling of controls benefited from the
fact that in the Netherlands all inhabitants are enlisted with one general practice
and that virtually all relevant clinical information, including drugs prescribed
and general practitioner and hospital diagnoses, are kept on file there. This
system greatly facilitates control sampling in case-control studies.
For example, if the aim is to quantify the association between certain genetic
polymorphisms and the occurrence of Alzheimer’s disease, a case-control study
within a cohort may be very efficient. Such studies are often termed nested case-
control studies, but other terms are applied, sometimes depending on the method
applied to sample the controls. In this case-control study, three cases of
Alzheimer’s disease are diagnosed among the 15 cohort members during the 12-
month follow-up period. Several methods to sample controls can be applied.
Analogous to sampling controls from a dynamic population, one can
randomly select a control each time a case is diagnosed. At 3 months, the first
control will be sampled from the 13 remaining in the cohort: 15 minus the first
case and minus individual number 10, who was lost to follow-up. Similarly, the
other methods presented earlier for dynamic populations can be applied
[Vandenbroucke & Pierce, 2012]. One can sample a control at a random date
assigned each time a case is diagnosed or one may sample at regular time
intervals, for example every week or month. Again, the control that is sampled is
representative of the study base by definition, and sampling at multiple points in
time during the study period will produce a valid sample. Such an approach may
pose a logistical problem, however, because sampling frames including all
members still in the cohort are needed each time a control is sampled randomly.
In many earlier case-control studies and sometimes even today, the controls
are sampled at the end of the study period from the remainder of the cohort. This
method excludes all cases as well as cohort members who are lost to follow-up.
In our example, the three controls would be sampled from the eight subjects still
in the cohort after the 1-year follow-up period. In contrast to the sampling
methods outlined in previous sections of this chapter, this method clearly
violates the study base principle because the controls are not a representative
sample from the population experience during the entire study period. Especially
when many cohort members are lost to follow-up and many develop the case
disease (i.e., the outcome is not rare), this method will lead to biased estimates of
the determinant– outcome association. For that reason, sampling of controls at
the end of the follow-up period from the remainder of the cohort is discouraged.
A much better alternative is to sample the control group at the beginning of
the follow-up period (t = 0). Although sampling at one specific point in time
seems to carry the danger of violating the study base principle, sampling at t = 0
is an important exception. A quick look at Figure 9–6 clearly shows that a
random selection of the cohort (at t = 0) provides a sample that is representative
of the full cohort (e.g., gives full information on the determinant distribution),
from which all future cases will develop during the study period. This type of
nested case-control study is usually referred to as a case-cohort study. This term
is rather confusing because it does not clearly indicate that, in essence, this study
is a case-control study (because sampling from the study base is involved), not a
cohort study. Because this method is increasingly being applied, a more
elaborate discussion of case-cohort studies and their advantages and limitations
is included in a separate paragraph in a later section.
Population Controls
In theory, population controls should be sampled when the cases included in the
case-control study originate from the same population. This often is the case,
notably when the domain of the occurrence relation is humanity, such as in
etiologic studies examining the links between smoking and lung cancer, and
physical exercise and cardiovascular disease. In case-control studies, because
case identification is commonly restricted in time or region, control sampling
from the population at large ideally should be restricted in a similar manner. The
main advantage of sampling population controls in this manner is that these are,
by definition, representative of the study base.
In a case-control study addressing the putative causal relationship between
alcohol intake and acute appendicitis (the domain being all humans) in which
cases are drawn from a large general hospital in a defined area during a 1-year
study period, the population at large represents the source of the cases. However,
control sampling, ideally, should be restricted to inhabitants of that defined area
(i.e., the catchment area population of that hospital) during that time period. As
outlined earlier, this may be achieved by sampling from available population
registries at multiple points in time during the study period. Again, posing the
question, “Would the control subject be identified as a case should he or she
develop the outcome under study during the study period?” helps the researcher
and reader to assess the validity of control selection. When sampling population
controls from the catchment population of a hospital, one should realize that the
catchment population varies with the disease studied. For example, acute
appendicitis cases will originate from a much smaller area around the hospital
than childhood leukemia cases in that same hospital. If, however, the distribution
of the relevant characteristics in both catchment areas is similar, this has little
influence on the validity of the study.
Several methods other than sampling from population registries have been
proposed to efficiently draw population controls. Random digital dialing, where
a random telephone number (usually computer generated) is dialed, may be an
attractive option. It also allows for targeting a specific region using the telephone
area codes. Depending on the information required from the controls,
computerization in such an approach could go as far as using the computer to
pose the necessary multiple-choice questions and to store the respondents’
answers. The advantages of this approach are self-evident. The relatively low
response rate is a major disadvantage of this method, however, especially when a
potential participant is being interviewed by a computer. In addition, not all men
and women have a landline telephone, some only have a cellular telephone, and
many calls will remain unanswered. These phenomena are related to
socioeconomic status, employment, and health status. If these factors are studied
as (or related to) the determinant (or a confounder), the resulting nondifferential
non-response can lead to bias. Selective non-response may threaten any method
applied to sample population controls, because the motivation of members of the
population at large to be involved in clinical research is usually lower than, for
example, hospital controls. Random digit dialing as a means to select population
controls has become less efficient now that many people mainly use mobile
phones, making it difficult to cover specific areas. An example of a case-control
study using population controls is given in Box 9–3. Controls were sampled by
means of random digit dialing [Fryzek et al., 2005]. Both cases and controls
were interviewed to obtain the required information.
BOX 9–3 A Case-Control Study Examining the Association of Body Mass Index with Pancreatic Cancer
Using Population Controls
Increased body mass index has emerged as a potential risk factor for pancreatic cancer. The authors
examined whether the association between body mass index and pancreatic cancer was modified by
gender, smoking, and diabetes in residents of southeastern Michigan, 1996–1999. A total of 231
patients with newly diagnosed adenocarcinoma of the exocrine pancreas were compared with 388
general population controls. In-person interviews were conducted to ascertain information on
demographic and lifestyle factors.
Unconditional logistic regression models estimated the association between body mass index and
pancreatic cancer. Males’ risk for pancreatic cancer significantly increased with increasing body mass
index (ptrend = 0.048), while no relation was found for women (ptrend = 0.37). Among nonsmokers,
those in the highest category of body mass index were 3.3 times (95% confidence interval: 1.2, 9.2)
more likely to have pancreatic cancer compared with those with low body mass index. In contrast, no
relation was found for smokers (ptrend = 0.94). While body mass index was not associated with
pancreatic cancer risk among insulin users (ptrend = 0.11), a significant increase in risk was seen in
non-insulin users (ptrend = 0.039). This well designed, population-based study offered further evidence
that increased body mass index is related to pancreatic cancer risk, especially for men and
nonsmokers. In addition, body mass index may play a role in the etiology of pancreatic cancer even in
the absence of diabetes.
Reproduced from Fryzek JP, Schenk M, Kinnaid M, Greenson JK, Garabrant DH. The association of body
mass index and pancreatic cancer in residents of southeastern Michigan, 1996–1999. Am J Epidemiol
2005;162:222–8, with permission from Elsevier.
The following quotation from this study illustrates the selection process
typical of population controls, although it should be emphasized that the
response rate among controls (76%) was relatively high. Of all eligible cases,
92% participated. “Of the 597 general population controls eligible for the study,
19 could not be reached by phone, one died before being contacted, and 27 were
not contacted because there was an overselection of controls under 45 years of
age early in the study period. The remaining 550 people were invited to
participate, and 420 (76 percent) agreed.”
Hospital Controls
The study presented in the last section also illustrates one of the advantages of
using hospital controls in case-control studies: their willingness to participate. In
general, the response rate in the diseased and in particular in those being
admitted to the hospital is higher than in the population at large. Moreover,
selecting control subjects from the same hospital with another illness than the
case disease is efficient because the researcher is collecting similar data from the
cases admitted to the same hospital anyway. From the introduction of the case-
control method, hospital controls have been widely applied, and their popularity
continues.
Disadvantages of hospital controls are, however, considerable. In particular,
the validity of the case-control study is threatened if the hospital controls are not
a representative sample from the study base that produces the cases. One could
think of many reasons why, in patients with an illness other than the case
disease, the distribution of relevant characteristics (notably the determinant of
interest and possible confounders or effect modifiers) would differ from the
members of the study base. For example, smoking and other unhealthy habits,
overweight, comorbidity, and medication use generally will be more common in
those admitted to a hospital than in the “true” study base (i.e., the catchment area
population of that hospital for the case disease). A common (but incorrect)
approach to prevent bias when taking hospital controls is the use of multiple
control diseases. The rationale for such a “cocktail” of diseases is simple, if not
somewhat naïve; should one control disease lead to bias (e.g., because the
exposure to the determinant of interest in the control disease is higher than in the
true study base), this bias could be offset by other control diseases (of which
some may have a lower exposure than the study base). Alternatively, control
diseases known to be associated with the determinant of interest are often
excluded or patients visiting the emergency room are taken as controls. The
advantage of the latter control group is that the prevalence of comorbidity and
unhealthy habits may be lower than in other hospital controls.
However, these methods all contribute to the complexity of using hospital
controls. It is usually very difficult for the readers and the researchers alike to
judge whether the essential prerequisite of a case-control study—namely, that
the controls are a valid sample from the study base—has been met. Too often,
the researchers only mention the control disease(s) chosen without providing a
rationale and fail to discuss the potential drawbacks of this choice. They then
leave it up to the readers of their work to determine whether indeed the crucial
characteristics of the hospital controls are similar to those of the study base (i.e.,
the catchment area population for the case disease). We do not suggest a
moratorium on hospital controls, but there should be no doubt that the
responsibility of proving the validity of hospital control sampling lies with the
researcher and no one else. In their famous case-control study published more
than half a century ago, Doll and Hill [1950] took up this responsibility and
discussed the validity of their choice of hospital controls (see Box 9–4).
An example of a case-control study using hospital controls is the famous paper on smoking and lung
cancer by Doll and Hill. The following excerpt from the original paper highlights the way the control
subjects were sampled:
“As well, however, as interviewing the notified patients with cancer of one of the specified sites, the
almoners were required to make similar inquiries of a group of “non-cancer control” patients. These
patients were not notified, but for each lung-carcinoma patient visited at a hospital, the almoners were
instructed to interview a patient of the same sex, within the same five-year age group and in the same
hospital at about the same time.”
The 709 control patients had various medical conditions, including gastrointestinal and cardiovascular
disease and respiratory disease other than cancer.
disease and respiratory disease other than cancer.
The authors fully recognized the importance of ensuring that the control patients were not selected
based on their smoking habits, and it is worth studying the additional data provided and reading their
arguments to convince the reader that:
“There is no evidence of any special bias in favour of light smokers in the selection of the control
series of patients. In other words, the group of patients interviewed forms, we believe, a satisfactory
control series for the lung-carcinoma patients from the point of view of comparison of smoking
habits.”
This study, although performed more than half a century ago, still exemplifies the potential advantage
of hospital controls and the way researchers should argue the validity of their control group.
Adapted from Doll R, Hill AB. Smoking and carcinoma of the lung. BMJ 1950;ii:739–48.
Neighborhood Controls
Selecting controls from the same neighborhood as the cases are often drawn
from is an alternative to population controls. Instead of taking a random sample
of the population at large (or when hospital cases are used, from the catchment
population), the researcher samples one or more individuals from the same
neighborhood as the corresponding case. Inclusion of neighborhood controls is
attractive for several reasons, but mostly because they, almost literally, seem to
originate from the same study base as the case and often the researcher is already
in the neighborhood collecting the necessary information from the cases.
Another often mentioned advantage is the homogeneity of the neighborhood
with regard to certain characteristics, including potential confounders such as
socioeconomic status.
The latter, however, also should be viewed as a potential disadvantage. Cases
and controls will be matched according to these characteristics. But matching in
case-control studies (as discussed in more detail later in this chapter) carries
important dangers, including the impossibility of studying these characteristics
as determinants. It would be unwise, for example, to sample neighborhood
controls in a case-control study quantifying the causal relationship of living near
high-voltage power lines with the occurrence of childhood cancer. Other
disadvantages of neighborhood controls are the relatively low response and the
time and costs involved, notably when the researcher needs to travel to the
neighborhood to select a neighboring household.
BOX 9–5 is an excerpt from the methods section of a case-control study
performed to identify lifestyle and other risk factors for thyroid cancer. It
describes the way neighborhood controls can be sampled and further illustrates
the enormous efforts sometimes involved [Mack et al., 2002].
One could argue that the control selection in this study was independent of the
risk factors studied (such as dietary habits) and that these controls may indeed
represent a valid sample from the study base also producing the cases. It is
unfortunate, however, that the authors did not discuss their choice of control
group.
A single neighborhood control was sought for each interviewed patient. Using a procedure defining a
housing sequence on specified blocks in the neighborhood in which the patient lived at the time of her
thyroid cancer diagnosis, we attempted to interview the first female matching the case on race and
birth year (within five years). For each case, up to 80 housing units were visited and three return visits
made before failure to obtain a matched control was conceded. We obtained matched controls for 296
of the 302 cases. For 263 patients, the first eligible control agreed to participate. Three controls were
later found to be ineligible due to a prior thyroidectomy, and one control was younger than the
matched case was at diagnosis. Questionnaires on 292 case-control pairs were available for analysis.
The average interval between the case and matched control interview was 0.3 years.
Reproduced from Mack WJ, Preston-Martin S, Bernstein L, Qian D. Lifestyle and other risk factors for
thyroid cancer in Los Angeles County females. Ann Epidemiol 2002;12:395–401, reprinted with permission
from Elsevier.
TABLE 9–2 Case-Control Study Linking Smoking and Epithelioma of the Lip
Patients with Lip Epithelioma Patients Without Lip Epithelioma
Pipe smoking 421 (a) 190 (b)
No pipe smoking 116 (c) 310 (d)
Total 537 (a+c) 500 (b+d)
Data from: Broders AC. Squamous-cell epithelioma of the lip. A study of 537 cases. JAMA 1920;74:656–
64.
The strength of the case-control method is that if indeed the controls are a
valid sample of the study base from which the cases originate, the exposure odds
ratio is by definition a valid estimate of the incidence rate ratio one would obtain
from a cohort study; that is, if one took a census approach. It can be shown this
is true irrespective of the frequency of the outcome of interest, and, thus, any
assumption about the rarity of the outcome is irrelevant.
Imagine a dynamic population, including in total N + N′ participants during
the entire study period. Note that because this is a dynamic population, the time
that a subject is part of the study base theoretically ranges from 1 second to the
full study period. Assuming, for simplicity, that exposure in a subject is constant,
N subjects are exposed to the determinant and N′ are not (see Table 9–3).
To calculate the association between the determinant and the outcome in this
dynamic population followed over time, incidence rates of the disease in those
with and without the determinant can be calculated. Taking an average follow-up
time (t) of the members in the study base, the incidence rate, or incidence density
of the outcome in those with the determinant, equals a/(N × t) while the
incidence rate in the unexposed equals c/(N′ × t).
The incidence rate ratio can be calculated as (a/(N × t))/(c/(N′ × t)) or (a × N′ ×
t)/(c × N × t) or (a × N′)/(c × N).
The major findings of a case-control study conducted within this dynamic
population are summarized in Table 9–4.
BOX 9–6 A Case-Cohort Study on the Causal Link Between Iron and the Risk of Coronary Heart Disease
Background: Epidemiological studies aimed at correlating coronary heart disease (CHD) with serum
ferritin levels have thus far yielded inconsistent results. We hypothesized that a labile iron component
associated with non-transferrin-bound iron (NTBI) that appears in individuals with overt or cryptic
iron overload might be more suitable for establishing correlations with CHD.
Methods and Results: We investigated the relation of NTBI, serum iron, transferrin saturation, and
serum ferritin with risk of CHD and acute myocardial infarction (AMI). The cohort used comprised a
population-based sample of 11,471 postmenopausal women aged 49 to 70 years at enrollment in 1993
to 1997. During a median follow-up of 4.3 years (quartile limits Q1 to Q3: 3.3 to 5.4), 185 CHD
events were identified, including 66 AMI events. We conducted a case-cohort study using all CHD
cases and a random sample from the baseline cohort (n = 1134). A weighted Cox proportional hazards
model was used to estimate hazard ratios for tertiles of iron variables in relation to CHD and AMI.
Adjusted hazard ratios of women in the highest NTBI tertile (range 0.38 to 3.51) compared with the
lowest (range −2.06 to −0.32) were 0.84 (95% confidence interval 0.61 to 1.16) for CHD and 0.47
(95% confidence interval 0.31 to 0.71) for AMI. The results were similar for serum iron, transferrin
saturation, and serum ferritin.
Conclusions: Our results show no excess risk of CHD or AMI within the highest NTBI tertile
compared with the lowest but rather seem to demonstrate a decreased risk. Additional studies are
warranted to confirm our findings.
Reproduced from Van der A DL, Marx JJ, Grobbee DE, Kamphuis MH, Georgiou NA, van Kats-Renaud
JH, Breuer W, Cabantchik ZI, Roest M, Voorbij HA, Van der Schouw YT. Non-transferrin-bound iron and
the risk of coronary heart disease in postmenopausal women. Circulation 2006;113:1942–9.
The following paragraph from the study of Van der A et al. describes the
rationale and methodology of this case-cohort study:
The case-cohort design consists of a subcohort randomly sampled from the full cohort at the
beginning of the study and a case sample that consists of all cases that are ascertained during
follow-up. With this sampling strategy, the subcohort may include incident cases of CHD that will
contribute person-time as controls until the moment they experience the event. We selected a
random sample of [almost equal to] 10% (n = 1134) from the baseline cohort to serve as the
subcohort. The advantage of this design is that it enables the performance of survival analyses
without the need to collect expensive laboratory data for the entire cohort.
The complexity of the data analysis is illustrated in the next few lines from the
same article:
To assess the relationship between the iron variables (i.e., NTBI, serum iron, transferrin saturation,
and serum ferritin) and heart disease, we used a Cox proportional hazards model with an estimation
procedure adapted for case-cohort designs. We used the unweighted method by Prentice, which is
incorporated in the macro ROBPHREG made by Barlow and Ichikawa. This macro is available at
http://lib.stat.cmu.edu/general/robphreg and can be implemented in the SAS statistical software
package version 8.2. It computes weighted estimates together with a robust standard error, from
which we calculated 95% confidence intervals.
CASE-CROSSOVER STUDIES
The case-crossover study was introduced in 1991 by Maclure. A case-crossover
study bears some resemblance to a crossover randomized trial. In the latter, each
participant receives all (usually two) interventions and the order in which he or
she receives them in this experimental study is randomly allocated, with a short
time between the two interventions, allowing for the effect of the intervention to
wear off. Assumptions underlying a crossover trial include the transient effect of
each intervention and that the first intervention does not exert an effect during
the time period the participant receives the second intervention (i.e., there is no
carryover effect).
In a case-crossover study, all participants experience periods of exposure as
well as periods of nonexposure to the determinant of interest. However, a case-
crossover study is nonexperimental and thus the order in which exposure or
nonexposure occurs is anything but random. In fact, exposure or nonexposure
may change multiple times in a participant during the study period. Importantly,
the previously mentioned prerequisites for crossover trials also pertain to case-
crossover studies: the exposure being transient and the lack of a carryover effect.
A case-crossover study is a case-control study because a sampling instead of a
census approach is taken. Instead of comparing cases with a sample from the
study base, however, the exposure is compared in the risk period preceding the
outcome and the “usual exposure” in the same case. The latter may be measured
by calculating the average exposure over a certain time period or measuring
exposure at a random point in time or specified period, for example, 48 hours
before the event. The types of transient determinants that have been evaluated in
case-crossover designs include coffee drinking, physical exertion, alcohol intake,
sexual activity, and cocaine consumption [Mittleman et al., 1993; Mittleman et
al., 1999]. In addition, a case-crossover design is an attractive option to identify
transient triggers of exacerbations in patients with chronic disease, such as
multiple sclerosis or migraine [Confavreux et al., 2001; Villeneuve et al., 2006].
Let us consider the example of a study aimed at quantifying the occurrence of
myocardial infarction as a function of strenuous physical exertion [Willich et al.,
1993]. In the article, both a typical case-control study and a case-crossover study
are presented. Both designs are shown in Figure 9–7.
FIGURE 9–7 Comparison of a case-crossover and a case-control study examining the causal link between
physical exertion and myocardial infarction.
Reproduced from Willich SN, Lewis M, Lowel H, Arntz HR, Schubert F, Schroder R. Physical exertion as
a trigger of acute myocardial infarction. Triggers and mechanisms of myocardial infarction study group. N
Engl J Med 1993;329:1684–90.
Time zero indicates the occurrence of the outcome in a member of the study
base. The determinant is defined as “being engaged in physical exertion one
hour before a certain point in time,” and for the cases this is the time of onset of
nonfatal myocardial infarction. In their case-control analysis, Willich et al.
compared the prevalence of strenuous physical exertion of cases in the risk
period with the prevalence in age-, sex-, and neighborhood-matched population
controls. The adjusted odds ratio resulting from this analysis was 2.1 (95% CI,
1.1–3.6). In their case-crossover analysis, the authors compared the exposure
during the risk period of the cases with their usual frequency of strenuous
exercise. The data were obtained by interviewing the participants. In the
analyses, the observed odds of strenuous exercise within the hour before the
onset of myocardial infarction and the expected odds (x:y) that the case would
have been engaged in exercise, based on the usual exercise frequency, were
calculated. The risk ratio was calculated as the ratio of the sums of y (i.e., the
probability of usually not being engaged in exercise) in cases who were
exercising within 1 hour before the event and the sum of x (i.e., the probability
of usually being engaged in exercise) in cases who did not exercise within 1 hour
of symptom onset. The risk ratio resulting from this approach was similar to the
case-control estimate: 2.1 (95% CI, 1.6–3.1).
The major strength of a case-crossover design is the within-person
comparison, just as in crossover trials. The case and its matched control (who in
fact is the same person) will be matched according to characteristics that are
constant in a certain (usually short) time span (e.g., comorbidity, socioeconomic
status, gender). Because of this matching, these characteristics can never be
studied as a determinant of the outcome event, but a case-crossover study
usually focuses on one transient exposure only. The most important threat to
case-crossover studies is the possibility that the determinant exerts its effect way
beyond the risk period defined. This “carryover” effect cannot always be ruled
out.
WORKED-OUT EXAMPLE
Anesthetic care in westernized societies is of high quality and is generally
considered safe. However, very rarely accidents still occur that can have serious
health consequences. The Netherlands Society for Anaesthesiology decided to
estimate the incidence of serious morbidity and mortality during or following
anesthesia and study possible causal factors related to procedures and
organization with the goal of reducing risks further. Because of the rarity of the
event, large numbers of anesthetic procedures were needed for the study. This, in
combination with the necessary detailed information to be obtained led to the
decision to conduct a case-control study (see Box 9–7) [Arbous et al., 2005].
BOX 9–7 Impact of Anesthesia Management Characteristics on Severe Morbidity and Mortality
Reproduced from Arbous MS, Meursing AAE, van Kleef JW, de Lange JJ, Spoormans HHAJM, Touw P,
Werner FM, Grobbee DE. Impact of anesthesia management characteristics on severe morbidity and
mortality. Anesthesiology 2005;102:257–68.
Theoretical Design
The research question addressed was: “Which characteristics of anesthesia
management are causally related to 24-hour postoperative severe morbidity and
mortality?” This translates to the following occurrence relation: severe
postoperative morbidity and mortality as a function of factors related to
anesthesia management conditional on confounders. The domain was all patients
given anesthesia for surgery. The operational definition of the outcome was
coma or death during or within 24 hours of anesthesia administration. The
determinant and confounders were operationalized by recording all relevant
characteristics of anesthesia, hospital, and patients by means of a questionnaire
and by scrutinizing anesthesia and recovery forms.
The point made by this author is illustrated in the baseline table from the
original report, a section of which is shown in Table 9–6 [Arbous et al., 2005].
The observation of marked differences in risk between cases and controls is
correct, but the inference is erroneous [Arbous et al., 2006]. Cases and controls
should be inescapably different if cases are the ones who experience problems
and controls are randomly sampled from the remainder of the cohort. In
particular, they should be different in factors that reflect known mortality risks
such as age, ASA physical status, or urgency of the procedure. The question is
whether these prognostic factors are also related to characteristics of anesthetic
management.
Randomized Trials
INTRODUCTION
Trials are cohort studies in which allocation to the determinant is initiated by the
investigator. Moreover, in randomized trials the allocation is made at random by
some algorithm. Because the determinant is allocated with the purpose of
learning about its effect on the outcome, randomized trials are experiments. The
determinant that is allocated is typically a treatment such as a drug or another
intervention, for example, a surgical procedure or lifestyle advice intended to
provide relief, cure, or prevention of disease. In this chapter, the term treatment
will be used for all interventions studied in randomized trials.
Randomized trials have an important role in determining the efficacy and
safety of treatments. A trial can be viewed as a measurement of the effect of a
treatment. It should provide a quantitative and precise estimate of the benefits or
risks that can be expected when a treatment is given to patients with an
indication for it.
Randomized trials can be distinguished according to the phase of development
of a treatment. This distinction is most frequently applied in drug trials. Phase I
trials are usually carried out after satisfactory findings have been reported in
animal experiments. They primarily aim to determine the pharmacologic and
metabolic effects of the drug in humans, and to detect the most common side
effects. Study subjects in phase I trials usually are healthy volunteers who
typically undergo dose escalating studies, first in single doses and later in
multiple ones, to identify the safe dosage range. Also in this phase, the effects of
the drug on physiologic measures may be determined, for example, on the
aggregation of platelets in studies of platelet inhibitors. Usually the number of
participants in a phase I trial is no more than 100.
In phase II trials, the new treatment is studied for the first time in the type of
patients for whom the treatment is intended. Emphasis is again on safety but also
on intermediate outcomes (see later discussion of types of outcomes) that
broaden insight into the pathophysiologic effects and possible benefits of the
treatment. Drug studies often test several doses in order to find the optimal dose
for a large-scale study. For example, a trial group sought to determine whether
and at what dose recombinant activated factor VII can reduce hematoma growth
after intracerebral hemorrhage [Mayer et al., 2005]. The investigators
randomized 399 patients with intracerebral hemorrhage within 3 hours of disease
onset to either a placebo or three different doses of the drug. The primary
outcome was the percent change in volume of the hemorrhage from admission to
24 hours. Clinical status was determined after 3 months as a secondary outcome.
In phase III trials, the treatments are brought to a “real-life” situation with
outcomes that are considered to be clinically relevant in patients who are
diagnosed with the indication for the treatment. Phase III trials are large (often
1,000 or more patients) and hence costly. Much of the practical aspects of
clinical trials discussed in this chapter pertain specifically to phase III trials.
Phase IV trials, also termed postmarketing (surveillance) trials, may
concentrate on the study of rare side effects after a treatment has been allowed
access to the market. Phase IV trials can also be conducted to assess possibly
new, beneficial effects of registered drugs. Phase IV trials frequently are also
used for the promotion of a newly registered treatment, which is an
understandable approach from the perspective of the industry but less attractive
from a scientific point of view (these are referred to as seeding trials). There is
currently ample discussion on how to best monitor the total (both beneficial and
untoward) effects of a drug once it has entered the market. Sometimes,
conditional approvals are considered, where the pharmaceutical industry is
required to provide updated information on the effects of a drug during the first
period of real-life use. This could include the continuation of specifically
designed randomized comparisons to quantify side effects. However, there are
several other research approaches to address the study of side effects once a
treatment has come to the market.
When designing the data collection and organizational aspects of a clinical
trial, it is useful for the researcher to have conceptualized the structure of the
written manuscript about the study. A guideline on what to report and how to do
it was issued in 2001. This document, the Consolidated Standards of Reporting
Trials (CONSORT), has been revised and adopted as an obligatory format by
major medical journals and was most recently updated in 2010 [Moher et al.,
2001b; Moher et al., 2010]. The website of the CONSORT organization
(www.consort-statement.org) also provides several extensions of the statement,
including information about noninferiority trials.
However, even before a report on the trial results is written, or even before the
study has started, the International Committee of Medical Journal Editors
(ICMJE) currently requires all trials (including phase III trials) that assess
efficacy to be registered [De Angelis et al., 2005]. Registration must occur
before the first patient is enrolled and the registry must be electronically
searchable and accessible to the public at no charge. If no such registration is
created, the manuscript on the results of the trial will not be accepted for
publication by the journals that adhere to the ICMJE statement, which include all
major general medical journals. The rationale for a trial registry is the
responsibility of investigators to present the design of the study and give an
account of the results of the trial, irrespective of the nature of the findings. In the
past, too often the design features of a trial were changed during the study or so-
called negative trials were not published, leaving the international scientific
community with mainly the positive trials, thus creating publication bias.
FIGURE 10–2 Confidence intervals and noninferiority (NI) interpretation of the treatment difference
between a test drug and an active comparator drug. The dashed vertical line represents the NI margin, the
solid vertical line is the point-of-no-difference line, and the horizontal lines represent the confidence
intervals. The point-of-no-difference is the point at which the estimated treatment difference between the
new drug and comparator is neutral: zero for a difference in outcome or one for a ratio. Studies A, B, and C
show that the new drug is noninferior to its comparator. While noninferiority is not shown for studies D, E,
and F.
Reproduced from Wangge G, Klungel OH, Roes KC, de Boer A, Hoes AW, Knol MJ. Interpretation and
inference in noninferiority randomized controlled trials in drug research. Clin Pharmacol Ther
2010;88:420–3.
Reproduced from: Moher D, Hopewell S, Schulz KF, Montori V, Gotzsche PC, Devereaux PJ, Elbourne D,
Egger M, Altman DG. CONSORT 2010 explanation and elaboration: updated guidelines for reporting
parallel group randomised trials. BMJ. 2010;340:c869.
PARTICIPANTS
Trials are conducted to measure the benefits and risks of treatment in particular
groups of patients. The study population in a trial should reflect these future
patients in relevant aspects. The first step, therefore, is to define clearly to which
future patients the findings of the trial should apply; this is referred to as the
domain. The domain determines the generalizability of the trial findings,
sometimes also called the external validity, of the trial. The more immediately
the results of interventions need to be implemented in clinical practice, the more
closely a trial population needs to resemble the population for whom the
treatment is intended. Consequently, a phase I trial may well be conducted in
healthy volunteers, but a phase III trial, just before registration, should be
performed in patients who are very similar to the patients to whom the drug will
be marketed. First and foremost, the domain of a phase III trial is defined by the
presence of a treatment indication and the absence of known contraindications.
Domain characteristics are operationalized by specifying eligibility criteria.
Typical selection criteria for a study population in a trial may relate to age, sex,
clinical diagnosis, and comorbid conditions; exclusion criteria are often used to
ensure patient safety. Eligibility criteria should be explicitly defined. The
conventional distinction between inclusion and exclusion criteria is unnecessary;
the same criterion can be phrased to include or exclude participants [Moher et
al., 2010]. There are many additional characteristics of the population eventually
included in a trial that may further restrict the domain and thus affect
generalizability. Examples are the setting of the trial (country, healthcare system,
primary vs. tertiary care), run-in periods of trial medication, and stage of the
disease [Rothwell, 2005].
The CONSORT statement recommends using a diagram to delineate the flow
of patients through the trial (see Figure 10–3) [Moher et al., 2010]. Its upper
part describes the enrollment of patients in the trial and their subsequent
allocation to the trial treatments. In fact, this part still could be expanded with
the stages that precede the actual randomization, for example, identification of
affected patients in primary care, referral to secondary care (typically a hospital
that participates in the trial), under care of a physician taking part in the trial,
meeting the eligibility criteria, and giving informed consent [Rothwell, 2005].
Figure 10–4 shows the patient flow in the ASPECT-2 trial [Van Es et al., 2002].
INFORMED CONSENT
An essential part of the randomization process is the step that precedes the actual
randomization: the discussion with the patient or his or her family about
participation in the trial. Ideally, this discussion is led by a physician who is not
the treating physician in order to avoid a conflict of interest. The potential
benefits and harms of the study treatments need to be explained, as well as all
practicalities of the trial, including the fact that the patient will be randomized.
All information also should be given in a patient information document. In trials
with nonacute treatments, the patient should have some time to decide about
participating, and only after written informed consent has been obtained will the
patient be randomized.
BLINDING
The need to blind patients and doctors for the actual treatment given depends on
the type of research question (pragmatic or explanatory) and the trial’s primary
type of outcome event (hard or soft). If the trial has an explanatory nature, there
should be full comparability of extraneous effects and preferably, extraneous
effects should be eliminated: A placebo is required, which implies that treatment
needs to be given in a blinded fashion. If, however, a pragmatic design is
preferred, the need for blinding depends on the type of outcome event and, here,
comparability of observer effects is considered. If an objective measure is
chosen, such as death, blinding is not mandatory. If quality of life is the primary
outcome, blinding is definitely needed because of the subjective nature of this
outcome. In an open trial, outcome assessment can still be blinded by using an
independent assessor who does not know which study treatment has been given.
For example, records on potential outcome events may be sent to a central trial
office where all information on treatment allocation is removed. The blinded
outcome data are then classified by members of an adjudication committee
[Algra & van Gijn, 1994].
Placebos should be made such that they cannot be distinguished from the
active treatment. They should be similar in appearance and, in the event of oral
administration, taste the same. Even with capsules that are meant to be
swallowed at once, one should be careful, as “de-blinding” has been reported
when patients first bit the capsule and then tasted its content. Even with the most
careful preparation of placebos, the effects or side effects of the treatment may
give the allocation code away. For example, the effect on the need to urinate of a
diuretic drug may be so obvious that this cannot be concealed from the patient.
When a trial aims to assess patients’ perception of outcomes, blinding may be
complicated. To solve this problem investigators developed a modified consent
procedure in which consent was asked from the patient to collect follow-up data
and that states that information on the details of the study will be provided at the
end of the study [Boter et al., 2003]. In a study of an outreach nursing care
program for patients discharged home after stroke that measured self-reported
quality of life and satisfaction, thus two problems related to incomparability of
observations could be avoided. First, patients allocated to usual care (i.e., no
outreach program) might be dissatisfied because they did not receive the active
intervention. Second, patients allocated to the outreach program would not feel
obliged to answer more positively than they really felt because of loyalty to the
staff providing the intervention. An alternative solution might be to use so-called
prerandomization [Zelen, 1979]. Patients fulfilling the eligibility criteria of the
trial are randomized before consent is sought. Subsequently, only those patients
allocated to the intervention group are asked for informed consent. This design
also avoids incomparability of observations; however, it comes at the price of
the drop-out of the nonconsenters from the intervention group and hence
compromise in the comparability of the patients receiving the intervention and
those not. This design was used in a trial on risk factor reduction in patients with
symptomatic vascular disease [Goessens, 2006]. Patients were prerandomized to
receive treatment by a nurse practitioner plus usual care versus usual care alone.
OUTCOME
The choice of a particular outcome, its definition, and measurement completely
depend on the goal of the trial. If, for example, the researcher wants an answer
that has immediate relevance for clinical practice another outcome may be
chosen than if the primary aim is to show that an intervention exerts the
anticipated pathophysiologic effect. In phase II trials, the emphasis is on safety
and pathophysiology. In the example of recombinant activated factor VII, the
primary outcome was the percent change in volume of the hemorrhage from
admission to 24 hours, which is important for a “proof of concept” but less
relevant from the perspective of a patient. In phase III clinical trials with a
primary explanatory design, pathophysiology driven or clinical outcomes may be
chosen, whereas in pragmatic trials, investigators tend to concentrate particularly
on those outcomes that are most relevant for patients.
Sometimes investigators disagree on what they deem is important for patients.
For example, a recent debate addressed the question of whether in stroke
prevention studies one should take only strokes as outcome [Albers, 2000] or use
all vascular events because of the atherosclerotic nature of cerebrovascular
disease [Algra & van Gijn, 2000]. The latter outcome is a so-called composite
outcome because it consists of several contributing outcomes (in this example,
death due to vascular diseases, nonfatal stroke, and nonfatal myocardial
infarction). The composite outcome is reached as soon as one of the contributing
outcomes has occurred.
Phase II and initial phase III trials often use intermediate (or surrogate)
outcomes; that is, outcomes that on the basis of pathophysiologic reasoning will
proceed to the occurrence of the clinically relevant outcome event. The validity
of an intermediate outcome as a proxy for the real outcome relies heavily on the
extent to which the intermediate outcome truly reflects the risk of the outcome of
interest. For example, ventricular arrhythmias were chosen as an intermediate
outcome for sudden death in patients with cardiac disease. In the early
assessment of the effects of anti-arrhythmic drugs, the reduction of the number
of ventricular premature complexes at a 24-hour electrocardiogram from
baseline to follow-up was used. With this outcome, several anti-arrhythmic
drugs appeared promising. However, these promising effects were completely
negated in a phase III trial that used the final outcome of sudden death [CAST
Investigators, 1989]. The anti-arrhythmic drugs in fact proved to be dangerous!
Clearly, one should always be careful in accepting findings from trials with an
intermediate outcome as proof of the effect on the outcome of interest.
Still, a major advantage of the use of an intermediate outcome is that it may
produce results sooner because these outcomes occur more frequently or are
continuous rather than dichotomous variables. Moreover, an intermediate
outcome may effectively be used to establish the effect of a treatment by a
presumed pathophysiologic pathway and thus may demonstrate the primary
mode of action. Sometimes the consequence of the intermediate outcome on
disease is assumed to be so clear that the measure itself suffices as an indicator
of treatment effect, as for example with blood pressure–lowering drugs; although
the clinically relevant outcome in trials on antihypertensive drugs would be the
incidence of cardiovascular events, phase III trials typically use blood pressure
level as the intermediate outcome and blood pressure level is accepted as a
surrogate outcome for cardiovascular events by regulatory agencies such as the
FDA. A well-established example of a proxy measure that is generally accepted
as a continuous measure of atherosclerotic vascular disease is the thickness of
the combined intima and media of the carotid arteries (see Figure 10–5) [Bots et
al., 1997]. When continuous outcome measures are used, such as carotid wall
thickness or blood pressure, it is possible to increase precision by taking the
mean of multiple measurements, thus reducing measurement error.
FIGURE 10–5 Measurement of the thickness of the combined intima and media of the carotid arteries.
Because the confidence intervals are wide, the data in this example are not
sufficiently precise to infer that new treatment A is better than old treatment B;
the trial was too small. Thus, before one embarks on a trial, a sample size
calculation needs to be done. With a fairly simple formula one can calculate the
number of participants required. Advanced methods for calculating the power of
a study and the required sample size may seem attractive, but the numbers that
follow from any calculation are highly dependent upon the assumptions that are
being made. By definition the researcher is uncertain and subjective about the
size of the expected treatment effect. Here, not only the plausible size but also
the clinical relevance of this estimate matters.
A parameter that one needs to estimate or assume is the percentage of
outcome events in the patients who receive standard treatment (denoted as p0),
which is 15% in the given example. This is also called the background rate. The
expected percentage in the treated group (p1) would be 13%. The sample size per
treatment group needed would then be:
INTRODUCTION
The decision to apply findings from research to clinical practice is rarely based
on a single study. Trust in the validity of research findings grows after results are
replicated by similar studies in different settings. Moreover, the results of a
single study are often not sufficiently precise and thus leave room for doubt
about the exact magnitude of the association between the determinant(s) and
outcome(s) of interest, such as, for example, the effects of a certain treatment.
This is particularly important when the magnitude of the expected benefits of an
intervention must be balanced against the possible risks. For this purpose, the
evidence that a treatment works may be valid but too imprecise or too general.
What works in a high-risk patient may be counterproductive in a low-risk patient
because the balance between benefits and risks differs. The contribution that
meta-analysis can make is to summarize the findings from several relevant
studies and improve the precision of the estimate of the treatment effect, thereby
increasing confidence in the true effect of a treatment.
Meta-analysis is a method of locating, appraising, and summarizing similar
studies; assessing similar determinants and comparable outcomes in similar
populations; and synthesizing their results into a single quantitative estimate of
associations or effect. The magnitude of the “average” association between the
determinant and outcome can be used in decisions in clinical practice or in
making healthcare policy. Meta-analysis may reduce or resolve uncertainty when
individual studies provide conflicting results, which often leads to disagreement
in traditional (narrative) reviews.
Traditional reviews typically only offer a qualitative assessment of the kind,
“This treatment seems to work and appears to be safe.” In addition to providing
a quantitative effect estimate across studies, meta-analysis uses a transparent
approach to the retrieval of evidence from all relevant studies, employs explicit
methods aimed at reducing bias, and uses formal statistical methods to
synthesize evidence. Unless individual patient data from the studies included are
available, a meta-analysis treats the summary result of each study (e.g., the
number of events and the number of patients randomized by treatment group) as
a unit of information.
Meta-analysis originated in psychological research and was introduced in
medicine around 1980. With the rapid adoption of evidence-based medicine and
the increasing emphasis on the use of quantitative evidence as a basis for patient
management, meta-analysis has become popular. Today, meta-analysis has an
indispensable role in medicine, in general, and in clinical epidemiologic research
in particular.
This chapter introduces the design and methods of meta-analysis aimed at
summarizing the results from randomized trials comparing an intervention arm
to a control arm. Meta-analysis of etiologic, diagnostic, and prognostic studies is
increasingly common, but it is beyond the scope of this chapter.
RATIONALE
Meta-analysis helps to answer questions such as these: “What is the best
treatment for this patient?” “How large is the expected effect?” “How sure are
we about the magnitude of this effect?” Definite answers are rarely provided by
the results of a single study and are difficult to give when several studies have
produced results that seem conflicting. Traditionally, decisions about the
preferred treatment for a disease or health condition have largely relied on expert
opinion and narrative reviews in medical textbooks. These may be based on a
biased selection of only part of the evidence, frequently only the largest studies,
studies with “positive” results (i.e., those reporting P values less than 0.05), or—
even worse—only studies with results that support the expert’s personal opinion.
Clearly, such studies are not necessarily the most valid. Moreover, due to the
rapid accumulation of evidence from clinical research, expert opinion and
medical textbooks can quickly become outdated.
Access to up-to-date evidence on treatment effects is needed to make
informed decisions about patient management and health policy. For instance,
several authors have shown convincingly that medical textbooks lag behind
medical journals in presenting the evidence for important treatments in
cardiology [Antman et al., 1992; Lau et al., 1992]. Often, investigators perform a
meta-analysis before starting a new study. From studying previous trials, they
learn which questions remain unanswered, what pitfalls exist in the design and
conduct of the anticipated research, and which common errors must be avoided.
Meta-analyses may provide valuable assistance in deciding on the best and most
relevant research questions and in improving the design of new clinical studies.
In addition, the results of meta-analyses are increasingly being incorporated into
clinical guidelines.
An example of the value of meta-analysis is the research on the putative
benefits of minimally invasive coronary artery bypass surgery. Minimally
invasive coronary bypass surgery is a type of surgery on the beating heart that
uses a number of technologies and procedures without the need for a heart–lung
machine. After the introduction of this procedure, the results of the first
randomized trial were published in 1995 [Vural et al., 1995]; four years later the
initial results of a second randomized trial were published [Angelini et al.,
2002]. Subsequently, 12 trials were published up to January 2003, 12 more trials
were published between January 1 and December 31, 2003, and another 10 were
published in the first 4 months of 2004 [Van der Heijden et al., 2004]. Meta-
analysis is extremely helpful in summarizing the evidence provided by studies
conducted in this field. In particular, it may support timely decisions about the
need for more evidence and prevent the conduct of additional trials when precise
effect estimates are available.
PRINCIPLES
The direction and size of the estimate of a treatment effect observed in a trial,
commonly expressed as a ratio of, or a difference between, two measures of
occurrence, indicates the strength of the effect of an index treatment relative to
that of a reference treatment. The validity of the estimate of the treatment effect
depends on the quality of the study. In research on treatment effects, validity
depends in particular on the use of randomization to achieve comparability with
regard to the initial prognostic status, and potentially the use of blinding and
placebo to achieve comparability of extraneous effects and observations. In
addition, the validity of the observed treatment effect depends on the
completeness of follow-up data and whether the data were analyzed correctly.
The precision of an estimate of a treatment effect from a study is reflected in
the confidence interval (CI) of the effect estimate. This denotes the probabilistic
boundaries for the true effect of a treatment. That is, if a study was repeated
again and again, the 95% CI would contain the true effect in 95% of the
repetitions. The width of the confidence interval is determined by the number of
the outcome events of interest during the period of follow-up observation, which
in turn depends on the sample size, the risk or rate of the outcome of interest in
the trial population, and the duration of follow-up. In general, a large study with
many events yields a result with a narrow confidence interval. Inconsistent
results of multiple randomized trials lead to uncertainty regarding the effect of a
treatment. Contradictory results, such as a different magnitude or even a
different direction of the effect, may be reported by different trials. In addition,
some trials may be inconclusive, for example, when the point estimate of effect
clearly deviates from “no effect” even though its confidence interval includes
“no effect.” Uncertainty about the true treatment effect can be overcome by
combining the results of trials through meta-analysis.
It should be emphasized, however, that differences in findings between studies
may be the result of factors other than a lack of precision. Diversity in the way
trials are conducted and in the type of study populations may lead to different
trial results. To maintain validity when different studies are combined in a meta-
analysis, aggregation of data is usually restricted to trials considered combinable
with respect to patients, treatments, endpoints, and measures of effect. To ensure
adequate selection of trials, their designs need to be systematically reviewed and
they must be grouped according to their similarity. Contradictory results may
also reflect problems in the study design or data analysis that may have biased
the findings of some trials. Because the results of meta-analyses cannot be
trusted when flawed trials are included, it is important to make an explicit effort
to limit such bias. Hence, the study design needs to be critically appraised with
regard to the randomization procedure and concealment of treatment allocation,
blinding of outcome assessments, deviation from the allocation scheme,
contamination of the treatment contrast (e.g., unequal provision of care apart
from the allocated treatment), and completeness of follow-up, as well as the
statistical analysis.
Small trials often lack statistical power. In a meta-analysis, statistical power is
enhanced by pooling data abstracted from original trial publications to determine
a single combined effect estimate, using statistical methods that have specifically
been developed for this purpose. Many such methods exist, and their
appropriateness depends on underlying assumptions and practical considerations.
Unfortunately, quite often the possibilities for pooling are restricted by poor data
reporting of individual studies.
Adherence to fundamental design principles of meta-analyses can prevent
misleading results and conclusions. These should be articulated in a protocol to
be used as a reference in conducting the meta-analysis and writing the methods
section of the report. Guidelines and manuals for writing a protocol for meta-
analyses are available [Higgins, 2006; Khan et al., 2003] (see Box 11–1). As for
clinical epidemiologic studies in general, the design of a meta-analysis involves:
BOX 11–1 Internet Resources for Writing a Protocol for Meta-Analysis (accessed May 7, 2013)
The Cochrane Handbook for Systematic Review of Interventions, from the Cochrane Collaboration:
http://www.cochrane.org/training/cochrane-handbook
Systematic Reviews: CRD’s guidance for undertaking systematic reviews in health care, from the
Centre for Reviews and Dissemination, University of York, UK:
http://www.york.ac.uk/inst/crd/report4.htm
1. The theoretical design of the research question, including the specification
of the determinant–outcome relation of interest
2. The design of data collection, comprising the retrieval of publications, the
selection and critical appraisal of trials, and the data extraction
3. The design of data analysis and the reporting of the results
THEORETICAL DESIGN
As in any research, a meta-analysis should start with a clear, relevant, and
unambiguous research question. The design of the occurrence relation includes
three components: (1) the determinant contrast (typically, the treatments or
exposures compared), (2) the outcome of interest, and (3) the domain. All need
to be explicitly defined to frame the search and selection strategy for eligible
trial publications. By using unambiguous definitions of these components, the
scope and objective of the meta-analysis are narrowed. This directly impacts the
applicability of the results.
To illustrate, there are similarities between the following questions: “What is
the effect of intermittent lumbar traction on the severity of pain in patients with
low back pain and sciatica?” and “What is the effect of spinal traction on the
recovery of patients with back pain?” [Clarke et al., 2006]. Despite the
similarities, these questions have a completely different scope that would result
in different criteria for selection of trials and subsequently different estimates of
treatment effect and applicability of findings. Due to its more detailed wording,
the first question may provide a more informative summary of the evidence for a
particular type of patient management, while the more general wording of the
domain, determinant, and outcome in the second question may serve public
health policy more generally. Although it is not the primary objective of a meta-
analysis to formulate recommendations for patient management, but rather to
quantitatively summarize the evidence on a particular mode of treatment, meta-
analyses are often used in the development of clinical guidelines.
Just as in the design of any epidemiologic study, it is necessary to carefully
decide on the domain, that is, the type of patients or subjects to whom the results
of the meta-analysis will apply. Definition of the domain determines how the
study populations to be considered will be collected and thus assists in obtaining
relevant summaries of evidence from published trials.
Bibliographic Databases
For a comprehensive search, several medically oriented electronic bibliographic
databases are available. These include:
PubMed (National Library of Medicine and National Institutes of Health) [Dickersin et al., 1985;
Gallagher et al., 1990]
EMBASE (Elsevier, Inc.) [Haynes et al., 2005; Wong et al., 2006a], Web of Science (Thompson
Scientific), PsycINFO (American Psychological Association) [Watson & Richardson, 1999a; Watson
& Richardson, 1999b]
CINAHL (Cumulative Index to Nursing and Allied Health Literature, EBSCO Industries) [Wong et
al., 2006b]; LILACS (Literatura Americana e do Caribe em Ciências da Saúde) [Clark, 2002]
Cochrane Database of Randomized Trials (Wiley Interscience)
Search Filters
Search filters are command syntax strings in the database language for retrieving
relevant records. Most electronic bibliographic databases provide indexing
services and search facilities, which make it easy to create and use search filters.
For every research question, a reproducible subject-specific search filter must be
defined. There is no standard for building a subject-specific search filter, and
they need to be customized for each database. The art of building a subject-
specific search filter comes down to reducing the “numbers-needed-to-read” to
find a single pertinent record for an original trial publication [Bachmann et al.,
2002].
Clinical Queries
PubMed includes clinical queries; these can be found in the blue sidebar on the
PubMed home page. The therapy query, using the Boolean operator “AND,” can
be combined with the constructed subject-specific search filter in order to retain
records about treatment effects and type of study while reducing the search yield
to a more manageable number of records.
Several other methods filters that allow searching for a type of study are
available for different bibliographic databases [Watson et al., 1999b; Wong et
al., 2006a; Wong et al., 2006b; Zhang et al., 2006]. Some of these have been
tested intensively [Jadad & McQuay, 1993; Montori et al., 2005; Shojania &
Bero, 2001; Wilczynski et al., 1994; Wilczynski & Haynes, 2002; Wilczynski et
al., 2005], but none are perfect, and often certain relevant articles will be missed.
The added value of methods filters, in terms of accuracy of their yield, may
depend on the medical field or research question of interest [Sampson et al.,
2006a]. For PubMed clinical queries, a broad (i.e., sensitive or inclusive) or a
narrow (i.e., specific or restrictive) prespecified search methodology filter is
available. While a broad methods search filter is more comprehensive, the
number-needed-to-read will always be higher. With a narrow methods filter, the
number of records retrieved will be smaller, but the likelihood of excluding
pertinent records is higher. Therefore, using narrow methods filters in the
context of meta-analyses is not advised.
Complementary Searches
Publications are not always properly included or indexed in electronic
bibliographic databases. Sometimes, relevant studies identified by other means
turn out to be included in electronic bibliographic databases but are
inappropriately indexed because of changes in the thesaurus, for example.
Therefore, searching for lateral references is always necessary to supplement
initial retrieval of relevant publications and to optimize a search filter.
Additional relevant publications can be found by screening the reference lists
of available systematic reviews, meta-analyses, expert reviews, and editorials on
your topic, for publications not retrieved by your search filter. Web of Science,
the bibliographic database of the Institute of Scientific Information, facilitates
such cross-reference searching by providing links to publications cited in the
identified paper and links to publications citing the identified paper. PubMed
facilitates such cross-reference searching by providing a link to related articles.
It is advisable to use cross-reference searching for all pertinent records selected
by the initial search and to use the Boolean operator “OR” to combine them all.
To avoid duplication of work, records already retrieved by the initial search filter
can be excluded by combining an additional filter for the collection of related
articles and the initial search filter using the Boolean operator “NOT.” Then, the
remainder of the related articles is screened for relevant additional records.
When cross-reference searching yields additional relevant publications, these
should be scrutinized for new relevant search terms related to the domain,
determinants, and outcomes in the title and abstract. These should always be
added to update the initial subject-specific search filter. Again, the Boolean
operator “NOT” should be used to exclude the records already retrieved by the
initial search filter (plus the combined related articles). Then the remaining
records are screened for other additional relevant records and new search terms.
Thus, building a subject-specific search filter becomes a systematic iterative
process. Still, the total number of original studies published on a topic of a
particular meta-analysis always remains unknown. Therefore, it may be useful to
write to experts, researchers, and authors, including a list of the retrieved trial
publications, and ask them to add studies not yet on the list.
Most electronic bibliographic databases only include citations for studies
published as full-text articles. To retrieve studies without full publication it is
useful to write to researchers, authors, and experts for preliminary reports, and
search in Web of Science or on the Internet (e.g., websites of conferences and
professional societies) for abstracts of meetings and conference proceedings. The
recently initiated registries for clinical trials [Couser et al., 2005; De Angelis et
al., 2004] promise a better view on all studies started, some of which may never
be published in full (see Box 11–2). Some authors have suggested that journal
hand searching, which is a manual page by page examination of contents of
relevant journals, may reveal additional relevant publications [Hopewell et al.,
2002; Jefferson & Jefferson, 1996; McDonald et al., 2002; Sampson et al.,
2006b]. In addition, Internet search engines, in particular, Google Scholar
(http://scholar.google.com), may prove useful in the retrieval of citations
[Eysenbach et al., 2001] and, in particular, full-text articles that somehow have
not made it to other databases.
BOX 11–2 Internet Resources for Trial Registries (accessed May 17, 2013)
The International Standard Randomized Controlled Trial Number Registry, Bio Med Central:
http://www.controlled-trials.com
BOX 11–3 Internet Resources for Bibliographic and Citation Management Software Programs
Avoiding Bias
Retrieval and selection of original studies should be based on a comprehensive
search and explicit selection criteria. Relevant publications can be easily missed
by a not fully comprehensive or even flawed retrieval and selection procedure.
Selection of studies must be based on criteria related to study design, rather than
on results or a study’s purported appropriateness and relevance. Holes in a
methodology filter as well as searching in a limited number of bibliographic
databases may lead to serious omissions. When a search is not comprehensive or
selection is flawed, the results of the meta-analysis may be biased; this type of
bias is known as retrieval and reviewer bias.
To prevent reviewer bias, the selection of material should preferably be based
on the consensus of at least two independent researchers [Edwards et al., 2002;
Jadad et al. 1996; Moher et al., 1999a]. Still, any comprehensive strategy for the
retrieval and selection of relevant original studies can be frustrated by flaws in
the reporting of individual trials [Sutton et al., 2002].
Trials with positive and significant results are more likely to be reported and
are published faster, particularly when they are published in English (i.e.,
publication bias) [Jüni et al., 2002; Sterne et al., 2002]. Furthermore, such
positive trials are cited more often (i.e., citation bias), which makes them easier
to locate, so only a comprehensive search can prevent such retrieval bias
[Ravnskov, 1992]. Multiple reporting of a single trial (for example, separate
reporting of initial and final results, different follow-up times or endpoints in
subsequent publications) and preferential reporting of positive results cause
dissemination bias that may be difficult to detect. There is no complete remedy
against these types of bias in the reporting and dissemination of trial results.
Omission of pertinent studies and inclusion of multiple publications may
change the results of a meta-analysis dramatically [Simes, 1987; Stern & Simes,
1997]. For example, from around 2,000 eligible titles that were retrieved in a
meta-analysis assessing the effect of off-pump coronary surgery, only 66
publications remained after exclusion of publications of nonrandomized trials
and randomized trials with another treatment comparison or endpoint. After
assessing the 66 reports, seven conference abstracts of trials not fully published
were further excluded. There were 17 duplicate publications relating to three
trials, leaving only 42 full trial publications for further analysis [Van der Heijden
et al., 2004].
Before critically appraising the studies selected for inclusion, it is important to
ensure that errata that were published later have been traced, as these may
contain errors in the data initially reported. It is also recommended to ensure that
design papers (available for many large multicenter trials) have been traced and
are available for critical appraisal together with the results report(s). One may
encounter a report that is based on a smaller number of subjects than was
planned according to the design paper. This may suggest publication bias, unless
the reasons for this are explained in the results report.
CRITICAL APPRAISAL
Randomized trials are the cornerstone of evaluation of treatment effects. They
frequently offer the best possibility for valid and precise effect estimations, but
many aspects of their design and conduct require careful handling for their
results to be valid. Hence, critical appraisal of all elements of a study design is
an important part of meta-analysis. Critical appraisal concentrates on aspects of
a study design that impact the validity of the study, notably randomization
techniques and concealment of treatment allocation, blinded endpoint
assessment, adherence to the allocation scheme, contamination of treatment
contrast, postrandomization attrition, and statistical techniques applied. This
requires information regarding inclusion and exclusion criteria, treatment
regimens, and mode of administration, as well as the type of endpoints, their
measurement scale, and the duration of follow-up and the time points of follow-
up assessments. Each aspect of the study design needs to be documented on a
predesigned critical appraisal checklist to decide whether the publication
provides sufficient information and, if so, whether the applied methods were
adequate and bias is considered likely or not. Based on this critical appraisal,
studies can be grouped by the type and number of design flaws, as well as by the
level of omitted information. Accordingly, decisions about which studies are
combinable in a pooled or a stratified analysis can be made.
Although the requirements for reporting the methods of clinical trials are well
defined and have been generally accepted [CONSORT, 2010; Chalmers et al.,
1987a, 1987b; Moher et al., 2005; Plint et al., 2006], information on important
design features cannot be found in the published report of many studies. For
example, only 11 of 42 trials comparing coronary bypass surgery with or without
a cardiopulmonary bypass pump that are reported as a “randomized trial”
provided information on the methods of randomization and concealment of
treatment allocation, while only 14 reported on blinding of outcome assessment
or standardized postsurgical care, and only 30 gave details on deviations from
the protocol that occurred [Van der Heijden et al., 2004]. The unavailability of
this information hampers a complete and critical appraisal of such studies and
raises questions about the validity of their results.
Blinding for the journal source, the authors, their affiliation, and the study
results during critical appraisal by editing copies of the articles requires ample
time and resources. Therefore, this should only be considered when reviewer
bias as an important threat to the validity of the meta-analysis needs to be
excluded [Jadad et al., 1996; Verhagen et al., 1998]. To avoid errors in the
assessment of trials, critical appraisal should be standardized using a checklist
that is preferably completed independently by two reviewers as they read the
selected publications. In the event of disagreement between these two reviewers,
the eventual data analyzed can be based on a consensus meeting or a third
reviewer may provide a decisive assessment.
Studies that are the same with respect to the outcome, including scales used
and follow-up times, can be pooled by conventional statistical procedures.
Similarity can be judged by the information that is derived during data
extraction. Data extraction entails documentation of relevant data for each study
on a standardized data registry form and should include the number of patients
randomized per group and their baseline characteristics, notably relevant
prognostic markers (i.e., potential confounders). The follow-up data to be
recorded for each treatment group should, for each outcome and follow-up time,
include the point effect estimates and their variance, and the number of patients
analyzed, with accrued person-time “at risk.” Using these data, trials can be
grouped by outcome, follow-up time, or even risk status at baseline.
Accordingly, this gives a further quantitative basis for decisions about which
studies are combinable in the pooled or stratified analysis. Unfortunately, details
about follow-up are frequently omitted. Inadequate or incomplete reporting of
outcome parameters precludes statistical pooling in a meta-analysis. For
example, only 4 of 42 trials comparing coronary bypass surgery with or without
a cardiopulmonary bypass pump reported data that allowed calculating a
composite endpoint for death, stroke, and myocardial infarction [Van der
Heijden et al., 2004].
When the numbers of patients still “at risk” for the outcome concerned are
given under a KM curve (as is often the case), the total person-time “at risk” can
be calculated for each treatment. This is not just true for death as outcome. For
any other outcome considered in a KM curve, person-times “at risk” can be
calculated, even when the outcome concerned is subject to competition from
other outcomes. When (exceptionally, and the reason for using SOLVD as an
example) data are also given on how the number of subjects with the outcome
concerned (death in this case) evolves over time, one can also determine how the
corresponding rates evolve over time.
The calculations required are illustrated in Table 11–1 and are explained here.
For the time points given in column (1), the numbers of patients still “at risk” for
death (as shown in Figure 11–1) appear in columns (2) and (6) for the enalapril
and the placebo group, respectively. Columns (4) and (8) give the corresponding
number of deaths by interval as derived from Table 3 in the report. For example,
there were 118 deaths in the enalapril arm in the interval 12–24 months.
TABLE 11–1 Mortality Data Extracted from the SOLVD Trial Report
*Derived from the cumulative numbers of deaths in Table 3 of the report.
**Inconsistent with Figure 11–1, which suggests that 1,285−1,195, or 90, enalapril subjects and 1,284 −
1,159, or 125, placebo subjects did not complete the first 6 months of follow-up. This does not affect the
calculations.
Columns (3) and (7) give the interval number of patient-months of follow-up
for those “at risk” for death by treatment group and by interval as calculated
using Excel from the data given in columns (2) and (6). For example, it follows
from the data in column (2) that during the first 6 months, follow-up was
terminated in 1,285–1,195, or 90 enalapril patients either because of death or an
early end to follow-up. Assuming that this occurred on average halfway the
interval (which is equivalent to approximating the mortality curve for the
interval 0–6 months by a straight line), these 90 patients have contributed 90 × 3,
or 270 patient-months of follow-up to the total. In the same treatment group,
1,195 patients were still alive at 6 months. These have contributed at that time
point another 1,195 × 6, or 7,170 patients-months of follow-up to the total for
enalapril. Adding 7,170 to 270 gives the 7,440 months shown in column (3) for
the interval 0–6 months.
The same calculation can be repeated for each subsequent time interval and
for each treatment. At 48 months, 333 enalapril and 299 placebo patients were
still alive and were followed further. According to the report, the maximum
duration of follow-up was 55 months, that is, another 7 months beyond 48. The
calculations in columns (3) and (7) assume an average of 3 months of follow-up
beyond 48 months. This explains the last entries in columns (3) and (7): 333 × 3
= 999 for enalapril and 299 × 3 = 897 for placebo. Note that the total follow-up
time does not critically depend on the assumption made concerning the
additional follow-up duration beyond 48 months in the 333 and 299 patients
concerned, as their number is small.
One can now calculate the mortality rates in SOLVD. The numerator for
enalapril is 452 deaths. The corresponding denominator is the sum of the interval
durations in column (3) of Table 11–1, and is equal to 44,943 patient-months.
This is equivalent to 44,943/12, or 3,745.25 patient-years. Hence, the mortality
rate for enalapril is (452/3,745.25) × 100, or 12.1 deaths per 100 patient-years
“at risk” for death. Similarly, the mortality rate for placebo is 510/42,624, which
corresponds to 14.4 deaths per 100 patient-years “at risk” after conversion of
months to years, and multiplying by 100.
Based on these rates, one can now also calculate the rate (hazard) ratio
comparing enalapril to placebo, which is 12.1/14.4, or 0.84. As (1 – 0.84) × 100
= 16, note that this corresponds exactly to the “reduction in risk” of death of
16% as stated in the SOLVD report.
The number of subjects still “at risk” as shown for SOLVD in Figure 11–1 is
today a fairly common feature of trial reports in major journals, but the number
of subjects with event by time interval, as presented in Table 11–1, is rarely
given. That these numbers were reported for SOLVD allows us to determine
how the rates of death and the rate (hazard) ratios evolve over time. In Table 11–
1, interval rates were calculated as the number of deaths for the interval
concerned, divided by the corresponding patient-time of follow-up. For example,
the rate of death for enalapril in the interval 12–24 months in column (5) of
Table 11–1 is {118/[(6588 + 6237)/12]} × 100, or 11.0 deaths per 100 patient-
years. When follow-up is partitioned by time intervals in this manner, one would
expect that the time interval rates vary. What matters here is whether there is a
trend over time. Note that the rates for enalapril given in column (5) are
essentially stable over time. For the placebo, the rate appears high in the first 6
months and is then also essentially stable. Why this is so cannot be answered
from the data given. What matters for meta-analysis is that the overall rates of
12.1 per 100 patient-years for enalapril and 14.4 for the placebo are convenient
and credible summary occurrence measures for SOLVD that can often easily be
calculated even if they are not given and that can be taken as essentially constant
over time for the chronic disease condition concerned. As we shall see later, as
long as the rates can be assumed constant over time, the particular duration of
follow-up in each trial or study included in a meta-analysis no longer matters.
The same cannot be said for the risk or the odds ratio and for the “number-
needed-to-treat” as commonly calculated [Lubsen et al., 2000].
In column (10) of Table 11–1, the rate (hazard) ratios are also given by time
interval. As is true for the rates, these vary but are essentially constant over time.
Both the log-rank test and Cox proportional hazard analysis with treatment as the
only covariate assume that the rate (hazard) ratio is constant over time. In the
case of SOLVD, the data do not clearly violate this assumption. We emphasize
that these methods of comparing rates do not assume that the rates themselves
are also constant. Few trial reports address the question of whether the rates are
constant over time, or whether the data support the assumption made in the
analysis that the rate (hazard) ratio is constant.
Because the time “at risk” for death is the same for all causes, cause-specific
death rates can be calculated when a breakdown by cause is reported and the
subject time of follow-up by treatment is given or can be calculated. As we shall
see later, this is essential when a meta-analysis is performed for specific causes
of death, or for nonfatal events that are subject to competitive censoring.
Treatment-specific subject times of follow-up also suggest an alternative
measure of treatment effect that may be easier to understand for patients than the
statement—based on SOLVD— “if you take enalapril for your heart failure,
your mortality rate will go down by 16 percent.” Note that in Table 11–1 the
mean follow-up is calculated as 35.0 months for enalapril, as opposed to 33.2 for
the placebo. Based on this, a physician could say to a patient: “After you have
taken enalapril for 35 months, you will have gained 1.8 months of life.” Of
course, this is only true on average. Nonetheless, it puts the effect of treatment in
a different perspective. An extra 1.8 months may be worthwhile if enalapril does
not cause any side effects and reduces the symptoms of heart failure. On the
other hand, if the treatment has quality of life decreasing side effects, the
perspective may be different. Few if any meta-analyses have thus far considered
effects on duration of life.
When numbers “at risk” are not given under a KM curve one can obtain
treatment-specific follow-up durations if the mean follow-up duration until death
or end of study for both treatments combined and the rate (hazard) ratio for all-
cause death are given in a report. This involves solving the following two
equations with two unknowns:
TABLE 11–2 Risk of Death and Risk-Based Treatment Effects by Duration of Follow-up for Constant
Rates of 14/100 and 20/100 Subject-Years of Follow-up “At Risk” for Death for Treated and Control
Subjects Respectively (Hazard Ratio = 0.7)
TABLE 11–3 Occurrence of Death by Cause in a Simulated Trial with 2000 Treated Subjects and 2000
Controls
Obviously, both the risk and the odds ratio for NCV death in Table 11–3 do
not take cause-of-death competition into account. The reason for this is not that
the numerators used in calculating the risks or the odds of NCV death are any
different from those used in calculating the rates. Rather, the reason is that the
denominators (number of participants allocated to the treatment in the case of
risks, number of participants with no event in the case of odds) do not take the
increased person-time “at risk” for treated subjects in comparison to controls
into account. On the other hand, rates have by definition person-time “at risk” as
the denominator, and are therefore sensitive to effects of treatment on the
person-time “at risk.”
The hazard ratios for all death, CV death, and NCV death shown in Table 11–
3 follow directly from the corresponding rates for treated and controls that were
the basis of the calculations, as explained in the table’s legend. The same hazard
ratios can also be obtained from the familiar log-rank statistic (O/E) for treated
subjects, divided by (O/E) for controls, with O denoting the observed numbers of
deaths for the cause concerned, and E the expected number. The expected
numbers of deaths by cause must be obtained by first calculating the rates for
both groups combined. For example, the CV death rate for both groups
combined is (196 + 451)/(2,000 × 2.445 + 2,000 × 2.256), or 6.9 per 100 person-
years. By applying this rate to the total person-years of follow-up for treated and
controls respectively, the corresponding expected numbers of CV deaths are
336.9 and 310.1, respectively. Hence, the log-rank statistic estimate of the rate
(hazard) ratio for CV death is (196/336.9)/(451/310.1), or 0.4, which
corresponds to the value calculated directly from the data in Table 11–3. This
shows that in calculating the log-rank statistic for CV death, it does not matter
whether follow-up “at risk” is terminated by competing NCV death or by the end
of follow-up. Contrary to the risk and the odds ratio, the rate (hazard) ratio from
the log-rank statistic is also an unbiased estimator of treatment effect when the
event concerned is subject to competition from other event(s).
Trials always compare closed cohorts of differently treated subjects. Hence
competition between events will always occur. A subject who, for example, dies
in a car accident is no longer “at risk” for acute myocardial infarction. A
comparison between treatments for the occurrence of myocardial infarction
cannot ignore events that terminated follow-up “at risk” for infarction.
Because of this, the already mentioned odds ratio–based meta-analysis by
Sattar et al. [2010] of statins and new diabetes is difficult to interpret. In the
report, the authors have tabulated for each included trial a proxy of the rates of
new diabetes for statins and controls, respectively, using the mean or median
duration of follow-up until death or the end of study for statin and control
subjects combined in calculating the denominators. These “rates” are useful as
an indicator of the frequency of new diabetes, which ranged from 4.5 to 34.8 per
1,000 person-years. However, these are not true rates according to its definition
because the denominators were not taken as the person-time “at risk” for new
diabetes for statin and control subjects. This data was obviously not available to
the authors. It would have been of interest to know whether the authors
attempted to obtain such data from the investigators concerned, but failed (our
experience in this regard is not good).
In the studies considered by Sattar et al. [2010], there were 2,226 cases of new
diabetes for statin users as opposed to 2,052 for control subjects. This represents
an increase of 8.5%. The overall odds ratio for new diabetes comparing statin to
control subjects was 1.09 (95% CI, 1.02–1.17). The authors conclude that “statin
therapy is associated with a slightly increased risk of development of diabetes,
but the risk is low both in absolute terms and when compared with the reduction
in coronary events.” In the discussion, the authors note that “improved survival
with statin treatment” may be “a residual confounding factor,” and then, quoting
a meta-analysis of statins by the Cholesterol Treatment Trialists’ Collaborators
[2005], state that “overall survival with statins is very similar to survival with
control therapy (about 1.4% absolute difference), suggesting that survival bias
does not explain the variation.”
Sattar et al. [2010] do not define survival bias or explain how this relates to a
1.4% absolute difference. A definition of survival bias that follows from Table
11–3 is the difference in mean follow-up “at risk” for death (which may also be
called mean survival) between statin users and controls. Hence, the question is
whether an estimate of the difference in mean survival can be derived from the
report of the Cholesterol Treatment Trialists’ Collaborators [2005].
The Cholesterol Treatment Trialists’ meta-analysis used a sophisticated
extension of the log-rank statistic to estimate rate (hazard) ratios for all-cause
death and for major CV events. The mean duration of follow-up for survivors for
the trials included in this meta-analysis was given in the report as 4.7 years. This
quantity cannot be used to determine the mean survival for statin users and
controls by the method of solving two equations with two unknowns given
previously. Hence, another method is required to determine how large the
difference in mean survival might be.
The method concerned assumes that the mean duration of follow-up for
survivors is the same for treated and controls, which is reasonable. The absolute
rate (assumed constant) of all-cause death h is equal to –ln[S(t)]/t, where S(t)
denotes the survival probability at time t, and ln the natural logarithm. The
Cholesterol Treatment Trialists’ meta-analysis concerned 45,054 subjects
assigned statins and 45,002 assigned the control treatment. The corresponding
numbers of deaths were 3,832 and 4,354, respectively. It follows that there were
(45,054 – 3,832), or 41,222, survivors among statin users and (45,002 – 4,354),
or 40,648, for controls. Hence, the approximate (approximate because a fixed
follow-up of 4.7 years is assumed) rates of death were –ln(41,222/45,054)/4.7, or
1.891 deaths per 100 person-years for the statin group, and
–ln(40,648/45,002)/4.7, or 2.165 deaths per 100 person-years for controls. The
rate ratio from this is 1.891/2.165, or 0.87, which reassuringly is the same as the
overall rate ratio for all-cause death stated in the Cholesterol Treatment Trialists’
meta-analysis. From an equation for mean duration of follow-up “at risk” for
death used earlier to determine the data given in Table 11–3, it follows that the
mean survival is (3,832/45,054)/(1.891/100), or 4.50 years, for the statin group,
and (4,354/45,002)/(2.165/100), or 4.47 years, for controls. The “survival bias”
is thus 0.03 years, which is equivalent to less than 1% of the mean survival for
controls. This small increase in survival cannot explain the 8.5% increase in new
diabetes reported by Sattar et al. [2010]. The conclusion is that the bias that can
be attributed to increased survival by statin treatment in the effect measure for
new diabetes reported by these authors is indeed irrelevant (that there may be
other biases is an entirely different matter).
In the case of the meta-analysis by Sattar et al. [2010] it did not matter that a
theoretically biased estimator of treatment effect was used. This is not always
the case.
Koller et al. [2008] performed a meta-analysis of nonarrhythmic death for
nine trials comparing an implantable cardiac defibrillator (ICD) to control
treatment. Deaths were classified as either due to arrhythmia or not. The overall
odds ratio for nonarrhythmic death was 1.11 (95% CI 0.84–1.45). Although not
convincingly so because of the wide confidence interval, this suggests that ICD
implantation may have an untoward effect on nonarrhythmic death. Because of
cause-of-death competition, the odds ratio for nonarrhythmic death is potentially
biased. The authors also calculated an overall rate (hazard) ratio for this
outcome, using person-time denominators calculated from the data given in each
report. The overall rate (hazard) ratio for nonarrhythmic death obtained was 1.03
(95% CI 0.80–1.32).
The meta-analysis by Koller et al. [2008] shows that an odds ratio–based
analysis can be seriously biased and potentially result in a spurious conclusion.
As shown in Table 11–3, the same applies to a risk ratio–based analysis. There
are therefore compelling reasons to use rate (hazard) ratios in meta-analysis
unless the studies included all have the same fixed duration of follow-up. In
practice, this can be difficult because the rate ratio data required for the
outcome(s) of interest cannot be abstracted in a consistent manner for all studies
included.
Occurrence Measures and Kaplan-Meier Curves
Competition between events also affects the interpretation of Kaplan-Meier
(KM) curves, which must be taken into account when estimating risks from
published KM curves. A KM curve for all-cause mortality shows the risk of
death and the proportion of subjects still alive over time irrespective of
censoring, but assuming that the censoring was non-informative. It is important
to understand what is meant here by non-informative. When subjects are
followed over time for any death, follow-up is either terminated (censored)
because of death, or because the subject is still alive and also still “at risk” for
death when the study ends. Note that there are only two ways that follow-up for
any death can be censored.
Now, suppose that a KM curve is derived for CV death. In that case, there are
three different reasons for censoring: (1) A CV death has occurred and is
counted as an event, (2) an NCV death has occurred (not counted as an event!),
after which the subject concerned is no longer “at risk” of CV death, and (3)
follow-up is terminated in a subject who is still alive and “at risk” for any death.
A conventional KM analysis for CV death will treat censoring because of the
second and the third reason as equivalent, although the second reason is by no
means non-informative because a subject who died of NCV death is no longer
“at risk” for CV death. It follows that a KM curve for CV death can only be
interpreted as showing the risk of CV death when there are no competing deaths
due to an NCV cause. By the same token, a KM curve for the combined outcome
of any death, myocardial infarction, or stroke can be interpreted as showing
event-free survival as there is no informative censoring due to competing events.
The same cannot be said about a KM curve for the combined outcome CV death,
myocardial infarction, or stroke. This KM curve will be subject to informative
censoring due to competing NCV death. Hence, the curve for this combined
outcome does not show event-free survival unless there are no NCV deaths that
precede CV death, myocardial infarction, or stroke.
The error of interpretation made when taking, for example, a risk of CV death
from a KM curve for CV death is avoided by considering the rate of CV death
rather than the risk. The latter can only be obtained, however, when person-time
“at risk” for death data are available or can be calculated. KM curves in study
reports are of little use, other than showing how the events considered “spread
out” during follow-up.
A KM curve for an event that has a constant rate will have the well-known
exponential shape. A more useful way to determine whether rates are constant
can be understood based on the exponential relationship between the risk and the
rate of death R(t) = 1 –exp(–h × t), which is equivalent to S(t) = exp(–h × t). If
one were to plot –ln[S(t)] in lieu of S(t), the plot would be determined by h × t,
or by a straight line when h is constant. In the statistical literature, the quantity
–ln[S(t)] is called the cumulative hazard. Cumulative hazard plots are much
more informative than conventional KM curves and less prone to
misinterpretation in the case of competing events. However, they are rarely
found in trial reports. An example may be found in Connolly et al. [2011].
where EC denotes the combined effect estimate; E1, E2 … the effect estimate for
each study considered; and weight1, weight2 … the corresponding weight given
to each study. In other words, the weighted average is equal to the sum of the
study-specific weights multiplied by the corresponding value of the effect
estimate, divided by the sum of the weights. Note that the arithmetic average is a
special case of a weighted average, with weights all equal to 1.
The various Mantel-Haenszel–type methods that use 2 × 2 table cell counts to
calculate combined risk-based effect measures can be thought of as using the
size of each study as weights in calculating weighted combined effect estimates.
The generalized or inverse variance procedure for combining effect estimates
uses 1/(variance of effect estimate) for each study as weights. Weights that
depend on the variance of effect estimates in this manner will be smaller for
small studies (large variance) in comparison to those for larger studies (small
variance). As for Mantel-Haenszel–type methods, this implies that large studies
have more impact on the combined estimate than smaller ones.
The computationally simple calculations required by the inverse variance
method for ratio measures of effect can conveniently be performed by using a
spreadsheet program (Excel or similar) and are explained in detail in Table 11–4
using data on the occurrence of stroke in six trials as an example.
The method for risk ratios is illustrated in Table 11–4. For odds ratios and rate
(hazard) ratios, the denominators entered have to be adapted accordingly. Note
that the calculations are performed for a natural logarithmic transformation of
the relative effect measure concerned. The formula for the corresponding
standard error is slightly different for risk ratios, odds ratios, and rate (hazard)
ratios (see legend for Table 11–4) and must therefore be adapted in the
spreadsheet as appropriate. Note also that calculating the combined effect
estimate directly from the sums gives the same result, as (218/8,875)/(322/8,769)
= 0.67. Starting from the natural logarithm of this and by using the formula for
its variance as given in the legend of Table 11–4, one can readily verify that the
95% CI calculated from the sums has a lower limit of 0.56 and an upper limit of
0.79, which corresponds closely to the values obtained by the inverse variance
pooling procedure.
TABLE 11–4 Example of Combining Relative Effect Measures by the Inverse-Variance Method, Using
Stroke Data from Six Trials
(1) The Heart Outcomes Prevention Evaluation Study Investigators, N Engl J Med. 2000;342:145–153.
(2) MacMahon S et al., J Am Coll Cardiol. 2000;36:438–343.
(3) Pitt B et al., Am J Cardiol. 2001;87:1058–1063.
ai, ci = Number of subjects with event for treated and controls.
bi, di = Denominators of occurrence measures compared. Totals allocated for risk ratio (as in this
example), totals of subjects without event for odds ratio, person-time “at-risk” for rate (hazard)
ratio.
REi = Relative effect = (ai/bi)/(ci/di) = risk ratio in this example. May be entered directly when
combining rate (hazard) ratios.
ln(REi) = Natural logarithm of relative effect REi.
Se of ln(REi) = Standard error of natural logarithm of REi. For risk ratio (as in this example) = square root of
(1/ai – 1/bi + 1/ci – 1/di). For odds ratio (with bi and di equal to subjects without event) = square
root of (1/ai + 1/bi + 1/ci + 1/di). For rate (hazard) ratio (with bi and di equal to person-time “at
risk”) = square root of (1/ai + 1/ci).
95% CI of REi = 95% confidence interval of REi = exponent [ln(REi) – 1.96 × standard error of ln(REi)] for lower
limit (LLi), exponent [ln(REi) + 1.96 × standard error of ln(REi)] for upper limit (ULi).
Theoretically, the inverse variance method requires only that the effect
estimates for each study be included, along with their standard errors. When the
latter are not given, a standard error can be derived from either the P value or the
confidence interval. The cell count data necessary for risk ratios and odds ratios
will generally be available. This is not so for person-time denominators required
to calculate rate (hazard) ratios. When combining the hazard ratios for studies
that have all reported a hazard ratio value for the outcome of interest, the
standard error of its natural logarithm can be obtained from the number of
subjects with an event for treated and controls (see legend for Table 11–4). In
that case, person-time denominator data are not required. The inverse variance
method can also be used to combine odds ratio data from one study and hazard
ratio data from another. It follows from what has been said earlier about
competition between events that this may give results that are difficult to
interpret. Hence, whether it is reasonable to combine odds ratios and hazard
ratios must be considered carefully.
When the number of events is zero for any treatment in a particular study, the
study concerned cannot be included as such in calculating an overall measure of
effect by the inverse variance, as is obvious from the formula for the standard
error given in Table 11–4 (1/0 = infinity). In such cases, a Mantel-Haenszel–type
method may be a better choice. For odds ratios, Sweeting et al. [2004] compared
different methods for sparse data and concluded that the inverse invariance
procedure with a continuity correction for zero cell counts (i.e., replacing a zero
with 0.5, for example) gives biased results. The same applies to risk difference
[Tang, 2000].
Heterogeneity
As is evident from Table 11–4, effect estimates may vary in magnitude and
direction across studies. This poses the question of whether this reflects genuine
differences between studies or chance. Assessment of consistency of effects
across studies is an essential part of meta-analysis, as inconsistent results are not
generalizable [Higgins et al., 2003].
A test for heterogeneity examines whether the variability in effect estimates
between studies exceeds the variability that can be attributed to chance alone.
There are essentially two underlying assumptions for this test that differ in the
way variability across studies is considered. The fixed-effects model assumes that
variability between studies is entirely random around one true value of the effect
estimate concerned. The random-effects model, on the other hand, assumes that
true effects differ randomly between trials [DerSimonian & Laird, 1986]. The
random-effects model implies the use of a larger study-specific variance of the
effect estimate than the fixed-effects model. Hence, the confidence interval
obtained for the combined effect estimate will in general be wider for the
random-than for the fixed-effects model.
Higgins et al. [2003] reviewed various approaches to the assessment of
inconsistencies in meta-analyses that have been proposed. The usual
heterogeneity test statistic known as Cochran’s Q assumes a fixed-effects model,
has a Chi-squared distribution, and is computed as shown in Table 11–4. A value
that exceeds the number of degrees of freedom (rather than the corresponding P
value) is generally interpreted as evidence of heterogeneity.
Heterogeneity tests pose subtle problems of interpretation. First, absence of
“significant” heterogeneity is not proof of homogeneity. Cochran’s Q test is
known to be poor at detecting true heterogeneity among studies as significant, in
particular when the number of studies included is limited. An alternative test
statistic called I2 does not depend on the number of studies included [Higgins et
al., 2003]. Second, when clinically relevant (which is something other than
“statistically significant”) heterogeneity is observed across studies, one may
question whether these studies can be combined by the chosen effect measure in
the first place. Conventionally, a random-effects or other model is assumed
when one of the available tests for a fixed-effects model suggests “significant”
heterogeneity [Berry, 1998]. But this may mask the existence of a credible
explanation for the heterogeneity that was observed. To illustrate, suppose that
the studies considered in Table 11–4 had been ranked according to the mean age
of the subjects in each study, and that the effect estimates showed a relationship
between mean age and effect estimate across studies, indicating that age is an
effect modifier. The value of Cochran’s Q statistic is the same, regardless of
whether age is an effect modifier. Another reason for effect modification that
could explain heterogeneity is difference in study design, such as choice of
treatment and type of treatment comparison (e.g., add-on treatment or
replacement, comparisons between two active treatments, double-blind, or open
comparison, etc.).
Taking lack of significance of a test of heterogeneity as proof of homogeneity
forces the data to fit a preconceived model that assumes that the true effect
estimate of interest is either the same (fixed effect), or varies at random (random
effect) across studies. This may result in conclusions about effects of treatment
that are not generalizable. Therefore, the possible causes of the heterogeneity
must always be explored, whether or not the heterogeneity observed is
statistically significant.
FIGURE 11–2 Three effect models for the relationship between the rate for treated and the rate for controls
respectively. The dotted line is given by y = x, which indicates that treatment has no effect, relative to
controls, across the range of risks for controls. The constant rate ratio line is given by y = HR × x, and the
constant rate difference line by y = y − RD. The ‘mixed’ model is given by y = a × x + b (with constant a
and b).
Figure 11–2 shows three possible relationships between the rates for treated
and control subjects. The constant rate (hazard) ratio model assumes that
treatment reduces the rate on a ratio scale to the same extent for any value of the
control rate. The same applies to the constant rate difference model on the
absolute difference scale. For low rates neither model is credible. The mixed
model, on the other hand, allows for the possibility that a treatment may be
highly effective in high-risk subjects while having no effect at all (or even an
untoward effect) in low-risk subjects, as will often be the case in clinical
practice.
Of course the data points plotted will never fall exactly on any line shown in
Figure 11–2 due to random variability and other factors. Nonetheless, a L’Abbé
plot can be helpful in determining whether any effect model illustrated in Figure
11–2 seems to fit the data, and therefore in determining whether heterogeneity
between studies can perhaps be explained by a mixed-model relationship that
implies by definition that there will be heterogeneity both for ratio and for
difference measures of effect.
An example given by Hoes et al. [1995b] is reproduced here as Figure 11–3.
Based on a weighted least-squares regression analysis that assumes a mixed
model as shown in Figure 11–2, Hoes et al. [1995b] concluded that the rate
(hazard) ratio for all-cause death cannot be assumed constant over the range of
absolute rates across the trial subgroups considered and that there is no evidence
that drug treatment improves survival when the death rate for control is below
6/1,000 person-years. The mixed-model result obtained by Hoes et al. [1995b],
as shown in Figure 11–3, has been criticized by Egger & Smith [1995], who
contended that regression bias is a more likely explanation for the relationship
shown. Arends et al. [2000] have reanalyzed the data used by Hoes et al.
[1995b] using a Bayesian approach and they came to a similar conclusion about
the existence of a cut-off point for efficacy at a death rate of 6/1,000 person-
years.
Effect models as shown in Figure 11–2 can also be explored by plotting the
effect measure concerned on the vertical axis and the absolute occurrence
measure for treated and controls combined in the horizontal axis. Further details
may be found in Van Houwelingen et al. [2002].
Meta-Regression
Staessen et al. [2001] plotted treatment effects on clinical events expressed as
odds ratios on a vertical axis against the difference in mean on treatment blood
pressure levels between treatment and control on the horizontal axis (see Figure
11–4). This is an example of meta-regression.
FIGURE 11–3 All-cause mortality rates (deaths/1000 patient-years [py]) in the intervention and control
groups of 12 subgroups from 7 trials in mild-to-moderate hypertension. The dotted ‘no-effect’ line indicates
that the rates are the same for the intervention and control groups. The continuous weighted least-squares
regression line is given by y = 0.53 × x + 0.0029 and describes the estimated intervention death rate as a
function of the control rate. The 95% confidence interval of the regression coefficient is 0.33–0.73. The no-
effect line and the regression line intersect at a control rate of 6/1000 patient-years.
Reproduced with permission from the Journal of Hypertension. Hoes AW, et al. Does drug treatment
improve survival? Reconciling the trials in mild-to-moderate hypertension. J Hypertens 1995;13:805–811.
FIGURE 11–4 Example of a meta-regression analysis [Staessen, 2001]. On the horizontal axis the
difference in systolic blood pressure (mm Hg) is depicted. The vertical axis shows the odds ratio for
cardiovascular mortality (left panel) and cardiovascular events (right panel) of the trials considered.
Reproduced from The Lancet Vol. 358; Staessen JA, Whang J, Thijs L. Cardiovascular protection and blood
pressure reduction: a meta-analysis. The Lancet 2001;358:1305–1315, with permission from Elsevier.
Flowchart
The strategy for retrieval and selection of the publications and results must be
stated clearly. The search filter syntax per bibliographic source with the
subsequent number of retrieved publications, the number and reasons for
exclusion of publications, the final number of publications included, and the
number of studies concerned should be reported, preferably as a flowchart (see
Figure 11–5).
Funnel Plot
A funnel plot is a scatter plot of the treatment effect estimates from individual
studies against a measure of its precision, which can be the sample size or the
inverse of its variance. Its proponents suggest that a funnel plot can be used to
explore the presence of publication and retrieval bias. Effect estimates from
small studies will scatter more widely around the true value than estimates from
larger studies because the precision of the treatment effect estimate increases
when the sample size increases. Conventionally, a skewed (asymmetrical) funnel
plot is considered to indicate bias in publication, retrieval, and selection of trials
[Sterne et al., 2000]. If smaller trials with a beneficial treatment effect (see the
upper left part of Figure 11–6) are preferentially published and more likely to be
retrieved and selected than smaller studies with no or even a harmful effect of
treatment (see the lower left part of the figure), the plot will be asymmetric.
FIGURE 11–5 Example of a flow diagram representing the search and selection of trials.
Reproduced from Cappuccio FP, Kerry SM, Forbes L, Donald A. Blood pressure control by home
monitoring: meta-analysis of randomized trials. BMJ 2004;329:145.
FIGURE 11–6 Example of a funnel plot based on simulated data. Simulated funnel plot, created by
randomly drawing 100 samples of size varying from 50 to 2000 from an underlying normal distribution
with a mean of 1 unit and standard deviation of 10 units. The curves indicate the region within which 95%
of samples of a given size are expected to fall. Closed circles indicate samples where the mean is
significantly increased (above zero) at P < 0.05, open circles samples where it is not. For the full sample,
the funnel shape is evident, but this would not be so if the open circles (or a proportion of them) were not
included due to publication bias.
Reproduced from Thornton A, Lee P. Publication bias in meta-analysis: its causes and consequences. J Clin
Epidemiol 2000;53:207–216, with permission from Elsevier.
Many other reasons for funnel plot asymmetry have been suggested, but a
rigorous simulation study exploring the impact of several explanations on the
impact of funnel plot asymmetry is lacking. Hence, the relevance of funnel plots
for evaluating completeness of the studies included in a meta-analysis remains
questionable.
Tables
The results of the critical appraisal and data extraction should be reported in
tables. These tables should account for the validity of trials and their
combinability. Notably, this includes the relevant characteristics of the
participating patients, the compared treatments, and reported endpoints. The
occurrence measures per treatment group should be tabulated with the effect
estimate and its confidence interval for each trial. Examples are given in Table
11–5 and Table 11–6.
Forest Plot
The results from the data analysis are preferably displayed as a Forest plot
showing the effect estimates for each included study with its confidence interval
and the pooled effect estimate with its confidence interval (see Figure 11–7).
The treatment effect estimate of each trial is represented by a black square with a
size that is proportional to the weight attached to the trial, while the horizontal
line represents the confidence interval.
The 95% CIs would contain the true underlying effect in 95% of the
repetitions if the study were redone. The solid vertical line corresponds to “no
effect.” If the 95% CI crosses this solid line, the effect measure concerned is not
statistically significant at the conventional level of (two-sided) P ≤ 0.05. The
diamond represents the combined treatment effect. The horizontal width of the
diamond represents its confidence interval.
The dashed line is plotted vertically through the combined treatment effect.
When all confidence intervals cross this plotted line, the trials are rather
homogeneous. Ratio measures (e.g., risk, odds or rate [hazard] ratios) of effect
are typically plotted on a logarithmic scale, the reason being that in that case
confidence intervals are displayed symmetrically around the point estimate. An
example is given in Figure 11–8.
TABLE 11–5 Example of Table for Reporting Results of Critical Appraisal of the Methodology of
Individual Studies from a Meta-Analysis of Trials Comparing Off-Pump and On-Pump Coronary Bypass
Surgery (ordered by the number of items satisfi
Meaning of item ratings:
Contamination: • ≤ 10% crossover, ∞ > 10% crossover All other items: • = bias unlikely (yes, adequate design or method)
∞ = bias likely (no, inadequate design or method)
°= unclear (insufficient information available)
Reproduced from Nathoe HM, Coronary revascularization: stent-implantation, on pump or off pump bypass
surgery? PhD thesis; Utre cht University (2004) 90-393-3739-X.
TABLE 11–6 Example of Table for Reporting Results of Data Extraction from a Meta-Analysis of the
Effect of Lipid Lowering Treatment
*
Percentage of subjects per trial with the established diagnosis.
†
Relative reduction of total cholesterol levels in the treatment group.
‡
Includes transient ischemic attacks.
Reproduced from Briel M, Studer M, Glass TR, Bucher HC. Effects of statins on stroke prevention in
patients with and without coronary heart disease: a meta-analysis of randomised controlled trials. Am J Med
FIGURE 11–7 Example of a Forest plot from a meta-analysis examining the relationship between use of
statins and change in blood pressure level. Mean differences and 95% CIs in systolic blood pressure (SBP)
achieved in patients who took statins compared with those who took placebo or other control treatment are
shown. Separate evaluations were made for studies in which the baseline SBP was > 130 or <= 130 mm Hg.
Symbols are (box) treatment effect estimate of each trial, with a size proportional to its weight; (—) CI of
the treatment effect estimate of each trial (the treatment effect with 95% CI is also displayed on the right of
the plot); (I) no effect on treatment; (vertical dashes) combined treatment effect; (diamond) width of the
diamonds represents the CI of the combined treatment effect.
Reproduced from Strazzullo P, Kerry SM, Barbato A, Versiero M, D’Elia L, Cappuccio FP. Do statins
reduce blood pressure? A meta-analysis of randomised, controlled trials. Hypertension 2007;49:792–8.
FIGURE 11–8 Results from a meta-analysis to investigate the effects of fi brates on major cardiovascular
outcomes (Jun et al., 2010).
Reproduced from The Lancet Vol. 375; Jun M, Foote C, Lv J, Neal B, Patel A, Nicholls SJ, Grobbee DE,
Cass A, Chalmers J, Perkovic V. Effects of fibrates on cardiovascular outcomes: a systematic review and
meta-analysis. The Lancet 2010;375:1875–84, with permission from Elsevier.
BOX 11–4 Internet Resources for Computer Software and Programs for Meta-Analysis (accessed May 17,
2013)
STATA Data Analysis and Statistical Software, College Station, Texas: http://www.stata.com
Strong evidence: A solid summary effect of clinically relevant magnitude, without apparent
heterogeneity across a large number of exclusively high-quality trials, that is, the direction and size of
effect is consistent across trials. Clinical recommendation: Treatment should be considered in all
patients; effects in subgroups could still be of interest.
Moderate evidence: A summary effect of clinically relevant magnitude, without apparent
heterogeneity across multiple high-to moderate-quality trials, that is, the direction and size of effect is
consistent across trials. Clinical recommendation: Treatment may be considered for all patients, but
different subgroups effects could be of interest. Clinical consensus may be helpful.
Weak evidence: A summary effect with statistical significance of clinically relevant magnitude, that is,
the direction of effect is consistent across trials of moderate to low quality. Exploration for sources of
heterogeneity across trials at the patient level (i.e., subgroups) or study design level appears justified.
Clinical recommendation: Treatment may be considered for most patients, and different subgroup
effects may be of interest. Clinical consensus could be helpful.
Inconsistent evidence: The magnitude and direction of effect varies across moderate-to low-quality
trials. Exploration for sources of heterogeneity across trials at the patient level (i.e., subgroups) or
study design level appears justified. Clinical recommendation: Treatment may be considered for
patients, but clinical consensus will be helpful.
Little or no evidence: Limited number of trials of low quality. Clinical consensus is needed; research is
warranted.
INTRODUCTION
The critically essential stages of designing clinical epidemiologic research are
over when the occurrence relation and the mode of data collection have been
established. Design of data analysis is important because it will determine the
utility of the result and should maintain the relevance and validity achieved so
far. Yet, in general, there are only a few appropriate and feasible ways to analyze
the data of a given study. Ideally, the design of data analysis follows naturally
from the nature of the occurrence relation and the type of data collected. Similar
to the design of the occurrence relation and the design of data collection, the
design of data analysis in diagnostic, etiologic, prognostic, and intervention
research each have their particular characteristics.
This chapter deals with elementary techniques used in data analysis. Often
these techniques are sufficient to answer the research question. For more
extensive information on data analysis, the reader must consult textbooks that
are specifically dedicated to data analysis [Altman, 1991; Kleinbaum & Kupper,
1982] or the referred literature in the chapters on diagnostic, etiologic,
prognostic, and intervention research. A simple statistical calculator,
“WhatStat,” can be found in Apple® iTunes® digital store.
A typical data analysis begins with a description of the population; key
characteristics are provided in the first, so-called baseline table. Its format
depends on the type of research that is performed. In a randomized trial, the
baseline table summarizes the frequencies and levels of important prognostic
variables in the randomized groups. This table is important because the
reviewers and readers of the eventual publication learn about the study
population and can judge the quality of the randomization. In etiologic research,
the frequencies and levels of relevant characteristics (in particular potential
confounders) will be summarized by categories of the determinant, while in
diagnostic and prognostic research, predictors according to the disease or
outcome will be shown. In the first step of data analysis, the data are reduced by
giving summary estimates (e.g., mean, range, standard deviation, frequencies).
Next, measures of association between the determinant(s) and the outcome of
interest are calculated with corresponding 95% confidence levels. In etiologic
research, the crude association measure will generally be adjusted by one or
more confounding variables.
Before we deal with the data analysis steps that are performed in nearly every
clinical epidemiologic study, we focus attention on how to calculate prevalence
and incidence measures. Next, we cover the concept of variability in research
and the way uncertainty is reflected in the description of the data. Finally,
adjustment for confounding with several techniques such as stratified analysis
(Mantel-Haenszel, 1959), linear, logistic, and Cox regression is explained.
BOX 12–1 Calculating the Prevalence (and Confidence Interval) of Metabolic Syndrome in 1,000 Patients
with Coronary Ischemia
In a study population of 1,000 patients with coronary ischemia the proportion (prevalence) of patients
with the metabolic syndrome is 40% (400 patients).
The 95% CI can be calculated with the traditional method of formula (1):
In comparable populations, the prevalence of diabetes will be found in the range between 37% and
43%. With the method proposed by Altman, the following calculations need to be done: P = 400/1000
= 0.40, q = 600/1000 = 0.60, and r = 400 [Altman et al., 2000b].
Data from Altman D, Machin D, Bryant TN, Gardner MJ. Statistics with Confidence. 2nd edition. BMJ
Books, 2000b.
where r is the number of participants that has the feature, q is the proportion that
does not have it, n is the total number of participants, and z (usually) is 1.96. The
confidence interval for the population prevalence P is now calculated by:
(A − B)/C to (A + B)/C
Software such as Confidence Interval Analysis (CIA) [Altman et al., 2000b]
dedicated to the estimation of confidence intervals is available and easy to use.
To estimate the incidence, two measures are commonly used: cumulative
incidence and incidence rate. The cumulative incidence is the number of
subjects developing the disease during a particular time period divided by the
number of subjects followed for the time period. The incidence rate estimates the
occurrence of disease per unit of time. The incidence rate is also called force of
morbidity. The cumulative incidence is a proportion, binomially distributed, and
the 95% CI can be calculated with Equation 1 or the alternative method as
explained in the earlier calculations. The cumulative incidence is often
interpreted as the “risk.”
The incidence rate is the number of cases occurring per unit of follow-up time
and can be expressed as the number of cases (I) per 1,000 (or 10,000) person-
years (PY). For the prevalence and the cumulative incidence, the number of
cases cannot become larger than the denominator, but in the formula of the
incidence rate (IR = I/PY), the denominator has no fixed relationship with the
numerator. Confidence intervals for this type of distribution can be calculated by
assuming that the incidence rate has a Poisson distribution (see Box 12–2)
[Altman, 1991]. The 95% CI of incidence rates can be easily read from a table
that can be found in most statistics textbooks or on the Internet (Health Data,
2012).
In a population with a mean follow-up of 2.3 years cumulating in 9,300 person-years (PY) of follow-
up, 35 patients experience a myocardial infarction.
In the confidence limits table for variables that have a Poisson distribution, we find that the lower
border of the incidence rate (95% CI) is 24.379 and the upper border is 48.677. These are absolute
numbers and have to be expressed per 10,000 PY:
Data from Washington State Department of Health (2012). Guidelines for Using Confidence Intervals for
Public Health Assessment. http://www.doh.wa.gov/Portals/1/Documents/5500/ConfIntGuide.pdf. Accessed
June 20, 2013.
DATA ANALYSIS STRATEGIES IN CLINICAL
EPIDEMIOLOGIC RESEARCH
Baseline Table
In the methods section of an article, the researchers meticulously describe the
study population so the readers can get acquainted with this population and
judge the domain to which the results of the study pertain. In the first part of the
results section, the authors describe the key characteristics in the baseline table.
In Box 12–3, a baseline table is given from a study in which investigators
examined the relationship between the presence of the metabolic syndrome in
patients with symptomatic vascular disease and the extent of atherosclerosis
[Olijhoek et al., 2004].
The baseline table provides an overview of the most important characteristics
of the study population. In this example, the relationship between metabolic
syndrome and the extent of vascular disease was the subject of research. For
patients with and without metabolic syndrome, the relevant characteristics are
summarized in the first table of the report [Olijhoek et al., 2004].
For each characteristic, either the mean (with standard deviation) or frequency
is given. Variability is a key concept in clinical research. People differ in their
characteristics and their responses to tests and treatment, so there are many
sources of variability. To reduce the amount of available information, the data
need to be summarized. A continuous variable (e.g., age, blood pressure) is
summarized by a central measure (the mean) and a measure of variability (the
standard deviation), or a median with an interquartile range.
The standard deviation (SD) characterizes the distribution of the variable and
can be calculated by taking the square root of the variance. The mean ± 2 SD
includes 95% of the observation distributions that are approximately normally
distributed, but in non-normal and even skewed distributions at least 75% of the
observations are within this range. If variables have a skewed distribution, the
median will likely be a more relevant summary measure than the mean and in
that event, the distribution is characterized by giving the interquartile ranges,
that is, the range from the 25th (P25) to the 75th (P75) percentile. Interquartile
values are typically more useful than the full range, as the extremes of a
distribution may comprise erroneous or unlikely data (see Figure 12–1).
Categorical variables are summarized by giving their frequencies. For
example, 70% of the population is male. Data of this type with only two
possibilities (i.e., dichotomous variables) have a binomial distribution and are
very common in medical research. If sample sizes are large enough, the binomial
distribution approaches the normal distribution with the same mean and standard
deviation.
BOX 12–3 Baseline Characteristics of the Study Population from a Study in which the Relationship
Between Metabolic Syndrome and the Extent of Vascular Disease is Determined
All data in percentages, or as indicated: 1 mean ± standard deviation or 2 median with interquartiles
range.
HDL: high-density lipoprotein.
a
Still smoking, recently stopped smoking, or previously smoking.
b
History of vascular disease other than qualifying diagnosis.
cFasting serum glucose ≥ 7.0mmol/l or self-reported diabetes.
Reproduced from Olijhoek JK, van der Graaf Y, Banga JD, Algra A, Rabelink TJ, Visseren FL. The
SMART study group. The metabolic syndrome is associated with advanced vascular damage in patients
with coronary heart disease, stroke, peripheral arterial disease or abdominal aortic aneurysm. Eur Heart J
2004;25:342–8.
FIGURE 12–1 Plot of weight of the 1045 patients with symptomatic atherosclerosis: mean 80.25 k; SD
13.0; SEM (standard error of the mean) 0.40; median 80.0 k; interquartile range 17; P25 is 72 k and P75 is
89 k; range 42–143 k.
Reproduced from Olijoek JK, van der Graaf Y, Banga JD, Algra A, Rabelink TJ, Visseren FL; the SMART
study group. The metabolic syndrome is associated with advanced vascular damage in patients with
coronary heart disease, stroke, peripheral arterial disease or abdominal aortic aneurysm. Eur Heart J
2004;25:342–8, by permission of Oxford University Press.
Variability is not only present between subjects but also between studies. The
variability of the sample is expressed by the standard error (SE) and can be
calculated by dividing the population standard deviation by the square root of the
number of observations.
In research, inferences about populations are made from samples. We cannot
include all patients with a myocardial infarction in our study; instead we want to
generalize the findings from our sample to all patients with myocardial
infarction. Thus, we sample and estimate. The way we sample determines to
what extent we may generalize. Generally, the results from a sample are valid
for the study population from whom the sample was drawn and may be
generalized to other patients or populations that are similar to the domain that is
represented by the study population.
Extrapolations of inference from one population to other populations are not
“hard science” but rather a matter of knowledge and reasoning and,
consequently, they are subjective. Variability of the sample mean is expressed
with the 95% CI of that mean that can be calculated from the SE. If the mean
weight in the example given previously is 80.25 kilograms and the SE of the
mean is 0.40, the upper and lower limits of the 95% CI can be calculated as
80.25 − (1.96 × 0.40) and to 80.25 + (1.96 × 0.40), respectively. This infers that
the real population mean will be somewhere between 79.5 and 81.0 kilograms.
The 95% CI (or the precision of a study result) indicates the reproducibility of
measurements and reflects the range of values that estimates can have when
studies are repeated. If a study was repeated again and again, the 95% CI would
contain the true effect in 95% of the repetitions. A confidence interval for an
estimated mean extends either side of the mean by a multiple of the standard
error. The 95% CI is the range of values from mean − 1.96 SE to mean + 1.96
SE. SEs can also be used to test the statistical significance of a difference
between groups.
A common statistical test for continuous variables is the unpaired t-test, for
example, to estimate the significance of a difference in age between patients
with and without the metabolic syndrome. When a continuous variable is
compared before and after the intervention, a paired t-test is done. Paired and
unpaired t-tests assume that the difference (paired or between groups) represents
a simple shift in mean, with the variation remaining the same (same standard
deviation). Under these assumptions, the t-tests are approximately valid as long
as the sample size is sufficiently large, even for a skewed distribution. However,
in the case of skewed distributions, the difference between groups is typically
reflected in a shift in mean as well as a change in standard deviation: Often the
standard deviation then increases with increasing mean. The most appropriate
solution in most cases is to apply a transformation, for example, to analyze the
data on logarithmically transformed values. Then results can be represented in
relative instead of absolute changes. In the event that normality is very unlikely,
nonparametric variants of the paired and unpaired t-tests can be chosen, such as
the Mann-Whitney U-test. Note, however, that for this test too the assumption is
that the difference represents a simple shift in mean, with the variation
remaining the same. To compare categorical variables, cross-tables with
corresponding chi-square analyses are chosen. In general, however,
epidemiologists prefer an estimation of a particular parameter and description of
its precision with a 95% CI instead of performing tests. We return to this issue
later in this chapter.
In a randomized clinical trial the baseline table presents the most important
prognostic factors according to the treatment arm. Here, differences should not
be tested for statistical significance and P values should not be calculated, as
differences in distributions between treatments reflect chance by definition
[Knol et al., 2012].
Continuous Outcome
In many studies, the outcome is a continuous variable such as blood pressure or
body weight. In the previously mentioned example in which the relationship
between the presence of metabolic syndrome and the extent of vascular disease
in patients with symptomatic atherosclerosis was investigated, the extent of
vascular damage was measured by ultrasound scanning of the carotid artery
intima media thickness (IMT), the percentage of patients with a decreased ankle-
brachial blood pressure index, and the percentage of patients with albuminuria.
As a first step in the comparison of the IMT of the patients with and without the
metabolic syndrome, the mean IMT and its standard deviation and standard error
are calculated for both groups. The mean IMT in patients with the metabolic
syndrome was 0.98 mm and in patients without the syndrome 0.92 mm (see
Table 12–1).
The standard deviation gives an impression of the underlying distribution in
the two groups. The mean ± 2 SD covers 95% of the observations in that
population. The mean ± 1.96 SE reflects the variability of the population mean,
as shown in the SPSS® (SPSS, Inc., Chicago, IL) output in Table 12–2.
TABLE 12–1 Intima Media Thickness (in mm) Data for Metabolic Syndrome
TABLE 12–2 Independent Samples Test: Intima Media Thickness in Patients With and Without Metabolic
Syndrome
The t-test for unpaired samples estimates the likelihood that the means are
really different from each other, rather than the difference being due chance.
From the SPSS output in Table 12–2, we can read that there are two possible
answers. The first line gives the results if we assume the variance to be equal and
the second line if variances are not assumed to be equal. Whatever we assume
here, although the variances are not equal in this situation, the conclusion is that
the IMTs are different and that the mean difference of 0.059 mm is statistically
significantly and different from zero. Whether this also reflects a clinically
relevant difference is an entirely different matter.
If we need to adjust our result for confounding variables (e.g., age and sex),
there are several possibilities. We can adjust the mean IMT in both groups for
age and sex with a general linear model procedure (PlanetMath, 2012a). It will
provide us with adjusted mean IMTs in both groups that cannot be explained by
differences in age and sex between the patients with and without the metabolic
syndrome (see Table 12–3).
If we want to quantify the differences between the two groups (with and
without metabolic syndrome), we can also perform a linear regression in which
we define IMT as a dependent variable and the metabolic syndrome as a
“yes/no” (1/0) independent variable (PlanetMath, 2012b).
The regression coefficient of metabolic syndrome is 0.059 (see Table 12–4),
which means that in patients with the metabolic syndrome the mean IMT is
0.059 mm thicker. Exactly the same number is obtained when subtracting the
mean IMT in patients with and without the metabolic syndrome. Using the same
approach, we can now adjust for confounders such as gender and sex and
directly obtain an adjusted difference (see Table 12–5).
The regression coefficient changed after adjusting for age and gender from
0.059 to 0.061. This means that in patients with the metabolic syndrome, the
mean IMT is 0.061 mm thicker when differences in age and gender are taken
into account. The section on linear regression later in this chapter explains the
principles of this type of analysis. The SPSS output presents the unstandardized
coefficient and the standardized coefficients. The latter refers to how many
standard deviations a dependent variable will change per standard deviation
increase in the predictor. Standardization of the coefficient is usually done to
determine which of the independent variables have a greater effect on the
dependent variable in a multiple regression analysis, when the variables are
measured in different units of measurement. However, the validity of this
interpretation is subject to debate, if only because the coefficients are unitless
[Simon, 2010].
TABLE 12–3 Intima Media Thickness (in mm) According to Metabolic Syndrome, Taking Gender and
Age into Account as Possible Confounders, Using a General Linear Model Procedure
Covariates appearing in the model are evaluated at the following values: gender = 1.21, age = 59.66.
a
TABLE 12–4 Relationship Between Metabolic Syndrome and Intima Media Thickness, Using Linear
Regression Analysis
a
Dependent variable: mean intima media thickness (mm).
TABLE 12–5 Relationship Between Metabolic Syndrome and Intima Media Thickness, Taking
Confounding by Age and Gender into Account, Using Linear Regression
Dependent variable: mean intima media thickness (mm).
a
Discrete Outcome
In medicine, often the outcome of interest is a simple “yes/no” event or
continuous data are categorized in a structure that permits a “yes/no” outcome.
Instead of calculating the difference in blood pressure levels between two
groups, we can compare the percentage of patients above or below a particular
cut-off level. The study design dictates the data analysis “recipe.” In a
longitudinal study, such as a cohort study, absolute risks and relative risks can be
calculated, while in most case-control studies the odds ratios should be
calculated.
Relative risks can be easily calculated with a hand-held calculator. The
simplest layout for data obtained in a cohort study is summarized in Table 12–6,
if we assume there is no differential follow-up time.
The absolute risk (cumulative incidence) for disease in the people with the
determinant is R+ = a/(a + b), while the absolute risk in people without the
determinant is R− = c/(c + d).
From these absolute risks, the relative risk (RR) can be calculated by dividing
both absolute risks:
Determinant Yes No
Present a b
Not present c d
The formula for the standard error is given here, and the 95% CI of the
relative risk is calculated from Equation 2:
An example of a cohort study examining the association between previous
myocardial infarction and future vascular events including calculation of the
relative risk with confidence interval is shown in Box 12–4. The sampling in a
typical case-control study conducted in a dynamic population of unknown size
permits no direct calculation of absolute risks. Instead, the odds ratio can be
calculated. The odds ratio is the ratio of exposure to nonexposure in cases and
controls (Table 12–7). The odds ratio obtained in case-control studies is a valid
estimate of the incidence rate ratio one would obtain from a cohort study,
provided that the controls are appropriately sampled. However, in cohort studies
and randomized clinical trials, odds ratios are often also interpreted as risk ratios.
This is problematic because an odds ratio always overestimates the risk ratio,
and this overestimation becomes larger with increasing incidence of the outcome
[Knol et al., 2012].
BOX 12–4 Example of a Cohort Study on Prior Myocardial Infarction and Future Vascular Events
In a cohort study (N = 3288) in which patients with vascular disease are included 218 patients
experienced a vascular event within 3 years. In the table, the occurrence of the event according to a
history of previous myocardial infarction (MI) is summarized.
The cumulative incidence in 3 years in patients with a previous MI 95/858 = 11%, the cumulative
incidence in patients without previous MI is 123/2430 is 5%. The relative risk (RR) is the ratio of both
risks (RpreviousMI = 95/858 divided by RnopreviousMI = 123/2430) = 2.1874. The SElnRR =
is 0.13 and the 95% CI RR = eln2.2±1.96
= e0.78845±0.25607 = e 1.7 − 2.8.
The relative risk in the underlying population will be between 1.7 − 2.8. Patients with symptomatic
vascular disease and a previous MI have 2.19 times the risk compared with patients with symptomatic
disease without previous MI.
The odds ratio for being exposed versus nonexposed is a/c in the cases and b/d
in the controls (Box 12–4). The odds ratio (OR) is the ratio of the two odds:
The formulas for the standard error of the odds ratio and the 95% CI
(Equation 3) are given here. Note that a logarithmic transformation is needed
just like when calculating the SE of the relative risk.
BOX 12–5 Example of a Case-Control Study on Oral Contraceptive and Peripheral Arterial Disease
The following data are taken from a study that investigated the relationship between oral contraceptive
use and the occurrence of peripheral arterial disease [Van den Bosch, et al., 2003]. Of the women with
peripheral arterial disease (n = 39), 18 (46%) used oral contraceptives, while of the 170 women
without peripheral arterial disease only 45 (26%) used oral contraceptives. The layout of the data table
is as follows:
The odds ratio for having peripheral arterial disease is (18 × 125)/(21 × 45) = 2.4. The SEln2.4 =
95% CIOR = eln2.4±1.96 = 1.17 − 4.90. The odds
ratio of 2.4 means that women who use oral contraceptives have 2.4 times the risk to develop
peripheral arterial disease compared to women who do not use oral contraceptives. If we repeat the
study 100 times, the odds ratio will have a value of between 1.17 and 4.90 in 95 out of 100 studies.
Adapted from Van den Bosch MA, Kemmeren JM, Tanis BC, Mali WP, Helmerhorst FM, Rosendaal FR,
Algra A, van der Graaf Y. The RATIO Study: oral contraceptives and the risk of peripheral arterial disease
in young women. J Thromb Haemost 2003;1: 439–444.
Stratified Analysis
One way to address confounding is to do a stratified analysis, where the data are
analyzed in strata of the confounding variable. Consequently, in each stratum,
the effect of the confounder is removed and the determinant– outcome
relationship is estimated conditional on the confounder. The effect estimates for
the relationship between determinant and outcome are calculated in each
stratum. Next, the investigator compares the magnitude of the strata-specific
estimates before they are pooled in one summary estimate. Strata-specific
estimates can only be pooled when they are more or less comparable and have
the same direction and magnitude. If not, effect modification is likely to be
present. Then, the relationship has to be expressed for each stratum of the effect
modifier, and calculation of a single overall summary estimate may be of limited
use. Note that in that situation, confounding may still need to be removed from
each stratum.
To estimate the degree of confounding, the crude effect estimate is calculated
and compared with the pooled estimate adjusted for the confounding variable.
The pooled estimate, according to Mantel-Haenszel method, is calculated with
the following formula:
BOX 12–6 Adjustment for the Confounder Age, Using the Mantel-Haenszel Approach in a Case-Control
Study on Oral Contraceptive Use and the Occurrence of Peripheral Arterial Disease
Mantel-Haenszel odds ratio for peripheral arterial disease (PAD) in relation to oral contraceptive use
in women [Van den Bosch et al., 2003].
The odds ratios in the different age strata for age are 3.0, 2.4, and 3.9 respectively. The age-adjusted
odds ratio (3.2) is quite different from the crude (2.0) odds ratio. This implies that age confounds the
relationship between oral contraceptive use and the occurrence of peripheral arterial disease.
Adapted from Van den Bosch MA, Kemmeren JM, Tanis BC, Mali WP, Helmerhorst FM, Rosendaal FR,
Algra A, van der Graaf Y. The RATIO Study: oral contraceptives and the risk of peripheral arterial disease
in young women. J Thromb Haemost 2003;1: 439–444.
Regression Analysis
In the event of one or two confounding variables and sufficient data, the Mantel-
Haenszel technique is suitable for adjustment for confounding. If there are more
confounders involved, the database quickly will not be of sufficient size to
perform a stratified analysis. To overcome this problem, a regression technique
such as linear regression, logistic regression, and Cox regression can be used
(PlanetMath, 2007). The occurrence relation and the type of the outcome
variable largely determine the choice of the technique. If the outcome is
measured on a continuous scale (blood pressure, weight, etc.), linear regression
analysis will be the first choice. If the outcome is dichotomous (yes/no), logistic
regression is usually chosen, while if time-to-event is the outcome (survival), a
Cox model will be used to estimate the effect measure. When logistic regression
is chosen, the effect measure is expressed as an odds ratio. Odds ratios can be
interpreted as risk ratios if the outcome occurs in less than 10% of the
participants. If the incidence of the outcome is higher than 10%, other methods
have to be used to estimate risk ratios such as log-binomial regression or Poisson
regression [Knol et al., 2012]. Both methods are available in SPSS, SAS, R, and
Stata [Lumley et al., 2006, Spiegelman & Hertzmark, 2005].
Linear Regression
Techniques for fitting lines to data and checking how well the line describes the
data are called linear regression methods. With linear regression, we can
examine the relationship between a change in the value of one variable (X) and
the corresponding change in the outcome variable (Y).
The simple linear regression model assumes that the relationship between
outcome and determinant can be summarized as a straight line. The line itself is
represented by two numbers, the intercept (where the line crosses the y-axis) and
the slope. The values of intercept and slope are estimated from the data.
In the observed relationship between IMT and age (Box 12–7) several
confounders that differ in subjects with different ages and also have a
relationship with IMT can play a role, for example, sex. With regression analysis
we can control for confounding in an easy way by including the confounder in
the regression model. Assuming that we have enough data, we can extend
Equation 5 with several confounders.
BOX 12–7 Age and Intima Media Thickness in Patients with Symptomatic Atherosclerosis
In 1000 patients with symptomatic atherosclerosis the relationship between intima media thickness
(IMT) of the carotid artery and age (in years) is investigated. The intercept is 0.26 and the coefficient
of age is 0.01. The interpretation of the coefficient is that with each year increase in age the mean IMT
increases 0.01 mm. The R2 is a measure for the variation in Y (IMT) that is explained by X (age). The
precision of the coefficient is expressed with the 95% CI. The lower limit of the 95% CI (0.010)
means that if we repeat this study generally the coefficient for age will not be below 0.010 (in 95 out
of 100 studies).
Coefficientsa
Logistic Regression
Linear regression is indicated when the outcome parameter of interest is
continuous. The dependent variable can either be continuous or dichotomous.
When the outcome is discrete (e.g., diseased/nondiseased), logistic regression
analysis is suitable. Logistic regression is very popular because in medicine the
outcome variable of interest is often the presence or absence of disease or can be
transformed into a “yes” or “no” variable. A regression model in this situation
does not predict the value Y for a subject with a particular set of characteristics
(as in linear regression), but rather predicts the proportion of subjects with the
outcome for any combination of characteristics. The difference between linear
regression and logistic regression is that instead of predicting the exact value of
the dependent variable, a transformation of the dependent variable is predicted.
The transformation used is the logit transformation. The formula of the logistic
model is:
TABLE 12–8 Relationship Between Age and Intima Media Thickness in Patients with Atherosclerosis,
Adjusting for Sex
a
Dependent variable: intima media thickness (mm).
where Y is the proportion of subjects with the outcome (e.g., the probability of
disease); (1 − Y) is the probability that they do not have the disease, ln [Y/(1 −
Y)] is the logit or log (odds) of disease, b0 is the intercept, and X1 is one of the
independent variables.
From the regression model, we can directly obtain the odds ratio because the
coefficient (b1) in the regression model is the natural logarithm of the odds ratio.
This is a major reason for the popularity of the logistic regression model.
Computer packages not only give the coefficients but also the odds ratios and
corresponding confidence limits. The 95% CI in the output in Box 12–8 shows
that 1 is not included in the interval, meaning that the relationship between
smoking and the presence of cardiovascular disease is significant at the 5%
level. Odds ratios are generally expressed in literature by giving the value and
corresponding 95% CIs, for example, OR is 1.9 (95% CI 1.5–2.3). In the
example in Box 12–8, the independent variable is entered as a discrete variable
(yes/no), but variables with more categories or continuous variables can also be
included in a logistic regression model.
In the example in Table 12–10, smoking has three categories—present
smoker, former smoker, never smoker; the never-smoker is chosen as reference
category. The outcome is coronary disease.
Some computer packages (e.g., the SPSS statistical software program) create
so-called dummies when variables have more categories. In other packages, the
user has to define the dummies before the variables can be included in the
model. If a variable has three categories, two new variables are needed to
translate that variable into a “yes/no” variable. If categorical variables are
entered without recoding dummies in the model, the model will consider the
covariate as if it was a continuous variable. Now the regression coefficient
applies to a unit change, for example, from never smoking (0) to former smoker
(1), or from former smoker (1) to present smoker (2). Such a single coefficient,
however, makes no sense. Creating dummies is more useful and is also simpler.
Two new variables can be defined as smoking1 and smoking2, where smoking1
is 0 except when the subject is a former smoker and smoking2 is 0 except when
the subject is a present smoker. The following possibilities appear:
BOX 12–8 Smoking Link with Cardiac Disease, Logistic Regression Analysis
In a cross-sectional study of 3,000 subjects, we estimated the relationship between smoking and the
presence of coronary disease with logistic regression analysis.
The regression equation for the model with one variable (smoking) is:
Logit (coronary disease) = −1.150 + 0.641 (smoking)
With this equation we can calculate the odds of coronary disease for smokers and for nonsmokers:
For smokers: logit (coronary disease) = −1.150 + 0.641
For nonsmokers: logit (coronary disease) = −1.150
For nonsmokers: logit (coronary disease) = −1.150
Logit(smokers) − logit(nonsmokers) = 0.641
Odds ratio(smokers) = e0.641 5 1.89
TABLE 12–10 Smoking (in Three Categories) and Coronary Artery Disease
a
Variable(s) entered on step 1: weight
The coefficient of weight is 0.008 and the odds ratio (Exp B) is 1.008. This
means that for each kilogram increase in weight, the risk for coronary ischemia
increases by 0.8%. Often age is treated as a continuous variable as well. In
Table 12–12, the risk increases by 3.8% each year of increase in age.
The absolute probability (or risk) of the outcome for each subject can be
directly calculated from the logistic model by substituting the determinants X1,
X2, etcetera by the values measured in these patients, according to the following
formula (see also Box 12–9):
Cox Regression
In many studies, not only the event but also the time to event is of interest. This
event may or may not have occurred during the observation period. If an event
did occur, it will have occurred at different intervals for each subject. For these
type of data, linear and logistic regression techniques are not sufficient because
it is not possible to include the time to event in the model [Steenland et al.,
1986]. Generally, these types of data are referred to as survival data, but apart
from death as an outcome, all kinds of “yes/no” events (e.g., disease progression,
discharge from hospital, the occurrence of an adverse event or a disease) can be
analyzed with survival techniques. If only one independent variable is
investigated, the Kaplan-Meier method of estimating a survival distribution can
be used. If more variables have to be included in the analysis, the Cox
proportional hazards regression model is needed.
a
Variable(s) entered on step 1: age.
With logistic regression, the relationship between the risk of cardiac ischemia and age and gender was
examined. With each year increase in age the risk of the outcome increases 1.036 times, while males
have 2.38 times the risk of women (output below). For each patient the absolute risk in the observed
have 2.38 times the risk of women (output below). For each patient the absolute risk in the observed
follow-up time for the presence of myocardial infarction can be calculated with the following formula:
The coefficients are given in the printed output. β0 is the intercept, β1 is the age coefficient, and β2 is
the coefficient for gender for a 60-year-old male. The risk of cardiac ischemia in the observed follow-
up time is:
TABLE 12–13 Sex, Age, and the Risk of Cardiovascular Events: Results from Cox Regression
FIGURE 12–2 Survival data from the SMART study of 3200 patients with symptomatic atherosclerosis.
There were 223 events.
With kind permission from Springer Science+Business Media: Simons PC, Algra A, van de Laak MF,
Grobbee DE, van der Graaf Y. Second Manifestations of ARTerial disease (SMART) study: rationale and
design. Eur J Epidemiol 1999;15:773–81.
These hazard functions are proportional to each other, and it is not necessary
to know the underlying hazard h0 (t) in order to compare the four groups. From
the Cox model, coefficients and their standard errors can be estimated and
several computer packages generate the hazard ratios, a type of relative risk, as
well.
In the computer output in Table 12–13, the results of a Cox regression are
shown. An event was defined as a combined outcome of a nonfatal myocardial
infarction, a nonfatal stroke, or cardiovascular death. Data were collected in a
cohort of patients with elevated risk for cardiovascular disease. In this analysis,
gender and age are investigated as determinants of the outcome. Females
younger than age 60 are the reference group. A female older than age 60 has a
hazard ratio of 2.8 (b2, coefficient of age). This hazard ratio can be interpreted as
a relative risk, meaning that, compared to women younger than 60 years of age,
women older than 60 years have 2.8 times the risk for a cardiovascular event. A
male younger than age 60 (b1, coefficient of gender) has 1.7 times the risk
compared to a female younger than age 60. A male older than age 60 has the
coefficient of sex and age (e0.572 + 1.043) leading to a hazard ratio of 5.0.
FIGURE 12–3 British mathematician, Reverend Thomas Bayes, whose solution to “inverse probability”
was published posthumously.
Courtesy of the MacTutor History of Mathematics Archive, University of St. Andrews, Scotland, United
Kingdom. Available at http://www-history.mcs.st-andrews.ac.uk/history/Biographies/Bayes.html. Accessed
July 11, 2007.
Statistics as a discipline remains sharply divided, even on the fundamental definition of “probability.”
The frequentist’s definition sees probability as the long-run expected frequency of occurrence. P(A) =
n/N, where n is the number of times event A occurs in N opportunities. The Bayesian view of
probability is related to degree of belief. It is a measure of the plausibility of an event given
incomplete knowledge. A frequentist believes that a population mean is real, but unknown, and
unknowable, and is one unique value that needs to be estimated from the data. Knowing the
distribution for the sample mean, he constructs a confidence interval, centered at the sample mean.
Here it gets tricky. Either the true mean is in the interval or it is not. Thus, the frequentist cannot say
there is a 95% probability that the true mean is in this interval, because it is either already in, or it is
not. And that is because to a frequentist the true mean, being a single fixed value, does not have a
distribution. The sample mean does. Thus, the frequentist must use circumlocutions like “95% of
similar intervals would contain the true mean, if each interval were constructed from a different
random sample like this one.” Graphically this is illustrated here:
Bayesians have an altogether different worldview. They say that only the data are real. The population
mean is an abstraction, and as such some values are more believable than others based on the data and
their prior beliefs. (Sometimes the prior belief is very noninformative, however.) The Bayesian
constructs a credible interval, centered near the sample mean, but tempered by “prior” beliefs
concerning the mean. Now the Bayesian can say what the frequentist cannot: “There is a 95%
probability that this interval contains the mean.”
In summary, probability according to a frequentist can be defined as the long-run fraction having a
characteristic, while according to a Bayesian it can be considered a degree of believability.
In caricature, a frequentist is a person whose long-run ambition is to be wrong 5% of the time, while a
Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes
he has seen a mule.
The Bayesian approach makes the subjective and arbitrary elements shared by
all statistical methods explicit through a prior probability distribution. Bayesian
analysis thus requires that these prior beliefs be explicitly specified. This could
be done using empirical evidence available before the next study is conducted,
insights into mechanisms that make the presence of an association likely, or any
other belief or knowledge obtained without the data generated by the new study.
One clear consequence is that Bayesian results can only be interpreted
conditional on the prior belief. Hence, if there is no universal agreement on the
prior belief, the same data will necessarily lead to different estimates, different
credibility intervals, and different conclusions among groups that have different
prior beliefs. To put it in another way, before you believe the results of a
Bayesian analysis as presented in a paper you first have to commit yourself to
the prior belief that was used as a basis. This problem is often avoided by using
“vague” or noninformative priors, but then in fact the advantage of using
directed and real prior data or belief is lost.
It has been argued that Bayesian statistical techniques are difficult, but they
are not necessarily more complicated than frequentist techniques. They do
require more intensive computation, even in cases that can be approximated in
the frequentist setting. This is due to the fact that combining the prior
distribution with the data can only be done in a straightforward analytical way
under restrictive assumptions that do not allow the flexibility for prior beliefs
that is needed in practice. Most current statistical computer packages are
implicitly based on the frequentist’s way of thinking about hypothesis testing, P
values, and confidence intervals. However, even with standard frequentist
software, it is possible to approximate Bayesian analyses and incorporate prior
distributions of the data, for example, by inverse variance weighting of the prior
information with the frequentist estimate [Greenland, 2006]. In the statistical
data analysis of clinical epidemiologic data there is room for both frequentist and
Bayesian approaches, with the Bayesian approach being perhaps more natural to
medical reasoning [Brophy & Joseph, 1995]. To promote the use of Bayesian
analyses, however, both the understanding of Bayesian concepts and analyses
and the accessibility of Bayesian statistics in data analysis software packages
need to be improved.
References