Modern Methods For Epidemiology PDF
Modern Methods For Epidemiology PDF
Modern Methods For Epidemiology PDF
Modern Methods
for Epidemiology
Editors
Yu-Kang Tu Darren C. Greenwood
Division of Biostatistics Division of Biostatistics
Leeds Institute of Genetics Leeds Institute of Genetics
Health and Therapeutics Health and Therapeutics
University of Leeds University of Leeds
Leeds, UK Leeds, UK
Statistical methods are important tools for scientific research to extract information
from data. Some statistical methods are simple whilst others are more complex, but
without such methods our data are just numbers and useless to our understanding of
the world we are living in. In epidemiology, researchers use more advanced and
complex statistical methods than colleagues who work with experimental data,
often under more controlled conditions than can be achieved with the larger
datasets and more “real-life” conditions required by observational data. The issues
of observational data are not just about the amount of data but also the quality of
data. Epidemiological data usually contains missing values in some variables for
some patients, and the instruments used for data collection may be less accurate or
precise than those used for experimental data. Therefore, textbooks of epidemiology
often contain much discussion of statistical methods for dealing with those problems
in analysis and interpretation of data, and very often they also contain some
discussion of the philosophy of science. This is because elaborating causes and
their consequences from observational data usually requires certain epistemological
theories about what constitutes “causes” and “effects”.
Routine applications of advanced statistical methods on real data have become
possible in the last 10 years because desktop computers have become much more
powerful and cheaper. However, proper understanding of the challenging statistical
theory behind those methods remains essential for correct application and inter-
pretation, and rarely seen in the medical literature. This textbook contains a general
introduction to those modern statistical methods that are becoming more important
in epidemiological research, to provide a starting point for those who are new to
epidemiology, and for those looking for guidance in more modern statistical
approaches. For those who wish to pursue these methods in greater depth, we
provide annotated lists of further reading material, which we hope are useful for
epidemiological researchers who wish to overcome the mathematical barrier of
applying those methods to their research.
The Centre for Epidemiology and Biostatistics at the University of Leeds,
United Kingdom, where we have been working for many years, has a masters
v
vi Preface
Leeds, UK Yu-Kang Tu
Darren C. Greenwood
Contents
ix
x Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Contributors
xi
xii Contributors
The issue of causation is a challenging one for epidemiologists. Politicians and the
public want to know whether something of concern causes a disease or influences
the effectiveness of healthcare services. However, the training provided to
statisticians, and to scientists more generally, tends to stress that non-experimental
research will only ever offer evidence for association and that suitably designed
experimental studies are required to offer robust evidence of causation. In the real
world, where experimental data are rare, difficult or impossible to produce, the
extent to which associations between variables can and should be interpreted as
evidence of causality is less a technical question than a philosophical, moral,
cultural or political one. These issues have been discussed at some length elsewhere
(see for example Susser 1973; and Pearl 1998, 2000), and although these influence
the extent to which associational evidence from non-experimental studies is (and
should be) used in real-world settings, the following Chapter will focus on the more
technical issue of strengthening the causal inferences drawn from non-experimental
data by using causal path diagrams when designing and describing the analysis of
data from non-experimental studies. In this chapter we will introduce causal path
diagrams (specifically Directed Acyclic Graphs; DAGs) and explore the issue of
confounding.
A causal path diagram is a visual summary of the likely (and, where relevant, the
speculative) causal links between variables. Constructing these diagrams is based
on a priori knowledge and, in the case of speculative and hypothesised relationships
being explored in the analysis, on conjecture. Causal path diagrams have been used
informally for many years in causal analysis and in recent years have been formally
developed for use in expert-systems research (Greenland et al. 1999). Although
such diagrams are beginning to be adopted by the epidemiological community
(Hoggart et al. 2003; Hernandez-Dı̀az et al. 2006; Shrier and Platt 2008; Head et al.
2008, 2009; Geneletti et al. 2011; Tu and Gilthorpe 2012), a causal diagram is still a
novel epidemiological tool which can be used in a variety of ways: to think clearly
about how exposure, disease and potential confounder variables, relevant to the
research hypothesis, are related to each other; to communicate these inter-
relationships to academic and professional audiences; to indicate which variables
were important to measure; and to inform the statistical modelling process –
particularly the identification of confounding, confounders and competing
exposures.
In this Chapter we discuss the use of causal path diagrams (Pearl 2000), specifically
Directed Acyclic Graphs (DAGs), to develop models that can inform the analysis of
one variable (the ‘exposure’) as a potential cause of another (the ‘outcome’). Within
epidemiology, such analyses include exploring: the potential role of risk factors
(as ‘exposures’) in the aetiology of disease (where the ‘outcome’ is the prevalence,
incidence or severity of disease); and the role of specific characteristics of healthcare
systems (where these characteristics are the ‘exposures’) in the effective and efficient
delivery of health services (where this constitutes the ‘outcome’).
The nomenclature of DAGs is still evolving, and can be off-putting to the uninitiated,
particularly when accompanied by statistical notation (such as that developed by
Geneletti et al. (2009)). However, the terminology that is developing helps to specify
each of the components of DAGs in a way that facilitates their consistent application
and further utility. And, with this in mind, we have provided a comprehensive
glossary of terms in Table 1.1, and a more detailed explanation of these below.
known (i.e. from a firm grasp of established functional biological, social or clinical
relationships between variables); likely (i.e. from previous robust empirical studies);
or speculative (i.e. hypothesised) relationships between any two variables, with an
arrow representing causality – the direction of causality following in the direction of
the arrow. For example, ‘X causes Y’ would be represented as X ! Y, where X and Y
are nodes (or vertices) and the arrow between them is an arc (or edge or line).
DAGs are usually depicted with the nodes arranged in a temporal and thus causal
sequence, with the preceding variables to the left of the diagram and subsequent
4 G.R. Law et al.
variables to the right. This is not mandatory, but can help when deciding which of
two closely related variables precedes the other and acts as its cause. A node
immediately preceding another node to which it is connected (i.e. a node at the
non-arrow end of an arc) is known as a parent of the node at the arrow end of the
arc, which is in turn known as a child. Thus, in the example X ! Y, X is the parent
node and Y is the child. Similarly, a node ‘preceding’ another node but connected to
another node via at least one other node is known as an ancestor, whereas the
preceding node from which it is separated is known as a descendent. Therefore,
in the example X ! Y ! Z, X is the ancestor of Z, and Z is the descendent of
X; while Y (which is a child of X and a parent of Z) lies on the causal pathway
between X and Z.
Identification of Arcs
All arcs in a DAG reflect a priori presumptions about cause and effect in a specific
context. Some of these presumptions will be based on known causal relationships
between variables (drawing on established functional biological, social and clinical
processes); others on likely causal relationships (drawing, for example, on the
statistical findings of previous robust empirical studies); as well as speculative
relationships (drawing on unsubstantiated hypotheses – including the specific
hypotheses being tested in the analyses). These arc-related presumptions cannot
(and should not) be inferred empirically from data on which the analyses will be
conducted, but must be drawn from established mechanisms or strong research
evidence, both of which are crucial for developing an accurate DAG as the basis on
which suitable statistical analyses can then be designed (Tu et al. 2004; Weinberg
2005; Tu and Githorpe 2012).
1 Confounding and Causal Path Diagrams 5
a b c
E O E O E O
C C C
A B A B A B
Fig. 1.1 An example of Directed Acyclic Graphs. Key to variables: E exposure, O outcome, A, B,
C additional
1.1.1.2 Notation
A as being independent of B given C, where A, B and C are known variables.
For example the DAG represented in Fig. 1.1 consists of 5 variables: E the
exposure of interest, O the outcome of interest and 3 other additional variables
A, B, and C.
In Fig. 1.1a the exposure, E, causes the outcome, O. This can be represented as
O
n
In most research studies the causal pathways described and summarised within
causal path diagrams are not established (i.e. ‘proven’) causal relationships, but
are in the main based on evidence from whatever previous studies are available.
Proof in this context is essentially more of a philosophical than a scientific
concept, and can be subject to intense debate. The pathways included in the
diagrams are therefore often based on: (i) incomplete or predominantly theoretical
understanding (rather than established knowledge) of the functional relationships
between the variables involved; (ii) the statistical findings of empirical research
which may not themselves be definitive; and (iii) hypotheses based on putative,
tentative or speculative beliefs about the sorts of relationships that exist – not least
the one between the exposure(s) and the outcomes that the study set out to
address. These three very different ingredients involved in the conceptualisation
of causal pathways are important to recognise as they influence both: the extent to
which different causal path diagrams can be drawn for the same variables
(reflecting different views of what is known, likely or speculative) and the extent
6 G.R. Law et al.
to which these different diagrams might be more (or less) useful for generating
robust evidence of causality between two or more specific variables. Despite this
DAGs are useful because they force researchers to make explicit their
presumptions about the relationships between pairs of variables, whether or not
these presumptions prove to be correct. Other analysts are then able to critique,
(re)interpret and (where necessary) repeat and improve on the analyses
conducted, based on different presumptions or firmer knowledge of the causal
relationships involved.
However, alongside their assumption that no variable can be its own cause
(which, as mentioned earlier, reduces the utility of DAGs for modelling systems
containing feedback loops), a key limitation of DAGs is that they will only ever be
able to include variables (as nodes) that are (as Donald Rumsfeld would have it)
‘knowns’ (i.e. are recognised as conceptual entities within the epistemological
context concerned). Likewise, analyses based on DAGs will only ever be
able to include those variables for which data are available (i.e. that have been
measured – in Donald Rumsfeld’s parlance, ‘known knowns’). This is a fundamen-
tal limitation of all analyses of data from non-randomised non-experimental stud-
ies, not least because unknown or unmeasured confounders cannot be taken
into account when modelling or analysing potential causal relationships. Nonethe-
less, using DAGs to identify the most appropriate statistical analyses for any given
set of measured variables will reduce the likelihood that these are subject to
confounding (from known and measured confounders) and help others to critique,
(re)interpret and (where necessary and possible) repeat and improve on the analyses
conducted. These then are the core strengths of using DAGs to design the analysis
of data from non-experimental studies – strengths we explore in greater detail in
Sect. 1.2.4, below.
Meanwhile, another potential limitation of DAGs is that, despite the potential
for visual complexity (particularly for those DAGs with more than a handful of
nodes), they are essentially an oversimplification of the causal relationships
between variables. For example, a causal diagram does not indicate whether
an effect is harmful or protective or whether effect modification is actually
occurring (Hernan et al. 2004 – although Weinberg 2007 recently suggested how
DAGs might be modified to include this), nor does a causal diagram identify
whether a cause is sufficient or necessary to elicit the outcome(s) involved
(Rothman 1976). Nonetheless, it bears restating that one of the key strengths
of such diagrams is that they enable researchers to think clearly and logically
about the research question at hand, and to make explicit any presumptions that
are being made about the (presumed) relationships between the pairs of variables
involved. This visual summary can then be used as an aid to communicate
these inter-relationships to academic and professional audiences and to explicitly
identify, for example, if important variables or relationships are missing from
or misrepresented in the diagram or, indeed, whether any of the presumed
relationships are contentious.
1 Confounding and Causal Path Diagrams 7
1.1.3 Notation
One way to represent the statistical relationships between variables (as nodes) in
causal diagrams is to use the notation developed by Geneletti and colleagues
(2009). For example, the notation A BjC signifies A is independent of B given
C, where A, B and C are known variables. For example the DAG represented in
Fig. 1.1 consists of 5 variables: E the exposure of interest, O the outcome of interest
and 3 other additional variables A, B, and C.
In Fig. 1.1a the exposure, E, causes the outcome, O. This can be represented as
n
E
cause) in unexposed subjects (i.e. a ‘risk factor’). Second, the variable concerned must
be correlated with the exposure variable within the study population concerned.
Finally, the variable concerned must not be situated on any causal pathway between
exposure and outcome (Hennekens and Buring 1987). More recently, the last of these
three conditions has been replaced with an even stricter one: the variable concerned
must not be an effect of the exposure (McNamee 2003).
Confounding can exist at the level of the population, or as a consequence of a
biased sample. This is an important point; the consideration of confounding should
not be solely based on a study sample, indeed it may be the case that apparent
confounding in a study is due to sampling and is not true confounding in the
population as a whole. Many studies are often able to identify more than one
relevant confounder in their analyses, and we will discuss later how one might
establish whether the analyses have accounted for a sufficient set of confounders (or
whether too few/too many have been included in the analyses: see Sect. 1.2.3,
below).
We may have a situation where E ! O and A ! O, but there is no association
between E and A. This happens in a successfully randomised controlled trial (RCT)
where baseline variables (A) are balanced between groups – so A is independent of
E (due to the success of randomisation for treatments). Nonetheless, because A is a
competing exposure for O, the precision with which the relationship between E and
O is characterised improves after adjusting for A.
In order to explain how DAGs can be used to determine whether there is potential
for confounding in the apparent relationship between an exposure and an outcome
let us first use a simple DAG as an example (see Fig. 1.1a). To determine
1 Confounding and Causal Path Diagrams 9
As this example shows, there can be more than one minimally sufficient set (S).
However, these sets may also vary in size and may not necessarily overlap
(Greenland et al. 1999). It can therefore be helpful to identify all minimally
sufficient sets so that the best one can be chosen for dealing with confounding
through adjustment. For example, some sets may need to be rejected if they contain
variables that were not measured in the study. Others may be rejected due to
concerns about measurement error, or because they contain many more variables
than other sets and would thereby generate less precise estimates from multivari-
able statistical analyses on the sample sizes available. As such an important
advantage of using DAGs over traditional approaches to identifying potential
confounding is that the latter are usually unable to identify any of the potential
sufficient subsets of potential confounders, and all potential confounders would
therefore need to be included in the analysis (at cost to the precision of the estimates
produced).
As we have shown in this chapter, DAGs can be used to identify confounding and
confounders in a systematic way, and by helping researchers to identify these
objectively and explicitly, DAGs can help to reduce bias and advance debate.
Moreover, despite the various limitation mentioned earlier in this Chapter
(see Sect. 1.1.2, above), one of the main strengths of using causal path diagrams
in epidemiological analyses of data from non-experimental studies is that it enables
researchers to think clearly and logically about the known, likely and speculative
causal relationships between variables that are relevant to the research hypothesis
and related analytical questions. Causal path diagrams thereby facilitate the com-
munication of any causal presumptions that have been made during data analysis to
academic and professional audiences using a structured approach that is explicit
and easy to critique or re-model.
DAGs also enable the identification of variables that are important to measure in
a prospective research study, and thereby improve the efficiency of both data
collection and statistical analyses by avoiding the unnecessary measurement or
inclusion of variables that are irrelevant to the study and its analysis.
Nonetheless, a somewhat surprising feature of tackling confounding using
DAGs is that incorrect specification of the model itself can itself create more
problems than it solves. For example, bias may be introduced by including variables
that are consequences of the exposure, while additional confounding may be
created by including variables that are common descendents of other confounders.
Likewise, as we saw earlier, stratification may lead to key changes to some of the
paths within the DAG, and these changes may lead to previously blocked paths
becoming unblocked and causing further confounding. However, both of these
potential flaws can be put to good use in identifying whether adjustment for specific
confounders might create new associations between variables that may generate
12 G.R. Law et al.
further confounding that will also need to be addressed. As such, these features are
arguably an additional strength of using DAGs in analytical design.
One important weakness of DAGs is that with increasing numbers of highly
inter-related variables they can rapidly become visually complex to read.
DAGs also represent an inherent oversimplification of causal relationships between
variables as they do not indicate whether: any relationships are positive or negative
(e.g. harmful or protective); effect modification might occur; each causal relation-
ship is weak or strong; and some of the variables might only be able to cause an
effect in combination with other variables.
Moreover, as with all causal models, DAGs are only as good as the functional
and empirical knowledge and speculative hypotheses on which they are based.
In particular, DAGs may be based on a set of presumptions that are wrong (either as
a result of incorrect knowledge, weak empirical evidence or fallacious hypotheses).
However, because DAGs ensure that these presumptions are explicitly stated, the
key benefit of DAGs is that they facilitate criticism, (re) interpretation and (where
necessary) modification of the model to assess whether different conclusions would
be reached about: which variables are true confounders (see Chap. 11 on structural
equation modelling); and which subset of variables are best to adjust for in order to
address confounding while taking into account the availability and quality of data
on each of the variables involved.
1.3 Conclusions
References
Geneletti, S., Richardson, S., & Best, N. (2009). Adjusting for selection bias in retrospective,
case-control studies. Biostatistics, 10, 17–31.
Geneletti, S., Gallo, V., Porta, M., Khoury, M. J., & Vineis, P. (2011). Assessing causal
relationships in genomics: From Bradford-Hill criteria to complex gene-environment
interactions and directed acyclic graphs. Emerging Themes in Epidemiology, 8, 5.
Greenland, S., & Morgenstern, H. (2001). Confounding in health research. Annual Review of
Public Health, 22, 189–212.
1 Confounding and Causal Path Diagrams 13
2.1 Introduction
We illustrate the issues raised by missing data using Fig. 2.1, which shows the
frontage of a high-level mandarin’s house in the New Territories, Hong Kong.
First, we notice missing data can either take the form of completely missing
figurines, or damaged— i.e. partially observed—figurines. The former is analogous
to what is usually termed unit non-response, while the latter is analogous to item
non-response. However, the statistical issues raised are the same in both cases.
For simplicity, we therefore assume there are no completely missing figurines.
Next, we see that the effect of missing data on any inference depends crucially
on the question at hand. For instance, if interest lies in the position of the figurines in
the tableau shown in Fig. 2.1, then missing data are not a problem. If, instead,
interest is in the height, or facial characteristics of the figurines, then missing data
raises issues that have to be addressed. Thus, when assessing the impact of missing
data it is not the number, or proportion of missing observations per se that is the key,
rather the extent of the missing information about the question at hand. Changing
the example, if we are interested in the prevalence of a rare disease, missing the
disease status of two individuals—potentially non-randomly—out of 1,000 means
we have lost a substantial amount of information.
Now suppose we are interested in estimating a facial characteristic—say average
hair length—of the four figurines shown. Two are missing their heads, and we
cannot be sure why. In order to estimate the average hair length we need to make an
assumption about why the two heads are missing, and/or how their mean hair length
relates to those whose heads are present. Our assumptions must take one of the
following three forms:
1. the reason for the missing heads is random, or at any rate unconnected to any
characteristics of the figurines;
2. the reason for the missing heads is not random; but within groups of ‘similar’
figurines (e.g. with similar neckties) heads are missing randomly, or
3. the reason for the missing heads is not random, and—even within groups of
apparently similar figurines—depends on hair length (i.e. depends directly on
what we want to measure).
In case 1, the ‘data’ (hair length) are said to be Missing Completely At Random
(MCAR). What is usually termed the missingness mechanism may depend on the
position of the figurines relative to missing tiles in the roof above, but is indepen-
dent of information relevant to the question at hand. Under this assumption there is
no difference in the distribution of hair-length between the figurines, so we can get a
valid estimate using the complete cases (i.e. figurines with heads). In case 2,
the data are said to be Missing At Random (MAR). The reason for the missing
data (hair length) depends on the unseen value (hair length) but we can form groups
based on observed data (e.g. necktie) within which the reason for the missing data
does not depend on the unseen value (missing hair length). If we assume hair length
is MAR given necktie, we can estimate hair length among figurines with straight
2 Statistical Modelling of Partially Observed Data Using Multiple Imputation. . . 17
Fig. 2.1 Mandarin’s house, New Territories, Hong Kong (Photo H. Goldstein)
neckties, and among those that end in a bobble. We can then calculate a weighted
average of these—weighting by the number with each kind of necktie—to estimate
mean hair length across the ‘population’ of figurines.
In case 3, the data are said to be Missing Not At Random (MNAR). In this case,
we cannot estimate average hair length across the figurines without knowing either
(i) the relationship between the chance of a headless figurine and hair length or (ii)
the difference in mean hair length between figurines with, and without, heads.
This terminology was first proposed by Rubin (1976), and despite the slightly
counter-intuitive meaning of ‘Missing At Random’ it is now almost universally
used. We now highlight two things, implicit in the above discussion, which are
universal in the analysis of partially observed data:
Given Fig. 2.1, we do not know which of the assumptions 1–3 above is correct;
furthermore each has different implications for how we set about validly estimating
mean hair length. Therefore, the best we can do is state our assumptions clearly,
arrive at valid inference under those assumptions, and finally report how inference
18 J.R. Carpenter et al.
varies with the assumptions. The latter is referred to as sensitivity analysis, and is
fundamental to inference from partially observed data. We hope that our inference
is pretty robust to different assumptions about the missing data, so that we can be
fairly confident about our conclusions. However, as we cannot verify our
assumptions using the data at hand, our readers can reasonably be expected to be
informed if this is indeed the case.
Each of the assumptions 1–3 above makes a statement both about the probabilistic
mechanism causing the missing data (which we refer to as the missingness
mechanism) and the difference between the distribution of the missing data
given the observed data. To see this, suppose that Y is hair length, X is
characteristics of the body (observed on all figurines) and R ¼ 1 if the head is
present and 0 if absent.
Under MCAR, the chance of R ¼ 1 given X, Y —for which we use the notation
[R|X, Y]—does not depend on X or Y, that is [R|X, Y] ¼ [R]. This means that the
distribution of Y given X does not depend on R. More formally, using the definition
of conditional probability,
Thus the missingness mechanism tells us about the distribution of the missing
data given the observed, and vice versa.
A similar argument gives (2.1) if data are MAR, for then the chance of R ¼ 1
does not depend on Y once we take X into account, so that [R |Y, X] ¼ [R|X]. Thus,
if data are MCAR or MAR, the distribution of the partially observed variables (hair
length) given the fully observed ones (body characteristics) is the same across
individuals, regardless of whether—for a particular individual—the partially
observed variable (hair length) is seen or not.
However, this relationship does not hold if data are MNAR. In that case
the chance of R ¼ 1 depends on both X and Y, and this means that the distribution
of [Y |X] is different depending on whether Y is observed or not (i.e. whether R ¼ 1
or not). This makes MNAR analyses more difficult, as we either have to say (i)
exactly how [R] depends on Y, X or (ii) exactly how [Y |X] differs according to
R—i.e. whether Y is observed or not.
2 Statistical Modelling of Partially Observed Data Using Multiple Imputation. . . 19
The aim of this Section was to use a pictorial example to sketch out the intuition
behind the standard jargon in the missing data literature. This is a key step in
understanding its relevance to the analysis of any partially observed dataset. It is
also important to bear in mind the question the analysis is addressing, and how the
answers might be affected by plausible missing data mechanisms.
The next Section elaborates this further.
Analyses of a partially observed data set which include only individuals with no
missing data (or at least no missing data on the variables in the current model) are
often called ’Complete Case’ analyses.
The question in the section heading is often posed, but given the discussion
above is not appropriate as it stands. Instead, the question is whether a complete
case estimator is appropriate given the inferential question at hand and assumptions
about the missing data mechanism. An important secondary question concerns the
efficiency of a complete case estimator, relative to other estimators such as
those obtained using multiple imputation. We now discuss this further, taking a
simple setting as an example.
Suppose we have four variables, W, X, Y, Z. In a more general setting,
these could be groups of covariates. Let our model of interest be the regression of
Y on X and Z. We consider two situations: first that the response Y is partially
observed but the other variables are complete, and second that the covariate X is
partially observed.
Given the results in the previous Section we know that, if Y values are MCAR, then
the complete case analysis is unbiased. Now suppose Y is MAR given X and Z.
In this situation, the complete case analysis is also unbiased. To see this, we note
that the contribution to the likelihood for an individual with missing data is simply
the likelihood with the missing data integrated out. With a missing response it is
thus
Z
½YjX; ZdY ¼ 1 (2.2)
This also means that in this setting a complete case analysis is efficient.
Next, we suppose that W is predictive of the response Y being missing, so that
Y is MAR given W and possibly the X, Z, but values of W are independent of Y
20 J.R. Carpenter et al.
(this also means W is not a confounder). This situation may occur, for example,
when W describes how the data collection process has changed over time, but this
administrative change is unrelated to the actual values of Y. Once again, the
complete case analysis is unbiased, as W contains no information on the parameters
in the regression of interest.
Now suppose that W is both not independent of Y and predictive of Y being
missing, so that Y is MAR given W and possibly X, Z. Here a complete case
analysis is inconsistent. Consistent estimation requires that we take account of the
information in W. We could include W as an additional covariate; however this
changes the model of interest. This may not be desirable, because it changes the
goal of the analysis. In this setting, we need to use one of the more sophisticated
methods described below.
The last possibility is that the response Y is MNAR. From the discussion in the
previous Section it should be clear that a complete case analysis will be inconsistent
here. We need to make an assumption about the difference between [Y |X, Z, R ¼ 1]
(i.e. the complete case estimate) and [Y |X, Z, R ¼ 0] (i.e. the regression relation-
ship in individuals where Y is missing). Only given such an assumption can we
estimate the regression parameters relating Y to X and Z.
We again consider the regression of Y on X, Z but now suppose that Y, Z are fully
observed and X is partially observed. If X is MCAR, then the complete case
analysis will be unbiased, as above. However, in this setting (in contrast to that of
a missing response, equation (2.2) we can recover information about the regression
coefficients from individuals with Y and Z observed. The key to this is the
introduction of an assumed distribution for X in terms of Y and Z. This distribution
buys information about the missing data, though it cannot be definitively validated
from the observed data. This information can be incorporated using for example
multiple imputation or EM-type algorithms, as described in Sect. 2.4.
Next suppose X is MAR given Z, but given Z the mechanism does not depend on
Y. In this case, analysis using complete cases will again be unbiased, but—as with
X MCAR—more information on the regression coefficients can be obtained from
individuals with Y and Z observed. If the covariate X is MAR and the mechanism
depends on the response Y and covariate Z, then analysis based on complete cases
will be biased, as well as potentially inefficient. In such settings a more sophisti-
cated analysis is needed, taking account of the information in the partially observed
cases. As usual, this relies on an assumed distribution of the missing data given the
observed data, which cannot be definitively validated from the observed data.
Again, this information can be incorporated using for example multiple imputation
or EM-type algorithms.
Similarly, if X is MAR but the mechanism depends on another variable W which
is not in our model of interest, but which is associated with Y, then the complete
2 Statistical Modelling of Partially Observed Data Using Multiple Imputation. . . 21
case analysis will be biased, and an analysis valid under MAR is required.
However, if—as above—W is associated with the chance of seeing X, but not
with the distribution of X, then there is no gain in including W in the analysis.
Next, if the covariate X is MNAR, depending on X and possibly Z but given
these not on the response Y, then the complete case analysis is unbiased. A more
efficient analysis is possible, but only if we correctly specify the MNAR
missingness mechanism. In practice, we are unlikely to be able to do this. Further,
an analysis under the MAR assumption is not valid here—as the true missingness
mechanism is MNAR. Since for this particular MNAR mechanism the complete
case analysis is unbiased, an analysis under MAR could introduce bias.
Finally, if X is MNAR, depending on X, Y and possibly Z then the complete case
analysis is biased, as is an analysis under the MAR mechanism. Quite often the
MAR analysis will be less biased than the complete case analysis but this is
not guaranteed. Only if we can correctly specify the MNAR mechanism—i.e.
the difference in the conditional distribution of X when X is observed and
unobserved—will our analysis be unbiased in general.
Taken together, the above underlines the importance of exploring the data
carefully, and understanding plausible missingness mechanisms, before using a
more sophisticated—and potentially time consuming—analysis. This can yield
important insights about whether a more sophisticated analysis is required, how
to formulate it, and how plausible the results are likely to be. Nevertheless, the
uncomfortable fact remains that all analysis with partially observed data rest on
inherently untestable assumptions. Thus sensitivity analyses—where we explore
the robustness of our inference to different assumptions about the missing
data—have a key role to play.
Here we briefly review the advantages and disadvantages of different methods for
analysing partially observed data. All the methods have an extensive literature, to
which we give some pointers.
This approach only re-weights those with no missing data, whereas other
approaches (discussed later in this section) assume a distribution for the missing
data given the observed. Thus we can view inverse probability weighting as trading
efficiency for robustness. If our weight model is correctly specified, our parameter
estimates are consistent. However, if we are prepared to specify a distribution of the
missing data given the observed we can obtain consistent and more efficient
estimates. Note that neither approach avoids the need to make untestable
assumptions. Specifically with regard to inverse probability weighting, the data
we need to check the assumptions made in estimating the weights are missing.
This point has triggered extensive methodological work, and the emergence of
augmented inverse probability weighting and doubly robust estimation. These
approaches both make some additional assumptions in order to buy information
relative to inverse probability weighting. Thus doubly robust methods incorporate a
term which is a function of the mean of the missing data given the observed data.
Assuming that the (as usual inherently untestable) assumption under which this
mean is estimated is correct, and that the model for the weights is correct, the
resulting estimates are consistent and comparably precise to those obtained using
the methods described in the rest of this section. They also have the desirable
property that if either the weights are wrong, or the mean of the missing data given
the observed data is wrong, consistent parameter estimates result. For a relatively
accessible introduction see Carpenter et al. (2006) or Vansteelandt et al. (2010).
The principal drawback of these approaches remain the difficulty of dealing with
the non-monotone missing data patterns (which arise naturally in most observa-
tional data and some experimental data) under a general MAR mechanism, concern
about instability in the weights (now an active research area, see Cao et al. (2009)),
and the lack of software. For a lively discussion see Kang and Schafer (2007).
The second widely used approach to parameter estimation with missing data is the
Expectation-Maximisation (EM) algorithm and its derivatives, which was devel-
oped in the early 1970s (Orchard and Woodbury 1972; Dempster et al. 1977).
This is an iterative method for obtaining maximum likelihood estimates with
missing data, based on iteratively calculating the expectation of the likelihood
over the distribution of the missing data given the observed data, and then
maximising this expected likelihood with respect to the parameters of interest.
Although it can work well, convergence is often slow, estimating the standard
errors of parameters can be tricky (Louis 1982) and calculating the expectations
involved can be tricky. This has led to the development of various approaches
which use Monte-Carlo methods to estimate the expectations involved; see for
example Little and Rubin (2002) and Clayton et al. (1998), who also discuss other
algorithms for maximising incomplete data likelihoods. Once Monte-Carlo
methods are used in the estimation, a key attraction of the EM algorithm relative
2 Statistical Modelling of Partially Observed Data Using Multiple Imputation. . . 23
If missing data are primarily in an outcome measured repeatedly over time, it may
often be possible to embed the simpler model of interest in a more complex
repeated measures model. If there are no missing data, the two give the same
inference for the parameters of interest, but if data are missing, the latter gives
inference under a broader class of MAR mechanisms. This approach requires care,
but is most suitable for clinical trials with continuous repeatedly measured outcome
data subject to patient withdrawal. This approach is reviewed and applied in this
context in Chap. 3 of Carpenter and Kenward (2008), and also compared to multiple
imputation. It is limited by the difficulty of setting up such model in general,
particularly when covariates are missing.
The fourth option we consider here, Multiple Imputation (MI), was conceived as a
two-stage Bayesian approach for parameter estimation in the presence of missing
data. However, if done properly, inferences have very good frequentist properties.
Thus MI can be viewed as a method of maximising an incomplete data likelihood.
Indeed, provided the underlying incomplete data likelihood is the same, asymptoti-
cally equivalent estimates will be obtained from the EM-type algorithms and
repeated measures modelling. It follows that MI can also be viewed an approxima-
tion to a full Bayesian analysis in which the analyst’s model of interest and the
imputation model for the missing data given the observed are fitted concurrently.
MI was introduced by Rubin (Rubin 1987, 1996); for a recent review
see Kenward and Carpenter (2007). A joint model is formed for the observed
data, where partially observed variables are the response. This model is fitted,
and then used to create a number of imputed datasets, by drawing the missing
data from its conditional distribution given the observed data, taking care to fully
accommodate the statistical uncertainty in this process. This results in a number of
‘complete’ datasets. Then the model of interest is fitted to each of these in turn, and
the results combined for final inference using Rubin’s rules. These rules are simple
and general, and although naturally derived using a Bayesian argument,
the resulting inferences have good frequentist properties.
The attraction of MI relative to the methods above include (i) it can be
implemented in terms of regression models, so developing general robust software
is more straightforward; (ii) convergence issues do not arise in the same way as with
24 J.R. Carpenter et al.
Table 2.1 Some of the software packages available for multiple imputation
Imputation Software Name of
method package imputation package Available from
Full conditional Stata icea Install within Stata using the ssc command;
specification packaged with Stata 12 onwards
R micea Install as additional package from
http://cran.r-project.org/
mia Install as additional package from
http://cran.r-project.org/
SAS IVEwarea Download from
http://www.isr.umich.edu/src/smp/ive/
Joint modelling MLwiN mi macrosb Download from
http://www.missingdata.org.uk
SAS PROC MI Standard from SAS v9
(with the MCMC
option)
Stand alone norm, PANb, mixa Download from
http://sites.stat.psu.edu/~jls/misoftwa.html
Stata MI Standard with Stata version 11 and later
Stand alone REALCOMa,b Download from
http://www.bristol.ac.uk/cmm/
Designed to work with MLwiN; interface to
Stata and other packages from
http://www.misssingdata.org.uk
All websites accessed 25 Jan 2012
a
indicates software which does not treat discrete data as continuous
b
indicates software allowing for multilevel structure
EM-type algorithms; (iii) information from auxiliary variables, not in the model of
interest, can naturally be included, and (iv) it can readily be used for sensitivity
analysis to the MAR assumption. Thus it is becoming increasingly established as
the leading practical approach to analysing partially observed datasets (Sterne et al.
2009; Klebanoff and Cole 2008). Although there is an increasing range of statistical
software packages available, they vary in their accessibility to data analysts. More
fundamentally, some software uses the full conditional specification approach
(also known as the chained equation approach; for an early example see van Buuren
et al. (1999)), which does not explicitly model the joint distribution but forms
univariate models for each incomplete variable in turn conditional on all the others.
There is no guarantee in general that these correspond to a proper joint model. Other
software is based on an explicit joint model, as described for example in Schafer
(1997). Moreover, some software treats discrete data as continuous in the imputa-
tion model, and most packages do not allow for a multilevel structure (Kenward and
Carpenter 2007). Table 2.1 gives some more details.
In the light of the above, we believe multiple imputation is currently the most
general and accessible method for a wide range of analyses. In the next Section, we
therefore review some key issues that arise in its application, before illustrating its
use in a multilevel setting in Sect. 2.6.
2 Statistical Modelling of Partially Observed Data Using Multiple Imputation. . . 25
Assuming that the user knows the model of interest they wish to fit in the absence of
missing data, the multiple imputation procedure follows a fixed set of standard steps
once the user has specified the imputation model. This is because the computer fits
the imputation model to the partially observed data, imputes the missing data from
this to create the ‘completed’ data sets, fits the user’s model of interest to each
imputed dataset and then combines the results for final inference using Rubin’s
rules.
The implication is that care needs to be taken in formulating the imputation model—
this is the make or break step. In particular this needs to be compatible/congenial with
the model of interest—in the sense described below—and valid under a general
missing at random mechanism.
By compatibility we mean that the imputation model should ideally allow the
same richness of structure between the variables as the model of interest. Thus all
the variables in the model of interest, including the response, need to go into the
imputation model. This is important, for otherwise the imputed data will be
independent of these variables. For most outcomes, this is straightforward;
for survival data we need to remember to include the censoring indicator as well
as a suitable measure of survival. Work by White and Royston (2009) suggests the
cumulative hazard is preferable. Survival data with censoring—when we know the
event occurred after censoring—can also be viewed as an example of situations
where we have prior information about the range of values the missing data can
take. Goldstein et al. (2009) discuss how such information can be incorporated.
Problems are more likely to arise when the model of interest contains
non-linearities and/or interactions. If these are functions of fully observed variables,
then they should be included as covariates in the imputation model. However, if
they include partially observed variables then more care needs to be taken, and in
some settings it is challenging to handle this correctly. If this is not done correctly,
estimates of interactions and non-linearities in the model of interest will tend to be
biased towards the null. This is obviously of greatest concern where inference
focuses on precisely these interactions/non-linearities. For a detailed discussion
and worked example, see Carpenter and Plewis (2011).
26 J.R. Carpenter et al.
A further issue of compatibility concerns the correlation structure of the data. If the
data are multilevel, then this structure should be reflected in the imputation model.
Failure to do this will generally result in the imputations being weighted towards
those level two units with the most data, and the variance of the imputed data will be
too small. Multilevel multiple imputation is more problematic with the full
conditional specification approach (Royston 2007; van Buuren et al. 2006).
However, it can be handled naturally within a joint modelling approach, as can
missing values for level two variables. This approach is described in Goldstein et al.
(2009) and can handle discrete and continuous variables at different levels of the
multilevel data hierarchy. Experimental software implementing this is available at
http://www.bristol.ac.uk/cmm/.
In conclusion, we note that multiple imputation does not have the same goal
as prediction. To see this, consider a model with a number of covariates, such as
the model for obtaining educational qualifications by age 23, fitted to the 1958
National Childhood Development Study data by Carpenter and Plewis (2011). For
fitting the model of interest, there were only 10,279 complete cases, i.e. 65% of the
target sample. The advantage of multiple imputation is that, even if our imputations
are very imprecise, we can bring in the information from the observed data for the
35% of individuals with incomplete data. In most cases—especially with the
judicious use of auxiliary variables (Spratt et al. 2010)—improved prediction will
result in more accurate inference for the model of interest. However, multiple
imputation can be useful when prediction is poor. For a full discussion of these
points see Rubin (1996).
2 Statistical Modelling of Partially Observed Data Using Multiple Imputation. . . 27
In this Section we illustrate some of the key points above, using data from a study of
the effect of class size on children’s achievement in their first 2 years at school.
We explore the importance of multilevel structure in the imputation model. The
data come from a class size study kindly made available to us by Peter Blatchford at
the Institute of Education, London. This study sought to understand the effect of
class size on development of literacy and numeracy skills in the first 2 years of
English children’s full time education. The analysis below is illustrative; for a fuller
analysis and more details of the study see Blatchford et al. (2002).
The version of the dataset we explore below was derived from the original; we
restrict the analysis to a complete subset of 4,873 pupils in 172 schools. School
sizes vary greatly in these data and this is reflected in the number of pupils each
school contributes to the analysis, which ranges from 1 to 88. The dataset is thus
multilevel, with children at level 1 belonging to classes at level 2. Our model of
interest regresses literacy score at the end of the first year on class size, adjusting for
literacy measured when the children started school, eligibility for free school meals
and gender. The pre- and post- reception year (i.e. first school year) literacy scores
were normalised as follows. For each test, the pupils’ results were ranked. Then for
observation in rank order i, where N pupils sat the test, the normalised result was
calculated as the inverse normal of i/(n + 1).
We will explore the following:
1. fitting the multilevel model of interest to the 4,873 complete cases;
2. fitting the same model ignoring the multilevel structure;
3. making some values of pre-reception literacy score missing at random, and
(a) analysing the remaining complete cases;
(b) using multilevel multiple imputation to handle the missing data, and
(c) using single level multiple imputation to handle the missing data.
Let j denote class and i denote pupil. Our illustrative model of interest is:
Table 2.2 Description of variables in class size data used in this analysis
Variable name Description
nlitpost Normalised literacy score at the end of 1st school year
nlitpre Normalised literacy score at the start of 1st school year
fsmn Binary variable, 1 indicates pupil is eligible for free school meals
gend Binary variable, 1 for boys and 0 for girls
Table 2.3 Parameter estimates from fitting model (2.3) to various datasets; full details in text
Estimates (standard errors) from
Original data Original data Reduced data Multilevel MI Single level MI
Multilevel Single level Complete on reduced on reduced
model model cases data data
Parameter (n ¼ 4,873) (n ¼ 4,873) (n ¼ 3,313) (n ¼ 3,313) (n ¼ 3,313)
b0 0.088 (0.040) 0.065 (0.017) 0.121 (0.041) 0.092 (0.041) 0.107 (0.037)
b1 0.733 (0.010) 0.662 (0.012) 0.717 (0.012) 0.731 (0.011) 0.647 (0.012)
b2 0.058 (0.018) 0.086 (0.022) 0.020 (0.022) 0.056 (0.020) 0.070 (0.022)
b3 0.068 (0.027) 0.095 (0.030) 0.036 (0.037) 0.101 (0.034) 0.103 (0.034)
s2u 0.237 (0.028) 0.231 (0.028) 0.243 (0.029) 0.182 (0.022)
s2e 0.372 (0.008) 0.573 (0.012) 0.360 (0.009) 0.367 (0.009) 0.425 (0.011)
for 2 2 covariance matrices Ωu, Ωe . Note that the fully observed fsmn has been
included as a response, since the software requires two or more responses in the
imputation model in order to perform multiple imputation. Although this is not
strictly compatible with (2.3), as the missing data is in nlitpre, the error induced by
treating fsmn as continuous is negligible. This is because the properties of the
bivariate normal distribution mean that we would get similar imputations if we
instead had fsmn as a covariate in a univariate imputation model for nlitpre.
The software fits this imputation model using Markov Chain Monte Carlo
(MCMC), taking improper priors for the regression coefficients and Wishart priors
for the covariance matrices. It then imputes the missing data as draws from the
Bayesian posterior, fits the model of interest to each imputed dataset and combines
the results for inference using Rubin’s rules.
We used a ‘burn in’ of 500 MCMC updates and updated the sampler a further
200 times between drawing each of 50 imputed datasets. The results are shown in
the rightmost two columns of Table 2.3. Column 5 shows the result of multilevel
multiple imputation; in column 6 we create the imputations using (2.5), but with Ωu
set to zero. This is equivalent to single level imputation, as available in other
packages. Looking at the results, we see that multilevel multiple imputation gives
point estimates much closer to the original, fully observed, data but with slightly
increased standard errors (reflecting the lost information in the partially observed
data). Comparing with single level multiple imputation, we see the latter results in
an overestimated gender coefficient. This is because the difference between boys
and girls is greater in the larger schools; single level imputation carries this stronger
effect to all schools and results in a greater gender coefficient. Further, the school
level component of variance is substantially reduced after single level multiple
imputation, compared with multilevel multiple imputation. The pupil level variance
is correspondingly increased. This makes sense: single level imputation puts extra
variability at the pupil level. Finally, each multiple imputation analysis took about
90s with a 2.4 GHz chip.
Although our model is quite simple, we can readily handle additional variables
with missing data. These are included as additional responses on the left hand side.
In general, as described in Goldstein et al. (2009) this can include appropriate
models for discrete and unordered categorical variables. This example illustrates
the potential of multiple imputation, and also illustrates the importance of allowing
for the multilevel structure in the imputation, when it is present in the data.
2.7 Conclusions
In this Chapter we have argued that, when data are missing, analysis cannot proceed
without inherently untestable assumptions about the missingness mechanism.
We gave an intuitive illustration of missingness mechanisms and related this to
common terminology in the literature. Armed with this, we described the likely
effect of missing covariate and response data, under different mechanisms.
30 J.R. Carpenter et al.
References
Blatchford, P., Goldstein, H., Martin, C., & Browne, W. (2002). A study of class size effects in
English school reception year classes. British Educational Research Journal, 28, 169–185.
Cao, W., Tsiatis, A. A., & Davidian, M. (2009). Improving efficiency and robustness of the doubly
robust estimator for a population mean with incomplete data. Biometrika, 96, 723–734.
Carpenter, J. R., & Kenward, M. G. (2008). Missing data in clinical trials — A practical guide.
Birmingham: National Health Service Co-ordinating Centre for Research Methodology.
Freely downloadable from www.missingdata.org.uk. Accessed 25 Jan 2012.
Carpenter, J. R., & Plewis, I. (2011). Coming to terms with non-response in longitudinal studies.
In M. Williams & P. Vogt (Eds.), SAGE handbook of methodological innovation. London:
Sage.
Carpenter, J. R., Kenward, M. G., & Vansteelandt, S. (2006). A comparison of multiple imputation
and inverse probability weighting for analyses with missing data. Journal of the Royal
Statistical Society: Series A (Statistics in Society), 169, 571–584.
Carpenter, J. R., Kenward, M. G., & White, I. R. (2007). Sensitivity analysis after multiple
imputation under missing at random — A weighting approach. Statistical Methods in Medical
Research, 16, 259–275.
Carpenter, J. R., Roger, J. H., & Kenward, M. G. (2012). Analysis of longitudinal trials with
missing data:—A framework for relevant, accessible assumptions, and inference via multiple
imputation (Submitted).
Chan, A., & Altman, D. G. (2005). Epidemiology and reporting of randomised trials published in
PubMed journals. The Lancet, 365, 1159–1162.
Clayton, D., Spiegelhalter, D., Dunn, G., & Pickles, A. (1998). Analysis of longitudinal binary data
from multi-phase sampling (with discussion). Journal of the Royal Statistical Society, Series B
(statistical methodology), 60, 71–87.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data
via the EM algorithm (with discussion). Journal of the Royal Statistical Society Series B
(Statistical Methodology), 39, 1–38.
Goldstein, H., Carpenter, J. R., Kenward, M. G., & Levin, K. (2009). Multilevel models with
multivariate mixed response types. Statistical Modelling, 9, 173–197.
Kang, J. D. Y., & Schafer, J. L. (2007). Demystifying double robustness: A comparison of
alternative strategies for estimating a population mean from incomplete data (with discussion).
Statistical Science, 22, 523–539.
2 Statistical Modelling of Partially Observed Data Using Multiple Imputation. . . 31
Kenward, M. G., & Carpenter, J. R. (2007). Multiple imputation: Current perspectives. Statistical
Methods in Medical Research, 16, 199–218.
Klebanoff, M. A., & Cole, S. R. (2008). Use of multiple imputation in the epidemiologic literature.
American Journal of Epidemiology, 168, 355–357.
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). Chichester:
Wiley.
Louis, T. (1982). Finding the observed information matrix when using the EM algorithm. Journal
of the Royal Statistical Society, Series B, 44, 226–233.
Orchard, T., & Woodbury, M. (1972). A missing information principle: theory and applications.
In L. M. L. Cam, J. Neyman, & E. L. Scott (Eds.), Proceedings of the Sixth Berkely Symposium
on Mathematics, Statistics and Probability: Vol. 1 (pp. 697–715). Berkeley: University of
California Press.
Royston, P. (2007). Multiple imputation of missing values: Further update of ice with emphasis on
interval censoring. The Stata Journal, 7, 445–464.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
Rubin, D. B. (1996). Multiple imputation after 18 years. Journal of the American Statistical
Association, 91, 473–490.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. London: Chapman and Hall.
Spratt, M., Sterne, J. A. C., Tilling, K., Carpenter, J. R., & Carlin, J. B. (2010). Strategies for
multiple imputation in longitudinal studies. American Journal of Epidemiology, 172, 478–487.
Sterne, J. A. C., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., Wood, A. M.,
& Carpenter, J. R. (2009). Multiple imputation for missing data in epidemiological and clinical
research: Potential and pitfalls. British Medical Journal, 339, 157–160.
van Buuren, S., Boshuizen, H. C., & Knook, D. L. (1999). Multiple imputation of missing blood
pressure covariates in survival analysis. Statistics in Medicine, 18, 681–694.
van Buuren, S., Brand, J. P. L., Groothuis-Oudshoorn, C. G. M., & Rubin, D. B. (2006).
Fully conditional specification in multivariate imputation. Journal of Statistical Computation
and Simulation, 76, 1049–1064.
Vansteelandt, S., Carpenter, J. R., & Kenward, M. G. (2010). Analysis of incomplete data using
inverse probability weighting and doubly robust estimators. Methodology, 6, 37–48.
White, I. R., & Royston, P. (2009). Imputing missing covariate values for the Cox model. Statistics
in Medicine, 28, 1982–1998.
Wood, A. M., White, I. R., & Thompson, S. G. (2004). Are missing outcome data adequately
handled? A review of published randomized controlled trials in major medical journals.
Clinical Trials, 1, 368–376.
Chapter 3
Measurement Errors in Epidemiology
Darren C. Greenwood
There is a widespread view that such measurement error in an exposure can only
dilute any association between exposure and outcome (Bashir and Duffy 1997;
Fuller 1987; Gladen and Rogan 1979; Weinberg et al. 1994; Wong et al. 1999a).
However, the belief that this is always the case is flawed for several reasons (Bjork
and Stromberg 2002; Carroll et al. 1995; DelPizzo and Borghesi 1995; Dosemeci
et al. 1990; Flegal 1999; Phillips and Smith 1991; Richardson and Ciampi 2003;
Sorahan and Gilthorpe 1994; Weinberg et al. 1994; White et al. 2001; Wong et al.
1999b): First, even in simple situations where this is generally the case, this is only
“on average”, so situations will exist where by chance the bias is away from the null;
second, this assumes a simple linear regression scenario with classical additive non-
differential measurement error (see Sect. 3.2.2) (Carroll et al. 1995), and for logistic
regression, for instance, the resulting bias may be in either direction (Stefanski and
Carroll 1985). Even if the exposure is measured without error, biased estimates can
be caused by measurement error in confounding variables (Brenner 1993; Greenland
1980; Wong et al. 1999b). Even with additive measurement error in linear regres-
sion, measurement error in a confounding variable often leads to under-adjustment
for the confounder and this can readily lead to distortion of the estimated effect of the
exposure in either direction, depending on the direction of the confounder’s effect
(Wong et al. 1999b). In addition, bias away from the null is possible under certain
situations with other more complex error structures (Carroll et al. 1995; Fuller
1987). One implication of the effects of measurement error not always biasing
estimates towards the null is that trends are not always preserved (Carroll et al.
1995; Weinberg et al. 1994), even under non-differential measurement error
(Sect. 3.2.2.4). One situation where bias may not be relevant is in the context of
prediction, e.g. a risk score based on measured exposures (Wikipedia contributors
2007). However, in aetiological epidemiology, such bias could substantially influ-
ence conclusions relating to the true level of exposure.
It can be argued that, for the purposes of prediction or public health advice,
measurement error in an exposure is irrelevant, because the prediction is based on
the measured exposure, not the true exposure. The problem with interpreting such
models based on error-prone exposures is that few clinicians have a good apprecia-
tion for the amount of variation in such measures. Therefore, an analysis based on an
error-prone exposure may lead to the erroneous conclusion that a particular exposure
3 Measurement Errors in Epidemiology 35
is unimportant or does not strongly (or significantly) predict the outcome. Many
naı̈ve analyses may lead to this variable being dropped from the model entirely, or
the public health implications being under-estimated. If the measurement error is in
a confounder, then this would lead to under-adjustment for the confounder, again
potentially leading to incorrect interpretation of the model. Finally, if there is a
nonlinear association between the outcome and an error-prone exposure, then this
can lead to a shift in location as well as a bias in the regression slopes.
(nonlinear in the sense of using a link function other than the identity link).
This more recent work is of far greater relevance to most branches of epidemiology
because exposures are often being related to dichotomous disease outcomes using
logistic regression or Poisson regression models, or through methods for survival
analysis. The mechanisms underlying measurement error in covariates are common
to both linear and nonlinear models, though their impact on estimates may differ.
One general distinction to be drawn between assumed underlying mechanisms is
based on whether the measurement error model focuses on modelling the observed
measure (W) conditional on the true exposure (X) and other covariates (Z), the
classical model, or whether the focus is on modelling the true measure (X)
conditional on the observed (W) and the other covariates (Z), the Berkson model.
Under this model there is random error around the true value and it is assumed
that any measurement error is not associated with the true exposure. In more
complicated forms, it is assumed that measurement error is only independent of
true exposure conditional upon other covariates.
Classical measurement error can be described using the three submodels
outlined above, where it is helpful to express the conditional independence
assumptions of this model in terms of model conditional distributions:
Where b, p and l are model parameters, which in the general case are vector
quantities.
Disease status Y is only dependent on the true exposure X, the known covariates
Z and the unknown parameters b. Therefore, conditional on the true exposure, X,
being known, the imperfect observed measures W do not contribute any informa-
tion to the outcome Y.
The aim of methods to control for measurement error is to discover the disease
model free of any bias caused by measurement error. To achieve this, one needs to
know the distribution of X|W, ignoring perfectly measured covariates and other
parameters for the moment. Knowledge of the measurement model W|X is not
sufficient in itself to identify the true disease-exposure relationship. One also
requires the exposure model, the distribution of X, as demonstrated by Bayes’
theorem: prob(X|W) / prob(W|X) prob(X).
3 Measurement Errors in Epidemiology 37
This provides the Bayes estimate of X and is the basis for Bayesian approaches
to handling classical measurement error, because we now have an estimate of the
true exposure. This provides a justification for conceptually separating the mecha-
nism into three sub-models.
For the Berkson measurement error model, it is assumed that any measurement
error is not associated with the measured exposure. In terms of the three submodels,
where b and l are model parameters:
Both classical and Berkson error models assume random errors. Alternative
mechanisms extend these fundamental approaches to encompass systematic errors
(Thomas et al. 1993) where values that are measured are not randomly distributed
around the truth (classical) or measured values (Berkson). For example, food
frequency questionnaires are notorious for either over-estimating energy and nutri-
ent intakes (Byers 2001; Cade et al. 2002; Calvert et al. 1997) or under-estimating
them (Kipnis et al. 2001, 2003; Schatzkin et al. 2003; Subar et al. 2003). A situation
where dietary intake is under-estimated for people (in absolute terms) for people
with larger intakes, and over-estimated for people with lower intakes would be a
systematic error that depended systematically on the true exposure. The underlying
measurement error mechanism potentially may be more complicated if it systemati-
cally depends on unmeasurable person-specific characteristics. This is a particular
problem in the field of nutrition epidemiology.
38 D.C. Greenwood
Both systematic and random measurement error structures can be sub-divided into
differential or non-differential. This distinction depends on whether the errors in the
exposure variable are conditionally independent of the outcome Y, given true
exposure X and perfectly known covariates Z. If the misclassification of the
measured exposure depends on the outcome, then this is a differential measurement
error. If, on the other hand, the measured exposure W given true exposure X and
known covariates Z contains no additional information about Y, then the measure-
ment error is said to be non-differential, i.e. p(Y|X,Z,W) ¼ p(Y|X,Z). When this
occurs, W can be said to be a surrogate for X, because W is conditionally indepen-
dent of the outcome Y. Where measurement error can be assumed to be
non-differential, then relatively straightforward methods can be used to correct
for the measurement error bias.
Measurement error models have also traditionally been considered to fall into
two separate categories: additive or multiplicative. Additive error structures
are those that define the measurement error model in terms of adding a component
of error, e.g. W ¼ X + U in the classical model, or X ¼ W + U in the Berkson
model. Multiplicative error structures are those that define the measurement
error model in terms of multiplying a component of error, e.g. W ¼ XU
or X ¼ WU. However this distinction is often a false one, when a multiplicative
error structure is additive on the log scale, e.g. W ¼ XU implies ln(W) ¼ ln(XU) ¼
ln(X) + ln(U).
CovðY; W Þ CovðY; X Þ
bX ¼ ¼ if U is independent of X:
VarðWÞ VarðWÞ
CovðY; XÞ
¼
VarðXÞ þ VarðUÞ
VarðXÞ CovðY; X Þ
¼
VarðXÞ þ VarðUÞ VarðXÞ
¼ lbX (3.1)
VarðXÞ
l¼ (3.2)
VarðXÞ þ VarðUÞ
In this simple example of linear regression with additive error in the classical
model, the intercept is also biased. If we regress Y on W, the standard least squares
estimate of the intercept is:
b0 ¼ mY bX mW
¼ mY lbX mW
¼ b0 þ bX mX lbX mX
¼ b0 þ ð1 lÞbX mX
1
¼ s2YY lbX sWY
n2
and from the working given above demonstrating the biased estimate of the slope,
sWY ¼ s2WbW ¼ (s2X + s2U)bW ¼ (s2X + s2U)lbX ¼ s2XbX, so the residual vari-
ance becomes
1
¼ s2YY lb2X s2X
n2
1 2
¼ sYY b2X s2X þ ½1 lb2X s2X
n2
Which is the true residual variance inflated by the factor (1l )bX2s2X.
The estimate of l also affects power. In general sample size calculations, for a given
required power, dividing the effect size one wishes to detect by a constant
c increases the number of subjects required by c2. Using this argument, if the
estimated regression coefficient changes by multiplying by l, then the number of
study participants required will change by 1/l2 (Carroll et al. 2006). When l is <1,
this leads to a substantial increase in the required sample size, or loss of power for
the same sample size.
In this simple example, the constant l tends to be less than 1 because of shrinkage.
This is because classical measurement error in an exposure will, in the long run,
tend to make the measured exposure appear more extreme than the true exposure.
For this simple example it is relatively straightforward to demonstrate how the
constant l can be used to quantify, on average, the amount of attenuation of
the association (flattening of the regression slope) caused by measurement error,
bias in the intercept, and in the residual variance.
3 Measurement Errors in Epidemiology 41
In many epidemiological settings, however, the simple classical model may not
hold. For radiation exposures, the Berkson model is sometimes more appropriate.
Multiplicative error structures may also hold for some exposures. In nutrition
epidemiology, dietary exposures often have intercept and slope components to
the measurement model, and in particular may well have correlated errors
(Day et al. 2004; Kipnis et al. 2001). Adjustment for confounding, measured with
or without error, adds further complexity. In these more complicated examples the
algebra is more involved than that presented here and the potential for bias could
easily be in either direction. Measurement error in gene-environment interactions
also poses different problems (Greenwood et al. 2006a; Huang et al. 2005; Murad
and Freedman 2007; Wong et al. 2003, 2004).
3.3.1 Design
The effects of measurement error can be reduced by using a more precise measure
of exposure, and more precise measures of important confounders. In practice, this
may mean that more lengthy or costly exposure measures are used, and that smaller
samples are all that can be achieved. However, this is almost always offset by the
reduction in the effects of measurement error outlined earlier. For example,
in nutrition epidemiology, coding of weighed food diaries to derive nutrient intake
is very expensive compared to simpler food frequency questionnaires, but will yield
far more realistic results. Use of objective measures of dietary intake such as
biomarkers or itemised till receipts will reduce bias further (Greenwood et al.
2006b; Kipnis et al. 1999).
The effect of measurement error can also be reduced by increasing the variance
of the true exposure (Freedman et al. 2007; Schatzkin et al. 2001b). This is because
increasing s2X has the benefit of making l closer to 1 and therefore reducing bias on
average. The easiest way to do this is to ensure sampling a wide range of exposures.
Whilst the best approach to reducing the effects of measurement error are to
avoid it in the first place through better exposure measure, the effects can be
mitigated by statistical techniques using additional information from external or
internal validation samples based on comparison with a measure of the true
exposure (if one exists), or surrogate measures (including replicated measures).
measure of the radioactive emissions released from the facility. However, only a
cheap error-prone measure of the radiation dose to an individual is available using
a tag clipped to an individual’s uniform. The more accurate measure is, however,
the aggregate measure applying to everyone in the plant. Ideally, one would like to
somehow calibrate the inaccurate and imprecise measures from the individuals by
using the accurate and precise aggregate measure. This differs from designs with
individual-level surrogates because imprecision in the exposure information
applied to the individual has also to be accounted for. In nutrition epidemiology
this may occur when diet is measured at the household level using household
inventories or records of purchases, but inferences are required regarding the health
of individuals. This second situation may become more common if databases of
supermarket purchasing behaviour for households are linked to individuals’
subsequent health (Greenwood et al. 2006b; Ransley et al. 2001, 2003).
Sometimes, instead of a sub-sample from the same study, the repeat or validation
sample is from an external source, a second study. However, even if the same
exposure measure is used in the second study, it is unlikely that the distribution of X
would be identical, and therefore unlikely that l would be the same. Rather than try
to apply the reliability ratio l to the data, it is more sensible to extract the
measurement error variance itself, su2, from the second study, and use this in
conjunction with the distribution of X found in the study needing correction for
the effects of measurement error (Carroll et al. 2006). The ability to use a parameter
(such as the reliability ratio or measurement error variance) obtained from one
study to inform another is known as “transportability”. For the reasons I have
outlined above, the measurement error variance is considered more transportable
than the reliability ratio (Carroll et al. 2006).
The different combination of internal and external sources of information in the
context of nutritional epidemiology has been summarised by Spiegelman et al.
(2005). Whilst Lyles et al. explore the possibility of combining both internal and
external validation (Lyles et al. 2007).
There are many statistical methods suggested for correcting for measurement error
in a variety of situations. Many approaches for linear regression have been
reviewed by Fuller et al. (Fuller 1987), and for nonlinear models (nonlinear link
functions) by Carroll et al. (2006). I now introduce a few of the common or widely
applicable methods that are available using standard statistical software.
analysis and Cox proportional hazards regression (Clayton 1992; Hughes 1993;
Prentice 1982), where it has been used in two large cohorts: the Nurses’ Health
Study (Spiegelman et al. 1997) and EPIC (Gonzalez et al. 2006a, b). The method
was “popularised” amongst statisticians by Rosner et al. (Carroll et al. 1995; Rosner
et al. 1989, 1990), though methods are still not widely used, even amongst
statisticians or epidemiologists, and they are certainly not well recognised yet
within clinical circles.
Caroll et al.’s version of regression calibration is available in Stata (StataCorp
2005) for a range of generalised linear models characterised by any sensible
combination of link function and distribution families (Hardin et al. 2003a).
Versions are also available in SAS (Rosner et al. 1992; Weller et al. 2007). It is
also easy to implement the algorithms in R.
SIMEX and regression calibration are both functional approaches to adjusting for
the effects of measurement error. Regression calibration uses a modelling-based
solution to this, but in contrast SIMEX is simulation-based, using a particular
resampling algorithm (Carroll et al. 1995; Cook and Stefanski 1994). This simple
method is widely applicable to a broad range of models. Loosely speaking,
the algorithm keeps adding a small known amount of error and re-estimates the
parameters each time. A trend in the effect of the measurement error is then
estimated, and extrapolation made to the case where there is no measurement
error. The algorithm is as follows:
_
1. The model is fitted “as is” to obtain estimated coefficients b and an estimate of
_2
su based on variance components analysis (Carroll et al. 2006) or deemed
“known” from external validation data.
2. Additional error is added as follows:
_2
(a) Additional random error is generated at y times the estimated su and added
_2
to the original values of X, such that added error ey ~ N(0, ysu ).
_
(b) The model is then refitted to the new X and new coefficients b estimated.
_
3. This is repeated r times and the mean or median coefficient b of these parameter
estimates is then calculated. Step 2 is then repeated for different of values of y,
e.g. {.5, 1, 1.5, 2}. The original model fitted in step 1 is taken as an additional
observation with y ¼ 0.
4. For each coefficient in the model, the estimate is plotted against the value of
y used. The trend is then extrapolated back to y ¼ 1. This then is the estimate
without measurement error. Methods for extrapolation include linear or
quadratic extrapolants. The most stable of these is the quadratic extrapolant
(Carroll et al. 1995, 1996; Cook and Stefanski 1994; Hardin et al. 2003b;
Wang et al. 1998).
46 D.C. Greenwood
The SIMEX method does allow for plots that can be informative in terms of the
effects of measurement error, and the robustness of the results from taking this
approach.
This approach does not have the elegance of a simple closed formula, and it
may be sensitive to the choice of method used to extrapolate the curve over the
values of y. However, in practice it appears to work well. SIMEX is available in
Stata for a range of generalised linear models characterised by any sensible
combination of link function and distribution families (Hardin et al. 2003b).
SIMEX is also available through R’s simex package. It is also easy to implement
the algorithm in SAS.
Measurement error can profitably be viewed as a latent variable problem, with the
latent true exposure estimated on the basis of replicates, surrogate measures or
instrumental variables. The methods and software described elsewhere in this book
for latent traits and latent classes are therefore directly relevant to correcting for
errors in continuous and categorical exposure variables (see Chaps. 6 and 7).
Two particularly flexible approaches to estimating disease associations with latent
true exposures are Bayesian and quadrature-based methods.
3 Measurement Errors in Epidemiology 47
Bayesian methods are ideally suited for modelling the conditional dependency
models described by the disease, exposure and measurement models outlined in
Sect. 3.2.2. The structural modelling approach has the advantage of naturally taking
account of the hierarchical structure of data incorporating latent true exposures,
repeated observed exposures and measurement error. In addition they allow the
incorporation of prior information on the measurement error variance or the
distribution of the true covariate. Through inclusion of prior information they
may also provide a solution to non-identifiability of the measurement error model
in some circumstances (White et al. 2001). This leads to very flexible models in that
they can be applied to a wide range of measurement error problems.
Whilst_ regression calibration
_
has two main steps (regressing X on W to
estimate X , then using X in place of X in the standard analysis), the Bayesian
approach models all the estimates (nodes) simultaneously. This means that cyclical
structural equation models cannot be fitted using some popular software
(Spiegelhalter et al. 1996, 2004). In addition, Gustafson suggests that the need for
an exposure model with the structural Bayesian approach, and the possibility of
misspecifying this, makes it more susceptible to bias than regression calibration
(Gustafson 2004).
Richardson has applied a range of Bayesian models in the context of both
validation samples and replicate samples (Richardson and Gilks 1993a, b).
She has extended these models to the situation where the prior distribution is a
mixture of different distributions, to allow greater flexibility (Richardson et al.
2002), and has outlined how the approach may be used on aggregate level data
(Gilks and Richardson 1992; Richardson 1996; Richardson and Best 2003).
In addition, some discussion and development of these tools for use in epidemiol-
ogy has been made by Bashir and Duffy (1997), by Mishra and Day (Day and
Mishra 2003, 2004), by Gustafson (Gustafson et al. 2002; Gustafson 2004) and
others (Bashir and Duffy 1997; Bennett and Wakefield 2001; Berry et al. 2002;
Dunson 2001; Moala and Baba 2003; Raghunathan and Siscovick 1998; Schmid
and Rosner 1993; Song et al. 2002; Whittaker et al. 2003).
Advantages of using MCMC within a Bayesian framework include the ability to
model measurement error correlation structures, extension of simple classical mea-
surement error models to include correlated person-specific biases and flattening of the
regression of X on W. A range of exposure distributions can be modelled, or a mixture
of distributions could be used if justified, e.g. for zero-inflated data (see Chap. 6).
Disadvantages of Bayesian methods include possible subjectivity in specification of
priors, potential lack of convergence to a stationary distribution, and length of time
taken to achieve convergence, given the heavy computational requirement.
MCMC methods are implemented in the BUGS software family (Classic BUGS,
WinBUGS, and OpenBUGS) (Spiegelhalter et al. 1996, 2007), JAGS (Plummer
2003), and for a limited range of models in MLwiN (Browne 2004; Rasbash
et al. 2004). Bayesian models are often described using directed acyclic graphs
(DAGs), and this graphical approach to modelling is described in detail in Chap. 9.
48 D.C. Greenwood
The importance of correcting for the effects of measurement error are well
demonstrated by research into the relationship between dietary fat and breast cancer
(Bingham et al. 2003). Researchers in Cambridge investigated this association
using both a food frequency questionnaire (FFQ) and a 7-day food diary.
FFQs are known to lack precision, but are often used in large cohort studies because
they are cheaper to administer and derive nutrient intakes from for large numbers of
participants.
A nested case-control design was used, with four controls for every case,
matched on age and date of entry to the study. Conditional logistic regression
was then used to derive odds ratios (equivalent here to hazard ratios). After
adjustment for potential confounders, including energy intake from non-fat sources,
the FFQ measure of total fat yielded a hazard ratio of 1.06 (95% CI: 0.89–1.25) for
each fifth of total fat intake. The same analysis based on the food diary gave a
hazard ratio of 1.17 (95% CI: 1.00–1.36). For saturated fat, the contrast was just as
clear, with 1.10 (0.94–1.29) using the FFQ and 1.22 (1.06–1.40) using the diary.
With the less precise measure, estimates of the size of the association were small
and confidence intervals spanned the null, whilst using the more precise measure
3 Measurement Errors in Epidemiology 49
the association appeared stronger and confidence intervals further from the null.
Yet food diaries themselves have been strongly criticised for containing a large
component of error, and that error being correlated between repeat measures, and
with other self-report tools such as FFQs (Day et al. 2004; Greenwood et al. 2006b;
Kipnis et al. 1999, 2001). The implication is that even food diaries yield strongly
biased estimates of diet-disease associations, possibly as much as halving the true
association. Only the use of unbiased, objective measures such as recovery
biomarkers, till receipts or household inventories can offer unbiased results.
Measurement error methods have only recently started to be used in major cohorts,
e.g. the Nurses’ Health Study (Spiegelman et al. 1997), EPIC (Ferrari et al. 2009;
Gonzalez et al. 2006a, b), The UK Women’s Cohort (Cade et al. 2007), and the
Centre for Nutritional Epidemiology in Cancer Prevention and Survival
(Dahm et al. 2010), but their use is not yet widespread. As their benefits are more
widely recognised and they become accepted in the clinical community, we will
begin to see less biased estimates of some difficult to measure environmental
exposures. Recent incorporation into standard software packages should also facil-
itate their use. Areas of further development may be specific to particular
exposures, such as development of a wider range of objective biomarkers of various
exposures, including dietary exposures.
References
Aiken, L. S., & West, S. G. (1991). Reliability and statistical power. In Multiple regression:
Testing and interpreting interactions (pp. 139–171). Newbury Park: Sage publications.
Armstrong, B. (1985). Measurement error in generalised linear models. Communications in
Statistics-Simulation and Computation, 14, 529–544.
Armstrong, B. G. (1998). Effect of measurement error on epidemiological studies of environmen-
tal and occupational exposures. Occupational and Environmental Medicine, 55, 651–656.
Bashir, S. A., & Duffy, S. W. (1997). The correction of risk estimates for measurement error.
Annals of Epidemiology, 7, 154–164.
Bennett, J., & Wakefield, J. (2001). Errors-in-variables in joint population pharmacokinetic/
pharmacodynamic modeling. Biometrics, 57, 803–812.
Berry, S. M., Carroll, R. J., & Ruppert, D. (2002). Bayesian smoothing and regression splines for
measurement error problems. Journal of the American Statistical Association, 97, 160–169.
Bingham, S. A., Luben, R., Welch, A., Wareham, N., Khaw, K. T., & Day, N. E. (2003).
Are imprecise methods obscuring a relation between fat and breast cancer? Lancet, 362,
212–214.
Bjork, J., & Stromberg, U. (2002). Effects of systematic exposure assessment errors in partially
ecologic case-control studies. International Journal of Epidemiology, 31, 154–160.
Brenner, H. (1993). Bias due to non-differential misclassification of polytomous confounders.
Journal of Clinical Epidemiology, 46, 57–63.
50 D.C. Greenwood
Elmstahl, S., & Gullberg, B. (1997). Bias in diet assessment methods–consequences of collinearity
and measurement errors on power and observed relative risks. International Journal of
Epidemiology, 26, 1071–1079.
Ferrari, P., Roddam, A., Fahey, M. T., Jenab, M., Bamia, C., Ocke, M., Amiano, P., Hjartaker, A.,
Biessy, C., Rinaldi, S., Huybrechts, I., Tjonneland, A., Dethlefsen, C., Niravong, M., Clavel-
Chapelon, F., Linseisen, J., Boeing, H., Oikonomou, E., Orfanos, P., Palli, D., de Santucci, M.,
Bueno-de-Mesquita, H. B., Peeters, P. H., Parr, C. L., Braaten, T., Dorronsoro, M., Berenguer,
T., Gullberg, B., Johansson, I., Welch, A. A., Riboli, E., Bingham, S., & Slimani, N. (2009).
A bivariate measurement error model for nitrogen and potassium intakes to evaluate the
performance of regression calibration in the European Prospective Investigation into Cancer
and Nutrition study. European Journal of Clinical Nutrition, 63, S179–S187.
Flegal, K. M. (1999). Evaluating epidemiologic evidence of the effects of food and nutrient
exposures. American Journal of Clinical Nutrition, 69, 1339S–1344S.
Freedman, L. S., Schatzkin, A., & Wax, Y. (1990). The impact of dietary measurement error on
planning sample size required in a cohort study. American Journal of Epidemiology, 118,
1185–1195.
Freedman, L. S., Schatzkin, A., Thiebaut, A. C. M., Potischman, N., Subar, A. F., Thompson, F. E.,
& Kipnis, V. (2007). Abandon neither the food frequency questionnaire nor the dietary fat-
breast cancer hypothesis. Cancer Epidemiology, Biomarkers & Prevention, 16, 1321–1322.
Fuller, W. A. (1987). Measurement error models. New York: Wiley.
Gilks, W. R., & Richardson, S. (1992). Analysis of disease risks using ancillary risk factors, with
application to job-exposure matrices. Statistics in Medicine, 11, 1443–1463.
Gladen, B., & Rogan, W. J. (1979). Misclassification and the design of environmental studies.
American Journal of Epidemiology, 109, 607–616.
Gleser, L. J. (1990). Improvements of the naive approach to estimation in nonlinear errors-in-
variables regression models. In P. J. Brown & W. A. Fuller (Eds.), Statistical analysis of
measurement error models and applications (pp. 99–114). Providence: American Mathemati-
cal Society.
Gonzalez, C. A., Jakszyn, P., Pera, G., Agudo, A., Bingham, S., Palli, D., Ferrari, P., Boeing, H.,
Del Giudice, G., Plebani, M., Carneiro, F., Nesi, G., Berrino, F., Sacerdote, C., Tumino, R.,
Panico, S., Berglund, G., Siman, H., Nyren, O., Hallmans, G., Martinez, C., Dorronsoro, M.,
Barricarte, A., Navarro, C., Quiros, J. R., Allen, N., Key, T. J., Day, N. E., Linseisen, J., Nagel,
G., Bergmann, M. M., Overvad, K., Jensen, M. K., Tjonneland, A., Olsen, A.,
Bueno-de-Mesquita, H. B., Ocke, M., Peeters, P. H., Numans, M. E., Clavel-Chapelon, F.,
Boutron-Ruault, M. C., Trichopoulou, A., Psaltopoulou, T., Roukos, D., Lund, E., Hemon, B.,
Kaaks, R., Norat, T., & Riboli, E. (2006a). Meat intake and risk of stomach and esophageal
adenocarcinoma within the European Prospective Investigation into Cancer and Nutrition
(EPIC). Journal of the National Cancer Institute, 98, 345–354.
Gonzalez, C. A., Pera, G., Agudo, A., Bueno-de-Mesquita, H. B., Ceroti, M., Boeing, H., Schulz,
M., Del Giudice, G., Plebani, M., Carneiro, F., Berrino, F., Sacerdote, C., Tumino, R., Panico,
S., Berglund, G., Siman, H., Hallmans, G., Stenling, R., Martinez, C., Dorronsoro, M.,
Barricarte, A., Navarro, C., Quiros, J. R., Allen, N., Key, T. J., Bingham, S., Day, N. E.,
Linseisen, J., Nagel, G., Overvad, K., Jensen, M. K., Olsen, A., Tjonneland, A., Buchner, F. L.,
Peeters, P. H., Numans, M. E., Clavel-Chapelon, F., Boutron-Ruault, M. C., Roukos, D.,
Trichopoulou, A., Psaltopoulou, T., Lund, E., Casagrande, C., Slimani, N., Jenab, M., &
Riboli, E. (2006b). Fruit and vegetable intake and the risk of stomach and oesophagus
adenocarcinoma in the European Prospective Investigation into Cancer and Nutrition
(EPIC-EURGAST). International Journal of Cancer, 118, 2559–2566.
Greenland, S. (1980). The effect of misclassification in the presence of covariates. American
Journal of Epidemiology, 112, 564–569.
Greenwood, D. C., Gilthorpe, M. S., & Cade, J. E. (2006a). The impact of imprecisely measured
covariates on estimating gene-environment interactions. BMC Medical Research Methodol-
ogy, 6, 21.
52 D.C. Greenwood
Greenwood, D. C., Ransley, J. K., Gilthorpe, M. S., & Cade, J. E. (2006b). Use of itemized till
receipts to adjust for correlated dietary measurement error. American Journal of Epidemiology,
164, 1012–1018.
Gustafson, P. (2004). Measurement error and misclassification in statistics and epidemiology:
Impacts and Bayesian adjustments. London: Chapman & Hall.
Gustafson, P., Le, N. D., & Vallee, M. (2002). A Bayesian approach to case-control studies with
errors in covariables. Biostatistics, 3, 229–243.
Hardin, J. W., Schmiediche, H., & Carroll, R. J. (2003a). The regression-calibration method for
fitting generalized linear models with additive measurement error. The Stata Journal, 3,
361–372.
Hardin, J. W., Schmiediche, H., & Carroll, R. J. (2003b). The simulation extrapolation method for
fitting generalized linear models with additive measurement error. The Stata Journal, 3,
373–385.
Huang, L. S., Wang, H. K., & Cox, C. (2005). Assessing interaction effects in linear measurement
error models. Journal of the Royal Statistical Society Series C-Applied Statistics, 54, 21–30.
Hughes, M. D. (1993). Regression dilution in the proportional hazards model. Biometrics, 49,
1056–1066.
Kaaks, R., Riboli, E., & van Staveren, W. A. (1995). Calibration of dietary-intake measurements in
prospective cohort studies. American Journal of Epidemiology, 142, 548–556.
Kipnis, V., Carroll, R. J., Freedman, L. S., & Li, L. (1999). Implications of a new dietary
measurement error model for estimation of relative risk: Application to four calibration studies.
American Journal of Epidemiology, 150, 642–651.
Kipnis, V., Midthune, D., Freedman, L. S., Bingham, S., Schatzkin, A., Subar, A., & Carroll, R. J.
(2001). Empirical evidence of correlated biases in dietary assessment instruments and its
implications. American Journal of Epidemiology, 153, 394–403.
Kipnis, V., Subar, A. F., Midthune, D., Freedman, L. S., Ballard-Barbash, R., Troiano, R. P.,
Bingham, S., Schoeller, D. A., Schatzkin, A., & Carroll, R. J. (2003). Structure of dietary
measurement error: Results of the OPEN Biomarker Study. American Journal of Epidemiol-
ogy, 158, 14–21.
Kuha, J. (1994). Corrections for exposure measurement error in logistic-regression models with an
application to nutritional data. Statistics in Medicine, 13, 1135–1148.
Kuha, J. (1997). Estimation by data augmentation in regression models with continuous and
discrete covariates measured with error. Statistics in Medicine, 16, 189–201.
Longford, N. T. (2001). Multilevel analysis with messy data. Statistical Methods in Medical
Research, 10, 429–444.
Lyles, R. H., Zhang, F., & Drews-Botsch, C. (2007). Combining internal and external validation
data to correct for exposure misclassification: A case study. Epidemiology, 18, 321–328.
Moala, F. A., Baba, M. Y. (2003). Bayesian analysis of the simple linear regression with
measurement errors. Practical Bayesian Statistics 5 Conference, Milton Keynes.
Murad, H., & Freedman, L. S. (2007). Estimating and testing interactions in linear regression
models when explanatory variables are subject to classical measurement error. Statistics in
Medicine, 26(23), 4293–4310.
Phillips, A. N., & Smith, G. D. (1991). How independent are independent effects – Relative risk-
estimation when correlated exposures are measured imprecisely. Journal of Clinical Epidemi-
ology, 44, 1223–1231.
Plummer, M. (2003). JAGS: A program for analysis of bayesian graphical models using gibbs
sampling. Proceedings of the 3rd International Workshop on Distributed Statistical Computing
(DSC 2003), March 20–22, Vienna, Austria.
Prentice, R. L. (1982). Covariate measurement errors and parameter estimation in failure time
regression models. Biometrika, 69, 331–342.
Rabe-Hesketh, S., Pickles, A., & Skrondal, A. (2001). GLLAMM manual: Technical report 2001/
01. London: Department of Biostatistics and Computing, Institute of Psychiatry, King’s
College, University of London.
3 Measurement Errors in Epidemiology 53
Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2002). Reliable estimation of generalised linear
mixed models using adaptive quadrature. The Stata Journal, 2, 1–21.
Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2003). Maximum likelihood estimation of
generalized linear models with covariate measurement error. The Stata Journal, 3, 386–411.
Raghunathan, T. E., & Siscovick, D. S. (1998). Combining exposure information from various
sources in an analysis of a case-control study. Journal of the Royal Statistical Society Series D-
the Statistician, 47, 333–347.
Ransley, J. K., Donnelly, J. K., Khara, T. N., Botham, H., Arnot, H., Greenwood, D. C., & Cade, J. E.
(2001). The use of supermarket till receipts to determine the fat and energy intake in a UK
population. Public Health Nutrition, 4, 1279–1286.
Ransley, J. K., Donnelly, J. K., Botham, H., Khara, T. N., Greenwood, D. C., & Cade, J. E. (2003).
Use of supermarket receipts to estimate energy and fat content of food purchased by lean and
overweight families. Appetite, 41, 141–148.
Rasbash, J., Steele, F., Browne, W., & Prosser, B. (2004). A user’s guide to MLwiN version 2.0.
London: Institute of Education/University of London.
Richardson, D. B., & Ciampi, A. (2003). Effects of exposure measurement error when an exposure
variable is constrained by a lower limit. American Journal of Epidemiology, 157, 355–363.
Richardson, S. (1996). Measurement error. In W. R. Gilks, S. Richardson, & D. J. Spiegelhalter
(Eds.), Markov chain Monte Carlo in practice (pp. 401–417). London: Chapman & Hall.
Richardson, S., & Best, N. (2003). Bayesian hierarchical models in ecological studies of health-
environment effects. Environmetrics, 14, 129–147.
Richardson, S., & Gilks, W. R. (1993a). A Bayesian approach to measurement error problems in
epidemiology using conditional independence models. American Journal of Epidemiology,
138, 430–442.
Richardson, S., & Gilks, W. R. (1993b). Conditional independence models for epidemiological
studies with covariate measurement error. Statistics in Medicine, 12, 1703–1722.
Richardson, S., Leblond, L., Jaussent, I., & Green, P. J. (2002). Mixture models in measurement
error problems, with reference to epidemiological studies. Journal of the Royal Statistical
Society: Series A (Statistics in Society), 165, 549–566.
Rippin, G. (2001). Design issues and sample size when exposure measurement is inaccurate.
Methods of Information in Medicine, 40, 137–140.
Rosner, B., Willett, W. C., & Spiegelman, D. (1989). Correction of logistic regression relative risk
estimates and confidence intervals for systematic within-person measurement error. Statistics
in Medicine, 8, 1051–1069.
Rosner, B., Spiegelman, D., & Willett, W. C. (1990). Correction of logistic-regression relative risk
estimates and confidence-intervals for measurement error – The case of multiple covariates
measured with error. American Journal of Epidemiology, 132, 734–745.
Rosner, B., Spiegelman, D., & Willett, W.C. (1992). Correction of logistic regression relative risk
estimates and confidence intervals for random within person measurement error. American
Journal of Epidemiology, 136, 1400–1413.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. London: Chapman & Hall.
Schatzkin, A., Midthune, D., Subar, A., Thompson, F., & Kipnis, V. (2001a). The national
institutes of health-American association of retired persons (NIH-AARP) diet and health
study: Power to detect diet-cancer associations after adjusting for measurement error.
American Journal of Epidemiology, 153, 966.
Schatzkin, A., Subar, A. F., Thompson, F. E., Harlan, L. C., Tangrea, J., Hollenbeck, A. R.,
Hurwitz, P. E., Coyle, L., Schussler, N., Michaud, D. S., Freedman, L. S., Brown, C. C.,
Midthune, D., & Kipnis, V. (2001b). Design and serendipity in establishing a large cohort with
wide dietary intake distributions – The National Institutes of Health-American Association of
Retired Persons Diet and Health Study. American Journal of Epidemiology, 154, 1119–1125.
Schatzkin, A., Kipnis, V., Carroll, R. J., Midthune, D., Subar, A. F., Bingham, S., Schoeller, D. A.,
Troiano, R. P., & Freedman, L. S. (2003). A comparison of a food frequency questionnaire with
54 D.C. Greenwood
a 24-hour recall for use in an epidemiological cohort study: Results from the biomarker-based
Observing Protein and Energy Nutrition (OPEN) study. International Journal of Epidemiol-
ogy, 32, 1054–1062.
Schmid, C. H., & Rosner, B. (1993). A bayesian-approach to logistic-regression models having
measurement error following a mixture distribution. Statistics in Medicine, 12, 1141–1153.
Skrondal, A., & Rabe-Hesketh, S. (2004). Generalized latent variable modeling. London: Chap-
man & Hall.
Song, X., Davidian, M., & Tsiatis, A. A. (2002). An estimator for the proportional hazards model
with multiple longitudinal covariates measured with error. Biostatistics, 3, 511–528.
Sorahan, T., & Gilthorpe, M. S. (1994). Non-differential misclassification of exposure always
leads to an underestimate of risk: An incorrect conclusion. Occupational and Environmental
Medicine, 51, 839–840.
Spiegelhalter, D. J., Thomas, A., Best, N. G., & Gilks, W. (1996). BUGS 0.5: Bayesian inference
using Gibbs sampling manual. Cambridge: MRC Biostatistics Unit.
Spiegelhalter, D. J., Thomas, A., Best, N. G., & Lunn, D. (2004). WinBUGS user manual: Version
1.4.2. Cambridge: MRC Biostatistics Unit.
Spiegelhalter, D. J., Thomas, A., Best, N. G., & Lunn, D. (2007). WinBUGS user manual: Version
1.4.3. Cambridge: MRC Biostatistics Unit.
Spiegelman, D., McDermott, A., & Rosner, B. (1997). Regression calibration method for
correcting measurement-error bias in nutritional epidemiology. American Journal of Clinical
Nutrition, 65, S1179–S1186.
Spiegelman, D., Zhao, B., & Kim, J. (2005). Correlated errors in biased surrogates: Study designs
and methods for measurement error correction. Statistics in Medicine, 24, 1657–1682.
StataCorp. (2005). Stata statistical software: Release 9.2. College Station: Stata Corporation.
Stefanski, L. A., & Carroll, R. J. (1985). Covariate measurement error in logistic regression.
The Annals of Statistics, 13, 1335–1351.
Subar, A. F., Kipnis, V., Troiano, R. P., Midthune, D., Schoeller, D. A., Bingham, S., Sharbaugh,
C. O., Trabulsi, J., Runswick, S., Ballard-Barbash, R., Sunshine, J., & Schatzkin, A. (2003).
Using intake biomarkers to evaluate the extent of dietary misreporting in a large sample of
adults: The OPEN study. American Journal of Epidemiology, 158, 1–13.
Thomas, D., Stram, D. O., & Dwyer, J. H. (1993). Exposure measurement error: Influence on
exposure-disease relationships and methods of correction. Annual Review of Public Health, 14,
69–93.
Wang, N., Carroll, R. J., & Liang, K. Y. (1996). Quasilikelihood estimation in measurement error
models with correlated replicates. Biometrics, 52, 401–411.
Wang, N. Y., Lin, X. H., Gutierrez, R. G., & Carroll, R. J. (1998). Bias analysis and SIMEX
approach in generalized linear mixed measurement error models. Journal of the American
Statistical Association, 93, 249–261.
Weinberg, C. R., Umbach, D. M., & Greenland, S. (1994). When will nondifferential misclassi-
fication of an exposure preserve the direction of a trend? American Journal of Epidemiology,
140, 565–571.
Weller, E., Milton, D., Eisen, E., Spiegelman, D. (2007). Method in regression calibration for
logistic regression with multiple surrogates for one exposure. Journal of Statistical Planning
and Inference, 137, 449–461.
White, I. R. (2006). Commentary: Dealing with measurement error: multiple imputation or
regression calibration? International Journal of Epidemiology, 35, 1081–1082.
White, E., Kushi, L. H., & Pepe, M. S. (1994). The effect of exposure variance and exposure
measurement error on study sample-size – Implications for the design of epidemiologic
studies. Journal of Clinical Epidemiology, 47, 873–880.
White, I., Frost, C., & Tokunaga, S. (2001). Correcting for measurement error in binary and
continuous variables using replicates. Statistics in Medicine, 20, 3441–3457.
Whittaker, H., Best, N., & Nieuwenhuijsen, M. (2003). Modelling exposure estimates for an
epidemiological study of disinfection by-products in drinking water and adverse birth
outcomes. Practical Bayesian Statistics 5 Conference, Milton Keynes.
3 Measurement Errors in Epidemiology 55
Whittemore, A. S. (1989). Errors in variables regression using Stein estimates. The American
Statistician, 43, 226–228.
Wikipedia contributors. Regression dilution. http://en.wikipedia.org/w/index.php?
title¼Regression_dilution&oldid¼159186428. 20-9-2007. Wikipedia, The Free Encyclopedia.
2-11-2007.
Wong, M. Y., Day, N. E., Bashir, S. A., & Duffy, S. W. (1999a). Measurement error in epidemi-
ology: The design of validation studies – I: Univariate situation. Statistics in Medicine, 18,
2815–2829.
Wong, M. Y., Day, N. E., & Wareham, N. J. (1999b). Measurement error in epidemiology:
The design of validation studies – II: Bivariate situation. Statistics in Medicine, 18, 2831–2845.
Wong, M. Y., Day, N. E., Luan, J. A., Chan, K. P., & Wareham, N. J. (2003). The detection of
gene-environment interaction for continuous traits: Should we deal with measurement error by
bigger studies or better measurement? International Journal of Epidemiology, 32, 51–57.
Wong, M. Y., Day, N. E., Luan, J. A., & Wareham, N. J. (2004). Estimation of magnitude in
gene-environment interactions in the presence of measurement error. Statistics in Medicine, 23,
987–998.
Chapter 4
Selection Bias in Epidemiologic Studies
4.1 Introduction
There are three main types of error recognised in epidemiology: random error or
chance, confounding, and systematic error. Random error is the most widely-
understood form of error, and is introduced through the inevitable use of random
samples from the population. Confidence intervals and statistical analysis have for a
long time been used to address this form of error. The epidemiologist seeks also to
address the other forms of error and associated potential bias. A substantive
potential for bias can arise from what is characterised as confounding,
when associations are inappropriately attributed as caused by an exposure of
interest rather than to some other characteristic of the study participants.
The identification and control of confounding was described in Chap. 1 and are
further explored in Chap. 11.
The third source of error, bias due to systematic error, has traditionally been
characterised by the cause of the bias. Bias due to errors in measuring the observed
exposure or outcome, known as information bias, can have many causes. Chapter 2
discusses bias introduced through missing data, an extreme form of information
bias. Chapter 3 discussed a subtler form of information bias, caused by errors
in measuring the exposure or confounders. A more widely recognised source
of information bias is differential recall, which arises where people with different
outcomes remember information in different ways; this can be controlled and
reduced through the design and implementation of a study.
a E U A b E U A
Exposure causes
the outcome
O O
c E U A d E U A
Exposure does not
cause the outcome
O O
Case-control study
with selection bias2
e E U A f E U A
O S O S
g E U A h E U A
Prospective cohort
study1
O S O S
Fig. 4.1 Directed acyclic graphs representing two scenarios (i) one where an auxiliary factor (A)
is not a confounder for the exposure (ii) and one where it is. Key to variables: E exposure,
A auxiliary, O outcome, OS sampled outcome, S selection, U-unmeasured; 1balanced case and
control sampling; 2differential control sampling
O (4.1)
n
E
E A (4.2)
n
A confounder may be present in any study, regardless of the study design.
To consider a variable to be a confounder, it must be: (i) a cause, or a proxy of a
cause, of the disease (ii) correlated with the exposure under study (iii) unaffected by
the exposure, that is not on the causal pathway from exposure to the outcome (Tu
et al. 2004). When these criteria are satisfied, a confounder is present, indicating
that adjustment should be made within the analysis. DAGs are a particularly useful
tool in identifying a confounder. If a backdoor path exists between the outcome and
the exposure, via an auxiliary variable, then that variable should be considered a
confounder (Greenland et al. 1999; McNamee 2003). Figure 4.1b shows such a
situation, where the auxiliary variable is a cause of the outcome (O), thus leading to
the conclusion that A is a confounder. Again, further to Eqs. 4.1 and 4.2:
O (4.3)
n
In order to consider selection bias, Geneletti and colleagues (2009) imagine that in
the study population there is no causal link between the exposure and the outcome:
O (4.4)
n
E
O EjS ¼ 1 (4.5)
n
This is key to describing the results from many case-control studies where an
association is shown but it can be explained through effects due to selection. In this
situation some epidemiologists incorrectly believe that statistical adjustment for A
will remove the bias due to selection. However this is a simplification of a more
complex situation; this will be developed later.
Figure 4.1f shows a similar situation to Fig. 4.1e, though this time the A variable
can be considered to be a true confounder. This is not due to conditioning on
selection, but A is a separate and distinct cause of the outcome. In this situation
making statistical adjustment for A would be appropriate for interpreting whether E
causes the outcome O.
The case-control study is a popular design, particularly for rare diseases, to investi-
gate putative risk factors in epidemiology. A “target-population” (Geneletti
et al. 2009) is defined and the whole, or a random sample of the whole, population
4 Selection Bias in Epidemiologic Studies 63
with a disease (cases) arising within that population are targeted for study: selection
for a case-control study follows the diagnosis of disease in the cases. For two
reasons selection bias is a particularly important issue for case-control studies:
(i) the direct participation by the study subjects is usually required to obtain data on
disease, risk factors, other auxiliary factors, and for ethical reasons, and (ii) cases
are usually selected through a health system, such as clinic lists for the relevant
disease, whilst controls are usually selected from a different sampling frame. When
the researcher is aiming to recruit controls that represent the general population, a
population register will be employed, such as birth registers.
If the case and control selection procedures produce a representative sample of
the target-population, whereby selection bias is not present, the DAGs for the case-
control study are an exact copy of those defined for the population, as shown in
Fig. 4.1a, b. In practice, the probability that any individual, selected for inclusion in
a study, will participate is usually higher for cases than for those without the disease
(controls). It has been recognised that the effect of such differential or non-random
sampling, and the bias it introduces, must be taken into account in the analysis of a
case-control study. This is represented in Fig. 4.1e, f, dependent upon whether A is
a true confounder or not.
In a prospective cohort study, a sample from the target-population is made and the
exposure measured, in advance of any diagnosis. After an appropriate length of
follow-up time the proportion of cases of the disease could be compared in respect
of their exposure. Every person in the target-population has equal probability of
being approached at the outset: selection would not be conditioned on outcome (O).
Figures 4.1g, h show the situation where selection is not associated with the
outcome. This situation would be normal for prospective studies such as a cohort
study. Within the cohort sample, conditioned on S, there would not be an arc
between O and S. For this reason Eq. 4.4 holds.
1
Proportion of participants
0 .2 .4 .6 .8
−10 −5 0 5 10
Deprivation
Fig. 4.2 Proportion of controls participating by deprivation categorised into control choice groups
sample, with selection bias, cannot be recovered by statistical adjustment for the
auxiliary variable.
When a study is conducted it is normal for the researchers to pick a sample of
controls and approach them for participation. We can refer to these as “first-choice
controls” (Law et al. 2002, 2003); subsequent choice controls will be made until the
required number of controls is recruited. It has been suggested that you could use
participating first-choice controls only. Figure 4.2 shows the results for participa-
tion in controls in the UKCCS, fitted by a Lowess smoothed line (Cleveland and
Devlin 1988). It is clear that the association between participation and deprivation
does not differ by the choice of controls selected. This is intuitively correct, as the
control selection is an exchangeable process, where controls selected later in the
process possess the same characteristics as those selected earlier.
First-choice controls have been used; for example, Law and colleagues used all
first-choice controls without requiring their participation (Law et al. 2003). This is
because the exposure measures were derived from postal addresses linked to
national census data. These controls should be a representative sample of the
target-population; given the initial selection procedures were robust.
One approach to dealing with selection bias is to ‘adjust’ for variables thought to
cause, or influence, selection in a statistical model. Some authors have argued that
adjusting for selection bias is identical to the process of adjusting for confounding:
the variables associated with selection are incorporated into a regression model
(e.g. Breslow and Day, 1980).
To investigate this a simulation study was conducted in R (R Development Core
Team 2004) to investigate the effect of adjusting for a variable that is related to
66 G.R. Law et al.
1
Case-control Odds Ratio
.75
unbiased model E
.25 .5
.25 .5 .75 1
Population Relative Risk (model E)
biased model E
Case-control Odds Ratio
.5 .75
.25 .5 .75 1
Population Relative Risk (model E+A)
Fig. 4.3 Odds ratios (point) and 95% empirical confidence intervals (vertical bar) for biased and
unbiased simulated case-control studies (1,000 cases fixed, 2,000 controls, sampled 1,000 times)
applied to two populations (106 individuals, 1000 cases, sampled once), contrasted to the ‘true’
population relative risk (thick solid line). (i) Auxiliary (A) not a confounder, but associated with
exposure (E). (ii) Auxiliary (A) a confounder
the assertion that attempting to recover the relative risk estimate for the scenario
where A was not a true confounder, through statistical adjustment for A, fails
to obtain an odds ratio close to the original relative risk (Fig. 4.3(i) biased model
E + A), unless the true relative risk was close to unity. We must, therefore,
conclude that we cannot statistically ‘adjust’ for the auxiliary variable’s impact
on selection bias.
68 G.R. Law et al.
When A was a true confounder, the conclusion is quite different. The simulation
study showed that the biased models overestimated the relative risk (Fig. 4.3(ii)
biased model E). In contrast to the first scenario, statistical adjustment by inclusion
of A in the model recovered accurately and consistently the population relative risk
(Fig. 4.3(ii) biased model E + A).
E SjðO; AÞ
(ii) This states that, conditional on the outcome within strata of the auxiliary,
exposure is independent of selection for the study. This requires the auxiliary
to be stratified into categories.
(iii) The bias breaking variable must have additional data available to the
researcher, outside of the study, so that a distribution of the bias breaker in
relation to the outcome can be obtained.
Where there is no case selection bias, it is possible to use the distribution of A for
both cases and controls to estimate the overall distribution of A, regardless of the
outcome status. Further details are available to estimate the unbiased estimate of
the odds ratio (Geneletti et al. 2009).
In the survey literature, the issue of differential selection, and the bias that may
arise, is known as informative sampling. The issues arising in survey sampling are
analogous to the selection of cases and controls in a case-control study.
Informative sampling has a long history in the literature of complex social
surveys (see, for example, Rubin 1976). The bias it can cause, if not taken account
of, is widely acknowledged (for example, Pfeffermann 1996). One approach
to identify this bias is to model the distribution of the sample data as a function
4 Selection Bias in Epidemiologic Studies 69
Unless Pði 2 sjYi ; xi Þ ¼ Pði 2 sjxi Þ for all possible values of Y the sample and
population densities differ and the sampling is said to be informative. In the context
of case-control studies Y ¼ 1 would correspond to cases and Y ¼ 0 to controls.
The approaches taken by Pfeffermann and Sverchkov (Pfeffermann 1996;
Pfeffermann and Sverchkov 2003) to fit generalised linear models between
response Y and predictors X1 ; :::; Xk rest on redefining the population density fu ð:Þ
by the sample density fs ð:Þ assuming known forms for Pði 2 sjYi ; xi Þ and Pði 2 sjxi Þ.
These methods are yet to be fully exploited in epidemiological studies though
Samuelsen and colleagues discuss the use of stratification and sampling weights in
case-cohort studies (Samuelsen et al. 2007). The non-cases are divided into strata
according to the values of covariates, and the probability of sampling an individual
from each strata is included as an inverse probability weight in the parameter
estimation process.
4.7 Conclusions
We have shown how a biased sampling regime within a study may lead to a biased
estimate of the relative risk for an exposure. We agree with Hernán and colleagues
that authors should be encouraged to approach modelling in a structured way,
allowing readers to assess the likelihood of selection biasing the exposure risk
estimates, and the potential success of any statistical adjustments (Hernan et al.
2004). Using the case-control sample as the ‘oracle’ for defining potential causal
pathways for variables at the population level, one might erroneously conclude that
A is a confounder when it is not. This may be due wholly to the influence of
differential participation between cases and controls. The decision to assign an
auxiliary variable as a confounder may be complicated by the evidence from the
case-control study that suffers from selection bias; an association between A and O
may be present. It is worth noting that considering all possible auxiliaries as
confounders, when they may not all have a causal relationship with the outcome,
seems erroneous even though this practice appears to be commonly employed.
Case-control studies reliant upon individual participation, suffer from selection
bias (Law et al. 2002). In the example of Sect. 4.1, a variable identified as causing
selection, socioeconomic status, was not a true confounder. As a consequence, it is
70 G.R. Law et al.
References
Breslow, N. E., & Day, N. E. (1980). Statistical methods in cancer research. Vol 1. The analysis of
case-control studies. IARC, Lyon.
Cleveland, W. S., & Devlin, S. J. (1988). Locally weighted regression: An approach to regression
analysis by local fitting. Journal of the American Statistical Association, 83, 596–610.
Galea, S., & Tracy, M. (2007). Participation rates in epidemiologic studies. Annals of Epidemiol-
ogy, 17(9), 643–653. available from: ISI:000249293100001.
Geneletti, S., Richardson, S., & Best, N. (2009). Adjusting for selection bias in retrospective, case-
control studies. Biostatistics, 10(1), 17–31. available from: PM:18482997.
Greenland, S., & Brumback, B. (2002). An overview of relations among causal modelling methods.
International Journal of Epidemiology, 31(5), 1030–1037. available from: PM:12435780.
Greenland, S., Pearl, J., & Robins, J. M. (1999). Causal diagrams for epidemiologic research.
Epidemiology, 10(1), 37–48. available from: PM:9888278.
Gunby, J. A., Darby, S. C., Miles, J. C. H., Green, B. M. R., & Cox, D. R. (1993). Factors affecting
indoor radon concentrations in the United-Kingdom. Health Physics, 64(1), 2–12. available
from: ISI:A1993KD26400002.
4 Selection Bias in Epidemiologic Studies 71
Henderson, M., & Page, L. (2007). Appraising the evidence: What is selection bias? Evidence-
Based Mental Health, 10(3), 67–68. available from: PM:17652553.
Hernan, M. A., Hernandez-Diaz, S., & Robins, J. M. (2004). A structural approach to selection
bias. Epidemiology, 15(5), 615–625. available from: PM:15308962.
Law, G. R., Smith, A. G., & Roman, E. (2002). The importance of full participation: Lessons from
a national case-control study. British Journal of Cancer, 86(3), 350–355. available
from: PM:11875698.
Law, G. R., Parslow, R. C., & Roman, E. (2003). Childhood cancer and population mixing.
American Journal of Epidemiology, 158(4), 328–336. available from: PM:12915498.
McNamee, R. (2003). Confounding and confounders. Occupational and Environmental Medicine,
60(3), 227–234. available from: PM:12598677.
Pearl, J. (2000). Causality: Models, reasoning and inference. New York: Cambridge University
Press.
Pfeffermann, D. (1996). The use of sampling weights for survey data analysis. Statistical Methods
in Medical Research, 5(3), 239–261. available from: PM:8931195.
Pfeffermann, D., & Sverchkov, M. Y. U. (2003). Fitting generalised linear models under informa-
tive sampling. In R. L. Chambers & C. J. Skinner (Eds.), Analysis of survey data (pp. 175–195).
Chichester: Wiley.
R Development Core Team. (2004). R: A language and environment for statistical computing.
Vienna: R Foundation for Statistical Computing.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–590. available
from: ISI:A1976CP66700021.
Samuelsen, S. O., Anestad, H., & Skrondal, A. (2007). Stratified case-cohort analysis of general
cohort sampling designs. Scandinavian Journal of Statistics, 34(1), 103–119. available
from: ISI:000244852300008.
Schlesselman, J. J. (1982). Case-control studies: Design, conduct, analysis. Oxford: Oxford
University Press.
Smith, A. G., Fear, N. T., Law, G. R., & Roman, E. (2004). Representativeness of samples from
general practice lists in epidemiological studies: Case-control study. BMJ, 328(7445), 932.
available from: PM:14990513.
The UK Childhood Cancer Study Investigators. (2000). The United Kingdom Childhood Cancer
Study: Objectives, materials and methods. UK Childhood Cancer Study Investigators.
British Journal of Cancer, 82(5), 1073–1102. available from: PM:10737392.
The UK Childhood Cancer Study Investigators. (2002). The United Kingdom Childhood Cancer
Study of exposure to domestic sources of ionising radiation: 1: Radon gas. British Journal of
Cancer, 86(11), 1721–1726. available from: PM:12087456.
Tu, Y.-K., West, R. W., Ellison, G. D. H., & Gilthorpe, M. S. (2004). Why evidence for the fetal
origins of adult disease can be statistical artifact: The reversal paradox examined for hyperten-
sion. American Journal of Epidemiology, 161(1), 27–32.
Chapter 5
Multilevel Modelling
Andrew Blance
5.1 Introduction
A. Blance (*)
Division of Biostatistics, Centre for Epidemiology and Biostatistics, Leeds Institute
of Genetics, Health & Therapeutics, University of Leeds, Leeds, UK
e-mail: [email protected]
As we age, the gum around our teeth recedes resulting in attachment loss. This gives
the appearance of longer teeth, hence the saying “long in the tooth”. The rate of
attachment loss varies from person to person and is influenced by (amongst other
things) the standard of oral hygiene. The illustrative example used throughout this
chapter consists of observations of (clinical) attachment loss (CAL) measured in
millimeters around each tooth, up to 28 teeth per individual (all teeth excluding
third molars, i.e. wisdom teeth). The data structure, or hierarchy, can be depicted as
in Fig. 5.1.
5.1.2 Independence
if the data were analysed with the number of independent observations correctly
taken to be 1,000, the standard errors would be proportional to (1/√1000) 1/32.
Therefore, ignoring inherent hierarchy in this instance yields standard errors that
are an order of magnitude (32/134) smaller than they should be. This also leads to
an increase in the potential for Type I errors; incorrectly identifying a covariate as
influencing the outcome of interest.
Ignoring the lack of independence, what might be termed the ‘Ostrich’ approach of
burying your head in the sand, is clearly not an acceptable solution. A work-around
would be to sample just one lower-level unit per highest level-unit (one attachment
loss per individual). Alternatively a subgroup analysis could be performed (e.g. all
teeth analysed for each individual separately) yielding as many separate analyses as
there are individuals. If and only if the data are balanced (equal numbers of identical
teeth per individual), the hierarchy can initially be ignored in estimating
coefficients. Correct coefficient standard errors can subsequently be obtained by
making the appropriate adjustment to account for the hierarchy. All of these are
workarounds to the lack of independence. The statistical analyses are rendered
statistically valid. However, they fail to get the most out the data due to a great loss
of detailed information. Further, they can lead to more general problems, such as
drawing inferences about lower-level units through higher-level analyses, thereby
running the risk of committing the ecological fallacy (Bland 2000).
A more optimal solution would be to use statistical methods that yield robust
standard errors by dealing with the hierarchy explicitly. These methods can be
viewed as belonging to one of two groups: (i) those that treat the hierarchy as a
nuisance or, (ii) those which treat the hierarchy as a special feature to be exploited.
Examples of the former are Generalised Estimating Equations (Ziegler et al. 1998)
and Sandwich Estimates (Qian and Wang 2001; White 1980). An example of the
latter is what is now termed multilevel modelling (UK) (Leyland and Goldstein
2001) or hierarchical linear modelling (US) (Raudenbush and Bryk 2002).
Framing random structure firmly within a hierarchical context gives rise to major
benefits (Gilthorpe and Cunningham 2000; Quene and van den Bergh 2008).
Not least, it provides a naturally intuitive understanding. For example, patients
are often the unit of concern, yet observations are frequently made at a lower level.
Further, in contrast to techniques that treat clustering as a nuisance to derive robust
estimates of the fixed effects, a key feature of MLM is that it provides insights into
random effects. The full power of MLM in exploiting random structure will only be
realised for research questions posed such that the random effects provide the
research answer.
5 Multilevel Modelling 77
Partitioning the total variance by its source yields insight as to the relative weights
of each source. Consider the 2-level hierarchy of the periodontal example where
teeth are clustered within individuals. Figure 5.2 illustrates the variation in attach-
ment loss amongst 140 teeth. The 1,400 observations relate to just 5 individuals as
illustrated in Fig. 5.3.
Some of the variation is due to variation between individuals and some within
individuals. Thus the total variance can be partitioned according to that which is
attributable to teeth and individual. This is known as variance components and
represents the simplest MLM (Snijders and Bosker 1999). The variance
components model has no explanatory variables (only the intercept is present)
and is often termed the null model. Variance components models are mathemati-
cally equivalent to ‘random effects’ ANOVA (Snijders and Bosker 1999). They are
useful in establishing the relative proportions of variation across all levels, allowing
the variation at each level to be known a priori to the introduction of ‘explanatory’
terms. This in turn allows consideration of the reduction in variation associated with
the inclusion of an explanatory term.
CAL
CAL
Within MLM, the first stage of the analysis is to determine the appropriate
multilevel structure. If the variance at a given level does not contribute substan-
tially to the total variance, this level may be obsolete and might therefore be
ignored. However, the extent of variation at any specified level may appear
insubstantial (or even zero) whilst it remains masked by larger, as yet unmodelled,
fixed and random effects. For this reason, it is unwise to discard a particular level
simply because its variation is small (or even zero) and not significant, when no
other terms have been included in the model. Only where this remains consis-
tently to be true throughout model development would it suggest that the level
could be discarded. Thus producing a more parsimonious model, perhaps increas-
ing its interpretability.
In introducing the theory of MLM, it is useful to outline the notation (algebra) used.
Considering a two-level model, the general algebraic formula may be written:
X
N
yij ¼ bmij xmij (5.1)
m¼0
where yij is the outcome measure for the i-th level 1 unit, clustered within the j-th
level 2 unit; xmij (m 1) is the m-th of N covariates with coefficients bmij ; x0ij ¼ 1
such that b0ij (the intercept) is the outcome when all explanatory variables are zero.
To aid comprehension of this generic formulation, we start with the simplest of
all multilevel models and build the MLM algebra from first principles.
A variance components model has only the intercept present. The effect of having
no explanatory variables present is that N ¼ 0. Thus the variance components
model is derived from (5.1) by setting N ¼ 0:
where yij is the attachment loss (CAL) for the i-th tooth, clustered within the j-th
individual; b0 is the mean value of the attachment loss yij , with the total variance
partitioned across each level such that: e0ij N 0; s20e , tooth level residuals (e0ij )
have zero mean andare normally
distributed with variance s20e across all units;
similarly for u0j N 0; s0u (individual level residuals),wheres20u is the variance
2
across all units. Thus, the total variation is given by Var CALij ¼ s20u þ s20e .
Consider our periodontal example where we have a 2-level model with only one
covariate (age). Consider the relationship between attachment loss (CAL) and age
within a longitudinal dataset of clinical measures obtained from repeated full-
mouth recordings on a sample of individuals. Suppose we ignore the natural
hierarchy and observe a near-linear relationship, as illustrated in Fig. 5.4.
80 A. Blance
age
β0
age
In general, to include covariates that ‘operate’ at the various levels, such as x1j for
the individual-level (i.e. level-2) covariate age, and x2ij for the tooth-level covariate
presence/ absence of plaque, a 2-level random intercept model may be written:
yij ¼ b0ij þ b1 x1j þ b2 x2ij ¼ b0 þ b1 x1j þ b2 x2ij þ e0ij þ u0j
5 Multilevel Modelling 81
where the first part (b0 þ b1 x1j þ b2 x2ij ) is similar to any single-level model which
contains the two covariate effects plus the intercept, whilst the second part
(e0ij þ u0j ) is the multilevel random structure for the total variation partitioned
across each level of the hierarchy. The subscript j is present for the first covariate
(x1j ) to represent covariate values that vary across individuals (and are constant for
all sites within individuals); whereas the subscripts ij are present for the second
covariate (x2ij ) to represent that these covariate values vary across teeth (within
individuals) as well as across individuals (it is not constant for all sites within
individuals). The multilevel model thus estimates the parameters of the fixed part of
the model (b0 ,b1 ,b2 ) alongside the parameters of the random part of the model
(s20e ,s20u ). The random terms (e0ij ,u0j ) are not estimated; only their variances are
estimated.
age
If we were to try and model the complex variation in Fig. 5.6, where it is assumed
that, for each individual, the within-mouth variation widens with increasing mean
CAL (or increasing age), we would express the 2-level model as:
CALij ¼ b0ij þ b1i agej ¼ b0 þ b1 agej þ e0ij þ u0j þ e1ij agej
where the first part (b0 þ b1 agej ) is similar to any single-level model that fits the
mean covariate relationship, including an intercept term. The second part
(e0ij þ u0j þ e1ij agej ) is the multilevel random structure for the total variation
partitioned across each level of the hierarchy, including additional random struc-
ture at level-1 that depends upon age. In this instance, age is acting as a proxy/
surrogate for mean CAL. The subscript j is present for all references to the
covariate age, since the covariate value varies across individuals (and is constant
for all teeth within individuals). The additional random parameter, e1ij , satisfies the
usual assumptions, and represents the level-1 (random) variation which is effec-
tively ‘scaled’ by the age covariate. At level-1, the two random terms (e0ij ,e1ij ) give
rise to two variances and one covariance (s20e ,s21e ,s01e ), where the covariance
depicts any underlying relationship (if one exists) between one random term
varying ‘in tune’ with the other.
The multilevel model thus estimates the parameters of the fixed part of the model
(b0 ,b1 ) alongside the parameters of the random part of the model (s20e , s21e , and s01e
for level-1; s20u for level-2). The
variance ‘function’
at level-1 now has a complex
form: level-1 variance ¼ Var e0ij þ e1ij agej ¼ s20e þ 2s01e agej þ s21e agej 2 , since
age is a fixed covariate. In other words, the level-1 variance structure is a quadratic
function of the covariate age (Fig. 5.7). Different level-1 variance functions can be
obtained by appropriate transformation(s) of the selected covariate(s).
5 Multilevel Modelling 83
age
age
age
age
If we were to model the simple random slope model, illustrated in Fig. 5.10, the
2-level model would be:
CALij ¼ b0ij þ b1j agej ¼ b0 þ b1 agej þ e0ij þ u0j þ u1j agej
5 Multilevel Modelling 85
where the fixed part (b0 þ b1 agej ) is akin to any single-level model, and now
the random part (e0ij þ u0j þ u1j agej ) represents the multilevel random
structure, with the total variation partitioned across each level of the hierarchy,
including additional random structure at level-2 that depends upon age.
The additional random parameter, u1j , satisfies the usual assumptions
(normally distributed with mean zero), but only varies from individual to individ-
ual. This term represents level-2 (random) variation, which is effectively ‘scaled’
by the age covariate. At level-2, the two random terms (u0j ,u1j ) give rise to
two variances and one covariance (s20u ,s21u ,s01u ), where the covariance
depicts any underlying relationship (if one exists) between one random term
changing ‘in tune’ with the other.
The multilevel model thus estimates the parameters of the fixed part of the model
(b0 ,b1 ) alongside the parameters of the random part of the model (s20e for level-1;
s20u ,s21u , and s01u for level-2), only now the complex variation occurs at level-2 and not
at level-1.
The variance ‘function’ at level-2 again has a complex form: level-2 variance
¼ Var u0j þ u1j agej ¼ s20u þ 2s01u agej þ s21u agej 2 , since age is not a random vari-
able. Thus, the level-2 variance structure is a quadratic function of the covariate age.
Something that is peculiar to multilevel modelling occurs if the covariate,
about which we model random slopes, is binary. Consider the covariate sex in
a similar
model to that previously described for age: CALij ¼ b0ij þ b1j sexj
¼ b0 þ b1 sexj þ e0ij þ u0j þ u1j sexj . Then: level-2 variance ¼ s20u þ 2s01u sexj
þs21u sexj 2 , since sex is not a random variable. Although the level-2 variance struc-
ture appears to be a quadratic function of the covariate sex, it is not, since sex can
only take two values, usually coded 0 and 1, in which case sex2 ¼ sex. Thus: level-
2 variance ¼ s20u þ 2s01u þ s21u sexj , and it therefore becomes impossible to
determine, simultaneously, the complex variance associated with sex (s21u ) and the
covariance between sex and the intercept (s01u ). In this situation, we have to
constrain one of these two terms to be zero and adopt either s20u þ 2s01u sexj or s20u þ
s21u sexj as the variance function at level-2.
Now, the problem is, if we adopt s20u þ 2s01u sexj as the correct expression,
we have what appears to be an absurd situation. The model determines a variance
for the random intercept and a covariance between the random intercept
and random covariate (e.g. sex), whilst constraining the variance of the random
covariate for sex to be zero. However, this parameterisation (s20u þ 2s01u sexj )
is correct and the alternative (s20u þ s21u sexj ) is only correct in a limited number
of circumstances. This is because variance terms can only be positive,
whereas the covariance term may be negative also, providing that total variance
remains positive. Thus, were sex coded such that males were 0 and females
were 1, and outcome variation was greater amongst males than females, the
expression s20u þ s21u sexj could not capture this situation correctly, since s21u can
only be positive when it would need to be negative. The expression
s20u þ 2s01u sexj on the other hand accommodates all situations adequately, since
s01u may be negative.
86 A. Blance
age
females
age
Complex random slope models are an extension to the random slope model, where the
variation in a slope is not entirely random, but may systematically vary as a function of
other covariates (Goldstein 2003). For instance, changes in CAL with age may vary
randomly across individuals, though this slope may also differ systematically between
males and females. Thus, revisiting the relationship presented in Fig. 5.6, if instead of
heteroscedasticity (as shown in Fig. 5.7), suppose that within- individual variation in
CAL values was constant across all ages for all individuals, but that mean levels of
CAL within each individual progressed at different rates. In other words, complex
random structure occurs at level-2 and not at level-1 (Fig. 5.11).
Now consider the hypothetical scenario that, on average, males progress faster
than females could be visualised by Fig. 5.12.
If we were to model the complex random slope model, illustrated in Fig. 5.12, the
2-level model would be:
CALij ¼ b0ij þ b1j agej þ b2 agej sexj
¼ b0 þ b1 agej þ b2 agej sexj þ e0ij þ u0j þ u1j agej
5 Multilevel Modelling 87
where the fixed part (b0 þ b1 agej þ b2 agej sexj ) is akin to a single-level model
with an interaction term included (b2 agej sexj ), and the random part
(e0ij þ u0j þ u1j agej ) is as before, with the total variation partitioned across each
level, including additional random structure that depends upon age. What is
perhaps most striking is how the fixed part includes an interaction term without
both covariates present, sex is not included as an independent covariate and only
within the interaction. This is not so strange within MLM, since there need be no
underlying outcome differences by sex for there to be any differences in the male
and female slopes. Nevertheless, it is likely that a genuine underlying gender
difference is frequently observed, hence the more realistic model would be:
CALij ¼ b0ij þ b1j agej þ b2 agej sexj þ b3 sexj ;
where
CALij ¼ b0 þ b1 agej þ b2 agej sexj þ b3 sexj þ e0ij þ u0j þ u1j agej :
The multilevel model estimates the parameters of the fixed part of the model
(b0 ,b1 ,b2 , and b3 ) along with the parameters of the random part of the model (s20e
for level-1; s20u ,s21u , and s01u for level-2).
Now, it is possible that the degree of random variation is different for males and
females, in which case the model could become:
CALij ¼ b0ij þ b1j agej mj þ b2j agej fj ;
where
CALij ¼ b0 þ b1 agej mj þ b2 agej fj
þ e0ij þ u0j þ u1j agej mj þ u2j agej fj ;
such that mj ¼ 1 for males, and zero otherwise; and similarly fj ¼ 1 for females, or
zero otherwise. If there were no underlying gender outcome differences being
sought, this would be modelled explicitly by constraining the covariate
coefficients to be equal (b1 b2 ). The random parameters, u1j and u2j , depict
two random slopes with respect to age, for males and females, respectively.
The model estimates the parameters of the fixed part (b0 ,b1 , and b2 ) along with
the parameters of the random part (s20e for level-1; s20u ,s21u ,s22u ,s01u ,s02u and s12u
for level-2), where some covariance terms (s01u ,s02u ,s12u ) might be zero or small
and not significantly different from zero. In which case, it might be appropriate to
constrain these to be zero where model convergence is not readily achieved.
Each constraint could be relaxed again in turn if specifically being sought.
In any event, the variance structure at level-2 is now very complex. If all variance
and covariances are to be estimated, the total level-2 variance structure is:
Var u0j þ u1j agej mj þ u2j agej fj , which simplifies to (assuming sex2 ¼ sex):
Maximum likelihood methods are the usual method employed to obtain estimates
of model parameters. The probability of the observed data is written as a function of
the unknown parameters (the likelihood function); with the coefficient estimates
taking the values of the unknown parameters that maximizes this function. This
poses no problems for single-level logistic regression, since the likelihood function
can be written explicitly. However, the likelihood function for a multilevel logistic
regression cannot be written as an explicit function and thus alternative model fit
methods are required. Broadly speaking, there are three strategies for obtaining the
estimates sought. Firstly, the likelihood function could be approximated. Maximum
Quasi-Likelihood (MQL) and Partial Quasi-Likelihood (PQL) are appropriate
examples (Moerbeek et al. 2003). These methods are in general computational
amenable but produce (to varying degrees) biased estimates. The second method
would be to perform numerical integration. Performing numerical integration over
all random parameters can be rather computationally intensive. Finally, a Bayesian
approach, for example Markov Chain Monte Carlo (MCMC), could be adopted
(Gilks et al. 1996). MCMC is also computationally intensive but will yield unbiased
estimates of the random structure.
In addressing the computational demand of MLM, there is always a balance
between speed and accuracy that has to be sought. A suggested approach that seems
logical is to gain benefit from the increased speed of an approximate maximum
likelihood method (for example PQL) during model development. Speed
can subsequently be sacrificed for accuracy by using MCMC whenever any
doubt arises. In any case, MCMC should at least be used to obtain estimates of
a final model.
the MCMC process yields a set of estimates that are useful for making inferences
regarding the model. The simulated estimates are representative of a hypothetical
population of all possible model estimates, for which the sample data
represents only one example. By examining the simulated estimates, it is possible
to check if the simulations have converged on the correct values, and that
the number of simulations is sufficient to provide meaningful summary
information.
The necessary criteria for convergence include ‘reasonable’ starting values and,
in loose terms, a reasonable degree of freedom in how far a new simulation can
‘stray’ from the previous one. MCMC chains should possess good ‘mixing’
characteristics; variation around the parameter estimate should be random. That
is, no pattern(s) should exist amongst the estimates of a parameter at each iteration
of the chain. Good mixing will yield a symmetric (Gaussian) distribution for the
kernel density (empirical distribution) of the parameter. For this reason, the MCMC
diagnostics should always be considered. Finally, attention should be drawn to the
fact that the variance cannot be negative and that the objective of assessment should
be in identifying how inherent outcome variability ought to be specified, namely at
which level(s) any variance structure exists.
A comprehensive coverage is given by Goldstein (Goldstein 2003), while
Browne (Browne 2006; Browne and Draper 2000; 2006) provides in-depth cover-
age of many aspects of MCMC pertinent to MLM.
The fixed part of the model is checked in the same manner as single-level models.
Specifically, the relationship between the outcome and predictor variables should be
correctly specified, the residuals should be normally distributed for every value of the
predictor variables and the variance of the outcome should be the same at each value of
the predictor variables.
Simulated data from the fitted model should ‘look’ like the observed data.
The model is not valid if the simulated data differ (in a non-random way) from
the observed data.
90 A. Blance
GP GP GP
Fig. 5.13 A cross-classified data structure of patients nested within GDPs and nested within
hospitals
Multilevel methodology can be used for data structures that are not strictly
hierarchical, or are not typically thought of as representing a natural hierarchy
(Fielding and Goldstein 2006). Examples can be drawn from multivariate data,
repeated measures, categorical outcomes, meta-analyses, and cross-
classifications. For multivariate data, the lowest level represents the multiple
outcomes under consideration. Repeated measures and categorical outcomes
work in a similar manner, with occasion/category represented at level 1 for
repeated measures and categorical outcomes respectively. Meta-analysis can
incorporate covariates (meta-regression) and thus help in addressing the issue of
heterogeneity between studies.
An example of a model with no strict hierarchy is General Practitioner (GP)
referral for hospital-based treatment. As shown in Fig. 5.13, hospitals are cross-
classified with referring GPs. Thus, patients that belong to a particular practice
attend different hospitals, and vice versa. This approach can account for differences
between hospitals due to organisational procedures, whilst accounting for GP
variation in their referral behaviour. Furthermore, the model can incorporate
covariates at the patient-level (e.g. age, gender), the hospital-level (e.g. proportion
of cases undertaken as day-cases), and the GP-level (e.g. level of qualification,
experience).
The interested reader may wish to consult Hox (2002) or Snijders and Bosker
(1999) for fuller coverage. Goldstein (2003) provides readers with useful details
of the fit algorithms, while Leyland and Goldstein (2001) provide a text dedicated to
MLM in the context of health. An extension of multilevel modelling is to allow the
random effects to follow a discrete distribution. This is known as latent class
5 Multilevel Modelling 91
References
Altman, D. G. (1991). Practical statistics for medical research. London: Chapman and Hall/CRC.
Armitage, P., Berry, G., & Matthews, J. N. S. (2002). Statistical methods in medical research
(4th ed.). Malden: Blackwell Science.
Bland, M. (2000). An introduction to medical statistics (3rd ed.). Oxford: Oxford University Press.
Browne, W. J. (2006). MCMC algorithms for constrained variance matrices. Computational
Statistics & Data Analysis, 50(7), 1655–1677.
Browne, W. J., & Draper, D. (2000). Implementation and performance issues in the Bayesian and
likelihood fitting of multilevel models. Computational Statistics, 15(3), 391–420.
Browne, W. J., & Draper, D. (2006). A comparison of Bayesian and likelihood-based methods for
fitting multilevel models. Bayesian Analysis, 1(3), 473–513.
Fielding, A., & Goldstein, H. (2006). Cross-classified and multiple membership structures in
multilevel models an introduction and review (Vol. 791). Nottingham: DfES.
Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996). Markov chain Monte Carlo in practice.
London: Chapman & Hall.
Gilthorpe, M. S., & Cunningham, S. J. (2000). The application of multilevel, multivariate
modelling to orthodontic research data. Community Dental Health, 17(4), 236–242.
Goldstein, H. (2003). Multilevel statistical models (3rd ed.). London: Arnold.
Hox, J. J. (2002). Multilevel analysis: Techniques and applications. Mahwah: Lawrence Erlbaum
Associates.
Kirkwood, B. R., & Sterne, J. A. C. (2003). Essential medical statistics (2nd ed.). Malden:
Blackwell Science.
Leyland, A. H., & Goldstein, H. (2001). Multilevel modelling of health statistics. Chichester:
Wiley.
Machin, D., Campbell, M. J., & Walters, S. J. (2007). Medical statistics: A textbook for the health
sciences (4th ed.). Chichester: Wiley.
Moerbeek, M., Van Breukelen, G. J. P., & Berger, M. P. F. (2003). A comparison of estimation
methods for multilevel logistic models. Computational Statistics, 18(1), 19–37.
Qian, L. F., & Wang, S. J. (2001). Bias-corrected heteroscedasticity robust covariance matrix
(sandwich) estimators. Journal of Statistical Computation and Simulation, 70(2), 161–174.
Quene, H., & van den Bergh, H. (2008). Examples of mixed-effects modeling with crossed random
effects and with binomial data. Journal of Memory and Language, 59(4), 413–425.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models applications and data
analysis methods (2nd ed.). Thousand Oaks: Sage Publications.
Snijders, T. A. B., & Bosker, R. J. (1999). Multilevel analysis: An introduction to basic and
advanced multilevel modeling. London: Sage.
White, H. (1980). A heteroskedasticity-consistent covariance-matrix estimator and a direct test for
heteroskedasticity. Econometrica, 48(4), 817–838.
Ziegler, A., Kastner, C., & Blettner, M. (1998). The generalised estimating equations:
An annotated bibliography. Biometrical Journal, 40(2), 115–139.
Chapter 6
Modelling Data That Exhibit an Excess
Number of Zeros: Zero-Inflated Models
and Generic Mixture Models
6.1 Overview
Within biomedical research, count data may appear to possess an ‘excess’ of zeros
relative to standard statistical distributions. There is a plethora of statistical litera-
ture addressing how best to model such outcomes. The Zero-inflated Poisson (ZiP)
and the zero-inflated binomial (ZiB) are two common modelling strategies pro-
posed. More recently, generic mixture models have also been suggested (Skrondal
and Rabe-Hesketh 2004). We discuss these modelling strategies in some depth,
introducing the concepts of mixture modelling in simpler terms in this chapter
before examining in a wider context in later chapters. Crucial issues surrounding
the modelling of counts with an excessive proportion of zeros are addressed,
specifically outlining common potential pitfalls, and we provide some helpful tips
in model selection and model interpretation.
In order to examine the ideas we explore in this chapter, we choose for illustration
an oral health dataset that is in the public domain and has been analysed extensively
already to study different methods for analysing data with excess zeros compared to
standard count distributions. The example data are from a prospective study that
examined the effect of four interventions to improve the oral health status amongst
children. An established indicator of oral health involves counting the number of
decayed (d/D), missing (m/M), and filled (f/F) deciduous ‘milk teeth’ (t) or perma-
nent teeth (T), yielding the measure of dmft or DMFT (B€ohning et al. 1999).
The dmft count ranges between 0 and 20, whereas the DMFT count assumes values
between 0 and 32. Amongst relatively healthy individuals, or during the early
stages of dentition development, there is potential for an excess number of zero
dmft/DMFT counts.
The example dataset derives from a study conducted in the urban area of
Belo Horizonte, Brazil, during the 1990s (B€ ohning et al. 1999). The effect of four
caries prevention methods were examined amongst 797 school children aged
7 years at the start of the study. Data were recorded for the eight deciduous molars;
hence the outcome ranges between 0 and 8. The research question was how might
different intervention methods prevent caries incidence (new lesions). Interventions
were administered in six settings: (1) oral health education; (2) enrichment of the
school diet with rice bran; (3) mouthwash with 0.2% sodium fluoride (NaF)
solution; (4) oral hygiene; (5) all the interventions combined; and (6) none of the
interventions (control). The study was clustered in design, with children nested
within schools, which in turn were allocated to one of the intervention groups.
The outcome therefore was change in dmft count from baseline and was analysed by
B€ohning et al. (1999) to illustrate that ZiP regression models are useful in
evaluating intervention effects on dental caries when data exhibit an excess of
zero counts.
The data may be downloaded from the publishers of the original B€ohning et al.
(1999) article (http://www.blackwellpublishers.co.uk/rss); or from the webpage of
the vendor of the software LatentGOLD4.0™ (Vermunt and Magidson 2005a) as
the data were used as part of a tutorial (http://www.statisticalinnovations.com).
where p is a weight between zero and one; gðyi ; mi Þ is the Poisson or binomial
probability with parameter mi ; f ½mi is the link function, which takes the form of the
natural logarithm for Poisson probability and the logit for binomial probability; and
x0 i b is the vector of linear predictors (covariates). The standard probability
distributions are given by:
myi i expðmi Þ
Poisson: gðyi ; mi Þ ¼
yi !
N!
or binomial: gðyi ; mi Þ ¼ myi ð1 mi ÞNyi
yi !ðN yi Þ ! i
96 M.S. Gilthorpe et al.
where yi is the outcome count, mi is the Poisson mean or the binomial distribution
parameter, and N is the binomial denominator (the upper bound of the
count outcome).
An extension to these models is where one allows for ‘over-dispersion’, which
may occur within a dataset for a number of reasons. The most common cause of
over-dispersion is lack of heterogeneity because outcomes are not truly indepen-
dent. This often occurs due to clustering, i.e. where outcomes are grouped, as
frequently occurs in biomedical research. To accommodate this explicitly within
the Poisson/binomial distribution, we adapt the distribution parameter to follow
another distribution. The over-dispersed Poisson, also known as the negative
binomial, is derived from the Poisson when its distribution parameter
follows a
gamma distribution (denoted by G) with mean mi and variance m2i n2 :
2 v2 yi
G ðy i þ v 2 Þ v mi
gðyi ; mi Þ ¼ :
yi ! Gðv2 Þ v2 þ mi v 2 þ mi
Bðmi v2 þ yi ; ð1 mi Þv2 þ ðN yi ÞÞ N!
gðyi ; mi Þ ¼ :
Bðmi v2 ; ð1 mi Þv2 Þ yi !ðN yi Þ!
of yi is mi N, and variance
The expected value is no longer equal to mi ð1 mi ÞN but
is a factor 1 þ ðN 1Þ ð1 þ v2 Þ larger; as 1 v2 ! 0, the beta-binomial distribution
reduces to the standard binomial distribution. For the rest of this chapter, we refer to
these extensions to the Poisson/binomial as over-dispersed Poisson/binomial.
A further extension to the zero-inflated model is achieved by including
covariates in the mixture part of the model, to determine the proportions conditional
on these covariates. For the ZiP/ZiB models, this involves replacing the weight p
with a function of the covariates:
Pðyi jxi ; zi Þ ¼ p h1 ½z0 i g gðyi ; 0Þ þ 1 p h1 ½z0 i g g yi ; f 1 ½x0 i b
X
C
dðyi jxi ; zi Þ ¼ Pðcjzi Þdðyi jc; xi Þ
c¼1
The model log-likelihood is a measure of how well the model fits the data.
The use of this statistic directly, without any adjustment for parsimony, is not
generally favoured as a model-fit criterion alone, since one can nearly always
improve upon it (hence improve model fit, ultimately towards the point of a
saturated model) by increasing model complexity. One option is to plot changes
in the likelihood value against increasing model complexity, e.g. for each increment
in the number of latent classes (keeping all other parameter configurations consis-
tent). One then ‘eyeballs’ the point of complexity at which there is an ‘elbow’,
signifying acceleration in the diminishing return in model improvement for increas-
ing model complexity. This approach is similar to the use of scree plots for those
familiar with principal component analysis; there is no hard and fast rule employed.
Alternatives strategies to the raw likelihood statistic are penalized versions, such
as the Bayesian Information Criterion (BIC) (Vermunt and Magidson 2005b)
or Akaike’s Information Criterion (AIC) (Vermunt and Magidson 2005b), both of
which incorporate a sense of parsimony by accommodating the varying number of
model parameters. These statistics effectively provide a trade-off between growing
model complexity and how well the model fits the data. There is no consensus,
however, on which penalized form of the likelihood statistic should be adopted.
In general, one should consider a range of likelihood-based statistics for model-fit
criteria. In this chapter, we adopt the BIC and AIC, though we also consider model-
fit criteria that reflect how well count models perform along the outcome range in
terms of predicted counts.
Given the focus of zero-inflated modelling to accommodate an excess numbers of
zeros compared to standard distributions, model-fit criteria should perhaps examine
how well zero-inflated models do in predicting the total number of zero counts.
A transition or contrast from zero to one typically represents the onset of disease in
longitudinal data, or elevated disease prevalence in cross-sectional data, which has
direct clinical importance. To acknowledge the importance of zero counts, we there-
fore contrast the number of predicted and observed zeros. However, other count
thresholds along the outcome scale may also have clinical importance. For instance,
the tail of the distribution (truncated for binomial or infinite for Poisson) typically
denotes increasing disease severity. Crossing a ‘critical’ threshold may represent a cut-
off that distinguishes between ‘high’ and ‘low’ risk groups (for the purpose of targeted
preventions) or in some instances may signify irreversibility or a critical state, such as
mortality (e.g. a tooth exfoliates or an individual dies). The entire range of the outcome
might have importance for clinical diagnostic or prognostics reasons. Model-fit criteria
should therefore seek to capture this.
6 Modelling Data That Exhibit an Excess Number of Zeros. . . 99
For an overall assessment of the distribution, we assess the ‘root mean squared
error’ (RMSE) between predicted and observed counts (for the viable range) as
a proportion of the number of observations. This is achieved by initially
differencing the observed (Obsi) and predicted (Predi) counts for the entire scale,
squaring these differences and summing, where the ‘scale’ is either determined by
the distribution (i ¼ 0 . . . N, where N is the binomial maximum or an arbitrary
observed maximum count for the Poisson, or set by the user, beyond which all
predicted Poisson counts are grouped). One then divides by the number of
categories set by the choice of scale (N + 1), takes the square-root, multiplies by
the number of categories (N + 1), and divides by the total number of observations,
n, to express as a fraction:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PN ffi
2
Nþ1 ð Obs i Pred i Þ
RMSE ¼ i¼0
:
n ð N þ 1Þ
Although the RMSE statistic may seem somewhat arbitrary, as indeed it is not
directly comparable across different datasets or for different model parameter-
isations of the same dataset (e.g. Poisson vs. binomial), its construction is such
that, in very crude terms, it may be thought of as representing the maximum
proportion of ‘misallocated’ counts. For instance, for an outcome scale of 0–3, if
the predicted number of ‘zeros’ were 3 more than observed, the number of ‘ones’
3 less, the number of ‘twos’ 3 more, and the number of ‘threes’ 3 less, out of a
total of 48 observations, there is crudely speaking 25% misallocated counts
(12 misallocated observations out of 48). Calculating RMSE: 32 ¼ 9 occurs
four times (36), averaged over four categories (9), square-rooted (3), multiplied
by four categories (12), expressed as a fraction of the number of observations
(0.25). Were the same predicted counts distributed such that one frequency
miscount was 6 over and another 6 under, RMSE: 62 ¼ 36 occurs twice (72),
averaged over four categories (18), square-rooted (4.24), multiplied by four
categories (16.97), expressed as a fraction of the number of observations
(0.35). Thus, where misallocation is evenly distributed, RMSE represents the
proportion of misallocated observations; if misallocation is not evenly
distributed, RMSE is higher. Hence, the proportion of counts misallocated is
no larger than the RMSE. Examples are given for the illustrative dental dataset
used later in this chapter.
Caveat Model-fit criteria per se do not always provide enough insight as to how well
a particular modelling strategy suits the data. This becomes particularly apparent for
data exhibiting an excess numbers of zeros compared to standard distributions, as
there is the potential for what might be seen as a ‘dual’ or ‘two-stage’ process of data
generation. In such circumstances, it is feasible that likelihood statistics and
predicted counts fail to distinguish between differently parameterised models, as
shown for the example dental data in this chapter. Understanding data generation
might then usefully inform model choice and hence model interpretation.
100 M.S. Gilthorpe et al.
Bias resulting from the omission of an important covariate from the distribution part
of a model is well known, but less well known is that bias may also occur when
important covariates are omitted from the mixture part. We demonstrate this for
ZiP/ZiB models. This may not be immediately obvious, which is perhaps why many
researchers have overlooked this problem. However, consider for one moment a
hypothetical example of dental data, similar to the Brazilian study, with dmft
recorded for all deciduous teeth and modelled using a zero-inflated Poisson
model with only one covariate (sex) and assume this is included only in the
distribution part of the model. Accordingly, the proportion of children in the
zero-bin must be the same for boys and girls. This is an implicit constraint resulting
from not modelling sex to predict class membership. The impact on such a model is
illustrated by simulation.
We undertook a two-stage simulation process whereby dmft data were
generated, in the statistical software package R (http://www.r-project.org/) using
the function rpois, to represent 50,000 boys and 50,000 girls: 20% of the boys had a
dmft count of zero, the remainder taking values from a Poisson distribution with
mean 2; and 80% of the girls had a dmft of zero, the remainder taking values from a
Poisson distribution with mean 1. Data were then modelled using a standard ZiP
model (ZiP), i.e. with the covariate sex in the Poisson part of the model only.
Including sex in the ZiP model to predict class membership emulates the two-stage
simulation process, so we focus only on how unreliable the standard ZiP model is
for this scenario. The log-likelihood, BIC and AIC were obtained by maximising:
XX
1
lðp; mF ; mM Þ ¼ ptrue ðk; sexÞ logðpZiP Z ðk; sex; p; mF ; mM ÞÞ;
sex k¼0
where ptrue ðk; sexÞ is the true probability of observing k for each sex based on the
true model and pZiP Z ðk; sex; p; mF ; mM Þ is the same probability under a zero-inflated
model with parameters p (proportion in the zero bin), mF (mean of the females
distribution part), and mM (mean of the male distribution part). LatentGOLD4.0™
(http://www.statisticalinnovations.com) was used to generate the standard zero-
inflated Poisson model and results are presented in Table 6.1.
It is apparent that the standard ZiP model does not perform well for girls. The
proportion of children in the zero bin, constrained to be identical for both girls and
boys, was estimated to be 22.87%. This was close to the true value of 20% for boys,
but was far from the true value of 80% for girls. The distribution mean for girls was
also far from true (0.27 opposed to 1.00), though less biased for boys (2.03 opposed to
2.00). The inappropriately specified zero-inflated model yielded considerable devia-
tion from truth in terms of size, shape and central location of the distribution part for
girls, yet overall predicted counts were indistinguishable from the simulated data, as
seen in Fig. 6.1.
6 Modelling Data That Exhibit an Excess Number of Zeros. . . 101
Table 6.1 Model fit criteria for the ZiP model undertaken with the simulated data
Simulated true ZiP estimated
Log-likelihood 100,088a 111,700.74
BIC 200,212a 223,436.02
AIC 200,184a 223,407.48
Zero 0 1,573.74
RMSE 0% 13.7%
Girls
Proportion in the Zero-bin 80% 22.87%
Distribution mean dmft count (95% CI) 1 0.27 (0.26, 0.28)
Boys
Proportion in the Zero-bin 20% 22.87%
Distribution mean dmft count (95% CI) 2 2.03 (1.97, 2.07)
ZiP standard zero-inflated Poisson model with sex as a covariate in the non-zero part only (not as a
class predictor), BIC Bayesian Information Criterion, AIC Akaike’s Information Criterion, Zero
the absolute difference between observed and predicted number of zero counts, RMSE root mean
squared error (see Sect. 3.1) for categories 0–10, CI Confidence Interval
a
True log-likelihood, BIC and AIC are based on the asymptotic likelihood, which was maximised
numerically
1.0
0.9 Boys true Girls true
0.8
Boys fitted Girls fitted
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 2 4 6 8 10
Fig. 6.1 Simulated dmft counts for boys and girls: predicted and true distributions for the
simulated dataset (including the zero-bin)
Two different model parameterisations (the standard ZiP with sex in the distri-
bution part only and the extended ZiP with sex in both the distribution and mixture
parts) can yield near-identical predicted outcomes, yet each give rise to very
different model inferences. If focus was specifically given to the distribution part
of the model, the inferred Poisson distribution for girls would look very different to
that from the true distribution from which the simulated data were sampled, as
illustrated in Fig. 6.2.
102 M.S. Gilthorpe et al.
1
0.9
0.8 Standard ZiP Model for Girls
0.7 Simulated Truth for Girls
0.6
0.5
0.4
0.3
0.2
0.1
0
0 2 4 6 8 10
Fig. 6.2 Distributions of the dmft counts for girls: the true distribution from which the simulated
sample was drawn (solid line) and the inferred distribution from the standard Poisson model (both
excluding the zero bin)
Despite a deliberately large difference in the simulated number of boys and girls
in the zero bin, the standard ZiP model readily accommodated the implied and
unnecessary constraint of equal proportions of boys and girls in the zero bin by
distorting the distribution part of the model. Were the extended ZiP model
evaluated, it would be favoured in this instance due to likelihood-based model fit
criteria. However, in many instances researchers do not consider covariates in the
mixture part of the model. Many zero-inflated models have been evaluated where
covariates are identified as important for the distribution part only, and no consid-
eration is given to these same covariates in the mixture model. The implications of
this will vary, but undoubtedly in some instances the most suitable model may have
been overlooked, the distribution part may have been biased, and model interpreta-
tion may have been misleading. One should thus consider carefully the role of
covariates in determining the mixture in zero-inflated models. Considering this
problem more generally, if one observes covariate differences in the proportion of
total zeros, there may be genuine differences in the mixture proportions. Therefore,
exploring bivariate associations between the binary outcome (zero/non-zero) and
each covariate could be a good indicator of which covariates ought to be included in
the mixture part of the model, at least initially.
Caveat The converse, however, does not necessarily follow, since the absence of
any bivariate association between the binary outcome and a covariate does not
preclude that covariate from being important to the mixture model. One might err
on the side of caution and include all possible covariates in the mixture part, but
there is a price to pay in terms of lack of model parsimony, with potentially
redundant covariates in the mixture model. The extent by which researchers then
seek to trade potential small biases in their models for small improvements in the
6 Modelling Data That Exhibit an Excess Number of Zeros. . . 103
Before moving into the realm of model selection and interpretation (in relation to
data generation), we first examine some of the more fundamental issues about
model generation.
Skrondal and Rabe-Hesketh questioned the use of a Poisson distribution for the
Brazilian dataset, since the study counts represented the number of dmft
(‘successes’) out of a total of eight deciduous molars (‘trials’) (Skrondal and
Rabe-Hesketh 2004). Model fit then becomes relevant for a finite range
of the outcome scale only (counts of 0–8). Using their Generalized Linear Latent
and Mixed Models (GLLAMM) software, the ZiB model was introduced
and compared with the ZiP model. This generally revealed that Poisson
models predicted unrealistically long tails and binomial models performed
much better. Similarly, our focus will be given to binomial models for the example
dataset.
In general, it is important that the familiarity of the Poisson model to most
researchers, compared to the binomial model say, does not dictate modelling
strategy; especially when it is clear that data are bounded above. One should also
consider model interpretation and not be stuck with overly familiar practices.
For instance, it seems almost standard practice within the oral health research
field to interpret model coefficients in relation to an individual’s mean dmft (as
with a Poisson model), whereas risk ratios for increments along the dmft index scale
(as with a binomial model) have not gained widespread use. Yet the latter is
probably more appealing in terms of what really happens in terms of data genera-
tion, since individuals and teeth that are more prone to disease succumb first and the
more individuals and teeth that do succumb, the more difficult is it for the remaining
individuals and teeth to become unaffected. It is therefore not surprising that
binomial outcome models were preferable for the Brazilian dataset, as
demonstrated by Skrondal and Rabe-Hesketh (2004), and this may not be
just because the data are bounded above.
6 Modelling Data That Exhibit an Excess Number of Zeros. . . 105
6.7.2 Over-Dispersion
Within the Brazilian dataset, the dmft outcome was derived in a clustered setting of
children nested within schools. Since clustering can predispose to outcome over-
dispersion (i.e. heavier tails than expected for either the Poisson or the binomial
outcome), we explore this explicitly. One could deal with clustering directly within
a multilevel model, as described in other chapters, or one could employ a marginal
model, such as GEE (Liang and Zeger 1986), but for the purposes of this chapter we
maintain focus on the aggregated outcome and accommodate over-dispersion
directly. This is because what follows then applies to situations where one typically
models aggregated outcomes, as with disease counts within epidemiological
studies.
8 8
LC1 LC2
8 - 8
6 - 6
5 - 5
LC3
4 - 4
3 - 3
2 - 2
1 - 1
0 - 0
Fig. 6.3 Hypothetical risk models for the onset and progression of dmft. Chart gradients represent
the strength of underlying risks for disease onset and progression; A – period with no underlying
risk of disease; B – period where disease-free individuals are susceptible to disease onset;
C – period where individuals with existing diseases are susceptible to disease progression; LC1
– latent class one: sub-group of individuals with high risk of disease onset and progression; LC2 –
latent class two: sub-group of individuals with low risk of disease onset and high risk of disease
progression; LC3 – latent class three: sub-group of individuals with medium risk of disease onset
and low risk of disease progression
of disease onset and the rate of disease progression; for cross-sectional data
one might observe differences between the proportions of observed and expected
disease-free (i.e. disease prevalence) given the disease extent (number of lesions
per child) amongst those diseased. Thus, whether dealing with longitudinal
data (onset and progression) or cross-sectional data (prevalence and extent), mani-
fest differences in the outcome are potentially consequent on the underlying
risk differences in the dual processes of disease initiation and development.
The ZiP/ZiB models would thus seem most suitable.
108 M.S. Gilthorpe et al.
Alternatively, if the underlying risks for caries onset and progression were
identical for any one child, but differed across children, generic mixture modelling
would seem suitable; this approach encapsulates the concept of ‘subtypes’ of
children. These concepts are not mutually exclusive and where underlying com-
plexity warrants it (i.e. where caries onset and progression differs both within and
between children), both modelling strategies could be valid simultaneously; it is
possible to have generic mixture models with latent classes subdivided into a zero-
bin and standard distribution. It is thus valuable to have a priori hypotheses of how
data may have been generated in terms of underlying differences in caries onset and
progression, in order to select the most appropriate modelling strategy between the
zero-inflated and generic mixture models.
0.05
0
0.5 1 1.5 2 2.5 3
−0.05
Fig. 6.4 Bland-Altman plot of contrast between two over-dispersed Binomial models (oZiB_CP
and o2LCiB_CP)
The Brazilian dataset is too small and insufficiently robust to provide evidence to
support or rebut the hypothesised two-stage data generation process. Nevertheless,
findings from such a relatively simple evaluation might inform a modelling strategy
and steer a preference between zero-inflated or generic mixture models, particularly
where model-fit criteria and predicted outcomes make no such distinction. Regard-
ing the dental caries data, if we assume that disease follows a two-stage process it
would be more suitable to adopt a zero-inflated model; if we had evidence that
caries onset and progression have similar underlying risks within individuals and
differences occur only between individuals, then generic mixture models would be
more suitable. Despite the lack of evidence either way from the Brazilian dataset,
given extensive a priori clinical knowledge of caries onset and progression that
supports a dual process, we opt for the zero-inflated model (oZiB_CP) as our
preferred model.
6.10 Summary
B€ohning et al. rightly argued that one needs to consider the problem of excess zeros
in dental data and they advocated the use of the zero-inflated Poisson model for the
Brazilian oral health dataset. However, the Poisson distribution is not always ideal
for bounded data. Within the Brazilian dataset, counts represented the number of
successes (dmft) out of a finite number of trials (eight deciduous molars) and
consequently the binomial distribution was more suitable. Where data are also
inherently clustered, over-dispersion needs to be modelled explicitly and adopting
data-specific model-fit criteria may be useful in evaluating the performance of
models with respect to predicted outcomes where there is clinical importance
for certain thresholds. The estimated proportion of zeros is an obvious marker
of model performance when evaluating data with excessive zeros compared to
112 M.S. Gilthorpe et al.
standard count distributions. We also propose the use of a root mean squared error
(RMSE) between observed and predicted counts for the entire viable outcome
range. However, data-specific model-fit criteria do not generally agree with
the likelihood-based criteria, endorsing the cautious use of log-likelihood statistics
in isolation.
For cross-sectional analyses of a randomised study, one anticipates that mem-
bership of the zero bin is balanced across all treatment groups at baseline (due to
randomisation), but there is no reason for other covariates to have balanced zero
counts. In non-randomised studies, inadequately randomised studies, or within
longitudinal studies where the analyses are of follow-up data, the assumption of
balanced zeros across intervention groups is no longer viable. For zero-inflated
models, it is thus necessary to consider covariates in the mixture model, especially
if identified as necessary in the distribution part.
As different model parameterisations can yield near-identical predicted
outcomes and model fit statistics, whilst yielding potentially diverse model
inferences, it becomes necessary to consider data generation to inform model
selection and model interpretation. This could be particularly valuable where
different covariates affect onset and progression differently. In epidemiology, for
instance, it is proposed that childhood cancers are triggered by infection and that the
infectious agents are transmitted via population mixing (Kinlen et al. 1990).
A small community may be free from an infectious assault (a period of ‘not at
risk’), but once circumstances change due to factors associated with population
mixing, the community may become exposed to the infectious agents (start of
exposure) and the rate of spread of infection then depends upon other factors
associated with population mixing. The underlying risk of cancer is affected by
exposure to infectious agents and factors specific to the infected individuals.
Different factors are associated with population mixing and would potentially
have a different impact upon the onset and progression of rates of cancer in each
community. Where the number of cancers across several small areas within a region
is modelled assuming a zero-inflated Poisson distribution, different population
mixing measures could be evaluated. A measure that captures elevated risk to a
community of exposure to incoming infectious agents should be associated with
elevated rates of communities belonging to the distribution part of the zero-inflated
model. Similarly, a measure that captures elevated risk to a community of infec-
tious agents spreading within communities should be associated with elevated rates
of communities having higher prevalence rates of cancers (conditional on the
community belonging to the distribution part). It may thus be possibly to evaluate
more carefully the infectious agent hypothesis using zero-inflated models and
evaluating measures of population mixing in terms of where their impact lies in
the model parameterisation of a zero-inflated model.
Introducing the flexibility of generic mixture models might at first seem to evade
the problem of needing to explore covariates for the mixture part of the model, since
the implicit constraint on zero counts imposed by zero-inflated models with no
covariates in the mixture part is circumnavigated (unless one latent class is deter-
mined empirically to have a central location of zero). However, the problem is
6 Modelling Data That Exhibit an Excess Number of Zeros. . . 113
6.11 Conclusions
When dealing with biomedical count data that exhibit an excess of zeros, model
selection is not straightforward. It is crucial to consider appropriate outcome
distributions and explore context-specific model-fit criteria. For zero-inflated
models, one should consider covariates in the mixture model if identified as
necessary in the distribution part. Difficulties in distinguishing between models
based solely on likelihood statistics and predicted counts need to be informed by a
priori hypotheses of data generation. Zero-inflated models reflect whether or not
there are or have been risk differences in the onset and progression of disease,
whereas generic mixture models identify sub-types of individuals; both model
strategies can be employed simultaneously. Model selection is not about model fit
per se, but also about interpretation and robustness in the model truly reflecting the
context in which the data were generated.
References
Blance, A., Tu, Y. K., Baelum, V., & Gilthorpe, M. S. (2007). Statistical issues on the analysis of
change in follow-up studies in dental research. Community Dentistry and Oral Epidemiology,
35(6), 412–420. available from: PM:18039282.
Bland, J. M., & Altman, D. G. (1999). Measuring agreement in method comparison studies.
Statistical Methods in Medical Research, 8(2), 135–160. available from: PM:10501650.
B€ohning, D. (1998). Zero-inflated Poisson models and C.A.MAN: A tutorial collection of evi-
dence. Biometrical Journal, 40(7), 833–843.
B€ohning, D., Dietz, E., & Schlattmann, P. (1999). The zero-inflated Poisson model and the
decayed, missing and filled teeth index in dental epidemiology. Journal of the Royal Statistical
Society, Series A, 162, 195–209.
Carlos, J. P., & Gittelsohn, A. M. (1965). Longitudinal studies of the natural history of caries. II.
A life-table study of caries incidence in the permanent teeth. Archives of Oral Biology, 10(5),
739–751. available from: PM:5226906.
Groeneveld, A. (1985). Longitudinal study of prevalence of enamel lesions in a fluoridated and
non-fluoridated area. Community Dentistry and Oral Epidemiology, 13(3), 159–163. available
from: PM:3860338.
Hall, D. B. (2000). Zero-inflated Poisson and binomial regression with random effects: A case
study. Biometrics, 56(4), 1030–1039. available from: http://www.blackwell-synergy.com/loi/
biom.
Holst, D. (2006). The relationship between prevalence and incidence of dental caries. Some
observational consequences. Community Dental Health, 23(4), 203–208. available
from: PM:17194066.
Kinlen, L. J., Clarke, K., & Hudson, C. (1990). Evidence from population mixing in British New
Towns 1946–85 of an infective basis for childhood leukaemia. Lancet, 336(8715), 577–582.
available from: PM:1975376.
Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in
manufacturing. Technometrics, 34(1), 1–14. available from: ISI:A1992GZ77700001.
Leroy, R., Bogaerts, K., Lesaffre, E., & Declerck, D. (2005). Multivariate survival analysis for the
identification of factors associated with cavity formation in permanent first molars. European
Journal of Oral Sciences, 113(2), 145–152. available from: PM:15819821.
Liang, K. Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models.
Biometrika, 73, 13–22.
Lord, F. M. (1967). A paradox in the interpretation of group comparisons. Psychological Bulletin,
68, 304–305.
Lord, F. M. (1969). Statistical adjustments when comparing preexisting groups. Psychological
Bulletin, 72, 337–338.
6 Modelling Data That Exhibit an Excess Number of Zeros. . . 115
Macek, M. D., Beltran-Aguilar, E. D., Lockwood, S. A., & Malvitz, D. M. (2003). Updated
comparison of the caries susceptibility of various morphological types of permanent teeth.
Journal of Public Health Dentistry, 63(3), 174–182. available from: PM:12962471.
Mullahy, J. (1986). Specification and testing of some modified count data models. Journal of
Econometrics, 33(3), 341–365. available from: ISI:A1986F205600002.
Parner, E. T., Heidmann, J. M., Vaeth, M., & Poulsen, S. (2007). Surface-specific caries incidence
in permanent molars in Danish children. European Journal of Oral Sciences, 115(6), 491–496.
available from: PM:18028058.
Poulsen, S., & Horowitz, H. S. (1974). An evaluation of a hierarchical method of describing the
pattern of dental caries attack. Community Dentistry and Oral Epidemiology, 2(1), 7–11.
available from: PM:4153274.
Poulsen, S., Heidmann, J., & Vaeth, M. (2001). Lorenz curves and their use in describing the
distribution of ‘the total burden’ of dental caries in a population. Community Dental Health, 18
(2), 68–71. available from: PM:11461061.
Ridout, M., Demétrio, C. G. B., & Hinde, J. (1998) Models for count data with many zeros.
Proceedings article for an International Biometric Conference (pp. 179–192). Cape Town.
http://www.kent.ac.uk/IMS/personal/msr/webfiles/zip/ibc_fin.pdf.
Senn, S. (2006). Change from baseline and analysis of covariance revisited. Statistics in Medicine,
25(24), 4334–4344. available from: PM:16921578.
Skrondal, A., & Rabe-Hesketh, S. (2004). Generalized latent variable modeling: Multilevel,
longitudinal and structural equation models. London: Chapman & Hall.
Vermunt, J. K., & Magidson, J. (2005a). Latent GOLD 4.0 User’s Guide. Belmont Massachusetts:
Statistical Innovations Inc.
Vermunt, J. K., & Magidson, J. (2005b). Technical guide for Latent GOLD 4.0: Basic and
advanced. Belmont Massachusetts: Statistical Innovations Inc. http://www.statisticalinnovations.
com/products/LGtechnical.pdf.
Vieira, A. M. C., Hinde, J. P., & Demetrio, C. G. B. (2000). Zero-inflated proportion data models
applied to a biological control assay. Journal of Applied Statistics, 27(3), 373–389. available
from: ISI:000086354400009.
Wong, M. C., Schwarz, E., & Lo, E. C. (1997). Patterns of dental caries severity in Chinese
kindergarten children. Community Dentistry and Oral Epidemiology, 25(5), 343–347.
available from: PM:9355769.
Chapter 7
Multilevel Latent Class Modelling
7.1 Overview
Multilevel latent class models can reveal new insights into clustered data.
For instance, within observational studies, latent class analysis of multilevel data
allows groups or clusters of patients to be identified (e.g. according to casemix or
different pathways through the healthcare system) and allows sub-groups of
organisations to be derived (e.g. according to the treatments available, quality of
care, or differences in patient outcomes). It is also feasible to generate organisation-
level latent classes with similar patient casemix and differences between these
casemix-adjusted latent classes can then be evaluated, whereby factors that differ
across the organisational classes are then tested for their association with
differences in clinical outcomes. This allows areas of healthcare provision to be
targeted for intervention and evaluation to improve patient care. The same methods
can be adopted in a cluster-randomised setting, where the multilevel latent class
methodology improves on the cluster-randomisation, generating organisational
classes that are balanced in terms of patient casemix – a form of pseudo-
randomisation of observational data.
Y.-K. Tu and D.C. Greenwood (eds.), Modern Methods for Epidemiology, 117
DOI 10.1007/978-94-007-3024-3_7, # Springer Science+Business Media Dordrecht 2012
118 W. Harrison et al.
Patients with colorectal cancer (ICD-10 (World Health Organisation 2005) codes
C18, C19 and C20) diagnosed between 1998 and 2004 and resident in the
Northern and Yorkshire regions were identified from the Northern and Yorkshire
Cancer Registry and Information Service (NYCRIS) database. Patient age, sex,
tumour stage at diagnosis (using the Dukes classification (Dukes 1949)), diag-
nostic centre (Trust), and whether or not the patient received treatment were
extracted. Socioeconomic background (SEB) was defined at the 2001 enumera-
tion district level of residence (super output area) using the Townsend Index
(Townsend et al. 1988) and matched to patients using their postcode of
residence.
We adopted the outcome of mortality (alive/dead) at 3 years after diagnosis, as
this was considered to be clinically meaningful and facilitated ready comparison
with other studies. Whilst interest lies in investigating potential treatment centre
characteristics associated with colorectal cancer survival, this can be complex to
assess, as patients may be treated at different Trusts throughout their care.
In our data, 90% were treated initially within the same Trust as they were
diagnosed, though only 75% remained within this Trust throughout. We neverthe-
less choose to analyse by Trust of diagnosis in order to include all patients, whether
treated or not, and maintain a reasonable proportion of patients whose treatment
was initially received within the same Trust as they were diagnosed. As 78 patients
had diagnostic centres found to be external to Trusts within the Northern and
Yorkshire region, these were excluded, yielding 24,455 patients available for
analysis.
7 Multilevel Latent Class Modelling 119
We seek to answer two distinct research questions with the example dataset:
i. “What is the relation between 3-year mortality and socioeconomic background
(SEB) of patients and what factors affect this relationship?”
This research question is an example within epidemiology of seeking to
determine the impact of an exposure or risk factor (SEB in this instance) on
an outcome (3-year survival) where it is impossible to conduct a randomised
controlled trial. When seeking to determine the outcome–exposure relationship,
adjustment for potentially confounding factors is crucial. Achieving this is not
always straightforward, as we will discuss.
ii. “How does Trust performance vary after accommodating patient (casemix)
differences?”
This research question seeks to assess variation in Trust performance (in
terms of mean 3-year survival rates) over and above differences anticipated due
to patient casemix. Some Trusts may perform better or worse than others in
terms of their median survival rates due to their patient casemix (which likely
varies geographically), or due to underlying differences in the effectiveness of
Trust function and healthcare delivery, or both. It is important to identify good
and poor performing Trusts in order to identify good clinical practice.
Before addressing each research question, we discuss the scope of potential
models available within a multilevel latent-class framework, because each research
question requires a slightly different multilevel latent class model to deliver the
appropriate analytical strategy.
The concept of latent classes was introduced in Chap. 6. Within a multilevel dataset it
is possible to have a latent-class structure at each level of the data hierarchy.
Considering latent classes at more than one level permits several complex model
configurations, each relating to different assumptions, with slightly different
interpretations, not all of which have analogues to continuous latent-variable models
or standard multilevel models. As mentioned in Chap. 6, some parameterisations may
not be identifiable, and some identifiable models may not be interpretable. These
issues only become more complex in a multilevel setting.
cancer dataset (patients nested within Trusts), we may consider initially that
patients belong to different latent classes (i.e. sub-groups of individuals that are
homogeneous within groups, though heterogeneous between groups in their 3-year
mortality). Conditional on belonging to a given patient-level class, the Trusts in
which they are treated may then be grouped according to similarities or differences
in terms of patient outcomes.
If grouped according to similarities, a Trust class might contain Trusts that have
roughly the same mean levels of 3-year survival, whilst the proportion of patients
within each patient-level class might differ. Trust classes are then homogeneous
with respect to patient outcomes, whilst heterogeneous in terms of patient-class
profiles. This modelling strategy is appropriate for research question (i).
If grouped according to outcome differences, Trust classes might contain Trusts
that have the same proportions of patients within each patient class, where patient
classes differ in terms of mean 3-year survival. Trust-level classes are then homo-
geneous with respect to patient-class profiles (i.e. casemix), whilst heterogeneous in
terms of patient outcomes (survival). This modelling strategy is appropriate for
research question (ii).
In practice, within the estimation process, there is no sense of ordering in terms
of at which level the latent classes are formed ahead of other levels, because this
happens simultaneously, i.e. patient-level classes are determined simultaneously
with Trust-level classes. Models are an optimum solution for all classes, at all
levels, conditional on covariates considered in the model; estimation procedures
seek to maximise the likelihood function in a single process.
The simplest scenario is where the continuous latent variable at the upper level is
replaced by a categorical latent variable. The usual constraint of normally
distributed upper-level residuals is then no longer applicable. Within a standard
multilevel model of the colorectal cancer data, the mean outcome for each Trust
(the proportion of patients who survive 3 years) is assumed to be normally
distributed and the model estimates an overall Trust mean 3-year survival fraction
and its variance. If a categorical latent variable were adopted instead, Trust classes
are determined such that each Trust is assigned to a class according to probabilities
that sum to one over all Trust classes. The model estimates the mean 3-year survival
for each Trust class and also the size of each Trust class (i.e. summation of
individual Trust probabilities for each Trust class); and no assumptions are made
regarding the distribution of Trust class means or Trust class sizes.
As was seen in Chap. 6, covariates can be entered into a latent-class model in the
usual way, as within a standard regression model, i.e. as ‘predictors’ of the
outcome variation. The same covariates may also enter the model as ‘predictors’
7 Multilevel Latent Class Modelling 121
When adopting a multilevel latent class structure, the central location of each class is
estimated for all levels of the hierarchy simultaneously, i.e. each latent class at any
level has its own intercept or mean probability. This is very broadly analogous to
random intercepts within a continuous latent variable multilevel model. Discrete latent
variable intercepts at a lower level, however, may be either class dependent or class
independent in relation to class structures at higher levels. Consider, for instance, the
two-level colorectal cancer dataset, where patients are nested within Trusts. With say
MP classes at the patient level (MP 2), and MT classes at the Trust level (MT 2),
MP intercepts may exhibit relative differences that are either identical or different
within each Trust class. Where they are identical, patient-class intercepts differ by the
same degree, irrespective of which Trust class their treatment centre is assigned to:
122 W. Harrison et al.
patient-class intercepts are thus Trust class independent. Where patient classes vary
across Trust classes, they are Trust class dependent. Model interpretation differs
between these two latent-class multilevel models and the choice of model is driven
by context.
Let us consider the context of the illustrative dataset. Individuals tend to
vary in many ways, such as in their willingness to seek medical advice if feeling
ill, or in their diet, or their levels of daily exercise. It is therefore likely that
some patients are more homogenous in their experience of disease and other
characteristics, some of which may be related either to their risk of developing
disease or to their bodies’ ability to cope with illness once disease has developed.
Similarly, some Trusts are more likely to share common practices and procedures,
offering standard treatment pathways for patients of a particular kind, whereas
others may differ slightly due to local factors, such as size and the numbers of
medical specialists (consider for instance differences between teaching hospitals,
city general hospitals, and ‘cottage’ hospitals). The Trust-class independent
configuration enables identical contrasts to be made amongst patient classes
within Trust classes, in a relative sense, i.e. the patient classes with ‘best’ and
‘worst’ mortality differ in relative terms identically for each Trust class. If the
Trust class dependent configuration were adopted, contrasts in survival amongst
patient classes in one Trust class could, relatively speaking, mean different things
according to which Trust class were considered. In both instances, Trust classes
may differ in their overall 3-year mortality. For illustration and ease of inter-
pretation within the colorectal cancer dataset, we adopt the class independent
configuration for model intercepts – this is not essential, though helpful in
this context. In other circumstances (especially for different datasets) the class
dependent configuration might be more appropriate.
to have different estimated values for each Trust class (i.e. to be Trust class
dependent). This latter option is akin to random slopes in the standard multilevel
model, but where the random effects (represented by a continuous latent variable)
are effectively categorised and multiple fixed effects parameter values are estimated
for each Trust latent class.
Not all covariate fixed effects would necessarily be modelled this way, and so
the number of patient-level covariates that are Trust class dependent could be fewer
than the total number of patient-level covariates, i.e. CR CF , yielding CR MT
parameters to be estimated. This can nevertheless be much less parsimonious than
the standard multilevel model, since the latter has only one continuous latent
variable variance to be estimated per covariate random slope, opposed to multiple
fixed effects parameter values for each Trust class. This is why it is necessary to
consider carefully the pros and cons of class dependent vs. independent covariate
effects. Furthermore, interpretation again differs between these two types of multi-
level latent-class models and choice is driven by context.
Considering the context of the colorectal cancer dataset, we initially adopt the
class dependent configuration to allow for random effects, though for parsimony we
switch to class independent covariate effects if there is little evidence that a covariate
parameter value varies substantially across classes. A combination of configurations
is possible and may be more parsimonious. For instance, one covariate might have
two distinct parameter values across six latent classes, such that parameter estimates
are constrained to take one value for three classes and another value for the other
three classes. Although technically possible, a complex a priori grasp of how the data
are generated should prevail to warrant such complex model structures.
Class size may also be class dependent or independent. Consider again the 2-level
colorectal cancer dataset with MP patient-level and MT Trust-level latent classes.
There are MP MT latent classes and each may have a different proportion of the
total number of patients (Trust class dependent). Some patient classes may possess
no patients at all because the number of patient classes per Trust class is fixed in the
model parameterisation, yet in practice some Trust classes might favour fewer
patient classes, so some are ‘empty’. Alternatively, it is possible to constrain
class sizes such that the proportion of each patient class remains the same for
each Trust class (Trust class independent). The total number of patients per Trust
class can still vary. Model interpretation differs according to which strategy is
adopted, which is again driven by context.
Considering the colorectal cancer dataset, model strategy (i) requires that patient
class sizes are Trust class dependent, to reflect that each Trust class may be made up
of differing proportions of patient classes. Model strategy (ii) requires patient classes
to be Trust class independent, to reflect that each Trust class is required to have
exactly the same profile of patient classes (and hence patient casemix characteristics).
124 W. Harrison et al.
The statistical software LatentGold (Vermunt and Magidson 2005a) was used
for all latent variable models. The number of latent classes at the patient and Trust
7 Multilevel Latent Class Modelling 125
levels is sequentially increased from one to identify the optimum model according
to a number of model-fit criteria, including the Bayesian Information Criterion
(BIC) (Schwarz 1978), the Akaike Information Criterion (AIC) (Akaike 1974)
and change in log-likelihood (LL). Both the BIC and AIC incorporate a sense
of model parsimony by accommodating the varying number of model parameters
(Vermunt and Magidson 2005b) while the LL does not. Use of the BIC
implies that the true model is among those compared, although this may not be
the case as modelling inherently simplifies the data; whereas use of the AIC may
suggest a more complex model than necessary, as it may over-fit the data.
Although the LL improves with an increasing number of classes at both levels,
improvements grow more slowly to reveal a diminishing return for increased
model complexity.
We evaluate a standard multilevel model (continuous upper-level latent vari-
able) as well as the latent-class multilevel model to contrast the standard approach
with the latent-class approach. We select models that minimise all three criteria,
while providing a useful model and informative results.
To address research question (i), we take stock of the generic problems in analysing
outcome-exposure relationships with observational data. Whilst discussed in terms
of our illustrative dataset, issues raised here affect all epidemiological datasets.
Age at
diagnosis
Sex
Survival
Stage at Treated
SEB
diagnosis yes/no
Time to first
treatment
Fig. 7.1 Directed acyclic graph showing the relationship amongst all available variables at the
population level
When considering SEB as the main exposure variable, it has been suggested that
SEB may influence late presentation (Ionescu et al. 1998; Kogevinas et al. 1991),
which may then influence stage at diagnosis. Therefore, SEB causally precedes
stage at diagnosis. Figure 7.1 shows the proposed relationships amongst all avail-
able variables at the population level by use of a DAG. If stage is adjusted for by
inclusion in a regression model examining the relationship between SEB and
mortality, this may introduce bias and lead to inappropriate model interpretation.
This may explain why findings into the impact of SEB on cancer mortality vary,
with some studies finding a significant relationship between worsening SEB and
increased cancer mortality (Coleman et al. 1999; Pollock and Vickers 1997;
Schrijvers et al. 1995), whilst others have found no such association (Lyratzopoulos
et al. 2004; Wrigley et al. 2003). It depends upon whether or not any statistical
adjustment for alleged confounding has indeed removed the impact of genuine
confounding (for which the alleged confounding is merely a proxy), introduced bias
due to the reversal paradox, or an unknown combination of both.
Standard regression analyses may give rise to biased results when model covariates
(such as stage at diagnosis) are measured with error, or have missing values
(Carroll et al. 2006; Fuller 1987), and this bias is exacerbated within product
interaction terms (Greenwood et al. 2006), e.g. when investigating the role of
SEB across different levels of stage at diagnosis. Stage, widely used as a potential
confounder, often suffers from a large proportion of incomplete data (24% missing
7 Multilevel Latent Class Modelling 127
Deriving multiple patient latent classes divides patients into sub-groups such that the
relationship between survival and SEB might vary within each latent class. The latent
classes then correspond to specific patient features that can be labelled post-hoc
according to outcome (e.g. ‘good’ or ‘poor’ survivors) or covariates (e.g. ‘early-’ or
‘late-’ stage disease at diagnosis). Adopting multiple latent classes for the Trusts
effectively groups diagnostic or treatment centres, though the relationship between
mortality and SEB varies only across patient classes, not Trust classes.
The continuous measures of patient age at diagnosis and Townsend score (SEB)
exhibited non-linear relationships with 3-year survival. Generalised additive models
(GAMs; discussed in more detail in Chap. 15), identified the higher order terms
required for each; the statistical software used was R 2.9.0 (Venables and Smith
1990). For both terms, the non-linearity was explored and threshold values identified,
to simplify the number of higher order terms required. Patient age at diagnosis was
centred on the study mean of 71.5 years and Townsend score was centred on the
population mean of zero (the study mean was 0.040). Models were also adjusted for
sex.
Stage was included as a class predictor rather than as a fixed-effect covariate,
meaning the resultant patient classes had a graduated mortality risk analogous to
that observed for different stages of disease. This allowed the relationship between
mortality and various risk factors to be explored across patient classes, introducing
an implicit ‘interaction’ with stage at diagnosis, without the risk of exacerbated bias
due to measurement error. Additional variables were included as inactive
covariates, which allowed them to be interpreted within the classes but did not
allow them to predict class membership.
7.5.5 Results
All multilevel latent class models revealed an improved fit compared with standard
multilevel regression analysis according to all model-fit criteria considered. Although
the model-fit criteria identified different optimum models (the LL and BIC identified
the model with three patient classes and one Trust class while the AIC identified the
model with four patient classes and three Trust classes) the preferred model was that
with three patient classes and two Trust classes, because this sufficiently differentiated
patient characteristics, while four patient classes and either extra or fewer Trust classes
added little insight to patient and hospital variation. For the final model, patient CE
was 22% and Trust CE was 8%.
Table 7.2 summarises the preferred multilevel latent class model, with patients
apportioned into either a large good-prognosis group, a small reasonable-prognosis
group, or an even smaller poor-prognosis group. Patient class 1 contained 42% of
cases of which 10% died within 3 years, compared with patient class 2 with 31%
of cases of which 69% died within 3 years, and patient class 3 with 27% of cases of
which 99% died within 3 years.
The impact of deprivation differed insubstantially across the patient classes.
In classes 1 and 2 (good and reasonable prognosis), living in a more deprived area
was clearly associated with increased odds of death. In class 3 (poor prognosis), the
association was less clear, with the odds ratio indicating only slightly decreased
odds of death and with a wide confidence interval. This indicates that the role of
SEB in 3-year colorectal mortality operates somewhat differently for differently
staged individuals, with SEB having less impact for those with late-stage disease.
The mean Townsend scores also differed across the classes, indicating that
individuals in class 2 (reasonable prognosis) generally lived in more deprived
areas than individuals in either of the other two classes.
The impact of sex differs substantially across the classes. In class 3 (poor
prognosis), females had an increased risk of death compared to males, whereas in
class 1 (good prognosis), females had a decreased risk of death and in class
2 (reasonable prognosis), females also had a decreased risk of death. This difference
130 W. Harrison et al.
Table 7.2 Results for the subject classes in the 3-patient-, 2-Trust-class multilevel regression
model: odds ratio of death within 3 years
OR (95% CI) Wald test
Model covariates Class 1 Class 2 Class 3 (p-value)
Female 0.60 (0.46–0.77) 0.84 (0.61–1.15) 1.75 (0.48–6.30) 0.022
Townsend (per SD) 1.21 (1.07–1.37) 1.59 (1.31–1.92) 0.99 (0.55–1.77) 0.048
Age (per 5 years) 2.18 (0.83–5.75) 2.53 (2.00–3.21) 0.58 (0.22–1.53) 0.011
Age squared (per 5 years) 1.00 (0.96–1.03) 1.01 (1.00–1.02) 1.06 (0.96–1.16) 0.340
Model summary statistics
Class size 42% 31% 27%
Overall prevalence 10% 69% 99%
Reference group prevalence 6% 69% 97%
Model class profiles
Stage A 23% 6% 0.5%
Stage B 47% 19% 8%
Stage C 27% 30% 16%
Stage D 0.5% 12% 69%
Missing 3% 32% 7%
Patients treated 98% 76% 69%
ICD-10 C18 (colon) 58% 57% 62%
ICD-10 C19 11% 10% 11%
(rectosigmoid jct.)
ICD-10 C20 (rectum) 31% 34% 27%
Tumour on left side 69% 67% 61%
Tumour on right side 28% 25% 29%
Tumour across both sides 3% 8% 11%
OR Odds Ratio, CI Confidence Interval; there were 12,856 (52%) deaths in the entire study
population; the reference group comprised males, aged 71.5 years classified as Stage I at diagnosis,
and attributed a Townsend score of zero
in risk profile by sex across classes indicates that the role of sex in 3-year colorectal
mortality operates differently for differently staged individuals, with women faring
better than men with early-staged disease, and the reverse with late-stage disease.
The proportions of females differed across the classes (class 1: 42%; class 2: 33%;
class 3: 25%), indicating that the majority of females had a decreased risk of death
compared with males.
The impact of age in the model differed substantially across classes. In class
2 (reasonable prognosis), older age was clearly associated with increased odds of
death, as too in class 1 (good prognosis), though the association was reduced. In
contrast, in class 3 (poor prognosis), the odds ratio indicates a decreased odds of
death for older age. The mean age also differed across the classes (class 1:
71.6 years, SD 8.6 years; class 2: 76.6 years, SD 8.4 years; class 3: 71.5 years,
SD 8.8 years), indicating that individuals in class 2 (reasonable prognosis) were, on
average, older than the individuals in either of the other two classes.
The stage profile differed across the patient classes. Class 1 (good prognosis)
corresponded to early stage diagnosis with 69% of the stage A/B patients versus 28%
7 Multilevel Latent Class Modelling 131
7.5.6 Discussion
The multilevel latent class regression model substantially improved fit for the
illustrative dataset compared to the standard multilevel model. As both patients
and Trusts were categorised into latent classes, this led to an improved inter-
pretation of the data. We are therefore able to investigate how risk factors associate
with mortality within sub-groups, rather than only for all patients or Trusts, as in
the standard multilevel model. New insights were available that were not previously
apparent using the standard multilevel model. For instance, although the standard
analysis found age, sex and SEB differences in survival, multilevel LCA showed
that within latent classes, age and sex differences varied according to patient class
and SEB varied according to patient sub-type; Trusts, whilst heterogeneous, did
not have a huge impact on the association between patient factors and 3-year
mortality.
By not modelling stage as a covariate, we have attempted to avoid the reversal
paradox and minimise bias due to measurement error and incomplete data. As the
patient classes correspond to stage at diagnosis, we have been able to determine
how the covariates associate with mortality within different stage groupings,
without using product interaction terms. Categorising missing values in stage
allows the modelling to take account of incomplete data and assign patients to the
132 W. Harrison et al.
most suitable patient class according to how their outcome corresponds to other
patients. There are some limitations:
First, with stage included as a class predictor, bias is minimised, though it will
not be completely eradicated. Patient classes may, however, be derived without
stage as a class predictor and the same differentiation across patient classes may
then be observed. For patients presenting with early- or mid-stage disease,
their characteristics may help determine their chances of dying from colorectal
cancer, whilst for patients presenting with late-stage disease, their characteristics
are less likely to be associated with mortality. Second, it would be more sophisti-
cated to explore survival as a continuous measure, e.g. using Cox’s proportional
hazards regression. This was not undertaken as this would have introduced even
further complexity. Third, as already suggested, SEB is really measured at
the area-level and so should be considered as a separate level, effectively
cross-classified with the Trust level. This too could be accommodated, but to
achieve these more complex models – cross-classified Cox proportional hazards
regression – one needs more powerful software, e.g. WinBUGS (Lunn et al. 2000)
or eventually MPlus (Muthén and Muthén 2007), once developed for
cross-classified modelling.
We have considered a number of model-fit criteria when assessing our
‘best-fit’ model and we have chosen the model that minimises these criteria while
providing a useful and informative summary of the data. Alternative models could,
however, have been selected. Our chosen model has a low classification error at the
Trust level (8%), meaning that the classes are more ‘real’ and we can differentiate
between the Trusts, categorising them into good or poor performance groups. At the
patient level, the classification error was higher (22%), meaning that these classes are
more ‘virtual’. Some patients will fall into one class entirely, whilst others may
‘belong’ substantially to more than one patient class.
To address research question (ii), we wish to ‘adjust’ for patient casemix in order to
assess the relative (ranked) performance of Trusts. We contrast this approach to that
of adopting Trust standardised mortality ratios (SMRs).
7.5.9 Results
Table 7.3 summarises the ‘ideal’ MLLC model determined by the procedures
described. Patients were assigned to two latent classes of similar size, one with
134 W. Harrison et al.
Table 7.3 Results for the subject classes in the 2-patient-, 2-Trust-class multilevel latent class
regression model: odds ratio of death within 3 years
Model summary statistics Class 1 Class 2
Class size 54.3% 45.7%
Overall prevalence 63.0% 39.3%
Reference group prevalence 23.2% 7.0%
Model covariates OR (95% CI)
Stage ¼ B 2.40 (1.63–3.54) 0.55 (0.21–1.43)
Stage ¼ C 7.72 (4.61–12.94) 1.74 (0.75–4.06)
Stage ¼ D 20.19 (8.88–45.89) Infinitea
Stage ¼ X 6.30 (1.89–20.97) 33.41 (7.93–140.68)
Female 0.94 (0.78–1.14) 0.58 (0.38–0.88)
Townsend (per SD) 1.32 (1.21–1.43) 1.03 (0.81–1.31)
Age (per 5 years) 1.51 (1.42–1.60) 2.53 (1.31–4.90)
Age squared (per 5 years) 1.005 (0.997–1.012) 0.984 (0.960–1.008)
OR Odds Ratio, CI Confidence Interval. There were 12,856 (52.2%) deaths in the study popula-
tion. The reference group comprised males, aged 71.5 years, classified as Stage A at diagnosis, and
attributed a Townsend score of zero
a
The odds ratio could not be estimated as there were zero patients who survived 3 years in this
subcategory
reasonable prognosis (PC1: 54.3% of cases, of which 63.0% died within 3 years),
and one with better prognosis (PC2: 45.7% of cases, of which 39.3% died within
3 years). Trusts were similarly assigned to two latent classes. The largest Trust
class, with 53.1% of patients, had better prognosis (TC1: 51.3% of patients died
within 3 years; TC2: 53.2% of patients died within 3 years).
Table 7.4 summarises the number of deaths within each patient class by stage.
Allocating patients to classes according to their largest class probability
(modal assignment), all patients in PC1 diagnosed either at stage B or C died within
3 years; in PC2, all patients diagnosed at stage A, B or C survived. This difference is
anticipated, as stage at diagnosis is an important predictor of survival. Most of the
early- or mid-stage patients died within 3 years in PC1 compared to PC2, and there
was a clear graduation in survival with increasing stage at diagnosis from early- to
late-stage within both classes.
Trust ranks and their bootstrapped 95% CIs, according to both methods consid-
ered, are summarised in Table 7.5; a low ranking value indicates a better survival
rate than expected. Differences in the median rank of Trust performance between
the MLLC model approach and the Trust SMRs are well within their estimated 95%
CIs. Figure 7.2 provides a graphical representation of these results, in order of
increasing median probability of belonging to the best survival Trust class for the
MLLC methodology.
For the final model, patient CE was 35% and Trust CE was 17%. The large
patient-level CE indicates that patient classes are more a ‘distilled’ classification
of patient traits than well-defined subgroups or subtypes of individuals. The Trust-
level CE indicates that Trusts also comprise shared traits, though it is feasible that
some Trust classes comprise more a distinct subgroup of Trusts.
7 Multilevel Latent Class Modelling 135
Table 7.4 Deaths within 3 years, by stage, in each of the 2-patient classes for the 2-patient,
2-Trust multilevel latent class regression model
Modal class 1, died within 3 yrs Modal class 2, dies within 3 yrs
Stage at diagnosis No Yes No Yes
A 1,099 550 1,210 0
B 0 1,955 4,829 0
C 0 2,736 3,437 0
D 437 3,202 0 1,962
X 413 2,360 359 91
TOTAL 1,949 10,803 9,835 2,053
Table 7.5 Trust ranks from the multilevel latent class model and the calculation of Trust SMRs
Median rank (95% CI)
Trust Median probability of belonging to best survival Trust class ML LC SMR
1 1.000 1 (1–9.5) 6 (2–11)
2 0.999 3 (1–11) 4 (1–10.5)
3 0.997 4 (1–11) 3 (1–10.5)
4 0.996 4 (1–15) 8 (3–14.5)
5 0.993 5 (1–12.5) 5 (1–13)
6 0.956 8 (2–16) 9 (2–17)
7 0.912 9 (3–17) 5 (1–17)
8 0.908 9 (2–17) 6 (1–18)
9 0.897 9 (3–18) 5 (1–18)
10 0.816 10 (3–17) 8 (1–18)
11 0.575 11 (3.5–18) 11 (3–17)
12 0.476 13 (5.5–18) 12.5 (3–18)
13 0.372 12 (4–18.5) 11.5 (5.5–17)
14 0.359 12 (3–19) 12 (7–17)
15 0.152 14 (5.5–19) 15 (4.5–18)
16 0.070 14 (4–19) 13 (7–18)
17 0.070 15 (7.5–19) 16 (7.5–18)
18 0.003 18 (7–19) 15 (10–18)
19 0.002 18 (13.5–19) 19 (18–19)
7.5.10 Discussion
The simplest multilevel latent class model, where the continuous latent variable at
the upper level is replaced by a categorical latent variable, estimates the mean
outcome for each Trust class and the size of each Trust class (summation of Trust
probabilities for each Trust class) with no assumptions made regarding the distribu-
tion of means or class sizes. The upper-level discrete latent variable allows for
individual Trusts to be assigned probabilistically to the discrete latent trust classes,
136 W. Harrison et al.
9
7
5
3
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Trust
ML LC analysis calculation of SMR
Fig. 7.2 Trust Median Ranks and 95% Confidence Intervals, ordered by the multilevel latent class
(ML LC) analysis
providing less restricted weighting of Trust relative performance than the standard
normal assumption. This likely improves the accuracy of estimated patient diff-
erences across Trust classes, which in turn improves the estimated patient casemix
adjustment for individual Trusts. The multilevel latent-class model is also more
likely to capture contextual effects due to inherent data hierarchy than by merely
estimating Trust ranks according to their SMRs.
Continuous and discrete latent variables, if combined, may prove more parsimo-
nious, with variation within each Trust class captured by the continuous latent
variable, potentially leading to fewer Trust classes to describe efficiently the overall
Trust-level variation. If determination of Trust ranks is important, however,
the estimation of Trust outcomes is more straightforward if the categorical latent
variable only is used at the Trust level, as this avoids having to derive the normally
distributed random effects within each Trust class. Improvements in patient
casemix modelling might be feasible were more patient variables considered,
but this could incorporate incomplete data, which can cause bias. Within a
latent-class framework the uncertainty surrounding unrecorded or unused patient
characteristics is modelled explicitly – with ‘fuzzy’ matching.
In fixing patient-level latent-class composition and modelling patient casemix
differences, the residual Trust-class differences in outcome reflect variations in
Trust performance. This paves the way for the analysis of treatment centre
characteristics (in addition to patient casemix characteristics), whereby differences
in the patient pathway of care are modelled explicitly to evaluate organisational
7 Multilevel Latent Class Modelling 137
Multilevel latent class analysis has considerable utility. It improves upon standard
multilevel models by yielding better fit and providing enhanced insight. The
introduction of latent classes allows for investigation into how the risk factors
considered are associated with survival within patient classes and within Trust
classes, rather than for all patients and across all Trusts.
References
Adams, J., White, M., & Forman, D. (2004). Are there socioeconomic gradients in stage and grade
of breast cancer at diagnosis? Cross sectional analysis of UK cancer registry data. BMJ, 329
(7458), 142.
Akaike, H. (1974). A new look at the statistical identification model. IEEE Trans Auto Control, 19,
716–723.
Carroll, R. J., Ruppert, D., Stefanski, L. A., & Crainiceanu, C. M. (2006). Measurement error in
nonlinear models (2nd ed.). London: Chapman & Hall.
Coleman, M., Babb, P., Damiecki, P., et al. (1999). Cancer survival trends in England and Wales,
deprivation and NHS region (SMPS, Vol. 61, pp. 119–212). London: The Stationery Office.
Davy, M. (2007). Socio-economic inequalities in smoking: An examination of generational trends
in Great Britain. Health Statistics Quarterly, 34, 26–34.
Downing, A., Harrison, W. J., West, R. M., Forman, D., & Gilthorpe, M. S. (2010). Latent class
modelling of the association between socioeconomic background and breast cancer survival
status at 5 years whilst incorporating stage of disease. Journal of the Epidemiology and
Community Health, 64(9), 772–776. Published Online First: 19 August 2009.
Dukes, C. E. (1949). The surgical pathology of rectal cancer. Journal of Clinical Pathology, 2,
95–98.
Duncan, C., Jones, K., & Moon, G. (1999). Smoking and deprivation: Are there neighbourhood
effects? Social Science & Medicine, 48(4), 497–505.
Fuller, W. A. (1987). Measurement error models. New York: Wiley.
Greenwood, D. C., Gilthorpe, M. S., & Cade, J. E. (2006). The impact of imprecisely measured
covariates on estimating gene-environment interactions. BMC Medical Research Methodol-
ogy, 6, 21.
Hernandez-Diaz, S., Schisterman, E. F., & Hernan, M. A. (2006). The birth weight “paradox”
uncovered? American Journal of Epidemiology, 164(11), 1115–1120.
7 Multilevel Latent Class Modelling 139
Ionescu, M. V., Carey, F., Tait, I. S., & Steele, R. J. (1998). Socioeconomic status and stage at
presentation of colorectal cancer. Lancet, 352(9138), 1439.
James, W. P., Nelson, M., Ralph, A., & Leather, S. (1997). Socioeconomic determinants of health.
The contribution of nutrition to inequalities in health. BMJ, 314(7093), 1545–1549.
Kirkwood, B., & Sterne, J. (2003). Medical statistics (2nd ed.). Oxford: Blackwell.
Kogevinas, M., Marmot, M. G., Fox, A. J., & Goldblatt, P. O. (1991). Socioeconomic differences
in cancer survival. Journal of Epidemiology and Community Health, 45, 216–219.
Lunn, D. J., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS – A Bayesian modelling
framework: Concepts, structure and extensibility. Statistics and Computing, 10, 325–337.
Lyratzopoulos, G., Sheridan, G. F., Michie, H. R., McElduff, P., & Hobbiss, J. H. (2004). Absence
of socioeconomic variation in survival from colorectal cancer in patients receiving surgical
treatment in one health district: Cohort study. Colorectal Disease, 6(6), 512–517.
Macdonald, L., Cummins, S., & Macintyre, S. (2007). Neighbourhood fast food environment and
area deprivation-substitution or concentration? Appetite, 49(1), 251–254.
Morris, E. J. A., Maughan, N. J., Forman, D., & Quirke, P. (2007a). Identifying stage III colorectal
cancer patients: The influence of the patient, surgeon and pathologist. Journal of Clinical
Oncology, 25(18), 2573–2579.
Morris, E. J. A., Maughan, N. J., Forman, D., & Quirke, P. (2007b). Who to treat with adjuvant
therapy in Dukes B/Stage II colorectal cancer? The need for high quality pathology. Gut, 56,
1419–1425.
Muthén, L. K., & Muthén, B. O. (2007). Mplus user’s guide (5th ed.). Los Angeles: Muthén
& Muthén.
National Institute of Clinical Excellence. (2004). Guidance on cancer services: Improving
outcomes in colorectal cancers – Manual update. London: National Institute of Clinical
Excellence.
Noble, M., Wright, G., Dibben, C., et al. (2004). The English indices of deprivation 2004 (revised)
(p. 2004). Yorkshire: Office of the Deputy Prime Minister.
Pearl, J. (2000). Causality: Models, reasoning and inference. Cambridge: University Press.
Pollock, A. M., & Vickers, N. (1997). Breast, lung and colorectal cancer incidence and survival in
South Thames Region, 1987–92: The effect of social deprivation. Journal of Public Health
Medicine, 19, 288–294.
Quirke, P., & Morris, E. (2006). Reporting colorectal cancer. Histopathology, 50, 103–112.
Robinson, W. S. (1950). Ecological correlations and the behaviour of individuals. American
Sociological Review, 15, 351–357.
Schrijvers, C. T. M., Mackenback, J. P., Lutz, J.-M., Wuinn, M. J., & Coleman, M. P. (1995).
Deprivation, stage at diagnosis and cancer survival. International Journal of Cancer, 63,
324–329.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.
Stigler, S. M. (1999). Statistics on the table. Cambridge: Harvard University Press.
Townsend, P., Phillimore, P., & Beattie, A. (1988). Health and deprivation: Inequality and the
North. London: Croom Helm.
Tu, Y. K., West, R., Ellison, G. T., & Gilthorpe, M. S. (2005). Why evidence for the fetal origins
of adult disease might be a statistical artifact: The “reversal paradox” for the relation
between birth weight and blood pressure in later life. American Journal of Epidemiology,
161(1), 27–32.
UKACR Quality and Performance Indicators 2008: Final [online]. (2008). [cited 2-2-2009] [last
updated 2008]. Available from http://82.110.76.19/quality/UKACR%20report2008_final.pdf.
Venables, W. N., & Smith, D. M. An introduction to R [online]. Austria: The R Foundation, 1990.
Available from http://www.r-project.org/.
Vermunt, J. K., & Magidson, J. (2005a). Latent GOLD 4.0 User’s Guide. Belmont, Massachusetts:
Statistical Innovations Inc.
Vermunt, J. K., & Magidson, J. (2005b). Technical guide for Latent GOLD 4.0: Basic and
advanced. Belmont, Massachusetts: Statistical Innovations Inc.
140 W. Harrison et al.
World Health Organisation. (2005). The International Statistical Classification of Diseases and
Health Related Problems ICD-10: Tenth Revision (2nd ed.). Geneva: World Health
Organisation.
Wrigley, H., Roderick, P., George, S., Smith, J., Mullee, M., & Goddard, J. (2003). Inequalities in
survival from colorectal cancer: A comparison of the impact of deprivation, treatment, and host
factors on observed and cause specific survival. Journal of Epidemiology and Community
Health, 57(4), 301–309.
Chapter 8
Bayesian Bivariate Disease Mapping
8.1 Introduction
Y.-K. Tu and D.C. Greenwood (eds.), Modern Methods for Epidemiology, 141
DOI 10.1007/978-94-007-3024-3_8, # Springer Science+Business Media Dordrecht 2012
142 R.G. Feltbower and S.O.M. Manda
mean of the Poisson distribution, specified by the sum of two parts: the logarithm of
the expected cases, which is just an offset term, and the logarithm of the underlying
risk (McCullagh and Nelder 1989). The expected number of cases is typically
obtained from the age-gender specific incidence rates for the entire region under
study. However, when considering how risk estimates differ across areas, especially
if the disease of interest is rare (such as the case with childhood cancer or diabetes
where incidence rates are of the order 3–20 per 100,000 person years) then it is not
uncommon to observe areas which exhibit zero counts. Similarly, if the population
of a rural area is extremely sparse with an associated expected disease count of just
greater than zero, we may actually observe at least one case. In these scenarios, our
estimate of disease risk – the standardised incidence (or mortality) ratio SIR
(or SMR) – for certain areas will be zero or artificially inflated, thus our inferences
concerning epidemiology will be seriously flawed.
The rest of the chapter will cover an introduction to Bayesian smoothing
approaches, where we will consider multivariate spatial models when we wish
to model more than one disease simultaneously. An example investigating
the epidemiological similarities and correlation between childhood leukaemia
and Type 1 diabetes (T1D) over small areas will be used to illustrate these
techniques, where we compare classical and Bayesian approaches to spatial
disease modelling.
The rationale behind using Bayesian techniques to underpin disease mapping is that
for any given area i say, all neighbouring areas j are likely to share similar
environmental exposures and therefore one would expect disease rates and RR
estimates for area i to resemble those of all adjacent areas. The Bayesian approach
uses this principle to borrow strength or information from neighbouring areas to
provide more robust risk estimates for each area within the study region of interest.
This also overcomes the problem of relying on the unlikely assumption that disease
risks are independent across geographical areas, a concept which is difficult to
justify when there may be significant evidence of clustering or extra-Poisson
variation.
Bayesian ‘spatial’ smoothing is traditionally used to refer to RR estimates
which have been derived according to the local distribution of RR in areas
which are close or adjacent to one another. This is in contrast to ‘non-spatial’
smoothing which uses the global distribution of RR for all areas within the study
region (Clayton and Kaldor 1987). A further advantage of spatial smoothing
techniques is the ability to remove or reduce the effect of arbitrary geographical
boundaries, since geo-political areas are unlikely to be related to the disease of
interest. Thus, any artefactual variation exhibited in the data by methods of data
aggregation is reduced.
8 Bayesian Bivariate Disease Mapping 143
A conventional way of applying spatial modelling was set out by Besag et al. (1991)
within the context of image analysis. The principles which underpin Besag’s
statistical model allow us to differentiate between the relative contribution of the
spatial and non-spatial effects on disease risk. The non-spatial or heterogeneity
random effects appear in the model as extra-Poisson variation and arise through the
variation among the underlying populations at risk due to omitted covariates. The
spatial random effects control for unmeasured spatial covariates. That are similar
across close or adjacent geographical areas.
The model is defined such that the observed disease counts Oi in each area i, with
associated expected counts Ei , are assumed to take a Poisson distribution, i.e. Oi
Poisson ðEi RRi Þ for i ¼ 1; ; N areas, where RRi is the relative risk of disease in
area i. The maximum likelihood estimate of the relative risk of disease in area i
equals Oi =Ei , which is the SIRi.
We might wish to extend this model by including area-level covariates such as
socio-economic status, whilst accounting for both the spatially structured and
unstructured (heterogeneity) random effects for the relative risks across areas.
Referring to the model of Besag et al. (1991), the logarithm of the RR for each
area i is modelled so that:
logðRRi Þ ¼ a0 þ bT xi þ ui þ vi (8.1)
where a0 represents the intercept of the log relative risk; xi represents the covariate
for each area with associated parameter b; ui represent the independent heteroge-
neity effects between areas (Clayton and Kaldor 1987) and are synonymous with
extra-Poisson variation; vi represent the spatially dependent random effects which
are defined by a range of different structures describing adjacency or closeness in
space. This class of models is referred to as convolution models and generally we
may define a normal prior distribution (Besag et al. 1991) for the non-spatial
heterogeneity effects such that ui Normalð0; s2u Þ.
We may assume that the spatially correlated random effects vi arise through a
combination of independent random effects errors ei that are normally distributed
i.e. ei Normalð0; s2ei Þ as set out by Langford et al. (1999) and Leyland et al.
(2000). Here we assume that the components vi may be written as
P
ej
j2Yi
vi ¼ (8.2)
ni
where Yi represents the set of areas sharing a common boundary with area i, with ni
denoting the number of neighbours for area i. Thus, through the averaging of the
independent random effects, ej defines the effect of area j on the disease risk in area i.
In this context Eq. 8.1 effectively then becomes a multilevel model, with each area i
144 R.G. Feltbower and S.O.M. Manda
occurring at the second level and it’s neighbours in Yi being at the third level where
the first level is the observed disease incidance. Multilevel modelling concepts are
covered in more detail in chapters 5 and 7. Alternatively, instead of an adjacency
model defined in the above example, the distance between the centroids of two areas
could be taken as the combined effect of the neighbouring areas.
In the preceding construction, the total variation in disease risk (8.1) for each
area i is the sum of variance of the heterogeneity and spatial effects and is
dependent on the number of neighbours ni i.e.
s2e
VarðlogRRi Þ ¼ s2u þ (8.3)
ni
Empirical Bayes methods for disease mapping were originally developed and
described more than 20 years ago by Clayton and Kaldor (1987). Empirical
Bayes methods rely on the conditional distributions given the overall observed
8 Bayesian Bivariate Disease Mapping 145
data and hyperparameters that are estimated marginally from the data using
maximum likelihood (ML) or iterative generalised least squares (IGLS) methods.
Thus, these hyperparameters are simply entered into the estimation of the random
effect means. Compared to a fully hierarchical Bayesian approach using MCMC
based methods, such as Gibbs sampling (Smith 1993), maximum-likelihood (ML)
based estimation methods have the advantage of being fast to compute and easier to
identify convergence of the final estimates. However, since ML methods do not
account for the uncertainty in the hyperparameters, the variability in the random
effects is underestimated, which might lead to erroneous inferences.
Modern computing power however has now largely meant that MCMC
techniques have superseded the empirical Bayes approach when describing and
modelling disease rates across geographical areas. The two main advantages of
the fully Bayesian approach are that prior information can be incorporated into the
modelling process and the full posterior distribution can be derived. Further details
of MCMC techniques can be found in Gilks et al. (1996) and are also covered in
chapter 9.
We illustrate the methods outlined in this chapter using an example taken from a
paper describing the epidemiological similarities and spatial correlation between
acute lymphoblastic leukaemia (ALL) and T1D (Feltbower et al. 2005).
We extracted data on children aged under 15 and diagnosed with ALL and Type 1
diabetes (T1D) between 1986 and 1998 from two population-based disease registers
covering the former Yorkshire Regional Health Authority in the north of the United
Kingdom (UK) (Feltbower et al. 2003; McKinney et al. 1998). The registers cover a
geographical area of 12,000 km2 and a childhood population of 700,000. We
limited the case series to a period centred at the time of the 1991 national UK
census to ensure the inclusion of relevant socio-demographic denominator data.
ALL and T1D were chosen as there is growing epidemiological evidence
suggesting that both diseases may be linked to infectious exposure (Greaves
1997; Parslow et al. 2001; Feltbower et al. 2004).
Patients’ addresses and postcodes (equivalent to zip codes) at the time of
diagnosis were validated and linked to one of 532 Electoral Wards in existence in
Yorkshire, UK, at the time of the 1991 UK Census. These small geographical areas
have a median childhood population count of 750 (interquartile range 400–2,030).
146 R.G. Feltbower and S.O.M. Manda
Population estimates from the 1991 UK Census were used to calculate age-sex
standardised incidence rates.
8.3.3 Covariates
We compared separately the risk for both diseases from three socio-demographic
factors previously linked to disease onset, measured at Ward level. These included:
(i) population mixing, measured using the Shannon index (Stiller and Boyle 1996;
Parslow et al. 2001), describing the diversity of origins of incomers into each
Ward for the childhood population (ages 0–14); (ii) person-based childhood popu-
lation density (Parslow et al. 2001), which is a population weighted average of
population density (persons per hectare), and more appropriate for investigating
infectious aetiology as it reflects the population density at which a typical person
lives; and (iii) deprivation, measured using the Townsend Score (Townsend et al.
1988) which was derived from the following census variables: unemployment,
household overcrowding, car ownership and housing tenure.
Incidence rate ratios (IRR) and 95% confidence (credible) intervals are
presented according to categories used in previous epidemiological publications
for comparative purposes (Parslow et al. 2001; Stiller and Boyle 1996) and are
based on the rankings of the values across all Wards for each covariate. They were
defined as follows:
• Population mixing: <10th percentile, 10th–90th percentiles (reference group)
and >90th percentile. This grouping enables effects to be detected at the
extremes of the range, as there is little variation in the value of the Shannon
Index for the majority of Wards.
• Population density: three equal groups in size of Wards (lowest density taken as
the reference group).
• Deprivation: five equal groups in size of Wards (least deprived taken as the
reference group).
For each disease, a Poisson regression model was fitted to the observed numbers
of cases in each Ward using the log of the number of expected cases as the offset
derived from age-sex specific incidence rates for Yorkshire between 1986 and 1998.
This was implemented within a classical framework. All three socio-demographic
variables (population density, mixing and deprivation) were included separately in
the Poisson model, whilst no other confounding factors were added to this initial
model. The effect on SIRs of including all three covariates was then assessed.
8 Bayesian Bivariate Disease Mapping 147
We modelled the two disease counts jointly, examining the effects from each
covariate using Bayesian spatial and non-spatial smoothing. By extending Eq. 8.1
for the case of two diseases, we can denote the disease-specific RR as:
Where a0h is an intercept of the log relative risk for disease h (h ¼ 1,2) in ward i;
xi is a covariate vector with the corresponding parameter bh ;
ui1 and ui2 are the independent unstructured random effects (representing global
smoothing);
vi1 and vi2 are the spatially structured random effects (representing local
smoothing).
We assume the four random effect terms ui1 , ui2 , ei1 , and ei2 arise from a
multivariate normal distribution with zero mean vector and covariance matrix X
(Langford et al. 1999; Leyland et al. 2000), though we chose to adopt a hierarchi-
cal Bayesian approach (Congdon 2003) rather than use IGLS estimation. All fixed
effect parameters were given vague but proper Normal (0, 1,000) prior
distributions.
However, for the covariance matrix D describing the four random effects, a
sensitivity analysis was performed consisting of both informative and vague
specifications for the scale matrix in the parameterization of the Wishart distribu-
tion for the precision matrix D1 .
Posterior estimation of all the model parameters was carried out using the Gibbs
sampling algorithm implemented in the software package WinBUGS (Lunn et al.
2000). The variance and covariance between the spatial effects vi1 and vi2 and the
total risk variation (ui1 + vi1) and (ui2 + vi2) were computed empirically at each
iteration of the Gibbs sampler.
For each model considered, three parallel Gibbs sampler chains from indepen-
dent starting positions were run for 50,000 updates. All fixed effects and covariance
parameters were monitored for convergence to stationary distributions. Trace plots
148 R.G. Feltbower and S.O.M. Manda
of sample values of each of these parameters showed that they were converging to
the same distributions. A burn-in period of 15,000 updates was used as convergence
of the three chains was shown to have been reached after this period had
elapsed, since Gelman-Rubin reduction factors (Gelman and Rubin 1992) were all
estimated near 1.0. For posterior inference, we used a combined sample of the
remaining 35,000 iterations. Finally, the effect on the degree of spatial correlation
between both diseases was examined with and without adjustment for each socio-
demographic factor previously linked to the spatial distribution of disease incidence.
Numerous diagnostic tests have been developed such as the Deviance Informa-
tion Criterion (DIC), which is a natural extension of the Akaike Information
Criterion (AIC) derived from the chains produced by the MCMC run. The DIC is
a composite assessment combining both overall model fit with complexity and
penalises additional parameters to encourage parsimony.
299 children with ALL and 1,551 with T1D were included in the dataset. Figures 8.1
and 8.2 showing the spatially smoothed SIR illustrate the variation in disease rates
across wards, especially in the South-Eastern part of the region below the Humber
estuary. Lower rates of ALL and T1D were seen in the more urban county of West
Yorkshire, whereas higher disease rates were observed in the more rural county of
North Yorkshire. The median (and interquartile range) for the number of cases
distributed across all 532 Wards was 0 (0–1) and 2 (1–4) for ALL and T1D
respectively.
Table 8.1 shows the unadjusted and adjusted IRRs for each covariate, for ALL and
T1D separately. Generally, we infer higher rates of ALL and T1D in areas of low
population mixing; however, in areas with high mixing, significantly lower rates of
ALL were observed, although no similar association in incidence was seen for
diabetes.
An inverse association was present for population density for each condition
with lower rates associated with higher levels of population density. However, this
association disappeared for T1D and was reversed for ALL once the effects from
population mixing and deprivation were taken into consideration. There was some
evidence of a negative association between deprivation and diabetes, although no
clear relationship was evident between deprivation and ALL.
Although all three variables included in the model were positively correlated, we
saw no evidence of multi-collinearity. Variance inflation factors were all less than
2.5. Population mixing exhibited the least degree of correlation of any of the
8 Bayesian Bivariate Disease Mapping 149
Fig. 8.1 Spatially smoothed standardised incidence ratios for childhood type 1 diabetes diagnosed
between 1986 and 1998 across electoral wards in Yorkshire, UK
Fig. 8.2 Spatially smoothed standardised incidence ratios for childhood acute lymphoblastic
leukemia diagnosed between 1986 and 1998 across electoral wards in Yorkshire, UK
150 R.G. Feltbower and S.O.M. Manda
Table 8.1 Incidence rate ratios (IRR) and 95% Confidence Intervals (CI) derived using the
classical (frequentist) approach for Type 1 diabetes
Unadjusted estimates Adjusteda estimates
Covariate IRR 95% CI IRR 95% CI
Population density
Low 1.00 1.00
Medium 0.92 0.77–1.09 1.01 0.83–1.23
High 0.75 0.63–0.88 0.95 0.76–1.19
Population mixing (Shannon Index)
<10th percentile 1.50 1.11–2.02 1.29 0.94–1.78
10th–90th percentile 1.00 1.00
>90th percentile 1.01 0.88–1.16 0.94 0.82–1.08
Townsend deprivation score
1 Least deprived 1.00 1.00
2 1.05 0.86–1.29 1.07 0.87–1.32
3 1.03 0.84–1.26 1.06 0.87–1.30
4 0.86 0.71–1.04 0.90 0.73–1.12
5 Most deprived 0.76 0.63–0.91 0.81 0.65–1.02
a
Each covariate is adjusted for each of the other two covariates
covariates; for example, areas with high levels of mixing had an equal number of
areas in the medium and highest population density categories. A graphical com-
parison between the observed counts and predicted counts from a simulated model
showed good symmetry for each disease.
By modelling the effects of both disease counts together as a bivariate outcome, and
assuming dependent random effects between diseases with no adjustment for
covariates, we found that 50% of the variation occurred through the spatial compo-
nent for diabetes and ALL, with the remainder occurring through heterogeneity
effects. A modest degree of positive spatial correlation was found between diseases
of 0.33 (95% CI 0.20 to 0.74).
Compared to the classical univariate model (Tables 8.1 and 8.2), the parameter
estimates largely remained the same after allowing for dependent random effects
and the contribution of each covariate on its own (Table 8.3). The spatial correlation
between diseases fell from 0.33 to 0.18 (95% CI 0.62 to 0.82), 0.14 (0.50 to
0.78) and 0.06 (0.59 to 0.69), respectively, after separately accounting for popu-
lation mixing, population density, and deprivation. Adding the spatial component
of variation into a model already containing the heterogeneity component signifi-
cantly improved model fit using the DIC.
After adjusting for all three covariates simultaneously (Table 8.4), the spatial
correlation fell to 0.12 (0.63 to 0.73), whilst the parameter estimates were similar
to the adjusted IRRs presented in Tables 8.1 and 8.2 from the classical approach.
8 Bayesian Bivariate Disease Mapping 151
Table 8.2 Incidence rate ratios (IRR) and 95% CI derived using the classical (frequentist)
approach for acute lymphoblastic leukaemia
Unadjusted estimates Adjusteda estimates
Covariate IRR 95% CI IRR 95% CI
Population density
Low 1.00 1.00
Medium 0.92 0.62–1.38 1.14 0.72–1.79
High 0.83 0.56–1.23 1.21 0.73–2.03
Population mixing (Shannon Index)
<10th percentile 1.36 0.67–2.74 1.27 0.59–2.75
10th–90th percentile 1.00 1.00
>90th percentile 0.74 0.54–0.99 0.64 0.47–0.89
Townsend deprivation score
1 Least deprived 1.00 1.00
2 1.28 0.78–2.11 1.47 0.89–2.43
3 1.24 0.77–2.00 1.38 0.85–2.24
4 1.11 0.70–1.77 1.10 0.67–1.81
5 Most deprived 0.92 0.59–1.44 0.88 0.52–1.47
a
Each covariate is adjusted for each of the other two covariates
8.5 Discussion
In this chapter, we have shown that mapping the spatial distribution of diseases
using Bayesian smoothing techniques can help to both visualise and assess the level
of spatial and non-spatial variation across geographical areas. For instance, in our
example with ALL and diabetes in Yorkshire, UK, we observed lower rates of
disease in the more populated county of West Yorkshire and elevated rates in the
less populated county of North Yorkshire.
We were also able to test for evidence of a common environmental aetiology
between ALL and T1D by considering a bivariate outcome within a hierarchical
framework. The similarity in risk between diseases could be quantified across small
152
Table 8.3 Fixed and random effects estimates (median and 95% credible intervals) from a Bayesian bivariate model with dependent errors
a a a
Population density Population mixing Deprivation
Diabetes Leukemia Diabetes Leukemia Diabetes Leukemia
Random effects
Heterogeneity 0.02 (0.00–0.04) 0.04 (0.01–0.11) 0.02 (0.00–0.04) 0.04 (0.01–0.12) 0.01 (0.00–0.04) 0.04 (0.01–0.08)
Spatial 0.01 (0.00–0.02) 0.06 (0.02–0.17) 0.02 (0.01–0.04) 0.06 (0.01–0.15) 0.01 (0.00–0.04) 0.05 (0.01–0.12)
Proportion of total variation 45% (13–79%) 59% (20–86%) 57% (18–85%) 57% (16–84%) 52% (15–87%) 56% (25–83%)
Spatial correlation 0.14 (0.50–0.78) 0.18 (0.62–0.82) 0.06 (0.59–0.69)
Fixed effects
Population density
Low 1.00 1.00
Medium 0.93 (0.77–1.09) 0.99 (0.63–1.44)
High 0.76 (0.64–0.90) 0.88 (0.59––1.28)
Population mixing
<10th 1.47 (1.05–1.94) 1.34 (0.60–2.48)
10th–90th 1.00 1.00
>90th 1.03 (0.90–1.20) 0.74 (0.53–0.99)
Deprivation
1 (lowest) 1.00 1.00
2 1.06 (0.85–1.32) 1.37 (0.79–2.19)
3 1.04 (0.85–1.26) 1.30 (0.73–2.07)
4 0.86 (0.71–1.04) 1.17 (0.70–1.84)
5 (highest) 0.76 (0.64–0.91) 0.98 (0.59–1.57)
a
Models defined by the inclusion of each covariate separately without adjustment for any other fixed effects
R.G. Feltbower and S.O.M. Manda
8 Bayesian Bivariate Disease Mapping 153
Table 8.4 Fixed and random effects estimates (median and 95% credible intervals) from a
Bayesian bivariate model with dependent errors and all three covariates considered simultaneously
All covariates entered into the model as fixed effects
Diabetes Leukemia
Random effects
Heterogeneity 0.02 (0.00–0.06) 0.04 (0.01–0.10)
Spatial 0.01 (0.06–0.24) 0.05 (0.01–0.11)
Proportion of total variation 48% (9–85%) 52% (20–85%)
Spatial correlation 95% CI ¼ (0.63–0.73)
Fixed effects
Population density
Low 1.00 1.00
Medium 1.02 (0.83–1.25) 1.18 (0.71–1.90)
High 0.97 (0.75–1.22) 1.29 (0.74–2.40)
Population mixing
<10th 1.30 (0.94–1.72) 1.32 (0.58–2.49)
10th–90th 1.00 1.00
>90th 0.95 (0.81–1.08) 0.65 (0.46–0.87)
Deprivation
1 (lowest) 1.00 1.00
2 1.08 (0.87–1.32) 1.53 (0.93–2.45)
3 1.06 (0.87–1.31) 1.43 (0.87–2.15)
4 0.91 (0.73–1.15) 1.17 (0.73–1.90)
5 (highest) 0.82 (0.65–1.01) 0.92 (0.55–1.48)
8.6 Conclusions
A number of books have appeared in recent times that are devoted to the field of
spatial epidemiology. In particular, Lawson et al. (2003) and Lawson (2008) cover
basic disease mapping and a review of multivariate disease mapping. Recently,
flexible multivariate models such as Generalised Hierarchical Multivariate Condi-
tional Autoregressive (GMCAR) and Order-free Coregionalized Lattice Models are
emerging and they offer a unified approach (Jin et al. 2005). They are based on
theoretical work on multivariate Gaussian Markov random fields (Mardia 1988).
In addition, generalized spatial structural equation models, which handle the case of
multivariate latent spatial factors, are being developed. The new class of models are
versatile and practical and can account for associations between different diseases
within areal units as well as the spatial association between areal units.
References
Besag, J., York, J., & Mollie, A. (1991). Bayesian image restoration, with two applications in
spatial statistics (with discussion). Annals of the Institute of Statistical Mathematics, 43, 1–75.
Clayton, D., & Kaldor, J. (1987). Empirical Bayes estimates of age-standardized relative risks for
use in disease mapping. Biometrics, 43, 671–681.
Congdon, P. (2003). Applied Bayesian models. Chichester: Wiley.
Dabney, A. R., & Wakefield, J. C. (2005). Issues in the mapping of two diseases. Statistical
Methods in Medical Research, 14, 83–112.
Feltbower, R. G., McKinney, P. A., Parslow, R. C., Stephenson, C. R., & Bodansky, H. J. (2003).
Type 1 diabetes in Yorkshire, UK: Time trends in 0–14 and 15–29 year olds, age at onset and
age-period-cohort modelling. Diabetic Medicine, 20, 437–441.
8 Bayesian Bivariate Disease Mapping 155
Feltbower, R. G., McKinney, P. A., Greaves, M. F., Parslow, R. C., & Bodansky, H. J. (2004).
International parallels in leukemia and diabetes epidemiology. Archives of Disease in Child-
hood, 89, 54–56.
Feltbower, R. G., Manda, S. O. M., Gilthorpe, M. S., Greaves, M. F., Parslow, R. C., Kinsey, S. E.,
Bodansky, H. J., & McKinney, P. A. (2005). Detecting small-area similarities in the epidemi-
ology of childhood acute lymphoblastic leukemia and diabetes mellitus, type 1: A Bayesian
approach. American Journal of Epidemiology, 161, 1168–1180.
Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulations using multiple sequences.
Statistical Science, 7, 457–472.
Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo in
practice. London: Chapman and Hall.
Greaves, M. (1997). Etiology of acute leukemia. The Lancet, 349, 344–349.
Held, L., Natario, I., Fenton, S. E., Rue, H., & Becker, N. (2006). Towards joint disease mapping.
Statistical Methods in Medical Research, 14, 61–82.
Jin, X., Carlin, B. P., & Banerjee, S. (2005). Generalised hierarchical multivariate CAR models for
areal data. Biometrics, 6, 539–557.
Knorr-Held, L., & Best, N. G. (2001). A shared component model for detecting joint and selective
clustering of two diseases. Journal of the Royal Statistical Society A, 164, 73–85.
Langford, I. H., Leyland, A. H., Rasbash, J., & Goldstein, H. (1999). Multilevel modelling of the
geographical distributions of diseases. Journal of the Royal Statistical Society C, 48, 253–268.
Lawson, A. B. (2008). Bayesian disease mapping: Hierarchical modeling in spatial epidemiology.
Boca Raton: CRC Press.
Lawson, A. B., Browne, W. J., & Vidal Rodeiro, C. L. (2003). Disease mapping with WinBUGS
and MLwiN. London: Wiley.
Leyland, A. H., Langford, I. H., Rabash, J., & Goldstein, H. (2000). Multivariate spatial models for
event data. Statistics in Medicine, 19, 2469–2478.
Lunn, D. J., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS – A Bayesian modelling
framework: Concepts, structure, and extensibility. Statistics and Computing, 10, 325–337.
Mardia, K. V. (1988). Multi-dimensional multivariate Gaussian Markov random fields with
application to image processing. Journal of Multivariate Analysis, 24, 265–284.
McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). London: Chapman
and Hall.
McKinney, P. A., Parslow, R. C., Lane, S. A., Lewis, I. J., Picton, S., Kinsey, S. E., & Bailey, C. C.
(1998). Epidemiology of childhood brain tumors in Yorkshire, UK 1974–1995: Changing
patterns of occurrence. British Journal of Cancer, 78, 974–979.
Parslow, R. C., McKinney, P. A., Law, G. R., & Bodansky, H. J. (2001). Population mixing and
childhood diabetes. International Journal of Epidemiology, 30, 533–538.
Smith, A. F. M. (1993). Bayesian computations via the Gibbs sampler and related Markov Chain
Monte Carlo methods. Journal of the Royal Statistical Society B, 55, 3–23.
Stiller, C. A., & Boyle, P. J. (1996). Effect of population mixing and socioeconomic status in
England and Wales, 1979–85, on lymphoblastic leukemia in children. BMJ, 313, 1297–1300.
Townsend, P., Phillimore, P., & Beattie, A. (1988). Health and deprivation: Inequality and the
North. London: Croom Helm.
Chapter 9
A Multivariate Random Frailty Effects
Model for Multiple Spatially Dependent
Survival Data
9.1 Introduction
Y.-K. Tu and D.C. Greenwood (eds.), Modern Methods for Epidemiology, 157
DOI 10.1007/978-94-007-3024-3_9, # Springer Science+Business Media Dordrecht 2012
158 S.O.M. Manda et al.
for possible spatial dependence in hazard rates among clusters that are spatially
arranged (Banerjee et al. 2003). The latter may arise, for instance, through lesser
variation in hazard rates in neighbouring densely urban populated areas, as
opposed to sparsely populated rural areas, or through similarities in the underly-
ing cultural and traditional beliefs affecting timing of events.
Thus, it becomes necessary to include both the effect of area under investigation
and the effect of surrounding areas in modelling spatially observed time-to-event
data. In this chapter, we focus on the ideas developed in Banerjee et al. (2003) by
using spatially correlated survival models for failure-time data which are spatially
arranged. However, rather than model the spatial dependence using a conventional
conditional autoregressive (CAR) normal model (Besag et al. 1991), we instead use
a multiple membership multiple classification (MMMC) model (Browne et al.
2001) to capture both the unstructured heterogeneity and spatially structured
random effects. Further, we consider an extension of the univariate spatially
correlated survival model (the 2-way spatial frailty effect model) to include multi-
ple frailty effects of order 2 K, where K is the number of possible failure type
events that can happen to a subject. The resulting 2 K spatial random effects are
modeled using a 2K-multivariate normal model (Leyland et al. 2000). By
incorporating information from all types of failure events, the resulting fixed and
random parameter estimates have improved efficiency. Furthermore, similarities
and differences can be made on the effect of risk factors (Manda et al. 2009), in
addition to identifying event-specific risk factors, which otherwise would have been
masked by well known common factors (Manda and Leyland 2007). The methods
presented in this chapter are therefore somewhat similar to those in Chap. 8, where
we modelled joint aggregated count data.
The proposed multivariate frailty model for correlated survival data is illustrated
with an analysis of timing of first childbirth and timing of first marriage amongst
women aged between 15 and 49 years across health districts in South Africa.
We investigate differential patterns of early childbearing and marriage rates using
key covariates – age of woman, education, type of residence and race – while
accounting for possible variation in the hazard rates due to the effects of unobserved
and unmeasured covariates, which may induce spatial dependence in hazard rates
among women in the same health district. The excess fitted hazard risks are mapped
in order to highlight parts of the country with persistent excess hazard risks for
early child bearing and marriage, thereby generating in-depth epidemiological
investigations on what could be causing the interjectory between the districts.
events that have occurred to the subject by time t. In addition, a process Yijk ðtÞ,
which indicates whether or not the subject was at risk for the event of type k at time t,
is also observed. The intensity process lijk ðtÞ of event of type k for subject ij is a
product of the risk indicator and the event hazard function hijk ðtÞ; i.e.
lijk ðtÞ ¼ Yijk ðtÞhijk ðtÞ. We also measure a (possibly time-varying) p-dimensional
vector of risk factors xij ðtÞ, where p is the numberof risk factors being investigated.
Thus, for subject ij, the observed data are D ¼ Nijk ðtÞ; Yijk ðtÞ; xij ðtÞ; t 0; k ¼ 1;
. . . ; Kg and are assumed independent. Furthermore, suppose dNijk ðtÞ is the incre-
ment of Nijk ðtÞ in the infinitesimal interval ½t; t þ dtÞ and Ft are the available data
just before time t, such that the increment dNijk ðtÞ is constrained to take only values
0 and 1. This constraint implies that the mean increase
of Nijk ðtÞ during
the infinites-
imal interval ½t; t þ dtÞ is given by lijk ðtÞdt ¼ Pr dNijk ðtÞ ¼ 1jFt .
The effect of the risk factors on the baseline intensity function of type k for
subject ij at time t is given by the Cox proportional hazards model:
lijk tjl0k ðtÞ; bk ; xij ðtÞ; wik ¼ Yijk ðtÞl0k ðtÞexpðbTk xij ðtÞ þ wik Þ (9.1)
Y
I Y
ni Y
K Y
T
lijk ðtjl0k ðtÞ; bk ; xij ðtÞ; wik ÞdNijk ðtÞ expððlijk tjl0k ðtÞ; bk ; xij ðtÞ; wik dtÞ
i¼1 j¼1 k¼1 t 0
(9.2)
on models that account for possible spatial correlation in hazard among clusters that
are spatially arranged. In particular, we extend the spatial survival models as
described in Banerjee et al. (2003) to situations where individuals in a cluster can
experience multiple failure-time events of different types. The models considered
here could be termed multivariate spatially correlated survival models.
ð2Þ
log lij1 ðtÞ ¼ logðYij1 Þ þ log dL01 ðtÞ þ bT1 xij ðtÞ þ u1 districtðiÞ
X ð3Þ
þ w u
l2NeighboursðiÞ l1 l1
(9.3a)
ð2Þ
log lij2 ðtÞ ¼ logðYij2 Þ þ log dL02 ðtÞ þ bT2 xij ðtÞ þ u2 districtðiÞ
X ð3Þ
þ w u
l2NeighboursðiÞ l2 l2
(9.3b)
if district i and j are neighbours and 0 otherwise. Thus, all districts that border a
particular district are part of neighbour classification for that district. The direct
ð2Þ
district effects uk districtðiÞ are modeled as uk ð2Þ Normalð0; s2ukð2Þ Þ and the
ð3Þ ð3Þ districtðiÞ
neighbouring district effects ulk by ulk Normalð0; s2ukð3Þ Þ. Thus the spatial
structured effect vik is normally distributed with mean 0 and variance sukð3Þ=:mi .
2
unstructured effects within a district is su1u2ð22Þ and between the two disease-specific
neighbour effects within a district is . given by su1u2ð33Þ mi from which the respective
.qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi
correlation coefficients are: su1ð2Þu2ð2Þ su1ð2Þ su2ð2Þ and su1ð3Þu2ð3Þ
2 2 s2u1ð3Þ s2u2ð3Þ the
later being a conditional correlation ince the number of neighbouring district has been
omitted.
Our modelling approach also allows the computation of relative contributions
of spatial and unstructured heterogeneity to the total variation of the random
effects. As the unstructured and structured variances are marginal and condition,
for ease of comparison, the relative contributions are obtained empirically
using the sample values of the unstructured and spatial random effects (Feltbower
et al. 2005).
On the other hand, early marriage, in the absence of using contraceptives, increases
the risk of early childbearing. Thus, rising ages at first childbirth and at first
marriage are important domains of public health policy-making, as they both play
roles in fertility levels, maternal and child health, and women’s status in a society
(Kalule-Sabiti et al. 2007; Palamuleni 2011).
Improvements in social and economic conditions among the women in
South Africa, and in many parts of the world, have attributed to reduction
in rates of teenage childbearing and marriage (Department of Health 2002;
Kalule-Sabiti et al. 2007). However, both rates have been shown to vary
according to women’s educational level, employment status, ethnicity, and
period of birth (Kalule-Sabiti et al. 2007). There is evidence of differences in
the rates by provinces, with the most economically advanced provinces
(Gauteng and Western Cape) having the lowest rates of childbearing; and the
predominately rural and underdeveloped provinces such as Limpopo, Eastern
Cape and Mpumalanga have the highest childbearing rates (Statistics South Africa
2010). On the other hand, the more economically advanced provinces have higher
rates of early marriage than provinces that are less economically advanced
(Palamuleni et al. 2007).
Thus, an understanding of the factors, whether individual or ecological, affecting
the risks of early childbearing and marriage among women, especially in the sub-
Saharan Africa, could contribute to reductions in the maternal mortality ratios and
to increases in universal access to reproductive health services. A study carried out
in some sub-Saharan African countries showed that risk factors for first childbirth
and first marriage are similar (Lloyd and Mensch 2006), but they used separate
univariate proportional hazards models. Presently, there is a scarcity of research
studies in South Africa and the region investigating spatial variation in the rates
below the provincial level. One such study, carried out by Statistics South Africa
(2010), only described the district level observed spatial rates in fertility, which,
with all things being equal, is linked to timing of childbirth and marriage (Manda
and Meyer 2005; Palamuleni 2011). However, the Statistics South Africa study did
not undertake any modelling of dependence of the observed rates between districts.
Thus, the lack of studies investigating childbearing and marriage rates for lower
levels than provinces has adversely affected local government health-policy
planning regarding maternal and reproductive health.
To the best of our knowledge, this is the first attempt to employ the recent
methodological advances in spatial modelling that accounts for multiple
outcomes (see, for example Chap. 8). We use joint spatial models to investigate
dependence structures between and within rates of timing of childbearing and
marriage. In the context of substantive issues within maternal and child health,
we investigate the spatial distributions of hazard of early childbearing and
marriage using spatially dependent models to account for the spatial correlation
of the two maternal health events. We are not aware of any previous work that
uses joint spatial hazards models to estimate geographical distribution of hazard
rates for multiple maternal health outcomes, at least within the sub-Saharan African
region.
9 A Multivariate Random Frailty Effects Model for Multiple Spatially. . . 163
We use data from South African Demographic and Health Survey (SADHS) of
1998, which was a nationally representative probability sample of nearly 12,000
women between the ages of 15 and 49 years (Department of Health 2002). The
women were selected using a two-stage sampling design; using enumeration areas
(EAs) as primary sampling units and households as secondary sampling units. We
extracted the following women-level explanatory variables for use in the models:
urban or rural residence, used to capture effects of urbanisation and modernisation
of timing of childbearing and marriage; birth cohort, used to capture effects of
changing generations and behaviour on early childbearing and marriage; and
education level of woman, to capture the effects of social status and modernisation
of women. In addition, we model the health district spatial random effect to account
for unmeasured and unobserved district-level risk factors, such as differences in
social and material deprivation (this was not available for the study time) and
cultural influences, both of which have been shown to contribute to differences in
the timing of childbearing and marriage. Some of the observed characteristics of the
sample women are shown in Table 9.1.
9.4.2 Results
a b
1 1
0.8 0.8
0.6 0.6
0.4 0.4
Black/African Black/African
Coloured Coloured
0.2 White 0.2 White
Asian/Indian Asian/Indian
0 0
10 20 30 40 10 20 30 40
Age in years of woman Age in years of woman
Fig. 9.1 (a) Proportion of women who are already mothers by population group. (b) Proportion of
women who have been married before by population group
9 A Multivariate Random Frailty Effects Model for Multiple Spatially. . . 165
Overall, women born in the earlier decades before the 1980s have significantly
higher rates of early childbearing and marriage. For instance, women born in the
1960s had rates of first childbirths and first marriage that are about 7 and 10 times the
rates of the women born in the 1980s. Compared to women with more than 12 years
of schooling years, women with lower years of education have significantly higher
rates of early motherhood and marriage. In addition, women residing in the rural
areas have significantly higher rates of first childbirth and marriage. Racial
differences among the sampled women in the age at first childbirth and marriage
are observed, with Black African and Coloured women being more likely than White
and Indian women to be early mothers. On the other hand, Black African and
Coloured women have rates of marriage about half that of White women. Indian
and White women have similar rates for both timing of childbearing and marriage.
In all the models, the condition structured variation is larger than that for the
structured variation. This is also reflected in the larger contribution of the structured
random effect to the total variation. The hazard rates for timing of first marriage are
more variable than those of timing of first childbirth. As the variations and
contributions could not be taken as negligible, especially of the structured random
effects, it indicates there are still missing important covariates that are causing spatial
correlation in the observed data. The estimated fitted hazards on the log-scale for
various spatial models of the risk of timing of first childbirth and timing of first
marriage are shown in Fig. 9.2a–d. These figures show that both rates are highest in
the north-eastern provinces of Limpopo, Mpumalanga, Kwazulu-Natal, and parts of
Eastern Cape province. Districts in the most urbanised and economically advanced
provinces of Gauteng and Western Cape, and metropolitan districts have lower rates
of early childbearing. Most metropolitan districts, and those districts that are in the
more economically developed provinces, have higher rates of early marriage than
those districts that are more rural (results not shown).
9.5 Discussion
models
Univariate spatial models A multivariate spatial model
Parameter Timing of first childbirth Timing of first marriage Timing of first childbirth Timing of first marriage
Birth cohort
1948–1959 6.502 (5.420, 8.091) 11.290 (8.617, 16.510) 6.324 (5.374, 7.831) 11.690 (8.805, 17.840)
1960–1969 6.949 (5.786, 8.625) 10.080 (7.719, 16.780) 6.743 (5.754, 8.352) 10.500 (7.849, 15.990)
1970–1979 5.357 (4.468, 6.660) 5.416 (4,125, 7.899) 5.200 (4.442, 6.434) 5.622 (4.196, 8.597)
1980–1983 1.000 1.000 1.000 1.000
Ethnicity of woman
Black/African 1.154 (1.046, 1.271) 0.400 (0.357, 0.443) 1.151 (1.038, 1.268) 0.392 (0.353, 0.444)
Coloured 1.106 (0.983, 1.243) 0.497 (0.432, 0.570) 1.116 (0.78, 1.247) 0.503 (0.437, 0.581)
White 1.000 1.000 1.000 1.000
Asian 0.992 (0.839, 1.169) 0.919 (1.118, 1.285) 1.002 (0.851, 1.162) 0.918 (0.770, 1.081)
Place of residence
Urban 1.000 1.000 1.000 1.000
Rural 1.119 (1.061, 1.178) 1.200 (1.118, 1.285) 1.114 (1.060, 1.177) 1.190 (1.108, 1.278)
Education of woman
None 1.677 (1.481, 1.883) 1.900 (1.652, 2.196) 1.672 (1.483, 1.869) 1.909 (1.669, 2.181)
Grade 1–7 1.748 (1.582, 1.927) 1.779 (1.569, 2.019) 1.739 (1.577, 1.922) 1.792 (1.602, 2.019)
Grade 8–11 1.583 (1.434, 1.741) 1.401 (1.244, 1.577) 1.572 (1.424, 1.732) 1.408 (1.266, 1.571)
Grade 12 1.162 (1.049, 1.288) 1.068 (0.945, 1.206) 1.158 (1.048, 1.281) 1.074 (0.968, 1.204)
Higher 1.000 1.000 1.000 1.000
Random effects
Unstructured standard deviation 0.029 (0.014, 0.066) 0.174 (0.107, 0.267) 0.062 (0.015, 0.098) 0.150 (0.074, 0.217)
Structured standard deviation 0.136 (0.077, 0.214) 0.372 (0.150, 0.6001) 0.102 (0.042, 0.155) 0.537 (0.350, 0.719)
Proportion structured variation 0.865 (0.397, 0.977) 0.566 (0.101, 0.830) 0.436 (0.063, 0.933) 0.730 (0.534, 0.907)
Correlation unstructured effects – – 0.532 (0.331, 0.884)
Correlation structured effects – – 0.887 (0.145, 0.991)
S.O.M. Manda et al.
a
0.005km
(10) < −0.05 (19) −0.05 - 0.0 (14) 0.0 - 0.05 (9) >= 0.05
0.005km
(24) < − 0.05 (8) −0.05 - 0.0 (4) 0.0 - 0.05 (16) >= 0.05
Fig. 9.2 (a) Estimated district level first childbirth log-hazards for the univariate spatial survival
model of timing of first childbirth. (b) Estimated district level first marriage log-hazards for the
univariate spatial survival model of timing of first marriage. (c) Estimated district level first
childbirth log-hazards for the multivariate spatial survival model. (d) Estimated district level
first marriage log-hazards for the multivariate spatial survival model
168 S.O.M. Manda et al.
0.005km
(11) < -0.05 (18) -0.05 - 0.0 (15) 0.0 - 0.05 (8) >= 0.05
0.005km
(23) < -0.05 (7) -0.05 - 0.0 (12) 0.0 - 0.05 (10) >= 0.05
9.6 Conclusions
The concept of frailty was developed by Vaupel et al. (1979) for univariate survival
data, and it was subsequently developed by Clayton (1978), Oakes (1982), Clayton
(1991), Clayton and Cuzick (1985) for bivariate model based on a gamma frailty
effect; the later also extending to account for covariates. Hougaard (2000) present an
excellent introduction into the concepts of frailty and important different frailty
model specification estimation procedures. In particular, the shared frailty models
for various forms of bivariate and multivariate survival distributions are presented
and exemplified with typical data sets. Most of these earlier developments have been
on independent and identically distributed frailty effects. Recently, extensions to
modelling spatially structured shared frailty effects have been developed and can be
found in Banerjee and Carlin (2003a); Banerjee et al. (2003b); and Banerjee et al.
(2004) using the CAR model. We proposed a MMMC model which was based on the
concept of a multivariate spatial model outlined by Leyland et al. (2000) and Browne
and colleagues (2001), which accounts for both unstructured heterogeneity and
spatially structured random effects. These closely adhere to applications of previous
spatial modelling carried out by Feltbower et al. (2005) and Manda et al. (2009).
References
Abrahantes, J. C., Legrand, C., Burzykowski, T., Janssen, P., Ducrocq, V., & Duchateau,
L. (2007). Comparison of different estimation procedures for proportional hazards model
with random effects. Computational Statistics and Data Analysis, 51, 3913–3930.
Andersen, P. K., & Gill, R. D. (1982). Cox’s regression models for counting processes. The Annals
of Statistics, 10, 1100–1120.
Banerjee, S., & Carlin, B. P. (2003). Semiparametric spatio-temporal frailty modeling.
Environmetrics, 14, 523–535.
Banerjee, S., Wall, M. M., & Carlin, B. P. (2003). Frailty modeling for spatially correlated survival
data, with application to infant mortality in Minnesota. Biostatistics, 4, 123–142.
Banerjee, S., Carlin, B. P., Alan, E., & Gelfand, A. E. (2004). Hierarchical modeling and analysis
for spatial data. Boca Raton: Chapman & Hall.
Besag, J., York, J., & Mollie, A. (1991). Bayesian image restoration, with two applications in
spatial statistics (with discussion). Annals of the Institute of Statistical Mathematics, 43, 1–59.
Browne, W. J., Goldstein, H., & Rasbash, J. (2001). Multiple membership multiple classification
(MMMC). Statistical Modelling, 1, 103–124.
Cai, B., & Meyer, R. (2011). Bayesian semiparametric modeling of survival data based on
mixtures of B-spline distributions. Computational Statistics and Data Analysis, 55,
1260–1272.
Carlin, B. P., & Hodges, J. S. (1999). Hierarchical proportional hazards regression models for
highly stratified data. Biometrics, 55, 1162–1170.
Clayton, D. G. (1978). A model for association in bivariate life tables and its application in epidemio-
logical studies of familial tendency in chronic disease incidence. Biometrika, 65, 141–152.
Clayton, D. G. (1991). A Monte Carlo method for Bayesian inference in frailty models.
Biometrics, 47, 467–485.
9 A Multivariate Random Frailty Effects Model for Multiple Spatially. . . 171
Clayton, D. G., & Cuzick, J. (1985). Multivariate generalizations of the proportional hazards
model (with discussion). Journal of the Royal Statistical Society A, 148, 82–117.
Department of Health, Medical Research Council, Measure DHS+. (2002). South Africa Demo-
graphic and Health Survey 1998. Pretoria: Department of Health.
Feltbower, R. G., Manda, S. O. M., Gilthorpe, M. S., Greaves, M. F., Parslow, R. C., Kinsey, S. E.,
Bodansky, H. J., Patricia, A., & McKinney, P. A. (2005). Detecting small area similarities in
the epidemiology of childhood acute lymphoblastic leukaemia and type 1 diabetes: A Bayesian
approach. American Journal of Epidemiology, 161, 1168–1180.
Gupta, N., & Mahy, M. (2003). Adolescent childbearing in sub-Saharan Africa: Can increased
schooling alone raise ages at first birth? Demographic Research, 8, 93–106.
Hougaard, P. (2000). Analysis of multivariate survival data. New York: Springer.
Human Sciences Research Council. (2009) Teeange pregancy in South Africa-with specific focus
on school-going children (Full report). Pretoria: Human Sciences Research Council.
Jewkes, R., Morrell, R., & Christofides, N. (2009). Empowering teen ages to prevent
pregnancy: Lessons from South Africa. Culture, Health and Sexuality, 11, 675–688.
Kalule-Sabiti, I., Palamuleni, M., Makiwane, M., & Amoateng, A. Y. (2007). Family formation
and dissolution patterns. In A. Y. Amoateng & T. B. Heaton (Eds.), Families and households in
post-apartheid South Africa: Socio-demographic perspectives (pp. 89–112). Cape Town:
HSRC Press.
Leyland, A. H., Langford, I. H., Rasbash, J., & Goldstein, H. (2000). Multivariate spatial models
for event data. Statistics in Medicine, 19, 2469–2478.
Lloyd, C. B., & Mensch, B. S. (2006). Marriage and childbirth as factors in school exit:
An analysis of DHS data from sub-Saharan Africa. New York: Population Council.
Lunn, D. J., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS: A Bayesian modelling
framework: concepts, structure, and extensibility. Statistics and Computing, 10, 325–337.
Magadi, M. (2004). Poor pregnancy outcomes among adolescents in South Nyansa region of Kenya.
Working paper: A04/04 Statistical Sciences Research Institute. University of Southampton.
Manda, S. O. M. (2011). A nonparametric frailty model for clustered survival data.
Communications in Statistics – Theory and Methods, 40(5), 863–875.
Manda, S. O. M., & Leyland, A. (2007). An empirical comparison of maximum likelihood and
Bayesian estimation methods for multivariate spatial disease model. South African Statistical
Journal, 41, 1–21.
Manda, S. O. M., & Meyer, R. (2005). Age at first marriage in Malawi: A Bayesian multilevel
analysis using a discrete time-to-event model. Journal of the Royal Statistical Society A, 168,
439–455.
Manda, S. O. M., Feltbower, R. G., & Gilthorpe, M. S. (2009). Investigating spatio-temporal
similarities in the epidemiology of childhood leukaemia and diabetes. European Journal of
Epidemiology, 24, 743–752.
Oakes, D. (1982). A concordance test for independence in the presence of censoring. Biometrics,
38, 451–455.
Palamuleni, M. E. (2011). Socioeconomic determinant of age at first marriage in Malawi.
International Journal of Sociology and Anthropology, 3, 224–235.
Palamuleni, M. E., Kalule-Sabiti, I., Makiwane, M. (2007). Fertility and child bearing in South
Africa. In A. Y. Amoateng & T. B. Heaton (Eds.), Families and households in post-apartheid
South Africa: Sociol-demographics perspectives (pp. 113–134). Cape Town: HSRC Press.
Sargent, D. J. (1998). A general framework for random effects survival analysis in the Cox
proportional hazards setting. Biometrics, 54, 1486–1497.
Sastry, N. (1997). A nested frailty model for survival data, with an application to the study of child
survival in northeast Brazil. Journal of the American Statistical Association, 92, 426–435.
Sharma, A. K., Verma, K., & KhatriS, K. A. T. (2003). Determinants of pregnancy in adolescents
in Nepal. Indian Journal of Paediatrics, 69, 19–22.
South, S. J. (1993). Racial and ethnic differences in the desire to marry. Journal of Marriage and
Family, 55, 357–370.
172 S.O.M. Manda et al.
Statistics South Africa. (2010). Estimation of fertility from the 2007 Community Survey of South
Africa/Statistics South Africa. Pretoria: Statistics South Africa.
Upchurch, D. M., Levy-Storms, L., Sucoff, C. A., & Aneshensel, C. S. (1998). Gender and ethnic
differences in the timing of first sexual intercourse. Family Planning Perspectives, 30, 121–127.
Vaupel, J. W., Manton, K. G., & Stallard, E. (1979). The impact of heterogeneity in individual
frailty on the dynamics of mortality. Demography, 16, 439–454.
Chapter 10
Meta-analysis of Observational Studies
Darren C. Greenwood
10.1 Introduction
Y.-K. Tu and D.C. Greenwood (eds.), Modern Methods for Epidemiology, 173
DOI 10.1007/978-94-007-3024-3_10, # Springer Science+Business Media Dordrecht 2012
174 D.C. Greenwood
For meta-analysis to be possible, all the differently presented results from the
observational studies need to be converted to one metric. Combining studies with
entirely different designs, e.g. cohort and case-control studies, is often unreason-
able, and so analyses should be stratified by design and separate estimates presented
(Sutton et al. 2000). Even so, substantial heterogeneity is common between studies
of the same design, due to different population structures and adjustment for
confounding. It is important to characterise these differences in tables and to
explore any heterogeneity using standard tools such as stratified forest plots and
meta-regression techniques.
Epidemiologists sometimes like to compare extreme categories of exposure,
such as the highest versus lowest categories. Although this allows most studies to
be combined, this is often unwise for the following reasons: categorisation often
differs between studies, definition of quantiles depends on the population exposure
distribution, and categorisation loses information by introducing measurement error
(see Chap. 2). These issues introduce unwanted heterogeneity that often render the
combined estimates useless. The common metric that reduces these problems is to
present results as a linear dose-response trend.
To allow these different studies to be included in the same meta-analysis, all results
need to be converted to a relative risk for a unit increase of exposure, giving a linear
dose-response trend. This is done using the methods attributable to Greenland and
Longnecker (1992) (the “pool last” approach) and Chêne and Thompson (1996).
The method of Greenland and Longnecker is particularly useful in that it (i)
provides dose-response estimates that take account of the correlation between the
estimates for each category induced by using the same reference group (Berlin et al.
1993; Greenland and Longnecker 1992), and (ii) enables derivation of dose-
response relative risk estimates that are adjusted for whatever confounding factors
were considered in the particular study.
Using the notation given in Greenland and Longnecker (1992), these methods
are applied in the following steps:
Step 1. Use an iterative algorithm to estimate the cell counts Ax and Bx, where Ax
is the fitted number of cases at exposure level x and Bx is the number of
non-cases.
Step 2. Let Lx be the adjusted log relative risk for exposure level x when x 6¼ 0
compared to the reference level (assumed to be x ¼ 0). For x 6¼ z, estimate the
correlation rxz between Lx and Lz by rxz ¼ (1/A0 + 1/B0)/sxsz for case-control
studies, rxz ¼ (1/A0 – 1/B0)/sxsz for cohort studies without person-time, and
rxz ¼ 1/(A0sxsz) for cohorts involving person-time, where sx2 is the crude variance
10 Meta-analysis of Observational Studies 175
estimate. Calculating sx2 depends on the study type too: sx2 ¼ (1/Ax + 1/Bx + 1/
A0 + 1/B0) for case-control studies, sx2 ¼ (M1/AxA0 – 1/N0 – 1/Nx) for cohort
studies without person-time, and sx2 ¼ M1/AxA0 for cohort studies with person-
time data, where Nx is the total number of subjects at exposure level x, and M1 is
the total number of cases.
pffiffiffiffiffiffiffiffi
Step 3. Estimate the covariance cxz of Lx and Lz by cxz ¼ rxz vx vz where vx is the
estimated variance of Lx.
Step 4. Estimate the dose-response slope b* (and the variance of its estimate, vb*)
by weighted least squares for correlated outcomes as follows: b* ¼ vb*x´C1 L
and vb* ¼ (x´C1x)1 where x is the vector of exposure levels excluding the
reference level, and C is the estimated covariance matrix of L, and has diagonal
element vx and off-diagonal element cxz. When the mean exposure in the
reference category is non-zero, appropriate subtraction from the remaining
category means is required. For linear trends this is simply subtracting the
mean for the reference category from all category means.
The information required to derive a dose-response is not presented in the
majority of papers and a number of approaches should taken in order to derive
the information required. These need to be applied in the following order of
priority:
1. Where the exposure is measured as a continuous variable, and the dose-response
slope given, then this should be used directly. This does not allow extension to
nonlinear trends. Where nonlinear trends need to be modelled, results based on
three or more categories are required.
2. Where the slope (and its standard error or confidence interval) is not given in the
text, these should be estimated using the methods of Greenland and Longnecker
(1992) using the mean exposure in each category given in the paper. No
additional assumptions are required.
3. Greenland and Longnecker’s method requires the total numbers of cases
and controls to be known, and starting estimates for the number of cases in
each category. Where these are not presented, values should be estimated
based on the ratio of cases to controls, the basis for any categorisation using
quantiles (whether based on the whole population or just controls), or on the
information contained in each category estimated from the width of the
confidence intervals.
4. The mean exposure for each category is rarely given, so the methods of Chêne
and Thompson(1996) can be used to estimate the means for use in the
Greenland and Longnecker technique. This approach makes the assumption
of a normally distributed exposure, or a distribution that could be transformed
to normality. Many environmental or dietary exposures will follow a
lognormal distribution adequately enough for these purposes. This is not
necessarily the case for exposures where a large group of unexposed
individuals may reasonably be assumed to come from a separate distribution,
i.e. zero-inflated, or a mixture of two distributions. Episodically consumed
foods such as alcohol or meat are two examples where this approach may not
176 D.C. Greenwood
• Range of exposure for each category, the total number of cases and controls,
estimated relative risks for each category, a way of quantifying uncertainty
around these estimates, e.g. confidence intervals.
• Mean exposure for cases and controls separately, number of cases and controls,
along with a measure of uncertainty in the mean, e.g. standard deviation or
standard error.
The method of Greenland and Longnecker should be applied using standard
errors that depend on the study type (cohort) and the form of the relative risk
estimate (relative risk or odds ratio). For these cohort studies, relative risks derived
from person-years of exposure should also be taken into account if presented.
For the purpose of meta-analysis, it may be possible for estimates of relative risk
to be treated as good approximations because of the outcome, i.e. odds ratios may
be considered as a good approximation to the relative risk in some situations
(Greenland et al. 1986; Greenland and Thomas 1982).
Where the information required for meta-analysis is not available, studies cannot be
included in that meta-analysis. As discussed above, whilst allowing wider inclusion
of studies, comparison of extreme categories introduces heterogeneity.
For example, as part of the World Cancer Research Fund and American Institute
for Cancer Research series of systematic literature reviews of “Food, Nutrition,
Physical Activity and the Prevention of Cancer”, a systematic review was
conducted of the association between processed meat intake and gastric cancer
(World Cancer Research Fund/American Institute for Cancer Research 2007).
All original, aetiological, peer-reviewed studies were considered with no exclusions
on the basis of study quality or publication date, or language. Data were extracted
from 29 studies, but of these, only 17 studies (59%) contained sufficient informa-
tion to contribute to the dose-response meta-analysis. Of these, 9 were case-control
studies prone to substantial recall and selection biases, leaving just 8 cohort studies,
28% of the initial number of studies extracted. It is typical for the more recent
studies to provide better quality information. Similarly a greater proportion of
cohort studies tend to contain more information useable for meta-analysis than
the case-control studies, although in some fields cohort studies are rare.
Cohort studies sometimes publish more than one paper from the same study,
separated by a number of years’ follow-up. In this situation the paper containing
the larger number of cases should usually be used, which is often the most recent
178 D.C. Greenwood
paper. In the even more common situation where the same exposure is analysed in
several ways, with different levels of adjustment, a decision needs to be made
regarding which model is the one with the “most appropriate” adjustment for
confounding. The most appropriate adjustment is often the maximally adjusted
analysis given in the paper, or the one with the narrower confidence intervals.
However, the best model is not always the maximally adjusted one and sometimes a
model with less adjustment is more appropriate because it avoids collinearity and
over-adjustment (see Chap. 1). Where estimates are presented only by subgroup,
e.g. men and women, then the subgroups can be included in the meta-analysis
separately and give valid overall pooled estimates. However, this leads to
underestimation of the heterogeneity by including the same study as an apparently
independent observation, and incorrect degrees of freedom in the test for heteroge-
neity, so it is better to first pool these subgroup results and include this pooled
estimate as the single result from that study.
Where there was a category representing a zero exposure, i.e. non-consumers, then
this presents a situation similar to that discussed in Chap. 6. For the purposes of
estimating the category means required for Greenland and Longnecker’s method,
this zero category may be treated separately for the purposes of estimating means in
each category. For example, for processed meat intake, this would include
vegetarians. Such “never” categories often lead to a peak in the distribution at
zero, which means that the data follow neither a normal nor a lognormal distribu-
tion. By using a mean of zero for the “never” category and estimating means for the
other categories separately, this allows distributional assumptions to be made for
the remaining exposure categories, and therefore more studies can be included in a
meta-analysis.
10.8 Presentation
The World Cancer Research Fund and American Institute for Cancer Research
series of systematic literature reviews of “Food, Nutrition, Physical Activity and the
Prevention of Cancer” form a seminal series of meta-analyses of observational
studies (World Cancer Research Fund/American Institute for Cancer Research
2007). All original, aetiological, peer-reviewed studies were considered with no
exclusions on the basis of study quality or publication date, or language. Studies
were identified through a comprehensive literature search (Butrum et al. 2006). One
of the reviews conducted was of the association between processed meat intake and
gastric cancer. Data were extracted from eight cohort studies (Galanis et al. 1998;
Gonzalez et al. 2006; Khan et al. 2004; McCullough et al. 2001; Ngoan et al. 2002;
Nomura et al. 1990; van den Brandt et al. 2003; Zheng et al. 1995).
In order to combine studies presenting results as portions of processed meat with
those presenting results as grams of intake, a standard portion size was used to convert
portions to grams, based on standard food tables (Ministry of Agriculture 1988).
10.9.2 Results
Study characteristics are given in Table 10.1 with category definitions and relative
risk estimates given in Table 10.2. It is sometimes helpful to also to show these
results graphically for each study, converted into standard portion sizes. The forest
plot shows results for both the fixed effects with inverse-variance weighting (“I-V
overall”) and DerSimonian and Laird random effects (“D + L overall”) analyses
(DerSimonian and Laird 1986) (Fig. 10.1). The fixed effects estimate of relative
risk was 1.02 (95% CI: 1.00–1.05) per 20 g/day of processed meat. There was
hardly any excess heterogeneity within the cohort studies (I2 ¼ 1%). The random
effects estimate of relative risk was almost the same as the fixed effects estimate
because there was very little excess heterogeneity (relative risk ¼ 1.03, 95% CI:
1.00–1.05).
180
Table 10.2 Category definitions and relative risks. 95% confidence intervals are given in parentheses
Author (Year) Subjects’ gender Reference category Category 2 Category 3
Nomura et al. (1990) M 0–1 times/week 2–4 times/week 5+ times/week
RR ¼ 1.0 RR ¼ 1.0 (0.7, 1.4) RR ¼ 1.3 (0.9, 2.0)
Zheng et al. (1995) F 0–4.4 times/month 4.4–12 times/month 13+ times/month
RR ¼ 1.0 RR ¼ 0.9 (0.3, 2.9) RR ¼ 2.2 (0.8, 6.0)
Galanis et al. (1998) M/F 0 times/week 1–2 times/week 3+ times/week
RR ¼ 1.0 RR ¼ 0.9 (0.6, 1.4) RR ¼ (0.6, 1.7)
McCullough et al. (2001) M 0–0.9 days/week 1–4.4 days/week 4.5+ days/week
Meta-analysis of Observational Studies
Estimated relative
Study risk for 20g/day
ID increment (95% CI)
Fig. 10.1 Forest plot for fixed-effects meta-analysis ignoring measurement error
Substantial work has been done by others on correction for the effects of missing
data in the context of meta-analysis, allowing for the uncertainty that it introduces
using Bayesian methods (Sutton et al. 2000; White et al. 2008a, b). In simple
situations it is possible to apply the methods described in Chap. 2 to the estimates
from each study before pooling, to correct for the effects of measurement error on
pooled estimates in meta-analysis. A particular problem for meta-analysis of
published data is that the exposure is often presented in categorized form, thus
suffering from loss of information and associated bias introduced by this form of
measurement error. Bayesian methods can be used to address these issues, with the
benefit of taking into account the uncertainty in the measurement error variance,
and flexibility for use with non-additive or non-classical measurement error
mechanisms.
10.11.1 Methods
So far the methods described have assumed that there is a linear dose-response
curve. It is possible that the curve is nonlinear, and for some exposures, particularly
dietary components such as alcohol intake, this is likely. Considering a non-linear
dose-response curve is not possible using Greenland and Longnecker’s “pool last”
approach outlined in Sect. 10.3, i.e. slopes derived before pooling, but is possible if
means and covariance matrices from individual studies are pooled before
estimating the slopes, known as the “pool first” approach (Greenland and
Longnecker 1992).
One way to fit a nonlinear curve, using the “pool first” approach within Stata, is
to select the best fitting nonlinear dose-response curve from a limited but flexible
family of fractional polynomials (Bagnardi et al. 2004; Royston et al. 1999;
Royston and Altman 1994, 2000). For the example in this chapter, the family of
second-order fractional polynomials were used with ln(RR|x) ¼ b1xp + b2xq, with
p and q taking values of 2, 1, 0.5, 0, 0.5, 1, 2, 3. When p or q are zero, xp or xq
are taken to be ln(x). When p ¼ q, the model is taken as ln(RR|x) ¼ b1xp + b2xqln(x).
These provided simple models but still with a good range of possible curves,
including a range of commonly observed tick-shaped (J-shaped) and U-shaped
curves. The best model was the one that gave the most improvement (decrease)
in deviance compared to the linear model.
The “pool-first” approach is similar to the “pool-last” approach outlined in
Sect. 10.3 but more flexible for including covariates. Using the same notation as
184 D.C. Greenwood
Reference categories
2
RR for processed meat intake
1.5
Estimated RR
1.2
1
.8
0 20 40 60 80 100
Processed meat intake (g/day)
Fig. 10.2 Scatterplot of ln(RR) vs. processed meat. The area of the circle representing each
estimate is proportional to the precision associated with that estimate. Reference categories are
indicated by points with ln(RR) ¼ 0
before, let xk, and Lk be vectors of exposure levels for study k excluding
the reference levels, Ck be the estimated covariance matrix for Lk, G be a
block-diagonal matrix whose kth diagonal block is Ck1. The pooled estimate of
^ is VX´GL, with estimated covariance matrix V ¼ (X´GX)1. The
the coefficients b
model fit can be assessed by comparing e´Ge to a chi-squared distribution on
degrees of freedom equal to the number of elements in e – 2, where e is the vector
^
of residuals e ¼ L – Xb.
When the mean exposure in the reference category is non-zero, the value of the
fractional polynomial function evaluated at the mean of the reference category
needs to be subtracted first.
10.11.2 Results
A scatterplot of log relative risk of gastric cancer against level of processed meat
intake was plotted with the area of the plotting symbol proportional to the precision
of the relative risk associated with it. This represents the raw data extracted in a way
that summarises the observations clearly on one graph (Fig. 10.2).
The best fitting fractional polynomial (based on the deviance) was ln(RR|x) ¼
b1x3 + b2x3ln(x) with b1 ¼ 6.17 106 (s.e. ¼ 3.35 106) and b2 ¼ 1.31 106
(s.e. ¼ 7.30 107).
The coefficients appear small because of the size of x3 and x3lnx. This model had
the lowest deviance (Chi-squared goodness of fit test ¼ 9.0, on 13 df, p ¼ 0.78) but
10 Meta-analysis of Observational Studies 185
1.5
Best fitting fractional polynomial
95% confidence interval
Predicted RR
1.2
1
0 20 40 60 80 100
Processed meat intake (g/day)
Fig. 10.3 Predicted relative risk vs. processed meat intake from fractional polynomial model.
Shaded area represents 95% confidence interval around fitted curve
was not significantly better than the linear model (decrease in deviance ¼ 2.0, on 1
df, p ¼ 0.16). This curve is plotted in Fig. 10.3 with 95% confidence bands.
The curve shows very little excess risk for the first 30–40 g/day of processed
meat rising more steeply to a peak risk with 80 g/day intake before dropping
slightly with higher intakes. For higher intakes, the confidence intervals are very
wide, so this curve could also indicate a threshold effect beyond which additional
processed meat intake does not confer any further harm. The risks associated with
intakes of processed meat above 20 g/day were statistically significant, though 95%
confidence intervals include negligible effects for intakes up to 40–50 g/day.
Eilers (2007) has also viewed the problem of nonlinear dose-response curves in
meta-analysis as a latent trait on which the categorised exposure is based. Eilers
considers a non-parametric smooth latent distribution of event probabilities, and
uses an EM algorithm to do it. Further work could extend this to allow for
measurement error. Extensions of established multilevel methods for meta-analysis
(Higgins et al. 2001; The Fibrinogen Studies Collaboration 2006) (see also Chap. 5)
and latent variable methods within a Bayesian framework (Higgins et al. 2001;
Salanti et al. 2006; Spiegelhalter et al. 2007; Sutton et al. 2000, 2008) (see also
Chaps. 8 and 9) may also be successfully applied.
186 D.C. Greenwood
10.12 Software
The methods of Greenland and Longnecker have been implemented in the Stata
function “glst.ado”. This implementation includes both fixed and random effects
meta-analysis, and both the “pool last” or “pool first” methods. This allows explora-
tion of heterogeneity as well as modelling linear and nonlinear dose-response
curves. However, substantial work can sometimes be required to extract and derive
the appropriate information from the included studies, and some additional pro-
gramming is required to select an appropriate fractional polynomial and plot the
curves. A package such as WinBUGS or JAGS is required for work within the
Bayesian framework (Spiegelhalter et al. 2007; Plummer 2003).
10.14 Conclusions
References
Bagnardi, V., Zambon, A., Quatto, P., & Corrao, G. (2004). Flexible meta-regression functions for
modeling aggregate dose-response data, with an application to alcohol and mortality. American
Journal of Epidemiology, 159(11), 1077–1086.
Berlin, J. A., Longnecker, M. P., & Greenland, S. (1993). Meta-analysis of epidemiologic
dose-response data. Epidemiology, 4(3), 218–228.
Bingham, S. A., & Day, N. (2006). Commentary: Fat and breast cancer: Time to re-evaluate both
methods and results? International Journal of Epidemiology, 35(4), 1022–1024.
Butrum, R., Cannon, G., Heggie, S., Kroke, A., Miles, L., Norman, H., El Sherbini, N., James, C.,
Stone, E., Thompson, R., & Wiseman, M. (2006). Food, Nutrition, Physical Activity and the
Prevention of Cancer: a Global Perspective. Systematic Literature Review Specification Man-
ual (Second expert report), Washington DC: World Cancer Research Fund/American Institute
for Cancer Research. http://www.wcrf.org/research/research_pdfs/slr_manual_15.doc.
Chene, G., & Thompson, S. G. (1996). Methods for summarizing the risk associations of
quantitative variables in epidemiologic studies in a consistent form. American Journal of
Epidemiology, 144(6), 610–621.
Collaborative Group on Pre-Natal Risk Factors and Subsequent Risk of Breast Cancer. (2008).
Birth size and breast cancer risk: Re-analysis of individual participant data from 32 studies.
PLoS Medicine, 5(9), 1372–1386.
DerSimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Controlled Clinical Trials, 7,
177–188.
Eilers, P. H. C. (2007). Data exploration in meta-analysis with smooth latent distributions.
Statistics in Medicine, 26(17), 3358–3368.
Elm, E. V., Altman, D. G., Egger, M., Pocock, S. J., Gotzsche, P. C., Vandenbroucke, J. P., &
STROBE, I. (2007). Strengthening the reporting of observational studies in epidemiology
(STROBE) statement: Guidelines for reporting observational studies. British Medical Journal,
335(7624), 806–808.
Galanis, D. J., Kolonel, L. N., Lee, J., & Nomura, A. (1998). Intakes of selected foods and
beverages and the incidence of gastric cancer among the Japanese residents of Hawaii:
A prospective study. International Journal of Epidemiology, 27(2), 173–180.
Gonzalez, C. A., Jakszyn, P., Pera, G., Agudo, A., Bingham, S., Palli, D., Ferrari, P., Boeing, H.,
Del Giudice, G., Plebani, M., Carneiro, F., Nesi, G., Berrino, F., Sacerdote, C., Tumino, R.,
Panico, S., Berglund, G., Siman, H., Nyren, O., Hallmans, G., Martinez, C., Dorronsoro, M.,
Barricarte, A., Navarro, C., Quiros, J. R., Allen, N., Key, T. J., Day, N. E., Linseisen, J., Nagel,
G., Bergmann, M. M., Overvad, K., Jensen, M. K., Tjonneland, A., Olsen, A., Bueno-de-
Mesquita, H. B., Ocke, M., Peeters, P. H., Numans, M. E., Clavel-Chapelon, F., Boutron-
Ruault, M. C., Trichopoulou, A., Psaltopoulou, T., Roukos, D., Lund, E., Hemon, B., Kaaks,
R., Norat, T., & Riboli, E. (2006). Meat intake and risk of stomach and esophageal
188 D.C. Greenwood
adenocarcinoma within the European Prospective Investigation into Cancer and Nutrition
(EPIC). Journal of the National Cancer Institute, 98(5), 345–354.
Greenland, S., & Longnecker, M. P. (1992). Methods for trend estimation from summarized dose-
response data, with applications to meta-analysis. American Journal of Epidemiology, 135(11),
1301–1309.
Greenland, S., & Thomas, D. C. (1982). On the need for the rare disease assumption in case-
control studies. American Journal of Epidemiology, 116(3), 547–553.
Greenland, S., Thomas, D. C., & Morgenstern, H. (1986). The rare-disease assumption revisited.
American Journal of Epidemiology, 124(6), 869–883.
Higgins, J. P., Whitehead, A., Turner, R. M., Omar, R. Z., & Thompson, S. G. (2001).
Meta-analysis of continuous outcome data from individual patients. Statistics in Medicine,
20(15), 2219–2241.
Khan, M. M., Goto, R., Kobayashi, K., Suzumura, S., Nagata, Y., Sonoda, T., Sakauchi, F.,
Washio, M., & Mori, M. (2004). Dietary habits and cancer mortality among middle aged and
older Japanese living in Hokkaido, Japan by cancer site and sex. Asian Pacific Journal of
Cancer Prevention, 5(1), 58–65.
McCullough, M. L., Robertson, A. S., Jacobs, E. J., Chao, A., Calle, E. E., & Thun, M. J. (2001).
A prospective study of diet and stomach cancer mortality in United States men and women.
Cancer Epidemiology, Biomarkers & Prevention, 10(11), 1201–1205.
Ministry of Agriculture, F. a. F. (1988). Ford portion sizes (2nd ed.). London: HMSO.
Ngoan, L. T., Mizoue, T., Fujino, Y., Tokui, N., & Yoshimura, T. (2002). Dietary factors and
stomach cancer mortality. British Journal of Cancer, 87(1), 37–42.
Nomura, A., Grove, J. S., Stemmermann, G. N., & Severson, R. K. (1990). A prospective study of
stomach cancer and its relation to diet, cigarettes, and alcohol consumption [see comment].
Cancer Research, 50(3), 627–631.
Plummer, M. (2003). JAGS: A program for analysis of bayesian graphical models using
gibbs sampling. Proceedings of the 3rd International Workshop on Distributed Statistical
Computing (DSC 2003), March 20–22, Vienna, Austria.
Royston, P., & Altman, D. G. (1994). Regression using fractional polynomials of continuous
covariates: Parsimonious parametric modelling. Applied Statistics, 43, 429–467.
Royston, P., & Altman, D. G. (2000). A strategy for modelling the effect of a continuous covariate
in medicine and epidemiology. Statistics in Medicine, 19, 1831–1847.
Royston, P., Ambler, G., & Sauerbrei, W. (1999). The use of fractional polynomials to model
continuous risk variables in epidemiology. International Journal of Epidemiology, 28,
964–974.
Salanti, G., Higgins, J. P., & White, I. R. (2006). Bayesian synthesis of epidemiological evidence
with different combinations of exposure groups: Application to a gene-gene-environment
interaction. Statistics in Medicine, 25(24), 4147–4163.
Spiegelhalter, D. J., Thomas, A., Best, N. G., & Lunn, D. (2007). WinBUGS user manual: Version
1.4.3. Cambridge: MRC Biostatistics Unit.
Stroup, D. F., Berlin, J. A., Morton, S. C., Olkin, I., Williamson, G. D., Rennie, D., Moher, D.,
Becker, B. J., Sipe, T. A., Thacker, S. B., & for the Meta-analysis of Observational Studies in
Epidemiology Group. (2000). Meta-analysis of observational studies in epidemiology: A
proposal for reporting. Journal of the American Medical Association, 283(15), 2008–2012.
Sutton, A. J., Abrams, K. R., Jones, D. R., Sheldon, T. A., & Song, F. (2000). Methods for meta-
analysis in medical research. Chichester: Wiley.
Sutton, A. J., Kendrick, D., & Coupland, C. A. C. (2008). Meta-analysis of individual- and
aggregate-level data. Statistics in Medicine, 27(5), 651–669.
The Fibrinogen Studies Collaboration. (2006). Regression dilution methods for meta-analysis:
Assessing long-term variability in plasma fibrinogen among 27 247 adults in 15 prospective
studies. International Journal of Epidemiology, 35(6), 1570–1578.
10 Meta-analysis of Observational Studies 189
van den Brandt, P. A., Botterweck, A. A., & Goldbohm, R. A. (2003). Salt intake, cured meat
consumption, refrigerator use and stomach cancer incidence: A prospective cohort study
(Netherlands). Cancer Causes & Control, 14(5), 427–438.
White, I. R., Higgins, J. P., & Wood, A. M. (2008a). Allowing for uncertainty due to missing data
in meta-analysis – Part 1: Two-stage methods. Statistics in Medicine, 27(5), 711–727.
White, I. R., Welton, N. J., Wood, A. M., Ades, A. E., & Higgins, J. P. (2008b). Allowing for
uncertainty due to missing data in meta-analysis – Part 2: Hierarchical models. Statistics in
Medicine, 27(5), 728–745.
World Cancer Research Fund/American Institute for Cancer Research. (2007). Food, nutrition,
physical activity and the prevention of cancer: A global perspective. Washington DC: World
Cancer Research Fund/American Institute for Cancer Research.
Zheng, W., Sellers, T. A., Doyle, T. J., Kushi, L. H., Potter, J. D., & Folsom, A. R. (1995). Retinol,
antioxidant vitamins, and cancers of the upper digestive tract in a prospective cohort study of
postmenopausal women. American Journal of Epidemiology, 142(9), 955–960.
Chapter 11
Directed Acyclic Graphs and Structural
Equation Modelling
Yu-Kang Tu
11.1 Introduction
One of the major challenges for epidemiologists is to understand and infer causal
relationships between risk factors and health outcomes in the population by
analysing data from observational studies. For many risk factors, it is either unethi-
cal or impractical to conduct randomised controlled trials to test their health effects.
It would therefore be very desirable if there is a methodology for observational
studies to discover causes and effects amongst variables or at least confirm or refute
the proposed causal relationships. Epidemiologists need a methodology which is
sort of a combination of the directed acyclic graphs (DAGs, see Chap. 1) for
conceptual construction of causal models and regression analysis for testing those
models. It is therefore surprising that structural equation modelling (SEM) has not
been so frequently used in epidemiology as in the social sciences, given that both
epidemiologists and social scientists want to delineate causes and effects from
observational data. The difference between DAGs and path diagrams in SEM is
trivial: the path between two variables can only have one direction in DAGs
(Greenland and Brumback 2002; Iacobucci 2008), whilst in SEM the paths can be
in both directions at once. An individual path in SEM is tested in the same way as the
regression coefficient is in regression analysis, and model fit indices provided by
SEM software packages help the analysts to assess the adequacy of the proposed
causal model compared to the observed associations in the sample data (Pearl 2000;
Kline 2011).
Why then is SEM still under-utilised in epidemiology? This is a question posed
by Der (2002) a few years ago. The answers cited included that SEM used
unfamiliar terminology, because mathematical models in SEM are formulated in
Y.-K. Tu (*)
Division of Biostatistics, Centre for Epidemiology and Biostatistics, Leeds Institute of Genetics,
Health & Therapeutics, Faculty of Medicine and Health, University of Leeds, Leeds, UK
e-mail: [email protected]
Y.-K. Tu and D.C. Greenwood (eds.), Modern Methods for Epidemiology, 191
DOI 10.1007/978-94-007-3024-3_11, # Springer Science+Business Media Dordrecht 2012
192 Y.-K. Tu
matrix algebra, and the first SEM software, LISREL, uses eight matrices in Greek
letters; the restriction in the assumptions of variables requires that the outcome
variables need to be continuous and follow multivariate normality; it is quite
tedious to set up SEM models to test interaction amongst variables and non-linear
relationship. More importantly, two different causal models may imply the same
covariance structure and consequently, it is impossible to tell which is better,
known as the equivalent models problem. Recent advances in SEM theory and
software development has nevertheless resolved some of these issues. We now
known that the maximum likelihood estimator is quite robust to the violation of
multivariate normality, and new estimation methods do not require the strict
assumption of multivariate normality (Shipley 2000). Software packages can now
estimate non-continuous outcome variables (Skrondal and Rabe-Hesketh 2004;
Muthen and Muthen 2006; Hancock and Samuelson 2007). From a statistical
viewpoint, all general and generalised linear models (such as linear regression,
analysis of variance, and logistic regression), and multivariate statistics (such as
path analysis, multivariate analysis of variance, canonical correlation, and factor
analysis) are part of SEM family. As a result, almost all epidemiologists are ‘doing’
SEM every day, though most of them are not aware of this.
Since Karl J€oreskog first proposed his famous LISREL model in 1970s, SEM has
become a very important research tool for quantitative social scientists, because it
provides a very powerful and versatile framework for formulating research
hypotheses and testing them. SEM is a vast and rapidly evolving field, and there
are more than a dozen of textbooks and monographs dedicated entirely to SEM.
Chapter-length introduction can be found in many statistics textbooks covering
multivariate methods. Therefore, the aim of this Chapter is not to explain the
mathematical theory of SEM or to demonstrate how to use SEM software. Instead,
the aim of this Chapter is to discuss the relation between DAGs and SEM, rationale
behind SEM, and the limitation of SEM philosophy. Readers who are interested in
applying this methodology can consult the textbooks and software manuals for
further details. The structure of this Chapter is as follows: we first explain the path
diagrams used by SEM and the similarity between them and DAGs (Sects. 11.2 and
11.3). Then, we explain how SEM may be useful for the identifications of causal
relationships (Sects. 11.4 and 11.5). Finally, we explain the philosophy behind
SEM testing and its limitations (Sect. 11.6).
of possible models. We need to test the causal relationship using active observation,
i.e. we need to intervene into the system, observe the consequences, and compare
them to the predictions made by the models.
For instance, the causal models in Figs. 11.1, 11.2 and 11.4 suggests that if
we change X, Y will not change because X is not a cause of Y. Therefore if change
in Y does occur when we increase X by one unit in a selected sample by
conducting an experiment, the three causal models in Figs. 11.1, 11.2 and 11.4
are rejected because their predictions are refuted by the experiment. On the
other hand, if change in Y does not occur when we increase X by a unit, the
three causal models are tentatively accepted because their predictions pass
the experiment, but further experiments are required to tell which of the three
models is the best. For instance, we can increase Y by a unit and see if X will
change. If X changes, this means that the causal models in Figs. 11.1 and 11.2 are
rejected, because according to their predictions, Y is not a cause of X and should
not change (Tu et al. 2008).
Suppose we do not observed any change in X when Y is increased by one unit,
and this will reject the prediction by the model in Fig. 11.4. But what about the
models in Figs. 11.1 and 11.2? How can we know which one is the best? In fact,
there is no genuinely causal relationship in the model in Fig. 11.1, and as a result, no
experiment can be undertaken to test its truthfulness. This is why in DAGs, double
arrows are not allowed. For the causal model in Fig. 11.2, although there is no
double arrow, conducting experiments to test it will not be easy. First, we need to
identify the unobserved variable U and measure it. When U increases, we expect to
see changes in both X and Y. The problem is if we observe no change in X and Y, this
is not sufficient to reject the model, because we might identify an incorrect U.
Therefore, the model in Fig. 11.2 needs to pass three tests: (1) when X changes,
Y does not change; (2) when Y changes, X does not change; (3) U is identified and
when U changes both X and Y change. Then we may tentatively accept this model as
the most plausible one amongst the three causal models.
The discussion so far may look recognisable for readers who are familiar with
the writings of Sir Karl Popper, an influential philosopher of science in the last
century. His famous slogan: conjectures and refutations, has once been considered
as the demarcation criteria between science and pseudo-science. A good scientific
theory clearly specifies the conditions where it may be rejected, i.e. to make
predictions which have not been but can be observed, and then experiments are
designed to test the predictions. If the scientific theory passes the test, its truth-
content has increased; if it fails the test, it may need to be modified or in extreme
circumstances rejected. Of course, in real research, the process is quite complex, as
we need to take into account the accuracy of our measurements and to decide the
extent of deviations between predictions and observations that suffice to falsify our
theory. Note that according to Popperian philosophy, we cannot ‘prove’ a theory
true by undertaking an experiment or observation, because the theory may fail in
the next test. This is an asymmetry in the acceptance and refutation of a scientific
theory: one experiment or observation may refute a theory forever but only
corroborate it (Popper 2002).
11 Directed Acyclic Graphs and Structural Equation Modelling 195
Popperian philosophy was once very popular amongst natural and social
scientists, and there have been many discussions about its application in epidemi-
ology. In SEM literature, Popper’s philosophy has been used to defend the values of
SEM in finding causal relationships in data from passive observations (Bollen
1989). In the following sections, we will discuss why it is not always easy to
practice (so-called naı̈ve) Popperian conjectures and refutations in SEM.
One limitation of classical SEM analysis is that the manifest (observed) variables
for the outcomes have to be continuous variables such as Y in Fig. 11.3 and X in
Fig. 11.4 (there is no such limitation for X in Fig. 11.3 and Y in Fig. 11.4). Recent
advances in SEM theory and software development have overcome this by
implementing new estimation procedures (Little et al. 2007; Muthén 2001).
These new developments make SEM a useful tool for causal modelling in epidemi-
ology, because many outcome variables in epidemiology are binary or counts.
DAGs, which have been known to epidemiologists for nearly two decades, have
received greater attention in the last few years (see Chaps. 1 and 4 for more details).
They have mainly been used by epidemiologists to identify confounders and
potential biases in the estimation of causal relationships, and DAGs are a particular
type of path diagrams.
The question of which variables are confounders and should be adjusted for in
statistical analysis has been a controversial issue within epidemiology (Weinberg
2005; Kirkwood and Sterne 2003; Jewell 2004). Only with the consideration of
DAGs can the relevant issue be resolved. According to DAGs theory, confounders
are variables which are causally associated with both the outcome and exposure but
are not on the causal pathway from the exposure to the outcome variable
(Greenland et al. 1999; Pearl 2000; Tu et al. 2005; Glymour 2006; Glymour and
Greenland 2008). For instance, variable Z is a confounder for the relation between
the exposure X and the outcome Y in Fig. 11.5a, because there are arrows from Z to
X and Y (i.e. they are causally associated), and Z is not on the causal path from X to
Y. In contrast, Z is not a confounder for the relation between the exposure X and the
outcome Y in Fig. 11.5b, because although there are arrows from Z to Y and from X
to Z (i.e. they are causally associated), Z is on the causal path from X to Y.
Therefore, if we want to estimate the impact of X on Y, Z is a confounder and
should be adjusted for according to Fig. 11.5a, but Z is not a confounder according
to 11.5b. From the viewpoint of path diagram, the adjustment of Z in Fig. 11.5b is
the partition of direct and indirect effect, and this is very common in SEM literature.
196 Y.-K. Tu
Most SEM software can produce the results of direct and indirect effects upon
request. The path from X to Y in Fig. 11.5b is interpreted as the direct effect of X on
Y, and the path from X to Z and Y is the indirect effect of X on Y. To estimate the
former, we need to adjust for an intermediate variable between X and Y. This
practice is also known as ‘mediation’ analysis in social sciences (MacKinnon
2008). The total effects are just the sum of direct and indirect effects. Therefore,
although the adjustment of Z in a regression model will change the estimate of
regression coefficient for X, it matters little in SEM because both unadjusted and
adjusted regression coefficients are reported: the former is the total effects and the
latter the direct effects.
We may ask why Z is a confounder and should be adjusted for in Fig. 11.5a and
what would happen if Z is not adjusted for. Before we answer this question, we first
look at Fig. 11.2 again. Suppose the arrows from U to X and Y represent positive
associations. When U increases, we will observe that both X and Y increase. If we
do not know that there is U behind the observed increases in X and Y, we may
therefore conclude that either X influences Y or vice versa, but actually if we change
the values of X (or Y), nothing would happen to Y (or X). U is therefore a confounder
for the relation between X and Y, and this can be identified by tracing the path from
X to Y or Y to X known as a backdoor path (see Chap. 1). When there are backdoor
paths from the exposure to the outcome, the estimate of their causal relation is
contaminated (in statistical jargon, biased). To block the backdoor paths, variables
such as U need to be adjusted for, and in epidemiological terminology, these
variables are confounders.
A related issue is to identify colliders in DAG (see Chap. 1 for the definition of a
collider). Consider Fig. 11.6. There are paths from X to Z and Y to Z, i.e. changing
11 Directed Acyclic Graphs and Structural Equation Modelling 197
either X or Y will give rise to change in Z, but changes in X will not cause changes in
Y, and vice versa. However, if we adjust for Z when we regress Y on X (or X on Y),
we will find a spurious association between Y and X. The non-mathematical
explanation for this phenomenon is as follows: we know both X and Y can influence
X, say, positively; if we observe a positive change in Z, we know either X or Y is the
cause but we are uncertain of which is the cause. However, if we then know X had
changed, that Y had also changed becomes less probable than that Y had not, so a
negative relation between X and Y would be observed. Mathematically speaking, X
and Y are independent unconditionally, but they are dependent, conditional on Z.
In this scenario, Z acts as a collider, because two arrows (one from X and the other
from Y) go toward it, so Z blocks the pathway from X to Y (and Y to X). However,
statistical adjustment of Z will open this path, and X and Y will become correlated.
In path diagram, X and Y are assumed to independent, i.e. when their
relationships with Z are estimated, the correlation between X and Y will be
constrained to be zero, even though their observed correlation may not be zero.
Structural equation modelling (SEM) looks at the model in Fig. 11.7 from a slightly
difference perspective. To simplify our discussion, we standardise all the variables
used in Fig. 11.7, so that their means are zero and their variances are one. The lower
case letters accompanying each path represent the standardised path coefficients
which can be interpreted as standardised regression coefficients from multiple
regression analysis. From Fig. 11.7 it is possible to work the estimated correlations
between each pair of variables in the model using Sewell Wright’s Rules of Tracing
(Loehlin 2004). Wright was a geneticist and invented path analysis in 1920s. His
path analysis was largely ignored by statisticians but adopted by econometricians in
1950s. In 1960s and 1970s, path analysis and factor analysis were incorporated into
one single general statistical framework, SEM, by Karl J€oreskog and others.
Wright’s rules can be summarised as follows:
1. No loops are allowed. In tracing from one variable to another, the same variable
cannot be passed through twice.
2. No going forward and then backward. Once following a path forward, e.g.
following the path from X to Y (X ! Y) in Fig. 11.7, it is not allowed to follow
backward across the path, e.g. following the path backward from Y to Z (Y Z)
in Fig. 11.7 not allowed. However, going backward and then forward is possible,
e.g. flowing the path backward from Y to Z (Y Z) and then from Z to X
(X ! Y) in Fig. 11.7 is allowed.
3. Only one double arrow is allowed in tracing from the first variable to the last
variable, e.g. tracing from X to Y is allowed in Fig. 11.1.
Note that we will not need rule No.3 for DAGs for reasons explained previously,
and the rule No.2 is equivalent to how we identify backdoor paths and colliders in
the previous section. It is better to use examples to explain how to apply these rules.
For instance, the correlation between U and V in Fig. 11.7 is zero according to the
rule No.2. The correlation between X and Z is (b + d*e). The former is a direct
effect from Z to X and the latter the confounding effect due to U, i.e. the estimated
correlation between X and Z is the sum of the genuine effect and the spurious
confounding effects. If the model is correct, this will also be the observed correla-
tion between X and Z in the population. The correlation between X and Y is
11 Directed Acyclic Graphs and Structural Equation Modelling 199
rU;V ¼ 0; (11.1)
rX;Z ¼ b þ d e; (11.2)
rX;Y ¼ a þ b c þ d e c þ b f g; (11.3)
rX;V ¼ b f ; (11.4)
rX;U ¼ d þ b e; (11.5)
rU; Z ¼ e; (11.6)
rU;Y ¼ e c þ e b a þ d a; (11.7)
rV;Y ¼ g þ f c þ f b a; (11.8)
rV;Z ¼ f ; (11.9)
rZ;Y ¼ c þ b a þ e d a þ f g; (11.10)
rZ; X ¼ b0 ; (11.11)
rZ;Y ¼ b0 þ a0 c0 ; (11.12)
rX;Y ¼ a0 þ b0 c0 ; (11.13)
200 Y.-K. Tu
The basic rationale behind the model testing in SEM is straightforward: multiple
linear equations are used to specify causal relationship between variables some of
which are manifest variables (observed and collected by the researchers), while
others are latent variables derived from the observed variables by specifying their
relations using equations, such as those in factor analysis. The multiple equations in
each causal model describe particular correlation structure between the observed
variables which is usually presented as a correlation (or covariance) matrix S.
The estimation procedure is to minimise the difference between S and the observed
correlation or covariance matrix S formulated by a likelihood function:
X X
1
FML ¼ log logjSj þ trace S ðp þ qÞ;
One controversy in the SEM model testing is whether a large w2 value relative to
the model degree of freedom necessarily means that the proposed model is wrong
and should be modified or even rejected, because the statistical power of the w2 test
to reject a model increases with the sample size. As a result, a causal model whose
model fit deems acceptable in a small sample is considered unacceptable when the
sample size increases. The problem is inherent in hypothesis testing (Gardner and
Altman 1986; Altman and Bland 1995). Therefore, many alternative model fit
indices have been proposed that take into account the sample size and model
degrees of freedom. Many software packages also provide modification indices to
help researchers identify possible ways to modify the models to reduce the w2 value.
Model modification should be guided by more than just the reduction in the w2
value, because this may be entirely caused by chance, and the modified model may
make little sense from a theoretical point of view.
A more fundamental issue in SEM model testing is why the aim of SEM is to
produce a model that has the same correlation/covariance structure in the sample as
that in the population. The rationale is that if a model faithfully represents the “true”
causal relationships amongst variables, the estimated relationships should corre-
spond to the observed ones in the data. In reality, there may be biases in the data
collection that cause the observed correlation/covariance structure to deviate from
the true one in the population. Furthermore, reality is often complex, and many of
the causes and effects may be subtle and intricate; we may never be able to capture a
full picture. Even if we do, the model may be too complex to be useful, and a
simpler model with a simplified version of theory may be more useful for our
understanding of reality. According to the popular version of Popperian philosophy
(we call it popular version, because Popper himself recognised the process of
conjectures and refutations is far more complex in the practice of science), a
model may survive many attempts to refute it, but if it fails just one test, we should
modify or even reject it. Is it always a good approach that we give up our carefully
formulated model and modify our theory because it does not fit one data set? Or
should we try to identify the reasons for the poor fit, such as random sampling
errors?
ones, i.e. the conditional relationships and covariance structure in the proposed
DAGs and path diagrams may be very close to the ones in the sample data, but this
alone may not be able to prove the proposed causal relationships. Ideally, we would
like to conduct experiments to make changes to the system in the model and
observed whether the consequence follows the prediction made by the model.
This may not always be feasible and does not mean no causal inference can be
made without experiments. For example, we could never take Moon away from our
galaxy and then see what change this might cause to how the Earth moves around
the sun; but we still believe Newton’s law is correct. Causal thinking can still be
incorporated in the statistical analysis and causal inference can still be made using
data from careful observations (Arah 2008, Tu 2009).
Two book chapters (Glymour 2006, and Glymour and Greenland 2008) provide a
comprehensive but accessible coverage of DAGs for epidemiologists. Judea Pearl’s
book (Pearl 2000) discusses both DAGs and SEM in much deeper depth, but people
without strong statistical background may find it difficult. Kline (2011) and Loehlin
(2004) are both good introductory textbooks on SEM without relying too much on
linear algebra for explaining the concepts.
References
Altman, D. G., & Bland, J. M. (1995). Absence of evidence is not evidence of absence. BMJ, 311,
485.
Arah, O. A. (2008). The role of causal reasoning in understanding Simpson’s paradox, Lord’s
paradox, and the suppression effect: Covariate selection in the analysis of observational
studies. Emerging Themes in Epidemiology, 5, 5.
Bollen, K (1989). Structural equations with latent variables. Hoboken, NJ: Wiley.
Cartwright, N. (1989). Nature’s capacities and their measurement. Oxford: Oxford University
Press.
Der, G. (2002). Commentary: Structural equation modelling in epidemiology: Some problems and
prospects. International Journal of Epidemiology, 31, 1199–1200.
Gardner, M. J., & Altman, D. G. (1986). Confidence intervals rather than P values: estimation
rather than hypothesis testing. BMJ, 292, 846–850.
Glymour, M. M. (2006). Using causal diagrams to understand common problems in social
epidemiology. In J. M. Oakes & J. S. Kaufman (Eds.), Methods in social epidemiology
(pp. 393–428). San Francisco: Jossey-Bass.
Glymour, M. M., & Greenland, S. (2008). Causal diagrams. In K. J. Rothman, S. Greenland, &
T. L. Lash (Eds.), Modern epidemiology (3rd ed., pp. 183–209). Philadelphia: Lippincott
Williams & Wilkins.
Greenland, S., & Brumback, B. (2002). An overview of relations among causal modelling
methods. International Journal of Epidemiology, 31, 1030–1037.
11 Directed Acyclic Graphs and Structural Equation Modelling 203
Greenland, S., Pearl, J., & Robins, J. M. (1999). Causal Diagrams for epidemiologic research.
Epidemiology, 10, 37–48.
Hancock, G., & Samuelson, K. M. (Eds.). (2007). Advances in latent variable mixture models.
Charlotte: IAP.
Iacobucci, D. (2008). Mediation analysis. Thousand Oaks: Sage.
Jewell, N. P. (2004). Statistics for Epidemiology. Boca Raton, Florida: Chapman & Hall.
Kirkwood, B., & Sterne, J. A. C. (2003). Essential medical statistics (2nd ed.). London: Blackwell.
Kline, R. B. (2011). Principles and practice of structural equation modeling (3rd ed.). New York:
Guilford press.
Little, T. D., Preacher, K. J., Selig, J. P., & Card, N. A. (2007). New developments in latent
variable panel analyses of longitudinal data. International Journal of Behavioral Development,
31, 357–365.
Loehlin, J. C. (2004). Latent variable models (4th ed.). Mahwah: Lawrence Erlbaum Associates.
MacKinnon, D. P. (2008). Introduction to statistical mediation analysis. Mahwah: Lawrence
Erlbaum Associates.
Muthén, B. (2001). Second-generation structural equation modelling with a combination of
categorical and continuous latent variables: New opportunities for latent class/latent growth
modeling. In L. M. Collins & A. Sayer (Eds.), New methods for the analysis of change
(pp. 291–322). Washington, DC: American Psychological Association.
Muthén, L. K., & Muthén, B. (2006). Mplus user’s guide (4th ed.). Los Angeles: Muthén &
Muthén.
Pearl, J. (2000). Causality: Models, reasoning, and inference. Cambridge: Cambridge University
Press.
Popper, K. R. (2002). Conjectures and refutations. London: Routledge.
Shipley, B. (2000). Cause and correlation in biology. Cambridge: Cambridge University Press.
Skrondal, A., & Rabe-Hesketh, S. (2004). Generalized latent variable modelling: Multilevel,
longitudinal and structural equation models. Boca Raton: Chapman & Hall/CRC.
Tu, Y. K., West, R., Ellison, G. T. H., et al. (2005). Why evidence for the fetal origins of adult
disease might be a statistical artifact: The “reversal paradox” for the relation between birth
weight and blood pressure in later life. American Journal of Epidemiology, 161, 27–32.
Tu, Y. K., Gunnell, D. J., & Gilthorpe, M. S. (2008). Simpson’s paradox, Lord’s paradox, and
suppression effects are the same phenomenon – The reversal paradox. Emerging Themes in
Epidemiology, 5, 2.
Tu, Y. K. (2009) Commentary: Is structural equation modelling a step forward for
epidemiologists? International Journal of Epidemiology, 38, 249–251.
Weinberg, C. R. (2005). Invited commentary: Barker meets Simpson. American Journal of
Epidemiology, 161, 33–35.
Chapter 12
Latent Growth Curve Models
12.1 Introduction
Y.-K. Tu (*)
Division of Biostatistics, Centre for Epidemiology and Biostatistics, Leeds Institute of Genetics,
Health & Therapeutics, Faculty of Medicine and Health, University of Leeds, Leeds, UK
e-mail: [email protected]
F. D’Auito
Department of Periodontology, Eastman Dental Institute, University College
London, London, UK
Y.-K. Tu and D.C. Greenwood (eds.), Modern Methods for Epidemiology, 205
DOI 10.1007/978-94-007-3024-3_12, # Springer Science+Business Media Dordrecht 2012
206 Y.-K. Tu and F. D’Auito
To illustrate how to draw path diagrams for growth models, we begin with a simple
example of linear regression with one outcome variable (known as the dependent
variable) and one explanatory variable (known as the independent variable or
covariate). Figure 12.1 is the path diagram for a simple linear regression model
given as:
y ¼ b0 þ b1 x þ e; (12.1)
12 Latent Growth Curve Models 207
where y is the outcome variable, x the explanatory variable, e the residual error
term, b0 the intercept, and b1 the regression coefficient for x.
From Eq. 12.1, we see that when x is zero, y is b0, and when x increases by one
unit, y is expected to increase by the amount of b1. The residual error term is the
difference between the observed values of the outcome and the predicted values of
the outcome. In path diagrams, observed variables such as x and y are within
squares, and latent (unobserved) variables such as residual errors (e in Eq. 12.1)
are within circles. An arrow from variable x to variable y in a path diagram means
that x affects y in the specified statistical model, but y does not affect x. In contrast, a
double arrow connecting x and y means that these two variables are correlated
without specific causal direction. When there is no arrowed line (single or double)
between x and y, this means that x and y are assumed to be causally independent, i.e.
the underlying population correlation between them is assumed to be zero in the
specified model. For instance, x and e are assumed to be uncorrelated, and this is
one of the assumptions behind regression analysis: explanatory variables and
residual errors are independent.
The arrow from one variable to another is called a path in the diagram. In
Fig. 12.1, there are two paths that specify the relationships between variables in
the model: one from x to y, and another from e to y. As a result, two parameters
associated with those two paths may need to be estimated. The parameter for the
path from x to y is b1, which is unknown and needs to be estimated, but another
parameter for the path from e to y is fixed to be unity. Only one free parameter for
the relationship between x and y requires estimation, though other free parameters
also require estimation, for example, the variances of x and e.
In a path diagram such as Fig. 12.1, variables like x are known as exogenous
variables because there is no arrow from another variable in the model directed
towards them. In contrast, variables like y are known as endogenous variables,
because there is at least one arrow from other variables (x in this model) directed
toward it. Endogenous variables are accompanied by residual errors, such as e in
our model, because it is unlikely that the variations in y can be completely
explained by x. SEM estimates the means and variances for exogenous variables
whilst estimating the intercepts for the endogenous variables. This is because the
variance of an endogenous variable is derived from exogenous variables as well as
residual errors associated with the endogenous variable. For example, in the linear
regression given in Eq. 12.1, the intercept for y will be estimated by b0. Both the
mean and variance of x will be estimated, although they are not explicitly expressed
in Eq. 12.1. The mean of the residual errors is fixed to zero and the path from it to
the associated endogenous variable is fixed to be unity (reflected by the regression
coefficient for e in Eq. 12.1 being 1). Therefore, the only parameter to be estimated
is its variance. The mean and variance of y can then be derived from Eq. 12.1. Note
that observed and unobserved variables can be exogenous or endogenous variables.
Multiple linear regression tests the relationship between one outcome variable and
more than one explanatory variable. Fig. 12.2 is the path diagram for a multiple
linear regression with three explanatory variables denoted:
y ¼ b0 þ b1 x1 þ b2 x2 þ b3 x3 þ e; (12.2)
12.3.1 Data
For illustration, we use data from a randomized controlled trial (RCT) on the effects
of periodontal treatments on clinical outcomes and laboratory biomarkers for
systematic inflammation (Tonetti et al. 2007). In general, the treatments of gum
(periodontal) diseases aim to control infection and inflammation by eradicating the
periodontal pathogens within the dental plaque on the tooth or root surfaces. The
periodontal pocket is the small space between a tooth and the surrounding gum
(gingivae), and its healthy depth is usually about 1–3 mm. As periodontal disease
progresses, the depth of periodontal pocket increases due to both the swollen gum
(caused by inflammation) and the loss of attachment between the tooth and the
surrounding supporting structure (such as periodontal ligament and bone). Pocket
depth is measured by a periodontal probe with markings, and it is the most
commonly used clinical variable for measuring periodontal diseases and treatment
effects. Many recent studies have shown an association between periodontal infec-
tion and an increased risk of cardiovascular diseases. The aim of the original study
was to test whether changes in clinical outcomes were associated with the changes
in inflammatory biomarkers and vascular function.
The details of the RCT have been reported elsewhere (Tonetti et al. 2007).
To summarise, 120 patients with chronic periodontal diseases were randomized
into two groups: the control group (59 patients) received conventional periodontal
treatment (CPT), i.e. professional cleaning of teeth without removal of dental
plaque and calculus in the periodontal pockets. This is the treatment what patients
would receive from their general dentist. The test group (61 patients) received
intensive periodontal treatment (IPT), i.e. specialist periodontal treatments to
remove dental plaque and calculus within periodontal pockets within a single
appointment. It is called intensive, because traditionally the specialist treatment
was usually given in several appointments over a few weeks. Previous studies have
shown that intensive periodontal treatment may induce short-term sharp rise in the
level of inflammatory biomarkers.
210 Y.-K. Tu and F. D’Auito
8 CPT IPT
8
Average pocket depths in mm
7
6
6
5
5
4
4
3
3
2
2
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Months Months
Fig. 12.3 Profile plots for pocket depths at baseline, 2 and 6 months
Full mouth pocket depths have been measured three times over this 6-month
study: baseline, 2 months and 6 months. Blood tests for inflammatory biomarkers
were undertaken at baseline, 1 day, 7 days, 1 month, 2 month and 6 months. In this
section, we first look at the difference in the changes in pocket depths, and Fig. 12.3
shows the individual trajectories of pocket depths for the two groups.
A two-level multilevel model for the analysis of change in the pocket depths for
these patients can be written as:
and
Expanding by substituting p0ij and p1j, the new multilevel model becomes:
PDij ¼ b0 þ b1 Monthij þ b2 Txj þ b3 Tx Monthij þ m1j Monthij þ m0j þ e0ij (12.6)
where PD is the average full mouth pocket depth in millimeters on the ith occasion
(level-1, i ¼ 1, 2, 3) for the jth subject (level-2, j ¼ 1,. . ., 120), Month is time in
months since baseline (i.e. 0, 2, 6), Txj is a binary variable (IPT coded 1 and CPT
12 Latent Growth Curve Models 211
code 0), and Tx*Monthij is a product interaction term between Tx and Month
(i.e. Months multiplied by Tx).
The two-level multilevel model given by Eqs. 12.3–12.5 is a linear growth
model, i.e. a straight line was fitted to the distance measured on three occasions
between baseline and 6 month for each of the 120 patients. The baseline pocket
depths and their changes varied across patients, so there were variations in the
intercepts and slopes of the fitted straight lines. These variations were modelled as
normally distributed random effects in MLM, and as we shall explain later,
modelled as latent variables in LGCM.
The intercept b0 is the average baseline pocket depth for the CPT group, and b2
(the regression coefficient for Txj) is the additional baseline pocket depth for the
IPT group, i.e. the average baseline distance for boys is b0 þ b2 . The regression
coefficient for Monthij is b1 , which is the estimated average change in pocket depth
per month for CPT; and b3 (the regression coefficient for the interaction term
Tx*Monthij) is the difference in the slopes between the CPT and IPT groups.
The slope, b1 , is the predicted amount of changes in the pocket depth per month
for the CPT group, and the total amount of growth is therefore b1 6. The slope for
the IPT group is b1 þ b3 , and the total amount of predicted growth in the depth is
ðb1 þ b3 Þ 6. As a result, the difference in the growth between the two groups is
b3 6.
The multilevel model in Sect. 12.3.2 can be specified using LGCM, and the path
diagram in Fig. 12.4 shows the general concept of LGCM. As explained previously,
observed and measured variables are represented by squares. In this model, the
observed variables are the three measurements of pocket depths made at baseline
(PD0M), 2 months (PD2M), and 6 months (PD6M). Note that software for MLM
usually requires the data in long format, for example, the variable PDij in Eqs. 12.3
and 12.6 is created by stacking the three measurements of pocket depth into
one column. In contrast, software for LGCM requires the data in wide format,
i.e. the outcome measured on different occasions is treated as three separate
variables. Another observed variable in the model is Tx, which is a binary variable
(CPT is coded 0 and IPT coded 1). The parameters m3 and v3 are the mean and
variance of Tx.
The latent variables are represented by circles: F1 and F2 are two latent
variables which model the growth trajectories (i.e. the change patterns) for the
pocket depth. The parameters m1 and m2 are the intercepts for F1 and F2; the D1
and D2 are residual error terms for F1 and F2. Recall that for endogenous variables
(i.e. F1, F2, and the three measurements of pocket depth in this model) only the
intercepts are estimated, because they are affected and ‘explained’ by exogenous
variables (i.e. Tx in this model) and their associated residual errors (D and E).
Like residual error terms in regression analysis, the means of D1 and D2 are fixed to
212 Y.-K. Tu and F. D’Auito
be zero, and d1 and d2 are their variances, respectively. E1 to E3 are the error terms
for each observed variable; E1 to E3 are assumed to be uncorrelated and to have a
mean of zero. By contrast, the two latent variables F1 and F2 are assumed to be
correlated (there is a double arrow between them indicating that their covariance is
estimated in Fig. 12.4).
Note that F1 and F2 are unobserved (latent) variables or factors, which means
that unlike PD, they are not directly measured but are estimated by extracting
information from the observed variables. Therefore, the meaning of F1 and F2
depends upon how this information is extracted, i.e. it depends upon how the
relationships between them and PD are defined in the model by specifying the
parameters for the arrows from F1 and F2 to PD. The regression coefficients for the
arrows from F1 to the three measurements of PD are fixed to be unity, and those for
the arrows from the residual errors (D and E) are also fixed to be unity. So a latent
growth curve model or structural equation model can be viewed as an attempt to use
multiple equations to define the relationships amongst observed and unobserved
variables in the model.
For instance, the equation for the relationship between PD0M and other
variables in Fig. 12.4 is given as:
Similarly, the equation for PD2M and PD6M in Fig. 12.5 are given as:
4 CPT IPT
4
2
2
−2 0
−2 0
log IL-6
log IL-6
−4
−4
−6
−6
−8
−8
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Months Months
Fig. 12.5 Profile plots for log transformed IL-6 levels in blood samples at baseline, 1, 2 and
6 months
F1 ¼ m1 þ g1 Tx þ 1 D1: (12.10)
F2 ¼ m2 þ g2 Tx þ 1 D2; (12.11)
where m1 an m2 are the intercepts in Eqs. 12.10 and 12.11 for F1 and F2,
respectively; and l1, l2, l3, l4, g1, g2 are regression coefficients. Recall the simple
linear regression given by Eq. 12.1: y ¼ b0 þ b1 x þ e, where b0 and b1 are two
unknown parameter which need to be estimated. The regression coefficient for e is
actually fixed to be unity just like those for E1 to D2 in Eqs. 12.7–12.11. Therefore,
we can view LGCM (or SEM in general) as a system of multiple equations for the
relationships amongst the observed and latent variables and to identify their
relationships by solving these equations simultaneously. In Eqs. 12.7–12.11,
some of the parameters have been given (such as the factor loadings fixed to be
unity), and the unknown parameters, such as g and l, need to be estimated. In
Fig. 12.4, all the means of residual errors are fixed to be zero, just like the means of
residual errors in the ordinary regression models. It is noted that the intercepts of
observed outcome variables PD0M, PD2M and PD6M in Eqs. 12.7–12.9 are also
fixed to be zero, because the expected means of these variables will be estimated via
the latent variables F1 and F2.
they are measured. The latent variables F1 and F2 are then equivalent to
the estimated baseline pocket depths and the estimated changes in pocket
depths over 6 months, respectively. Recall that Eq. 12.7 is written as: PD0M ¼
1 F1 þ l1 F2 þ 1 E1. When l1 ¼ 0, PD0M ¼ 1 F1 þ 1 E1, i.e. PD0M is
decomposed into a latent variable F1 and residual errors variable E1. On the other
hand, F1 can be viewed as the unobserved true PD0M by removing the measurement
errors and random variations. Therefore, F2, which is estimated from the differences
between PD0M, PD2M and PD6M, is the estimated change from F1.
The variances of D1 and D2 (d1 and d2), i.e. the variations in the estimated
baseline pocket depths and changes in pocket depths, are equivalent to the random
effects of the intercept (u0j) and slope (u1j) in Eqs. 12.4 and 12.5, respectively.
By constraining the variances of E1, E2, and E3, i.e. e1, e2, and e3, to be equal, they
are equivalent to e0ij in Eq. 12.4. In Eqs. 12.5 and 12.6, b3 is the difference in the
change in average pocket depth per month between CPT and IPT groups over the
6-month observation, and its equivalent in LGCM is g2 . The equivalents of b0 , b1 ,
and b2 in LGCM are m1, m2 andg1 .
In summary, for longitudinal data analysis, both MLM and LGCM estimate a
linear growth trajectory (i.e. a change pattern) for each patient. Variations in the
intercepts and slopes of these trajectories, regarded as random effects in MLM, are
explicitly specified in LGCM as latent variables, because unlike PD, which is
directly observed and measured, these trajectories and their variations (random
effects) are unknown and need to be estimated.
The results from Mplus using maximum likelihood estimation are shown in
Table 12.1. Mplus (and other SEM software) provides many additional indices
for accessing model fit. Two most commonly used are the Chi-square test and Root
Mean Square Error of Approximation (RMSEA). The Chi-square value for the
model is 250.5 with 4 degrees of freedom (P < 0.001), and RMSEA is 0.717. In
SEM, the null hypothesis for the Chi-square test is that there is no difference in the
covariance structures between the proposed model and the data, and a P-value
greater than 0.05 means that we cannot reject the null hypothesis. This is different
from the usual null hypothesis testing where researchers seek to reject the null
hypothesis. In contrast, a small RMSEA means that the proposed model fits the data
relatively well (0.06 is usually used the cut-off value) (Kline 2011). Apparently,
this linear latent curve model is not acceptable and requires modification. More-
over, the estimated variance of D2 is 0.013, which is not acceptable, because the
variance is the square of the standard deviation and should always be a positive
value. The case of negative variance is known as a Heywood Case in SEM literature
(Loehlin 2004; Kline 2011), i.e. an offending estimate which indicates serious flaws
in the model specifications. This should not be surprising, as Fig. 12.3 clearly shows
Table 12.1 Results of univariate latent growth curve models for pocket depths following the SEM outlined in Fig. 12.4
12
that the change in pocket depths did not appear to be linear. Traditional approach
would be to add a quadratic term for Month in order to fit a curve-linear model.
However, as there are only three measurements of pocket depths, quadratic curve
fitting is not the best approach. Although results seem to suggest that patients in IPT
group had lower average pocket depth at baseline (0.392 mm, P < 0.001) and
greater pocket depth reduction (0.218 mm, P < 0.001), the validity of these
results is highly questionable.
The observed trajectories shown in Fig. 12.3 suggest a non-linear growth curve, and
there are many simple and advanced approaches to model non-linear curves in the
statistical literature (see Chap. 15). Because there are only three measurements of
pocket depths over the 6-month period, most advanced methods, such as fractional
polynomials and splines (see Chap. 15), are not suitable.
LGCM provides an elegant way to model the non-linear growth curve. Recall
that in the linear growth curve model, the paths from F2 to PD0M, PD2M and
PD6M (l1, l2, and l3) are fixed to be 0, 2 and 6, respectively. To capture the non-
linearity, we can fixed the first path (l1) to be 0 and the final path (l3) to be 6 but
allow l2 to be a free parameter for estimation (Bollen and Curran 2006; Duncan
et al. 2006; Tu et al. 2008). If the estimated values for l2 are close to 2, this
indicates that the growth curves are approximately linear. The results from this
approach using Mplus (Table 12.2) show that l2 is 5.45, indicating that most of the
change in pocket depths occurred during the first 2 months. The Chi-square value
for the model is 7.64 with 3 degrees of freedom (P ¼ 0.054) and RMSEA ¼ 0.114.
In contrast to the results from the previous model, patients in the IPT group did not
have a significantly lower average pocket depth at baseline (0.009 mm,
P ¼ 0.948), but they did show greater pocket depth reduction (0.253 mm,
P < 0.001) at 6 months.
The large difference in the Chi-square values between the linear and nonlinear
models indicates a substantial improvement in the model fit. However, the Chi-
square value and RMSEA are still not ideal. The P-value for the Chi-square test is
just greater than 0.05, whereas the statistical power of the Chi-square test to reject a
structural equation model is related to sample size. As our sample size is moderate,
we should not feel complacent about the result of the Chi-square test. Also note that
when the degree of freedom becomes zero, the model will get a perfect fit, i.e. the
Chi-square value will certainly become zero. This is known as a saturated model.
The aim of statistical model building is therefore to seek models that approximate
the relationships between variables in a parsimonious way. A good model is one
with a small Chi-square value relative to the model’s degree of freedom. In this
nonlinear model, it is assumed that changes in pocket depth within both the CPT
and IPT groups followed similar patterns, but the profile plots in Fig. 12.3 showed
that there was a steeper change in pocket depth for the IPT group between baseline
12 Latent Growth Curve Models 217
Table 12.2 Results of univariate latent growth curve models for IL-6 following the SEM outlined
in Fig. 12.4
Linear model
Regression coefficients
Estimate SE P
F1 Tx 0.121 0.240 0.614
F2 Tx 0.001 0.044 0.982
tIL6_0 F1 1
tIL6_0 F2 0
tIL6_1 F1 1
tIL6_1 F2 1
tIL6_2 F1 1
tIL6_2 F2 2
tIL6_6 F1 1
tIL6_6 F2 6
Intercepts
Estimate SE P
F1 0.220 0.171 0.199
F2 0.032 0.032 0.316
Covariances
Estimate SE P
D1 $ D2 0.046 0.031 0.141
Correlations
Estimate
D1 $ D2 0.234
Variances
Estimate SE P
E1 0.632 0.058 <0.001
E2 0.632 0.058 <0.001
E3 0.632 0.058 <0.001
E4 0.632 0.058 <0.001
D1 0.412 0.224 <0.001
D2 0.027 0.008 0.001
Model fit indices
Chi-square df ¼ 10 11.49 0.32
RMSEA 0.035
and 2 months. This may be the cause of misfit, and to accommodate this subtle
difference in the trends, we add an additional path from Tx to PD2M in Fig. 12.4.
This additional path is to capture the additional change in pocket depth between
baseline and 2 months in the IPT group. After adding the additional path, the Chi-
square value becomes 1.63 with 2 degrees of freedom (P ¼ 0.44) and RMSEA is
zero, indicating a further improvement in model fit. This final model shows both
groups had similar average pocket depths at baseline, but the IPT group achieved
greater pocket depth reduction (0.249 mm, P < 0.001) at 6 months.
218 Y.-K. Tu and F. D’Auito
12.5 Conclusion
Fig. 12.6 Path diagram for multivariate latent growth curve model. To simplify the presentation,
the residual errors for F1 to F4 (D1 to D4) are omitted. The variances (e1) of residual error terms
(E1) for the three pocket depths are fixed to be equal, and the variances (e2) of four residual errors
(E2) for log transformed IL-6 measured at baseline (tIL6_0M), 1 month (tIL6_1M), 2 months
(tIL6_2M) and 6 months (tIL6_6M), respectively, were also fixed to be equal
of outcomes were approximately identical for all subjects. If, for example, PD was
measured at baseline, 2 and 6 months after interventions for some patients but at
baseline, 4 and 7 months for others, this will not pose any problem in the analyses
employing multilevel modeling, but this is currently a problem when undertaking
LGCM for some SEM software packages (though not a problem for Mplus). This
reflects the limitations of current statistical software rather than the method itself.
Therefore, researchers should choose the methods (and software) best suited for
their research questions and study design. We strongly encourage epidemiologists
to consult professional statisticians when they plan to use these methods to analyze
their longitudinal data.
220 Y.-K. Tu and F. D’Auito
Table 12.3 Results of multivariate latent growth curve models following the SEM outlined in
Fig. 12.6
Regression coefficients:
Estimate S.E. P
F1 (PD) Tx 0.007 0.135 0.959
F2 (PD) Tx 0.248 0.018 <0.001
F3 (tIL-6) Tx 0.123 0.237 0.604
F4 (tIL-6) Tx 0.100 0.090 0.265
F3 F1 0.290 0.172 0.091
F4 F2 0.394 0.314 0.290
F4 F1 0.065 0.037 0.078
PD2M Tx 4.714 0.296 <0.001
Intercepts
Estimate S.E. P
F1 4.681 0.096 <0.001
F2 0.026 0.013 0.039
F3 1.578 0.822 0.055
F4 0.324 0.170 0.057
Covariances
Estimate S.E. P
D1 $ D2 0.034 0.008 <0.001
D3 $ D4 0.050 0.031 0.103
Correlations
Estimate
D1 $ D2 0.555
D3 $ D4 0.268
Variances
Estimate S.E. P
D1 0.507 0.071 <0.001
D2 0.007 0.001 <0.001
D3 1.369 0.219 <0.001
D4 0.026 0.008 <0.001
E1 0.038 0.005 <0.001
E2 0.632 0.058 <0.001
Model fit indices
Chi-square df ¼ 21 21.43 0.43
RMSEA 0.013
For readers with knowledge of linear algebra, Bauer (2003) and Curran (2003)
explain why MLM and LGCM yield the same results. Bollen and Curran (2006) and
Duncan et al. (2006) are two textbooks dedicated to LGCM. The former is more
mathematical. Many examples for the applications of LGCM can be found in
psychological journals. Tu et al. (2008) provides an example for applying LGCM
to biomedical data with multiple outcomes.
12 Latent Growth Curve Models 221
References
Bauer, D. J. (2003). Estimating multilevel linear models as structural equation models. Journal of
Educational and Behavioral Statistics, 28, 135–167.
Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.
Bollen, K. A., & Curran, P. J. (2006). Latent curve models. Hoboken: Wiley.
Byrne, B. M., & Crombie, G. (2003). Modeling and testing change: An introduction to the latent
growth curve model. Understanding Statistics, 2, 177–203.
Curran, P. (2003). Have multilevel models been structural equation models all along? Multivariate
Behavioral Research, 38, 529–569.
Duncan, T. E., Duncan, S. C., & Strycker, L. A. (2006). An introduction to latent variable growth
curve modeling (2nd ed.). Mahwah: Laurence Erlbaum Associates.
Gilthorpe, M. S., Zamzuri, A. T., Griffiths, G. S., et al. (2003). Unification of the “burst” and
“linear” theories of periodontal disease progression: A multilevel manifestation of the same
phenomenon. Journal of Dental Research, 82, 200–205.
Goldstein, H. (1995). Multilevel statistical models (2nd ed.). New York: Wiley.
Hox, J. (2002). Multilevel analysis. Mahwah: Laurence Erlbaum Associates.
Kline, R. B. (2011). Principles and practice of structural equation modeling (3rd ed.). New York:
Guilford Press.
Liang, K. Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models.
Biometrika, 73, 13–22.
Loehlin, J. C. (2004). Latent variable models: An introduction to factor, path, and structural
equation analysis (4th ed.). Mahwah: Laurence Erlbaum Associates.
Muthén, L. K., & Muthén, B. (2006). Mplus user’s guide (4th ed.). Los Angeles: Muthén &
Muthén.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear model (2nd ed.). Thousand Oaks:
Sage Publication.
Senn, S., Stevens, L., & Chaturvedi, N. (2000). Repeated measures in clinical trials: Simple
strategies for analysis using summary measures. Statistics in Medicine, 19, 861–877.
Singer, J. B., & Willett, J. D. (2003). Applied longitudinal data analysis. New York: Oxford
University Press.
Steele, F. (2008). Multilevel models for longitudinal data. Journal of the Royal Statistical Society
Series A, 171, 1–15.
Tonetti M. S., D’Aiuto F., Nibali L., Donald A., Storry C., Parkar M., Suvan J., Hingorani A. D.,
Vallance P., Deanfield J. (2007) Treatment of periodontitis and endothelial function.
New Engl J Med, 356(9), 911–920.
Tu, Y. K., Jackson, M., Kellett, M., & Clerehugh, V. (2008). Direct and indirect effects of
interdental hygiene in a clinical trial. Journal of Dental Research, 87, 1037–1042.
Twisk, J. W. R. (2003). Applied longitudinal data analysis for epidemiology. Cambridge,
UK: Cambridge University Press.
Twisk, J. W. R. (2006). Applied multilevel analysis. Cambridge, UK: Cambridge University Press.
Chapter 13
Growth Mixture Modelling for Life
Course Epidemiology
Darren L. Dahly
13.1 Introduction
Life course epidemiology is the study of how physical and social exposures occurring
across the entire life course, or even inter-generationally, can impact chronic disease
risk later in life (Ben-Shlomo and Kuh 2002). The life course approach to chronic
disease epidemiology is not a new one, though it was overshadowed during much of
the twentieth century by research on the importance of adulthood lifestyle risk factors
such as smoking and diet (Kuh and Ben-Shlomo 2004). Recently, however, the life
course approach to epidemiology has been given more attention by researchers,
funding agencies, and policy makers (Ben-Shlomo and Kuh 2002; De Stavola et al.
2006; Kuh and Ben-Shlomo 2004; Kuh et al. 2003; Pickles et al. 2007).
A key life course theme is the Developmental Origins of Health and Disease
(DOHaD) paradigm. In its most basic formulation, it hypothesises that environmen-
tal influences during critical periods of development can impact physiology in a
manner that increases disease risk later in life (Gillman 2005; Gluckman and Hanson
2004; Barker 2004). While early DOHaD research focused on nutritional and other
influences during foetal development (Barker 2001), there is a growing interest in
post-natal influences, particularly in the role that post-natal growth may play in the
aetiology of later obesity, diabetes, and cardiovascular disease (Ong and Loos 2006;
Baird et al. 2005; Monteiro and Victora 2005; Stein et al. 2005; Stettler 2007).
Not surprisingly, reviews of published research paint a complex picture. Some
evidence suggests rapid growth in early infancy is associated with obesity later in
life (Baird et al. 2005; Monteiro and Victora 2005; Ong and Loos 2006), while other
research suggests that poor growth in early life is associated with diabetes and heart
disease (Eriksson et al. 2001; Eriksson et al. 2003). Existing evidence largely
Y.-K. Tu and D.C. Greenwood (eds.), Modern Methods for Epidemiology, 223
DOI 10.1007/978-94-007-3024-3_13, # Springer Science+Business Media Dordrecht 2012
224 D.L. Dahly
consists of observed associations between the rate of weight change over a specific
period in infancy or childhood (e.g. birth to 6 months) and later disease. Building
upon this research, more sophisticated approaches such two stage least squares
(Healy 1974; Gale et al. 2006; Keijzer-Veen et al. 2005; Adair et al. 2009), multi-
level spline models (Ben-Shlomo et al. 2008), and partial least squares regression
(Tu et al. 2010) have been employed to try and identify critical periods where
growth is associated with later disease, independent of final attained size and
growth in other periods. One important limitation of these methods is that they
fail to consider that growth may be more important than the sum of its parts, i.e.
they do not consider possible interactions between varying rates of growth over
different periods of time. Recognising this, some have suggested that later disease is
associated with overall patterns of pre- and postnatal growth (Victora and Barros
2001). To investigate this possibility, we need a way to identify mutually exclusive
groups of people who share a similar growth trajectory, and then relate those groups
to later disease and other covariates. This chapter aims to illustrate the utility of
growth mixture models for just this purpose.
13.3 Data
Data are from 1,620 young adult males enrolled in the Cebu Longitudinal Health
and Nutrition Survey (CLHNS), a community based study of a 1-year birth
cohort living in Metropolitan Cebu, Philippines. Detailed information on the
study design is given by Adair et al. (2010), and data are available to download at
http://www.cpc.unc.edu/projects/cebu.
Surveys were conducted during the third trimester of pregnancy; birth; bimonthly
to 24 months; and at 8.5, 11.5, 16, 19, and 21.5 years (mean ages). The estimated
trajectories are based on 13 measures of BMI from birth to 24 months, calculated as
kg/m2 from measured lengths and weights using standard techniques (Lohman et al.
1988). BMI is a measure of body mass that is relatively independent of height/length
across the life course and is most often used as a proxy for adiposity in population
based studies. While it is correlated with percent body fat, there are limitations to it
use (Hall and Cole 2006), though these concerns go beyond the scope of this chapter.
We focused on systolic blood pressure and waist circumference as the distal
health outcomes, measured in young adulthood. The former was calculated as the
mean of three repeat measures using a mercury sphygmomanometer taken after a
10 min seated rest; the latter was measured in cm at the midpoint between the
bottom of the ribs and the top of the iliac crest. We also included SES scores,
measured at birth and young adulthood, as key covariates. These were derived from
a principal components analysis of interviewer observed household goods and
housing materials (Vyas and Kumaranayake 2006; Victora et al. 2008).
We estimated a variety of statistical models for this analysis, all of which are
described in detail below. We started with a latent growth curve model that only
included the repeated BMI measures from the first 2 years of life. We then extended
this to a growth mixture model though the inclusion of a categorical latent class
variable. We then modeled class membership as a predictor of the two distal health
outcomes, systolic blood pressure and waist circumference, while controlling for
SES at birth and young adulthood. While we are not able to cover every detail for
each estimated model, all models were estimated using Mplus 5.2 (Muthén &
Muthén, Los Angeles, USA), and the code used is included in the appendices.
226 D.L. Dahly
Fig. 13.1 Graphical representation of a latent growth curve model (13.4.1) for body mass index
(BMI) from zero (birth) to 24 months. The model hypothesizes that the observed value of a
person’s BMI at any time point is a function of random intercept and slope factors and time
specific random error
First we estimated a latent growth curve model which will serve as the base of our
later mixture modelling. The model is presented graphically in Fig. 13.1. It includes
the following variables: the 13 observed measures of BMI (abbreviated by the
ellipsis . . .), the corresponding time specific latent error terms (E), and the latent
intercept and slope growth factors (I1 and S1). Freely estimated parameters include
the error variances (e1-e13), the intercept and slope variances (var1, var2) and
means (m1, m2), and the covariance of the slope and intercept (cov1). The error
means and the intercepts of the measured BMI variables are set at zero, and the factor
loadings for I1 are set at one. The factor loadings l1 and l2 are set at zero and one,
respectively. The remaining factor loadings are freely estimated (l3-l13), resulting
in a freed-loading model (Bollen and Curran 2006; Meredith and Tisak 1990). While
this complicates the interpretation of the slope factor, freeing these factor loadings
helps us avoid making any a priori assumptions regarding the functional form of the
BMI trajectory. This seems particularly advantageous given our eventual goal of
identifying unobserved heterogeneity in the BMI trajectories.
The estimated parameters, standard errors, and multiple indices of model fit are
given in Table 13.1. Mean BMI at birth, m1, was 12.40 kg/m2 (SE 0.036); mean change
in BMI from birth to month 2, m2, was 3.27 kg/m2 (SE 0.045). Both growth factors
had a non-zero variance (var1 and var2), and their covariance, cov1, was negative.
13 Growth Mixture Modelling for Life Course Epidemiology 227
Table 13.1 Results from the latent growth curve model (13.4.1)
Estimate SE R2 Estimate SE
Means Covariance
I1 (m1) 12.401 0.036 – I1 and S1 (cov1) 0.217 0.066
S2 (m2) 3.274 0.045 –
Factor loadings
Variances l1 (BMI0) 0.000 –
I1 (var1) 0.642 0.073 – l 2 (BMI2) 1.000 –
S1 (var2) 0.930 0.077 – l 3 (BMI4) 1.326 0.019
l 4 (BMI6) 1.382 0.019
e1 (BMI0) 1.425 0.076 0.26 l 5 (BMI8) 1.311 0.017
e2 (BMI2) 2.240 0.084 0.14 l 6 (BMI10) 1.225 0.016
e3 (BMI4) 1.765 0.069 0.36 l 7 (BMI12) 1.158 0.015
e4 (BMI6) 1.205 0.050 0.53 l 8 (BMI14) 1.105 0.014
e5 (BMI8) 0.841 0.036 0.64 l 9 (BMI16) 1.072 0.014
e6 (BMI10) 0.640 0.028 0.70 l 10 (BMI18) 1.047 0.014
e7 (BMI12) 0.495 0.022 0.75 l 11 (BMI20) 1.032 0.013
e8 (BMI14) 0.439 0.019 0.76 l12 (BMI22) 1.037 0.014
e9 (BMI16) 0.414 0.018 0.74 l13 (BMI24) 1.034 0.014
e10 (BMI18) 0.362 0.017 0.76
e11 (BMI20) 0.393 0.018 0.73
e12 (BMI22) 0.529 0.023 0.63
e13 (BMI24) 0.661 0.028 0.56
MODEL FIT: w2 3,998.352, 75df, p < 0.0001 (where adequate model fit is often indicated when
p > 0.05); Comparative Fit Index 0.757; Tucker Lewis Index 0.747; Log Likelihood
23,771.442; Akaike Information Criterion 51,599.237; Bayesian Information Criterion
51,755.552; Root Mean Square Error of Approximation 0.180 (90% CI 0.175–0.184);
Standardized Root Mean Square Residual 0.207
Using the time specific error variances (e1-e13) estimated by the model, and the
observed variances of the BMI measures, we can calculate R2 values. They show
that the model explains much of the variation in BMI from 6 to 24 months (~50–75%),
but considerably less of the variation from birth to 4 months.
The freely estimated factor loadings (l3- l13) can be interpreted as the change in
BMI from birth, relative to the change in BMI from birth to 2 months (accom-
plished by setting their respective l at zero and one, respectively). In other words,
the scale of the slope factor is the change in BMI (kg/m2) from birth to 2 months.
For example, the estimate of l3 indicates that BMI increased 1.33 times more from
birth to 4 months than it did from birth to month 2. According to the model, BMI
increased rapidly from birth to 6 months, and then slowly declined to month 24.
Figure 13.2 displays the model estimated mean BMI trajectory, along with
the observed trajectories of 20 randomly selected individuals. While the
mean curve is what we would expect, given what we know about BMI in infancy
(e.g. Cole et al. 2000), there is clearly a great deal of variation around the curve. In a
latent growth curve model, deviations from the mean trajectory are explained by
individual variation in the growth factors and time specific errors. However, a
cursory inspection of Fig. 13.2 suggests that this explanation may be inadequate,
228 D.L. Dahly
Fig. 13.2 Estimated mean BMI curve in infancy (heavy solid line) from the latent growth curve
model (13.4.1), and observed curves for 20 randomly selected individuals (dashed lines)
particularly as there there are apparent differences between individuals in the timing
of peaks and troughs in the curve. This phase variation (Hermanussen and Meigen
2007), while potentially very important, cannot be captured by the latent growth
curve model since the estimated factor loadings (l) that describe the functional form
of the BMI curve are fixed effects (i.e. assumed to be the same for every individual in
the sample). The poor model fit, indicated by the various criteria given in Table 13.1,
further suggests that this parameterization of BMI changes in infancy is not a very
good one. Next we set out to determine whether the variation in BMI trajectories is
better explained by the idea that our overall sample contains subgroups
characterized by BMI curves with different functional forms.
To test this idea, we added a categorical latent variable to the latent growth curve
model we just described. Each parameter that was freely estimated in the latent
growth curve model was allowed to vary across latent classes (including any
estimated factor loadings) with the following exception: the variances of the growth
factors, var1 and var2, were constrained as zero in all classes. These constraints to
the growth mixture model result in a specific form that is referred to as latent class
growth analysis (Jung and Wickrama 2008) or semi-parametric group-based lon-
gitudinal models (Nagin 1999), and was largely popularized by the SAS procedure
TRAJ (Jones et al. 2001).
13 Growth Mixture Modelling for Life Course Epidemiology 229
Fig. 13.3 Latent class growth analysis (13.4.2): BMI trajectories for the 6-class solution
The estimated mean BMI trajectories for each class are given in Fig. 13.3.
Class 1 contains 25.5% of the sample and has a similar functional form to the latent
growth curve model estimated in Sect. 13.4.1. With the exception of classes 2 and 4,
230 D.L. Dahly
the other class trajectories have a functional form similar to the class 1 trajectory.
Conversely, the class 2 curve is characterized by a low birth BMI, followed by a
slow but steady gain in BMI that doesn’t peak until the 22nd month, while Class 4,
which only contains 2.5% of the sample, is characterized by rapid early increase in
BMI that continues to 8 months (vs. 6 months in the other classes).
While the 6-class solution results in interesting groups with potentially important
differences between them, does it fit the data better than other solutions? Unlike
most structural equation models, Pearson w2 tests (comparing the observed covari-
ance matrix to the model estimated matrix) cannot be used to distinguish mixture
models with different numbers of specified latent classes (Garrett and Zeger 2000).
We must instead rely on likelihood based statistics, such as the log likelihood, the
Akaike Information Criterion, and the Bayesian Information Criterion (Nylund
et al. 2007). There is no definitive answer as to which fit index is best, though
there is some suggestion that BIC out-performs the other two (Nylund et al. 2007).
Table 13.2 includes information on model fit for a series of models specifying
between one and nine latent classes. Typically we are looking for the model that gives
the largest log likelihood and the smallest Akaike Information Criterion and Bayesian
Information Criterion, though failure to arrive at these extrema is common in
practice. The values can be plotted to help the researcher weigh improvements in
model fit versus parsimony (similar to the scree plots often used in factor analysis; see
Jackson 1993). For example, Fig. 13.4 indicates large improvements in the Bayesian
Information Criterion moving from one to three latent classes, and then more
moderate improvements with additional classes. Other tests of model fit possible in
Mplus include the Lo-Mendel-Rubin likelihood ratio test (Lo et al. 2001) and
parametric bootstrapped likelihood ratio test (McLachlan and Peel 2000), though
their properties under various scenarios are not as well understood, and they require
more time and computing power to estimate.
Table 13.2 Model characteristics, including indicators of model fit, for the latent class growth analysis, comparing solutions with one to nine latent classes
specified (13.4.2)
Number of latent classes 1 2a 3 4 5 6 7 8 9
Number of parameters 26 53 80 107 134 161 188 215 242
Log Likelihood 31,816.1 – 27,162.2 26,487.4 26,035.3 25,754.7 25,486.9 25,288.6 25,109.8
AIC 63,684.19 – 54,484.4 53,188.8 52,338.5 51,831.3 51,349.9 51,007.2 50,703.6
BIC 63,824.33 – 54,915.6 53,765.6 53,060.8 52,699.1 52,363.2 52,166.1 52,008.1
aBIC 63,741.74 – 54,661.5 53,425.6 52,635.1 52,187.7 51,766.0 51,483.1 51,239.3
Growth Mixture Modelling for Life Course Epidemiology
Fig. 13.4 Bayesian Information Criteria values for the latent class growth analysis (13.4.2), com-
paring solutions with one to nine latent classes specified. The plot illustrates a a distinct improvement
improvement in model fit moving from a 1 to 3 class solution, with diminishing improvement from
additional classes
Table 13.3 Mean probability of latent class membership for individuals assigned to their most
likely latent class (highest posterior probability). For example, among individuals from whom
membership in class 1 is most likely (row 1), the mean probability of being in that class is 81%
(column 1), while their mean probability of being in class 2 is 5.3% (column 2), and so on
Latent class
Assigned latent class 1 2 3 4 5 6
1 0.807 0.053 0.057 0.002 0.022 0.058
2 0.026 0.874 0.000 0.000 0.026 0.074
3 0.063 0.020 0.911 0.000 0.002 0.003
4 0.000 0.000 0.000 0.952 0.048 0.000
5 0.010 0.026 0.004 0.012 0.904 0.044
6 0.050 0.055 0.002 0.001 0.044 0.848
Fig. 13.5 Estimated probabilities of latent class membership as a function of SES at birth (13.4.3).
With increasing SES, probability of membership in classes 2, 4, and 5 increases; while probability
of membership in classes 1, 3, and 6 decreases
Fig. 13.6 Schematic for the final model in the latent class growth analysis (13.4.3)
SES at birth was modeled as a determinant of SES in adulthood and the probability
of latent class membership. Both of these were modeled as determinants of waist
circumference and systolic blood pressure in young adulthood. Lastly, waist circum-
ference was modeled as an influence on systolic blood pressure. A schematic of the
final model is given in Fig. 13.5.
This relationship between SES score at birth and class membership is estimated
by a multinomial logistic regression. The results from this part of the model are best
summarized by plotting the estimated probabilities of latent class membership as a
function of SES at birth (Fig. 13.6). Individuals with higher SES at birth were more
likely to be assigned into classes 2, 4 and 5, and less likely to be in classes 1, 3, and 6.
234 D.L. Dahly
The following relationships are estimated with linear regressions: SES score at
birth was positively associated with SES score in adulthood (b ¼ 0.68 SES/SES;
95% CI: 0.59 to 0.77), which in turn was a positive predictor of waist circumference
(b ¼ 0.70 cm/SES; 95% CI: 0.52 to 0.89) but not systolic blood pressure
(0.05 mmHg/SES; 95% CI: 0.29 to 0.20); waist circumference was positively
associated with systolic blood pressure (0.47 mmHg/cm; 95% CI: 0.37 to 0.56).
Lastly, we can look at the estimated intercepts of waist circumference
and systolic blood pressure within each latent class, which are displayed in
Fig. 13.7. These intercepts reflect the mean values of the outcomes within each
class, independent of the linear influence of other predictor variables (SES for waist
circumference; SES and waist circumference for systolic blood pressure). There are
no apparent differences in the systolic blood pressure intercepts, although class 3,
which was characterized by very poor growth in early life, had the lowest value
(80.27 mmHg, 95% CI: 73.12 to 87.42). Classes 4 and 5 have the largest WC
intercepts (79.98 cm, 95% CI: 71.49 to 88.46; and 74.63, 95% CI: 73.38 to 75.88,
respectively). Both groups are characterized by relatively large BMI values at birth,
rapid early BMI gains, and have the largest BMI values at 24 months.
13.5 Conclusion
Perhaps contrary to expectations, we did not identify an early life BMI trajectory
that was clearly associated with systolic blood pressure in this sample of young
adult Filipino males, independent of waist circumference and SES. While we did
identify a subgroup characterized by larger waist circumferences, the small num-
bers of individuals falling in to this group prevented us from drawing any confident
conclusions. Given that this was an abridged analysis, there are a number of
important caveats needing exploration. Some of these, which we include as food
for thought, are:
• What is the impact of non-normality in BMI measures on the model (see Bauer
and Curran 2003)?
• Should raw scores be used, or are z-scores (internal or externally referenced)
more appropriate?
• Should the model focus on growth velocity, or acceleration, versus growth
distance?
• Would a model which includes both height and weight (adjusted for height) be
more appropriate?
• What is the impact when measurements are not evenly spaced? For example, if
we had many early measures and few later measures, would results be dispropor-
tionally driven by the former? Would it then make more sense to look for groups
based on two or more latent categorical variables, each capturing parts of the
total trajectory?
13 Growth Mixture Modelling for Life Course Epidemiology 235
Fig. 13.7 Systolic blood pressure and waist circumference and intercepts (and 95% confidence
intervals) by latent class (13.4.3)
236 D.L. Dahly
Appendix 1
bmi14*;
bmi16*;
bmi18*;
bmi20*;
bmi22*;
bmi24*;
Appendix 2
Variable:
Classes ¼ c (2); ! Increase this for more classes
Analysis:
Type ¼ Mixture ;
STARTS ¼ 100 20;
STITERATIONS ¼ 20;
Model:
%OVERALL%
! Factor loadings defining the growth curve
i1 s1 | bmi0@0 bmi2@1 bmi4* bmi6* bmi8* bmi10* bmi12* bmi14* bmi16*
bmi18* bmi20* bmi22* bmi24*;
! Freely estimated factor means; variances constrained as zero
i1@0;
s1@0;
[i1*];
[s1*];
! Freely estimated error variances
bmi0*;
bmi2*;
bmi4*;
bmi6*;
bmi8*;
bmi10*;
bmi12*;
bmi14*;
bmi16*;
bmi18*;
bmi20*;
238 D.L. Dahly
bmi22*;
bmi24*;
%c#1%
[Repeat code from OVERALL model]
%c#2%
[Repeat code from OVERALL model]
! Add more class models as needed
Appendix 3
Variable:
Classes ¼ c (2);
Analysis:
Type ¼ Mixture ;
STARTS ¼ 100 20;
STITERATIONS ¼ 20;
Model:
! Factor loadings defining the growth curve
i1 s1 | bmi0@0 bmi2@1 bmi4* bmi6* bmi8* bmi10* bmi12* bmi14* bmi16*
bmi18* bmi20* bmi22* bmi24*;
! Freely estimated factor means; variances constrained as zero
i1@0;
s1@0;
[i1*];
[s1*];
! Freely estimated error variances
bmi0*;
bmi2*;
bmi4*;
bmi6*;
bmi8*;
bmi10*;
bmi12*;
bmi14*;
13 Growth Mixture Modelling for Life Course Epidemiology 239
bmi16*;
bmi18*;
bmi20*;
bmi22*;
bmi24*;
! Covariates
! Multinomial logit of C on SES
c ON ses0;
! Linear regression of SES in young adulthood on SES at birth
ses258 ON ses0;
! Linear regression of systolic blood pressure on SES and waist circumference
sys258 ON waist258 ses258;
! Linear regression of waist circumference on SES
waist258 ON ses258;
References
Adair, L. S., Martorell, R., Stein, A. D., Hallal, P. C., Sachdev, H. S., Prabhakaran, D., Wills, A.
K., Norris, S. A., Dahly, D. L., & Lee, N. R. (2009). Size at birth, weight gain in infancy and
childhood, and adult blood pressure in 5 low-and middle-income-country cohorts: When does
weight gain matter? American Journal of Clinical Nutrition, 89, 1383.
Adair, L. S., Popkin, B. M., Akin, J. S., Guilkey, D. K., Gultiano, S., Borja, J., Perez, L., Kuzawa,
C. W., McDade, T., & Hindin, M. J. (2010). Cohort profile: The Cebu longitudinal health and
nutrition survey. International Journal of Epidemiology. doi:10.1093/ije/ dyq085.
Baird, J., Fisher, D., Lucas, P., Kleijnen, J., Roberts, H., & Law, C. (2005). Being big or growing
fast: Systematic review of size and growth in infancy and later obesity. British Medical
Journal, 331, 929.
Barker, D. J. P. (2001). Fetal origins of cardiovascular and lung disease. New York: M. Dekker.
Barker, D. J. P. (2004). The developmental origins of adult disease. Journal of the American
College of Nutrition, 23, 588–595.
Bauer, D. J., & Curran, P. J. (2003). Distributional assumptions of growth mixture models:
Implications for overextraction of latent trajectory classes. Psychological Methods, 8, 338.
Ben-Shlomo, Y., & Kuh, D. (2002). A life course approach to chronic disease epidemiology:
Conceptual models, empirical challenges and interdisciplinary perspectives. London: IEA.
Int. J. Epidemiol. (2002) 31 (2): 285–293. doi: 10.1093/ije/31.2.285.
Ben-Shlomo, Y., McCarthy, A., Hughes, R., Tilling, K., Davies, D., & Davey Smith, G. (2008).
Immediate postnatal growth is associated with blood pressure in young adulthood: The Barry
Caerphilly Growth Study. Hypertension, 52, 638.
Bollen, K. A., & Curran, P. J. (2006). Latent curve models: A structural equation perspective. Hoboken:
Wiley-Interscience. http://onlinelibrary.wiley.com/doi/10.1002/0471746096.fmatter/pdf.
Celeux, G., & Soromenho, G. (1996). An entropy criterion for assessing the number of clusters in a
mixture model. Journal of Classification, 13, 195–212.
Cole, T. J., Bellizzi, M. C., Flegal, K. M., & Dietz, W. H. (2000). Establishing a standard definition
for child overweight and obesity worldwide: International survey. British Medical Journal,
320, 1240.
240 D.L. Dahly
De Stavola, BL and Nitsch, D and Silva, ID and McCormack, V and Hardy, R and Mann, V and
Cole, TJ and Morton, S and Leon, DA (2006) Statistical issues in life course epidemiology. AM
J EPIDEMIOL, 163(1) 84–96. 10.1093/aje/kwj003.
Dolan, C. V., Schmittmann, V. D., Lubke, G. H., & Neale, M. C. (2005). Regime switching in the
latent growth curve mixture model. Structural Equation Modeling, 12, 94–119.
Eriksson, J. G., Forsén, T., Tuomilehto, J., Osmond, C., & Barker, D. J. P. (2001). Early growth
and coronary heart disease in later life: Longitudinal study. British Medical Journal, 322, 949.
Eriksson, J. G., Forsen, T. J., Osmond, C., & Barker, D. J. P. (2003). Pathways of infant and
childhood growth that lead to type 2 diabetes. Diabetes Care, 26, 3006.
Gale, C. R., O’Callaghan, F. J., Bredow, M., & Martyn, C. N. (2006). The influence of head growth
in fetal life, infancy, and childhood on intelligence at the ages of 4 and 8 years. Pediatrics, 118,
1486.
Garrett, E. S., & Zeger, S. L. (2000). Latent class model diagnosis. Biometrics, 56, 1055–1067.
Gillman, M. W. (2005). Developmental origins of health and disease. The New England Journal of
Medicine, 353, 1848–1850.
Gluckman, P. D., & Hanson, M. A. (2004). Developmental origins of disease paradigm:
A mechanistic and evolutionary perspective. Pediatric Research, 56, 311–317.
Hall, D. M. B., & Cole, T. J. (2006). What use is the BMI? Archives of Disease in Childhood, 91,
283–286.
Healy, M. J. R. (1974). Notes on the statistics of growth standards. Annals of Human Biology, 1,
41–46.
Hermanussen, M., & Meigen, C. (2007). Phase variation in child and adolescent growth. Interna-
tional Journal of Biostatistics, 3, 9.
Jackson, D. A. (1993). Stopping rules in principal components analysis: A comparison of
heuristical and statistical approaches. Ecology, 74, 2204–2214.
Jones, B. L., Nagin, D. S., & Roeder, K. (2001). A SAS procedure based on mixture models for
estimating developmental trajectories. Sociological Methods & Research, 29, 374.
Jung, T., & Wickrama, K. A. S. (2008). An introduction to latent class growth analysis and growth
mixture modeling. Social and Personality Psychology Compass, 2, 302–317.
Keijzer-Veen, M. G., Euser, A. M., van Montfoort, N., Dekker, F. W., Vandenbroucke, J. P., & van
Houwelingen, H. C. (2005). A regression model with unexplained residuals was preferred in
the analysis of the fetal origins of adult diseases hypothesis. Journal of Clinical Epidemiology,
58, 1320–1324.
Kreuter, F., & Muthén, B. (2008). Analyzing criminal trajectory profiles: Bridging multilevel and
group-based approaches using growth mixture modeling. Journal of Quantitative Criminology,
24, 1–31.
Kuh, D., & Ben-Shlomo, Y. (2004). A life course approach to chronic disease epidemiology.
Oxford: Oxford University Press. http://books.google.co.uk/books?id=o_CFOTYglHsC&
printsec=frontcover&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false.
Kuh, D., Ben-Shlomo, Y., Lynch, J., Hallqvist, J., & Power, C. (2003). Life course epidemiology.
Journal of Epidemiology and Community Health, 57, 778–783.
Li, F., Duncan, T. E., Duncan, S. C., & Acock, A. (2001). Latent growth modeling of longitudinal
data: A finite growth mixture modeling approach. Structural Equation Modeling: A Multidis-
ciplinary Journal, 8, 493–530.
Li, C., Goran, M. I., Kaur, H., Nollen, N., & Ahluwalia, J. S. (2007). Developmental trajectories of
overweight during childhood: Role of early life factors. Obesity, 15, 760–771.
Lo, Y., Mendell, N. R., & Rubin, D. B. (2001). Testing the number of components in a normal
mixture. Biometrika, 88, 767.
Lohman, T., Roche, A., & Martorell, R. (1988). Anthropometric standardization reference man-
ual. Champaign: Human Kinetics Books.
McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: Wiley-Interscience. http://
espace.library.uq.edu.au/view/UQ:145685.
Meredith, W., & Tisak, J. (1990). Latent curve analysis. Psychometrika, 55, 107–122.
Monteiro, P. O. A., & Victora, C. G. (2005). Rapid growth in infancy and childhood and obesity in
later life-a systematic review. Obesity Reviews, 6, 143–154.
13 Growth Mixture Modelling for Life Course Epidemiology 241
Y.-K. Tu and D.C. Greenwood (eds.), Modern Methods for Epidemiology, 243
DOI 10.1007/978-94-007-3024-3_14, # Springer Science+Business Media Dordrecht 2012
244 K. Tilling et al.
a b
TIME TIME
SBP SBP
Quitting Quitting
smoking smoking
SBP SBP
Death Death
Fig. 14.1 (a) Time-varying confounding by SBP of the effect of quitting smoking on mortality.
(b) Non time-varying confounding by SBP of the effect of quitting smoking on mortality
introduce selection bias (Hernan et al. 2004), but again, we have to adjust for it if it
confounds future exposure and outcome. Hence the question is how to adjust for
time-varying confounding without interrupting mediated effects nor introducing
selection bias. Standard statistical methods for the analysis of cohort studies (for
example Cox or Poisson regression) often get this wrong and yield biased estimates
(Robins et al. 1992a), while G-estimation provides a valid method.
We illustrate the problem with an example. When analysing the effect of smoking
on mortality we could employ several possible strategies, including: examining the
effect of baseline smoking; examining the effect of time-updated smoking;
controlling for baseline covariates; and controlling for time-updated covariates.
The unadjusted estimate of the effect of baseline smoking will be biased
(favouring smoking, in this case), because those who are both smokers and have
high blood pressure (and therefore have the highest mortality risk) will tend to quit
subsequently, and thus will reduce their mortality risk. Controlling for baseline
covariates such as blood pressure which are measured at the start of the study will
still give biased estimates of the effect of smoking, because it ignores the fact that
individuals who quit after the start of the study will tend to be those whose blood
pressure increased over time.
Controlling for time-updated measurements of covariates such as blood pressure
will still give biased estimates of the effect of smoking, because smoking
acts on mortality at least partly by raising blood pressure. Controlling for a variable
(e.g. blood pressure) which is intermediate on the pathway between the exposure
(e.g. smoking) and the outcome (e.g. mortality) will estimate only the direct effect
of the exposure (ignoring the effect mediated through the covariate) and may
additionally introduce selection bias (Hernan et al. 2004).
Example 1 To illustrate the bias of the usual survival analysis in the situation
described above, we simulated data for 2,000 people with four assessment
occasions (visits) 3 years apart. Each person had a randomly-generated (log-
normally distributed) survival time representing how long they would survive if
never exposed, which was then decreased by high blood pressure or smoking.
Survival time for a smoker was 0.67 of survival time for a non-smoker with the
same covariate history, and survival time decreased by 4% per 1 mmHg increase in
current blood pressure. Blood pressure increased by 2 mmHg for current smokers,
and by 1 mmHg for ex-smokers (i.e. if an individual smoked at the previous visit
but not the current visit, blood pressure was 1 mmHg higher than if they had been a
non-smoker at both visits). The odds of smoking were decreased by 0.3 if the
participant had high blood pressure at the previous visit. All 2,000 participants were
“followed up” until either they died (n ¼ 1,672) or until 3 years after the fourth
visit. We took visit 1 to be a baseline visit, and measured time to event/censoring
from visit 2. Table 14.1 shows the simulated number at each visit, together with
number smoking at that visit and average blood pressure at that visit.
The data were analysed using a Weibull model with the accelerated failure
time parameterisation, because this is the parameterisation which corresponds to
g-estimation (i.e., calculating the survival ratio rather than the hazard ratio).
246 K. Tilling et al.
The accelerated failure time model assumes for the individual failure times Ti with
covariates xi that:
Ti ¼ expðyT xi þ ei Þ
where ei has a standard extreme value distribution with scale parameter 1/g, where g
is the shape parameter.
Survival models including current smoking, current smoking and blood pressure,
current smoking plus baseline smoking and blood pressure, and smoking and blood
pressure at current and previous visits, were all fitted. The model including current
smoking only estimated the survival time ratio for smokers compared to non-
smokers as 1.14 (95% CI 1.06–1.23), concluding that smoking had little (possibly
even a positive) effect on survival. The model including current smoking
and current blood pressure estimated the ratio as 0.87 (95% CI 0.84–0.89),
that including current smoking and baseline smoking and blood pressure as 0.93
(95% CI 0.90–0.96) and that including all time-updated variables as 0.93 (95% CI
0.91–0.94). Thus all these standard analyses under-estimated the true adverse effect
of smoking on mortality (a mortality ratio of 0.67).
14.2.1 G-estimation
G-estimation of causal effects was proposed by Robins (see e.g. Robins et al.
1992a; Witteman et al. 1998) as one method to allow for confounders which are
also on the causal pathway, i.e. time varying confounding. G-estimation has been
used in various applications, to estimate the causal association between: quitting
smoking and time to death or first CHD (Mark and Robins 1993); isolated systolic
hypertension and cardiovascular mortality (Witteman et al. 1998); therapy and
survival for HIV-positive men (Joffe et al. 1997, 1998); graft versus host disease
and relapse after bone marrow transplants in leukaemia (Keiding et al. 1999);
various cardiovascular risk factors and mortality (Tilling et al. 2002); to estimate
the total causal effect of highly active antiretroviral therapy (HAART) on the time
to AIDS or death among those infected with immunodeficiency virus (HIV)
(Hernan et al. 2005); and to correct for non-compliance in clinical trials (Korhonen
et al. 1999). G-estimation has also been implemented as a Stata programme (Sterne
and Tilling 2002).
where tv is the time from visit v to either the event or the next visit.
Example 2 The simulated data on smoking and blood pressure used in example 1
were analysed using g-estimation. We had four visits, of which the first was the
baseline. Thus, for a given value of C, the estimated counterfactual survival time
for an individual is given by Ui;C where
X
4
Ui;C ¼ ðtv Þ expðC ei;v Þ
v¼1
where tv is the time from visit v to either the event or the next visit and
ei;v ¼ whether individual i smoked at visit v. Data from three simulated individuals
are shown in Table 14.2:
The first individual in this simulated dataset (individual A) was a smoker at visits
1 and 2, and survived for a total of 4.88 years (i.e. 1.88 years from visit 2).
14 G-estimation for Accelerated Failure Time Models 249
X X X X
logit (ei;v Þ ¼ mUi;C þ ak xik þ bj cij;v þ dj cij;v1 þ lj cij;1 v ¼ 2;:::; n
k¼0 j j¼1 j¼1
for different values of C, where cij;v are the time-varying confounders and xik the
time-invariant confounders. Alternatively, one logistic model could be fitted
including data from all visits, with allowance made for clustering within
individuals (e.g. by using a GEE). The time-varying confounders may include
the values of exposure at previous time-points and at baseline. In fact, the above
model for exposure can be generalised and should be chosen according to what is
judged appropriate based on subject matter knowledge about the exposure pro-
cess. For example when exposure is treatment, we may have specific information
on the rules according to which treatment was administered. Subjects contribute
an observation for each occasion at which their exposure was assessed.
The g-estimate C0 is the value of C for which the Wald statistic of m in this
logistic regression is zero (P value 1, i.e. no association between current expo-
sure and Uij;C0 ). The upper and lower limits of the 95% confidence interval for C0
are the values of C for which the two-sided P-value for the Wald statistic of m in
this logistic regression is 0.05.
250 K. Tilling et al.
This g-estimate C0 is minus the log of the “causal survival time ratio”.
Thus expðC0 Þ estimates the ratio of the survival time of a continuously exposed
person to that of an otherwise identical person who was never exposed.
If expðC0 Þ>1 then exposure is beneficial (i.e. exposure increases time to the
outcome event).
The counterfactual survival time Ui;C can only be derived from the observed data
for a subject who experiences the event. If the study has a planned end of follow-up
(at time Ci for individual i) that occurs before all subjects have experienced the
outcome event, not all subjects’ counterfactual failure times will be estimable. If Ci
is independent of the counterfactual survival time, then this problem can be
overcome by replacing Ui;C with an indicator variable (Di;C ) for whether the
event would have been observed both if the person had been exposed throughout
follow-up and if they had been unexposed throughout follow-up, as described by
Witteman et al. (1998).
where Ci;C ¼ Ci if C0 and Ci;C ¼ Ci expðCÞ if C<0. Thus Di;C is zero for all
subjects who do not experience an event during follow-up, and may also be zero for
some of those who did experience an event.
Example 3 Continuing with the data from example 1, this study had a planned end
of follow-up 12 years after visit 1. Suppose we assume that smoking halves life
expectancy, i.e. expðCÞ¼0.5, so C¼0.69. Then for each individual, the indicator
variable Di;0:69 is equal to 1 if the counterfactual failure time (given C¼0.69) is less
than 12 years and 0 otherwise. The first individual in this simulated dataset
(A, above) was a smoker at visits 1 and 2, and survived for 4.88 years from visit 1.
The counterfactual survival time for this individual (see example 2) is 9.76 years,
and thus the indicator variable Di;0:69 takes the value 1 for this individual. The
counterfactual failure time for individual B, who smoked at visits 1, 2 and 3 then
gave up and survived for another 5 years, was 20 years. Thus the indicator variable
Di;0:69 takes the value 0 for this individual. Another individual (C) smoked at visits 1
and 2, then gave up and survived until the end of follow-up (dying 14 years after
visit 1). As this individual did not experience an event during follow-up, their value
for the indicator variable Di;0:69 is 0. The value of the indicator variable Di;C can be
calculated for all individuals, whether or not they experienced an event during
follow-up.
Once the value of the indicator variable has been calculated for each individual,
for a given value of C, then the g-estimation can proceed by performing a logistic
regression of the exposure of each individual at each timepoint on their
14 G-estimation for Accelerated Failure Time Models 251
counterfactual failure time. The data for individuals A, B and C are shown in
Table 14.3, assuming that smoking halves life expectancy):
Each individual i then contributes ni observations to a logistic regression
model with exposure as the outcome, where ni is the number of visits at which
that individual has observations. Thus in the example above, individuals A, B and C
contribute 1, 3 and 3 observations respectively. In each case, the logistic regression
relates their exposure to all their baseline covariates, and previous covariates and
exposures, and to the indicator variable for their counterfactual failure time (Di;C ).
We used g-estimation to estimate the effect of smoking on mortality using the
entire simulated dataset. The g-estimate of C was 0.41 (95% CI 0.37–0.44), and
the g-estimated survival ratio was 0.66 (95% CI 0.64–0.69) compared to the
true value of 0.67. This is closer to the true value than all the other (biased)
models (see Example 1), and also has a slightly narrower confidence interval.
In this one hypothetical example, g-estimation performs better than the usual
survival analysis.
Censoring by competing risks can occur when subjects leave the study early or, in
the case of cause-specific mortality models, die from other causes. For example, in
models where systolic or diastolic blood pressure are the exposures, individuals
might be censored when they first reported use of anti-hypertensive medication
(Tilling et al. 2002). Subjects could also withdraw from the study because they felt
too ill to participate in further follow-ups. In each of these cases, censoring is not
independent of the underlying counterfactual survival time. Thus the above method
for dealing with censoring by the planned end of a study cannot be used to deal with
censoring by competing risks.
As outlined by Witteman et al. (1998), censoring by competing risks is dealt with
by modelling the censoring mechanism, and using each individual’s estimated
probability of being censored to adjust the analysis. This is a similar idea to using
weighting for non-response to adjust for missing data (Little and Rubin 2002).
Multinomial (if there are several censoring mechanisms) or logistic regression
(if there is only one censoring mechanism), based on all available data, is used to
relate the probability of being censored at each measurement occasion to the
exposure and covariate history. The probability of being uncensored to the end of
252 K. Tilling et al.
the study for each individual is then estimated. The inverse of this probability is used
to weight the contributions of individuals to the logistic regression models used
in the g-estimation process, to which now only uncensored individuals contribute.
This can be done by using probability weights, or by replacing Di;C by
Di;C
p(not censored)
. This approach means that observations within the same individual
are no longer independent, so the logistic regression models for the g-estimation
process use robust standard errors allowing for clustering within individuals (using
the Huber-White sandwich estimator (Stata Corporation 2007)). This is equivalent
to the procedure suggested by Witteman et al., to use a robust Wald test from a
generalized estimating equation with an independence working correlation matrix
(Witteman et al. 1998). The confidence intervals obtained using this procedure are
conservative.
For example, suppose we are examining the effect of systolic and diastolic blood
pressure (as exposures) on mortality, and that individuals were censored when
they first reported use of anti-hypertensive medication. The probability of being
censored at each visit will depend on blood pressure at previous visits, and is likely
to be related to other factors also (e.g. smokers may be more likely to have other
health problems and therefore to visit the GP). The censoring process is modelled,
using logistic regression, with whether the individual was censored (i.e. prescribed
anti-hypertensive medication) at each occasion as the outcome. This logistic
regression model is then used to derive, for each individual, the probability that
they remained uncensored to the end of the study. The inverse of this probability is
used to weight all of that individual’s contributions to the g-estimation model (using
probability weights as before). For example, suppose a person with high initial
blood pressure has a chance of 0.25 of being uncensored at the end of the study.
In g-estimation the contribution of such a person to the model is multiplied by 4,
representing the ‘total’ of 4 people with high blood pressure, 3 of whom were
censored before the end of the study.
The parameter estimated by the g-estimation procedure, the causal survival time ratio,
describes the association between exposure and survival using the accelerated failure
time parameterisation. In epidemiology, the more usual parameterisation for survival
analysis is that of proportional hazards. It would thus be useful to be able to express the
causal survival time ratio in the proportional hazards parameterisation. One obvious
way to do this is via the Weibull distribution, as this can be expressed in either
parameterisation.
The Weibull hazard function at time t is hðtÞ ¼ fgtg1 , where f is referred to as
the scale parameter and g as the shape parameter. If the vector of covariates xi does
14 G-estimation for Accelerated Failure Time Models 253
not affect g, the Weibull regression model can be written as either the usual
epidemiological proportional hazards:
hðt; xi Þ ¼ h0 ðtÞ exp ðbT xi Þ
or accelerated failure time, using the expected failure time:
Ti ¼ expðyT xi þ ei Þ
where ei has a standard extreme value distribution with scale parameter 1/g.
The Weibull shape parameter g can thus be used to express results from the
accelerated failure time parameterisation as proportional hazards: y ¼ b/g.
If the underlying survival times follow a Weibull distribution, the Weibull shape
parameter can be estimated from the survival data and used to express the
g-estimated survival ratio as a hazard ratio for the exposure (Witteman et al. 1998).
The 95% confidence intervals for g-estimated effects are generally wider than
those for corresponding Weibull estimates, particularly with rare outcomes and for
estimates close to 1. This is because G-estimation discards information when
censoring, by dichotomising the outcome variable.
Example 4 G-estimation has been used to examine the effects of changes in
cardiovascular risk factors in mid-life on all-cause mortality and incidence of
coronary heart disease (CHD) (Tilling et al. 2002). Cardiovascular risk factors
(systolic and diastolic blood pressure, smoking, diabetes, HDL and LDL choles-
terol) were measured four times, with the first measure being used as the baseline in
the g-estimation model.
To identify the extent of time-varying confounding, the relationships between
each exposure and past and current values of all covariates were examined.
This was done using one regression model for each exposure, to which each
individual could contribute up to three observations. These models showed that
there was substantial time-varying confounding, with inter-relationships among
most of the time-varying exposures. Weibull survival analysis (with the accelerated
failure time parameterisation) was used to relate all the covariates to survival, and
the shape parameter from this model (1.26, 95% CI 1.17–1.36) was later used to
express the g-estimated survival ratios as hazard ratios for each exposure.
Separate g-estimation models were fitted for each exposure. In each g-estimation
model all risk factors (other than the exposure of interest) were included as
time-varying covariates. Baseline variables (e.g. age and sex) were included as
non time-varying covariates. In the models for systolic and diastolic blood pressure,
individuals were censored when they first reported use of anti-hypertensive medica-
tion. The probability of being on anti-hypertensive medication at each visit was
dependent on blood pressure at baseline and previous visits, and was also related to
baseline and time-varying values of BMI, smoking and diabetes, and to age and sex.
This censoring process was modelled, using logistic regression, and the probability
of each individual being censored was taken into account in the g-estimation method.
Table 14.4 (modified from (Tilling et al. 2002) with permission of Oxford
University Press and the Society for Epidemiologic Research) shows the baseline,
254
Table 14.4 Baseline and time-varying Weibull survival analysis and G-estimated relations between time-varying cardiovascular risk factors and survival for
Atherosclerosis Risk in Communities participants with data from at least the first two visits (1987–1989 and 1990–1993) (Modified with permission of the
Oxford University Press and the Society for Epidemiologic Research from Tilling et al. (2002))
Time-varying G-estimated
Variable Reference group Baseline HR 95% CI HR 95% CI HR 95% CI
SBP 140 mmHg SBP < 140 mmHg 2.08 1.43, 3.03 1.72 1.23, 2.40 1.79 1.38, 2.24
DBP 90 mmHg DBP < 90 mmHg 1.58 0.79, 3.17 1.91 1.02, 3.56 1.98 0.97, 28.56
Diabetes No diabetes 2.04 1.67, 2.49 1.26 0.98, 1.62 1.62 1.06, 1.98
BMI (kg/m2) BMI 20–30 kg/m2
20 2.58 1.89, 3.53 3.09 2.03, 4.71 2.07 1.39, 3.64
30 1.01 0.85, 1.20 0.83 0.64, 1.07 0.71 0.51, 1.12
HR hazard ratio, CI confidence interval, SBP systolic blood pressure, DBP diastolic blood pressure, BMI body mass index, HDL high density lipoprotein, LDL
low density lipoprotein
K. Tilling et al.
14 G-estimation for Accelerated Failure Time Models 255
time-varying and g-estimated hazard ratios for mortality for selected cardiovascular
risk factors. The comparisons of the results for the usual survival analysis (relating
exposure at baseline to mortality) and g-estimation shed some light on the likely
mechanisms for each exposure. Diabetes at baseline was associated with a hazard
ratio of 2.04 (Tilling et al. 2002). The g-estimated hazard rate ratio for time-varying
diabetes (1.62) was weaker than that for baseline diabetes, indicating that the
cumulative effect of diabetes is stronger than the instantaneous effect. The time-
varying effect of diabetes was underestimated by the standard analysis (hazard
ratio ¼ 1.26). The g-estimated hazard ratio for systolic blood pressure was again
weaker than the baseline effect, showing that the effect of blood pressure on
mortality was long-term rather than instantaneous. G-estimation and Weibull anal-
ysis showed a higher risk of death for those with low BMI and no evidence of
increased mortality among subjects with high BMI. The validity of G-estimation
depends on there being no unmeasured confounders. Confounders not included
here, such as comorbid conditions, may influence the relation between BMI and
mortality. Alternatively, BMI may have a cumulative effect, and so short-term
changes in weight (assessed by these time-varying models) have a different relation
to mortality than long-term weight.
For blood pressure and diabetes, the time-varying effects of exposure were
underestimated by the usual survival analysis, whereas the adverse effect of low
BMI appeared to be over-estimated by the usual survival analysis. Thus the time-
varying confounding present in this example led to biases in the estimation of the
effects of time-varying exposures. The confidence intervals for the g-estimated hazard
ratios were wider than those for the Weibull estimates, because g-estimation discards
information when dichotomising the outcome variable to deal with censoring.
G-estimation (as described above) assumes a binary exposure. The effect of trichot-
omous exposures on outcome has been estimated using g-estimation and an itera-
tive procedure (Tilling et al. 2002). For each exposure, the middle category was
chosen as the reference. One of the other two categories was selected, and the effect
of the dichotomous exposure defined by that category and the middle category
estimated using g-estimation. This estimate was then included as a fixed value in
the g-estimation of the effect of the dichotomous exposure defined by the third
category and the middle category. This procedure was iterated to convergence.
The standard errors for the effects of variables with three categories estimated in
this way may be under-estimated, because each iteration assumes that the effect of
the other category on survival is known (rather than estimated). Ideally, both
parameters should be estimated simultaneously and a 95% confidence region for
their joint distribution calculated. However, this has not yet been carried out in
practice. Similarly, there has to date been no extension of g-estimation to continu-
ous exposures.
256 K. Tilling et al.
Marginal structural models (MSMs) are one type of alternative to g-estimation for
analysing longitudinal data (Hernan et al. 2000, 2002; Young et al. 2010). In these
models each observation is weighted by the probability of exposure based on past
covariate and exposure history, and a model is then fitted to the weighted data and
14 G-estimation for Accelerated Failure Time Models 257
References
Amuzu, A., Carson, C., Watt, H. C., Lawlor, D. A., & Ebrahim, S. (2009). Influence of area and
individual lifecourse deprivation on health behaviours: Findings from the British Women’s
Heart and Health Study. European Journal of Cardiovascular Prevention and Rehabilitation,
16(2), 169–173.
Ben-Shlomo, Y. (2007). Rising to the challenges and opportunities of life course epidemiology.
International Journal of Epidemiology, 36(3), 481–483.
Ben-Shlomo, Y., & Kuh, D. (2002). A life course approach to chronic disease
epidemiology: Conceptual models, empirical challenges and interdisciplinary perspectives.
International Journal of Epidemiology, 31(2), 285–293.
Dawid, A. P., & Didelez, V. (2010). Identifying the consequences of dynamic treatment
strategies: A decision theoretic overview. Statistics Surveys, 4, 184–231.
Glymour, M. M., Avendano, M., Haas, S., & Berkman, L. F. (2008). Lifecourse social conditions
and racial disparities in incidence of first stroke. Annals of Epidemiology, 18(12), 904–912.
Goetgeluk, S., Vansteelandt, S., & Goetghebeur, E. (2008). Estimation of controlled direct effects.
Journal of the Royal Statistical Society Series B-Statistical Methodology, 70, 1049–1066.
Hernan, M. A., Brumback, B., & Robins, J. M. (2000). Marginal structural models to estimate the
causal effect of zidovudine on the survival of HIV-positive men. Epidemiology, 11(5),
561–570.
Hernan, M. A., Brumback, B., & Robins, J. M. (2001). Marginal structural models to estimate the
joint causal effect of nonrandomized treatments. Journal of the American Statistical Associa-
tion, 96(454), 440–448.
Hernan, M. A., Brumback, B. A., & Robins, J. M. (2002). Estimating the causal effect of
zidovudine on CD4 count with a marginal structural model for repeated measures. Statistics
in Medicine, 21(12), 1689–1709.
Hernan, M. A., Hernandez-Diaz, S., & Robins, J. M. (2004). A structural approach to selection
bias. Epidemiology, 15(5), 615–625.
Hernan, M. A., Cole, S. R., Margolick, J., Cohen, M., & Robins, J. M. (2005). Structural
accelerated failure time models for survival analysis in studies with time-varying treatments.
Pharmacoepidemiology and Drug Safety, 14(7), 477–491.
Hernan, M. A., Lanoy, E., Costagliola, D., & Robins, J. M. (2006). Comparison of dynamic
treatment regimes via inverse probability weighting. Basic & Clinical Pharmacology &
Toxicology, 98(3), 237–242.
Joffe, M. M., Hoover, D. R., Jacobson, L. P., Kingsley, L., Chmiel, J. S., & Visscher, B. R. (1997).
Effect of treatment with zidovudine on subsequent incidence of Kaposi’s sarcoma.
Clinical Infectious Diseases, 25(5), 1125–1133.
Joffe, M. K., Hoover, D. R., Jacobson, L. P., Kingsley, L., Chmiel, J. S., Visscher, B. R., & Robins,
J. M. (1998). Estimating the effect of zidovudine on Kaposi’s sarcoma from observational data
using a rank preserving structural failure-time model. Statistics in Medicine, 17, 1073–1102.
Keiding, N., Filiberti, M., Esbjerg, S., Robins, J. M., & Jacobsen, N. (1999). The graft versus
leukemia effect after bone marrow transplantation: A case study using structural nested failure
time models. Biometrics, 55(1), 23–28.
Korhonen, P. A., Laird, N. M., & Palmgren, J. (1999). Correcting for non-compliance in
randomized trials: An application to the ATBC Study. Statistics in Medicine, 18(21),
2879–2897.
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.).
Hoboken: Wiley.
Lok, J., Gill, R., van der Vaart, A., & Robins, J. (2004). Estimating the causal effect of a
time-varying treatment on time-to-event using structural nested failure time models. Statistica
Neerlandica, 58(3), 271–295.
14 G-estimation for Accelerated Failure Time Models 259
Mark, S. D., & Robins, J. M. (1993). Estimating the causal effect of smoking cessation in
the presence of confounding factors using a rank preserving structural failure time model.
Statistics in Medicine, 12(17), 1605–1628.
Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained
exposure period – application to control of the healthy worker survivor effect. Mathematical
Modelling, 7(9–12), 1393–1512.
Robins, J. M. (1992). Estimation of the time-dependent accelerated failure time model in the
presence of confounding factors. Biometrika, 79(2), 321–334.
Robins, J. M. (2004). Optimal structural nested models for optimal sequential decisions. In D. Y.
Lin & P. Heagerty (Eds.), Proceedings of the Second Seattle Symposium on Biostatistics
(pp. 189–326). New York: Springer.
Robins, J. M. (2008). Causal models for estimating the effects of weight gain on mortality.
International Journal of Obesity, 32(Suppl 3), S15–S41.
Robins, J. M., Blevins, D., Ritter, G., & Wulfsohn, M. (1992a). G-estimation of the effect of
prophylaxis therapy for Pneumocystis carinii pneumonia on the survival of AIDS patients.
Epidemiology, 3(4), 319–336.
Robins, J. M., Mark, S. D., & Newey, W. K. (1992b). Estimating exposure effects by modelling the
expectation of exposure conditional on confounders. Biometrics, 48(2), 479–495.
Robins, J. M., Greenland, S., & Hu, F. C. (1999). Estimation of the causal effect of a time-varying
exposure on the marginal mean of a repeated binary outcome. Journal of the American
Statistical Association, 94(447), 687–700.
Robins, J. M., Hernan, M. A., & Brumback, B. (2000). Marginal structural models and causal
inference in epidemiology. Epidemiology, 11(5), 550–560.
Robins, J. M., Hernan, M. A., & Rotnitzky, A. (2007). Effect modification by time-varying
covariates. American Journal of Epidemiology, 166(9), 994–1002.
Snowden, J. M., Rose, S., & Mortimer, K. M. (2011). Implementation of G-computation on a
simulated data set: Demonstration of a causal inference technique. American Journal of
Epidemiology, 173(7), 731–738.
Stata Corporation. (2007). College Station, Texas.
Sterne, J., & Tilling, K. (2002). G-estimation of causal effects, allowing for time-varying
confounding. The Stata Journal, 2(2), 164–182.
Sterne, J. A., Hernan, M. A., Ledergerber, B., Tilling, K., Weber, R., Sendi, P., Rickenbach, M.,
Robins, J. M., & Egger, M. (2005). Long-term effectiveness of potent antiretroviral therapy in
preventing AIDS and death: A prospective cohort study. The Lancet, 366(9483), 378–384.
Tanaka, Y., Matsuyama, Y., & Ohashi, Y. (2008). Estimation of treatment effect adjusting for
treatment changes using the intensity score method: Application to a large primary prevention
study for coronary events (MEGA study). Statistics in Medicine, 27(10), 1718–1733.
Taubman, S. L., Robins, J. M., Mittleman, M. A., & Hernan, M. A. (2009). Intervening on risk
factors for coronary heart disease: An application of the parametric g-formula. International
Journal of Epidemiology, 38(6), 1599–1611.
Tehranifar, P., Liao, Y., Ferris, J. S., & Terry, M. B. (2009). Life course socioeconomic conditions,
passive tobacco exposures and cigarette smoking in a multiethnic birth cohort of U.S. women.
Cancer Causes & Control, 20(6), 867–876.
Tennant, P. W., Gibson, G. J., & Pearce, M. S. (2008). Lifecourse predictors of adult respiratory
function: Results from the Newcastle Thousand Families Study. Thorax, 63(9), 823–830.
Tilling, K., Sterne, J. A., & Szklo, M. (2002). Estimating the effect of cardiovascular risk factors
on all-cause mortality and incidence of coronary heart disease using G-estimation: The
atherosclerosis risk in communities study. American Journal of Epidemiology, 155(8),
710–718.
Toh, S., & Hernan, M. A. (2008). Causal inference from longitudinal studies with baseline
randomization. The International Journal of Biostatistics, 4(1), Article 22.
260 K. Tilling et al.
Vansteelandt, S., Goetgeluk, S., Lutz, S., Waldman, I., Lyon, H., Schadt, E. E., Weiss, S. T.,
& Lange, C. (2009). On the adjustment for covariates in genetic association analysis: A novel,
simple principle to infer direct causal effects. Genetic Epidemiology, 33(5), 394–405.
Witteman, J. C., D’Agostino, R. B., Stijnen, T., Kannel, W. B., Cobb, J. C., de Ridder, M. A.,
Hofman, A., & Robins, J. M. (1998). G-estimation of causal effects: Isolated systolic hyper-
tension and cardiovascular death in the Framingham Heart Study. American Journal
of Epidemiology, 148(4), 390–401.
Yamaguchi, T., & Ohashi, Y. (2004). Adjusting for differential proportions of second-line
treatment in cancer clinical trials. Part I: Structural nested models and marginal structural
models to test and estimate treatment arm effects. Statistics in Medicine, 23(13), 1991–2003.
Young, J. G., Hernan, M. A., Picciotto, S., & Robins, J. M. (2010). Relation between three classes
of structural models for the effect of a time-varying exposure on survival. Lifetime Data
Analysis, 16(1), 71–84.
Chapter 15
Generalised Additive Models
Robert M. West
15.1 Introduction
Y.-K. Tu and D.C. Greenwood (eds.), Modern Methods for Epidemiology, 261
DOI 10.1007/978-94-007-3024-3_15, # Springer Science+Business Media Dordrecht 2012
262 R.M. West
interactions is even more challenging than for a single main effect. The techniques
available with GAMs provide a suitable means to tackle this ferocious challenge.
The first step however is to focus on the challenges of nonlinearity for a single
main effect. Throughout this chapter data from a study of sympathetic nerve
activity has been selected to illustrate issues and procedures.
Sympathetic nerve activity (sna) is known to increase with age and so is a conve-
nient example for the topic of this chapter. Further, there is a complex relationship
with systolic blood pressure (sbp) as well, so that there are two continuous covariates
to explore in models of sna (Burns et al. 2007). The setting for this example is a study
where 172 volunteers were recruited in order to investigate certain aspects of the
variation of sna between individuals. For simplicity here only the effects of sex, age,
and sbp on sna will be considered, and although the causal relationship might be
debated, in this and the next chapter, sna is taken to depend upon the other variables.
The outcome sna is a measurement on a continuous scale, sex is a dichotomous
covariate (factor), and as mentioned above, age and sbp are continuous covariates.
A basic model will fit just linear terms as covariates. All modelling will be
undertaken in R since this statistical language is widely available (R Development
Core Team 2010) and has good capabilities, once the relevant libraries have been
downloaded. In R, models are specified by notation suggested by Wilkinson and
Rogers (1973), and is straightforward to follow. The basic model is specified by
sna ~ as.factor(sex) + age + sbp and the fitted model yields the results
given in Table 15.1.
Note that for this model the adjusted sum of squares is 0.60: 60% of variation is
explained by the model. The errors were also explored through plots, and it was
seen that the residual plot against the fitted values, the normal QQ plot, the scale
location plot and the leverage plots were all satisfactory. This is also true for all the
subsequent residual plots in this chapter.
For this basic model, the effects of age and sbp are clear: they are simply linear
terms. For completeness, and to permit comparison with later plots, graphical
representations are provided in Figs. 15.1 and 15.2. These include rug plots along
the abscissas to indicate for which ages and SBPs measurements of sna have been
recorded. Note also the ranges of the ordinates.
The use of discretisation of age into age groups has a long history. John Graunt
(1662) is one of the earliest to publish material (life tables) and establish this
methodology that has been exploited to great effect by modern insurance
companies as just one example. As more data is available, the width of the age
groups can be diminished, any errors due to discretisation will be minimal, and age
can be considered to be modelled sufficiently well.
Such fine discretisation is not however undertaken throughout epidemiology,
even when sample sizes are large: widths of age groups of 5 or 10 years can be
found. There is an issue of parsimony in the model. If m age groups are to be used
then (m-1) variables are required to model age. Then a polynomial of degree (m-1)
is just as parsimonious and should be considered, see Sect. 15.4.
Discretisation might be favoured for reasons of interpretation, especially with
logistic regression. For example age might be discretised as: Under 60, 60–69,
70–85, and Over 85 years. Then logistic regression delivers three odds ratios compar-
ing the odds of mortality for persons in the three older groups with those in the
youngest group. Interpretation is very simple in relation to the age effect. Effectively
though, age has been modelled as a step function. An individual of age 69 steps up
their risk on their seventieth birthday. There is certainly a discretisation error.
The main concern though is that inaccuracy in modelling age will result in inaccuracy
and bias for the role of other covariates including the exposure of interest.
Another concern about discretisation is that the number of groups and the group
boundaries need to be chosen. There may be clinical or political reasons for
specifying boundaries, such as achieving adult status at age 18, achieving retire-
ment age at 65, etc. The results achieved for all covariate coefficients will differ
when boundaries are changed. From a modelling perspective, the boundaries may
be chosen for example by minimising the Akaike Information Criterion (AIC),
although this may lead to what seem strange boundary values that once more lead to
interpretation difficulties, albeit a fascinating challenge to obtain an interpretation.
Where there is a choice of the number of groups and their boundaries,
there is ‘temptation’ to choose them to deliver the coefficient values of other
covariates that are most favoured—especially if the main exposure has a coefficient
close to statistical significance, but these issues are always present in complex
modelling situations.
When a continuous variable is discretised, it is easy to define a further category
of ‘missing’ when values are not recorded for some participants. This has great
appeal if such an approach is appropriate for the modelling of missing values—for
example where values are missing at random. In other circumstances however this
could be disastrous. Consider the case where age is withheld by either the very
young or very old for reasons of identifiability of those with a rare disease.
Then such a category is misleading and it might be more appropriate to consider
an imputation technique to handle missingness.
It is possible that some continuous covariates are discretised due to doubt about
their true nature. An example might be a score from a psychometric test, which is
not truly continuous, being the sum of (weighted) responses to a questionnaire.
15 Generalised Additive Models 265
The basic model provided a fit for all three covariates with highly significant values
for the three coefficients, but nonlinearity does potentially exist and an investiga-
tion is warranted. Here both age and sbp are discretised into five categories forming
the new variables agegp and sbpgp. Cut points for age were taken as 30, 40, 50, and
60 years. Those for sbp were taken as 120, 140, 160, 180 mm Hg. For both variables
the categories are all reasonably evenly populated whilst the cut points are easy to
interpret. Values of sbp above 140 mm Hg suggest hypertension and so 140 mm Hg
has some clinical meaning.
The model is specified by sna ~ as.factor(sex) + as.factor
(agegp) + as.factor(sbpgp), and results given in Table 15.2.
The contributions of the covariates age and sbp are expressed graphically in
Figs. 15.3 and 15.4. Note that one clear effect is that the ranges of the effects are
much reduced from those in the basic model: compare the graphs. For this model
with discretised covariates, the adjusted R2 ¼ 0.63, so that on the basis of the
proportion of variation that is represented, the model with discretised age and sbp is
preferred to the basic model.
From Table 15.2 it is strongly tempting to coalesce some categories, thus
improving the adjusted R2. In particular the exact match of the category boundary
for sbp with the definition of hypertension sbp ¼ 140 mm Hg, is extremely
tempting. Such a data-driven approach however can be regarded as over-fitting to
the dataset. If disctretisation is to be employed, it is advisable to fix the boundaries
of all categories before fitting to the data.
266 R.M. West
discretised age
Partial for as.factor(agegp)
5
0
−5
−10
−15
0 1 2 3 4
agegp
0 1 2 3 4
sbpgp
15 Generalised Additive Models 267
For the illustrative example, orthogonal polynomials were chosen, the formula in the
R code being sna ~ as.factor(sex) + poly(age,3) + poly(sbp,3).
From Table 15.3, the impact of the covariates on the outcome sna is not
immediately clear. This is where graphical representations become important.
Figures 15.5 and 15.6 demonstrate the effect of age and sbp effectively. Comparing
the graphical figures for each of the models that have been fitted, it appears that the
effect of age gives the largest range of effect in the model with higher-order terms,
the youngest age resulting in a sizable decrease in sna: see Sect. 15.6 below for
further comment.
Inspecting Fig. 15.6, the final downturn in the effect of SBP can be seen from
the rug plot to be based on just a few measurements where sbp is above 200 mm Hg.
Considering also the marginal statistical significance (p ¼ 0.0744) of the cubic
term for sbp, many might consider refitting with only a quadratic polynomial
for sbp. The cubic representation is chosen here to identify that there is an
issue of how best to identify the degree of polynomial representations of covariates
in general: this issue is dealt with in Sect. 15.8.
268 R.M. West
20 30 40 50 60 70
age
10
Note that for the model fitted with higher-order terms has two fewer parameters
than the model for which the continuous covariates have been discretised. It is not
only more parsimonious, but has an adjusted R2 ¼ 0.65, up from 0.63.
15.6 Splines
The complexity of the relationship between the continuous covariate and the
modelled outcome may be efficiently represented using splines. These are low-
order polynomials that are fitted locally but joined at knots smoothly, meaning that
at the knots the function represented by the spline, and perhaps also some of its
derivatives, are continuous. There are also advantages of numerical stability.
The term spline derives from thin strips of flexible wood that have been used in
construction to represent complex smooth curves. Fitting splines to covariates can
be thought of as taking a nonparametric approach.
In the few situations where a small extrapolation might be considered, splines
can often provide less extreme behaviour immediately beyond the range of the
covariate. Note that this was a concern in the example above, where the model with
higher-order terms predicted very low sna for the youngest subjects of the study.
Similarly the sharp decline of sna with increasing sbp above 200 mm Hg provides a
further reason to reconsider the model that was fitted. Runge’s phenomenon, Runge
(1901), which occurs with higher-order polynomials can become problematic.
A very nice overview of splines together with a discussion is provided by Eilers
and Marx (1996).
There are many ways to specify a basis for a spline fit, Wahba (1990), some
examples are B-splines de Boor (1978), P-splines Eilers and Marx (1996), natural
cubic splines, and O’Sullivan splines O’Sullivan et al. (1986). The order of the
spline approximation must be chosen, as must the number and the location of knots.
Penalised splines can be employed, see Sect. 15.8, and then further parameters are
involved: the smoothness parameter and the derivative to be smoothed. Smoothing
is not considered in this introductory section, but deferred until Sect. 15.8. Knots
are often evenly spaced, or placed at certain percentiles of the covariate.
To illustrate the use of splines, natural cubic splines are selected. A single internal
interpolation point is chosen as the median (50th percentile) for each of the two
covariates. The end points of the range of a covariate are automatically used as
knots, and without internal knots the spline degenerates to a polynomial fit.
The formula for use with R is sna ~ as.factor(sex) + ns(age,knots ¼
median(age)) + ns(sbp,knots ¼ median(sbp)).
In tabulated form, the results of the fit are provided in Table 15.4. It is noted that
the fit is not so satisfactory, with the adjusted R2 ¼ 0.62. The effects of the
covariates are given graphically in Figs. 15.7 and 15.8.
270 R.M. West
Table 15.4 Table of coefficients for model with spline fits for covariates
Coefficient Estimate 95% CI p-value
Intercept 7.6 (0.6,15.8) 0.0689
Male 6.2 (2.8,9.7) 0.000480
ns(age,knots ¼ median(age)) 1 57.1 (41.5,72.6) 1.63e-11
ns(age,knots ¼ median(age)) 2 20.7 (11.2,30.2) 2.80e-5
ns(age,knots ¼ median(sbp)) 1 46.0 (29.0,63.1) 3.33e-7
ns(age,knots ¼ median(sbp)) 2 20.7 (10.4,31.0) 0.000109
for age
−30 −20 −10 0 10
20 30 40 50 60 70
age
for sbp
5
0
−5
−10
−15
−20
By comparison of Figs. 15.7 and 15.8 with preceding ones, it can be seen that
this particular spline fit gives rather different results for the effects of age and sbp
than the other models considered. The fit is better than that of the basic model, but it
is clear that there are challenges in finding the best spline representation. Those
providing libraries for GAMs have also provided tools to make spline selection
much easier and much more efficient: see Sect. 15.8.
Generalised additive models have continued to receive attention since their intro-
duction by Hastie and Tibshirani, see Hastie and Tibshirani (1986, 1990). Additive
models are ones where the effects of each covariate are added: there are no interac-
tion terms and so the additivity of effects is assumed. This chapter focusses on
nonlinearities whilst Chap. 16 enables the exploration of interactions. Hence here
the initial attention has been to the representation of the effect of each covariate with
a graphical representation of that effect to enable interpretation. GAMs continue this
theme. The generalised term simply refers to the fact that the methodology of
additive models (spline fits to covariates) can be just as easily applied to generalised
models, such as logistic regression, as well as it can be applied to linear regression.
Given the large number of parameters that need to be selected for a spline fit,
tools to provide automated choices save considerable effort and can provide some
objectivity. The principle of parsimony where a model with fewer parameters
is preferred to a more complex model is often to the forefront of automated
procedures. A statistical epidemiologist will be concerned with estimating the
effects of each covariate rather than intricate and subtle choices of parameters in
spline fitting and will want to utilise developed software tools with automated
choices rather than lavish time and resources on a general spline fit. There is
software available to fit GAMs in several statistical packages but here attention is
restricted to three libraries that are available in R and which provide more than
enough material for discussion in a single book chapter.
With the same function in the R::gam library it is possible to fit penalised
splines (smooths). The target number of degrees of freedom needs to be specified.
Rather than expand this aspect here, smooths are considered with the R::mgcv
package discussed in Sect. 15.11.
The model was reformulated to include loess representation of the two continuous
covariates through the formula sna ~ as.factor(sex) + lo(age) + lo
(sbp). The fit is excellent with the adjusted R ¼ 0.66 and the significance of
2
terms is given in Table 15.5 with the nonlinear nonparametric effects of the
covariates shown in Figs. 15.9 and 15.10. Note that there is a facility to display
the partial deviance residuals, which was exploited and that upper and lower point-
wise twice-standard-error curves were included.
The fits obtained by R::gam::gam provide good material for an epidemiologist
to consider. The main features of the fits should be explained. Smaller details that
lead to a little jaggedness might be ignored in many cases. This approach to
interpretation suggests that a smoother fit might be warranted.
20 30 40 50 60 70
age
Fig. 15.9 Term plot for model with loess spline fit for sbp
15 Generalised Additive Models 273
40
model with loess spline
fit for sbp
30
20
lo(sbp)
10
0
-10
-20
The formula needed to indicate smooths for age and sbp that is used in R::mgcv::
gam is sna ~ as.factor(sex) + s(age) + s(sbp) which reports signifi-
cance of smooths as is Table 15.6. The effects are shown graphically in
Figs. 15.11 and 15.12. Note the great similarity to the results with loess smoothing,
although of course the representation of each covariate effect is much smoother,
and perhaps therefore more credible in some circumstances. Partial residuals are
shown, as are ‘twice standard error’ curves.
274 R.M. West
40
model with smooth for age
20
s(age,4.29)
0
-20
-40
20 30 40 50 60 70
age
The fit with a two-dimensional spline for age and sbp give the best fit to date with
adjusted R2 ¼ 0.69 (Table 15.7). Thus there is evidence of an interaction between
age and sbp, see Chap. 16 where this interaction effect is considered further.
Figure 15.13 shows that there are no younger participants with hypertension
(high values of sbp) and no older participants with sbp in the normal range.
This might have been a property of the recruiting strategy, or it may be that older
people who volunteer for studies tend to have higher systolic blood pressure.
The study is cross-sectional rather than longitudinal but there are longitudinal
explanations that account for the relationship. Sympathetic nerve activity tends to
increase with age and is higher for hypertensives. For younger participants with
higher sbp, the increase of sna with age is more rapid (contours closer together).
s(age,sbp,12.82)
200
180
160
sbp
140
120
100
Note that standard error curves were omitted: the plot is already complex and
needs full-colour treatment if further information is to be included. For the 2d plot,
the standard se curves are 1 standard error rather than 2 standard errors as with
the one-dimensional curves.
This chapter provides an introduction only to GAMs motivating their use through
exploration of nonlinearities in covariate effects. Here is a brief mention of further
aspects.
A third library is available in R, namely Vector Generalised and Additive
Models, see Yee and Wild (1996). R::VGAM, that has been made available by
Thomas Yee and makes use of B-splines and O’Sullivan splines that have certain
advantages. The VGAM library is huge and there is a focus on multivariate
outcomes for generalised linear models and generalised additive models.
Random effects can be included in GAMs through the function R::mgcv::
gamm. Thus GAMs can be used in a multilevel context.
Further description of the case study of sympathetic nerve activity was delayed until
this point as the primary interest was the methodology for exploring nonlinearities
in covariates. Exploring different models however often helps to develop under-
standing of a situation, indeed that is one of the aims of modelling.
From each of the models it is clear that both age and sbp make significant
contributions to sna, explaining well over 50% of the variation in results. Exploring
residuals revealed nothing unusual so that for this application there was no indica-
tion that a linear model was unsuitable as regards the distribution of residuals.
Discretisation of covariates provided little extra information other than indicating
that the effect of sbp was far from linear. It is possible that a different discretisation
would have produced different results: model fitting has challenges. Fitting higher-
order terms was found to be no easier. By contrast the procedures for fitting GAMs
made modelling far simpler.
Figures 15.10 and 15.12 show partial residuals. These again indicate that the
distribution of residuals satisfy distributional assumptions of normality and homo-
geneity of variance. It is revealed also that there may be some digit preference for
some of the participants: those with sbp values of 110, 120 and 130, possibly 100,
whereas for other values there is no evidence of digit preference. Possibly a
different sphygmomanometer was employed for these participants, at a time
when younger normotensive volunteers were recruited to the study.
15 Generalised Additive Models 277
References
Burns, J., Sivananthan, M. U., Ball, S. G., Mackintosh, A. F., Mary, D. A., & Greenwood, J. P.
(2007). Relationship between central sympathetic drive and magnetic resonance imaging-
determined left ventricular mass in essential hypertension. Circulation, 115, 1999–2005.
de Boor, C. (1978). A practical guide to splines. New York: Springer.
Eilers, P. H. C., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical
Science, 11, 89–121.
Graunt, J. (1662). Natural and political observations on the bills of mortality. London.
Hastie, T. J., & Tibshirani, R. J. (1986). Generalised additive models (with discussion). Statistical
Science, 1, 295–318.
Hastie, T. J., & Tibshirani, R. J. (1990). Generalised additive models. Boca Raton: Chapman and
Hall/CRC.
Kennedy, W. J., & Gentle, J. E. (1980). Statistical computing. New York.
Miller, A. J. (2002). Subset selection in regression (2nd ed.). Boca Raton: Chapman and
Hall/CRC.
O’Sullivan, F., Yandell, B., & Raynor, W. (1986). Automatic smoothing of regression functions in
generalised linear models. Journal of the American Statistical Association, 18, 96–103.
278 R.M. West
R Development Core Team. (2010). R: A language and environment for statistical computing.
R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL
http://www.R-project.org.
Runge, C. (1901). Uber empirische funktionen und die interpolation zwischen aquidistanten
ordinaten. Zeitschrift fur Mathematik und Physik, 46, 224–243.
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.).
NewYork: Springer Science and Business Media.
Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM.
Wilkinson, G., & Rogers, C. (1973). Symbolic description of factorial models for the analysis of
variance. Applied Statistics, 22, 329–399.
Wood, S. N. (2006). Generalised additive models: An introduction with R. Boca Raton: Chapman
and Hall/CRC.
Yee, T. W., & Wild, C. J. (1996). Vector generalised additive models. Journal of the Royal
Statistical Society, Series B, Methodological, 58, 481–493.
Chapter 16
Regression and Classification Trees
Robert M. West
16.1 Introduction
Interactions in (generalised) linear models can be difficult, mainly due to the fact
that there are so many potential interactions to consider when there are a number of
covariates and factors in the model. For example, if a generalised linear model has
four factors and each factor has four levels, then there are four main effects, six two-
way terms, four three-way terms, and one four-way term to be considered: the four
main effects and 11 interactions. Perhaps eight continuous covariates have been
identified and it has been determined that they enter the model only in a linear
manner, although if interactions have not yet been identified it is doubtful that there
is any certainty about linearity of covariates, then there is 1 seven-way product, 7
six-way products, 21 five-way products and so on. Considering all interactions in a
model with a large number of covariates and factors involves a huge number of
terms and hence extensive modelling time.
One common approach to ‘work around’ this is to decide that such complexity is
beyond reasonable modelling capabilities given the limited amount of data avail-
able. It is rare to encounter studies or trials that have been designed to investigate
full details of all possible interactions. In that case an additive assumption may be
made and additive models used: see Chap. 15 on generalised additive models.
The approach suggested in this chapter is to make use of regression trees and
classification trees according to the nature of the outcome variable. In general
specifying a model for datasets with multiple factors and covariates can be chal-
lenging. The idea here is to use trees to suggest viable models and in particular a
shortlist of appropriate potential interaction terms.
Y.-K. Tu and D.C. Greenwood (eds.), Modern Methods for Epidemiology, 279
DOI 10.1007/978-94-007-3024-3_16, # Springer Science+Business Media Dordrecht 2012
280 R.M. West
16.2 Trees
Regression trees, and classification trees, are simple models but are not familiar to all,
hence this section is included to give some background to them. The popularity of
classification trees and regression trees was greatly spurred by Breiman, Friedman,
Stone, and Olshen; see Breiman et al. (1984), and this remains an excellent source for
more detail. Trees will continue their popularity since there are numerous software
tools available, which are easy to apply, and the resultant tree models are simple to
implement and interpret, even for non-specialist users of statistics.
First consider regression trees for continuous outcomes. Fitting is done by succes-
sively splitting the data two ways according to a single covariate or factor: branching.
Start with the whole dataset and search through all the covariates and factors and all
the possible cut points. For example a continuous covariate age might be split at any
cut point that lies midway between two adjacent values for age. The split is chosen so
that the difference between the two groups so defined gives maximal difference in
the mean outcome variable. If the outcome is sympathetic nerve activity sna then the
mean values of sna of the two groups is calculated. The residual sums of squares
about those means are calculated and the two sums added to produce a residual sums
of squares for that split (partition). The best split is the one that produces the smallest
residual sum of squares: the means are maximally distinguished. The two groups
defined by the first step are then each split again by the same procedure of searching
through all variables and all cut points. The process continues and can continue until
there are only single values at the end of each branch.
Note that linearity of covariates is not an issue for regression trees: there is no
need to assume linearity, nor monotonicity. As will be seen, interactions can be
revealed and need not pose any difficulties.
Regression trees (and classification trees) are restricted in the manner in which
they are fitted. Splits can occur along only one axis (covariate or factor) at a time.
A better fit might be obtained by considering multiple fits at each step: that is a
multiple split may reduce the residual sums of squares significantly whereas a split
on only one covariate or factor would achieve only a small RSS reduction.
16.4 R Libraries
Fig. 16.1 Tree model for sympathetic nerve activity. Here the expression as factor(sex) ¼ a is to
be read as sex ¼ female
282 R.M. West
Fig. 16.2 Plot of residuals from the tree model fitted by R::rpart::rpart
Discussion of interactions has been delayed until after the example on sympathetic
nerve activity. In Fig. 16.1, the two most important covariates/factors are seen to be
sbp and age. It is the early splits that involve these and the height/depth of
16 Regression and Classification Trees 283
20
rpart
10
Sample Quantiles
0
−10
−20
−30
−2 −1 0 1 2
Theoretical Quantiles
the branches for these splits is greater reflecting their greater contribution to
the reduction of the residual sum of squares. Often when an interaction occurs
between two variables, after the branch due to the first variable only one part of the
tree branches according to the second. Thus the second variable is of greater
importance for a certain range of the first: hence there is an interaction. In this
example both sides of the tree split by age after the split by sbp, but the split by age
is different on separate sides of the tree. Again an interaction is revealed, here
between age and sbp.
Trees with a few branches are especially easy to interpret. As there are more
branches, the complexity of the model is displayed. Thus trees are very useful for
determining:
• which covariates and factors should be considered
• where interactions between covariates might occur
• how complex the model might be.
A way to quantify the complexity of a tree model is available with the R::
rpart library through the function R::rpart::plotcp. Such a plot is provided
in Fig. 16.4 for the sympathetic nerve activity example. It shows the relative gain in
the reduction of the residual sums of squares (more broadly the objective function
selected for the fit) for additional splits.
284 R.M. West
1.2
complexity
1.0
x-val Relative Error
0.6 0.8
0.4
As with a linear model, if the outcome variable is transformed, then the model is
changed: a different tree will be fitted since the residual sums of squares will
be defined in a different way. Hence if the outcome is the concentration of a
toxin in a urine sample, then the model will change if the logarithm of the
concentration is taken as the outcome. The issues are the same as in least squares
fitting of a linear model: are the residuals better represented on the original scale or
the transformed one.
One of the largest issues is how to determine the best tree model. The tree with
each terminal branch giving a unique value is in almost all circumstances over-
fitting the model.
The above fitting procedure can be regarded as the tree equivalent of linear
modelling, being based on least squares. The procedure is generalised by defining
deviance rather than a residual sum of squares for outcomes based on the binomial,
Poisson, multinomial etc.
Missing values can be handled easily in some circumstances. For factors,
missing values might be assigned another factor value/level. If missing values
occur at random, then this approach is justified. If values are missing at random
for a covariate, then a suitable weighting must be applied in order to compare splits
between different covariates. The challenge as always with modelling is how to
deal with missing values when missingness does not occur at random.
16 Regression and Classification Trees 285
When there is a large dataset perhaps with many covariates and factors, then the
number of splits to investigate rapidly becomes very large. So the are computational
challenges for larger datasets. One option is to consider subsets of covariates and
factors – see Sect. 16.9 on Random Forests.
If the outcome variable is categorical, it has been mentioned that deviance can be
used to consider partitions of the tree. Two alternatives are entropy and the Gini
index. In the R::tree library, the default splitting method for R::tree::tree is
deviance and an average deviance across the terminal modes is reported. The R::
rpart::rpart function from the R::rpart library uses the Gini index as its
default method.
For illustrative purposes, consider the new indicator variable defined from the
sympathetic nerve activity dataset, namely the variable ht which takes the value 0 if
sbp < 140 and 1 otherwise: that is ht indicates ‘hypertension’. Taking ht as the
outcome, a classification tree is required rather than a regression tree, but the
modelling process is of course similar. With the default settings for R::tree::
tree, the model resulting is shown in Fig. 16.5. Similar tree models are produced
irrespective of the method of splitting, but the default settings provide different
amounts of pruning. The deviance results, being the default for the R::tree::tree
function are displayed in Fig. 16.5.
It is interesting to note that:
• the proportion of hypertensives is displayed at the terminal nodes
• sympathetic nerve activity sna and age are useful predictors of hypertension and
interact
• The factor sex is not needed by the model and so has little impact on hyperten-
sive status.
Close inspection of Fig. 16.5 with clinical expertise would suggest that further
pruning would be merited, thus as seen with tree regression the default settings in
this scenario provide a good starting point for the model but there is further work
to be done.
As the number of covariates and factors increases, regression trees and classifica-
tion tree become more difficult to handle. One way to facilitate data fitting would
be to take subsets of the variables and fit with those. There are greater advantages
obtained by taking this idea further and using random forests, see Breiman (2001).
286 R.M. West
many trees. The clearest outcomes from fitting a random forest are the importance
of covariates and the proportion of variation that might be achieved. See the
example below in Sect. 16.10.
The example of sympathetic nerve activity is not an ideal application for random
forests since there is a very limited number of covariates and factors: just three.
For completeness and to illustrate performance, a random forest was fitted using
the function R::randomForest::randomForest. The sample size in each
bootstrap n2 was taken as 130 (the original sample size is 172), the default number
of trees n2 ¼ 500 was used and the number of variables to consider at each node n3
was restricted to 2. The resulting forest explains 62% of variation. Figure 16.6
shows how the error reduces as the number of trees in the forest increases: such a
plot is a useful tool for indicating the number of trees to sample, in this case 200
trees appears sufficient. Figures 16.7 and 16.8 show how predictions vary from
Fig. 16.8 Observed (o) and fitted (+) values plotted against sbp
16 Regression and Classification Trees 289
observed values. The importance of the covariates and factor are given in
Table 16.1. It is clear that the two covariates sbp and age are the most important
in agreement with the results from the regression tree and the generalised additive
models fitted in Chap. 15.
The subject of regression trees and, in particular, classification trees has interest to
computer scientists as well as statisticians and associated with pattern recognition,
classification, machine learning, and data mining. Consequently there are a large
number of interested researchers developing the techniques and a wide variety of
methods that can be exploited. Just couple of alternative approaches are considered
here.
Boosting is a technique that can be used with a range of prediction methods
including GAMs and trees, but is perhaps most commonly applied to trees. The idea
of boosting for trees is to compute a sequence of simple trees with very few nodes,
where each successive tree is built for the residuals of the preceding tree. Random
forests can be thought to work in parallel and tree boosting thought to work in
series. If a tree has difficulty with some cases then these will have larger residuals
and have greater influence in the next tree, hence there is some focus on the more
difficult cases.
Boosting for trees is implemented in R and one library is R::gbm based on the
work of Friedman (2001) and others, see Hastie et al. (2001). Initial experimenta-
tion with the dataset for sympathetic nerve activity achieves a fit that is not very
good in terms of least squares. Importance is reported with sbp being more
important than age but sex not entering the model. There is little interpretable
output.
Neural networks have been often used as black-box predictors and there is
opportunity to use them in R through the library R::nnet. The black-box nature
provides no understanding, but neither is any knowledge of the process necessary.
One may to use R::scale() in order to standardise the scale of all variables
before use. Neural nets can be effective for large datasets and are often good for
prediction in those circumstances.
It is possible to plot marginal effects to obtain a little understanding of the effect
of each covariate or factor. It can also be the case that a neural network model
produces a very different model from a statistical model and this prompts effort to
determine why this might be so.
290 R.M. West
Regression trees and classification trees were suggested as tools to be able to assess
the appropriateness of covariates and factors, together with their interactions for
linear models. The R libraries R::tree and R::rpart were briefly introduced
and there is much software available for fitting tree models. Random forests,
boosting and neural networks can also have benefits but their interpretability limits
their use in statistical epidemiology.
References
17.1 Overview
Y.-K. Tu and D.C. Greenwood (eds.), Modern Methods for Epidemiology, 291
DOI 10.1007/978-94-007-3024-3_17, # Springer Science+Business Media Dordrecht 2012
292 M.S. Gilthorpe and D.G. Clayton
p
logitðpÞ ¼ ln ¼ g0 þ g1 x 1 þ g2 x 2 þ x (17.1)
1p
Consider Factor V Leiden genetic mutation again, where individuals with the
mutation have more coagulants in their blood and are thus at greater risk of DVT.
Genetically normal individuals can develop DVT, but the risk is elevated amongst
individuals with the genetic mutation. Amongst women, exposure to the combined
oral contraceptive pill (COCP) brings about changes in hormone levels that can also
raise levels of coagulant proteins and lower levels of anticoagulant proteins, thereby
yielding an elevated risk of DVT (Vandenbroucke et al. 1996). Considering the
joint action of the genetic mutation and the environmental impact of the combined
pill, both exposures are binary (i.e. present or absent), as too is the outcome (DVT
or no DVT), yet the putative causal process is best captured by the underlying risk
of developing DVT, which is a continuum between zero and one. The genetic and
environmental exposures operate jointly to affect the risk of DVT along this
continuum, and the impact of either genetic or environmental exposure on the
risk of DVT probably varies from woman to woman. To use a statistical model to
describe this situation, one ought to acknowledge the continuous underlying latent
risk and random variation amongst women. However, more often than not, a
logistic model (with binary outcome of DVT present/absent) with two binary
covariates (genetic and environmental exposures present/absent) is as an oversim-
plification of the underlying biology. The question becomes: what is the biological
interpretation of the gene-environment statistical interaction in this scenario?
y ¼ b0 þ b 1 x 1 þ b 2 x 2 þ e (17.2)
17.3.1 Example
We consider models with and without statistical interaction for these two link
functions and illustrate the impact of model scale on perception of presence or
absence of statistical interaction and its associated magnitude. This is a simple
illustration of how we perceive and hence interpret the additive and multiplicative
296 M.S. Gilthorpe and D.G. Clayton
scales in linear models. For the example, initially we consider the continuous
outcome Blood Pressure (BP) measured in millimetres mercury (mmHg). This is
dichotomised across the threshold of 160 mmHg to create the binary outcome
Hypertension (Hyp). We consider two covariates: a genetic binary variable (G)
that depicts individuals to have a genetic mutation predisposing to hypertension
(coded one if present or zero otherwise); and an environmental variable (E) that is
recorded as a binary to depict high or low salt intake (coded one or zero respec-
tively). We assume that both genetic mutation and high salt intake elevate blood
pressure.
The normal linear model is:
BP ¼ b0 þ b1 G þ b2 E þ b3 GE þ e (17.4)
p
logitðHyÞ ¼ ln ¼ g0 þ g1 G þ g2 E þ g3 GE þ x: (17.5)
1p
Increased mm Hg
25
20
15
10
5
0
Wild Type Mutation Wild Type Mutation
b c High Salt
Low Salt
25 25
Increased mm Hg
Increased mm Hg
20 20
15 15
10 10
5 5
0 0
Wild Type Mutation Wild Type Mutation
Fig. 17.1 Graphical display of a normal model for hypertension without statistical interaction
showing model coefficients: (a) combined; and separately for low (b) and high (c) salt intake
a
Combined
40
Low Salt High Salt
35
Increased mm Hg
30
25
20
15
10
5
0
Wild Type Mutation Wild Type Mutation
b c
Low Salt High Salt
35 35
Increased mm Hg
Increased mm Hg
30 30
25 25
20 20
15 15
10 10
5 5
0 0
Wild Type Mutation Wild Type Mutation
Fig. 17.2 Graphical display of a normal model for hypertension with synergistic statistical interac-
tion showing model coefficients: (a) combined; and separately for low (b) and high (c) salt intake
a
Combined
6
Low Salt High Salt
5
Odds Ratio
4
3
2
1
0
Wild Type Mutation Wild Type Mutation
b c
Low Salt High Salt
10 10
Odds Ratio
Odds Ratio
1 1
Wild Type Mutation Wild Type Mutation
Fig. 17.3 Graphical display of the logistic model for hypertension without statistical interaction
showing model coefficients: (a) combined; and separately for low (b) and high (c) salt intake
a
Combined
8
Low Salt High Salt
7
6
Odds Ratio
5
4
3
2
1
0
Wild Type Mutation Wild Type Mutation
b c
Low Salt High Salt
10 10
Odds Ratio
Odds Ratio
1 1
Wild Type Mutation Wild Type Mutation
Fig. 17.4 Graphical display of the logistic model for hypertension with synergistic statistical interac-
tion showing model coefficients: (a) combined; and separately for low (b) and high (c) salt intake
17 Statistical Interactions and Gene-Environment Joint Effects 299
of the presence or absence of statistical interaction and the extent of the interaction;
only the two-chart format is correct.
Such confusion arises because the scale upon which the odds ratio operates is
multiplicative and charts reflect differences additively. The separate charts are more
reliable because they contrast within exposure groups, separately for low and high
salt intake, revealing accurately any differences as a true indication of statistical
interaction. This reveals how much scale is important in our perception and hence
interpretation of statistical interaction.
All regression models are scale dependent, but this matters greatly when seeking
to interpret statistical interaction. There is nothing special about the scales adopted
by models that use either the identity or logit link functions, as these links are used
out of statistical convenience. The identity link may seem a natural choice for
continuous outcomes, as it preserves the original scale of the outcome, and
covariates operate additively on the original outcome scale. The logit link function
has specific utility because model coefficients are interpretable as odds ratios (via
exponentiation), and covariates therefore operate multiplicatively on the odds ratio
scale. There are, however, an infinite number of possible scales upon which model
covariates might relate to the outcome, depending upon the choice of link function.
As there are only a handful of regularly used link functions, it is easy to overlook
how arbitrary scale is, and how it always depends upon the choice of link function.
y ¼ b0 þ b1 x þ b2 x2 þ b3 z þ b4 xz þ e (17.6)
where y is a continuous outcome, b0 is the intercept, x and z are two covariates that
without loss of generality we assume to be continuous, bi (i ¼ 1. . .4) are covariate
regression coefficients, and e is residual error, assumed to be normally distributed
with mean zero and variance s2 . This is a linear model that has a quadratic term in x
and a product interaction term xz.
If we assume that x and z are correlated, i.e. collinear, then as the x-z correlation
increases, collinearity increases between xz and x2 (i.e. overlap between the
‘explained’ outcome variance increases). If we assume the relationship between y
and x to be curvilinear, i.e. b2 6¼ 0, but assume there to be no xz interaction, i.e.
b4 ¼ 0, the correct model we need to adopt is:
y ¼ b0 þ b1 x þ b2 x2 þ b3 z þ e: (17.7)
Were we instead to adopt Eq. 17.6 but without the quadratic term (x2 ):
y ¼ b0 þ b1 x þ b3 z þ b4 xz þ e (17.8)
300 M.S. Gilthorpe and D.G. Clayton
17.4.1 Example
Log-odds BE
Females
Males
Fig. 17.5 The inter-relationships amongst Barrett’s Oesophagus (BE), adipose tissue (AT) and
sex: the underlying hypothesised relationship between the log odds of BE and AT is curvilinear;
any linear relationship modelled using logistic regression might reveal different gradients for each
sex, which manifests as an AT.sex statistical interaction within a regression model
where superscripts denotes sex (M for males, F for females). Given different
AT distributions for men and women, in conjunction with the overall curvilinear
log odds BE ~ AT relationship, we would obtain different estimates for the log
odds BE ~ AT slope for men and women, and we anticipate gM 1 >g1 . Differences
F
in the two slopes is entirely due to the data for males and females lying on
different parts of the curvilinear relationship between the log odds of BE and
levels of AT.
The two separate models in Eq. 17.10 identically represent the combined
model in Eq. 17.9, in which the product interaction term is non-zero. There is a
statistical interaction between sex and AT in their jointly modelled association
with BE, but this does not signify a joint biological process; it is consistent with
separate biological processes, where sex affects AT levels and AT levels affect BE
(via reflux, perhaps). The statistical interaction is a consequence of adopting
the linear model approach when the underlying log odds BE ~ AT relationship
is curvilinear (on the statistically convenient log odds scale for the binary
outcome BE).
Where an underlying curvilinear relationship is overlooked within a linear
model, statistical interaction is observed that might be referred to as ‘spurious’,
yet the statistics are sound and the issue is one of interpretation. Statistical interac-
tion is correctly estimated, though its cause (hence model interpretation) may be
misguided. Statistical interaction need not be a reflection of joint biological action,
as it is typically attributed, rather a consequence of an overlooked curvilinear
relationship between outcome and (environmental) exposure. Without a priori
knowledge of any underlying curvilinear relationships between the outcome and
covariates, we would be too hasty in seeking to interpret statistical interaction
without also evaluating the model assumption of linearity. Conversely, we might
consider statistical interaction an indication of potential curvilinear relationships
between outcome and model covariates, as we discuss later.
It might be argued that most diseases result from gene-environment joint effects and
that the emphasis of research is to elucidate the magnitude of these effects.
To describe the outcome contingent on a range of options, one is not compelled
to assume any statistical model, i.e. to assume any particular model scale. Trying to
compress the complexity of biology into the simplicity of a statistical model might
be misguided and only mislead or misinform, as the information sought may not
lend itself to the form of a statistical test, supporting or refuting a hypothesised
mechanism; rather it should be more concerned with absolute effect size, which
has clinical relevance. We therefore ask: for what purpose do we test? More insight
might be gained by forgoing the questions ‘if’ and ‘how’ joint effects occur
biologically, and instead ask ‘to what extent’ or ‘by how much’ are joint effects
17 Statistical Interactions and Gene-Environment Joint Effects 303
Relative to not having Factor V Leiden genetic mutation and not using the
COCP, taking the contraceptive increases a woman’s relative risk (RR) for DVT
by approximately 3.6 fold (since DVT is rare, OR RR). If there was no reason for
the woman to suspect she had the genetic mutation, which has a prevalence of
around 4.4% in Europe (Rees et al. 1995), these increased risks might not trouble
her. On the other hand, if she was aware of a family history of DVT, she might
suspect an elevated possibility of having the genetic mutation. Considering the
relative risk of having both the mutation and using COCP (RR 34.7) compared
to merely having the mutation (RR 6.9), she might instead seek to use an
alternative contraception or explore being genetically tested.
There is no knowing how an individual woman would chose to use the informa-
tion in Table 17.1. It is perhaps dubious to suppose that her interest would lie in the
p-value of a formal test for a synergistic statistical interaction on the additive scale,
or the indication that a multiplicative odds ratio scale fits the data better. The mental
framework in which a woman’s decisions are informed seems more likely to be
based on the relative risk effect sizes (along with confidence intervals) than formal
testing of statistical interaction on any scale. This begs the question why we focus
on formal testing of statistical interaction. A related question is why are we so
concerned about the statistical power of such tests? This is addressed later.
genetic effect, or vice versa. But this would not be deemed joint action, in a
biological mechanistic sense, and the effect of the environmental exposure on the
outcome (if inferred to be causal) is noted only to differ by genotype; there is no
inferred direct mechanistic relationship between environmental exposure and geno-
type. It should be noted that whilst epidemiologists cannot randomise genes,
Mendelian randomisation (Davey and Ebrahim 2003) has been proposed to select
individuals according to a genetic predisposition to succumb to, or evade, an
environmental exposure. This may not always guarantee that the distribution of
the environmental exposure is balanced across levels of the genotype if intermediate
variables are involved jointly with both the genetic and environmental exposures.
logistic model), and whether or not the BP-Salt relationship is linear or not on either
scale (it cannot be linear on more than one scale; it may not be linear on either
conveniently adopted scale).
In the very limited circumstances that environmental exposures are intrinsically
categorical, the interpretation of joint effects is straightforward, as with purely
gene-gene interactions. For g genetic categories and e environmental categories,
g e joint effects are to be estimated. These could be evaluated as within
Table 17.1, or within a regression model that could then consider, simultaneously,
other potentially confounding factors (though interpretation of confounding needs
careful consideration, as discussed in Chap. 4). The g e joint effects of the gene-
environment interaction are readily derived and their impact, hence interpretation,
is observed directly via model coefficients along with confidence intervals. There is
little more to interpret than the effect-size estimated.
Focus thus far has been given to interpreting statistical interaction biologically, as
this is the area of considerable misunderstanding and where most errors are reported
in the literature. There is, however, statistical utility in understanding statistical
interaction, which in some instances has biological utility also. It has been proposed
in the statistical literature that testing the product term in a linear regression model
is suitable for examining both linearity (between outcome and covariates via a
suitable link function) and the multivariate normality of covariates (Cox and
Small 1978; Cox and Wermuth 1994). Consider the statistical linear model with
three continuous measures:
y ¼ b0 þ b1 x1 þ b2 x2 þ b3 x1 x2 þ e (17.11)
y¼b ^ x1 þ b
^ þb ^ lnðx2 Þ þ b
^ x1 lnðx2 Þ þ e^ (17.12)
0 1 2 3
17 Statistical Interactions and Gene-Environment Joint Effects 307
where b ^ (i ¼ 1. . .3) are the revised regression coefficients, and e^ is the revised
i
residual error. It is now highly implausible that the statistical interaction is zero (i.e.
^ 6¼ 0), though not as a consequence of anything that has biological meaning,
b 3
since nothing has changed biologically between models Eq. 17.11 and Eq. 17.12;
only the log-transformation of x2 is different. Statistical interaction is generated
where there was none before by transformation of the data. This would be true
irrespective of which variable is transformed. If the outcome were transformed, this
is analogous to using a non-identity link function, which is why the absence of
statistical interaction on one scale (the normal scale) can become statistical inter-
action on a transformed scale (the logit scale).
Biological measures are often transformed for statistical convenience, e.g.
positively skewed outcomes might be log-transformed to attain a more normal
distribution (though it is model residuals and not variable values that must be
normally distributed for normal models). Variable transformation is required
amongst covariates only if the assumption of linearity is not upheld and this cannot
be modelled via a curvilinear or some other parameterisation. Where model
assumptions are not satisfied, the act of transformation to improve model perfor-
mance will affect the statistical interaction. If biological measures are ‘well
behaved’ and several such measures are multivariate normal (or multivariate log-
normal if log-transformed), there will be few statistical interactions in a regression
model. Alternatively, if biological measures are ‘messy’, or do not capture well the
underlying processes under investigation (such that the data collected do not follow
multivariate normality), then statistical interactions will be present, though these
will have limited biological interpretation. If the modelling process indicates data
transformation may be required to improve model performance, statistical
interactions may come and go, even changing sign, merely due to data manipula-
tion. Since nothing changes biologically, biological interpretation of statistical
interaction in these instances is nonsensical.
The correct interpretation of statistical interaction is thus concerned with the
outcome and covariates inter-relationships with respect to linearity and multivariate
normality, and not joint biological action. Statistical interaction indicates that, as
entered into the model, the outcome and covariates do not exhibit linearity and
covariates do not exhibit multivariate normality. This can have utility if seeking to
determine a scale upon which covariates exhibit multivariate normality, as with
variables that are mechanistically closely linked. Where changes in one measure are
thought to cause changes in others in a linear fashion, underlying multivariate
normality might be anticipated. One could then seek the absence of statistical
interaction to determine the correct joint scale of the variables.
It is well documented that much larger sample sizes are required to test statistical
interactions than for main effects (Greenland 1983), and it is a criticism directed
at many studies for being too small to examine gene-environment interactions.
308 M.S. Gilthorpe and D.G. Clayton
17.10 Summary
Testing the effect of a risk factor on individuals may have a clear biological
interpretation, but testing for statistical interaction between two factors has a
mathematical interpretation, which is different. If the model has no biological
analogue, hence no biological interpretation, then testing for statistical interaction
might not contribute to biological understanding; indeed, it could confuse. Whilst
the quantification of joint effects remains a legitimate aim, its utility does not lie in
the elucidation of biological processes. There is an overreliance on the linear
regression model, with too few checks and balances to verify model assumptions.
There is also an overzealous interpretation of models invoking potential biological
mechanisms.
Model scale and linearity assumptions are typically overlooked, potentially
leading to confusion surrounding the interpretation of statistical interaction,
particularly within the domain of gene-environment interaction. Such arguments
have been made before, yet the previous literature that warned against these poor
practices, continues to receive little acclaim, as pointed out two decades ago by
Thompson when he reflected upon the even earlier debate on statistical interaction
at the start of the 1980s: “A decade ago the concept of interaction among causes of
disease was at the center of a lively debate. Since that time, controversy over the
17 Statistical Interactions and Gene-Environment Joint Effects 309
nature of interaction has largely subsided, although there seems never to have
been an adequate resolution of the conceptual and pragmatic issues that had
been raised” (Thompson 1991). He went on: “Unfortunately, choice among
theories of pathogenesis is enhanced hardly at all by the epidemiological assess-
ment of interaction . . . What few causal systems can be rejected on the basis
of observed results would provide decidedly limited etiological insight”
(Thompson 1991).
Problems persist with the misinterpretation of statistical interaction and over-
zealous attempts to interpret statistical interaction with biological meaning.
Researchers continue to pursue gene-environment interactions with no robust
insight as to what they mean. There is now almost an obsession to include some
form of formal testing for joint genetic and environmental effects wherever a study
records both, without an adequate a priori statement of what it is that is being
confirmed or refuted, either biologically or otherwise. This behaviour fuels atten-
tion to study sample size, pressuring researchers to seek statistical power sufficient
for the elucidation of significant gene-environment statistical interactions. Conse-
quently, there is a perceived and falsely legitimised demand for increasingly large
epidemiological studies. Less attention is given to the estimated size of main effects
and joint effects for clinical interpretation, or the evaluation of plausible underlying
causal paths amongst the factors being considered. There is perhaps an unease to
consider more sophisticated methods, such as structural equation modelling (SEM).
This is why one author of this chapter and his colleague felt it necessary in 2001 to
reaffirm the many points said throughout the previous decades and rehearsed here:
“The prospects for epidemiology in the post-genomic era depend on under-
standing how to use genetic associations to test hypotheses about causal pathways,
rather than modelling the joint effects of genotype and environment” (Clayton
and McKeigue 2001).
It thus remains necessary to spell out repeatedly and vehemently the many issues
associated with interpretation of statistical interaction in the hope to encourage
better epidemiological practice and dispel persistent and inappropriate pursuit of
the gene-environment statistical interaction. It is more appropriate to employ
statistical models to understand causal pathways than pursue statistical interaction.
Whilst statistical modelling (indeed statistical epidemiology) opens a window
on the biological world for investigation of cause and effect, one has to know
how to investigate what we go in search of, how to see what we find, and how
to interpret what we see. Otherwise, we kid ourselves with nothing more than
smoke and mirrors.
In addition to citations in this chapter, there are those provided in the papers
commentating on the recurring debate surrounding statistical interaction within
epidemiology by Thompson (1991) and by Clayton and McKeigue (2001).
310 M.S. Gilthorpe and D.G. Clayton
Amongst the citations here, the more recent by Clayton (2009) is a good source for
the genetics literature and a must for the more generic overview is the seminal
review by Cox (1984).
References
Barrett, N. R. (1957). The lower esophagus lined by columnar epithelium. Surgery, 41(6),
881–894. available from: PM:13442856.
Botto, L. D., & Khoury, M. J. (2001). Commentary: Facing the challenge of gene-environment
Interaction: the two-by-four table and beyond. American Journal of Epidemiology, 153(10),
1016–1020. available from: PM:11384958.
Clayton, D. G. (2009). Prediction and interaction in complex disease genetics: Experience in type
1 diabetes. PLoS Genetics, 5(7), e1000540. available from: PM:19584936.
Clayton, D., & McKeigue, P. M. (2001). Epidemiological methods for studying genes and
environmental factors in complex diseases. The Lancet, 358(9290), 1356–1360. available
from: PM:11684236.
Corley, D. A., Kubo, A., Levin, T. R., Block, G., Habel, L., Rumore, G., Quesenberry, C., &
Buffler, P. (2009). Race, ethnicity, sex and temporal differences in Barrett’s oesophagus
diagnosis: A large community-based study, 1994–2006. Gut, 58(2), 182–188. available from:
PM:18978173.
Cox, D. R. (1984). Interaction. International Statistical Review, 52(1), 1–24.
Cox, D.R., & Small, N. J. H. (1978). Testing multivariate normality. Biometrika, 65(2), 263–272.
Available from: http://biomet.oxfordjournals.org/cgi/content/abstract/65/2/263.
Cox, D. R., & Wermuth, N. (1994). Tests of linearity, multivariate normality and the adequacy of
linear scores. Applied Statistics, 43, 347–355.
Davey, S. G., & Ebrahim, S. (2003). ‘Mendelian randomization’: Can genetic epidemiology
contribute to understanding environmental determinants of disease? International Journal of
Epidemiology, 32(1), 1–22. available from: PM:12689998.
Ganzach, Y. (1997). Misleading interaction and curvilinear terms. Psychological Methods, 2(3),
235–247.
Greenland, S. (1983). Tests for interaction in epidemiologic studies: A review and a study of
power. Statistics in Medicine, 2(2), 243–251. available from: PM:6359318.
Greenwood, D. C., Gilthorpe, M. S., & Cade, J. E. (2006). The impact of imprecisely measured
covariates on estimating gene-environment interactions. BMC Medical Research Methodol-
ogy, 6, 21. available from: PM:16674808.
Koppert, L. B., Wijnhoven, B. P. L., van Dekken, H., Tilanus, H. W., & Dinjens, W. N. (2005).
The molecular biology of esophageal adenocarcinoma. Journal of Surgical Oncology, 92,
169–190. available from: PM:16299787.
Kupper, L. L., & Hogan, M. D. (1978). Interaction in epidemiologic studies. American Journal of
Epidemiology, 108(6), 447–453. available from: PM:736024.
Moayyedi, P. (2008). Barrett’s esophagus and obesity: The missing part of the puzzle. American
Journal of Gastroenterology, 103(2), 301–303. available from: PM:18289199.
Rees, D. C., Cox, M., & Clegg, J. B. (1995). World distribution of factor V Leiden. The Lancet,
346(8983), 1133–1134. available from: PM:7475606.
Ross, R., Shaw, K. D., Rissanen, J., Martel, Y., de Guise, J., & Avruch, L. (1994). Sex differences
in lean and adipose tissue distribution by magnetic resonance imaging: Anthropometric
relationships. American Journal of Clinical Nutrition, 59(6). available from: PM:8198051.
Rothman, K. J., Greenland, S., & Walker, A. M. (1980). Concepts of interaction. American
Journal of Epidemiology, 112(4), 467–470. available from: PM:7424895.
17 Statistical Interactions and Gene-Environment Joint Effects 311
Rowntree, R. K., & Harris, A. (2003). The phenotypic consequences of CFTR mutations. Annals of
Human Genetics, 67(Pt 5), 471–485. available from: PM:12940920.
Saracci, R. (1980). Interaction and synergism. American Journal of Epidemiology, 112(4),
465–466. available from: PM:7424894.
Thompson, W. D. (1991). Effect modification and the limits of biological inference from epide-
miologic data. Journal of Clinical Epidemiology, 44(3), 221–232. available from:
PM:1999681.
Tu, Y. K., Manda, S. O., Ellison, G. T., & Gilthorpe, M. S. (2007). Revisiting the interaction
between birth weight and current body size in the foetal origins of adult disease. European
Journal of Epidemiology, 22(9), 565–575. available from: PM:17641977.
Vandenbroucke, J. P., Koster, T., Briët, E., Reitsma, P. H., Bertina, R. M., & Rosendaal, F. R.
(1994). Increased risk of venous thrombosis in oral-contraceptive users who are carriers of
factor V Leiden mutation. The Lancet, 344, 1453–1457.
Vandenbroucke, J. P., van der Meer, F. J. M., Helmerhorst, F. M., & Rosendaal, F. R. (1996).
Factor V Leiden: Should we screen oral contraceptive users and pregnant women? British
Medical Journal, 313, 1127–1130.
Walter, S. D., & Holford, T. R. (1978). Additive, multiplicative, and other models for disease risks.
American Journal of Epidemiology, 108(5), 341–346. available from: PM:727202.
Index
A G
Additive measurement error, 34, 39 Gene-environment interaction, 41, 291–293,
306–309
Generalised additive models, 128, 261–277, 279
B G-estimation, 243–257
Backdoor principle, 197 Growth mixtures models, 223–239
Bayesian analysis, 23, 47, 138, 141
Berkson measurement error, 37
Best subset, 11, 12 H
Bias, 7, 11, 21, 34–36, 38, 40, 41, 47, Hierarchical linear models, 75
57–70, 100, 108, 125–128, 131,
132, 136, 173, 182, 183, 245, 256,
257, 264 I
Binomial distribution, 95, 96, 111, 293 Indirect effects, 195, 196, 256
Biomarkers, 33, 41, 49, 209, 210, 218 Instrumental variables, 42, 43, 46, 48
Interaction, 25, 41, 79, 87, 126–128, 131,
192, 211, 224, 257, 261, 262, 271,
C 275, 279–284, 290–310
Casemix, 117, 119, 120, 123, 131–133,
136, 137
Classical measurement error, 36–37, 40, 42 L
Collider, 3, 4, 196–198 Latent class analysis, 117, 137
Confounding, 1–12, 34, 41, 58, 61, 65, 76, Latent classes, 46, 90, 97, 98, 106–109, 112,
119, 126, 132, 146, 174, 176, 178, 117–138, 224, 225, 228–235, 237–239
186, 196, 198, 243–250, 253, Latent growth curve models, 205–220, 225
255–257, 291, 306 Latent variable methods, 46, 97, 124, 138, 185,
199–121, 206, 211
Logistic regression, 21, 35, 36, 43, 44, 48, 66,
D 88, 97, 138, 176, 192, 233, 246, 249,
Differential measurement error, 34, 38 251–253, 264, 271, 292, 293, 301
Directed acyclic graphs, 1, 2, 5, 12, 47, 59,
60, 125, 126, 191–202
Direct effects, 196, 256 M
MAR. See Missing at random (MAR)
Markov chain Monte Carlo methods
E (MCMC), 24, 29, 47, 48, 73, 88–89,
Errors-in-variables, 33, 46, 207, 213 144–145, 148, 163, 165
Y.-K. Tu and D.C. Greenwood (eds.), Modern Methods for Epidemiology, 313
DOI 10.1007/978-94-007-3024-3, # Springer Science+Business Media Dordrecht 2012
314 Index
MCAR. See Missing completely at random Power, 33, 35, 40, 41, 46, 76, 138, 145, 182,
(MCAR) 201, 216, 218, 230, 304, 308–309
MCMC. See Markov chain Monte Carlo Proportional hazards, 43, 45, 162, 252, 253, 257
methods (MCMC) Pseudo-randomisation, 117
Measurement error
bias, 33–49, 126, 127, 131, 182, 183, 305
loss of power, 33, 35, 40 Q
mechanisms, 37 Quadrature, 46, 48
Meta-analysis, 90, 173–187
Missing at random (MAR), 16–27, 30,
264, 284 R
Missing completely at random (MCAR), Random effects modelling, 77, 90, 123, 136,
16, 18–20 143, 147, 150, 157, 159, 161, 163,
Missing data, 15–23, 25, 27–30, 46, 58, 165, 170, 205, 211, 276
133, 251 Random measurement error, 38, 42
Missing Not At Random (MNAR), 17, 18, Regression calibration, 38, 43–45, 47
20, 21, 30 Regression tree, 279–281, 284–286, 289, 290
MNAR. See Missing Not At Random (MNAR) Replicate samples, 42, 47
Mplus, 132, 206, 214, 216, 218, 219, 225,
230, 236–239
Multilevel modelling, 73–91, 121, 144, 205 S
Multilevel multiple imputation, 26–30 Selection bias, 57–70, 177, 245
Multiple diseases, 81, 128, 292 SIMEX. See Simulation-extrapolation
Multiple imputation, 15–20, 23–30, 46 (SIMEX)
Multiplicative measurement error, 38, 41, 44 Simulation-extrapolation (SIMEX), 45–46
Multivariate disease mapping, 154 Spatial models, 114, 142–144, 147, 160, 162,
Multivariate random frailty effects model, 163, 165, 166, 170
157–170 Splines, 216, 224, 261, 269–273, 275, 276
Structural equation models, 47, 138, 154,
191–202, 205–208, 212–217, 219,
N 220, 224, 230, 309
Non-differential measurement error, 34, 44 Surrogate measures of exposure, 41
Nutrition epidemiology, 37, 41, 42 Survival analysis, 245, 251, 252, 254, 255
Systematic measurement error, 37, 42, 58
O
Over-dispersion, 96, 105, 108, 111, T
147, 153 Time-varing covariate, 244, 246, 253
P V
Performance evaluation, 111 Validation samples, 41, 42, 44, 47
Periodontal diseases, 73, 74, 77, 79, 209
Poisson distribution, 94–96, 100, 101, 104,
111, 112, 142, 143 Z
Poisson spatial models, 142, 144, 147 Zero-inflated models, 93–114