Tom A. B. Snijders - Multilevel Analysis - An Introduction To Basic and Advanced Multilevel Modeling (2011) - 1
Tom A. B. Snijders - Multilevel Analysis - An Introduction To Basic and Advanced Multilevel Modeling (2011) - 1
ANALYSIS
2nd Edition
MULTILEVEL
ANALYSIS
An
Introduction to
Basic and Advanced
Multilevel Modeling
Tom A B SNIJDERS
Roel J BOSKER
© Tom A B Snijders and Roel J Bosker 2012
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as
permitted under the Copyright, Designs and Patents Act, 1988, this publication may be reproduced,
stored or transmitted in any form, or by any means, only with the prior permission in writing of the
publishers, or in the case of reprographic reproduction, in accordance with the terms of licences
issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms
should be sent to the publishers.
A catalogue record for this book is available from the British Library
ISBN 978-1-84920-200-8
ISBN 978-1-84920-201-5
1 Introduction
1.1 Multilevel analysis
1.1.1 Probability models
1.2 This book
1.2.1 Prerequisites
1.2.2 Notation
8 Heteroscedasticity
8.1 Heteroscedasticity at level one
8.1.1 Linear variance functions
8.1.2 Quadratic variance functions
8.2 Heteroscedasticity at level two
8.3 Glommary
9 Missing Data
9.1 General issues for missing data
9.1.1 Implications for design
9.2 Missing values of the dependent variable
9.3 Full maximum likelihood
9.4 Imputation
9.4.1 The imputation method
9.4.2 Putting together the multiple results
9.5 Multiple imputations by chained equations
9.6 Choice of the imputation model
9.7 Glommary
13 Imperfect Hierarchies
13.1 A two-level model with a crossed random factor
13.2 Crossed random effects in three-level models
13.3 Multiple membership models
13.4 Multiple membership multiple classification models
13.5 Glommary
14 Survey Weights
14.1 Model-based and design-based inference
14.1.1 Descriptive and analytic use of surveys
14.2 Two kinds of weights
14.3 Choosing between model-based and design-based analysis
14.3.1 Inclusion probabilities and two-level weights
14.3.2 Exploring the informativeness of the sampling design
14.4 Example: Metacognitive strategies as measured in the PISA study
14.4.1 Sampling design
14.4.2 Model-based analysis of data divided into parts
14.4.3 Inclusion of weights in the model
14.5 How to assign weights in multilevel models
14.6 Appendix. Matrix expressions for the single-level estimators
14.7 Glommary
15 Longitudinal Data
15.1 Fixed occasions
15.1.1 The compound symmetry model
15.1.2 Random slopes
15.1.3 The fully multivariate model
15.1.4 Multivariate regression analysis
15.1.5 Explained variance
15.2 Variable occasion designs
15.2.1 Populations of curves
15.2.2 Random functions
15.2.3 Explaining the functions
15.2.4 Changing covariates
15.3 Autocorrelated residuals
15.4 Glommary
18 Software
18.1 Special software for multilevel modeling
18.1.1 HLM
18.1.2 MLwiN
18.1.3 The MIXOR suite and SuperMix
18.2 Modules in general-purpose software packages
18.2.1 SAS procedures VARCOMP, MIXED, GLIMMIX, and
NLMIXED
18.2.2 R
18.2.3 Stata
18.2.4 SPSS, commands VARCOMP and MIXED
18.3 Other multilevel software
18.3.1 PinT
18.3.2 Optimal Design
18.3.3 MLPowSim
18.3.4 Mplus
18.3.5 Latent Gold
18.3.6 REALCOM
18.3.7 WinBUGS
References
Index
Preface to the Second Edition
Tom Snijders
Roel Bosker
March,2011
Preface to the First Edition
This book grew out of our teaching and consultation activities in the
domain of multilevel analysis. It is intended for the absolute beginner in this
field as well as for those who have already mastered the fundamentals and
are now entering more complicated areas of application. The reader is
referred to Section 1.2 for an overview of this book and for some reading
guidelines.
We are grateful to various people from whom we got reactions on
earlier parts of this manuscript and also to the students who were exposed to
it and helped us realize what was unclear. We received useful comments
from, and benefited from discussions about parts of the manuscript with,
among others, Joerg Blasius, Marijtje van Duijn, Wolfgang Langer, Ralf
Maslowski, and Ian Plewis. Moreover we would like to thank Hennie
Brandsma, Mieke Brekelmans, Jan van Damme, Hetty Dekkers, Miranda
Lubbers, Lyset Rekers-Mombarg and Jan Maarten Wit, Carolina de Weerth,
Beate Völker, Ger van der Werf, and the Zentral Archiv (Cologne) who
kindly permitted us to use data from their respective research projects as
illustrative material for this book. We would also like to thank Annelies
Verstappen-Remmers for her unfailing secretarial assistance.
Tom Snijders
Roel Bosker
June, 1999
1
Introduction
1.2.1 Prerequisites
In order to read this textbook, a good working knowledge of statistics is
required. It is assumed that you know the concepts of probability, random
variable, probability distribution, population, sample, statistical
independence, expectation (population mean), variance, covariance,
correlation, standard deviation, and standard error. Furthermore, it is
assumed that you know the basics of hypothesis testing and multiple
regression analysis, and that you can understand formulas of the kind that
occur in the explanation of regression analysis.
Matrix notation is used only in a few more advanced sections. These
sections can be skipped without loss of understanding of other parts of the
book.
1.2.2 Notation
The main notational conventions are as follows. Abstract variables and
random variables are denoted by italicized capital letters, such as X or Y.
Outcomes of random variables and other fixed values are denoted by
italicized lower-case letters, such as x or z. Thus we speak about the
variable X, but in formulas where the value of this variable is considered as
a fixed, nonrandom value, it will be denoted by x. There are some
exceptions to this, for example in Chapter 2 and in the use of the letter N for
the number of groups (‘level-two units’) in the data.
The letter ε is used to denote the expected value, or population average,
of a random variable. Thus, εY and ε(Y) denote the expected value of Y. For
example, if Pn is the fraction of tails obtained in n coin flips, and the coin is
fair, then the expected value is
Statistical parameters are indicated by Greek letters. Examples are μ, σ2,
and β. The following Greek letters are used.
α alpha
β beta
γ gamma
δ delta
η eta
θ theta
λ lambda
μ mu
π pi
ρ rho
σ sigma
τ tau
φ phi
χ chi
ω omega
∆ capital Delta
Σ capital Sigma
τ capital Tau
χ capital Chi
1
We are indebted to Ivo Molenaar for this reference.
2
Multilevel Theories, Multistage
Sampling, and Multilevel Models
Phenomena and data sets in the social sciences often have a multilevel
structure. This may be reflected in the design of data collection: simple
random sampling is often not a very cost-efficient strategy, and multistage
samples may be more efficient instead. This chapter is concerned with the
reasons why it is important to take account of the clustering of the data, also
called their multilevel structure, in the data analysis phase.
Table 2.1: Summary of terms to describe units at either level in the two-
level case.
Table 2.2: Some examples of units at the macro and micro level.
Macro level Micro level
schools teachers
classes pupils
neighbourhoods families
firms employees
jawbones teeth
families children
litters animals
doctors patients
subjects measurements
interviewers respondents
judges suspects
The more the achievement levels of pupils within a school are alike (as
compared to pupils from other schools), the more likely it is that causes of
the achievement have to do with the organizational unit (in this case, the
school). Absence of dependency in this case implies absence of institutional
effects on individual performance.
A special kind of nesting is defined by longitudinal data, represented in
Table 2.2 as ‘measurements within subjects’. The measurement occasions
here are the micro-units and the subjects the macro-units. The dependence
of the different measurements for a given subject is of primary importance
in longitudinal data, but the following section on relations between
variables defined at either level is not directly intended for the nesting
structure defined by longitudinal data. Because of the special nature of this
nesting structure, Chapter 15 is specifically devoted to it.
The models treated in this book are for situations where the dependent
variable is at the lowest level. For models with nested data sets where the
dependent variable is defined at a higher level one may consult Croon and
van Veldhoven (2007), LÜdtke et al. (2008), and van Mierlo et al. (2009).
Multilevel propositions
Multilevel propositions can be represented as in Figure 2.2. In this example
we are interested in the effect of the macro-level variable Z (e.g., teacher
efficacy) on the micro-level variable y (e.g., pupil motivation), controlling
for the micro-level variable x (e.g., pupil aptitude).
Micro-level propositions
Micro-level propositions are of the form indicated in Figure 2.3. In this case
the line indicates that there is a macro level which is not referred to in the
hypothesis that is put to the test, but which is used in the sampling design in
the first stage. In assessing the strength of the relation between occupational
status and income, for instance, respondents may have been selected for
face-to-face interviews by zip-code area. This then may cause dependency
(as a nuisance) in the data.
Macro-level propositions
Macro-level propositions are of the form of Figure 2.4. The line separating
the macro level from the micro level seems superfluous here. When
investigating the relation between the long-range strategic planning policy
of firms and their profits, there is no multilevel situation, and a simple
random sample may have been taken. When either or both variables are not
directly observable, however, and have to be measured at the micro level
(e.g., organizational climate measured as the average satisfaction of
employees), then a two-stage sample is needed nevertheless. This is the
case a fortiori for variables defined as aggregates of micro-level variables
(e.g., the crime rate in a neighborhood).
Macro–micro relations
The most common situation in social research is that macro-level variables
are supposed to have a relation with micro-level variables. There are three
obvious instances of macro-to-micro relations, all of which are typical
examples of the multilevel situation (see Figure 2.5). The first case is the
macro-to-micro proposition. The more explicit the religious norms in social
networks, for example, the more conservative the views that individuals
have on contraception. The second proposition is a special case of this. It
refers to the case where there is a relation between Z and y, given that the
effect of x on y is taken into account. The example given may be modified
to: ‘for individuals of a given educational level’. The last case in the figure
is the macro–micro-interaction, also known as the cross-level interaction:
the relation between x and y is dependent on Z. To put it another way, the
relation between Z and y is dependent on x. The effect of aptitude on
achievement, for instance, may be small in case of ability grouping of
pupils within classrooms but large in ungrouped classrooms.
Next to these three situations there is the so-called emergent, or micro–
macro, proposition (Figure 2.6). In this case, a micro-level variable x affects
a macro-level variable Z (student achievement may affect teachers’
experience of stress).
Figure 2.5: The structure of macro–micro propositions.
2.4 Glommary
Multilevel data structures. Many data sets in the social sciences have a
multilevel structure, that is, they constitute hierarchically nested
systems with multiple levels. Much of our discussion focuses on two-
level structures, but this can be generalized to three or more nested
levels.
Sampling design. Often the multilevel nature of the social world leads to
the practical efficiency of multistage samples. The population then
consists of a nested system of subpopulations, and a nested sample is
drawn accordingly. For example, when employing a random two-stage
sample design, in the first stage a random sample of the primary units
is taken, and in the second stage the secondary units are sampled at
random from the selected primary units.
Levels. The levels are numbered such that the most detailed level is the
first. For example, in a two-level structure of individuals nested in
groups the individuals are called level-one units and the groups level-
two units. (Note the different terminology compared to the words used
in theories of survey sampling: in the preceding example, the primary
units are the level-two units and the secondary units the level-one
units.)
Units. The elements of a level are called units. Higher-level units are also
called clusters. We talk about level-one units, level-two units, etc.
Dependence as a nuisance. Not taking account of the multilevel data
structure, or of the multistage sampling design, is likely to lead to the
use of statistical procedures in which independence assumptions are
violated so that conclusions may be unfounded.
1
As with any rule, there are exceptions. If the data set is such that for each macro-unit only one
micro-unit is included in the sample, single-level methods still can be used.
3
Statistical Treatment of Clustered
Data
3.2 Disaggregation
Now suppose that we treat our data at the micro level. There are two
situations:
The number of micro-units within the jth macro-unit is denoted by nj. The
number of macro-units is N, and the total sample size is M = Σjnj.
In this situation, the intraclass correlation coefficient ρI can be defined
as
This is the proportion of variance that is accounted for by the group level.
This parameter is called a correlation coefficient because it is equal to the
correlation between values of two randomly drawn micro-units in the same,
randomly drawn, macro-unit. Hedges and Hedberg (2007) report on a large
variety of studies of educational performance in American schools, and find
that values often range between 0.10 and 0.25.
It is important to note that the population variance between macro-units
is not directly reflected by the observed variance between the means of the
macro-units (the observed between-macro-unit variance). The reason is that
in a two-stage sample, variation between micro-units will also show up as
extra observed variance between macro-units. It is indicated below how the
observed variance between cluster means must be adjusted to yield a good
estimator for the population variance between macro-units.
Table 3.1: Data grouped into macro-units (random digits from Glass and
Stanley, 1970, p. 511).
This number will vary from group to group. To have one parameter that
expresses the within-group variability for all groups jointly, one uses the
observed within-group variance, or pooled within-group variance. This is a
weighted average of the variances within the various macro-units, defined
as
For unequal group sizes, the contributions of the various groups need to be
weighted. The following formula uses weights that are useful for estimating
the population between-group variance:
In this formula, ñ is defined by
is the variance of the sample sizes. If all nj have the same value, then ñ also
has this value. In this case, S2between is just the variance of the group means,
given by (3.5).
It can be shown that the total observed variance is a combination of the
within-group and the between-group variances, expressed as follows:
(cf. Hays (1988, Section 13.3) for the case with constant nj and Searle et al.
(1992, Section 3.6) for the general case), which holds provided that model
(3.1) is valid. The second term in this formula becomes small when ñ
becomes large. Thus for large group sizes, the expected observed between
variance is practically equal to the true between variance. For small group
sizes, however, it tends to be larger than the true between variance due to
the random differences that also exist between the group means.
In practice, we do not know the population values of the between and
within macro-unit variances; these have to be estimated from the data. One
way of estimating these parameters is based on formulas (3.4) and (3.9).
From the former it follows that the population within-group variance, σ 2,
can be estimated without bias by the observed within-group variance:
From the combination of the last two formulas it follows that the population
between-group variance, τ2, can be estimated without bias by taking the
observed between-group variance and subtracting the contribution that the
true within-group variance makes, on average, according to (3.9), to the
observed between-group variance:
This formula was given by Fisher (1958, Section 39) and by Donner (1986,
equation (6.1)), who also gives the (quite complicated) formula for the
standard error for the case of variable group sizes. Donner and Wells (1986)
compare various ways to construct confidence intervals for the intraclass
correlation coefficient.
The estimators given above are so-called analysis of variance or
ANOVA estimators. They have the advantage that they can be represented
by explicit formulas. Other much used estimators are those produced by the
maximum likelihood (ML) and residual maximum likelihood (REML)
methods (cf. Section 4.7). For equal group sizes, the ANOVA estimators are
the same as the REML estimators (Searle et al., 1992). For unequal group
sizes, the ML and REML estimators are slightly more efficient than the
ANOVA estimators. Multilevel software can be used to calculate the ML
and REML estimates.
Example 3.2 Within- and between-group variability for random data.
For our random digits table of the earlier example the observed between variance is S2between = 105.7.
The observed variance within the macro-units can be computed from formula (3.8). The observed
total variance is known to be 814.0 and the observed between variance is given by 105.7. Solving
(3.8) for the observed within variance yields S2within = (99/90) × (814.0–(10/11)x105.7)= 789.7 Then
the estimated true variance within the macro-units is also The estimate for the true
between macro-units variance is computed from (3.11) as
Finally, the estimate of the intraclass correlation is Its
standard error, computed from (3.13), is 0.06.
This formula expresses the fact that, from a purely statistical point of view,
a two-stage sample becomes less attractive as ρI increases (clusters become
more homogeneous) and as the group size n increases (the two-stage nature
of the sampling design becomes stronger).
Suppose, for example, we were studying the satisfaction of patients with
the treatments provided by their doctors. Furthermore, let us assume that
some doctors have more satisfied patients than others, leading to a ρI of
0.30. The researchers used a two-stage sample, since that is far cheaper than
selecting patients simply at random. They first randomly selected 100
doctors, from each chosen doctor selected five patients at random, and then
interviewed each of these. In this case the design effect is 1 + (5 − 1) × 0.30
= 2.20. When estimating the standard error of the mean, we no longer can
treat the observations as independent of each other. The effective sample
size, that is, the equivalent total sample size that we should use in
estimating the standard error, is equal to
The quantity Nts in this formula refers to the total desired sample size when
using a two-stage sample, whereas Nsrs refers to the desired sample size if
one had used a simple random sample.
In practice, ρ1 is unknown. However, it often is possible to make an
educated guess about it on the basis of earlier research.
In Figure 3.2, Nts is graphed as a function of n and ρI (0.1, 0.2, 0.4, and
0.6, respectively), and taking Nsrs = 100 as the desired sample size for an
equally informative simple random sample.
The main point of this section is that within-group relations can be, in
principle, completely different from between-group relations. This is
natural, because the processes at work within groups may be different from
the processes at work between groups (see Section 3.1). Total relations, that
is, relations at the micro level when the clustering into macro-units is
disregarded, are mostly a kind of average of the within-group and between-
group relations. Therefore it is necessary to consider within- and between-
group relations jointly, whenever the clustering of micro-units in macro-
units is meaningful for the phenomenon being studied.
3.6.1 Regressions
The linear regression of a ‘dependent’ variable Y on an ‘explanatory’ or
‘independent’ variable X is the linear function of X that yields the best4
prediction of Y. When the bivariate distribution of (X, Y) is known and the
data structure has only a single level, the expression for this regression
function is
The constant term β0 is called the intercept, while β1 is called the regression
coefficient. The term R is the residual or error component, and expresses
the part of the dependent variable Y that cannot be approximated by a linear
function of Y. Recall from Section 1.2.2 that ε (X) and ε (Y) denote the
population mean (expected value) of X and Y, respectively.
Table 3.2: Artificial data on five macro-units, each with two micro-units.
A population model
The interplay of within-group and between-group relations can be better
understood on the basis of a population model such as (3.1). Since this
section is about two variables, X and Y, a bivariate version of the model is
needed. In this model, group (macro-unit) j has specific main effects Uxj
and Uyj for variables X and Y, and associated with individual (micro-unit) j
are the within-group deviations Rxij and Ryij. The population means are
denoted by μx and μy and it is assumed that the Us and the Rs have
population means 0. The Us on the one hand and the Rs on the other are
independent. The formulas for X and Y are then
For the formulas that refer to relations between group means and Ȳ.j, it
is assumed that each group has the same size, denoted by n.
The correlation between the group effects is defined as
One of the two variables X and Y might have a stronger group nature than
the other, so that the intraclass correlation coefficients for X and Y may be
different. These are denoted by ρIx and ρIy, respectively.
The within-group regression coefficient is the regression coefficient
within each group of Y on X, assumed to be the same for each group. This
coefficient is denoted by βwithin and defined by the within-group regression
equation,
This expression implies that if X is a pure macro-level variable (so that ρIx =
1), the total regression coefficient is equal to the between-group coefficient.
Conversely, if X is a pure micro-level variable we have ρIx = 0, and the total
regression coefficient is just the within-group coefficient. Usually X will
have both a within-group and a between-group component and the total
regression coefficient will be somewhere between the two level-specific
regression coefficients.
For large group sizes the reliability approaches unity, so the correlation ratio
approaches the intraclass correlation.
In the data, the correlation ratio η2x is the same as the proportion of
variance in Xij explained by the group means, and it can be computed as the
ratio of the between-group sum of squares relative to the total sum of
squares in an analysis of variance, that is,
Expression (3.27) was first given by Duncan et al. (1961) and can also be
found, for example, in Pedhazur (1982, p. 538). A multivariate version was
given by Maddala (1971). To apply this equation to an unbalanced data set,
the regression coefficient between group means must be calculated in a
weighted regression, group j having weight nj.
3.6.2 Correlations
The quite extreme nature of the artificial data set of Table 3.2 becomes
apparent when we consider the correlations.
The group means ( , ȲJ) lie on a decreasing straight line, so the
observed between-group correlation, which is defined as the correlation
between the group means, is Rbetween = – 1. The within-group correlation is
defined as the correlation within the groups, assuming that this correlation
is the same within each group. This can be calculated as the correlation
coefficient between the within-group deviation scores and
In this data set the deviation scores are (−1, −1) for
i = 1 and (+1, +1) for i = 2, so the within-group correlation here is Rwithin =
+1. Thus, we see that the within-group as well as the between-group
correlations are perfect, but of opposite signs. The disaggregated
correlation, that is, the correlation computed without taking the nesting
structure into account, is Rtotal = −0.33. (This is the same as the value for
the regression coefficient in the total (disaggregated) regression equation,
because X and Y have the same variance.)
For the between-group coefficient the relation is, as always, a little more
complicated. The correlation coefficient between the group means is equal
to
where λxj and λyj are the reliability coefficients of the group means (see
equations (3.20, 3.21)). For large group sizes the reliabilities will be close to
1 (provided the intraclass correlations are larger than 0), so that the
correlation between the group means will then be close to ρbetween.
The total correlation (i.e., the correlation in the disaggregated analysis)
is a combination of the within-group and between-group correlation
coefficients. The combination depends on the intraclass correlations, as
shown by the formula
If the intraclass correlations are low, then X and Y primarily have the nature
of level-one variables, and the total correlation will be close to the within-
group correlation; on the other hand, if the intraclass correlations are close
to 1, then X and Y almost have the nature of level-two variables and the
total correlation is close to the between-group correlation. As a third
possibility, if one of the intraclass correlations is close to 0 and the other is
close to 1, then one variable is mainly a level-one variable and the other
mainly a level-two variable. Formula (3.30) then implies that the total
correlation coefficient is close to 0, no matter how large the within-group
and between-group correlations. This is intuitively obvious, since a level-
one variable with hardly any between-group variability cannot be
substantially correlated with a variable having hardly any within-group
variability.
If the intraclass correlations of X and Y are equal and denoted by ρI, then
(3.30) can be formulated more simply as
In this case the weights ρI and (1 − ρI) add up to 1 and the total correlation
coefficient is necessarily between the within-group and the between-group
correlation coefficients. This is not true in general, because the sum of the
weights in (3.30) is smaller than 1 if the intraclass correlations for X and Y
are different.
The reliabilities of the group means then also are equal, λxj = λyj; let us
denote these by λj. The correlation coefficient between the group means
(3.29) then simplifies to
The last two formulae can help in understanding how aggregation changes
correlations – under the continued assumption that the intraclass
correlations of X and Y are the same. The total correlation as well as the
correlation between group means are between the within-group and the
between-group correlations. However, since the reliability coefficient λj is
greater7 than the intraclass correlation ρI, the correlation between the group
means is drawn toward the between-group correlation more strongly than
the total correlation is. Therefore, aggregation will increase correlation if
and only if the between-group correlation coefficient is larger than the
within-group correlation coefficient. Therefore, the fact that correlations
between group means are often higher than correlations between individuals
is not the mathematical consequence of aggregation, but the consequence of
the processes at the group level (determining the value of ρbetween) being
different from the processes at the individual level (which determine the
value of ρwithin).
This expression was given by Robinson (1950) and can also be found, for
example, in Pedhazur (1982, p. 536). When it is applied to an unbalanced
data set, the correlation between the group means should be calculated with
weights nj.
It may be noted that many texts do not make the explicit distinction
between population and data. If the population and the data are equated,
then the reliabilities are unity, the correlation ratios are the same as the
intraclass correlations, and the population between-group correlation is
equal to the correlation between the group means. The equation for the total
correlation then becomes
which indeed is the value found earlier for the total correlation.
and
where θ is the mean parameter in the population of all potential studies, and
Ej is the deviation from this value in study j, dependent on particulars of the
group under study, the measurement instrument used, etc. The estimate can
be represented as
where Rj is a random residual. From each study we know the standard error
Combining these two equations leads to a
representation of the parameter estimates as a mean value plus a double
error term,
Assuming that the standard errors were estimated very precisely, this can be
done by a chi-squared test as derived, for example, in Lehmann and
Romano (2005, Section 7.3). The test statistic is
and has, under the null hypothesis and if the estimates hypothesis and if the
estimates are normally distributed with variances s2j, a chi-squared
distribution with N – 1 degrees of freedom.
3.8 Glommary
Aggregation and disaggregation. Multilevel data structures can be
analyzed by aggregating data to the higher level (e.g., by taking the
means) and analyzing these; or by disaggregating to the lower level,
which means that characteristics of higher-level units are used as
characteristics of the lower-level units contained in them, and further
nesting is not used. Both approaches provide only a limited perspective
because they focus on only one level among several.
1
This model is also known in the statistical literature as the one-way random effects ANOVA model
and as Eisen-hart’s Type II ANOVA model. In multilevel modeling it is known as the empty model,
and is treated further in Section 4.4.
2
In a data set it is possible for the estimated intraclass correlation coefficient to be negative. This is
always the case, for example, for group-centered variables. In a population satisfying model (3.1),
however, the population intraclass correlation cannot be negative.
3
In the literature the reliability of a measurement is X frequently denoted by the symbol ρXX, so that
the reliability coefficient λj here could also be denoted by ρYY.
4
‘Best prediction’ means here the prediction that has the smallest mean squared error: the so-called
least squares criterion.
5
The remainder of this subsection may be skipped by the cursory reader.
6
The same phenomenon is at the heart of formulas (3.9) and (3.11).
7
We argue under the assumption that the group size is at least 2, and the common intraclass
correlation is positive.
8
The remainder of Section 3.6.2 may also be skipped by the cursory reader.
4
The Random Intercept Model
In the preceding chapters it was argued that the best approach to the
analysis of multilevel data is one that represents within-group as well as
between-group relations within a single analysis, where ‘group’ refers to the
units at the higher levels of the nesting hierarchy. Very often it makes sense
to represent the variability within and between groups by a probability
model, in other words, to conceive of the unexplained variation within
groups and that between groups as random variability. For a study of
students within schools, for example, this means that not only unexplained
variation between students, but also unexplained variation between schools
is regarded as random variability. This can be expressed by statistical
models with so-called random coefficients.
The hierarchical linear model is such a random coefficient model for
multilevel, or hierarchically structured, data and has become the main tool
for multilevel analysis. In this chapter and the next the definition of this
model and the interpretation of the model parameters are discussed. This
chapter discusses the simpler case of the random intercept model; Chapter 5
treats the general hierarchical linear model, which also has random slopes.
Chapter 6 is concerned with testing the various components of the model.
Later chapters treat various elaborations and other aspects of the
hierarchical linear model. The focus of this treatment is on the two-level
case, but Chapters 4 and 5 also contain sections on models with more than
two levels of variability.
The indices can be regarded as case numbers; note that the numbering for
the individuals starts again in every group. For example, individual 1 in
group 1 is different from individual 1 in group 2.
For individual i in group j, we have the following variables:
for group j,
These models pretend, as it were, that all the multilevel structure in the data
is fully explained by the group variable Z and the individual variable X. In
other words, if two individuals are being considered and their X- and Z-
values are given, then for their Y-value it is immaterial whether they belong
to the same or to different groups.
Models of types (4.1) and (4.2), and their extensions to more
explanatory variables at either or both levels, have in the past been widely
used in research on data with a multilevel structure. They are convenient to
handle for anybody who knows multiple regression analysis. Is anything
wrong with them? YES! For data with a meaningful multilevel structure, it
is practically always incorrect to make the a priori assumption that all of
the group structure is represented by the explanatory variables. Given that
there are only N groups, it is unjustified to act as if one has n1 + n2 + . . . +
nN independent replications. There is one exception: when all group sample
sizes nj are equal to 1, the researcher need have no qualms about using these
models because the nesting structure is not present in the relation between
observed variables, even if it may be present in the structure of the
population. Designs with nj = 1 can be used when the explanatory variables
were chosen on the basis of substantive theory, and the focus of the research
is on the regression coefficients rather than on how the variability of Y is
partitioned into within-group and between-group variability.
In designs with group sizes larger than 1, however, the nesting structure
often cannot be represented completely in the regression model by the
explanatory variables. Additional effects of the nesting structure can be
represented by letting the regression coefficients vary from group to group.
Thus, the coefficients β0 and β1 in equation (4.1) must depend on the group,
denoted by j. This is expressed in the formula by an extra index j for these
Groups j can have a higher (or lower) value of β0j, indicating that, for any
given value of X, they tend to have higher (or lower) values of the
dependent variable Y. Groups can also have a higher or lower value of β1j,
which indicates that the effect of X on Y is higher or lower. Since Z is a
group-level variable, it would not make much sense conceptually to let the
coefficient of Z depend on the group. Therefore β2 is left unaltered in this
formula.
The multilevel models treated in the following sections and in Chapter 5
contain diverse specifications of the varying coefficients β0j and β1j. The
simplest version of model (4.3) is that where β0j and β1j are constant (do not
depend on j), that is, the nesting structure has no effect, and we are back at
model (4.1). If this is an appropriate model, which we said above is a
doubtful supposition, then the ordinary least squares (OLS) regression
models of type (4.1) and (4.2) offer a good approach to analyzing the data.
If, on the other hand, the coefficients β0j and β1j do depend on j, then these
regression models may give misleading results. Then it is preferable to take
into account how the nesting structure influences the effects of X and Z on
Y. This can be done using the random coefficient model of this and the
following chapters. This chapter examines the case where the intercept β0j
depends on the group; the next chapter treats the case where the regression
coefficient β1j is also group-dependent.
Figure 4.1: Different parallel regression lines. The point y12 is indicated
with its residual R12.
For reasons that will become clear in Chapter 5, the notation for the
regression coefficients is changed here, and the average intercept is called
γ00 while the regression coefficient for X is called γ10. Substitution now
leads to the model
The values U0j are the main effects of the groups: conditional on an
individual having a given X-value and being in group j, the Y-value is
expected to be U0j higher than in the average group. Model (4.5) can be
understood in two ways:
Groups with a high value of U0j tend to have, on average, high responses
whereas groups with a low value of U0j tend to have, on average, low
responses. The random variables U0j and Rij are assumed to have a mean of
0 (the mean of Yij is already represented by γ00), to be mutually
independent, and to have variances var(Rij) = σ2 and var(U0j) = τ20. In the
context of multilevel modeling (4.6) is called the empty model, because it
contains not a single explanatory variable. It is important because it
provides the basic partition of the variability in the data between the two
levels. Given model (4.6), the total variance of Y can be decomposed as the
sum of the level-two and level-one variances,
The covariance between two individuals (i and i′, with i ≠ i′) in the same
group j is equal to the variance of the contribution U0 j that is shared by
these individuals,
The estimates obtained from multilevel software for the two variance
components, τ20 and σ2, will usually be slightly different from the estimates
obtained from the formulas in Section 3.3.1. The reason is that different
estimation methods are used: multilevel software uses the more efficient
ML or REML method (cf. Section 4.7 below), which in most cases cannot
be expressed in an explicit formula; the formulas of Section 3.3.1 are
explicit but less efficient.
Essential assumptions are that all residuals, U0j and Rij, are mutually
independent and have zero means given the values xij of the explanatory
variable. It is assumed that the U0j and Rij are drawn from normally
distributed populations. The population variance of the lower-level
residuals Rij is assumed to be constant across the groups, and is again
denoted by σ2; the population variance of the higher-level residuals U0j is
denoted by τ20. Thus, model (4.5) has four parameters: the regression
coefficients γ00 and γ10 and the variance components σ2 and τ20.
The random variables U0j can be regarded as residuals at the group
level, or group effects that are left unexplained by X. Since residuals, or
random errors, contain those parts of the variability of the dependent
variable that are not modeled explicitly as a function of explanatory
variables, this model contains unexplained variability at two nested levels.
This partition of unexplained variability over the various levels is the
essence of hierarchical random effects models.
The fixed intercept γ00 is the intercept for the average group. The
regression coefficient γ10 can be interpreted as an unstandardized regression
coefficient in the usual way: a one-unit increase in the value of X is
associated with an average increase in Y of γ10 units. The residual variance
(i.e., the variance conditional on the value of X) is
while the covariance between two different individuals (i and i', with i ≠ i')
in the same group is
The fraction of residual variability that can be ascribed to level one is given
by and to level two by
Part of the covariance or correlation between two individuals in the
same group may be explained by their X-values, and part is unexplained.
This unexplained, or residual, correlation between the X-values of these
individuals is the residual intraclass correlation coefficient,
Table 4.2: Estimates for random intercept model with effect for IQ.
The U0j are class-dependent deviations of the intercept and have a mean of 0 and a variance of
(hence, a standard deviation of 9.85 = 3.14). Figure 4.2 depicts 15 such random
regression lines. This figure can be regarded as a random sample from the population of schools
defined by Table 4.2.
The scatter around the regression lines (i.e., the vertical distances Rij between the observations
and the regression line for the class under consideration) has a variance of 40.47 and therefore a
standard deviation of These distances between observations and regression lines
therefore tend to be much larger than the vertical distances between the regression lines. However,
the distances between the regression lines are not negligible.
A school with a typical low average achievement (bottom 2.5%) will have a value of U0j of about
two standard deviations below the expected value of U0j, so that it will have a regression line
whereas a school with a typical high achievement (top 2.5%) will have a regression line
There appears to be a strong effect of IQ. Each additional measurement unit of IQ leads, on average,
to 2.507 additional measurement units of the language score. To obtain a scale for effect that is
independent of the measurement units, one can calculate standardized coefficients, that is,
coefficients expressed in standard deviations as scale units. These are the coefficients that would be
obtained if all variables were rescaled to unit variances. They are given by
in this case estimated by (2.04/9.00) × 2.507 = 0.57. In other words, each additional standard
deviation on IQ leads, on average, to an increase in language score of 0.57 standard deviations.
The residual variance σ2 as well as the random intercept variance τ20 are much lower in this
model than in the empty model (cf. Table 4.1). The residual variance is lower because between-pupil
differences are partially explained. The intercept variance is lower because classes differ in average
Figure 4.2: Fifteen randomly chosen regression lines according to the
random intercept model of Table 4.2.
IQ score, so that this pupil-level variable also explains part of the differences between classes. The
residual intraclass correlation is estimated by
slightly smaller than the raw intraclass correlation of 0.22 (see Table 4.1).
These results may be compared to those obtained from an OLS regression analysis, in which the
nesting of students in classes is not taken into account. This analysis can be regarded as an analysis
using model (4.5) in which the intercept variance τ20 is constrained to be 0. The results are displayed
in Table 4.3. The parameter estimates for the OLS method seem rather close to those for the random
intercept model. However, the regression coefficient for IQ differs by about three standard errors
between the two models. This implies that, although the numerical values seem similar, they are
nevertheless rather different from a statistical point of view. Further, the standard error of the
intercept is twice as large in the results from the random intercept model as in those from the OLS
analysis. This indicates that the OLS analysis produces an over-optimistic impression of the precision
of this estimate, and illustrates the lack of trustworthiness of OLS estimates for multilevel data.
is called the fixed part of the model, because the coefficients are fixed (i.e.,
nonstochastic). The remainder,
The part in parentheses is the (random) intercept for this group, and the
regression coefficient of X within this group is γ10. The systematic
(nonrandom) part for group j is the within-group regression line
On the other hand, taking the group average on both sides of the equality
sign in (4.9) yields the between-group regression model,
Table 4.4: Estimates for random intercept model with different within- and
between-group regressions.
where U0j is a class-dependent deviation with mean 0 and variance 8.68 (standard deviation 2.95).
The within-class deviations about this regression equation, Rij, have a variance of 40.43 (standard
deviation 6.36). Within each class, the effect (regression coefficient) of IQ is 2.454, so the regression
lines are parallel. Classes differ in two ways: they may have different mean IQ values, which affects
the expected results Y through the term this is an explained difference between the classes;
and they have randomly differing values for U0j, which is an unexplained difference. These two
ingredients contribute to the class-dependent intercept, given by 41.11
The within-group and between-group regression coefficients would be equal if, in formula (4.9),
the coefficient of average IQ were 0 (i.e., γ01 = 0). This null hypothesis can be tested (see Section
6.1) by the t-ratio defined as
given here by 1.312/0.262 = 5.01, a highly significant result. In other words, we may conclude that
the within- and between-group regression coefficients are indeed different.
If the individual IQ variable had been replaced by within-group deviation scores
that is, model (4.10) had been used, then the estimates obtained would have been and
cf. formulas (4.11) and (4.12). Indeed, the regression equation given above can be
described equivalently by
which indicates explicitly that the within-group regression coefficient is 2.454, while the between-
group regression coefficient (i.e., the coefficient of the group means on the group means ) is
3.766.
Note that τ20, being a variance, cannot be negative. This implies that, even
if τ20 = 0, a positive between-group variability is expected. If observed
between-group variability is equal to or smaller than what is expected from
(4.13) for τ20 = 0, then the estimate is reported (cf. the discussion
following (3.11)).
If the group sizes nj are variable, the larger groups will, naturally, have a
larger influence on the estimates than the smaller groups. The influence of
group size on the estimates is, however, mediated by the intraclass
correlation coefficient. Consider, for example, the estimation of the mean
intercept, γ00. If the residual intraclass correlation is 0, the groups have an
influence on the estimated value of γ00 that is proportional to their size. In
the extreme case where the residual intraclass correlation is 1, each group
has an equally large influence, independent of its size. In practice, where
the residual intraclass correlation is between 0 and 1, larger groups will
have a larger influence, but less than proportionately.
Since γ00 is already an estimated parameter, an estimate for β0j will be the
same as an estimate for U0j +γ00. Therefore, estimating β0j and estimating
U0j are equivalent problems given that an estimate for γ00 is available.
If we only used group j, β0j would be estimated by the group mean,
which is also the OLS estimate,
where EB stands for ‘empirical Bayes’ and the weight λj is defined as the
reliability of the mean of group j (see (3.21)),
The ratio of the two weights, λj/(1– λj), is just the ratio of the true variance
τ02 to the error variance σ2/nj. In practice we do not know the true values of
the parameters σ2 and T0, and we substitute estimated values to calculate
(4.15).
Formula (4.15) is called the posterior mean or empirical Bayes estimate,
for β0j. This term comes from Bayesian statistics. It refers to the distinction
between the prior knowledge about the group effects, which is based only
on the population from which they are drawn, and the posterior knowledge
which is based also on the observations made about this group. There is an
important parallel between random coefficient models and Bayesian
statistical models, because the random coefficients used in the hierarchical
linear model are analogous to the random parameters that are essential in
the Bayesian statistical paradigm. See Gelman et al. (2004, Chapter 1 and
Section 5.4).
Formula (4.15) can be regarded as follows. The OLS estimate (4.14) for
group j is pushed slightly toward the general mean This is an example of
shrinkage to the mean just as is used (e.g., in psychometrics) for the
estimation of true scores. The corresponding estimator is sometimes called
the Kelley estimator; see, for example, Kelley (1927), Lord and Novick
(1968), or other textbooks on classical psychological test theory. From the
definition of the weight λj it is apparent that the influence of the data of
group j itself becomes larger as group size nj becomes larger. For large
groups, the posterior mean is practically equal to the OLS estimate the
intercept that would be estimated from data on group j alone.
In principle, the OLS estimate (4.14) and the empirical Bayes estimate
(4.15) are both sensible procedures for estimating the mean of group j. The
former is an unbiased8 estimate, and does not require the assumption that
group j is a random element from the population of groups. The latter is
biased toward the population mean, but for a randomly drawn group it has a
smaller mean squared error. The squared error averaged over all groups will
be smaller for the empirical Bayes estimate, but the price is a conservative
(drawn to the average) appraisal of the groups with truly very high or very
low values of β0j.The estimation variance of the empirical Bayes estimate is
where the values indicate the (ML or REML) estimates of the regression
coefficients. The values (4.17) also are sometimes called posterior
intercepts.
The posterior means (4.15) can be used, for example, to see which
groups have unexpectedly high or low values on the outcome variable,
given their values on the explanatory variables. They can also be used in a
residual analysis, for checking the assumption of normality for the random
group effects, and for detecting outliers (see Chapter 10). The posterior
intercepts (4.17) indicate the total main effect of group j, controlling for the
level-one variables X1, . . . , Xp, but including the effects of the level-two
variables Z1, . . . , Zq. For example, in a study of students in schools where
the dependent variable is a relevant indicator of scholastic performance,
these posterior intercepts could be valuable information for the parents
indicating the contribution of the various schools to the performance of
their beloved children.
Example 4.5 Posterior means for random data.
We can illustrate the ‘estimation’ procedure by returning to the random digits table (Chapter 3, Table
3.1). Macro-unit 04 in that table has an average of Ȳj = 31.5 over its 10 random digits. The grand
mean of the total 100 random digits is Ȳ = 47.2. The average of macro-unit 04 thus seems to be far
below the grand mean. But the reliability of this mean is only λj = 26.7/{26.7+(789.7/10)} = 0.25.
Applying (4.15), the posterior mean is calculated as
In words, the posterior mean for macro-unit 04 is 75% (i.e., 1 — λj) determined by the grand mean of
47.2 and only 25% (i.e.,λj) by its OLS mean of 31.5. The shrinkage to the grand mean is evident.
Because of the low estimated intraclass correlation of = 0.03 and the low number of observations
per macro-unit, nj = 10, the empirical Bayes estimate of the average of macro-unit 04 is closer to the
grand mean than to the group mean. In this case this is a clear improvement: there is no between-
group variance in the population, and the posterior mean is much closer to the true value of γ00 + U0j
= 49.5 + 0 = 49.5 than the group average.
The diagnostic standard error expresses the deviation from the value 0,
which is the overall mean of the variables U0j, and is the standard deviation
of
The comparative standard error, also called the posterior standard deviation
of U0j, is used to assess how well the unobserved level-two contributions
U0j can be ‘estimated’ from the data, for example, to compare groups (more
generally, level-two units) with each other, as is further discussed later in
this section. The diagnostic standard error is used when the posterior means
must be standardized for model checking. Using division by the diagnostic
standard errors (and if we may ignore the fact that these are themselves
subject to error due to estimation), the standardized empirical Bayes
estimators can be regarded as standardized residuals having a standard
normal distribution if the model is correct, and this can be used for model
checking. This is discussed in Chapter 10.
A theoretically interesting property is that the sum of the diagnostic and
comparative variances is equal to the random coefficient variance:
This is shown more generally for hierarchical linear models by Snijders and
Berkhof (2008, Section 3.3.3). The interpretation is that, to the extent that
we are better able to estimate the random coefficients (smaller comparative
standard errors) – for example, because of larger group sizes nj – the
variability of the estimated values will increase (larger diagnostic
standard errors) because they capture more of the true variation between the
coefficients U0j.
The comparative standard error of the empirical Bayes estimate is
smaller than the root mean squared error of the OLS estimate based on the
data only for the given macro-unit (the given school, in our example). This
is just the point of using the empirical Bayes estimate. For the empty model
the comparative standard error is the square root of (4.16), which can also
be expressed as
This formula was also given by Longford (1993, Section 1.7). Thus, the
standard error depends on the within-group as well as the between-group
variance and on the number of sampled students for the school. For models
with explanatory variables, the standard error can be obtained from
computer output of multilevel software. Denoting the standard error for
school j shortly by S.E.j, the corresponding 90% confidence intervals can be
calculated as the intervals
Figure 4.4: The added value scores for 211 schools with comparative
posterior confidence intervals.
Once again we clearly observe the shrinkage since the OLS means (×) are further apart than the
posterior means (•). Furthermore, as we would expect from a random digits example, none of the
pairwise comparisons results in any significant differences between the macro-units since all 10
confidence intervals overlap.
where β0jk is the intercept in level-two unit j within level-three unit k. For
the intercept we have the level-two model,
Figure 4.5: OLS means (×) and posterior means with comparative
posterior confidence intervals.
where δ00k is the average intercept in level-three unit k. For this average
intercept we have the level-three model,
This shows that there are now three residuals, as there is variability on three
levels. Their variances are denoted by
The total variance between all level-one units now equals σ2 + τ2 + φ2, and
the population variance between the level-two units is τ2 + φ2. Substituting
(4.22) and (4.21) into the level-one model (4.20) and using (in view of the
next chapter) the triple indexing notation γ100 for the regression coefficient
β1 yields
students in the same schools thus is estimated to be 0.18, while the intraclass correlation expressing
the likeness of students in the same classes and the same schools thus is estimated to be 0.33. In
addition, one can estimate the intraclass correlation that expresses the likeness of classes in the same
schools. This level-two intraclass correlation is estimated to be 2.124/(2.124 + 1.746)= 0.55. This is
more than 0.5: the school level contributes slightly more to variability than the class level. The
interpretation is that if one randomly takes two classes within one school and calculates the average
mathematics achievement level in one of the two, one can predict reasonably accurately the average
achievement level in the other class. Of course we could have estimated a two-level model as well,
ignoring the class level, but that would have led to a redistribution of the class-level variance to the
two other levels, and it would affect the validity of hypothesis tests for added fixed effects.
Model 2 shows that the fixed effect of IQ is very strong, with a t-ratio (see Section 6.1) of
0.121/0.005 = 24.2. (The intercept changes drastically because the IQ score does not have a zero
mean; the conventional IQ scale, with a population mean of 100, was used.) Adding the effect of IQ
leads to a stronger decrease in the class- and school-level variances than in the student-level variance.
This suggests that schools and classes are rather internally homogeneous with respect to IQ and/or
that intelligence may play its role partly at the school and class levels.
Table 4.6: Estimates for three-level model with distinct within-class, within-
school, and between-school regressions.
As in Section 4.6, replacing the variables by the deviation scores leads to an equivalent model
formulation in which, however, the within-class, between-class, and between-school regression
coefficients are given directly by the fixed parameters. In the three-level case, this means that we
must use the following three variables:
the within-class deviation score of the student from the class mean;
the within-school deviation score of the class mean from the school mean; and
the school mean itself.
The results are shown as Model 4.
We see here that the within-class regression coefficient is 0.107, equal to the coefficient of
student-level IQ in Model 3; the between-class/within-school regression coefficient is 0.212, equal
(up to rounding errors) to the sum of the student-level and the class-level coefficients in Model 3;
while the between-school regression coefficient is 0.252, equal to the sum of all three coefficients in
Model 3. From Model 3 we know that the difference between the last two coefficients is not
significant.
4.10 Glommary
Regression model. A statistical model for investigating how a dependent
variable can be predicted, or explained, from one or more explanatory
variables, also called independent variables or predictor variables.
The usual form for this dependence is a linear model.
Intercept. The constant term in a linear function. For the linear function Y
= a + bX, the intercept is a; this is the value of Y corresponding to X =
0.
Fixed effects model. A statistical model with fixed effects only, where the
only random term is the residual term at the lowest level.
In the previous chapter the simpler case of the hierarchical linear model was
treated, in which only intercepts are assumed to be random. In the more
general case, slopes may also be random. In a study of students within
schools, for example, the effect of the pupil’s intelligence or socio-
economic status on scholastic performance could differ between schools.
This chapter presents the general hierarchical linear model, which allows
intercepts as well as slopes to vary randomly. The chapter follows the
approach of the previous one: most attention is paid to the case of a two-
level nesting structure, and the level-one units are called – for convenience
only – ‘individuals’, while the level-two units are called ‘groups’. The
notation is also the same.
The intercepts β0j as well as the regression coefficients, or slopes, β1j are
group-dependent. These group-dependent coefficients can be split into an
average coefficient and the group-dependent deviation:
It is assumed here that the level-two residuals U0j and U1j as well as the
level-one residuals Rij have mean 0, given the values of the explanatory
variable X. Thus, γ10 is the average regression coefficient just as γ00 is the
average intercept. The first part of (5.3), γ00+γ10xij, is called the fixed part of
the model. The second part, U0j+U1jxij+Rij, is called the random part.
The term U1j xij can be regarded as a random interaction between group
and X. This model implies that the groups are characterized by two random
effects: their intercept and their slope. These are called latent variables,
meaning that they are not directly observed but play a role ‘behind the
scenes’ in producing the observed variables. We say that X has a random
slope, or a random effect, or a random coefficient. These two group effects
will usually not be independent, but correlated. It is assumed that, for
different groups, the pairs of random effects (U0j, U1j) are independent and
identically distributed, that they are independent of the level-one residuals
Rij, and that all Rij are independent and identically distributed. The variance
of the level-one residuals Rij is again denoted σ2; the variances and
covariance of the level-two residuals (U0j, U1j) are denoted as follows:
Just as in the preceding chapter, one can say that the unexplained group
effects are assumed to be exchangeable.
5.1.1 Heteroscedasticity
Model (5.3) implies not only that individuals within the same group have
correlated Y-values (recall the residual intraclass correlation coefficient of
Chapter 4), but also that this correlation as well as the variance of Y are
dependent on the value of X. For example, suppose that, in a study of the
effect of socio-economic status (SES) on scholastic performance (Y), we
have schools which do not differ in their effect on high-SES children, but
do differ in the effect of SES on Y (e.g., because of teacher expectancy
effects). Then for children from a high-SES background it does not matter
which school they go to, but for children from a low-SES background it
does. The school then adds a component of variance for the low-SES
children, but not for the high-SES children: as a consequence, the variance
of Y (for a random child at a random school) will be larger for the former
than for the latter children. Further, the intraclass correlation will be nil for
high-SES children, whereas for low-SES children it will be positive.
This example shows that model (5.3) implies that the variance of Y,
given the value x on X, depends on x. This is called heteroscedasticity in the
statistical literature. An expression for the variance of (5.3) is obtained as
the sum of the variances of the random variables involved plus a term
depending on the covariance between U0j and U1j (the other random
variables are uncorrelated). Here we also use the independence between the
level-one residual Rij and the level-two residuals (U0j, U1j). From (5.3) and
(5.4), we obtain the result
Similarly, for two different individuals (i′ and with i ≠ i′) in the same group,
Formula (5.5) implies that the residual variance of Y is minimal for xij = −
τ01/τ11. (This is deduced by differentiation with respect to xij.) When this
value is within the range of possible X-values, the residual variance first
decreases and then increases again; if this value is smaller than all X-values,
then the residual variance is an increasing function of x; if it is larger than
all X-values, then the residual variance is decreasing.
It is clear that there are slope differences between the three schools.
Looking at the Y(1)-axis, there are almost no intercept differences between
the schools. But if we add a value 10 to each intelligence score x, then the
Y-axis is shifted to the left by 10 units: the Y(2)-axis. Now school 3 is the
best, school 1 the worst: there are strong intercept differences. If we had
subtracted 10 from the x-scores, we would have obtained the Y(3)-axis,
again with intercept differences but now in reverse order. This implies that
the intercept variance τ00, as well as the intercept-by-slope covariance τ01,
depend on the origin (0-value) for the X-variable. From this we can learn
two things:
The results can be read from Table 5.1. Note that the ‘Level-two random
part’ heading refers to the random intercept and random slope which are
random effects associated with the level-two units (the class), but that the
variable that has the random slope, IQ, is itself a level-one variable.
Should the value of 0.195 for the random slope variance be considered to be
high? The slope standard deviation is and the average slope
is γ10 = 2.48. The values of average slope
Figure 5.2: Fifteen random regression lines according to the model of Table
5.1 (with randomly chosen intercepts and slopes).
± two standard deviations range from 1.60 to 3.36. This implies that the
effect of IQ is clearly positive in all classes, but high effects of IQ are more
than twice as large as low effects. This may indeed be considered an
important difference. (As indicated above, ‘high’ and ‘low’ are respectively
understood here as those values occurring in classes with the top 2.5% and
the bottom 2.5% of the class-dependent effects.)
Recall from Example 4.2 that the standard deviation of the IQ score is about
2, and the mean is 0. Hence students with an intelligence among the bottom
few percent or the top few percent have IQ scores of about ±4. Substituting
these values in the contribution of the random effects gives U0j ± 4U1j. It
follows from equations (5.5) and (5.6) that for students with IQ = ±4, we
have
and therefore
Hence, the language test scores of the most intelligent and the least
intelligent students in the same class are positively correlated over the
population of classes: classes that have relatively good results for the less
able tend also to have relatively good results for the more able students.
This positive correlation corresponds to the result that the value of IQ
for which the variance given by (5.5) is minimal, is outside the range from
−4 to +4. For the estimates in Table 5.1, this variance is (with some
rounding)
and
Substituting (5.7) and (5.8) into this equation leads to the model
which we have rearranged so that the fixed part comes first and the random
part next. Comparing this with model (5.3) shows that this explanation of
the random intercept and slope leads to a different fixed part of the model,
but does not change the formula for the random part, which remains U0j +
U1jxij + Rij. However, it is to be expected that the residual random intercept
and slope variances, τ02 and τ12, will be less than their counterparts in model
(5.3) because part of the variability of intercept and slopes now is explained
by Z. In Chapter 7 we will see, however, that this is not necessarily so for
the estimated values of these parameters.
Equation (5.9) shows that explaining the intercept β0j by a level-two
variable Z leads to a main effect of Z, while explaining the coefficient β1j of
X by the level-two variable Z leads to a product interaction effect of X and
Z. Such an interaction between a level-one and a level-two variable is called
a cross-level interaction.
For the definition of interaction variables such as the product zj xij in
(5.9), it is advisable to use component variables Z and X for which the
values Z = 0 and X = 0, respectively, have some interpretable meaning. For
example, the variables Z and X could be centered around their means, so
that Z = 0 means that Z has its average value, and analogously for X.
Another possibility is that the zero values correspond to some kind of
reference value. The reason is that, in the presence of the interaction term
γ11 zj xij, the main effect coefficient γ10 of X is to be interpreted as the effect
of X for cases with Z = 0, while the main effect coefficient γ01 of Z is to be
interpreted as the effect of Z for cases with X = 0.
Table 5.2: Estimates for model with random slope and cross-level
interaction.
estimated as
The classroom mean of the centered IQ variable has a rather skewed
distribution, ranging from –4.8 to +2.5. For ranging between –4.8 and
+2.5, the fixed part of this expression ranges (in reverse order) between
about 2.00 and 3.36. This implies sizeable differences in the effect of
intelligence on language score between classes with students having low
average IQ and those with students of high average IQ, classrooms with
lower average IQ being associated with higher effects of intelligence on
language scores. (We must note, however, that further analysis of these data
will show that the final conclusions will be different.)
More variables
The preceding models can be extended by including more variables that
have random effects, and more variables explaining these random effects.
Suppose that there are p level-one explanatory variables, X1, . . . ,Xp, and q
level-two explanatory variables, Z1, . . . ,Zq. Then, if the researcher is not
afraid of a model with too many parameters, he can consider the model
where all X-variables have varying slopes, and where the random intercept
as well as all these slopes are explained by all Z-variables. At the within-
group level (i.e, for the individuals) the model is then a regression model
with p variables,
The explanation of the regression coefficients β0j, . . . ,β1j, . . . ,βpj is based
on the between-group model, which is a q-variable regression model for the
group-dependent coefficient βhj,
Substituting (5.11) into (5.10) and rearranging terms then yields the model
This shows that we obtain main effects of each X and Z variable as well as
all cross-level product interactions. Further, we see the reason why, in
formula (4.8), the fixed coefficients were called γh0 for the level-one
variable Xh, and γ0k for the level-two variable Zk.
1. Not all X-variables are considered to have random slopes. Note that the
explanation of the variable slopes by the Z-variables may have led to
some random residual coefficients that are not significantly different
from 0 (testing the τ parameters is discussed in Chapter 6) or even
estimated to be 0.
2. Given that the coefficients βhj of a certain variable Xh, are variable
across groups, it is not necessary to use all variables Zk in explaining
their variability. The number of cross-level interactions can be
restricted by explaining each βhj by only a well-chosen subset of the
Zk.
Which variables to give random slopes, and which cross-level and other
interaction variables to use, will depend on subject-matter as well as
empirical considerations. The statistical aspects of testing and model fitting
are treated in later chapters. For further help in the interpretation of
interactions, in particular, methods to determine the values of zj where the
regression on xij is different from zero (or vice versa), see Curran and Bauer
(2005) and Preacher et al. (2006).
are the fixed and the random parts of the model, respectively.
In cases where the explanation of the random effects works extremely
well, one may end up with models with no random effects at level two. In
other words, the random intercept U0j and all random slopes Uhj in (5.15)
have zero variance, and may just as well be omitted from the formula. In
this case, the resulting model may be analyzed just as well by OLS
regression analysis, because the residuals are independent and have
constant variance. Of course, this is known only after the multilevel
analysis has been carried out. In such a case, the within-group dependence
between measurements has been fully explained by the available
explanatory variables (and their interactions). This underlines the fact that
whether the hierarchical linear model is a more adequate model for analysis
than OLS regression depends not on the mutual dependence of the
measurements, but on the mutual dependence of the residuals.
5.3 Specification of random slope models
Given that random slope models are available, the researcher has many
options to model his data. Each predictor may be assigned a random slope,
and each random slope may covary with any other random slope.
Parsimonious models, however, should be preferred, if only for the simple
reason that a strong scientific theory is general rather than specific. A good
reason for choosing between a fixed and a random slope for a given
predictor variable should preferably be found in the theory that is being
investigated. If the theory (whether this is a general scientific theory or a
practical policy theory) does not give any clue with respect to a random
slope for a certain predictor variable, then one may be tempted to refrain
from using random slopes. However, this implies a risk of invalid statistical
tests, because if some variable does have a random slope, then omitting this
feature from the model could affect the estimated standard errors of the
other variables. The specification of the hierarchical linear model, including
the random part, is discussed more fully in Section 6.4 and Chapter 10.
In data exploration, one can try various specifications. Often it appears
that the chance of detecting slope variation is high for variables with strong
fixed effects. This, however, is an empirical rather than a theoretical
assertion. Actually, it may well be that when a fixed effect is – almost –
zero, there does exist slope variation. Consider, for instance, the case where
male teachers treat boys advantageously over girls, while female teachers
do the reverse. If half of the sample consists of male and the other half of
female teachers, then, all other things being equal, the main gender effect
on achievement will be absent, since in half of the classes the gender effect
will be positive and in the other half negative. The fixed effect of students’
gender is then zero but varies across classes (depending on the teachers’
gender). In this example, of course, the random effect would disappear if
one specified the cross-level interaction effect of teachers’ gender with
students’ gender.
and
This shows that the two models differ in the term U1j ·j which is included
in the group-mean-centered random slope model but not in the other model.
Therefore in general there is no one-to-one relation between the γ and the
parameters, so the models are not statistically equivalent except for the
extraordinary case where variable X has no between-group variability.
This implies that in constant slope models one can either use Xij and .j
or (Xij – .j) and .j as predictors, since this results in statistically equivalent
models, but in random slope models one should carefully choose one or the
other specification depending on substantive considerations and/or model
fit.
On which consideration should this choice be based? Generally one
should be reluctant to use group-mean-centered random slopes models
unless there is a clear theory (or an empirical clue) that not the absolute
score Xij but rather the relative score (Xij – .j) is related to Yij. Now (Xij –
.j) indicates the relative position of an individual in his or her group, and
examples of instances where one may be particularly interested in this
variable are:
(level-one model)
(level-two model for intercept)
(level-two model for slope)
(level-three model for intercept)
(level-three model for slope)
5.6 Glommary
Hierarchical linear model. The main model for multilevel analysis, treated
in this chapter.
Random slope. The random residual at level two in the hierarchical linear
model indicating group-level deviations in the effect of an explanatory
variable X on the dependent variable, symbolized in formula (5.3) by
U1j and in the more general formula (5.15) by Uhj. It is associated with
the slope variance, the variance of Uhj.
To interpret the random slope variance, it is often helpful to consider
the distribution of the group-dependent slopes. This distribution
follows from
1
In the older literature, these equations were applied to the estimated groupwise regression
coefficients rather than the latent coefficients. The statistical estimation then was carried out in two
stages: first ordinary least squares (OLS) estimation within each group, then OLS estimation with the
estimated coefficients as outcomes. This is statistically inefficient unless the group sizes are very
large, and does not distinguish the ‘true score’ variability of the latent coefficients from the sampling
variability of the estimated groupwise regression coefficients. A two-step approach which does make
this distinction is briefly treated in Section 3.7, following equation (3.40).
2
It is mathematically possible that some variables have a random but not a fixed effect. This makes
sense only in special cases.
6
Testing and Model Specification
Table 6.1: Estimates for two models with different between- and within-
group regressions.
The results for Model 2 can be used to test whether the within-group or between-group
regressions are 0. The t-statistic for testing the within-group regression is 2.265/0.065 = 34.9, the
statistic for testing the between-group regression is 2.912/0.262 = 11.1. Both are extremely
significant. In conclusion, there are positive within-group as well as between-group regressions, and
these are different from one another.
Under the null hypothesis, the distribution of this statistic divided by q can
be approximated by an F distribution, with q degrees of freedom in the
numerator, while the degrees of freedom in the denominator are determined
as for the t-test explained above. Here, for a large number of degrees of
freedom (say, more than 40), the F distribution is approximated by a chi-
squared distribution with q degrees of freedom.
The way of obtaining tests presented in this section is not applicable to
tests of whether parameters (variances or covariances) in the random part of
the model are 0. The reason is that, if a population variance parameter is 0,
its estimate divided by the estimated standard error does not approximately
have a t distribution. Tests for such hypotheses are discussed in the next
section.
The symmetric 95% confidence intervals based on the standard errors are defined as the parameter
estimates plus or minus 1.96 times the standard errors. These confidence intervals are less reliable. In
this example we use for the variances the standard errors in Table 4.4, and for the standard deviations
the standard errors from equation (6.2). The results for the standard deviations are
2
* From 0 = 8.680, calculate 0 = 2.946.
2
* From S.E.( 0)
= 1.096 and formula (6.2), calculate S.E.( 0) = 1.096/(2 × 2.946) =
0.1860.
* The confidence interval for τ0 now extends from 2.946 − 1.960 × 0.1860 = 2.581 to 2.946
+ 1.960 × 0.1860 = 3.311.
* Hence the confidence interval for τ02 extends from 2.5812 = 6.66 to 3.3112 = 10.96. This is
still not as good as the interval based on the profile likelihood, but considerably better than
the symmetric interval directly based on the estimated intercept variance.
These considerations are nice – but how should one proceed in practice?
To get an insight into the data, it is usually advisable to start with a
descriptive analysis of the variables: an investigation of their means,
standard deviations, correlations, and distributional forms. It is also helpful
to make a preliminary (‘quick and dirty’) analysis with a simpler method
such as OLS regression.
When starting with the multilevel analysis as such, in most situations
(longitudinal data may provide an exception), it is advisable to start with
fitting the empty model (4.6). This gives the raw within-group and between-
group variances, from which the estimated intraclass correlation can be
calculated. These parameters are useful as a general description and a
starting point for further model fitting. The process of further model
specification will include forward steps – select additional effects (fixed or
random), test their significance, and decide whether or not to include them
in the model – and backward steps – exclude effects from the model
because they are not important from a statistical or substantive point of
view. We mention two possible approaches in the following subsections.
Substitution yields
This model has a number of level-one variables with fixed and random
effects, but it will usually not be necessary to include all random effects.
For the precise specification of the level-one model, the following steps
are useful.
With respect to the random slopes, one may be restricted by the fact that
data usually contain less information about random effects than about fixed
effects. Including many random slopes can therefore lead to long iteration
processes of the estimation algorithm. The algorithm may even fail to
converge. For this reason it may be necessary to specify only a small
number of random slopes.
After this process, one has arrived at a model with a number of level-
one variables, some of which have a random effect in addition to their fixed
effect. It is possible that the random intercept is the only remaining random
effect. This model is an interesting intermediate product, as it indicates the
within-group regressions and their variability.
6.5 Glommary
t-test. The easiest test for a fixed coefficient is the so-called Wald test. Its
test statistic is the t-ratio, that is, the ratio of estimate to standard error.
This ratio is tested against a t distribution. For small N the number of
degrees of freedom is important and rules were given for this purpose.
ML and REML estimation. For Wald tests of parameters in the fixed part
when the number of higher-level units is rather small (N ≤ 50), REML
standard errors must be used. For deviance tests of random part
parameters, ML estimates must be used.
1
This is one of the common principles for construction of a t-test. This type of test is called the Wald
test, after the statistician Abraham Wald (1902–1950).
2
In the first edition, we erroneously stated that exactly the same procedure could be followed.
3
These approximations are based on linear approximations to the logarithmic and square root
functions (the so-called delta method), and are invalid if these standard errors are too large.
4
The profile likelihood function for a given parameter is the value of the likelihood function for this
parameter, maximized over all other parameters.
5
At the time of writing, it is implemented in the beta version lme4a, available from R-Forge.
7
How Much Does the Model Explain?
From Table 7.1 we see that in the balanced as well as in the unbalanced
case, increases as a within-group deviation variable is added as an
explanatory variable to the model. Furthermore, for the balanced case, is
not affected by adding a group-level variable to the model. In the
unbalanced case, increases slightly when adding the group variable.
2
When R is defined as the proportional reduction in the residual variance
parameters, as discussed above, then R2 on the group level is negative for
Model C, while for the entire data set R2 on the pupil level is negative for
Model B. Estimating σ2 and using the REML method results in slightly
different parameter estimates. The pattern, however,remains the same. It is
argued below that defining R2 as the proportional reduction in residual
variance parameters σ2 and , respectively, is not the best way to define a
measure analogous to R2 in the linear regression model; and that the
problems mentioned can be solved by using other definitions, leading to the
measure denoted below by .
7.1.2 Definitions of the proportion of explained variance in two-
level models
In multiple linear regression, the customary R2 parameter can be introduced
in several ways: for example, as the maximal squared correlation coefficient
between the dependent variable and some linear combination of the
predictor variables, or as the proportional reduction in the residual variance
parameter due to the joint predictor variables. A very appealing principle to
define measures of modeled (or explained) variation is the principle of
proportional reduction of prediction error. This is one of the definitions of
R2 in multiple linear regression, and can be described as follows. A
population of values is given for the explanatory and the dependent
variables (X1i, …, Xqi, Yi), with a known joint probability distribution; β is
the value for which the expected squared error
as well as for the fitted model (7.2), and compute 1 minus the ratio of these
values. In other words, is just the proportional reduction in the value of
+ due to including the X-variables in the model. For a sequence of
nested models, the contributions to the estimated value of (7.3) due to
adding new predictors can be considered to be the contribution of these
predictors to the explained variance at level one.
To illustrate this, we once again use the data from the first (balanced)
example, and estimate the proportional reduction of prediction error for a
model where within-group and between-groups regression coefficients may
be different.
From Table 7.2 we see that + for model A amounts to 10.965, and for
model D to 7.964. is thus estimated to be 1−(7.964/10.965)= 0.274.
This implies that the overall covariance matrix of X is the sum of these,
Further, the covariance matrix of the group average for a group of size n is
It may be noted that this notation deviates slightly from the common
split of Xij into
The split (7.5) is a population-based split, whereas the more usual split (7.6)
is sample based. In the notation used here, the covariance matrix of the
within-group deviation variable is
For the discussion in this section, the present notation is more convenient.
The split (7.5) is not a completely innocuous assumption. The
independence between and implies that the covariance matrix of the
group means is at least as large2 as 1/(n − 1) times the within-group
covariance matrix of X.
The vector of explanatory variables Z = (Z1,..., Zq) at level two has value
Zj for group j. The vector of expectations of Z is denoted
This equation only holds if Z and are uncorrelated. For the special case
where all explanatory variables are uncorrelated, this expression is equal to
(This holds, for example, if there is only one level-one and only one level-
two explanatory variable.) This formula shows that, in this special case, the
contribution of each explanatory variable to the variance of the dependent
variable is given by the product of the regression coefficient and the
variance of the explanatory variable.
The decomposition of X into independent level-one and level-two parts
allows us to indicate precisely which parts of (7.7) correspond to the
unconditional level-one variance of Y, and which parts to the
unconditional level-two variance
This shows how the within-group variation of the level-one variables eats
up some part of the unconditional level-one variance; parts of the level-two
variance are eaten up by the variation of the level-two variables, and also by
the between-group (composition) variation of the level-one variables.
Recall, however, the definition of which implies that the between-group
variation of X is taken net of the ‘random’ variation of the group mean,
which may be expected given the within-group variation of Xij.
without bothering about whether some of the xhij are level-one or level-two
variables, or products of a level-one and a level-two variable.
Recall that in this section the explanatory variables X are stochastic. The
vector X = (X1,…, Xq) of all explanatory variables has mean and
covariance matrix . The subvector (X1,…, XP) of variables that have
random slopes has mean µX(p) and covariance matrix ΣX(p). These
covariance matrices could be split into within-group and between- group
parts, but that is left to the reader.
The covariance matrix of the random slopes (U1j,..., Upj) is denoted by
T11 and the p × 1 vector of the intercept–slope covariances is denoted by
T10.
With these specifications, the variance of the dependent variable can be
shown to be given by
(A similar expression, but without taking the fixed effects into account, is
given by Snijders and Bosker (1993) as formula (21).) A brief discussion of
all terms in this expression is as follows.
1. The first term, gives the contribution of the fixed effects and may
be regarded as the ‘explained part’ of the variance. This term could be split
into a level-one and a level-two part as in the preceding subsection.
3. The penultimate term, trace (T11 Σ x(p)), is the contribution of the random
slopes to the variance of Y. In the extreme case where all variables X1,…, Xq
would be uncorrelated and have unit variances, this expression reduces to
the sum of squared random slope variances. This term also could be split
into a level-one and a level-two part.
7.3 Glommary
Estimated variance parameters. These may go up when variables are
added to the hierarchical linear model. This seems strange but it is a
known property of the hierarchical linear model, indicating that one
should be careful with the interpretation of the fine details of how the
variance in the dependent variable is partitioned across the levels of
the nesting structure.
1
This is a more advanced section which may be skipped by the reader.
2
The word ‘large’ is meant here in the sense of the ordering of positive definite symmetric matrices.
8
Heteroscedasticity
The hierarchical linear model is quite a flexible model, and it has some
other features in addition to the possibility of representing a nested data
structure. One of these features is the possibility of representing multilevel
as well as single-level regression models where the residual variance is not
constant.
In ordinary least squares regression analysis, a standard assumption is
homoscedasticity: residual variance is constant, that is, it does not depend
on the explanatory variables. This assumption was made in the preceding
chapters, for example, for the residual variance at level one and for the
intercept variance at level two. The techniques used in the hierarchical
linear model allow us to relax this assumption and replace it with the
weaker assumption that variances depend linearly or quadratically on
explanatory variables. This opens up an important special case of
heteroscedastic models, that is, models with heterogeneous variances:
heteroscedasticity where the variance depends on given explanatory
variables. More and more programs implementing the hierarchical linear
model also allow this feature. This chapter treats a two-level model, but the
techniques treated (and the software mentioned) can be used also for
heteroscedastic single-level regression models.
where the value of X1 for a given unit is denoted by x1ij while the random
part at level one now has two parameters, and . The reason for
incorporating the factor 2 and calling the parameter σ01 will become clear
later, when also quadratic variance functions are considered. For example,
when X1 is a dummy variable with values 0 and 1, the residual variance is
for the units with X1 = 0 and for the units with X1 = 1. When
the level-one variance depends on more than variable, their effects can be
added to the variance function (8.1) by adding terms etc.
Example 8.1 Residual variance depending on gender.
In the example used in Chapters 4 and 5, the residual variance might depend on the pupil’s gender. To
investigate this in a model that is not overly complicated, we take the model of Table 5.4 and add the
effect of gender (a dummy variable which is 0 for boys and 1 for girls). Table 8.1 presents estimates
for two models: one with constant residual variances, and one with residual variances depending on
gender.
Thus, Model 1 is a homoscedastic model and Model 2 a gender-dependent heteroscedastic model.
The girls do much better on the language test: for the fixed effect of gender, Model 1 has t =
2.407/0.201 = 12.0, p < 0.0001. According to formula (8.1), the residual variance in Model 2 is 37.85
for boys and 37.85 − 2 × 1.89 = 34.07 for girls. The residual variance estimated in the homoscedastic
Model 1 is very close to the average of these two figures. This is natural, since about half of the
pupils are girls and half are boys. The difference between the two variances is significant: the
deviance test yields χ2 = 24,486.8 − 24,482.2 = 4.6, df = 1, p < 0.05. However, the difference is not
large. The conclusion is that, controlling for IQ, SES, the school means of these variables, and their
interactions, girls score on average 2.4 higher than boys, and the results for girls are slightly less
variable than for boys.
and did not yield clearly better results. Thus, we use the definitions
Adding these two variables to the fixed part gives a quite significant
decrease in the deviance (24,430.2 − 24,369.0 = 61.2 with two degrees of
freedom) and strongly reduces the random slope of IQ. Comparing the
deviance with the model in which the random slope at level two of IQ is left
out shows, however, that this random slope still is significant. The total
fixed effect of IQ is now given by
Denote the variances of R0ij and R1ij by and , respectively, and their
covariance by . The rules for calculating with variances and covariances
imply that
The difference in deviance between Models 1 and 2 (χ2 = 97.3, df = 2, p < 0.0001) indicates that the
dependence of residual variance on SES is quite significant. The variance function decreases
curvilinearly from the value 2.91 for SES = 1 to a minimum value of 1.75 for SES = 6. This implies
that when educational attainment is predicted by the variables in this model, the uncertainty in the
prediction is highest for low-SES pupils. It is reassuring that the estimates and standard errors of
other effects are not appreciably different between Models 1 and 2.
The specification of this random part was checked in the following way. First, the models with
only a linear or only a quadratic variance term were estimated separately. This showed that both
variance terms are significant. Further, it might be possible that the dependence of the residual
variance on SES is a random slope in disguise. Therefore a model with a random slope (a random
effect at level two) for SES also was fitted. This showed that the random slope was barely significant
and not very large, and did not take away the heteroscedasticity effect.
The SES-dependent heteroscedasticity led to the consideration of nonlinear effects of SES and
interaction effects involving SES. Since the SES variable assumes values 1 to 6, five dummy
variables were used contrasting the respective SES values to the reference category SES = 3. In order
to limit the number of variables, the interactions of SES were defined as interactions with the
numerical SES variable rather than with the categorical variable represented by these dummies. For
the same reason, for the interaction of SES with the CITO tests only the average of the three CITO
tests (range from 1 to 20, mean 11.7) was considered. Product interactions of SES with gender, with
the average CITO score, and with minority status were considered. As factors in the product, SES
and CITO tests were centered approximately by using (SES − 3) and (CITO average − 12). Gender
and minority status, being 0–1 variables, did not need to be centered.
This implies that although SES is represented by dummies (i.e., as a categorical variable) in the
main effect, it is used as a numerical variable in the interaction effects. (The main effect of SES as
This decreases from 2.87 for SES = 1 to 1.71 for SES = 6. Thus, with the inclusion of interactions
and a nonlinear SES effect, residual variance has hardly decreased.
More generally, the residual variance may depend on more than one
variable; in terms of representation (8.4), several variables may have
random effects at level one. These can be level-two as well as level-one
variables. If the random part at level one is given by
while the variances and covariances of the Rhij are denoted by and ,
then the variance function is
This complex level-one variance function can be used for any values for
the parameters and , provided that the residual variance is positive.
The simplest case is to include only and the ‘covariance’ parameters
leading to the linear variance function
Correlates of diversity
It may be important to investigate the factors that are associated with
outcome variability. For example, Raudenbush and Bryk (1987) (see also
Raudenbush and Bryk, 2002, pp. 131–134) investigated the effects of
school policy and organization on mathematics achievement of pupils. They
did this by considering within-school dispersion as a dependent variable.
The preceding section offers an alternative approach which remains closer
to the hierarchical linear model. In this approach, relevant level-two
variables are considered as potentially being associated with level-one
heteroscedasticity.
Example 8.4 School composition and outcome variability.
Continuing the preceding example on educational attainment predicted from data available at the end
of primary school, it is now investigated whether composition of the school with respect to socio-
economic status is associated with diversity in later educational attainment. It turns out that socio-
economic status has an intraclass correlation of 0.25, which is quite high. Therefore the average
socio-economic status of schools could be an important factor in the within-school processes
associated with average outcomes but also with outcome diversity.
To investigate this, the school average of SES was added to Model 3 of Table 8.4 both as a fixed
effect and as a linear effect on the level-one variance. The nonsignificant SES-by-CITO interaction
was deleted from the model. The school average of SES ranges from 1.4 to 5.5, with mean 3.7 and
standard deviation 0.59. This variable is denoted by SA-SES. Its fixed effect is 0.417 (S.E. 0.109, t =
3.8). We further present only the random part of the resulting model in Table 8.5.
To test the effect of SA-SES on the level-one variance, the model was estimated also without this
effect. This yielded a deviance of 46,029.6, so the test statistic is χ2 = 46,029.6 – 45,893.0 = 136.6
with df = 1, which is very significant. The quadratic effect of SA-SES was estimated both as a main
effect and for level-one heteroscedasticity, but neither was significant.
How important is the effect of SA-SES on the level-one variance? The standard deviation of SA-
SES is 0.59, so four times the standard deviation (the difference between the few percent highest-SA-
SES and the few percent lowest-SA-SES schools) leads to a difference in the residual variance of 4 ×
0.59 × 0.265 = 0.63. For an average residual level-one variance of 2.0 (see Model 1 in Table 8.3), this
is an appreciable difference.
This ‘random effect at level one’ of SA-SES might be explained by interactions between SA-SES
and pupil-level variables. The interactions of SA-SES with gender and with minority status were
considered. Adding these to Model 4 yielded interaction effects of −0.219 (S.E. 0.096, t = −2.28) for
SA-SES by minority status and −0.225 (S.E. 0.050, t = 4.50) for SA-SES by gender. This implies
that, although a high school average for SES leads to higher educational attainment on average (the
main effect of 0.417 reported above), this effect is weaker for minority pupils and for girls. These
interactions did, however, not lead to a noticeably lower effect of SA-SES on the residual level-one
variability.
8.3 Glommary
Heteroscedasticity. Residual variance depending on explanatory variables.
The hierarchical linear model can be extended so that residual
variances depend, linearly or quadratically, on explanatory variables.
This can be a nuisance but also interesting.
1
Spline functions (introduced more extensively in Section 15.2.2 and treated more fully, for
example, in Fox, 2008, Chapter 17) are a more flexible class of functions than polynomials. They are
polynomials of which the coefficients may be different on several intervals.
9
Missing Data
The first situation, MCAR, is the pleasant situation in which we can almost
forget about the incompleteness of the data, and throwing away the
incomplete cases would entail only a loss of statistical efficiency (and
therefore, loss of power) but no bias. The second, MAR, is the potentially
difficult situation where naive approaches such as complete case analysis
can be biased and hence misleading, but where it is possible to make
judicious use of the observations and avoid bias. For the MCAR as well as
the MAR case it may be necessary to use special statistical methods to
avoid efficiency loss. By the way, the terminology is potentially misleading
because many people might interpret the term ‘missingness at random’ as
meaning what was defined above as ‘missingness completely at random’,
but these are the generally employed terms so we have to use them.
The third situation, MNAR, leads to serious difficulties for the
researcher because the missingness pattern contains information about the
unobserved values, but in practice we do not know which information. We
wish to say something about the world but we have observed too little of it.
To draw conclusions from the data it is necessary in this case to make
additional assumptions about how missingness is related to the values of the
data points that would have been observed if they had not been missing.
While the researcher must attempt to make plausible assumptions of this
kind, they will not be completely testable. In other words, to analyze data in
the MNAR case we need to make assumptions going beyond the available
data and beyond the model that we would use for analyzing complete data.
To still say something in this case about the uncertainty of the conclusions
drawn from the data, it will be helpful to carry out sensitivity analyses
studying the sensitivity of the conclusions to the assumptions made. In the
MNAR situation, any data analysis will leave bigger questions than in the
MAR situation.
9.4 Imputation
Imputation means filling in something (hopefully reasonable) for the
missing data points, which then leads to a completed data set so that a
regular complete-data analysis is possible. The general literature on missing
data (e.g., Little and Rubin, 2002; Schafer and Graham, 2002) provides
arguments and examples explaining why simple methods such as
imputation by an overall mean, and also slightly more complicated methods
such as imputation by a groupwise mean, will yield biased results and
underestimation of standard errors. Rubin (1987) had the insight that
multiple stochastic imputation is an approach which leads to good
(approximately unbiased) results under MAR. This approach now is one of
the major methods employed in the analysis of incomplete data, and an
extensive discussion can be found, for example, in Graham (2009). Multiple
stochastic imputation works in three steps:
Note that this procedure does not require any type of knowledge of the
model for the missingness variables beyond the validity of the MAR
assumption. The virtue of this approach is that step 2 is normally
straightforward, consisting, in our case, of the estimation of the hierarchical
linear model or one of its variants. The combination formulas of step 3 are
simple, as we shall see below. The catch is in step 1. This requires
knowledge of the distribution of the missing data given the observed data,
but this will depend on the parameters that we are trying to estimate in the
whole process3 – the dog seems to bite its own tail. The approach is still
feasible because it is possible to let the dog bite its own tail ad infinitum,
and cycle through the steps repeatedly until stability of the results is
achieved. Especially in the Bayesian paradigm this can be formulated in a
very elegant way as a kind of ‘data augmentation’ (Tanner and Wong,
1987), as we shall see below.
The model used for the imputations in step 1 is the assumed model for
the joint distribution of all observations. Let us assume that the model in
which we are interested, often referred to appropriately as the model of
interest, is a model for a dependent variable Y (e.g., a hierarchical linear
model) with an array X of explanatory variables. If the combination of X
and Y is the totality of available data, then the model of step 2 is the
conditional distribution of Y given the rest of the data according to the joint
distribution used in step 1. Then the imputation model and the model of
interest are in correspondence. This is, however, not necessary.
Often the research question is such that some observed variables Xextra
are excluded from the model that is currently being estimated. This may be
because of causality arguments – for example, because the variable Xextra
was measured at a later or earlier moment than the other variables, or
because Xextra occupies the theoretical place of a mediating variable which
is excluded from the currently estimated model. Or a large data set may
have been collected, and the dependence of Y on Xextra is thought to be
weak or theoretically irrelevant, and is therefore not investigated. Also the
considerations of Section 9.1.1 can lead to observing variables that are
relevant for the missingness mechanism but not for the model of interest.
Then missingness might be at random (MAR) in the joint distribution of all
the variables (X, Xextra, Y), whereas missingness would not be at random
(MNAR) if only the observations of (X, Y) would be considered. In all these
cases the imputation of missing values should depend not only on the
variables in the model of interest, but also on the variables in Xextra. Note
that this makes the imputation models potentially more complex than the
model of interest. To summarize, step 1 will impute by drawing from the
conditional distribution of the missing variables given the observations in
all of (X, Xextra, Y), while step 2 analyzes the dependence of Y only on X.
Of the three steps, we shall now elaborate the first and third. The second
step consists of the estimation of parameters of the hierarchical linear model
for all randomly imputed data sets, and how to do this should be clear to
anyone who has read the relevant chapters of this book. Note that each of
the imputed data sets itself is a complete data set so that it can be analyzed
by the multilevel methods for complete data.
Several computer packages for multilevel analysis have recently added
options for multiple stochastic imputation, or are in the process of doing so.
For example, MLwiN has a macro for relatively simple imputations, and an
associated program REALCOM-impute for imputation of missing data in
more complex patterns; and the mice package in R can impute missing
values in two-level models. See Chapter 18 for more information about
these packages.
Plausible values
Multiple data sets constructed from multiple stochastic imputation are
sometimes called plausible values. These are used particularly when the
missing values are missing by design. This means that the data collection
was such that some variables were missing on purpose for some parts of the
sample, or for some waves in the case of panel surveys. For example, if
there are four related questions that are too much of a burden on the
respondent, the sample could be divided randomly into six groups, each of
which gets the same two out of these four questions. The missingness for
such variables is MCAR, which facilitates the construction of the
imputation model.
Of course steps 3 and 4 need to be carried out only for the variables that
have missing values.
This procedure defines a stochastic iterative process. The general
principle was proposed by Tanner and Wong (1987) who called it data
augmentation, because the data set is augmented by the imputed missing
data. Cycling through the conditional distributions in this way is called
Gibbs sampling (e.g., Gelman et al., 2004), which again is a particular kind
of Markov chain Monte Carlo (MCMC) procedure (Section 12.1).
Mathematical results ensure that the joint probability distribution of (X1,. . .,
Xp, Y, ) converges to the joint distribution of the complete data and the
parameters given the observed data; recall that this procedure is defined in a
Bayesian framework in which the parameters are also random variables.
This has the advantage that the uncertainty about the parameter values will
be reflected in corresponding greater dispersion in the imputed values than
if one followed a frequentist procedure and fixed the parameter at an
estimated value.
The cycle consisting of steps 2–4 will be repeated a large number of
times. First a ‘burn-in’ of a number of cycles is needed to give confidence
that the process has converged and the distribution of generated values is
more or less stable. After this point, the generated imputations can be used.
Successively generated imputations will be highly correlated. Therefore a
sampling frequency K has to be chosen, and one in every K generated
imputations will be retained for further use. K will have to be large enough
that the imputed values separated by K cycles have a negligible correlation.
It may be difficult to assess when convergence has taken place, however
– in step 5 this was hidden behind the word ‘deemed’. The literature on
MCMC procedures contains material about convergence checks, and an
important sign is that the consecutive values generated for the components
of the parameter have achieved some kind of stability, as can be seen from
plotting them. More detailed diagnostics for convergence may be used in
more complicated models. Abayomi et al. (2008) propose to study the
convergence of the imputation procedure by considering the marginal
distributions and bivariate scatterplots of the imputed data sets, looking in
particular at the deviations between the observed and the imputed data
points. Evidently, it is quite acceptable that there are differences; when
missingness depends on observed variables, then differences between
distributions of observed and imputed data are indeed to be expected. But
one should check that the imputed data seem plausible and are in line with
presumed mechanisms of missingness.
The procedure of steps 1–5 was proposed and developed by Yucel
(2008) for normally distributed variables in two-level and three-level
models; and by Goldstein et al. (2009) for quite general two-level models,
including models for binary and categorical variables, where level-one as
well as level-two variables may have missing values. Such procedures are
implemented in REALCOM-impute (Goldstein, 2011) and the R package
mlmmm.
Example 9.1 Simulated example: missingness dependent on a third variable.
As a simple artificial example, consider the model
where X1ij and Rij are independent standard normal variables, and the random intercept U0j has
variance τ02 = 0.16. Let the group sizes be constant at n = 20 and let there be N = 200 groups; the
number of groups is sufficiently large that one simulated data set should give a good indication of
how parameters are being recovered. Suppose that there is a third variable X2ij given by
where the Eij likewise are independent standard normal variables; and let missingness be determined
according to the rule that Yij is observed only for X2ij ≤ 2. We simulated such a data set, and 834 out
of the 4,000 values of Yij turned out to be missing. Suppose the model of interest is
Since the data are generated according to (9.1), the true parameter values are γ00 = 0.5, γ10 = 0.2, γ01
= 0, τ02 = 0.16, σ2 = 1.
The variables for imputation are Y, X1, X2. Thus, in terms of the discussion on p. 136, X = X1
while Xextra = X2. The missingness mechanism for the three variables Y, X1, X2 jointly here is MAR.
Since missingness of Y depends on X2, the model for only Y and X1 is MNAR.
Model 1 of Table 9.1 gives the result of an analysis of the model of interest after dropping all
level-one units with a missing value. The estimated parameter values are far off; for example, γ10 is
estimated as 0.028 with a standard error of 0.018, whereas the true value is 0.2. This illustrates the
pernicious effect of listwise deletion of missing data on parameter estimates.
Table 9.1: Estimates for random intercept model with listwise deletion of
missing values (Model 1) and with randomly imputed values according to a
multivariate normal distribution (Model 2).
Model 2, on the other hand, was obtained after imputing the missing
values based on an assumed model that the random vectors (Yij, X1ij, X2ij)
are independent for different values of (i, j), and have a three-variate normal
distribution. This model is correct for the marginal distribution of (Yij, X1ij,
X2ij) but ignores the multilevel structure. It is the model suggested as step 1,
the initialization, of the procedure above. This model recovers the fixed
parameters quite well, with, for example, The recovery of the
variance parameters is less good, which may be expected since the
dependence structure is not well represented by the imputation model. The
results are not trustworthy, in addition, because the estimation is done as if
the imputed data were observed in a bona fide way, whereas they were
made up. This has the consequence that the standard errors will be too low,
suggesting undue precision. This issue will be dealt with in the next
subsection, where the example is continued.
Model 1 is based on considerably fewer cases than Model 2. The
deviance in Model 2 is based on an inflated data set, and therefore is totally
incomparable with that of Model 1.
Then the combined estimate is the mean parameter estimate (9.3) over all
imputed data sets, and its standard error is
If the different imputations lead to almost the same estimates , then the
between-imputation variance B will be very small, and the standard error
(9.6) is practically the same as the individual standard errors S.E.(m).
However, often the situation is different, and the overall standard error (9.6)
will be larger.
To test the null hypothesis that the parameter θ is equal to some value
θ0, the usual t ratio,
This expresses how much of the information about the value of a given
parameter, potentially available in the complete data set, has been lost
because of the missingness. This fraction will depend on the parameter
under consideration, as we shall see in the example below. The estimated
value may be rather unstable across independent repetitions of the
imputation process, but will indicate the rough order of magnitude.
Wald tests as well as deviance (likelihood-ratio) tests can be combined
by methods discussed by Little and Rubin (2002). Suppose that a null
hypothesis is being tested by a Wald test or deviance test and the multiple
imputations yield, for the completed data sets obtained, test statistics C1,
C2,. . ., CM which in the complete data case have under the null hypothesis
an asymptotic chi-squared distribution with q degrees of freedom. Test
statistic Cm will be either the Wald test, or the difference between the
deviances for the estimated models corresponding to the null and alternative
hypotheses, each for the same data set D(m). Define by the average test
statistic,
where
Then the combined test statistic is
These rather intimidating equations work out in such a way that when the
deviance statistics Cm are practically the same, so that V is very small, then
q × is very close to each of these deviances and the result of the combined
tests is practically the same as the result based on the individual deviance
tests. For deviance tests, Little and Rubin (2002) also present another
combined test, but this requires more information than the individual test
statistics.
The original work in which Rubin (e.g., 1987) developed the multiple
imputation method suggested that a very low number of imputations such as
K = 5 is sufficient. However, this may lead to instability of the estimated
between-imputation variance B. Further, a larger fraction of missing data
will require a larger number of imputations. Graham et al. (2007) developed
recommendations for the number of imputations; for example, their advice
is to use at least 20 imputations if 10–30% of the information is missing,
and at least 40 imputations if half of the information is missing. These
proportions of missing information refer to the quantity estimated by (9.8).
If multiparameter tests are required, for example, for testing the effect
of a categorical variable with three or more categories, then the required
number of imputations is also larger. Carpenter and Kenward (2008)
recommend multiplying the number of imputations by q(q + 1) /2, where q
is the dimension of the multiparameter test.
Example 9.2 Simulated example continued: multiple imputations.
We continue the preceding example. Recall that the model of interest is
(see (9.2)), there are missing values only in Y, and the missingness mechanism is known to be MAR,
given the additional variable X2.
The first set of results in Table 9.2 repeats Model 2 of Table 9.1, based on a single random
imputation under a multivariate normal model for (Y, X1, X2) without a multilevel structure, and
treating this as an observed data set. The second set of results present the result of multiple
imputations, using the imputation model
which is in accordance with the conditional distribution of the missing values given the observed
data, although some of the coefficients are 0. We used M = 50 imputed data sets.
Table 9.2: Estimates for random intercept model with a single imputation
and with M = 50 multiple imputations.
Comparing the two sets of results shows that in this case the parameter
estimates from the single imputation were quite close already, and the more
realistic standard errors for the multiple imputations all are somewhat larger
than those for the single imputation, in accordance with what is expected
from (9.6), but also for the standard errors the differences are not large. For
the level-two variance τ02 a clear improvement is obtained, the true value
being 0.16. The imputations using an imputation model with a multilevel
structure led to good recovery of the intraclass correlation, which was
underestimated by imputations using a multivariate normal model without
multilevel structure.
The fractions of missing information range from 0.05 for the regression
coefficient of to 0.32 for the level-one variance. The proportion of cases
with a missing value was 834/4,000= 0.21. The fractions of missing
information for the parameters pertaining to the level-two model are smaller
than this value, which means that the multiple imputation succeeds in
recovering a greater part of the information for these parameters; this is
possible because there the other level-one units in the same level-two unit
also contribute to the information. That the fraction of missing information
for the level-one parameters are larger than 0.21 might be a random
deviation.
1. An analysis using only the complete data and ignoring the existence
of missing data can be hopelessly biased.
2. If missingness is at random (MAR) and a correct imputation model is
used, then multiple stochastic imputation can yield approximately
unbiased results.
3. It is not necessary to know the missingness mechanism, it is sufficient
to know that MAR holds for the given set of variables, and to have an
imputation model that correctly represents the dependencies between
these variables.
4. If there are missing values only in the dependent variable, but other
variables are available that are predictive of the dependent variable
and not included in the model of interest, cases with a missing
dependent variable should not be dropped and multiple imputation
can lead to more credible results.
5. Imputation models that do not represent the dependencies in the data
well may yield results that in a similar way underestimate these
dependencies.
For regression coefficients especially this will be hardly any different from
the Bayesian procedure (step 2(a)). An even simpler method mentioned by
van Buuren (2007) is the plug-in method:
a″. Estimate the parameter from the completed data set, and use =
.
If sample sizes are reasonably large, this will usually yield quite similar
results to the full procedure.
The procedure for imputing as in step 2(b) depends on the type of
variable that is being considered. Let us assume that we are dealing with a
two-level data structure where the group sizes are rather homogeneous.
(Strongly heterogeneous group sizes may lead to variance heterogeneity
between the groups, which leads to further complications.) First consider
the case where Yk is a level-two variable for which a linear regression
model is reasonable. The parameter value comprises the regression
coefficients and the residual variance, to be denoted . Then for each
level-two unity j, using the regression coefficients and the explanatory
variables (which are completely known in the imputed data set), calculate
the predicted value Generate independent normally distributed random
variables with expected value 0 and variance 52. Then the imputed value
is + Rkj.
Now consider the case where Yk is a level-one variable for which a
random intercept model is used; denote this by
As auxiliary variables Xextra in the imputation model we used three variables that are closely related
to the variables in the model of interest: an earlier language test, a measure of ‘performal IQ’, and a
dummy variable indicating belonging to an ethnic minority. In these variables we have the following
numbers of missing values:
For the three variables with most missing cases, there is very little overlap in the individuals for
whom there are missing values, which leads to good possibilities for imputation.
The imputation was done by the method of chained equations, where for each variable with
missing values the imputation model was defined as a random intercept model using the other five
variables, as well as their group means, as predictors. From a preliminary estimation it appeared that
several interactions seemed to have significant effects: the product interactions between IQ and SES,
between the group means of IQ and of SES, and between IQ and the classroom proportion (i.e.,
group mean) of minority students. These interactions were added to the imputation models for the
variables that were not involved in these interactions.
Table 9.3 presents two sets of results for the language test, as a function of SES and IQ, in a
random intercept model. The left-hand column gives the parameter estimates of the naive analysis,
where all students with a missing value on at least one of these variables have been dropped. The
right-hand column gives the estimates from the analysis based on imputation by chained equations,
with 50 imputations. The results for the two analyses are very similar, which suggests that in this case
the missingness is not related to the ways in which intelligence, socio-economic status, and schools
influence this language test. This permitted us to use casewise deletion in the examples in Chapters
4-5.
Table 9.3: Estimates for random intercept model with complete cases only,
and using multiple imputations.
9.7 Glommary
Missingness indicators. Binary random variables indicating that a data
point is missing or observed.
Complete data. The hypothetical data structure that is regarded as the data
set that would be observed if there were no missing data points.
Missing values on the dependent variable. If there are missing data points
only for the dependent variable, while there are auxiliary variables that
are predictive of missingness or of the unobserved values, and these
auxiliary variables are not included in the model of interest, then bias
may be reduced by methods of analysis that use these auxiliary
variables and retain the data points with a missing dependent variable.
Combination of results. Each imputed data set is a complete data set for
which parameter estimates can be obtained for the model of interest.
These estimates can be combined across the multiple imputed data sets
to obtain, in the MAR case, and if the imputation model is correct,
approximately unbiased parameter estimates and standard errors.
1
The phrase ‘conditionally given the observed data’ refers to the assumption that the observed values
are known, but without knowing that these variables are all that what is observed; thus, the
conditioning is not on the missingness indicators.
2
Formally, for some of what is asserted below about MAR, an extra condition is needed. This is that
the parameters for the missingness process and those for the model explaining the primary dependent
variable are distinct. This condition is called separability. It is necessary to ensure, together with
MAR, that the missingness variables themselves are not informative for the primary research
question. This will be tacitly assumed below when the MAR condition is considered. MAR and
separability together define ignorability of the missingness mechanism.
3
At least it will depend on the parameters of the imputation model; not on parameters of the model
of interest that are not included in the imputation model.
4
Incidentally, it may be noted that and ∆U0j are uncorrelated (cf. (4.18)), so that var(U0j) =
var(Û0jEB) +S.E.j2.
10
Assumptions of the Hierarchical
Linear Model
1. Does the fixed part contain the right variables (now X1,…, Xr)?
2. Does the random part contain the right variables (now X1,. . ., Xp)?
3. Are the level-one residuals normally distributed?
4 Do the level-one residuals have constant variance?
5. Are the level-two random coefficients normally distributed?
Do the level-two random coefficients have a constant covariance
6.
matrix?
These questions are answered in this chapter in various ways. The answers
are necessarily incomplete, because it is impossible to give a completely
convincing argument that a given specification is correct. For complicated
models it may be sensible, if there are enough data, to employ cross-
validation (e.g., Mosteller and Tukey, 1977). Cross-validation means that
the data are split into two independent halves, one half being used for the
search for a satisfactory model specification and the other half for testing of
effects. This has the advantage that testing and model specification are
separated, so that tests do not lose their validity because of capitalization on
chance. For a two-level model, two independent halves are obtained by
randomly distributing the level-two units into two subsets.
Definition of levels
A first issue is the definition of the ‘levels’ in the model. Formulated more
technically, these are the systems of categories where the categories have
random coefficients: the random intercepts and possibly also random
slopes. Since this is at the heart of multilevel modeling it has already been
given much attention (see Chapter 2 and Section 4.3; Section 6.4 is also
relevant here). Several papers have studied the errors that may be made
when a level is erroneously left out of the model; see Tranmer and Steel
(2001), Moerbeek (2004), Berkhof and Kampen (2004), van den Noortgate
et al. (2005), and Dorman (2008). A general conclusion is that if a level is
left out of the model, that is, a random intercept is omitted for some system
of categories, then the variance associated with that level will be
redistributed mainly to the next lower and (if it exists) next higher levels.
Furthermore, erroneous standard errors may be obtained for coefficients of
variables that are defined on this level (i.e., are functions of this system of
categories), and hence tests of such variables will be unreliable. This will
also hold for variables with strong intraclass correlations for the omitted
level. Similarly, if a random slope is omitted, then the standard errors of the
corresponding cross-level interaction effects may be incorrectly estimated.
This leads to the general rule that if a researcher is interested in a fixed
effect for a variable defined at a certain level (where a level is understood as
a system of categories, such as schools or neighborhoods), then it is
advisable to include this level with a randomly varying intercept in the
multilevel model – unless there is evidence that the associated random
intercept variance is negligible. Similarly, if there is interest in a fixed effect
for a cross-level interaction X × Z where Z is defined at a certain level, then
X should have a random effect at this level – again, unless the random slope
variance is negligible. Furthermore, if there is interest in the amount of
variability (random intercept variance) associated with a given level, and
there exist next higher and next lower levels of nesting in the phenomenon
under study, then these levels should be included in the model with random
intercepts.
In the fixed part use the linear effect X, the quadratic effect (X - x0)2, and the
effects of the ‘half squares’ f1(X),. . ., fK(X). Together these functions can
represent a wide variety of smooth functions of X, as is evident from Figure
8.1. If some of the fk(X) have nonsignificant effects they can be left out of
the model, and by trial and error the choice of the so-called nodes xk may be
improved. Such an explorative procedure was used to obtain the functions
displayed in this figure.
can be used to test the constancy of the level-one residual variances. Its null
distribution is chi-squared with N′ – 1 degrees of freedom, where N′ is the
number of groups included in the summation.
If the within-groups degrees of freedom dfj are less than 10 for many or
all groups, the null distribution of H is not chi-squared. Since the null
distribution depends only on the values of dfj and not on any of the
unknown parameters, it is feasible to obtain this null distribution by straight
forward computer simulation. This can be carried out as follows: generate
independent random variables Vj according to chi-squared distributions with
dfj degrees of freedom, calculate = Vj / dfj, and apply equations (10.3),
(10.4) and (10.5). The resulting value H is one random draw from the
correct null distribution. Repeating this, say, 1,000 times gives a random
sample from the null distribution with which one can compare the observed
value from the real data set.
If this test yields a significant result, one can inspect the individual dj
values to investigate the pattern of heteroscedasticity For example, it is
possible that the heteroscedasticity is due to a few unusual level-two units
for which dj has a large absolute value. For other approaches for dealing
with heteroscedasticity, see Section 10.4.2.
Example 10.1 Level-one heteroscedasticity.
The example of students’ language performance used in Chapters 4, 5, and 8 is considered again. We
investigate whether there is evidence of level-one heteroscedasticity where the explanatory variables
at level one are IQ, SES, and gender, specifying the model as Model 1 in Table 8.1. Note that the
nature of this heteroscedasticity test is such that the level-two variables included in the model do not
matter. Those groups were used for which the residual degrees of freedom are at least 10. There were
133 such groups. The sum of squared standardized residual dispersions defined in (10.5) is H =
155.5, a chi-squared value with df = 132, yielding p = 0.08. Hence this test does not give evidence of
heteroscedasticity.
Although nonsignificant, the result is close enough to significance that it is still worthwhile to try
and look into this small deviation from homoscedasticity. The values dj can be regarded as standard
normal deviates in the case of homoscedasticity. The two schools with the largest absolute values had
dj = –3.2 and –3.8, expressing low within-school variability; while the other values were all less than
2.5 in absolute value, which is of no concern. The two homogeneous schools were also those with the
highest averages for the dependent variable, the language test. They were not particularly large or
small, and had average compositions with respect to socio-economic status and IQ. Thus, they were
outliers in two associated ways: high averages and low internal residual variability. The fact that the
homoscedasticity test gave p = 0.08, not significant at the conventional level of 0.05, suggests,
however, that these outliers are not serious.
An advantage of this test is that it is based only on the specification of the within-groups
regression model. The level-two variables and the level-two random effects play no role at all, so
what is checked here is purely the level-one specification. However, the null distributions of the dj
and of H mentioned above do depend on the normality of the level-one residuals. A heavier-tailed
distribution for these residuals in itself will also lead to higher values of H, even if the residuals do
have constant variance. Therefore, if H leads to a significant result, one should investigate the
possible pattern of heteroscedasticity by inspecting the values of dj, but one should also inspect the
distribution of the OLS within-group residuals for normality.
for i = K,. . ., M – K. The value of K will depend on data and sample size;
for example, if the total number of residuals M is at least 1,500, one
could take moving averages of 2K =100 values.
One may add to the plot horizontal lines plotted at r equal to plus or
minus twice the standard error of the mean of 2K values, that is,
is the variance of the OLS residuals. This indicates
roughly that values outside this band that is, may be
considered to be relatively large.
Make a normal probability plot of the standardized OLS residuals to
check the assumption of a normal distribution. This is done by plotting
the values where is the standardized OLS residual and zi, the
corresponding normal score (i.e., the expected value from the standard
normal distribution according to the rank of . Especially when this
2. shows that the residual distribution has longer tails than the normal
distribution, there is a danger of parameter estimates being unduly
influenced by outlying level-one units.
The squared standardized residuals can also be plotted against
explanatory variables to assess the assumption of level-one
homoscedasticity. Here also, averaging within categories, or smoothing,
can be very helpful in showing the patterns that may exist.
When a data exploration is carried out along these lines, this may
suggest model improvements which can then be tested. One should realize
that these tests are suggested by the data; if the tests use the same data that
were used to suggest them, which is usual, this will lead to capitalization on
chance and inflated probabilities of type I errors. The resulting
improvements are convincing only if they are ‘very significant’ (e.g., p <
0.01 or p < 0.001). If the data set is large enough, it is preferable to employ
cross-validation (see p. 126).
Example 10.2 Level-one residual inspection.
We continue Example 10.1, in which the data set also used in Chapters 4 and 5 is considered again,
the explanatory variables at level one being IQ, SES, and gender, with an interaction between IQ and
SES. Within-group OLS residuals were calculated for all groups with at least 10 within-group
residual degrees of freedom.
The variables IQ and SES are both based on integer scales from which the mean was subtracted,
so they have a limited number of categories. For each category, the mean residual was calculated and
the standard error of this mean was calculated in the usual way. The mean residuals are plotted in
Figure 10.1 for IQ categories containing 12 or more students; for SES categories were formed as
pairs of adjacent values, because otherwise they would be too small.
The vertical lines indicate the intervals bounded by the mean residual plus or minus twice the
standard error of the mean. For IQ the mean residuals exhibit a clear pattern, while for SES they do
not. The left-hand figure suggests a nonlinear function of IQ having a local minimum for IQ between
–2 and 0 and a local maximum for IQ somewhere about 2. There are few students with IQ values less
Figure 10.1: Mean level-one OLS residuals (with bars extending to twice
the standard error of the mean) as function of IQ (left) and SES (right).
Figure 10.2: Mean level-one OLS residuals (with bars extending to twice
the standard error of the mean) as function of IQ (left) and SES (right),for
model with nonlinear effect of IQ.
than –4 or greater than +4, so in this IQ range the error bars are very wide and not very informative.
Thus the figures point toward a nonlinear effect for IQ, in which the deviation from linearity has a
local minimum for a negative IQ value and a local maximum for a positive IQ value. Examples of
such functions are third-degree polynomials and quadratic spline functions with two nodes (cf.
Section 15.2.2). The first option was explored by adding IQ2 and IQ3 to the fixed part. The second
option was explored by adding IQ−2 and IQ+2, as defined in (8.2), to the fixed part. The second
option gave a much better model improvement and therefore was selected. When IQ_2 and IQ+2 are
added to Model 1 of Table 8.1, which yields the same fixed part as Model 4 of Table 8.2, the
deviance goes down by 54.9 (df = 2, p < 0.00001). This is strongly significant, so this nonlinear
effect of IQ is convincing even though it was not hypothesized beforehand but suggested by the data.
The mean residuals for the resulting model are graphed as functions of IQ and SES in Figure 10.2.
These plots do not exhibit a remaining nonlinear effect for IQ.
A normal probability plot of the standardized residuals for the model that includes the nonlinear
effect of IQ is given in Figure 10.3. The distribution looks quite normal except for the very low
values, where the residuals are somewhat more strongly negative (i.e., larger in absolute value) than
expected. However, this deviation from normality is rather small.
Lesaffre and Verbeke (1998) propose local influence diagnostics which are
closely related to those proposed here.1
Whether a group has a large influence on the parameter estimates for the
whole data set depends on two things. The first is the leverage of the group,
that is, the potential of this group to influence the parameter estimates
because of its size nj and the values of the explanatory variables in this
group. Groups with a large size and with strongly dispersed values of the
explanatory variables have a high leverage. The second is the extent to
which this group fits the model as defined by (or estimated from) the other
groups, which is closely related to the values of the residuals in this group.
A poorly fitting group that has low leverage, for example, because of its
small size, will not strongly affect the parameter estimates. Likewise, a
group with high leverage will not strongly affect the parameter estimates if
it has deletion residuals very close to 0.
If the model fits well and the explanatory variables are approximately
randomly distributed across the groups, then the expected value of
diagnostics (10.6), (10.7), and (10.8) is roughly proportional to the group
size nj. A plot of these diagnostics as a function of nj will reveal whether
some of the groups influence the fixed parameter estimates more strongly
than should be expected on the basis of group size.
The fit of the level-two units to the model can be measured by the
deletion standardized multivariate residual for each unit. This measure was
also proposed by Snijders and Berkhof (2008) as a variant of the ‘squared
length of the residual’ of Lesaffre and Verbeke (1998), modified by
following the principle of deletion residuals. It is defined as follows. The
predicted value for observation Yij on the basis of the fixed part, minimizing
the influence of the data in unit j on this prediction, is given by
Table 10.2: The 20 largest influence diagnostics for the extended model.
10.9 Glommary
Specification of the hierarchical linear model. This is embodied in the
choice of the dependent variable, the choice of levels, of variables with
fixed effects, and of variables with random effects with possible
heteroscedasticity.
The logic of the hierarchical linear model. Following this can be useful in
specifying the model: specify the levels correctly; consider whether
variables with fixed effects may have different within-group and
between-group effects (a choice with even more possibilities if there
are three or more levels); ask similar questions for interaction effects;
consider whether variables should have random slopes, and whether
there might be heteroscedasticity. A considerable decrease of the
explained variance (Section 7.1) when adding a fixed effect to the
model is also a sign of potential misspecification of the model.
Goodness of fit. This is how well the data, or a part of the data,
corresponds to the model. The residuals are an important expression of
this: residuals close to 0 point to a good fit.
Outliers. Data points that do not correspond well to the model, as will be
indicated by large residuals.
Robust estimators. Estimators that retain good performance for data that
do not satisfy the default model assumptions very well. Different types
of robust estimators have been developed, for example, incorporating
robustness against outliers or against misspecification of the
distribution of the random effects.
1
Lesaffre and Verbeke’s diagnostics are very close to diagnostics (10.6), (10.7), and (10.8),
multiplied by twice the number of parameters, but using different approximations. We have opted for
the diagnostics proposed here because their values are on the same scale as the well-known Cook’s
distance for the linear regression model, and because the one-step estimates can be conveniently
calculated when one uses software based on the Fisher scoring or (R)IGLS algorithms.
11
Designing Multilevel Studies
Up to now it has been assumed that the researcher wishes to test interesting
theories on hierarchically structured systems (or phenomena that can be
thought of as having a hierarchical structure, such as repeated data) on
available data. Or that multilevel data exist, and that one wishes to explore
the structure of the data. This, of course, is the other way around. Normally
a theory (or a practical problem that has to be investigated) will direct the
design of the study and the data to be collected. This chapter focuses on one
aspect of this research design, namely, the sample sizes. Sample size
questions in multilevel studies have also been treated by Snijders and
Bosker (1993), Mok (1995), Cohen (1998), Donner and Klar (2000),
Rotondi and Donner (2009), and others. Snijders (2001, 2005) and Cohen
(2005) provide primers for this topic.
Another aspect, the allocation of treatments to subjects within groups, is
discussed in Raudenbush (1997) and by Moerbeek and Teerenstra (2011).
Raudenbush et al. (2007) compare matching and covariance adjustment in
the case of group-randomized experiments. Hedges and Hedberg (2007)
treat the case where sample size is a function of the intraclass correlation
and covariate effects, while Rotondi and Donner (2009) provide an
approach to sample size estimation in which the distribution of the
intraclass correlation varies. Hedeker et al. (1999) as well as Raudenbush
and Liu (2001) present methods for sample size determination for
longitudinal data analyzed by multilevel methods.
This chapter shows how to choose sample sizes that will yield high
power for testing, or (equivalently) small standard errors for estimating,
certain parameters in two-level designs, given financial and practical
constraints. A problem in the practical application of these methods is that
sample sizes that are optimal, for example, for testing some cross-level
interaction effect are not necessarily optimal, for example, for estimating
the intraclass correlation. The fact that optimality depends on one’s
objectives, however, is a general problem of life that cannot be solved by
this textbook. If one wishes to design a good multilevel study it is advisable
to determine first the primary objective of the study, express this objective
in a parameter to be tested or estimated, and then choose sample sizes for
which this parameter can be estimated with a small standard error, given
financial, statistical, and other practical constraints. Sometimes it is possible
to check, in addition, whether also for some other parameters
(corresponding to secondary objectives), these sample sizes yield
acceptably low standard errors.
A relevant general remark is that the sample size at the highest level is
usually the most restrictive element in the design. For example, a two-level
design with 10 groups, that is, a macro-level sample size of 10, is at least as
uncomfortable as a single-level design with a sample size of 10.
Requirements on the sample size at the highest level, for a hierarchical
linear model with q explanatory variables at this level, are at least as
stringent as requirements on the sample size in a single-level design with q
explanatory variables.
where z1– α, z1– β, and zβ are the z-scores (values from the standard normal
distribution) associated with the cumulative probability values indicated. If,
for instance, α is chosen at 0.05 and 1 – β at 0.80 (so that β = 0.20), and an
effect size of 0.50 is what we expect, then we can derive that we are
searching for a minimum sample size that satisfies
where n is the average sample size in the second stage of the sample and ρI
is the intraclass correlation. Now we can use (3.17) to calculate required
sample sizes for two-stage samples.
Example 11.1 Assessing mathematics achievement.
Consider an international assessment study on mathematics achievement in secondary schools. The
mathematics achievement variable has unit variance, and within each country the mean should be
estimated with a standard error of 0.02. If a simple random sample of size n is considered, it can
readily be deduced from the well-known formula
11.7 Glommary
Hypothesis testing. For the general theory of statistical hypothesis testing
we refer to the general statistical literature, but in this glommary we
mention a few of the basic concepts in order to refresh the reader’s
memory.
Null hypothesis. A set of parameter values such that the researcher would
like to determine whether these are tenable in the light of the currently
analyzed data, or whether there is empirical evidence against them; the
latter conclusion is expressed as ‘rejecting the null hypothesis’.
Effect size. Numerical expression for the degree to which a parameter value
is believed to deviate from the null hypothesis. This is an ambiguous
concept, because the importance of deviations from a null value may
be expressed in so many ways. Cohen (1992) gives a list of definitions
of effect sizes.
To express differences between groups, the difference between the
means divided by the within-group standard deviation is a popular
effect size, called Cohen’s d. Rosenthal (1991) proposes to express
effect sizes generally as correlation coefficients. For multilevel
analysis this can be exemplified by noting that variances on their
original scales have arbitrary units and therefore are not useful as
effect sizes, but the intraclass correlation coefficient transforms the
within- and between-group variances to the scale of a correlation
coefficient, and thereby becomes comparable across data sets and
useful as an effect size index.
Significance level. Also called type I error rate, and usually denoted α, this
is the probability of making a type I error, given parameter values that
indeed conform to the null hypothesis. In classical statistical
hypothesis testing, the significance level is predetermined by the
researcher (e.g., α = 0.05; the popularity of this value according to the
statistician R.A. Fisher is caused by the fact that humans have 20
fingers and toes). In practice, significance levels are set at lower values
accordingly as sample sizes are higher, in view of the tradeoff
mentioned below.
Type II error. Failing to reject the null hypothesis when in fact the null
hypothesis is false, and the alternative hypothesis is true.
Type II error rate. Often denoted β, this is the probability of making a type
II error, given parameter values that indeed conform to the alternative
hypothesis. There is a tradeoff between the probabilities of making
type I and type II errors: having a higher threshold before rejecting the
null hypothesis leads to lower type I and higher type II error rates.
Cost function. The cost of the study design in terms of, for example, the
sample size. In multilevel situations one weighs the costs of sampling
macro-level units against the costs of sampling micro-level units
within macro-level units. Such a cost function can then be used to
determine optimal sample sizes at either level given a budget
constraint.
Here τ contains the parameters of the level-two model, and θ those of the
level-one model.
Browne and Draper (2000) presented Bayesian MCMC methods for
linear and nonlinear multilevel models; see also further publications by
these authors, presented in Draper (2008). In extensive simulation studies,
Browne and Draper (2006) found confirmation for the good frequentist
properties of Bayesian methods for multilevel modeling. Concise
introductions to Bayesian methods for multilevel modeling are given in
Raudenbush and Bryk (2002, Chapter 13), Gelman et al. (2004, Chapter
15), Gelman and Hill (2007, Chapter 18), and Hamaker and Klugkist
(2011). More extensive treatments are Draper (2008) and the textbook by
Congdon (2010).
For the regular (linear) hierarchical linear model, Bayesian and ML
procedures yield very similar results. Jang et al. (2007) present an extensive
comparative example. An advantage of the Bayesian approach may be the
more detailed insight into the likely values of parameters of the random part
of the model, because of the deviation from normality of their posterior
distributions; in frequentist terms, this is reflected by nonnormal
distributions of the estimators and by log-likelihoods, or profile log-
likelihoods, that are not approximately quadratic. For more complicated
models (e.g., with crossed random effects or with nonnormal distributions),
Bayesian approaches may be feasible in cases where ML estimation is
difficult, and Bayesian approaches may have better properties than the
approximations to some of the likelihood-based methods proposed as
frequentist methods. Examples of more complicated models fitted by a
Bayesian approach are given in Chapter 13 and in Browne et al. (2007).
12.2 Sandwich estimators for standard errors
For many statistical models it is feasible and sensible to use ML estimators
or their relatives (e.g., restricted maximum likelihood (REML) estimators).
Statistical theory has a standard way of assessing the precision of ML
estimators if the model is correct. This uses the so-called Fisher information
matrix, which is a measure for how strongly the probabilities of the
outcomes change when the parameters change. The key result is that the
large-sample covariance matrix of the ML estimator is the inverse of the
information matrix. Recall that standard errors are contained in the
covariance matrix of the estimator, as they are the square roots of its
diagonal elements. For the hierarchical linear model, this is discussed in de
Leeuw and Meijer (2008b, pp. 38–39).
Sometimes, however, the researcher works with a misspecified model
(i.e. one that is not a good approximation of reality), often because it is
easier to do so; sometimes also the estimation was done by a method
different from ML. In such cases, other means must be utilized to obtain
standard errors or the covariance matrix of the estimator. A quite general
method is the one affectionately called the sandwich method because the
mathematical expression for the covariance matrix features a matrix
measuring the variability of the residuals, sandwiched between two
matrices measuring the sensitivity of the parameter estimates to the
observations as far as these are being used for obtaining the estimates. The
method is explained in many places, for example, de Leeuw and Meijer
(2008b, Section 1.6.3) and Raudenbush and Bryk (2002). It was proposed
by White (1980), building on Eicker (1963) and Huber (1967), with the
purpose of obtaining standard errors for ordinary least squares (OLS)
estimates valid in the case that the linear model holds, but the assumption of
independent homoscedastic (constant-variance) residuals is not valid.
Therefore he called them robust standard errors.
One of the applications is to multilevel structures. There the sandwich
estimator yields cluster-robust standard errors (Zeger et al., 1988). The
term clusters here refers to the highest-level units in a multilevel data
structure. Thus, even for a multilevel data structure where a linear model is
postulated and one suspects within-cluster correlations, one could estimate
the parameters of the linear model by ordinary least squares, which is the
ML estimator for the case of independent homoscedastic residuals; and then
obtain the standard errors from an appropriate sandwich estimator to
‘correct’ for the multilevel structure in the data. This can also be done for
three- and higher-level structures, where the clusters are defined as the
highest-level units, because these are the independent parts of the data set.
This has been elaborated in the technique of generalized estimating
equations (GEE), proposed by Liang and Zeger (1986) and Zeger et al.
(1988); see the textbook by Diggle et al. (2002). Formulated briefly, this
method assumes a linear or generalized linear model for the expected values
of the dependent variable in a multilevel data structure, conditional on the
explanatory variables; but no assumptions are made with respect to the
variances and correlations, except that these are independent between the
clusters (the highest-level units). The parameters of the linear model are
estimated under a so-called ‘working model’, which does incorporate
assumptions about the variances and correlations; in the simplest case, the
working model assumes uncorrelated residuals with constant variance. The
standard errors then are estimated by a sandwich estimator. For large
numbers of clusters, this will be approximately correct, provided that the
linear model for the means is correct and the highest-level units are
independent. The comparison between GEE modeling and the hierarchical
linear model is discussed by Gardiner et al. (2009).
Thus, the sandwich method for obtaining standard errors is a large-
sample method for use when the model used for estimation is misspecified,
or when a different estimator than the ML or REML estimator is used, as in
the methods explained in Section 14.5 taking account of design weights in
surveys. The sandwich estimator will be less efficient than the ML or
REML estimator if the model is correctly specified, and a practical question
is for which sample sizes it may be assumed to work in practice. This will
also depend on the type of misspecification, and here we should distinguish
between misspecification of the correlation structure, such as GEE
estimation of a data set satisfying a random slope model with a working
model of independent residuals, and misspecification of the distributional
form of the residuals, for which the default assumption of normality might
be violated.
The quality of cluster-robust standard errors is better when the clusters
have equal sizes and the same distributions of explanatory variables, and
suffers when the clusters are very different. Bell and McCaffrey (2002)
proved that the cluster-robust sandwich estimator for the GEE with the OLS
working model is unbiased if all clusters have the same design matrix, and
otherwise has the tendency to underestimate variances. Corrections to the
sandwich estimator for standard errors were proposed by Mancl and
DeRouen (2001), Bell and McCaffrey (2002), and Pan and Wall (2002).
These corrections lead to clear improvements of the type I error rates of
tests based on cluster-robust standard errors, particularly for clusters with
unequal sizes or different distributions of explanatory variables. These
corrections should be implemented more widely as options, or even
defaults, in software implementations of cluster-robust standard errors. The
minimum correction one should apply in the case of a small number of
clusters is the following very simple correction mentioned by Mancl and
DeRouen (2001): multiply the estimated covariance matrix as defined by
White (1980) by N /(N – q – 1) (this is the HC1 correction of McKinnon
and White, 1985), and test single parameters against a t distribution with N
– q – 1 degrees of freedom (rather than a standard normal distribution), and
multidimensional parameters against an F distribution with N – q – 1
degrees of freedom in the denominator (rather than a chi-squared
distribution). Here N is the number of clusters and q the number of
explanatory variables at the highest level. In a simulation study for two-
level binary data, Mancl and DeRouen (2001) found that this gives
satisfactory results for N = 40, and in the case of equal cluster sizes and
identical distributions of explanatory variables per cluster, for considerably
smaller N. But it is preferable to use one of the more profound and more
effective corrections.
Verbeke and Lesaffre (1997) proved that for a hierarchical linear model
in which the random part of the model is correctly specified but has
nonnormal distributions, an appropriate sandwich-type estimator is
consistent, and found support for its performance for moderate and large
sample sizes. Yuan and Bentler (2002) also derived sandwich estimators for
the hierarchical linear model, together with rescaled likelihood ratio tests
for use under nonnormality of residuals. Maas and Hox (2004) reported a
simulation study designed to assess the robustness of the sandwich standard
error estimator for nonnormal distributions of the level-two random
coefficients. They compared the ML estimator and the sandwich estimator,
as implemented in MLwiN, for linear models where the covariance
structure is correctly specified but the distributions of the level-two random
effects are nonnormal. For the fixed effects they found good coverage rates
of the confidence intervals based on either standard error for a small
number (30) of clusters. For confidence intervals for the level-two variance
parameters they found that the sandwich estimator led to better coverage
rates than the ML estimator, but even for 100 clusters the coverage rates
based on the sandwich estimator were still far from satisfactory for some of
the nonnormal distributions. However, Verbeke and Lesaffre (1997)
reported better results for a differently defined sandwich estimator under
similar distributional assumptions. It can be concluded that the robustness
of the sandwich estimator for inference about variance parameters depends
on the number of clusters being sufficiently large, or on the appropriate use
of specific small-sample versions.
Next to the sandwich estimators, other estimators for standard errors
also are being studied in the hope of obtaining robustness against deviations
from the assumed model. Resampling methods (van der Leeden et al.,
2008) may be helpful here. Cameron et al. (2008) proposed a bootstrap
procedure that offers further promise for cluster-robust standard errors.
To decide whether to use an incompletely specified model and the
associated sandwich estimator, the following issues may be considered:
Table 12.1: Estimates for the hierarchical linear model compared with OLS
estimates and sandwich standard errors.
12.4 Glommary
Bayesian statistics. An approach to statistical inference based on the notion
that probability is a measure of how likely one thinks it is that a given
event will occur or that a variable has a value in a given range. In
particular, beliefs and uncertainties about parameters in a statistical
model are expressed by assuming that these parameters have a
probability distribution.
In this figure, 25 pupils are nested within five schools, but also within
four different neighborhoods. The lines indicate to which school and
neighborhood a pupil belongs. The seven pupils of school 1 live in three
different neighborhoods. The six pupils who live in neighborhood 3 attend
four different schools. Another way to present the cross-classification is by
making a cross-table, which of course readers can do themselves.
Following up on this example, we now continue with an explanation of
how to incorporate a crossed random factor in a multilevel model. We
consider the case of a two-level study, for example, of pupils (indicated by
i) nested within schools (indicated by j), with a crossed random factor, for
example, the neighborhood in which the pupil lives. The neighborhoods are
indexed by the letter k, running from 1 to K (the total number of
neighborhoods in the data). The neighborhood of pupil i in school j is
indicated by k(i, j). The hierarchical linear model for an outcome variable
Yij without the neighborhood effect is given by
In Model 1 in Table 13.1, the average examination score is 6.36 and the total variance 0.466
(difference due to rounding). Of this variance, 14% is associated with the secondary schools. Model 2
presents the results in which the available information on the primary schools previously attended by
the pupils is also taken into account. Of course the average examination grade remains the same, but
now we see some changes in the variance components. The variance between secondary schools
marginally decreases to 0.066, and the within-school variance decreases somewhat as well, since now
the primary schools take up a variance component of 0.006. The decrease in deviance is 7025.7 –
6977.4 = 48.3, highly significant when compared to a chi-squared distribution with df = 1. We
observe here an instance where, for a variance component, the estimate divided by the standard error
is not very large, but yet the variance is significantly different from 0 (cf. Section 6.1).
There are now three different intraclass correlations: the correlation between examination grades
of pupils who attended the same primary school but went to a different secondary school is
the correlation between grades of pupils who attended the same secondary school but came from
different primary schools is
and the correlation between grades of pupils who attended both the same primary and the same
secondary school is
We can elaborate the models by including predictor variables. The original data set contained
many potential candidate predictors, such as IQ, gender, and the socio-economic status of the pupil’s
family. For almost all of these variables scores were missing for some pupils, and therefore we
employed the chained equations technique described in Chapter 9 to impute values, using all the
information available.
From the potential predictors we selected an entry test (grand mean centered) composed of
arithmetic, language and information processing subscales; socio-economic status (SES, grand mean
centered); the primary school teacher’s advice about the most suitable level of secondary education
(grand mean centered); and ethnicity. Table 13.2 contains the results with a model including these
four predictor variables (Model 3) as an extension to the previously fitted Model 2 (presented again
in this table). According to the principles of multiple imputation (Section 9.4) we constructed 25 data
sets with imputed values. The results in the table, as well as those in the following tables in this
chapter, are the syntheses of 25 analyses run on these imputed data sets, using equations (9.3)–(9.6).
The four predictor variables all have highly significant effects, indicating that pupils with higher
entry test scores, with higher recommendations from their primary school teachers, and from more
affluent families have higher average examination scores. Moreover, pupils from ethnic minorities
have lower examination results than pupils from the Dutch majority group. Most important, however,
are the estimates of the variance components. Comparing Models 2 and 3, all variance components
have decreased. The between-pupils within-schools variance decreases from 0.395 to 0.330. The
between-secondary-schools variance (0.034, and significant) is almost half its original estimate
(0.066), which also turns out to be the case for the between-primary-schools variance: this decreases
from 0.006 to 0.003. This indicates that secondary schools appear to have a value-added effect on
pupil achievement measured at the final examination, but that primary schools, given the
achievement levels attained by pupils at the end of primary education and given their family
background, have only a marginally lasting effect as measured four or five years later at the
secondary school examinations.
Although for the models in the example the predictor variables only had
fixed effects, it is straightforward to include these variables with random
slopes at the level of one or both of the crossed random factors as shown in
equation (13.1).
Table 13.2: Models with and without cross-classification for examination
grades (continued).
The first six pupils all attended only school 1, but pupil 7 attended both
school 1 and school 2. The most extreme case is pupil 11, who attended
four different schools. One can think of a situation where the pupil is being
bullied and his parents try to resolve the situation by choosing another
school. Later they move house, so the pupil once again has to change
schools, and exactly the same problem of being bullied occurs, and again a
switch to another school is seen as the solution. The main difference with
the cross-classified situation, where the individuals belong to units of two
different populations of interest at level two (schools and neighborhoods, or
primary and secondary schools), is that in the multiple membership
situation there is only one such population of interest at level two, but the
individuals belong to more than one level-two unit in that population. Other
examples are teachers working in more than one school, pupils rating their
teachers of different subjects, families who changed neighborhoods during
the period under investigation, or employees changing firms.
The solution to the problem is to use some kind of weighting, where the
membership weights can be proportional to the time a level-one unit spent
at a level-two unit, with the weights summing to 1 (Hill and Goldstein,
1998). So a pupil who attended the same primary school from grade 1 to
grade 6 has a membership weight of 1 for that school, and 0 for all other
schools. A pupil attending the first three grades in her first school and the
last three grades in another school has a membership weight of ½ for the
first, ½ for the second and 0 for all other schools. The membership weights
are denoted by wih, for pupil i in school h, adding up to 1 for every pupil:
The multilevel model for this situation, following the notation of Rasbash
and Browne (2001) and ignoring explanatory variables, is:
The subscript {j} denotes that a level-one unit does not necessarily belong
to one unique level-two unit. Therefore the level-two residuals U0h are
weighted by wih,. For a given pupil i, the schools h that this pupil never
attended have wih = 0, so they do not contribute anything to the outcome of
(13.2). For example, for a pupil who spent a quarter of his time in school 1
and the remainder in school 2, this part of the formula gives the
contribution
which is 0 if one weight is 1 and all others 0, and positive if two or more of
the weights are non-zero; for example, if for pupil i there are K weights of
equal values 1/K and all other weights are 0, then Wi = K – 1.
Example 13.2 Multiple membership in secondary schools.
Following up on the previous example and paying attention only to the secondary school period of a
pupil’s school career, thus ignoring the sustained effects of the primary schools, we now try to do
justice to the fact that many pupils attended more than one secondary school. In this sample 3,438 of
the 3,658 pupils did their final examination in the school where they originally enrolled. So 94% of
the pupils never changed schools. Of the remaining 6%, 215 pupils attended two, and 5 pupils
attended three different schools. For the pupils who attended two different schools the membership
weights for their first school vary between 0.20 and 0.80. For the five pupils who attended three
different schools the membership weights for their first school vary between 0.20 and 0.25,
indicating that they rather quickly left this school. Unlike the analyses in the previous example, we
start with a model (Model 4) in which the second level is defined by the first school enrolled, of
which there are 86 in total in this subset of the sample. Then a multiple membership model (Model 5)
is fitted to the data.
From Table 13.3 it can be seen that the results for Model 4 are very similar to those presented in
Model 1 of Table 13.1. The variance for the first school a pupil attended is close to the variance for
the last school attended. In Model 5 the results of the multiple membership modeling of the data can
be seen. There is a marginal increase in the between-school variance, from 0.062 to 0.064, indicating
that the multiple membership model gets somewhat closer to the data than the ‘simple’ multilevel
model. To see whether there is an impact of the more advanced modeling on the estimated school
effects, Figure 13.3 shows a scatterplot of the residuals (‘school effects’) for the school where the
pupils did their final examinations.
On the horizontal axis the school-level residuals derived from the ‘simple’ multilevel (ML)
model (Model 4) are plotted. On the vertical axis the school-level residuals derived from the ‘more
realistic’ multiple membership (MM) model are pictured. As can be seen, there is a high correlation
(0.86 to be precise) between the two types of residuals, but the most striking point is that the
residuals from the multiple membership model are slightly less dispersed. That is, of course, because
part of the between-schools variation is now accounted for by the other schools previously attended
by some of the pupils.
13.5 Glommary
Cross-classification or multiple classification. A situation where levels
are not nested in each other. For example, lower-level units belong to
higher-level units of two or more different but intersecting populations
of interest. An example of such populations is given by pupils in
neighborhoods and in schools, where pupils from the same
neighborhood may attend different schools that also are attended by
pupils from other neighborhoods.
1
The cross-classified models in this chapter are estimated using the Bayesian MCMC algorithm of
Browne (2004) with 10,000 iterations, and parameter expansion at the level of the primary school.
For Bayesian procedures, see Section 12.1.
14
Survey Weights
Some data collection methods use complex surveys, that is, surveys which
cannot be regarded as simple random samples. In a simple random sample,
which may be regarded as the standard statistical data collection design,
each subset of the population having the same size has the same probability
of being drawn as the sample.1 In Chapter 2 we saw that two-stage samples
and cluster samples are quite natural sampling designs for multilevel
analysis. Another sampling method is stratification, where the population is
divided into relatively homogeneous subsets called strata and samples are
drawn independently within each stratum, with sampling proportions which
may differ between strata. Many big surveys use a combination of
stratification and multistage design. This will imply that elements of the
population may have different probabilities of being included in the sample
– these probabilities are known as the sampling probabilities or inclusion
probabilities. This must be taken into account in the data analysis.
is as small as possible.
For the case where only a population mean is being estimated, that is, β
is one-dimensional and yi(β) = β, this reduces to
with estimated standard error
This design-based standard error, that is, estimator for the sampling
standard deviation of under random sampling according to the design
with inclusion probabilities πi, is a special case of the formula presented, for
example, in Fuller (2009, p. 354).
Precision weights, also called analytical weights, indicate that the
residual standard deviations of some cases are larger than for others, and
estimates will be more precise if cases with higher residual standard
deviation get lower weight. Denote in a one-level design the residual
standard deviation for case i by si. One reason why residual standard
deviations might be nonconstant is that a case in the data set might be an
average of a variable number of more basic units; then the weights may be
called frequency weights. Another possible reason is that the dependent
variable might represent a count or quantity, and higher quantities mostly
are naturally associated with larger variability. The (nonstandardized)
weight associated with standard deviation si is wi = 1/s2i. Here, larger
weights reflect smaller standard errors (i.e., less uncertainty); this is
opposite to the case explained above of sampling weights. Here also, the
weighted parameter estimate is obtained by finding the minimum of
Precision
The reason for using the WLS estimator (14.4) with precision weights is to
obtain high precision.4 In the case of survey modeling, however, the PWE
estimator (14.1) is used to avoid bias, and its precision is not necessarily
optimal. Under the usual assumption of homoscedasticity (constant residual
standard deviations), there might be some cases with large sampling
weights wi which happen to have large residuals (which means that through
random causes their fit with the model is less good than for most other
cases), and the minimization of (14.1) will give such cases a large
influence. The extent of inefficiency of the PWE estimator (14.1) in the
homoscedastic case can be expressed by the effective sample size (Pothoff
et al., 1992), defined as
It is called this because when estimating by the PWE estimator (14.2) the
mean of some variable which has variance σ2 independently of the sampling
weights wi, the variance of the estimator is σ2/neff, just as if it were the
regular mean of a sample of size neff. The effective sample size is strictly
smaller than the regular sample size, unless the weights are constant;
therefore in the homoscedastic situation the use of sampling weights leads
to a loss of precision, and this can be serious if the variability of the weights
is large, as will be expressed by a small value for the effective sample size.5
If the residual variance is nonconstant (heteroscedasticity), however, the
effect on the estimation variance of using the weighted estimation (14.1)
will be different. The assumptions underlying survey weights and those
underlying precision weights are independent, and therefore they can both
be true – in which case the use of sampling weights yields a gain, rather
than a loss, of efficiency compared to unweighted estimation.
Bias
The reason for using sampling weights is to avoid bias. A sampling design
with heterogeneous inclusion probabilities will yield a sample that is a
biased representation of the population. This does not, however, necessarily
lead to bias in the estimators for the parameters. The hierarchical linear
model is a model for the dependent variable, conditional on the explanatory
variables and the multilevel structure. If the predictor variables are
represented differently in the sample than their distribution in the
population – which is one kind of bias – it does not follow that there must
be bias in the estimators of the parameters of the hierarchical linear model.
The parameters are biased only if the distribution of the residuals is affected
by the sampling design. Then the survey design is said to be informative. If
the residuals in the model are independent of the sampling design and of the
sampling weights, that is, the model is correctly specified given all
variables including the design variables, then the use of weights is
superfluous. In multilevel analysis of survey data, if we can be confident of
working with a well-specified hierarchical linear model and the sample
design is unrelated to the residuals, it is better to take no explicit account of
the survey design when doing the analysis and proceed as usual with
estimating the hierarchical linear model. The difficulty is, of course, that we
never can really be sure that the model is well specified.
As an example of the meaning of this type of bias, consider a multilevel
study of pupils in schools with language achievement as the dependent
variable, and socio-economic status (SES) as one of the explanatory
variables. Assume that the sampling design is such that the inclusion
probability of schools is likely to be related to the ethnic composition of the
schools, but no data about ethnic composition are available. Assume also
that language achievement depends on ethnic background of the pupil, and
SES is the main explanatory variable included that is related to ethnic
background. The hierarchical linear model for language achievement is then
misspecified because ethnic background is not included. and this will
mainly affect the estimate for the effect of SES. The SES variable in this
data set and model reflects a variable where different ethnic subgroups of
any given value on SES are distributed in proportions that differ from the
true population proportions, thus leading to a biased parameter estimate for
SES. This argument hinges on the fact that there is an omitted variable,
correlated with an included predictor variable, and correlated with the
inclusion in the sample.
Clearly, the most desirable solution here would be to observe the ethnic
background of pupils and include it in the model. A second solution is to
come up with a variable that reflects the sampling design and that is closely
related to the ethnic composition of the school, and control for this variable.
A third solution is to weight schools according to their inverse inclusion
probability. In the rest of this chapter we shall discuss such solutions in
more detail.
It is possible that πj = 1 for all j: then all clusters in the population are
included in the sample. An example is a survey in countries as level-two
units, where all countries satisfying some criterion (defining the population)
are included in the study, and within each country a sample of the
population is taken. A different possibility is πi\j = 1 for all i and j; then the
sample design is called a cluster sample, and clusters are observed either
completely or not at all. An example is a survey of schools as level-two
units, where within each sampled school all pupils are observed. The
marginal probability of observing level-one unit i in cluster j is given by the
product,
The weights or design weights are the inverse of the inclusion
probabilities. Thus, the weights are defined by
To use weights in two-level models, the separate sets of weights at level one
and level two are needed, corresponding to the separate inclusion
probabilities at level one and level two.
From these weights the effective sample sizes can be calculated:
The effective sample size is defined such that a weighted sample gives the
same amount of information as a simple random sample with sample size
equal to the effective sample size.
The ratios of effective sample sizes to actual sample sizes are called the
design effects at level two and level one,
The design effects give a first indication of the potential loss of statistical
efficiency incurred by following a design-based rather than model-based
analysis (but see footnote 5 on p. 221).
In single-level studies the scales of these weights are irrelevant,6 that is,
all weights could be multiplied by the same number without in any way
changing the results of the analysis. In multilevel designs scaling the level-
two weights still is irrelevant, but the scale of the level-one weights is
important, and the literature is still not clear about the best way of scaling.
Pfeffermann et al. (1998) propose two methods of scaling. These are
used in design-based estimation for multilevel designs (see below in
Section 14.5). The first is
which yields scaled weights summing to the effective sample size njeff; this
is called ‘method B’ by Asparouhov (2006). The second is
yielding weights summing to the actual sample size nj, and called ‘method
A’ by Asparouhov.
If the researcher is certain that the design variables have been included in a
satisfactory way, then a model-based analysis can be pursued further. In
practice, of course, one never is really sure about this, especially since
complicated interactions and nonlinear effects might possibly be involved.
To explore whether the model differs in ways associated to the design
weights, the following procedures can give some insights.
If the model is valid and the design variables are uncorrelated with
the level-one residuals Rij, then the difference jOLS — jw is just
random noise; on the other hand if the design variables are correlated
with the level-one residuals, then this difference estimates the bias of
the OLS estimator. A theoretically interesting property is that, if this
model is valid, the covariance matrix of the difference is the
difference of the covariance matrices:7
DuMouchel and Duncan (1983) discussed this and showed that the
test for the difference jw — jOLS can also be carried out as follows
in a linear OLS model. Compute the product of all level-one
variables xhij with the weights wi\j, and test in an OLS model the
effect of the weight wi/j itself and all r products xhijwi/j, controlling
for X1,. . ., Xr. This yields an F-test with r + 1 and nj−2(r + 1)
degrees of freedom which is exactly the same as the F-test for the
difference between the design-based and the model-based estimators.
It can be regarded as the test for the main effect of the weight
variable, together with all its interactions with the explanatory
variables. A version of this test for the case of multilevel models for
discrete outcome variables was developed by Nordberg (1989).
where ln(Fj) is the natural logarithm of Fj. In the second step, add
the normalized values
These F-tests are exact (if the level-one residuals are normally
distributed with constant variance) and therefore also are valid for
small sample sizes nj. (Fisher’s combination of p-values also is
exact, but Fisher’s Z transformation is an approximation.)
Table 14.1: Numbers of sampled schools (left) and sampled pupils (right)
per stratum, for PISA data, USA, 2009.
The data set contains two weight variables, W_FSTUWT at the student
level and W_FSCHWT at the school level. The documentation (OECD,
2010, p. 143) mentions that these are to be transformed as follows. The
student-level weights are to be divided by their school average:
In other words, the sum ∑i w1ij of the student-level weights per school is
equal to the sample size nj of this school. This is scaling method 2 of
(14.11b). The school-level weights are to be normalized so that their sum
over all students in the data set is equal to the total number of sampled
students:
The technical documentation for the PISA 2009 survey was unavailable
at the time of writing (early 2011), and therefore the technical report of the
PISA 2006 survey (OECD 2009) is used for further background
information. This report indicates (Chapter 8) that within the strata, schools
were sampled proportional to total enrollment of the schools, and within
schools pupils were sampled randomly (with equal probability). The
weights given in the data set reflect not only the selection probabilities but
also adjustment for nonresponse (so-called post-stratification).
To investigate how the sampling design could or should be included in
the analysis, we shall potentially consider the eight strata separately. This is
because they were sampled independently; the public–private dimension is
substantively meaningful and the geographic dimension is potentially
meaningful; and the number (eight) of separate analyses is still just
manageable.
As a first step we wish to gain some insight into the variability of the
weights. We calculate the weights (14.18) and (14.19) and compute the
design effects (14.10). It turns out that all level-one design effects are
between 0.95 and 1. This means they are hardly variable, and unequal
weights between pupils cannot lead to important adjustments or biases. The
level-two design effects must be considered per stratum. They are listed
only for the public schools in Table 14.2; these numbers would be
meaningless for the private schools, given the low number of such schools.
This means, for example, that, for the purpose of estimating the mean of a
homoscedastic variable, the sample for public schools in the South and
West has the precision of a sample which has less than 20% of the size of a
simple random sample. The precise definition of homoscedasticity in this
case is that the absolute deviation from the mean is unrelated to the
sampling weights.
Table 14.2: Level-two design effects for PISA data (USA 2009).
The estimated proportion in towns, 0.65, does not differ much from the
sample proportion, 0.56. The difference is less than one standard error
(0.13). This is in line with the small association between sampling weights
and being situated in a town. The total sample size here (number of sampled
public schools in the South) is 55 (see Table 14.1). If in a simple random
sample of size 55 a proportion of 0.65 were observed, the standard error of
the estimated proportion would be 0.06, against the standard error of the
weighted estimator S.E.p( p) = 0.13. This indicates the loss of efficiency in
using a weighted instead of a simple random sample, when the variable of
interest is not associated with the weights.
For ESCS and immigration status, the school average is also included in the
model, to allow differences between within-school and between-school
regression coefficients. These school means are denoted Sch-imm and Sch-
ESCS. The centering means that the intercept refers to girls of age 16 and
grade 10, with the other variables mentioned taking value 0.
The main design variables are the stratification variables, public/private
and region; and school size, determining inclusion probabilities within
strata. Of these variables, public/private is of primary interest, the other two
variables are of secondary interest only. Therefore, we do not initially
include region and school size in the model and rather follow method 3 (p.
226): divide the data set into parts dependent on the design, and apply the
hierarchical linear model to each of the parts. Since only 11 out of the 165
schools are private and this is a category of interest, the private schools are
considered as one part, and the public schools are divided into four groups
according to the school weights (14.19). The private schools have too few
level-two units for a good multilevel analysis, but this is an intermediate
step only and will give an impression of whether they differ from the public
schools. Table 14.4 presents the results. The classes denoted by ‘weight 1’
to ‘weight 4’ are the schools with the smallest to largest weights, therefore
the relatively largest to smallest schools. Pupils with missing data on one or
more variables were left out; this amounted to about 10% of the data set.
The group of private schools seems to be the most different from the
rest, but the sample size is small. The main difference in this respect are the
larger effects of the school means of ESCS and of immigrant status; the
lower intercept for the private schools may be related to the higher effect of
school mean of ESCS together with the fact that these schools on average
have a higher ESCS than the public schools. The private schools also have a
larger intercept variance, but this is an estimate based on only 11 schools
and therefore not very precise. The estimates for the four weight groups of
public schools do not differ appreciably, given their standard errors. These
differences can be assessed more formally using test (3.42). The variable
that leads to the most strongly different estimates is the school mean of
immigration status, for which this test applied to the estimates and their
standard errors in Table 14.4 leads to the test statistic C = 7.0, a chi-squared
variable with 3 degrees of freedom, with p = 0.07. Being the smallest p-
value out of a set of eight variables, this is not really alarming. However, it
turns out that also the average immigrant status in the sample is different in
the four different weight groups of public schools, ranging from 0.08 in
weight group 4 to 0.59 in weight group 1. It is likely that this reflects
urbanization rather than weights, or school size, per se. Indeed, the average
immigrant status in the sample ranges from 0.07 in villages to 0.76 in large
cities. Therefore urbanization, a variable in five categories, is added to the
data and the estimation is repeated. The results are in Table 14.5. It should
be noted that for the private schools this is hardly a meaningful model, as it
includes four school variables for 11 schools.
Controlling for urbanization hardly affects the effect of within-school
variables and makes the coefficients for the two school averages, especially
of immigration status, more similar across the four weight categories of
public schools. The parameters of the school averages for the private
schools, on the other hand, now deviate even more from those of the public
schools than without the control for urbanization.
Table 14.4: Estimates for model for metacognitive competence for five
parts of the data set.
From this exercise we have learnt that there are differences within the
set of public schools, dependent on the design, relating to the between-
school effect of immigrant status, and that these differences can be
attenuated by controlling for urbanization. Furthermore, we saw that the
private schools differ from the public schools with respect to the
coefficients of various of the explanatory variables. These conclusions will
be utilized in the further analyses.
Conclusions
All these simulation studies are based on limited designs and cannot be
regarded as conclusive. It can be concluded that for the estimation of fixed
effects in linear multilevel models, the methods mentioned all give good
results, perhaps excluding the case of a very low number of clusters. For the
estimation of random effects and nonlinear multilevel models, the method
of Rabe-Hesketh and Skrondal (2006) and Asparouhov (2006) seems better
than the method of Pfeffermann et al. (1998). It is necessary to apply
scaling to the level-one weights, but it is not clear which of the two methods
(14.11a) or (14.11b) is better, and this may depend on the design, the
parameter values, and what is the parameter of main interest. Carle (2009)
suggests using both scaling methods; and, if these produce different results,
to conduct simulation studies to determine, for the given design and
plausible parameter values, which is the preferable scaling method. This
seems to be good advice.
The estimation of standard errors, however, still is a problem. Coverage
probabilities of confidence intervals (and therefore also type I error rates of
hypothesis tests) can be too low for small sample sizes. Extant simulation
studies suggest that a survey of 35 clusters of size 20 is too small to apply
the design-based methods discussed here (Bertolet, 2008, 2010), while 100
clusters of size 100 is sufficient (Asparouhov, 2006). How generalizable
this is, and what happens between these two boundaries, is still unknown.
Perhaps bootstrap standard errors (see below) give better results, but this
still is largely unexplored.
For two-level surveys with few clusters, but large cluster sizes, a good
alternative may be to follow a two-step approach with a first step of
separate estimations for each cluster, where single-level weighting may be
used to account for sampling weights (cf. Skinner, 1989; Lohr, 2010,
Chapter 11) and a second step combining the results for the different
clusters (cf. Section 3.7).
Example 14.1 Design-based and model-based analysis of PISA data.
We continue the example of Section 14.4 of metacognitive competence in the PISA 2009 data for the
USA. In Section 14.4 we learned that level-two weights are important in this data set; schools were
sampled with inclusion probabilities dependent on school size. Level-one weights are hardly variable,
and therefore not important. We also learned that it may be good to control for urbanization.
Furthermore, private schools were perhaps different from public schools; but the number of private
schools sampled, 11, was too low to say much about this question with any confidence.
Based on this information, it was decided first to retain the entire data set, control for main effects
of urbanization as well as private/public; and compare the design-based estimates, computed by
MLwiN according to the method of Pfeffermann et al. (1998), employing scaling method 2, with the
model-based estimates according to the hierarchical linear model. The results are presented in Table
14.6. Deviances are not given, because these are not meaningful for the pseudo-likelihood method.
To compare the two kinds of estimate, Asparouhov’s (2006) informativeness measure I2 is given, as
presented in (14.17).
The results for the student-level effects correspond rather well, but those for the school-level
effects do not. I2 expresses the difference in terms of the model-based standard error. Values larger
than 2 can be considered unacceptable, and these occur for the two school averages and also for some
of the control variables for urbanization.
To diagnose the differences, first the private schools were left out, because of the possibility of
differences between private and public schools and the lack of enough information on the private
schools, and the two estimates were compared again. The differences remained (results not shown
here). Therefore a residual analysis was done (Chapter 10), with particular attention to the school
sizes, as these were the main determinants of the inclusion probabilities. It turned out that there is one
outlying school, having 6,694 pupils while the other school sizes ranged from 100 to 3,592, and with
results concerning metacognitive competence quite different from the other schools. This school was
also left out of the data set, leaving a total sample of 153 public schools with 4,316 pupils. (An
alternative would be to retain this school in the data set and represent it by a dummy variable.) In
addition, school size was used as an additional control variable; in view of the skewness of school
size, the square root of school size was used, centered at 35 (corresponding to a school size of 1,225,
close to average school size). Table 14.7 presents the results of the design-based and model-based
estimates for this smaller data set.
The results of the two approaches now are more in line with each other. The main difference now
is the coefficient of age, which is just significant for the model-based analysis, and small but still
positive for the design-based analysis, and has an informativeness measure slightly above 2. Further
explorations led to the conclusion that this difference may have to do with an interaction between age
and school size, or urbanization – these variables are too strongly associated to disentangle their
interaction effects. Table 14.8 shows the results of the two types of analysis, now including the
interaction of age with school size.
It turns out that in this model the differences between the two approaches are further reduced,
with all informativeness measures less than 2, and identical conclusions with respect to significance
of the variables in the original model of interest. The remaining differences may be regarded as
random variation. The model-based results have smaller standard errors, confirming the greater
precision of these estimators.
As a final conclusion of this data analysis, it seems reasonable to present the model-based results
of Table 14.8. The main interpretations are the following:
This represents the public schools only. The private schools seem to present a somewhat
* different picture, but their sampled number is too small to draw conclusions about these
differences.
There was one outlying very large public school which also seemed to present a different
*
picture, and which was left out of the data for the final results.
There may be interactions of age with other variables on metacognitive competence. Note that
the effect of age is already controlled for grade, and the age range in this data is small (between
*
15.2 and 16.4 years). These interactions could be with school size or with urbanization. These
variables are associated and their interaction effect cannot be disentangled.
A further analysis could investigate differences between the four regions used for the
*
stratification. These differences were not considered here.
Other methods
Since this is an area in development, it may be good to mention some other
procedures that have been proposed in the literature.
Korn and Graubard (2003) proposed another weighting method,
requiring knowledge of higher-order inclusion probabilities which often is
not available; therefore it is less widely applied. However, in their
simulations their method does appear to perform well. Bertolet (2008,
Section 2.2; 2010) explains how their method compares to the two methods
mentioned above.
Estimation of standard errors by bootstrap methods was proposed and
studied by Grilli and Pratesi (2004) and by Kovacěvicć et al. (2006), and
seems to perform well. Given the difficulties for standard error estimation,
this seems a promising method.
Pfefferman et al. (2006) propose a totally different approach. They
construct a simultaneous model for the random sample (the ‘inclusion
events’) and the dependent variable, conditional on inclusion in the sample.
This approach does not use weighting: it is model-based, not design-based.
Little (2004) and Gelman (2007) suggest a Bayesian approach by giving
random effects to subsets of the population having constant (or similar)
inclusion probabilities. This approach sacrifices unbiasedness, but may
obtain smaller posterior variances or smaller mean squared errors in return.
In the second case (14.4) it is assumed that the residuals are heteroscedastic,
cov(R) = σ2W−1, yielding
14.7 Glommary
Two-stage sample. A sample design where level-two units are chosen
independently with some probability, and given that a level-two unit j
has been selected, a sample of level-one units within this level-two
unit is chosen.
Effective sample size. A sample, with a complex sample design and having
effective sample size neff, gives the same amount of information as a
simple random sample of size neff; the usual formula holds, however,
only for the estimation of population means of homoscedastic random
variables.
1
The distinction between sampling with and without replacement is not discussed here.
2
For generalized linear models (e.g., multilevel logistic regression), it may be more enlightening not
to talk about residuals at the lowest level, but about the conditional distribution of the data given the
linear predictor.
3
See Section 12.1.
4
In other words, low mean squared error.
5
It should be noted that this measure for the design effect and its interpretation of the loss of
precision applies only to the case of estimating a mean of a homoscedastic population; for the
estimation of regression coefficients the formulas are more complicated.
6
It is assumed here that finite population corrections are ignored.
7
This relation does not hold in general. It is valid here because of the independence between the
OLS estimator and the difference jw− jOLS. The matrix expression for (14.12) is σ2 ((X′WX)−1
(X′W2X) (X′WX)−1 — (X′X)−1); cf. (14.20).
8
There are two different Fisher Z transformations; the better-known of these is for the correlation
coefficient, while this one is for the F distribution. See Fisher (1924) and Fisher and Yates (1963, p.
2) (‘The quantity z…’). For this reference we are indebted to sir David Cox.
9
Note the difference in notation: what in the clusterwise method was denoted by γj is the (r + 1)-
dimensional parameter vector for the jth cluster, whereas what is here denoted by γh is the single
number which is the parameter for variable Xh.
15
Longitudinal Data
The paradigmatic nesting structure used up to now in this book has been the
structure of individuals (level-one units) nested in groups (level-two units).
The hierarchical linear model is also very useful for analyzing repeated
measures, or longitudinal data, where multiple repeated measurements of
the same variable are available for a sample of individual subjects. Since
the appearance of the path-breaking paper by Laird and Ware (1982), this is
the main type of application of the hierarchical linear model in the
biological and medical sciences. This chapter is devoted to the specific two-
level models and modeling considerations that are relevant for data where
the level-one units are measurement occasions and the level-two units
individual subjects.
Of the advantages of the hierarchical linear model approach to repeated
measures, one deserves to be mentioned here. It is the flexibility to deal
with unbalanced data structures, for example, repeated measures data with
fixed measurement occasions where the data for some (or all) individuals
are incomplete, or longitudinal data where some or even all individuals are
measured at different sets of time points. Here it must be assumed that the
fact that a part of the data was not observed does not itself contain some
information about the unobserved values; see Chapter 9 for a treatment of
incomplete data.
The topic of repeated measures analysis is too vast to be treated in one
chapter. For a general treatment of this topic we refer the reader to, for
example, Maxwell and Delaney (2004) and Fitzmaurice et al. (2004). In
economics this type of model is discussed mainly under the heading of
panel data; see, for example, Baltagi (2008), Chow (1984), and Hsiao
(1995). Some textbooks treating repeated measures specifically from the
perspective of multilevel modeling are Verbeke and Molenberghs (2000),
Singer and Willett (2003, Part I), and Hedeker and Gibbons (2006). The
current chapter only explains the basic hierarchical linear model
formulation of models for repeated measures.
This chapter is about the two-level structure of measurements within
individuals. When the individuals, in their turn, are nested in groups, the
data have a three-level structure: longitudinal measurements nested within
individuals nested within groups. Models for such data structures can be
obtained by adding the group level as a third level to the models of this
chapter. Such three-level models are not explicitly treated in this chapter.
However, this three-level extension of the fully multivariate model of
Section 15.1.3 is the same as the multivariate multilevel model of Chapter
16.
The usual assumptions are made: the U0i and Rti are independent normally
distributed random variables with expectations 0 and variances τ20 for U0i
and σ2 for Rit.
To fit this model, note that the fixed part does not contain a constant
term, but is based on m dummies for the m measurement occasions. This
can be expressed by the following formula. Let dhti be m dummy variables,
defined for h = 1,…, m by
The results are in Table 15.1. However, it is dangerous to trust results for the compound
symmetry model before its assumptions have been tested, because standard errors of fixed effects
may be incorrect if one uses a model with a random part that has an unsatisfactory fit. Later we will
fit more complicated models and show how the assumption of compound symmetry can be tested.
The results suggest that individual (level-two) variation is a little more important than random
differences between measurement occasions (level-one variation). Further, the means change only
slightly from time t = 1 to time t = 6. The deviance test for the difference in mean between the six
time points is borderline significant: χ2 = 21,791.34–21,780.70 = 10.64, df = 5; p = 0.06. This near-
significance is not meaningful in this case, in view of the large data set.
The fixed part is an extension of (15.1) or the equivalent form (15.3), but
the random part is still the same. Therefore this is still called a compound
symmetry model. Inclusion in the fixed part of interactions between
individual-level explanatory variables and time variables such as s(t)
suggests, however, that the random part could also contain random slopes
of these time variables. Therefore we defer giving an example of the fixed
part (15.4) until after the treatment of such a random slope.
Classical analyses of variance methods are available to estimate and test
parameters of the compound symmetry model if all data are complete (e.g.,
Maxwell and Delaney, 2004). The hierarchical linear model formulation of
this model, and the algorithms and software available, also permit the
statistical evaluation of this model for incomplete data without additional
complications.
Covariance matrix
In the fixed occasion design one can talk about the complete data vector
Even if there were no subject at all with complete data, the complete data
vector would still make sense from a conceptual point of view.
The compound symmetry model (15.1), (15.3) or (15.4) implies that for
the complete data vector, all variances are equal and also all covariances are
equal. The expression for the covariance matrix of the complete data vector,
conditional on the explanatory variables, is the m × m matrix
This model means that the rates of increase have a random, individual-
dependent component U1i, in addition to the individual-dependent random
deviations U0i which affect all values Yti in the same way. The random
effect of time can also be described as a random time-by-individual
interaction.
The value t0 is subtracted from t in order to allow the intercept variance
to refer not to the (possibly meaningless) value t = 0 but to the reference
point t = t0; cf. p. 76. The variables (U0i, U1i) are assumed to have a joint
bivariate normal distribution with expectations 0, variances τ02 and τ12, and
covariance τ01.
The variances and covariances of the measurements Yti, conditional on
the explanatory variables, are now given by
where t ≠ s; we saw the same formulas in (5.5) and (5.6). These formulas
express the fact that the variances and covariances of the outcome variables
are variable over time. This is called heteroscedasticity. Differentiating with
respect to t and equating the derivative to 0 shows that the variance is
minimal at (if t were allowed to assume any value).
Furthermore, the correlation between different measurements depends on
their spacing (as well as on their position).
Extensions to more than one random slope are obvious; for example, a
second random slope could be given to the squared value (t-t0)2. In this
way, one can perform a polynomial trend analysis to improve the fit of the
random part. This means that one fits random slopes for a number of
powers of (t - t0) to obtain a model that has a good fit to the data and where
unexplained differences between individuals are represented as random
individual-dependent regressions of Y on (t - t0), (t - t0)2, (t - t0)3, etc.
Functions other than polynomials can also be used, for example, splines
(see Section 15.2.2). Polynomial trend analysis is also discussed in Singer
and Willett (2003, Section 6.3) and Maas and Snijders (2003).
Example 15.3 Random slope of time in life satisfaction.
We continue the earlier example of life satisfaction for 55–60-year-olds. Note that the observations
are spaced a year apart, and a natural variable for the occasions is s(t) = t as used above, that is, age
recoded so that t = 1 is 55 years and t = 6 is 60 years. Thus, the time dimension here corresponds to
age. A random slope of age is now added (Model 3). The reference value for the time dimension,
denoted by t0 in formula (15.7), is taken as t0= 1, corresponding to 55 years of age.
We now investigate whether part of the differences between individuals in life satisfaction and in
the rate of change can be explained by birth cohort. Years of birth here range from 1929 to 1951. This
variable is centered at the mean, which is 1940. To avoid very small regression coefficients the
variable is divided by 10, leading to a variable Z ranging from –1.1 to +1.1, used as a between-
subjects variable. A difference of one unit in Z corresponds to a difference of 10 years in birth year.
The effect of cohort on the rate of change, that is, the interaction effect of birth year with age, is
represented by the product of Z with t. The resulting model is an extension of a model of the type of
(15.4) with a random slope:
Parameter α is the main effect for birth year, while γ is the interaction effect between birth year and
age.
The results are given in Table 15.2. To allow deviance tests, all results were calculated using the
maximum likelihood estimation procedure. Comparing the deviances of Models 2 and 3 shows that
the random slope of age is significant: χ2 = 32.7, df = 2, p < 0.0001. This implies that there is a
significant deviation from the compound symmetry model, Model 2 of Table 15.1. However, the
differences between individuals in rate of change are still rather small, with an estimated inter-
individual standard deviation of One standard deviation difference between
individuals with respect to slope for age will add up for the age range of 5 years to a difference of
0.16 × 5 = 0.8 which is not negligible on the scale from 0 to 10, but still less than the within-
individual standard deviation of
The effect of birth year is significant, with a t-ratio of –0.394/0.078 = 5.1, p < 0.0001. Here also
the effect size is rather small, however. The difference between lowest and highest value of Z (birth
year divided by 10) is 1.1–(–1.1) = 2.2. Given the presence of an interaction, the main effect α of
birth year corresponds to the value t = t0 where the interaction parameter γ in (15.9) cancels, that is,
55 years of age. Thus the main effect of birth year translates to a difference in expected life
satisfaction, at 55 years of age, equal to 2.2 × 0.394 = 0.87.
The birth year × age interaction is significant, t = 0.049/0.019 = 2.6, p < 0.01. The contribution of
0.049 is of the same order of magnitude as the rather irregular differences between the estimated
means μ1 to μ6. Those born later on average experience slightly higher life satisfaction during the
period when they are 55–60 years old, those born earlier slightly lower or about the same.
The fitted covariance for the complete data vector under Model 4 has elements given by (15.8).
Inserting the estimated parameters τ02, τ12, and τ01 from Table 15.2 into (15.8) yields the covariance
matrix
These matrices show that the variance does not change a lot, and correlations attenuate slightly as age
differences increase, but they are close to the intra-subject correlation estimated above under the
compound symmetry model as
However, these values are conditional on the validity of the model with one random slope. We
return to these data below, and will investigate the adequacy of this model by testing it against the
fully multivariate model.
where the dummy variable d1 equals 1 or 0, respectively, depending on whether or not t = 1. The null
hypothesis that the means are identical for the two time points can be represented by ‘γ1 = 0’.
Table 15.3: Estimates for incomplete paired data.
The results are in Table 15.3. The estimated covariance matrix of the complete data vector is
The test of the equality of the two means, which is the test of γ1, is not significant (t = –0.053/.141 =
0.38).
Table 15.5: The multivariate model with the effects of cohort and
employment situation.
The estimates of the fixed effects are presented in Table 15.5. The effect of birth year and its
interaction with age is similar to Model 4 of Table 15.2. As to employment situation, compared to
full-time working, not working at age 55 has a strong negative effect on life satisfaction
and working part-time at age 55
a weak negative effect Currently
working part-time (compared to full-time) also has a weak negative effect.
still with model specification (15.12). Since the first ‘variable’ is constant,
this effectively means that one uses a random intercept and m – 1 random
slopes.
Each model for the random part with a restricted covariance matrix is a
submodel of the fully multivariate model. Therefore the fit of such a
restricted model can be tested by comparing it to the fully multivariate
model by means of a likelihood ratio (deviance) test (see Chapter 6).
For complete data, an alternative to the hierarchical linear model exists
in the form of multivariate analysis of variance (MANOVA) and
multivariate regression analysis. This is documented in many textbooks
(e.g., Maxwell and Delaney, 2004; Stevens, 2009) and implemented in
standard software such as SPSS and SAS. The advantage of these methods
is the fact that under the assumption of multivariate normal distributions the
tests are exact, whereas the tests in the hierarchical linear model
formulation are approximate. For incomplete multivariate data, however,
exact methods are not available. Maas and Snijders (2003) elaborate the
correspondence between the MANOVA approach and the hierarchical linear
model approach.
This shows that the occasion dummies are fundamental, not only because
they have random slopes, but also because in the fixed part all variables are
multiplied by these dummies. If some of the Zk variables are not to be used
for all dependent variables, then the corresponding cross-product terms
dhtizhi can be dropped from (15.14).
that is, the fixed part depends on the measurement occasion but not on other
explanatory variables, and the random part is chosen so as to provide a
good fit to the data.
The proportion of explained variance at level one, R12, is then the
proportional reduction in the average residual variance,
when going from the baseline model to the model containing the
explanatory variables.
If the random part of the compound symmetry model is adequate, the
baseline model is (15.1). In this case the definition of R12 is just as in
Section 7.1.
If the compound symmetry model is not adequate one could use, at the
other end of the spectrum, the multivariate random part. This yields the
baseline model
which is the fully multivariate model without covariates; cf. (15.11) and
(15.12).
In all models for fixed occasion designs, the calculation of the
proportions of explained variance can be related to the fitted complete data
covariance matrix, The value of R12 is the proportional reduction in
the sum of diagonal values of this matrix when going from the baseline
model to the model including the explanatory variables.
Example 15.6 Explained variance for life satisfaction.
We continue the example of life satisfaction of people aged 55–60, now computing the proportion of
variance explained by cohort (birth year), employment situation, and the interaction of the cohort
with age as included in the model of Table 15.5; to which are added the total income after taxes,
logarithmically transformed, and satisfaction with health, also measured on a scale from 0 to 10. The
last two covariates were centered to have overall means 0. Their standard deviations are 0.67 and 2.3,
respectively.
First suppose that the compound symmetry model is used. The baseline model then is Model 2 in
Table 15.1. The variance per measurement is 1.991 + 1.452 = 3.443. The results of the compound
symmetry model with the effect of cohort and employment situation are given in Table 15.6. In
comparison with Model 2 of Table 15.1, inclusion of these fixed effects has led especially to a
smaller variance at the individual level; these variables, although some of them are changing
covariates, contribute little to explaining year-to-year fluctuations in life satisfaction. The residual
variance per measurement is 1.038 + 1.351 = 2.389. Thus, the proportion of variance explained at
level one is R12 = 1–(2.389/3.443) = 0.306.
Table 15.6: The compound symmetry model with the effects of cohort,
employment situation, income, and health satisfaction.
Next suppose that the fully multivariate model is used. The estimated fixed effects do not change
much. The estimated covariance matrix in the fully multivariate model is given in (15.13), while the
residual covariance matrix of the model with these fixed effects, as estimated by employing the fully
multivariate model, is
The sum of diagonal values is 20.32 for the first covariance matrix and 14.13 for the second. Hence,
calculated on the basis of the fully multivariate model, R12 = 1–(14.13/20.32) = 0.305. Comparing
this to the values obtained above shows that for the calculation of R12 it does not make much
difference which random part is used. The calculations using the fully multivariate model are more
reliable, but in most cases the simpler calculations using the compound symmetry (‘random
intercept’) model will lead to almost the same values for R12.
Table 15.7: Linear growth model for 5–10-year-old children with retarded
growth.
Polynomial functions
To obtain a good fit, however, one can try and fit more complicated random
parts. One possibility is to use a polynomial random part. This means that
one or more powers of (t - t0) are given a random slope. This corresponds to
polynomials for the function Fi,
which has the same structure as (5.15). For h = p + 1,. . ., r, parameter γh0 is
the value of the coefficient βhi, constant over all individuals i. For h = 1,…,
p, the coefficients βhi are individual-dependent, with population average γh0
and individual deviations Uhi = βhi – γh0. All variances and covariances of
the random effects Uhi are estimated from the data. The mean curve for the
population is given by
Numerical difficulties appear less often in the estimation of models of this
kind when t0 has a value in the middle of the range of t-values in the data
set than when t0 is outside or at one of the extremes of this range.
Therefore, when convergence problems occur, it is advisable to try and
work with a value of t0 close to the average or median value of t. Changing
the value of t0 only amounts to a new parametrization, that is, a different t0
leads to different parameters γh0 for which, however, formula (15.20)
constitutes the same function and the deviance of the model (given that the
software will calculate it) is also the same.
The number of random slopes, p, is not greater than r and may be
considerably smaller. To give a rough indication, r + 1 may not be larger
than the total number of distinct time points, that is, the number of different
values of t in the entire data set for which observations exist; also it should
not be larger than a small proportion, say, 10%, of the total number of
observations ∑imi. On the other hand, p will rarely be much larger than the
maximum number of observations per individual, maxi mi.2
Example 15.8 Polynomial growth model for children with retarded growth.
We continue the preceding example of children with retarded growth for which height measurements
are considered in the period between the ages of 5 and 10 years. In fitting polynomial models,
convergence problems occurred when the reference age t0 was chosen as 5 years, but not when it was
chosen as the midpoint of the range, 7.5 years.
A cubic model, that is, a model with polynomials of the third degree, turned out to yield a much
better statistical fit than the linear model of Table 15.7. Parameter estimates are shown in Table 15.8
for the cubic model with t0 = 7.5 years. So the intercept parameters refer to children of this age.
For the level-two random slopes (U0i, U1i U2i, U3i) the estimated correlation matrix is
The fit is much better than that of the linear model (deviance difference 496.12 for 9 degrees of
freedom). The random effect of the cubic term is significant (the model with fixed effects up to the
power r = 3 and with p = 2 random slopes, not shown in the table, has deviance 6,824.63, so the
deviance difference for the random slope of (t – t0)3 is 221.12 with 4 degrees of freedom).
Table 15.8: Cubic growth model for 5–10-year-old children with retarded
growth.
The mean curve for the population (cf. (15.20)) is given here by
This deviates hardly at all, and not significantly, from a straight line.
However, the individual growth curves do differ from straight lines, but the
pattern of variation implied by the level-two covariance matrix is quite
complex. We return to this data set in the next example.
Other functions
There is nothing sacred about polynomial functions. They are convenient,
reasonably flexible, and any reasonably smooth function can be
approximated by a polynomial if only you are prepared to use a polynomial
of sufficiently high degree. One argument for using other classes of
functions is that some function shapes are approximated more
parsimoniously by other functions than polynomials. Another argument is
that polynomials are wobbly: when the value of a polynomial function F(t)
is changed a bit at one value of t, this may require coefficient changes that
make the function change a lot at other values of t. In other words, the fitted
value at any given value of t can depend strongly on observations for quite
distant values of t. This kind of sensitivity is often undesirable.
One can use other functions instead of polynomials. If the functions
used are called f1(t),. . ., fp(t), then instead of (15.19) the random function is
modeled as
Often one of the constant values a and b is chosen to be 0. The nodes are t1
and t2. Boundary cases are the functions (choosing a = t1 and b = 0 and
letting the lower node t1 tend to minus infinity)
and (choosing a = 0 and b = t2 and letting the the upper node t2 tend to plus
infinity)
and also linear functions such as f (t) = t (where both nodes are infinite).
Each piecewise linear function can be obtained as a linear combination of
these basic functions. The choice of nodes sometimes will be suggested by
the problem at hand, and in other situations has to be determined by trial
and error.
Example 15.9 Piecewise linear models for retarded growth.
Let us try to improve on the polynomial models for the retarded growth data of the previous example
by using piecewise linear functions. Recall that the height measurements were considered for ages
from 5.0 to 10.0 years. For ease of interpretation, the nodes are chosen at the children’s birthdays, at
6.0, 7.0, 8.0, and 9.0 years. This means that growth is assumed to proceed linearly during each year,
but that growth rates may be different between the years. For the comparability with the polynomial
model, the intercept again corresponds to height at the age of 7.5 years. This is achieved by using
piecewise linear functions that all are equal to 0 for t = 7.5. Accordingly, the model is based on the
following five basic piecewise linear functions:
The results for this model are presented in Table 15.9.
The level-two random effects (U0i, U1i,. . ., U5i) have estimated correlation matrix
The average growth rate is between 5 and 6 cm/year over the whole age range from 5 to 10 years.
There is large variability in individual growth rates. All slope variances are between 3 and 4, so the
between-children standard deviations in yearly growth rate are almost 2. The correlations between
individual growth rates in different years are not very high, ranging from –0.23 (between ages 7–8
and 9–10) to 0.48 (between ages 6–7 and 9–10). This indicates that the growth rate fluctuates rather
erratically from year to year around the average of about 5.5 cm/year.
The deviance for this piecewise linear model is 121.88 less than the deviance of the polynomial
model of Table 15.8, while it has 13 parameters more. This is a large deviance difference for only 13
degrees of freedom. The residual level-one variance also is smaller. Although the polynomial model
is not a submodel of the piecewise linear model, so that we cannot test the former against the latter by
the usual deviance test, yet we may conclude that the piecewise model not only is more clearly
interpretable but also has a better fit than the polynomial model.
Table 15.9: Piecewise linear growth model for 5–10-year-old children with
retarded growth.
Spline functions
Another flexible class of functions which is suitable for modeling
longitudinal measurements within the framework of the hierarchical linear
model is the class of spline functions. Spline functions are smooth
piecewise polynomials. A number of points called nodes are defined on the
interval where the spline function is defined; between each pair of adjacent
nodes the spline function is a polynomial of degree p, while these
polynomials are glued together so smoothly that the function itself and its
derivatives, of order up to p – 1, are also continuous at the nodes. For p = 1
this leads again to piecewise linear functions, but for p > 1 it yields
functions which are smooth and do not have the kinky appearance of
piecewise linear functions. An example of a quadratic spline (i.e., a spline
with p = 2) was given in Chapter 8 by equation (8.3) and Figure 8.1. This
equation and figure represent a function of IQ which is a quadratic for IQ <
0 and also for IQ > 0, but which has different coefficients for these two
domains. The point 0 here is the node. The coefficients are such that the
function and its derivative are also continuous at the node, as can be seen
from the graph. Therefore it is a spline. Cubic splines (p = 3) are also often
used. We present here only a sketchy introduction to the use of splines. For
a more elaborate introduction to the use of spline functions in (single-level)
regression models, see Seber and Wild (1989, Section 9.5). The use of
spline functions in longitudinal multilevel models was discussed by Pan and
Goldstein (1998).
Suppose that one is investigating the development of some
characteristic over the age of 12–17 years. Within each of the intervals 12–
15 and 15–17 years, the development curves might be approximately
quadratic (this could be checked by a polynomial trend analysis for the data
for these intervals separately), while they are smooth but not quadratic over
the entire range of 12–17. In such a case it would be worthwhile to try a
quadratic spline (p = 2) with one node, at 15 years. Defining t1 = 15, the
basic functions can be taken as
The functions f2 and f3 are quadratic functions to the left of t1 and right
of t1 respectively, and they are continuous and have continuous derivatives.
That the functions are continuous and have continuous derivatives even at
the node t = t1 can be verified by elementary calculus or by drawing a
graph.
The individual development functions are modeled as
If β2i = β3i, the curve for individual i is exactly quadratic. The freedom to
have these two coefficients differ from each other allows us to represent
functions that look very different from quadratic functions; for example, if
these coefficients have opposite signs then the function will be concave on
one side of t1 and convex on the other side. Equation (8.3) and Figure 8.1
provide an example of exactly such a function, where t is replaced by IQ
and the node is the point IQ = 0.
The treatment of this model within the hierarchical linear model
approach is completely analogous to the treatment of polynomial models.
The functions f1, f2, and f3 constitute the fixed part of the model as well as
the random part of the model at level two. If there is no evidence for
individual differences with respect to the coefficients β2i and/or β3i, then
these could be deleted from the random part.
Formula (15.25) shows that a quadratic spline with one node has one
parameter more than a quadratic function (4 instead of 3). Each node added
further will increase the number of parameters of the function by one. There
is considerable freedom of choice in defining the basic functions, subject to
the restriction that they are quadratic on each interval between adjacent
nodes, and are continuous with continuous derivatives at the nodes. For two
nodes, t1 and t2, a possible choice is the following. This representation
employs a reference value t0 that is an arbitrary (convenient or meaningful)
value. It is advisable to use a t0 within the range of observation times in the
data. The basic functions are
coefficient β2i is the quadratic coefficient in the interval between t1 and t2,
while β3i and β4i are the changes in the quadratic coefficient that occur
when time t passes the nodes t1 or t2, respectively. The quadratic coefficient
for t < t1 is β2i + β3i, and for t > t2 it is β2i + β4i.
The simplest cubic spline (p = 3) has one node. If the reference point t0
is equal to the node, then the basic functions are
For more than two nodes, and an arbitrary order p of the polynomials,
the basic spline functions may be chosen as follows, for nodes denoted by
t1,. . ., tM:
The correlation matrix of the level-two random effects (U0i, U1i,. . ., U4i) is estimated as
Testing the random effects (not reported here) shows that the functions f1,. . ., f4 all have
significant random effects. The level-one residual standard deviation is only
which demonstrates that this family of functions fits rather closely to the
height measurements. Notation is made more transparent by defining
Thus, we denote f3(t) = (t–15)–3, f4(t) = (t–15)+3. The mean height curve can be obtained by filling in
the estimated fixed coefficients, which yields
This function can be differentiated to yield the mean growth rate. Note that df3(t)/dt =–3 (t–15)–2 and
df4(t)/dt = 3 (t–15)+2. This implies that the mean growth rate is estimated as
For example, for the minimum of the age range considered,t = 12 years, this is 6.43 – (0.25 × 3) +
(0.114 × 9) + 0 = 6.71 cm/year. This is slightly larger than the mean growth rate found in the
preceding examples for ages from 5 to 10 years. For the maximum age in this range, t = 17 years, on
the other hand, the average growth rate is 6.43 + (0.25 × 2) + 0 – (1.587 × 4) = 0.58 cm/year,
indicating that growth has almost stopped at this age.
The results of this model are illustrated more clearly by a graph. Figure 15.1 presents the average
growth curve and a sample of 15 random curves from the population defined by Table 15.10 and the
given correlation matrix. The average growth curve does not deviate noticeably from a linear curve
for ages below 16 years. It levels off after 16 years. Some of the randomly drawn growth curves are
decreasing in the upper part of the range. This is an obvious impossibility for the real growth curves,
and indicates that the model is not completely satisfactory for the upper part of this age range. This
may be related to the fact that the number of measurements is rather low at ages over 16.5 years.
Figure 15.1: Average growth curve (*) and 15 random growth curves for
12–17-year-olds for cubic spline model.
The same can now be done as in Chapter 5 (see p. 83): the individual-
dependent coefficients βhi can be explained with individual-level (i.e., level-
two) variables. Suppose there are q individual-level variables, and denote
these variables by Z1, . ., Zq. The inter-individual model to explain the
coefficients β0i, β1i,. . ., βri is then
Substitution of (15.29) into (15.21) yields
The fixed effect of gender (γ01) is not significant, which means that at 15 years there is not a
significant difference in height of boys and girls in this population with retarded growth. However,
the interaction effects of gender with age, γ11 and γ21, show that girls and boys do grow in different
patterns during adolescence. The coding of gender implies that the average height difference between
girls and boys is given by
The girl–boy difference in average growth rate, which is the derivative of this function, is equal to –
2.532 – 1.448 (t – 15). This shows that from about the age of 13 years (more precisely, for t > 13.25),
the growth of girls is on average slower than that of boys, and faster before this age.
Parents’ height has a strong main effect: for each centimeter of extra height of the parents, the
children are on average 0.263 cm taller at 15 years of age. Moreover, for every centimeter extra of
the parents, on average the children grow faster by 0.03 cm/year.
The intercept variance and the slope variances of f1 and f2 have decreased, compared to Table
15.10. The residual variance at level one has remained the same, which is natural since the effects
included explain differences between curves and do not yield better-fitting curves. The deviance went
down by 113.88 points (df = 5, p < 0.0001).
15.2.4 Changing covariates
Individual-level variables such as the Zk of the preceding section are
referred to as constant covariates because they are constant over time. It is
also possible that changing covariates are available – changing social or
economic circumstances, performance on tests, mood variables, etc. Fixed
effects of such variables can be added to the model without problems, but
the neat forward model selection approach is disturbed because the
changing covariate normally will not be a linear combination of the
functions fh in (15.21). Depending on, for example, the primacy of the
changing covariates in the research question, one can employ the changing
covariates in the fixed part right from the start (i.e., add this fixed effect to
(15.19) or (15.22)), or incorporate them in the model at the point where the
constant covariates are also considered (add the fixed effect to (15.30)).
Example 15.12 Cortisol levels in infants.
This example reanalyzes data collected by de Weerth (1998, Chapter 3) in a study on stress in infants;
also see de Weerth and van Geert (2002). In this example the focus is not on discovering the shape of
the development curves, but on testing a hypothesized effect, while taking into account the
longitudinal design of the study. Because of the complexity of the data analysis, we shall combine
techniques described in various of the preceding chapters.
The purpose of the study was to investigate experimentally the effect of a stressful event on
experienced stress as measured by the cortisol hormone, and whether this effect is stronger in certain
hypothesized ‘regression weeks’. The experimental subjects were 18 normal infants with gestational
ages ranging from 17 to 37 weeks. Gestational age is age counted from the due day on which the
baby should have been born after a complete pregnancy. Each infant was seen repeatedly, the number
of visits per infant ranging from 8 to 13, with a total of 222 visits. The infants were divided randomly
into an experimental group (14 subjects) and a control group (4 subjects). Cortisol was measured
from saliva samples. The researcher collected a saliva sample at the start of the session; then a play
session between the mother and the infant followed. For the experimental group, during this session
the mother suddenly put the infant down and left the room. In the control group, the mother stayed
with the child. After the session, a saliva sample was taken again. This provided a pretest and a post-
test measure of the infant’s cortisol level. In this example we consider only the effect of the
experiment, not the effect of the ‘regression weeks’. Further information about the study and about
the underlying theories can be found in de Weerth (1998, Chapter 3).
The post-test cortisol level was the dependent variable. Explanatory variables were the pretest
cortisol level, the group (coded Z = 1 for infants in the experimental group and Z = 0 for the control
group), and gestational age. Gestational age is the time variable t. In order to avoid very small
coefficients, the unit is chosen as 10 weeks, so that t varies between 1.7 and 3.7. A preliminary
inspection of the joint distribution of pretest and post-test showed that cortisol levels had a rather
skewed distribution (as is usual for cortisol measurements). This skewness was reduced satisfactorily
by using the square root of the cortisol level, both for the pretest and for the post-test. These variables
are denoted by X and Y, respectively. The pretest variable, X, is a changing covariate.
The law of initial value proposed by Wilder in 1956 implies that an inverse relation is expected
between basal cortisol level and the subsequent response to a stressor (cf. de Weerth, 1998, p. 41). An
infant with a low basal level of cortisol will accordingly react to a stressor with a cortisol increase,
whereas an infant with a high basal cortisol level will react to a stressor with a cortisol decrease. The
basal cortisol level is measured here by the pretest. The average of X (the square root of the pretest
cortisol value) was 2.80. This implies that infants with a basal value x less than about 2.80 are
expected to react to stress with a relatively high value for Y whereas infants with x more than about
2.80 are expected to react to stress with a relatively low value of Y. The play session itself may also
lead to a change in cortisol value, so there may be a systematic difference between X and Y.
Therefore, the stress reaction was investigated by testing whether the experimental group has a more
strongly negative regression coefficient of Y on X – 2.80 than the control group; in other words, the
research hypothesis is that there is a negative interaction between Z and X - 2.80 in their effect on Y.
It appeared that a reasonable first model for the square root of the post-test cortisol level, if the
difference between the experimental and control group is not yet taken into account, is the model
where the changing covariate, defined as X – 2.80, has a fixed effect, while time (i.e., gestational age)
has a linear fixed as well as random effect:
The results are given as Model 1 in Table 15.12. This model can be
regarded as the null hypothesis model, against which we shall test the
hypothesis of the stressor effect.
The random effect of gestational age is significant. The model without this effect, not shown in
the table, has deviance 289.46, so that the deviance comparison yields χ2 = 8.89, df = 2. Using the
mixture of chi-squared distributions with 1 and 2 df, according to Section 6.2.1 and Table 6.2, we
obtain p < 0.05. The pretest has a strongly significant effect (t = 8.0, p < 0.0001): children with a
higher basal cortisol value tend also to have a higher post-test cortisol value. The fixed effect of
gestational age is not significant (t = 1.67, two-sided p > 0.05). The intercept variance is rather large
because the time variable is not centered and t = 0 refers to the due date of birth, quite an
extrapolation from the sample data.
To test the effect of the stressor, the fixed effect of the experimental group Z and the interaction
effect of Z × (X – 2.80) were added to the model. The theory, that is, Wilder’s law of initial value,
predicts the effect of the product variable Z × (X – 2.80) and therefore the test is focused on this
effect. The main effect of Z is included only because including an interaction effect without the
corresponding main effects can lead to errors of interpretation.
The result is presented as Model 2 in Table 15.12. The stressor effect (parameter γ03 in the table)
is significant (t = – 2.07 with many degrees of freedom because this is the effect of a level-one
variable, one-sided p < 0.025). This confirms the hypothesized effect of the stressor in the
experimental group.
The model fit is not good, however. The standardized level-two residuals defined by (10.10) were
calculated for the 18 infants. For the second infant (i = 2) the value was Si2 = 34.76 (df = ni = 13, p =
0.0009). With the Bonferroni correction which takes into account that this is the most significant out
of 18 values, the significance value still is 18 × 0.0009 = 0.016. Inspection of the data showed that
this was a child who had been asleep shortly before many of the play sessions. Being just awake is
known to have a potential effect on cortisol values. Therefore, for each session it had been recorded
whether the child had been asleep in the half hour immediately preceding the session. Subsequent
data analysis showed that having been asleep (represented by a dummy variable equal to 1 if the
infant had been asleep in the half hour preceding the session and equal to 0 otherwise) had an
important fixed effect, not a random effect, at level two, and also was associated with
heteroscedasticity at level one (see Chapter 8). Including these two effects led to the estimates
presented in Table 15.13.
Table 15.13: Estimates for a model controlling for having been asleep.
The two effects of sleeping are jointly strongly significant (the comparison between the deviances
of Models 2 and 3 yields χ2 = 25.11, df = 2, p < 0.0001). The estimated level-one variance is 0.130
for children who had not slept and (using formula (8.1)) 0.362 for children who had slept. But the
stressor (Z × (X – 2.80) interaction) effect has now lost its significance (t = –1.42, n.s.). The most
significant standardized level-two residual defined by (10.10) for this model is obtained for the fourth
child (j = 4), with the value Sj2 = 29.86, df = nj = 13, p = 0.0049. Although this is a rather small p-
value, the Bonferroni correction now leads to a significance probability of 18 × 0.0049 = 0.09, which
is not alarmingly low. The fit of this model therefore seems satisfactory, and the estimates do not
support the hypothesized stressor effect. However, these results cannot be interpreted as evidence
against the stressor effect, because the parameter estimate does have the predicted negative sign and
the number of experimental subjects is not very large, so that the power may have been low.
It can be concluded that it is important to control for the infant having slept shortly before the
play session, and in this case this control makes the difference between a significant and a
nonsignificant effect. Having slept not only leads to a higher post-test cortisol value, controlling for
the pretest value, but also triples the residual variance at the occasion level.
Other covariance and correlation patterns are also possible. For variable
occasion designs, level-one residuals with the correlation structure (15.32)
also are called autocorrelated residuals, although they cannot be constructed
by the relations (15.31). These models are discussed in various textbooks,
such as Diggle et al. (2002), Goldstein (2011, Section 5.4), Verbeke and
Molenbergs (2000, Chapter 10), Singer and Willett (2003, Section 7.3), and
Hedeker and Gibbons (2006).
The extent to which time dependence can be modeled by random
slopes, or rather by autocorrelated residuals, or a combination of both,
depends on the phenomenon being modeled. This issue usually will have to
be decided empirically; most computer programs implementing the
hierarchical linear model allow the inclusion of autocorrelated residuals in
the model.
15.4 Glommary
Longitudinal data. Repeated measurements of the same characteristic for a
sample of individual subjects. Another name is panel data. The level-
one units here are the measurements (usually labeled as time points)
and the level-two units are individual subjects or respondents.
Fixed occasion designs. Data structures with a fixed set of measurement
occasions for all subjects, which may be conveniently labeled t = 1,. . .,
m. This does not exclude that data may be incomplete for some or all
of the subjects.
t-test for paired samples. This provides the simplest case of a longitudinal
design with fixed measurement occasions.
Random intercept model for repeated measures. This is also called the
compound symmetry model.
Random slope models. These models, with random slopes for time and
transformed time variables, can be used to represent time-dependent
variances as well as nonconstant correlations between different
measurements of the same subject.
Variable occasion designs. In these designs the data are ordered according
to some underlying dimension such as time, and for each individual
data are recorded at some set of time points which is not necessarily
related to the time points at which the other individuals are observed.
Repeated measures according to variable occasion designs can be
regarded as observations on a population of curves.
Multilevel approach to populations of curves. This approach can
represent the average curve by fixed effects of individual-level
variables and of their cross-level interactions with time variables.
Individual deviations are represented by random effects of time
variables. In addition, changing covariates can be included to represent
further individual deviations.
1
The notation with the parameters γ has nothing to do with the γ parameters used earlier in this
chapter in (15.4), but is consistent with the notation in Chapter 5.
2
In the fixed occasion design, the number of random effects, including the random intercept, cannot
be greater than the number of measurement occasions. In the variable occasion design this strict
upper bound does not figure, because the variability of time points of observations can lead to richer
information about intra-individual variations. But if one obtains a model with clearly more random
slopes than the maximum of all mi, this may mean that one has made an unfortunate choice of the
functions of time (polynomial or other) that constitute the random part at level two, and it may be
advisable to try and find another, smaller, set of functions of t for the random part with an equally
good fit to the data.
16
Multivariate Multilevel Models
In words: for the hth dependent variable, the intercept is γ0h, the regression
coefficient on X1 is γ1h, the coefficient on X2 is γ2h, ...., the random part of
the intercept in group j is Uhj, and the residual is Rhij. This is just a random
intercept model like (4.5) and (4.9). Since the variables Y1, . . . , Ym are
measured on the same individuals, however, their dependence can be taken
into account. In other words, the Us and Rs are regarded as components of
vectors
Instead of residual variances at levels one and two, there are now residual
covariance matrices,
With these dummies, the random intercept models (16.1) for the m
dependent variables can be integrated into one three-level hierarchical
linear model by the expression
All variables (including the constant) are multiplied by the dummy
variables. Note that the definition of the dummy variables implies that in
the sums over s = 1, . . . ,m only the term for s = h gives a contribution and
all other terms disappear. So this formula is just a complicated way of
rewriting formula (16.1).
The purpose of this formula is that it can be used to obtain a
multivariate hierarchical linear model. The variable-dependent random
residuals Rhij in this formula are random slopes at level two of the dummy
variables, Rsij being the random slope of ds, and the random intercepts Uhj
become the random slopes at level three of the dummy variables. There is
no random part at level one.
This model can be further specified, for example, by omitting some of
the variables Xk from the explanation of some of the dependent variables Yh.
This amounts to dropping some of the terms γks dshij xkij from (16.4).
Another possibility is to include variable-specific covariates, in analogy to
the changing covariates of Section 15.2.4. An example of this is a study of
school pupils’ performance on several subjects (these are the multiple
dependent variables), using the pupils’ motivation for each of the subjects
separately as explanatory variables.
This empty model can be used to decompose the raw variances and
covariances into parts at the two levels. When referring to the multivariate
empty model, the covariance matrix
This shows that especially the random school effects for language and arithmetic are very
strongly correlated.
For the correlations between observed variables, these estimates yield a correlation between
individuals of (cf. (16.2))
Table 16.2: Parameter estimates for multivariate model for language and
arithmetic tests.
and, for groups of a hypothetical size n = 30, a correlation between group means (cf. (16.6)) of
Explanatory variables included are IQ, SES, the group mean of IQ, and the group mean of SES.
As in the examples in Chapters 4 and 5, the IQ measurement is the verbal IQ from the ISI test. The
correspondence with formulas (16.1) and (16.4) is that X1 is IQ, X2 is SES, X3 is the group mean of
IQ, X4 is the group mean of SES, X5 = X1 × X2 represents the interaction between IQ and SES, and X6
= X3 × X4 represents the interaction between group mean IQ and group mean SES. The results are
given in Table 16.2.
Calculating t-statistics for the fixed effects shows that for language, all effects are significant at
the 0.05 level. For arithmetic the fixed effect of mean SES is not significant, nor are the two
interaction effects, but the other three fixed effects are significant. The residual correlations are
ρ(U1j,U2j) = 0.83 at the school level and ρ(R1ij, R2ij) = 0.45 at the pupil level. This shows that taking
the explanatory variables into account has led to somewhat smaller, but still substantial residual
correlations. Especially the school-level residual correlation is large. This suggests that, also when
controlling for IQ, SES, mean IQ, and mean SES, the factors at school level that determine language
and arithmetic proficiency are the same. Such factors could be associated with school policy but also
with aggregated pupil characteristics not taken into account here, such as average performal IQ.
When the interaction effect of mean IQ with mean SES is to be tested for both dependent
variables simultaneously, this can be done by fitting the model from which these interaction effects
are excluded. In formula (16.1) this corresponds to the effects γ61 and γ62 of X6 on Y1 and Y2; in
formula (16.4) this corresponds to the effects γ61 and γ62 of d1X6 and d2X6. The model from which
these effects are excluded has a deviance of 46,570.5, which is 6.2 less than the model of Table 16.2.
In a chi-squared distribution with df = 2, this is a significant result (p < 0.05).
The random slopes U1hj are uncorrelated between groups j but correlated
between variables h; this correlation between random slopes of different
dependent variables is a parameter of this model that is not included in the
hierarchical linear model of Chapter 5.
Just like the multilevel random intercept model, this model is
implemented by a three-level formulation, defined by
This means that again, technically, there is no random part at level one,
there are m random slopes at level two (of variables d1, . . . , dm) and 2m
random slopes at level three (of variables d1,. . . , dm and of the product
variables d1X1, . . . , dmX1). With this kind of model, an obvious further step
is to try and model the random intercepts and slopes by group-dependent
variables as in Section 5.2.
16.4 Glommary
Multivariate multilevel model. This is a three-level model, in which level
one consists of variables or measurements; level two consists of
individuals; and level three of groups.
Up to now, it has been assumed in this book that the dependent variable has
a continuous distribution and that the residuals at all levels (U0j, Rij, etc.)
have normal distributions. This provides a satisfactory approximation for
many data sets. However, there also are many situations where the
dependent variable is discrete and cannot be well approximated by a
continuous distribution. This chapter treats the hierarchical generalized
linear model, which is a multilevel model for discrete dependent variables.
Thus, the variance is not a free parameter but is determined by the mean;
and the variance is not constant: there is heteroscedasticity. In terms of
multilevel modeling, this could lead to a relation between the parameters in
the fixed part and the parameters in the random part.
This has led to the development of regression-like models that are more
complicated than the usual multiple linear regression model and that take
account of the nonnormal distribution of the dependent variable, its
restricted range, and the relation between mean and variance. The best-
known method of this kind is logistic regression, a regression-like model
for dichotomous data. Poisson regression is a similar model for count data.
In the statistical literature, such models are known as generalized linear
models; see McCullagh and Nelder (1989) or Long (1997).
The present chapter gives an introduction to multilevel versions of some
generalized linear models; these multilevel versions are aptly called
hierarchical generalized linear models or generalized linear mixed models.
The choice of link function has to be guided by the empirical fit of the
model, ease of interpretation, and convenience (e.g., availability of
computer software). In this chapter the link function for a probability will
be denoted by f (p), and we shall concentrate on the logit link function.
For the deviations U0j it is assumed that they are independent random
variables with a normal distribution with mean 0 and variance
This model does not include a separate parameter for the level-one
variance. This is because the level-one residual variance of the dichotomous
outcome variable follows directly from the success probability, as indicated
by equation (17.3).
Denote by the probability corresponding to the average value as
defined by
For the logit function, this means that is the so-called logistic transform
of defined by
Here exp(γ0) = eγ0 denotes the exponential function, where e is the basis of
the natural logarithm. The logistic and logit functions are mutual inverses,
just like the exponential and the logarithmic functions. Figure 17.4 shows
the shape of the logistic function. This π0 is close (but not quite equal) to
the average value of the probabilities Pj in the population of groups.
Because of the nonlinear nature of the link function, there is no simple
relation between the variance of these probabilities and the variance of the
deviations U0j. There is an approximate formula, however, valid when the
variances are small. The approximate relation (valid for small τ02) between
the population variances is
When τ02 is not so small, the variance of the probabilities will be less than
the right-hand side of (17.13). (Note that these are population variances and
not variances of the observed proportions in the groups; see Section 3.3 for
this distinction.)
The logistic random intercept model expresses the log-odds, that is, the
logit of Pij, as a sum of a linear function of the explanatory variables2 and a
random group-dependent deviation U0j:
Educational level, measured as the age at which people left school (14–
* 21 years), minus 14.This variable was centered within countries. The
within-country deviation variable has mean 0,standard deviation 2.48.
Income, standardized within country; mean −0.03, standard deviation
*
0.99.
Employment status, 1 for unemployed, 0 for employed; mean 0.19,
*
standard deviation 0.39.
* Sex, 1 for female, 0 for male; mean 0.52, standard deviation 0.50.
Marital status, 1 for single/divorced/widowed, 0 for married/cohabiting;
*
mean 0.23,standard deviation 0.42.
Divorce status, 1 for divorced, 0 for other; mean 0.06, standard deviation
*
0.25.
Widowed, 1 for widowed, 0 for other; mean 0.08, standard deviation
*
0.27.
Urbanization, the logarithm of the number of inhabitants in the
* community or town of residence, truncated between 1,000 and
1,000,000, minus 10; mean 0.09, standard deviation 2.18.
At the country level, the averages of some individual-level variables
representing social and cultural differences between countries were used,
with one other country-level variable. Using the definitions above, this
leads to the following list of variables:
17.2.5 Estimation
Parameter estimation in hierarchical generalized linear models is more
complicated than in hierarchical linear models. Inevitably some kind of
approximation is involved, and various kinds of approximation have been
proposed. Good reviews are given in Demidenko (2004, Chapter 8),
Skrondal and Rabe-Hesketh (2004, Chapter 6), Tuerlinckx et al. (2006), and
Rodríguez (2008). We mention some references and some of the terms used
without explaining them. The reader who wishes to study these algorithms
is referred to the literature cited. More information about the computer
programs mentioned in this section is given in Chapter 18.
Currently the best and most often used methods are two frequentist
ones: the Laplace approximation and adaptive numerical quadrature as
algorithms for approximating the maximum likelihood estimates; and
Bayesian methods.
The Laplace approximation for hierarchical generalized linear models
was proposed by Raudenbush et al. (2000). Because of its good quality and
computational efficiency this is now one of the most frequently used
estimation algorithms for generalized linear mixed models. It is
implemented in the software package HLM and in various packages of the
statistical system R, such as lme4, glmmADMB and glmmML.
Numerical integration is an approach for dealing with the random
coefficients which is straightforward in principle, but poses various
difficulties in implementation. Its use for hierarchical generalized linear
models was initially developed by Stiratelli et al. (1984), Anderson and
Aitkin (1985) and Gibbons and Bock (1987); later publications include
Longford (1994), Hedeker and Gibbons (1994), Gibbons and Hedeker
(1997), and Hedeker (2003). Pinheiro and Bates (1995) and Rabe-Hesketh
et al. (2005) developed so-called adaptive quadrature methods of numerical
integration; see also Skrondal and Rabe-Hesketh (2004). Adaptive
quadrature has considerable advantages over nonadaptive numerical
integration. It is implemented in SAS proc NLMIXED, the Stata package,
and the software gllamm which combines with Stata. Nonadaptive
numerical integration is implemented in the program MIXOR/SuperMix
and its relatives.
Much work has been done on Bayesian methods (cf. Section 12.1),
mainly used in Markov chain Monte Carlo (MCMC) methods, which are
computationally intensive, because they are based on repeated simulation of
draws from the posterior distribution. Zeger and Karim (1991) and Browne
and Draper (2000) are important milestones for Bayesian inference for
multilevel models, and this approach is now being increasingly used. An
overview is given by Draper (2008) and an extensive treatment in Congdon
(2010). For hierarchical generalized linear models, Bayesian methods
perform very well, and they also have good frequentist properties (i.e.,
small mean squared errors of estimators and good coverage probabilities of
confidence intervals); see Browne and Draper (2006). Bayesian MCMC
methods are implemented in MLwiN, WinBUGS, and BayesX. A recent
development is the use of the Laplace approximation for Bayesian
inference, which circumvents the computationally intensive MCMC
methods; see Rue et al. (2009) and Fong et al. (2010). This is implemented
in the R package INLA.
Before these methods had been developed, the main methods available
were those based on first- or second-order Taylor expansions of the link
function. When the approximation is around the estimated fixed part, this is
called marginal quasi-likelihood (MQL), when it is around an estimate for
the fixed plus the random part it is called penalized or predictive quasi-
likelihood (PQL) (Breslow and Clayton, 1993; Goldstein, 1991; Goldstein
and Rasbash, 1996). For estimating and testing fixed effects these methods
are quite adequate, especially if the cluster sizes nj are not too small, but
they are not satisfactory for inference about random effects. The first-order
MQL and PQL estimates of the variance parameters of the random part
have an appreciable downward bias (Rodríguez and Goldman, 1995;
Browne and Draper, 2006; Rodríguez, 2008; Austin, 2010). The second-
order MQL and PQL methods produce parameter estimates with less bias
but, it seems, a higher mean squared error. The biases of MQL and PQL in
the estimation of parameters of the random effects can be diminished by
bootstrapping (Kuk, 1995; van der Leeden et al., 2008), but this leads to
quite computer-intensive procedures. These methods are implemented in
MLwiN, HLM, R packages MASS and nlme, and SAS proc GLIMMIX.
Several other methods have been proposed for parameter estimation in
hierarchical generalized linear models. McCulloch (1997), Ng et al. (2006),
and Jank (2006) gave various algorithms for estimation by simulated
maximum likelihood. Another computer-intensive method is the method of
simulated moments. This method is applied to these models by Gouriéroux
and Montfort (1996, Section 3.1.4 and Chapter 5), and an overview of some
recent work is given by Baltagi (2008, Section 11.2). A method based on
the principle of indirect inference was proposed by Mealli and Rampichini
(1999).
Various simulation studies have been done comparing different
estimation procedures for hierarchical generalized linear models, mostly
focusing on multilevel logistic regression: Rodríguez and Goldman (1995,
2001), Browne and Draper, (2006), Rodríguez, (2008), and Austin (2010).
These are the references on which the statements in this section about the
qualities of the various estimation procedures are based.
17.2.6 Aggregation
If the explanatory variables assume only few values, then it is advisable to
aggregate the individual 0–1 data to success counts, depending on the
explanatory variables, within the level-two units. This will improve the
speed and stability of the algorithm and reduce memory use. This is carried
out as follows.
For a random intercept model with a small number of discrete
explanatory variables X1, . . . , Xr, let L be the total number of combinations
of values (x1, . . . , xr). All individuals with the same combination of values
(x1, . . . , xr) are treated as one subgroup in the data. They all have a
common success probability, given by (17.15). Thus, each level-two unit
includes L subgroups, or fewer if some of the combinations do not occur in
this level-two unit. Aggregation is advantageous if L is considerably less
than the average group size nj.
Denote by
the number of individuals among these who yielded a success, that is, for
whom Yij = 1. Then Yj+(x1,. . ., xr) has the binomial distribution with
binomial denominator (‘number of trials’) nj+(x1,. . ., xr) and success
probability given by (17.15), which is the same for all individuals i in this
subgroup. The multilevel analysis is now applied with these subgroups as
the level-one units. Subgroups with nj+(x1,. . ., xr) = 0 can be omitted from
the data set.
17.3 Further topics on multilevel logistic
regression
There now are two random group effects, the random intercept U0j and the
random slope U1j. It is assumed that both have a zero mean. Their variances
are denoted, respectively, by τ02 and τ12 and their covariance is denoted by
τ01.
Example 17.4 Random slope model for religious attendance.
Continuing the example of religious attendance in 59 countries, Table 17.3 presents a model which
has the same fixed part as the model of Table 17.2, and random slopes for income and education.
Recall that income is standardized within countries and education is a within-country deviation
variable. Estimating a model with a random slope for education was successful for this relative
measure of education, but not when the grand-mean-centered education variable was used; this may
be a signal of misspecification of the latter model.
The deviance goes down by 115,969.9 – 115,645.7 = 324.2, a huge improvement for five extra
parameters. Also when each of the random slope variances is considered separately, the gain in
deviance is considerable (results not shown here). With such large sample sizes per country, the
regression coefficients within each country are estimated very precisely and it is not surprising to
find evidence for slope heterogeneity.
For education as well as income, the slope standard deviation is larger in absolute magnitude than
the estimated fixed effect, which is negative for both variables. This implies that for most countries
the regression coefficient per country is negative, but for a nonnegligible number of countries it is
positive. For example, the regression coefficient of income is which has an estimated
normal distribution with mean −0.074 and standard deviation 0.063, which leads to a probability of
0.12 of a randomly chosen country having a positive coefficient.
The standard errors of the fixed effects of education and income are considerably larger here than
in the random intercept model of Table 17.2. The random slope model must be considered to be more
realistic, and therefore the larger standard errors are a better representation of the uncertainty about
these fixed parameters, which are the estimated average effects in the population of all countries.
This suggests that for other level-one variables a random slope could, or should, also be considered,
and their estimated coefficients might then also get higher standard errors. For a data set with very
large level-two units such as this one, the two-step approach based on country-by-country equations
of the type (3.38)–(3.40) – but of course with a larger number of predictor variables – may be more
suitable than the approach by a hierarchical generalized linear model, because less stringent
assumptions are being made about the similarity of many parameters across the various different
countries (cf. Achen, 2005).
Its mean is 0 and its variance is π2/3 = 3.29. When it is assumed that Rij has
this distribution, the logistic random intercept model (17.15) is equivalent
to the threshold model defined by (17.17) and (17.18).
To represent the random slope model (17.16) as a threshold model, we
define
where Rij has a logistic distribution. It then follows that
Since the logit and the logistic functions are mutual inverses, the last
equation is equivalent to (17.16).
If the residual Rij has a standard normal distribution with unit variance,
then the probit link function is obtained. Thus, the threshold model which
specifies that the underlying variable has a distribution according to the
hierarchical linear model of Chapters 4 and 5, with a normally distributed
level-one residual, corresponds exactly to the multilevel probit regression
model. Since the standard deviation of Rij is for the logistic
and 1 for the probit model, the fixed estimates for the logistic model will
tend to be about 1.81 times as large as for the probit model and the variance
parameters of the random part about times as large (but see Long
(1997, p. 48), who notes that in practice the proportionality constant for the
fixed estimates is closer to 1.7).
These two definitions are different and will lead to somewhat different
outcomes. For example, for the empty model for the religious attendance
data presented in Table 17.1, the first definition yields
0.0404/(0.0404+0.1414) = 0.22, whereas the second definition leads to the
value 1.896/(1.896+3.290) = 0.37. In this case, the difference is large,
underscoring the rather arbitrary nature of the definition of these
coefficients.
An advantage of the second definition is that it can be directly extended
to define the residual intraclass correlation coefficient, that is, the intraclass
correlation which controls for the effect of explanatory variables. The
example can be continued by moving to the model in Table 17.2. The
residual intraclass correlation controlling for the variables in this model is
1.177/(1.177+3.290) = 0.26, lower than the raw intraclass correlation
coefficient.
For the multilevel probit model, the second definition for the intraclass
correlation (and its residual version) leads to
since this model fixes the level-one residual variance of the unobservable
variable to to 1.
This variable is also called the linear predictor for Y. Its variance is denoted
by The intercept variance is and the level-one residual
variance is denoted by Recall that for the
logistic and to 1 for the probit model.
For a randomly drawn level-one unit i in a randomly drawn level-two
unit j, the X-values are randomly drawn from the corresponding population
and hence the total variance of is equal to
The explained part of this variance is σF2 and the unexplained part is τ02 +
σR2. Of this unexplained variation, τ02 resides at level two and σR2 at level
one. Hence the proportion of explained variation can be defined by
The corresponding definition of the residual intraclass correlation,
using the symbols X1,…,X12 for the predictor variables. This constructed
variable has variance (a value obtained by calculating this new
variable and computing its variance). Equation (17.22) yields the proportion
of explained variance R2dicho = 1.049/(1.049 + 1.082 + 3.29)= 0.19.
Example 17.6 Taking a science subject in high school.
This example continues the analysis of the data set of Example 8.3 about the cohort of pupils entering
secondary school in 1989, studied by Dekkers et al. (2000). The focus now is on whether the pupils
chose at least one science subject for their final examination. The sample is restricted to pupils in
general education (excluding junior vocational education), and to only those who progressed to their
final examination (excluding drop-outs and pupils who repeated grades once or twice). This left
3,432 pupils distributed over 240 secondary schools. There were 736 pupils who took no science
subjects, 2,696 who took one or more.
A multilevel logistic regression model was estimated by R package lme4 and the MIXOR
program. Both gave almost the same results; the MIXOR results are presented. Explanatory variables
are gender (0 for boys, 1 for girls) and minority status (0 for children of parents born in industrialized
countries, 1 for other countries). The results are shown in Table 17.4.
and the variance of this variable in the sample is . Therefore the explained proportion of
variation is
In other words, gender and minority status explain about 13% of the variation in whether the pupil
takes at least one science subject for the high school exam.
The unexplained proportion of variation, 1−0.13 = 0.87, can be written as
which represents the fact that 11% of the variation is unexplained variation at the school level and
76% is unexplained variation at the pupil level. The residual intraclass correlation is ρI =
0.481/(0.481 + 3.29) = 0.13.
17.3.5 Consequences of adding effects to the model
When a random intercept is added, to a given logistic or probit regression
model, and also when variables with fixed effects are added to such a
model, the effects of earlier included variables may change. The nature of
this change, however, may be different from such changes in OLS or
multilevel linear regression models for continuous variables.
This phenomenon can be illustrated by continuing the example of the
preceding section. Table 17.5 presents three other models for the same data,
in all of which some elements were omitted from Model 1 as presented in
Table 17.4.
Models 2 and 3 include only the fixed effect of gender; Model 2 does
not contain a random intercept and therefore is a single-level logistic
regression model, while Model 3 does include the random intercept. The
deviance difference (χ2 = 3,345.15 – 3,251.86 = 93.29, df = 1) indicates that
the random intercept variance is significantly positive. But the sizes of the
fixed effects increase in absolute value when adding the random intercept to
the model, both by about 8%. Gender is evenly distributed across the 240
schools, and one may wonder why the absolute size of the effect of gender
increases when the random school effect is added to the model.
Model 4 differs from Model 1 in that the effect of gender is excluded.
The fixed effect of minority status in Model 4 is −0.644, whereas in Model
1 it is −0.727. The intercept variance in Model 4 is 0.293 and in Model 1 it
is 0.481. Again, gender is evenly distributed across schools and across the
majority and minority pupils, and the question is how to interpret the fact
that the intercept variance, that is, the unexplained between-school
variation, rises, and that also effect of minority status becomes larger in
absolute value, when the effect of gender is added to the model.
The explanation can be given on the basis of the threshold
representation. If all fixed effects γh and also the random intercept U0j and
the level-one residual Rij were multiplied by the same positive constant c,
then the unobserved variable would also be multiplied by c. This
corresponds also to multiplying the variances and σR2 by c2. However, it
follows from (17.17) that the observed outcome Yij would not be affected
because when is positive then so is c . This shows that the regression
parameters and the random part parameters of the multilevel logistic and
probit models are meaningful only because the level-one residual variance
σR2 has been fixed to some value; but this value is more or less arbitrary
because it is chosen merely by the convention of σR2 = π2/3 for the logistic
and σR2 = 1 for the probit model.
The meaningful parameters in these models are the ratios between the
regression parameters γh, the random effect standard deviations τ0 (and
possibly τ1, etc.), and the level-one residual standard deviation σR. Armed
with this knowledge, we can understand the consequences of adding a
random intercept or a fixed effect to a logistic or probit regression model.
When a single-level logistic or probit regression model has been
estimated the random variation of the unobserved variable in the
threshold model is When subsequently a random intercept is added this
random variation becomes σR2 + τ02. For explanatory variables that are
evenly distributed between the level-two units, the ratio of the regression
coefficients to the standard deviation of the (unexplained) random variation
will remain approximately constant. This means that the regression
coefficients will be multiplied by about the factor
In the comparison between Models 2 and 3 above, this factor is
1.08. This is indeed approximately the number by which
the regression coefficients were multiplied when going from Model 2 to
Model 3.
It can be concluded that, compared to single-level logistic or probit
regression analysis, including random intercepts tends to increase (in
absolute value) the regression coefficients. For single-level models for
binary variables, this was discussed by Winship and Mare (1984). In the
biostatistical literature, this is known as the phenomenon where population-
averaged effects (i.e., effects in models without random effects) are closer
to zero than cluster-specific effects (which are the effects in models with
random effects). Further discussions can be found in Neuhaus et al. (1991),
Neuhaus (1992), Diggle et al. (2002, Section 7.4), and Skrondal and Rabe-
Hesketh (2004, Section 4.8). Bauer (2009) proposes a way to put the
estimates from different models on a common scale.
Now suppose that a multilevel logistic or probit regression model has
been estimated, and the fixed effect of some level-one variable Xr+1 is
added to the model. One might think that this would lead to a decrease in
the level-one residual variance σR2. However, this is impossible as this
residual variance is fixed so that instead the estimates of the other
regression coefficients will tend to become larger in absolute value and the
intercept variance (and slope variances, if any) will also tend to become
larger. If the level-one variable Xr+1 is uncorrelated with the other included
fixed effects and also is evenly distributed across the level-two units (i.e.,
the intraclass correlation of Xr+1 is about nil), then the regression
coefficients and the standard deviations of the random effects will
all increase by about the same factor. Correlations between Xr + 1 and other
variables or positive intraclass correlation of Xr + 1 may distort this pattern
to a greater or lesser extent.
This explains why the effect of minority status and the intercept
variance increase when going from Model 4 to Model 1. The standard
deviation of the random intercept increases by a larger factor than the
regression coefficient of minority status, however. This might be related to
an interaction between the effects of gender and minority status and to a
very even distribution of the sexes across schools (cf. Section 7.1).
17.4 Ordered categorical variables
Variables that have as outcomes a small number of ordered categories are
quite common in the social and biomedical sciences. Examples of such
variables are outcomes of questionnaire items (with outcomes such as
‘completely disagree’, ‘disagree’, ‘agree’, ‘completely agree’), a test scored
by a teacher as ‘fail’, ‘satisfactory’, or ‘good’, etc. This section is about
multilevel models where the dependent variable is such an ordinal variable.
When the number of categories is two, the dependent variable is
dichotomous and Section 17.2 applies. When the number of categories is
rather large (5 or more), it may be possible to approximate the distribution
of the residuals by a normal distribution and apply the hierarchical linear
model for continuous outcomes. The main issue in such a case is the
homoscedasticity assumption: is it reasonable to assume that the variances
of the random terms in the hierarchical linear model are constant? (The
random terms in a random intercept model are the level-one residuals and
the random intercept, Rij and U0j in (4.8).) To check this, it is useful to
investigate the skewness of the distribution. If in some groups, or for some
values of the explanatory variables, the dependent variable assumes
outcomes that are very skewed toward the lower or upper end of the scale,
then the homoscedasticity assumption is likely to be violated.
If the number of categories is small (3 or 4), or if it is between 5 and,
say, 10, and the distribution cannot well be approximated by a normal
distribution, then statistical methods for ordered categorical outcomes can
be useful. For single-level data such methods are treated, for example, in
McCullagh and Nelder (1989) and Long (1997).
It is usual to assign numerical values to the ordered categories, taking
into account that the values are arbitrary. To have a notation that is
compatible with the dichotomous case of Section 17.2, the values for the
ordered categories are defined as 0, 1 , . . . , c – 1, where c is the number of
categories. Thus, in the four-point scale mentioned above, ‘completely
disagree’ would get the value 0, ‘disagree’ would be represented by 1,
‘agree’ by 2, and ‘completely agree’ by the value 3. The dependent variable
for level-one unit i in level-two unit j is again denoted Yij, so that Yij now
assumes values in the set {0, 1 , . . . , c – 1}.
A very useful model for this type of data is the multilevel ordered
logistic regression model, also called the multilevel ordered logit model or
the multilevel proportional odds model; and the closely related multilevel
ordered probit model. These models are discussed, for example, by Agresti
and Natarajan (2001), Gibbons and Hedeker (1994), Hedeker (2008),
Hedeker and Gibbons (2006, Chapter 10), Rabe-Hesketh and Skrondal
(2008, Chapter 7), and Goldstein (2011). A three-level model was discussed
by Gibbons and Hedeker (1997).
These models can be formulated as threshold models as in Section
17.3.2, now with c –1 thresholds rather than one. The real line is divided by
the thresholds into c intervals (of which the first and the last have infinite
length), corresponding to the c ordered categories. The first threshold is 0
= 0, the higher thresholds are denoted Threshold k defines
the boundary between the intervals corresponding to observed outcomes k
and k + 1 (for k = 0, 1 , . . . , c – 2). The assumed unobserved underlying
continuous variable is again denoted by and the observed categorical
variable Y is related to by the ‘measurement model’ defined as
where σF2 is the variance of the fixed part (or the linear predictor) while σR2
is π2/3 = 3.29 for the logistic model and 1 for the probit model.
The threshold parameters are usually of secondary importance and
reflect the marginal probabilities of the outcome categories: if category k
has a low probability then θk – 1 will be not much less than θk. For more
discussion about the interpretation of the fixed parameters we refer to the
literature on the single-level version of this model, such as Long (1997).
The model can be extended with a random slope in a straightforward
manner. However, estimation algorithms for these models are less stable
than for the standard hierarchical linear model, and it is not uncommon that
it is impossible to obtain converging parameter estimates for models with
even only one random slope.
What was said in Section 17.3.5 about the effect of adding level-one
variables to a multilevel logistic regression model is valid also for the
multilevel ordered logit and probit models. When some model has been
fitted and an important level-one variable is added to this model, this will
tend to increase the level-two variance parameters (especially if the newly
added variable explains mainly within-group variation), the threshold
parameters, and the absolute sizes of the regression coefficients (especially
for variables that are uncor-related with the newly added variable). This is
also discussed in Fielding (2004b) and Bauer (2009).
These models can be estimated by the methods discussed above in
Section 17.2.5. Various of these procedures are implemented in the
programs MLwiN, HLM, MIXOR, Stata, and SAS (see Chapter 18). Some
computer programs do not put the first threshold equal to 0, but the
intercept. Thus, γ0 = 0 and this parameter is not estimated, but instead θ0 is
not equal to 0 and is estimated. This is a reparametrization of the model and
yields parameters which can simply be translated into one another, since it
follows from (17.24) and (17.25) that subtracting the same number of all
thresholds as well as of γ0 yields the same distribution of the observed
variables Yij.
−1.680 X1 + 0.117X2−0.514X3,
The variance of the random intercept is again denoted by τ02. This model is
treated in Diggle et al. (2002, Section 9.4), Hedeker and Gibbons (2006,
Chapter 12), Goldstein (2011, Section 4.5), and Skrondal and Rabe-Hesketh
(2004, Chapter 11).
The multilevel Poisson regression model can be estimated by various
multilevel software packages (see Section 17.2.5 and Chapter 18), including
gllamm, HLM, MIX-PREG/SuperMix, MLwiN, various R packages
including lme4 and glmmADMB, SAS, Stata, and Latent Gold. The
estimation methods, as with those for multilevel logistic regression, include
numerical integration (in the version of Gaussian quadrature, with or
without adaptive node placement; see Rabe-Hesketh et al., 2005), Laplace
approximation (Raudenbush et al., 2000), and various Taylor series
approximations (Goldstein, 2011). The numerical integration method and
the Laplace approximation provide not only parameter estimates but also a
deviance statistic which can be used for hypothesis testing.
To transform the linear model back to the expected counts, the inverse
transformation of the natural logarithm must be used, which is the
exponential function exp(x)= ex. This function has the property that it
transforms sums into products:
Therefore the explanatory variables and the level-two random effects in the
(additive) multilevel Poisson regression model have multiplicative effects
on the expected counts. For example, if there is only r = 1 explanatory
variable, equation (17.28) is equivalent to
Therefore, each additional unit of X1 will have the effect of multiplying the
expected count by eγ1. Similarly, in a group with a high intercept, for
example, two standard deviations so that U0j = 2 τ0, the expected count will
be e2τ0 times as high as in a group with an average value, U0j= 0, of the
intercept.
In models for counts it is quite usual that there is a variable D that is
known to be proportional to the expected counts. For example, if the count
Yij is the number of events in some time interval of nonconstant length dij, it
often is natural to assume that the expected count is proportional to this
length of the time period. If in the example of counts of medical problems
there are several doctors each with his or her own population of patients,
then D could be the size of the patient population of the doctor. In view of
equation (17.29), in order to let the expected count be proportional to D,
there should be a term ln(dij) in the linear model for ln(Lij), with a
regression coefficient fixed to 1. Such a term is called an offset in the linear
model (see McCullagh and Nelder, 1989). Goldstein (2011, Section 4.5)
suggests that such offset variables be centered to improve the numerical
properties of the estimation algorithm.
Example 17.8 Memberships in voluntary organizations.
Memberships in voluntary organizations may be regarded as a measurement of social activity and a
dimension of social capital. In this example a subset is used of the data also analyzed in an
international comparative study by Peter and Drobnicˇ (2010). Here only the data for the Netherlands
are studied. They were collected in the Dutch part of the European Social Survey (ESS), in the
‘Citizenship, Involvement and Democracy’ module, during 2002–2003.
The dependent variable is the number of types of association of which the respondent is a
member. Twelve types of organizations are mentioned, ranging from sports clubs through political
parties to consumer organizations. Thus, the count can range from 0 to 12. This data set has a
regional identifier which ranges over 40 regions for the Netherlands. Respondents aged between 25
and 65 years were selected, and a very small number with missing values were discarded. This left a
total of 1,738 respondents. The nesting structure is respondents nested in regions.
The number of memberships ranges from 0 to 10, with a mean of 2.34 and a standard deviation of
1.78. The variance is 3.16. For the Poisson distribution according to (17.27) in the population the
mean and variance are equal. The fact that here dispersion is greater than would be expected for the
Poisson distribution points to heterogeneity, which may have to do with individual differences and/or
differences between regions. The Poisson distribution can assume all nonnegative values while the
outcome variable here is limited by the questionnaire to a maximum of 12, but in view of the low
mean and standard deviation this in itself is not a problem for applying the Poisson distribution.
Deviating from our practice in earlier chapters, here we report for the random part the standard
deviations and correlation parameters, rather than the variances and covariance. There are two
reasons for this. One is the fact that the numbers reported are smaller. The other is the use of the R
packages lme4 and glmmADMB for the computations, which use these parameters for reporting.
The empty model is reported as Model 1 in Table 17.7. Explorations showed that religion (coded
as Protestant versus other), gender, and age had effects, and the effect of age was well described by a
quadratic function. Furthermore, there appeared to be a random slope for gender and an interaction
between age and gender. This is reported as Model 2. Age is centered at 40 years. To obtain
parameters that are not too small, age is measured in decades: for example, 30 years is coded as –1,
40 years as 0, 50 years as +1.
A random intercept model with the same fixed effects as Model 2 has
deviance 6531.9. The deviance difference is 6531.9 – 6522.6 = 9.3. Against
a chi-squared distribution with 2 degrees of freedom this has p < < 0.01,
even without the more powerful test (cf. Section 6.2.1). This shows there is
clear evidence for a random slope for gender: the effect of gender on
number of membership differs between regions.
We see that Protestants have more memberships on average. To assess
the contribution of gender, we must take into account that there is an
interaction between gender and age, as well as a random slope for gender.
Since age is centered at 40 years, the main effect of gender as well as the
intercept variance refer to age 40. It can be concluded that on average,
females have fewer memberships at 40 years. The total contribution of
being female is
which is 0 for age 28 if U1j = 0. It can be concluded that in an average region, females have on
average the same number of memberships as males at age 28, and the average increase by age is
smaller for females than for males. But the male–female difference depends considerably on region,
as the between-region standard deviation is , larger than the main effect of
gender. Therefore there are many regions where at age 40 females have more memberships than
males.
17.7 Glommary
Dichotomous variables. Also called binary variables, these are variables
with two possible values, such as ‘yes’ and ‘no’; often the values are
formally called ‘success’ and ‘failure’. For such dependent variables, a
hierarchical linear model with homoscedastic normally distributed
residuals is not appropriate, in the first place because the residual
variance of a binary dependent variable will depend on the predicted
value. The same holds for dependent variables with a small number of
ordered numerical categories, because the residual variance must
become small when the predicted value approaches either of the
extremes of the range of values; and for nonnegative integer (count)
data, because the residual variance must become small when the
predicted value approaches 0.
Testing that the intraclass correlation for binary variables is 0. This can
be done by applying the well-known chi-squared test. Another test was
also mentioned, which can always be applied when there are many
groups (even if they are small, which may lead to difficulties for the
chi-squared test) as long as there is not a very small number of groups
making up almost all of the data.
Proportion of explained variance for dichotomous dependent variables.
This can be defined as the proportion of explained variance for the
underlying continuous variable in the threshold representation.
Event history analysis. The study of durations until some event occurs.
One approach is to transform data to the person–period format, which
is a two-level format with time periods nested within individuals, and
with a binary outcome which is 0 if the event has not yet occurred, and
1 if it has occurred; including only the first (if any) 1 response per
individual. A multilevel nesting structure, with persons nested in
groups (higher-level units), is then represented by a three-level data
structure (periods within persons within groups) which can be modeled
by multilevel logistic regression, with random effects at level three.
1
This is because in further modeling of this data set it will be preferable to present standard
deviations rather than variances of random slopes.
2
Rather than the double-subscript notation γhk used earlier in Chapters 4 and 5, we now use – to
obtain a relatively simple notation – a single-subscript notation γh analogous to (5.15).
18
Software
Almost all procedures treated in this book can be carried out by standard
software for multilevel statistical models. This of course is intentional,
since this book covers those parts of the theory of the multilevel model that
can be readily applied in everyday research practice. However, things
change rapidly. Some of the software discussed in the previous edition of
this book is no longer available. On the other hand, new software packages
are shooting up like mushrooms. The reader is therefore advised to keep
track of the changes that can be found at the various websites we mention in
this chapter – although we have to add that some of these websites and even
internet addresses tend to change. But then again, search engines will
readily guide the curious multilevel researcher to the most recent sites.
Currently most details on the specialized multilevel software packages
can be found via the links provided on the homepage of the Centre for
Multilevel Modelling at the University of Bristol; see
http://www.bristol.ac.uk/cmm/. At this site one can also find reviews of the
multilevel software packages. Chapter 18 of Goldstein (2011) also provides
a list of computer programs for multilevel analysis.
18.1.1 HLM
HLM was originally written by Bryk et al. (1996), and the theoretical
background behind most applications can be found in Raudenbush and
Bryk (2002). The main features of HLM are its interactive operation
(although one can also run the program in batch mode) and the fact that it is
rather easy to learn. Therefore it is well suited for undergraduate courses
and for postgraduate courses for beginners. The many options available also
make it a good tool for professional researchers. Information is obtainable
from the website, http://www.ssicentral.com/hlm/, which also features a
free student version. West et al. (2007) give an introduction to hierarchical
linear modeling with much attention to implementation in HLM.
Input consists of separate files for each level in the design, linked by
common identifiers. In a simple two-level case, for example, with data
about students in schools, one file contains all the school data with a school
identification code, while another file contains all the student data with the
school identification code for each student. The input can come from
system files of SPSS, SAS, SYSTAT or Stata, or may be given in the form
of ASCII text files.
Once data have been read and stored into a sufficient statistics file (a
kind of system file), there are three ways to work with the program. One
way is to run the program interactively (answering questions posed by the
program). Another is to run the program in batch mode. Batch and
interactive modes can also be combined. Finally, one can make full use of
the graphical interface. In each case the two-step logic of Section 6.4.1 is
followed. HLM does not allow for data manipulation, but both the input and
output can come from, and can be fed into, SPSS, SAS, SYSTAT, or Stata.
HLM does not go beyond four levels. It can be used for practically all the
analyses presented in this book, with the exception of multiple membership
models. Almost all examples in this book can be reproduced using HLM
version 7. Some interesting features of the program are the ability to test
model assumptions directly, for example, by test (10.5) for level-one
heteroscedasticity, and the help provided to construct contrast tests.
Furthermore, the program routinely asks for centering of predictor
variables, but the flipside of the coin is – if one opts for group mean
centering – that group means themselves must have been calculated outside
HLM, if one wishes to use these as level-two predictor variables.
A special feature of the program is that it allows for statistical meta-
analysis (see Section 3.7) of research studies that are summarized by only
an effect size estimate and its associated standard error (called in
Raudenbush and Bryk, 2002, the ‘V-known problem’). Other special
features are the analysis of data where explanatory variables are measured
with error (explained in Raudenbush and Sampson, 1999a), and the analysis
of multiply imputed data as discussed in Chapter 9. Note, however, that the
imputations have to be done with other specialized software packages. Next
to that HLM also offers facilities for the analysis of a special case of
multiply imputed data, namely the multilevel analysis of plausible values.
These are imputed data on the dependent variable for incomplete
designs,such as designs with booklet rotations as used in the PISA studies.
As a special feature it allows for the modeling of dependent random effects,
a topic not treated in this book.
18.1.2 MLwiN
MLwiN is the most extensive multilevel package, written by researchers
currently working at the Centre for Multilevel Modelling at the University
of Bristol (Rasbash and Wood-house, 1995; Goldstein et al., 1998; Rasbash
et al., 2009). Current information, including a wealth of documentation, can
be obtained from the website http://www.bristol.ac.uk/cmm/. Almost all the
examples in this book can be reproduced using MLwiN, which allows for
standard modeling of up to five levels. For heteroscedastic models (Chapter
8), the term used in the MLwiN documentation is ‘complex variation’. For
example, level-one heteroscedasticity is complex level-one variation. Next
to the standard IGLS (ML) and RIGLS (REML) estimation methods,
MLwiN also provides bootstrap methods (based on random drawing from
the data set or on random draws from an estimated population distribution),
and extensive implementation of Markov chain Monte Carlo methods
(Browne, 2009) for Bayesian estimation (see Section 12.1). With MLwiN
one may use the accompanying package REALCOM-Impute to impute
missing data in a multilevel model, and analyze these subsequently in
MLwiN. The program MLPowSim (Section 18.3.3) may be used to create
MLwiN macros for simulation-based power analysis.
A nice feature of MLwiN is that the program was built on NANOSTAT,
a statistical environment which allows for data manipulation, graphing,
simple statistical computations, file manipulation (e.g., sorting), etc. Data
manipulation procedures include several handy procedures relating to the
multilevel data structure. Input for MLwiN may be either an ASCII text file
or a system file from Minitab, SPSS, or Stata that contains all the data,
including the level identifiers. MLwiN can even be called directly from
Stata using the runmlwin command (Leckie and Charlton, 2011). Since the
data are available in one file, this implies that all the level-two and higher-
level data are included in disaggregated form at level one. The data are read
into a worksheet, a kind of system file. This worksheet, which can also
include model specifications, variable labels, and results, can be saved and
used in later sessions. The data can also be exported to Minitab, SPSS, or
Stata.
The most obvious way to work with the program is interactively using
the graphical interface. One can, however, also use the ‘command menu’ or
give a series of commands in a previously constructed macro.
MLwiN is the most flexible multilevel software package, but it may
take some time to get acquainted with its features. It is an excellent tool for
professional researchers and statisticians. The macro facilities in particular
provide experienced researchers ample opportunities for applications such
as meta-analysis, multilevel factor analysis, multilevel item response theory
modeling, to name just some examples.
For the second and further random slopes, the formulas for the standard
errors are even more complicated. These formulas can be derived with the
multivariate delta method explained for example, in Bishop et al. (1975,
Section 14.6.3).
18.2.2 R
R is an open source language and environment for statistical computing and
graphics similar to the S language. It is highly flexible and freely
downloadable at http://cran.r-project.org/. Due to the open source character
of R one can use procedures that have been developed by others (Ihaka and
Gentleman, 1996). Procedures in R are organized into so-called packages. R
is operated by a command language, and the commands are collected in
scripts. In the initial phase it may require some effort to learn the command
language, but once mastered this has the advantage of flexibility and
reproducibility of results.
There are several packages in R implementing multilevel models. The
main ones are nlme (Pinheiro and Bates, 2000) and lme4 (Bates, 2010; see
also Doran et al., 2007). The nlme package has extensive possibilities for
linear and nonlinear models with normally distributed residuals. The lme4
package is currently still under vigorous development and, in spite of its
title (‘Linear mixed-effects models using S4 classes’) also estimates
hierarchical generalized linear models, using Laplace and other methods.
With both of these packages one can perform most of the analyses treated in
this book. Some texts introducing multilevel analysis using R are, next to
the two books just mentioned, Maindonald and Braun (2007, Chapter 10),
Bliese (2009), Wright and London (2009), Berridge and Crouchley (2011),
and, specifically for multilevel models for discrete data, Thompson (2009,
Chapter 12).
There are several packages that can be used for more limited, but very
useful, purposes. The R packages mlmmm and mice allow multiple
imputation of missing data (see Chapter 9) under a two-level model. The
pamm package contains procedures for simulation-based power analysis.
The program MLPowSim (Section 18.3.3) creates R scripts for simulation-
based power analysis. Various new methods for hierarchical generalized
linear models (Section 17.1) have been made available in R; some of these
packages are HGLMMM, glmml, glmmADMB, and INLA. Package
multilevel contains some procedures that are especially useful for those
working with multi-item scales. Using packages WinBUGS or
glmmmBUGS gives access to the WinBUGS program (Section 18.3.7).
18.2.3 Stata
Stata (StataCorp, 2009) contains some modules that permit the estimation
of certain multilevel models (see Rabe-Hesketh and Skrondal, 2008).
Module loneway (‘long oneway’) gives estimates for the empty model. The
xt series of modules are designed for the analysis of longitudinal data (cf.
Chapter 15), but can be used to analyze any two-level random intercept
model. Command xtreg estimates the random intercept model, while xtpred
calculates posterior means. Commands xtpois and xtprobit, respectively,
provide estimates of the multilevel Poisson regression and multilevel probit
regression models (Chapter 17). These estimates are based on the so-called
generalized estimating equations method. A special feature of Stata is the
so-called sandwich variance estimator, also called the robust or Huber
estimator (Section 12.2). This estimator can be applied in many Stata
modules that are not specifically intended for multilevel analysis. For
statistics calculated in a single-level framework (e.g., estimated OLS
regression coefficients), the sandwich estimator, when using the keyword
‘cluster’, computes standard errors that are asymptotically correct under
two-stage sampling. In terms of our Chapter 2, this solves many instances
of ‘dependence as a nuisance’, although it does not help to get a grip on
‘interesting dependence’.
Within Stata one can use the procedure GLLAMM – an acronym for
General Linear Latent And Mixed Models – which was developed by Rabe-
Hesketh et al. (2004, 2005) and which analyzes very general models indeed,
including those of Chapter 17 but also much more general models with
latent variables, such as multilevel structural equation modeling and
multilevel item response theory models. The algorithms use adaptive
quadrature (Section 17.2.5). The package is available at
http://www.gllamm.org, where one can also find the necessary
documentation.
As mentioned above, from Stata one may call MLwiN through the
command runmlwin.
18.3.1 PinT
PinT is a specialized program for calculations of Power in two-level
designs, implementing the methods of Snijders and Bosker (1993). This
program can be used for a priori estimation of standard errors of fixed
coefficients. This is useful in the design phase of a multilevel study, as
discussed in Chapter 11. Being shareware, it can be downloaded with the
manual (Bosker et al., 2003) from
http://www.stats.ox.ac.uk/~snijders/multilevel.htm#progPINT.
18.3.4 Mplus
Mplus is a program with very general facilities for covariance structure
analysis (Muthén and Muthén, 2010). Information about this program is
available at http://www.StatModel.com. This program allows the analysis of
univariate and multivariate two-level data not only with the hierarchical
linear model but also with path analysis, factor analysis, and other structural
equation models. Introductions to this type of model are given by Muthén
(1994) and Kaplan and Elliott (1997).
18.3.6 REALCOM
At the time of writing the people at the Centre for Multilevel Modelling are
developing new software for, as they call it, ‘realistic multilevel modeling’.
Such modeling includes measurement errors in variables, simultaneous
outcomes at various levels in the hierarchy, structural equation modeling,
modeling with imputed data for missing values, and modeling with
misclassifications. The interested reader is referred to
http://www.bristol.ac.uk/cmm/software/realcom for more details and recent
developments.
18.3.7 WinBUGS
A special program which uses the Gibbs sampler is WinBUGS (Lunn et al.,
2000), building on the previous BUGS program (Gilks et al., 1996). Gibbs
sampling is a simulation-based procedure for calculating Bayesian
estimates (Section 12.1). This program can be used to estimate a large
variety of models, including hierarchical linear models, possibly in
combination with models for structural equations and measurement error. It
is extremely flexible, and used, for example, as a research tool by
statisticians; but it can also be used for regular data analysis. Gelman and
Hill (2007) and Congdon (2010) give extensive attention to the use of
WinBUGS for multilevel analysis. The WinBUGS example manuals also
contain many examples of hierarchical generalized linear models.
The program is available with manuals from http://www.mrc-
bsu.cam.ac.uk/bugs/. From R (Section 18.2.2) it is possible to access
WinBUGS by using the R package WinBUGS or glmmmBUGS.
References
generalizability theory, 25
generalized estimating equations, 198, 204
generalized linear models, 290
Gibbs sampling, 138, 173, 331
gllamm, 202, 300, 315
goodness of fit, 174
Greek letters, 5
group size, 56, 61, 188, 189
groups, 17, 42, 74
Kelley estimator, 63
R, 37, 138, 153, 191, 207, 300, 301, 315, 328, 330
R2, see explained variance
random coefficient model, 1
random coefficient models, 75
random coefficients, 2, 41, 46, 47, 71, 75
random effects, 1, 2, 17, 18, 47, 49
comparison of intercepts, 66
discrete, 201–203
random effects ANOVA, 49
random intercept, 46, 49, 54, 72, 74
test, 97–98, 108
random intercept model, 41–73, 114, 249, 280
comparison with OLS model, 54
for dichotomous data, 297–299, 320
multivariate, 283–288
three-level, 67, 284
random part, 55, 75, 87, 153, 158–161
test, 97–101, 108
random slope, 75, 77, 82, 92, 114, 156, 174
explanation, 80–85, 92
test, 97–101, 106, 108, 156
random slope model, 74–93, 116
for dichotomous outcome variable, 302, 320
for longitudinal data, 253, 280
multivariate, 288
random slope variance, 75, 78
interpretation, 77
REALCOM, 138, 330
reliability, 25–26, 39, 62, 180
REML, see residual maximum likelihood
repeated measures, 180, 247, 280
residual intraclass correlation, 52, 61, 75, 187, 252
residual iterated generalized least squares, 61, 89
residual maximum likelihood, 22, 35, 51, 60, 72, 89, 97, 108, 257
residual variance, 51, 76, 80, 153, 160
for dichotomous data, 291
nonconstant, 120, 123, 127
residuals, 27, 43, 46, 48, 51, 71, 75, 80, 83
level-one, 153, 161–165, 170, 174, 175
level-two, 165–171, 174
multivariate, see multivariate residual
non-normal distributions, 199
nonnormal distributions, 172
restricted maximum likelihood, see residual maximum likelihood
RIGLS, see residual iterated generalized least squares
robust estimators, 175
robust standard errors, 197–200, 204
sample
cluster, 216, 231, 245
multistage, 6, 7, 13, 177
simple random, 6
stratified, 216, 231, 245
two-stage, 7, 17, 23, 39, 179, 180, 183, 192, 216, 231, 244
sample size, 23, 176–193
for estimating fixed effects, 180–187
for estimating group mean, 180
for estimating intraclass correlation, 188–190
for estimating population mean, 179–180
for estimating variance parameter, 190–191
sample surveys, 216–246
sampling designs, 7
sampling probabilities, 216, 245–246
sampling weights, 216–246
scaling of, 225, 246
sandwich estimator, 173, 175, 197–200, 204, 220, 238, 246
SAS, 260, 300, 301, 312, 315, 327
Satterthwaite approximation, 95
shift of meaning, 15, 39, 59
shrinkage, 63–64, 66, 67, 73, 89
significance level, 177, 192
simple random sample, 216, 219, 224, 232, 245
slope variance, 92
slopes as outcomes, 1, 80, 92
software, 153, 300, 315, 323–331
spline function, 121, 129, 158, 174, 253, 270, 281
cubic, 270, 272
quadratic, 271
SPSS, 260, 329
standard error, 23, 37, 141, 178–179, 181
comparative, 65, 73
diagnostic, 65, 73, 165
of empirical Bayes estimate, 63, 65
of fixed coefficients, 155, 181
of intercept variance, 191
of intraclass correlation, 21, 188
of level-one variance, 191
of population mean, 24, 179
of posterior mean, 63, 65
of random part parameters, 90, 100, 108, 327
of standard deviation, 90, 100, 327
of variance, 90, 100, 327
standardized coefficients, 53
standardized multivariate residual, see multivariate residual
standardized OLS residuals, 174
Stata, 238, 300, 311, 315, 328–329
success probability, 290, 293, 295, 298, 314
SuperMix, 300, 315, 326
superpopulation model, 3, 236, 246
survey, 216
survey weights, see sampling weights
t-ratio, 59, 94, 108
t-test, 59, 94, 108
for random slope, 156
paired samples, 257, 280
test, 94–101, 163, 177
for random intercept, see random intercept
for random slope, see random slope
test of heterogeneity of proportions, 292, 321
textbooks, 4
three-level model, 67–71, 73, 90–93, 113, 282, 284, 321
for multivariate multilevel model, 282–288
threshold model, 304, 308, 310, 321
total variance, 18
total-group correlation, 32–35
total-group regression, 27–31
transformation
for count data, 315, 322
of dependent variable, 161
of explanatory variables, 157, 173
true score, 25, 63, 80, 180
two-stage sample, see sample, two-stage
type I error, 178, 192, 196, 198
type II error, 178, 192
unexplained variation, 41, 46, 48, 51, 71, 75, 80, 111, 306