0% found this document useful (0 votes)
7 views14 pages

Multilevel Analysis Techniques and Applications Th... - (3. Estimation and Hypothesis Testing in Multilevel Regression)

This chapter discusses estimation methods in multilevel regression, primarily focusing on maximum likelihood estimation (MLE) and its alternatives such as Bayesian methods and bootstrapping. It explains the differences between full and restricted MLE, as well as generalized least squares and generalized estimating equations, highlighting their advantages and limitations. The chapter emphasizes the importance of selecting appropriate estimation methods based on the data and model specifications.

Uploaded by

calvinjensbotha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views14 pages

Multilevel Analysis Techniques and Applications Th... - (3. Estimation and Hypothesis Testing in Multilevel Regression)

This chapter discusses estimation methods in multilevel regression, primarily focusing on maximum likelihood estimation (MLE) and its alternatives such as Bayesian methods and bootstrapping. It explains the differences between full and restricted MLE, as well as generalized least squares and generalized estimating equations, highlighting their advantages and limitations. The chapter emphasizes the importance of selecting appropriate estimation methods based on the data and model specifications.

Uploaded by

calvinjensbotha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

3

Estimation and Hypothesis Testing in


Multilevel Regression

Summary

The usual method to estimate the values of the regression coefficients and the intercept
and slope variances is the maximum likelihood estimation method. This chapter gives a
non-technical explanation of maximum likelihood estimation, to enable analysts to make
informed decisions on the estimation options offered by current software. Some alternatives
to maximum likelihood estimation are briefly discussed. Other estimation methods, such as
Bayesian estimation methods and bootstrapping, are also briefly introduced in this chapter.
Finally, this chapter describes some procedures that can be used to compare nested and non-
nested models, which are especially useful when variance terms are tested.

3.1 Which Estimation Method?

Estimation of parameters (regression coefficients and variance components) in multilevel


modeling is mostly done by the maximum likelihood method. The maximum likelihood
(ML) method is a general estimation procedure, which produces estimates for the population
parameters that maximize the probability (produce the ‘maximum likelihood’) of observing
the data that are actually observed, given the model (cf. Eliason, 1993). Other estimation
methods that have been used in multilevel modeling are generalized least squares (GLS),
generalized estimating equations (GEE), bootstrapping methods and Bayesian methods
Copyright © 2017. Taylor & Francis Group. All rights reserved.

such as Markov chain Monte Carlo (MCMC). In this section, we will discuss these methods
briefly.

3.1.1 Maximum Likelihood (ML): Full and Restricted ML Estimation

Maximum likelihood (ML) is the most commonly used estimation method in multilevel
modeling. The results presented in Chapter 2 are all obtained using full ML estimation.
An advantage of the maximum likelihood estimation method is that it is generally robust,
and produces estimates that are asymptotically (i.e, when the sample size approximates
infinity) efficient and consistent. With large samples, ML estimates are usually robust
against mild violations of the assumptions, such as having non-normal errors. Maximum
likelihood estimation proceeds by maximizing a function called the likelihood function.

27

Hox, Joop, et al. Multilevel Analysis : Techniques and Applications, Third Edition, Taylor & Francis Group, 2017. ProQuest Ebook
Central, http://ebookcentral.proquest.com/lib/pretoria-ebooks/detail.action?docID=5046900.
Created from pretoria-ebooks on 2024-07-23 08:35:18.
28 Multilevel Analysis: Techniques and Applications

Two different likelihood functions are used in multilevel regression modeling. One is
full maximum likelihood (FML); in this method, both the regression coefficients and the
variance components are included in the likelihood function. The other estimation method
is restricted maximum likelihood (RML); here only the variance components are included in
the likelihood function, and the regression coefficients are estimated in a second estimation
step. Both methods produce parameter estimates with associated standard errors and an
overall model deviance, which is a function of the likelihood. FML treats the regression
coefficients as fixed but unknown quantities when the variance components are estimated,
but does not take into account the degrees of freedom lost by estimating the fixed effects.
RML estimates the variance components after removing the fixed effects from the model (cf.
Searle et al., 1992, Chapter 6). As a result, FML estimates of the variance components are
biased; they are generally too small. RML estimates have less bias (Longford, 1993). RML
also has the property, that if the groups are balanced (have equal group sizes), the RML
estimates are equivalent to ANOVA estimates, which are optimal (Searle et al., 1992, p. 254).
Since RML is more realistic, it should, in theory, lead to better estimates, especially when
the number of groups is small (Bryk & Raudenbush, 1992; Longford, 1993). In practice, the
differences between the two methods are usually small (cf. Hox, 1998; Kreft & de Leeuw,
1998). For example, if we compare the FML estimates for the intercept-only model for the
popularity data in Table 2.1 with the corresponding RML estimates, the only difference
within two decimals is the intercept variance at level two. FML estimates this as 0.69,
and RML as 0.70. The size of this difference is absolutely trivial. If nontrivial differences
are found, the RML method is preferred (Browne, 1998). FML still continues to be used,
because it has two advantages over RML. Firstly, the computations are generally easier, and
secondly, since the regression coefficients are included in the likelihood function, an overall
chi-square test based on the likelihood can be used to compare two models that differ in the
fixed part (the regression coefficients). With RML, only differences in the random part (the
variance components) can be compared with this test. Most tables in this book have been
produced using FML estimation; if RML is used this is explicitly stated in the text.
Computing the maximum likelihood estimates requires an iterative procedure. At the
Copyright © 2017. Taylor & Francis Group. All rights reserved.

start, the computer program generates reasonable starting values for the various parameters
(for example based on single-level regression estimates). In the next step, an ingenious
computation procedure tries to improve upon the starting values, to produce better estimates.
This second step is repeated (iterated) many times. After each iteration, the program inspects
how much the estimates actually changed compared to the previous step. If the changes are
very small, the program concludes that the estimation procedure has converged and that it is
finished. Using multilevel software, we generally take the computational details for granted.
However, computational problems do sometimes occur. A problem common to programs
using an iterative maximum likelihood procedure is that the iterative process is not always
guaranteed to stop. There are models and data sets for which the program may go through an
endless sequence of iterations, which can only be ended by stopping the program. Because of

Hox, Joop, et al. Multilevel Analysis : Techniques and Applications, Third Edition, Taylor & Francis Group, 2017. ProQuest Ebook
Central, http://ebookcentral.proquest.com/lib/pretoria-ebooks/detail.action?docID=5046900.
Created from pretoria-ebooks on 2024-07-23 08:35:18.
Estimation and Hypothesis Testing in Multilevel Regression 29

this, most programs set a built-in limit to the maximum number of iterations. If convergence
is not reached within this limit, the computations can be repeated with a higher limit. If the
computations do not converge after an extremely large number of iterations, we suspect that
they may never converge.1 The problem is how one should interpret a model that does not
converge. The usual interpretation is that a model for which convergence cannot be reached is
a bad model, using the simple argument that if estimates cannot be found, this disqualifies the
model. However, the problem may also lie with the data. Especially with small samples, the
estimation procedure may fail even if the model is valid. In addition, it is even possible that,
if only we had a better computer algorithm, or better starting values, we could find acceptable
estimates. Still, experience shows that if a program does not converge with a data set of
reasonable size, the problem often is a badly misspecified model. In multilevel analysis, non-
convergence often occurs when we try to estimate too many random (variance) components
that are actually close or equal to zero. The solution is to simplify the model by leaving out
some random components; often the estimated values from the non-converged solution
provide an indication which random components can be omitted. The strategy you apply to
solve convergence issues should be reported in your logbook and/or paper.

3.1.2 Generalized Least Squares

Generalized least squares (GLS) is an extension of the standard estimation ordinary least
squares (OLS) method that allows for heterogeneity and observations that differ in sampling
variance. GLS estimation approximates ML estimates, and they are asymptotically
equivalent. Asymptotic equivalence means that in very large samples they are in practice
indistinguishable. ‘Expected GLS’ estimates can be obtained from a maximum likelihood
procedure by restricting the number of iterations to one. Since GLS estimates are obviously
faster to compute than full ML estimates, they can be used as a stand-in for ML estimates
in computationally intensive procedures such as extremely large data sets. They can also be
used when ML procedures fail to converge; inspecting the GLS results may help to diagnose
the problem. Simulation research has shown that GLS estimates are less efficient, and that
Copyright © 2017. Taylor & Francis Group. All rights reserved.

the GLS-derived standard errors are inaccurate (cf. Hox, 1998; van der Leeden et al., 2008;
Kreft, 1996). Therefore, in general, ML estimation should be preferred.

3.1.3 Generalized Estimating Equations

The generalized estimating equations method (GEE, cf. Liang & Zeger, 1986) estimates
the variances and covariances in the random part of the multilevel model directly from
the residuals, which makes them faster to compute than full ML estimates. Typically, the
dependences in the multilevel data are accounted for by a very simple model, represented by
a working correlation matrix. For individuals within groups, the simplest assumption is that
the respondents within the same group all have the same correlation. For repeated measures,

Hox, Joop, et al. Multilevel Analysis : Techniques and Applications, Third Edition, Taylor & Francis Group, 2017. ProQuest Ebook
Central, http://ebookcentral.proquest.com/lib/pretoria-ebooks/detail.action?docID=5046900.
Created from pretoria-ebooks on 2024-07-23 08:35:18.
30 Multilevel Analysis: Techniques and Applications

a simple autocorrelation structure is usually assumed. After the estimates for the variance
components are obtained, GLS is used to estimate the fixed regression coefficients. Robust
standard errors are generally used to counteract the approximate estimation of the random
structure. For non-normal data this results in a population average model, where the emphasis
is on estimating average population effects and not on modeling individual differences.
According to Goldstein (2011) and Raudenbush & Bryk (2002), GEE estimates are less
efficient than full ML estimates, but they make weaker assumptions about the structure
of the random part of the multilevel model. If the model for the random part is correctly
specified, ML estimators are more efficient, and the model-based (ML) standard errors are
generally smaller than the GEE-based robust standard errors. If the model for the random
part is incorrect, the GEE-based estimates and robust standard errors are still consistent.
So, provided the sample size is reasonably large, GEE estimators are robust against
misspecification of the random part of the model, including violations of the normality
assumption. A drawback of the GEE approach is that it only approximates the random
effects structure, and therefore the random effects cannot be analyzed in detail. Most
software will simply estimate a full unstructured covariance matrix for the random part,
which makes it impossible to estimate random effects for the intercept or slopes. Given
the general robustness of ML methods, it is preferable to use ML methods when these are
available, and use robust estimators or bootstrap corrections when there is serious doubt
about the assumptions of the ML method. Robust estimators, which are used with GEE
estimators (Burton et al., 1998), are treated in more detail in Chapter 13 of this book.

3.2 Bayesian Methods

In many different fields, including the field of multilevel analysis, Bayesian statistics is gaining
popularity (van de Schoot et al., 2017), mainly because it can deal with all kinds of technical
issues, for example multicollinearity (Can et al., 2014) or non-normality (see Chapter 13), or
because it can deal with smaller sample sizes on the highest level (e.g., Baldwin & Fellingham,
2013). The scope of this paragraph is not to provide a full introduction to Bayesian multilevel
Copyright © 2017. Taylor & Francis Group. All rights reserved.

modeling, for this we refer to Hamaker and Klugkist (2011). For a very gentle introduction to
Bayesian modeling, we refer the novice reader to, among many others, Kaplan (2014), or
van de Schoot et al. (2014). More detailed information about Bayesian multilevel modeling
can be found in Gelman and Hill (2007). For a discussion in the context of MLwiN see
Browne (2005). In the current chapter, and see also Section 13.5, we want to highlight some
important characteristics of Bayesian estimation.
There are three essential ingredients underlying Bayesian statistics. The first ingredient is
the background knowledge of the parameters in the model being tested. This first ingredient
refers to all knowledge available before seeing the data and is captured in the so-called prior
distribution. The prior is a probability distribution reflecting the researchers’ beliefs about the
value of the parameter in the population, and the amount of uncertainty the researcher has

Hox, Joop, et al. Multilevel Analysis : Techniques and Applications, Third Edition, Taylor & Francis Group, 2017. ProQuest Ebook
Central, http://ebookcentral.proquest.com/lib/pretoria-ebooks/detail.action?docID=5046900.
Created from pretoria-ebooks on 2024-07-23 08:35:18.
Estimation and Hypothesis Testing in Multilevel Regression 31

regarding this belief. Researchers may have a great degree of certainty in their belief, and
therefore specify an “informative prior”—that is, a prior with a low variance. In contrast, they
may have very little certainty in this belief, and consequently specify a non-informative prior—
that is, a prior with a large variance, also known as a diffuse or flat prior. The informativeness
of a prior is governed by hyperparameters. For example, the hyperparameters for a normal
distribution are the mean and variance terms that dictate the location and spread of the normal
distribution. A normally distributed prior would be written N(μ,σ2), where N denotes that the
prior follows a normal distribution (other distributions can also be specified in a model), the
mean of the prior is given by μ, and σ2 is the prior variance. Consequently, μ can be based on
background information about the model parameter value, and σ2 can be used to specify how
certain we are about the value of μ. The more informative a prior, the larger the impact it will
have on final model results, especially if the prior is combined with small sample sizes. If a non-
informative prior is desired, this is accomplished by specifying a very large variance for the
prior. Many simulations studies have shown that the more information is captured via the prior
distribution the smaller the sample size can become while maintaining power and precision.
The second ingredient in Bayesian estimation is the information in the data itself. It is
the observed evidence expressed in terms of the likelihood function of the data given the
parameters. In other words, the likelihood function asks: “given a set of parameters, such as the
mean and/or the variance, what is the likelihood or probability of the data at hand?”
The third ingredient is based on combining the first two ingredients, which is called
posterior inference. Both (1) and (2) are combined via Bayes Theorem and are summarized
by the so-called posterior distribution, which is a combination of the prior knowledge
and the observed evidence. The posterior distribution reflects one’s updated knowledge,
balancing prior knowledge with observed data. Given that the posterior is a combination of
information from the prior and the data, a more informative prior has a larger impact on the
posterior (or final result).
The use of prior knowledge is one of the main elements that separate Bayesian and
frequentist methods. However, the process of estimating a Bayesian model can also be quite
different. Typically, Markov chain Monte Carlo (MCMC) methods are used, where estimation
Copyright © 2017. Taylor & Francis Group. All rights reserved.

is conducted through the use of a Markov chain—or a chain that captures the nature of the
posterior. Given that the posterior is a distribution (rather than a single, fixed number), we
need to sample from it in order to obtain a “best guess” of what the posterior looks like.
These samples from the posterior distribution form what we refer to as a chain. Every model
parameter has a chain associated with it, and once that chain has converged (i.e., the mean—or
horizontal middle of the chain— and the variance—or height of the chain—have stabilized),
we use the information in the chain to derive the final model estimates. Often, the beginning
portion of the chain is discarded because it represents an unstable part before convergence is
reached; this portion of the chain is called the burn-in phase. The last portion of the chain, the
post burn-in phase of the chain, is then used as the estimated posterior distribution where final
model estimates are obtained.

Hox, Joop, et al. Multilevel Analysis : Techniques and Applications, Third Edition, Taylor & Francis Group, 2017. ProQuest Ebook
Central, http://ebookcentral.proquest.com/lib/pretoria-ebooks/detail.action?docID=5046900.
Created from pretoria-ebooks on 2024-07-23 08:35:18.
32 Multilevel Analysis: Techniques and Applications

The prior has the potential to have a rather large impact on final model results (even if
it is non-informative). As a result, it is important to report all details surrounding the prior
(see Depaoli & van de Schoot, 2017), which include: the distribution shape selected, the
hyperparameters (i.e., the level of informativeness), and the source of the prior information.
Equally important is to report a sensitivity analysis of priors to illustrate how robust final
model results are when priors are slightly (or even greatly) modified; this provides a better
understanding of the role of the prior in the analysis. Finally, it is also important to report all
information surrounding the assessment of chain convergence. Final model estimates are only
trustworthy if the Markov chain has successfully converged for every model parameter, and
reporting how this was assessed is a key component to a Bayesian analysis.
Bayesian multilevel estimation methods are discussed in more detail in Chapter 13 where
robust estimation methods are discussed to deal with non-normality, and in Chapter 12 where
sample size issues are discussed.

3.3 Bootstrapping

Bootstrapping is not, by itself, a different estimation method. In its simplest form, the
bootstrap (Efron, 1982; Efron & Tibshirani, 1993) is a method to estimate the parameters
of a model and their standard errors strictly from the sample, without reference to a
theoretical sampling distribution.2 The bootstrap directly follows the logic of statistical
inference. Statistical inference assumes that in repeated sampling, the statistics calculated
in the sample will vary across samples. This sampling variation is modeled by a theoretical
sampling distribution, for instance a normal distribution, and estimates of the expected
value and the variability are taken from this distribution. In bootstrapping, we draw b times
a sample (with replacement) from the observed sample at hand. In each sample, we estimate
the statistic(s) of interest, and the observed distribution of the b statistics is used for the
sampling distribution. Estimates of the expected value and the variability of the statistics are
taken from this empirical sampling distribution (Stine, 1989; Mooney & Duval, 1993; Yung
& Chan, 1999). Thus, in multilevel bootstrapping, in each bootstrap sample the parameters
Copyright © 2017. Taylor & Francis Group. All rights reserved.

of the model must be estimated, which is usually done with ML.


Since bootstrapping takes the observed data as the sole information about the population,
it needs a reasonable original sample size. Good (1999, p. 107) suggests a minimum sample
size of 50 when the underlying distribution is not symmetric. Yung and Chan (1999) review
the evidence on the use of bootstrapping with small samples. They conclude that it is not
possible to give a simple recommendation for the minimal sample size for the bootstrap
method. However, in general the bootstrap appears to compare favorably over asymptotic
methods. A large simulation study involving complex structural equation models (Nevitt
& Hancock, 2001) suggests that, for accurate results despite large violations of normality
assumptions, the bootstrap needs an observed sample of more than 150. Given such results,
the bootstrap is not the best approach when the major problem is a small sample size.

Hox, Joop, et al. Multilevel Analysis : Techniques and Applications, Third Edition, Taylor & Francis Group, 2017. ProQuest Ebook
Central, http://ebookcentral.proquest.com/lib/pretoria-ebooks/detail.action?docID=5046900.
Created from pretoria-ebooks on 2024-07-23 08:35:18.
Estimation and Hypothesis Testing in Multilevel Regression 33

When the problem is violations of assumptions, or establishing bias-corrected estimates


and valid confidence intervals for variance components, the bootstrap appears to be a viable
alternative to asymptotic estimation methods.
The number of bootstrap iterations b is typically large, with b between 1000 and 2000
(Booth & Sarkar, 1998; Carpenter & Bithell, 2000). If the interest is in establishing very
accurate confidence intervals, we need an accurate estimate of percentiles close to 0 or 100,
which requires an even larger number of iterations, such as b >5000.
The bootstrap is not without its own assumptions. A key assumption of the bootstrap is
that the resampling properties of the statistic resemble its sampling properties (Stine, 1989).
As a result bootstrapping does not work well for statistics that depend on a very “narrow
feature of the original sampling process” (Stine, 1989, p. 286), such as the maximum value.
Another key assumption is that the resampling scheme used in the bootstrap must reflect
the actual sampling mechanism used to collect the data (Carpenter & Bithell, 2000). This
assumption is very important in multilevel modeling, because in multilevel data we have a
hierarchical sampling mechanism, which must be mimicked in the bootstrapping procedure.
If we carry out a bootstrap estimation for our example data introduced in Chapter 2, the
results are almost identical to the asymptotic FML results reported in Table 2.2. The estimates
differ by 0.01 at most, which is a completely trivial difference. Of course, the example data
in Chapter 2 are simulated, and all assumptions are fully met. Bootstrap estimates are most
attractive when we have reasons to suspect the asymptotic results, because we have non-
normal data. Bootstrapping is described in more detail in Chapter 13 where robust estimation
methods are discussed to deal with non-normality.

3.4 Significance Testing and Model Comparison

This section discusses procedures for testing significance and model comparison for the
regression coefficients and variance components.
Copyright © 2017. Taylor & Francis Group. All rights reserved.

3.4.1 Testing Regression Coefficients and Variance Components

Maximum likelihood estimation produces parameter estimates and corresponding standard


errors. These can be used to carry out a significance test of the form Z = (estimate) / (standard
error of estimate), where Z is referred to as the standard normal distribution. This test is
known as the Wald test (Wald, 1943). The standard errors are asymptotic, which means that
they are valid for large samples. As usual, it is not precisely known when a sample is large
enough to be confident about the precision of the estimates. Simulation research suggests
that for accurate standard errors for level-2 variances, a relatively large level-2 sample size
is needed. For instance, simulations by van der Leeden, Busing and Meijer (1997) suggest
that with fewer than 100 groups, ML estimates of variances and their standard errors are
not very accurate. In ordinary regression analysis, a rule of thumb is to require 104 + p

Hox, Joop, et al. Multilevel Analysis : Techniques and Applications, Third Edition, Taylor & Francis Group, 2017. ProQuest Ebook
Central, http://ebookcentral.proquest.com/lib/pretoria-ebooks/detail.action?docID=5046900.
Created from pretoria-ebooks on 2024-07-23 08:35:18.
34 Multilevel Analysis: Techniques and Applications

observations if the interest is in estimating and interpreting regression coefficients, where


p is the number of explanatory variables (Green, 1991). If the interest is in interpreting
(explained) variance, the rule of thumb is to require 50 + 8p observations. In multilevel
regression, the relevant sample size for higher-level coefficients and variance components
is the number of groups, which is often not very large. Green’s rule of thumb and van
der Leeden et al.’s simulation results agree on a preferred group-level sample size of at
least 100. Additional simulation research (Maas & Hox, 2005) suggests that if the interest
lies primarily in the fixed part of the model, far fewer groups are sufficient, especially for
the lowest-level regression coefficients. The issue of the sample sizes needed to produce
accurate estimates and standard errors is taken up in more detail in Chapter 12.
It should be noted that the p-values and confidence intervals produced by different software
may not be exactly the same. Most multilevel analysis programs produce as part of their output
parameter estimates and asymptotic standard errors for these estimates, all obtained from the
maximum likelihood estimation procedure. The usual significance test is the Wald test, with
Z evaluated against the standard normal distribution. Bryk and Raudenbush (1992, p. 50),
referring to a simulation study by Fotiu (1989), argue that for the fixed effects it is better
to refer this ratio to a t-distribution on J – p – 1 degrees of freedom, where J is the number
of second-level units, and p is the total number of explanatory variables in the model. The
p-values produced by the program HLM (Raudenbush et al., 2011) are based on these tests
rather than the more common Wald tests. When the number of groups J is large, the difference
between the asymptotic Wald test and the alternative Student’s t-test is very small. However,
when the number of groups is small, the differences may become important. Since referring the
result of the Z-test on the regression coefficients to a Student’s t-distribution is conservative,
this procedure should provide a better protection against type I errors. A better choice for
the degrees of freedom in multilevel models is provided by the Satterthwaite approximation
(Satterthwaite, 1946) or the Kenward–Roger approximation (Kenward & Roger, 1997). Both
approximations estimate the number of degrees of freedom using the values of the residual
variances and their distribution across the available levels. Simulation research (Manor &
Zucker, 2004) shows that these approximations work better than the Wald test when the sample
Copyright © 2017. Taylor & Francis Group. All rights reserved.

size is small (e.g. smaller than 30). The Satterthwaite approximation is used in SAS, SPSS and
Stata, the Kenward–Roger approximation is available in SAS.
Several authors (e.g. Raudenbush & Bryk, 2002; Berkhof & Snijders, 2001) argue that the
Z-test is not appropriate for the variances, because it assumes a normal distribution, whereas
the sampling distribution of variances is skewed, especially when the variance is small.
Especially if we have both a small sample of groups and a variance component close to zero,
the distribution of the Wald statistic is clearly non-normal. Raudenbush and Bryk propose to
test variance components using a chi-square test on the residuals. This chi-square is computed
by

( )
2
 2 = ∑ ˆ j −  / Vˆj , (3.1)

Hox, Joop, et al. Multilevel Analysis : Techniques and Applications, Third Edition, Taylor & Francis Group, 2017. ProQuest Ebook
Central, http://ebookcentral.proquest.com/lib/pretoria-ebooks/detail.action?docID=5046900.
Created from pretoria-ebooks on 2024-07-23 08:35:18.
Estimation and Hypothesis Testing in Multilevel Regression 35

where ˆ j is the OLS estimate of a regression coefficient computed separately in group j,


β its overall estimate, and Vˆj its estimated sampling variance in group j. The number of
degrees of freedom is given by df = J – p – 1, where J is the number of second-level units,
and p is the total number of explanatory variables in the model. Groups that have a small
number of cases are passed over in this test, because their OLS estimates are not sufficiently
accurate.
Simulation studies on the Wald test for a variance component (van der Leeden et al., 1997)
and the alternative chi-square test (Harwell, 1997; Sánchez-Meca & Marín-Martínez, 1997)
suggest that with small numbers of groups both tests suffer from a very low power. The test
that compares a model with and without the parameters under consideration, using the chi-
square model test described in the next section, is generally better (Goldstein, 2011; Berkhof
& Snijders, 2001). Only if the likelihood is determined with a low precision, which is the case
for some approaches to modeling non-normal data, the Wald test is preferred. Note that if the
Wald test is used to test a variance component, a one-sided test is the appropriate one.

3.4.2 Testing Regression Coefficients and Variance Components Using Bayesian


Estimation

In Bayesian estimation, instead of using p-values one can use credibility intervals or Bayes
Factors (BF) for testing regression coefficients or variance components.
First, the Bayesian counterpart of the frequentist confidence interval is the Posterior
Probability Interval, also referred to as the credibility interval or the higher posterior density.
This interval is interpreted as the 95 percent probability that in the population the parameter
lies between the upper and lower value of the interval. Note, however, that the Bayesian
interval and the classical confidence interval may numerically be similar and might serve
related inferential goals, but they are not mathematically equivalent and conceptually quite
different. Many argue that the Bayesian 95 percent interval is easier to communicate because
it is actually the probability that a certain parameter lies between two numbers, which is
not the definition of a classical confidence interval. The Bayesian interval can be used to
Copyright © 2017. Taylor & Francis Group. All rights reserved.

determine whether a specific value, for example zero, lies within or outside the 95 percent
interval. Related to this, the region of practical equivalence has also been slowly gaining
popularity in the literature to avoid testing so-called nil-hypotheses (Kruschke, 2011).
A second way of testing regression coefficients or variance components is to use Bayes
Factors (Kass & Raftery, 1995; see also Morey & Rouder, 2011). Bayes Factors represent the
amount of evidence favoring one hypothesis over another. When BF = 1, this result implies
that both hypotheses are equally supported by the data, but when BF = 10, for example, the
support for one hypothesis is ten times larger than the support for the alternative hypothesis.
If BF < 1, the alternative hypothesis is supported by the data. Many researchers argue that
BFs are to be preferred over p-values, for example, Sellke, Bayarri and Berger (2001) showed
that the BF is preferable over a p-value when testing hypotheses because p-values tend to

Hox, Joop, et al. Multilevel Analysis : Techniques and Applications, Third Edition, Taylor & Francis Group, 2017. ProQuest Ebook
Central, http://ebookcentral.proquest.com/lib/pretoria-ebooks/detail.action?docID=5046900.
Created from pretoria-ebooks on 2024-07-23 08:35:18.
36 Multilevel Analysis: Techniques and Applications

overestimate the evidence against the null hypothesis. However, as stated by Konijn, van de
Schoot, Winter and Ferguson (2015), potential pitfalls of a Bayesian approach include BF-
hacking (cf., ‘Surely, God loves a Bayes Factor of 3.01 nearly as much as a BF of 2.99’). This
can especially occur when BF values are small. The first way in which BFs can be applied,
is to use them to test if variances are greater than zero (Verhagen & Fox, 2012), which is
implemented in Mplus (TECH 16, Muthén & Muthén, 1998–2015). A second way in which
BFs can be used is to test whether regression coefficients are smaller/larger than zero or to
test for order constraints between regression coefficients (see van de Schoot et al., 2013; for
an application see Johnson et al., 2015).

3.4.3 Comparing Nested Models

From the likelihood function we can calculate a statistic called the deviance that indicates
how well the model fits the data. The deviance is defined as –2 × ln (likelihood), where
likelihood is the value of the likelihood function at convergence, and ln is the natural
logarithm. In general, models with a lower deviance fit better than models with a higher
deviance. If two models are nested, which means that a specific model can be derived from a
more general model by removing parameters from the general model, we can compare them
statistically using their deviances. The difference of the deviances for two nested models
has a chi-square distribution, with degrees of freedom equal to the difference in the number
of parameters estimated in the two models. This can be used to perform a formal chi-square
test to test whether the more general model fits significantly better than the simpler model.
The deviance difference test is also referred to as the likelihood ratio test, since the ratio of
two likelihoods is compared by looking at the difference of their logarithms.
The chi-square test of the deviances can be used to good effect to explore the importance of
random effects, by comparing a model that contains these effects with a model that excludes
them.
Table 3.1 presents two models for the pupil popularity data used as an example in Chapter
Copyright © 2017. Taylor & Francis Group. All rights reserved.

2. The first model contains only an intercept. The second model adds two pupil-level variables
and a teacher-level variable, with the pupil-level variable extraversion having random slopes
at the second (class) level. To test the second-level variance component  u20 using the deviance
difference test, we remove it from model M0. The resulting model (not presented in Table 3.1)
produces a deviance of 6970.4, and the deviance difference is 642.9. Since the modified model
estimates one parameter less, this is referred to the chi-square distribution with one degree of
freedom. The result is obviously significant.
The variance of the regression coefficient for pupil gender is estimated as zero, and therefore
it is removed from the model. A formal test is not necessary. In model M1 in Table 3.1 this
variable is treated as fixed, no variance component is estimated. To test the significance of the
variance of the extraversion slopes, we must remove the variance parameter from the model.
This presents us with a problem, since there is also a covariance parameter σu02 associated with

Hox, Joop, et al. Multilevel Analysis : Techniques and Applications, Third Edition, Taylor & Francis Group, 2017. ProQuest Ebook
Central, http://ebookcentral.proquest.com/lib/pretoria-ebooks/detail.action?docID=5046900.
Created from pretoria-ebooks on 2024-07-23 08:35:18.
Estimation and Hypothesis Testing in Multilevel Regression 37

Table 3.1 Intercept-only model and model with explanatory variables

Model M0: intercept only M1: with predictors

Fixed part Coefficient (s.e.) Coefficient (s.e.)


Intercept 5.08 (.09) 0.74 (.20)
Pupil gender 1.25 (.04)
Pupil extraversion 0.45 (.02)
Teacher experience 0.09 (.01)
Random part
 e2 1.22 (.04) 0.55 (.02)
 2
u0
0.69 (.11) 1.28 (.28)
 u 02 –0.18 (.05)
 2
u2
0.03 (.008)
Deviance 6327.5 4812.8

the extraversion slopes. If we remove both the variance and the covariance parameter from
the model, we are testing a combined hypothesis on two degrees of freedom. It is better to
separate these tests. Some software (e.g. MLwiN) actually allows to remove the variance of the
slopes from the model but to retain the covariance parameter. This is a strange model, but for
testing purposes it allows us to carry out a separate test on the variance parameter only. Other
software (e.g. MLwiN, SPSS, SAS) allows the removal of the covariance parameter, while
keeping the variance in the model. If we modify model M1 this way, the deviance increases to
4851.9. The difference is 39.1, which is a chi-square variate with one degree of freedom, and
highly significant. If we modify the model further, by removing the slope variance as well, the
deviance increases again to 4862.3. The difference with the previous model is 10.4, again with
one degree of freedom, and it is highly significant.
Copyright © 2017. Taylor & Francis Group. All rights reserved.

Asymptotically, the Wald test and the test using the chi-square difference are equivalent.
In practice, the Wald test and the chi-square difference test do not always lead to the same
conclusion. If a variance component is tested, the chi-square difference test is clearly better,
except when models are estimated where the likelihood function is only an approximation, as
in the logistic models discussed in Chapter 6.
When the chi-square difference test is used to test a variance component, it should be noted
that the standard application leads to a p-value that is too high. The reason is that the null-
hypothesis of zero variance is on the boundary of the parameter space (all possible parameter
values) since variances cannot be negative. If the null-hypothesis is true, there is a 50 percent
chance of finding a positive variance, and a 50 percent chance of finding a negative variance.
Negative variances are inadmissible, and the usual procedure is to change the negative estimate

Hox, Joop, et al. Multilevel Analysis : Techniques and Applications, Third Edition, Taylor & Francis Group, 2017. ProQuest Ebook
Central, http://ebookcentral.proquest.com/lib/pretoria-ebooks/detail.action?docID=5046900.
Created from pretoria-ebooks on 2024-07-23 08:35:18.
38 Multilevel Analysis: Techniques and Applications

to zero. Thus, under the null-hypothesis the chi-square statistic has a mixture distribution of
50 percent zero and 50 percent chi-square with one degree of freedom. Therefore, the p-value
from the chi-square difference test must be divided by two if a variance component is tested
(Berkhof & Snijders, 2001). If we test a slope variance, and remove both the slope variance
and the covariance from the model, the mixture is more complicated, because we have a
mixture of 50 percent chi-square with one degree of freedom for the unconstrained intercept-
slope covariance and 50 percent chi-square with two degrees of freedom for the covariance
and the variance that is constrained to be non-negative (Verbeke & Molenberghs, 2000). The
( )
p-value for this mixture is calculated using p = 0.5P 12 > C 2 + 0.5P  22 > C 2 where C2 is ( )
the difference in the deviances of the model with and without the slope variance and intercept-
slope covariance. Stoel, Galindo, Dolan and van den Wittenboer (2006) discuss how to carry
out such tests in general. If it is possible to remove the intercept-slope covariance from the
model, it is possible to test the significance of the slope variance with a one degree of freedom
test, and we can simply halve the p-value again. For the regression coefficients, the chi-square
test (only in combination with FML estimation) is in general also superior. The reason is that
the Wald test is to some degree sensitive to the parameterization of the model and the specific
restrictions to be tested (Davidson & MacKinnon, 1993, Sections 13.5–13.6). The chi-square
test is invariant across different parametrizations of the model. Since the Wald test is much
more convenient, it is in practice used the most, especially for the fixed effects. Even so, if
there is a discrepancy between the result of a chi-square difference test and the equivalent Wald
test, the chi-square difference test is generally the preferred one.
LaHuis and Ferguson (2009) compare amongst others the chi-square deviance test and
the chi-square residuals test described above. In their simulation, all tests controlled the type
I error well, and the deviance difference test (dividing p by two for variances, as described
above) generally performed best in terms of power.

3.4.4 Comparing Non-Nested Models

If the models to be compared are not nested models, the principle that models should
Copyright © 2017. Taylor & Francis Group. All rights reserved.

be as simple as possible (theories and models should be parsimonious) indicates that we


should generally keep the simpler model. A general fit index to compare the fit of statistical
models is Akaike’s Information Criterion—AIC (Akaike, 1987), which was developed to
compare non-nested models, adjusting for the number of parameters estimated. The AIC
for multilevel regression models is conveniently calculated from the deviance d, and the
number of estimated parameters q:

AIC = d + 2q. (3.2)

The AIC is a very general fit-index that assumes that we are comparing models that are fit
to the same data set, using the same estimation method.

Hox, Joop, et al. Multilevel Analysis : Techniques and Applications, Third Edition, Taylor & Francis Group, 2017. ProQuest Ebook
Central, http://ebookcentral.proquest.com/lib/pretoria-ebooks/detail.action?docID=5046900.
Created from pretoria-ebooks on 2024-07-23 08:35:18.
Estimation and Hypothesis Testing in Multilevel Regression 39

A fit index similar to the AIC is Schwarz’s Bayesian Information Criterion–BIC (Schwarz,
1978), which is given by

BIC = d + q ln(N). (3.3)

In multilevel modeling, the general Equation 3.3 for the BIC is ambiguous, because it is
unclear whether N refers to the first-level or the second-level sample size. What N means in
Equation 3.3 is differently chosen by different software. Most software uses the number of
units at the highest level for the N. This makes much sense when multilevel models are used
for longitudinal data, where the highest level is often the subject level. Given the strong
interest in multilevel modeling in contextual effects, choosing the highest-level sample size
appears a sensible rule.
When the deviance goes down, indicating a better fit, both the AIC and the BIC also tend to
go down. However, the AIC and the BIC both include a penalty function based on the number
of estimated parameters q. As a result, when the number of estimated parameters goes up, the
AIC and BIC tend to go up too. For most sample sizes, the BIC places a larger penalty on
complex models, which leads to a preference for smaller models. Since multilevel data have
a different sample size at different levels, the AIC is more straightforward than the BIC, and
therefore the recommended choice. The AIC and BIC are typically used to compare a range of
competing models, and the model(s) with the lowest AIC or BIC value are considered the most
attractive. Both the AIC and the BIC have been shown to perform well, with a small advantage
for the BIC (Haughton et al., 1997; Kieseppä, 2003). It should be noted that the AIC and BIC
are based on the likelihood function. With FML estimation, the AIC and BIC can be used to
compare models that differ either in the fixed part or in the random part. If RML estimation
is used, it can only be used to compare models that differ in the random part. Since RML
effectively partials out the fixed part, before the random part is estimated, the RML likelihood
may still change if the fixed part is changed. Therefore, if likelihood based procedures are used
to compare models using RML estimation, the fixed part of the model must be kept constant.
Not all software reports the AIC or BIC, but they can be calculated using the formulas given
Copyright © 2017. Taylor & Francis Group. All rights reserved.

earlier. For an introductory discussion of the background of the AIC and the BIC see McCoach
and Black (2008).
Within the Bayesian framework the Deviance Information Criterion—DIC (Spiegelhalter
et al., 2002) can be used similarly to the AIC and BIC. The posterior DIC is proposed in
Spiegelhalter and colleagues (2002) as a Bayesian criterion for minimizing the posterior
predictive loss. It can be seen as the error that is expected when a statistical model based on
the observed data set y is applied to a future data set x. Let f() denote the likelihood, then the
expected loss is given by

 2log f ( x |  *)  ,
E f ( x| *) −

Hox, Joop, et al. Multilevel Analysis : Techniques and Applications, Third Edition, Taylor & Francis Group, 2017. ProQuest Ebook
Central, http://ebookcentral.proquest.com/lib/pretoria-ebooks/detail.action?docID=5046900.
Created from pretoria-ebooks on 2024-07-23 08:35:18.
40 Multilevel Analysis: Techniques and Applications

where –2 log f() is the loss function of a future data set x given the expected a-posteriori
estimates of the model parameters based on the observed data set. If we would know the true
parameter values, the expected loss could be computed. However, since these are unknown,
the posterior DIC takes the posterior expectation leading to the DIC:

DIC = d + pD.

where the first term is (approximately) equal to d for the AIC and BIC. The second term is
often interpreted as the “effective number of parameters”, but is formally interpreted as the
posterior mean of the deviance minus the deviance of the posterior means. Just like with the
AIC and BIC, models with a lower DIC-value should be preferred and indicates the model
that would best predict a replicate dataset which has the same structure as that currently
observed. For a detailed comparison of the three model selection tools, AIC BIC and DIC,
we refer to Hamaker et al. (2011).

3.5 Software

Most multilevel regression software uses maximum likelihood estimation and offers a
choice between full maximum likelihood and restricted maximum likelihood estimation.
Bayesian estimation is gaining in popularity, but a user-friendly software implementation is
currently available only in MLwiN and Mplus. In most software, when maximum likelihood
estimation is used, both regression coefficients and variance components are tested using
the Wald test. The software HLM uses a chi-square test based on the residuals. The deviance
difference test can be used only by calculating the difference of the two deviances manually.
Bayesian estimation methods generally investigate the precision of the estimated regression
coefficients and variances by calculating their 95 percent credibility interval. This is similar
to the ML-based 95 percent confidence interval, but it has a simpler interpretation and is not
necessarily symmetric, which is important for variance components.
Copyright © 2017. Taylor & Francis Group. All rights reserved.

Notes
1 Some programs allow the analyst to monitor the iterations, to observe whether the computations are
going somewhere, or are just moving back and forth without improving the likelihood function.
2 For a discussion of multilevel bootstrapping in the context of robust estimation see Hox and van de
Schoot (2013), which explains bootstrapping in more detail.

Hox, Joop, et al. Multilevel Analysis : Techniques and Applications, Third Edition, Taylor & Francis Group, 2017. ProQuest Ebook
Central, http://ebookcentral.proquest.com/lib/pretoria-ebooks/detail.action?docID=5046900.
Created from pretoria-ebooks on 2024-07-23 08:35:18.

You might also like