0% found this document useful (0 votes)
18 views14 pages

Bayesian Methods For The Analysis of Small Sample Multilevel Data With A Complex Variance Structure

Uploaded by

carol8ribas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views14 pages

Bayesian Methods For The Analysis of Small Sample Multilevel Data With A Complex Variance Structure

Uploaded by

carol8ribas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Psychological Methods © 2012 American Psychological Association

2013, Vol. 18, No. 2, 151–164 1082-989X/13/$12.00 DOI: 10.1037/a0030642

Bayesian Methods for the Analysis of Small Sample Multilevel Data


With a Complex Variance Structure
Scott A. Baldwin and Gilbert W. Fellingham
Brigham Young University

Inferences from multilevel models can be complicated in small samples or complex data structures. When
using (restricted) maximum likelihood methods to estimate multilevel models, standard errors and
degrees of freedom often need to be adjusted to ensure that inferences for fixed effects are correct. These
adjustments do not address problems in estimating variance/covariance components. An alternative to the
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

adjusted likelihood method is to use Bayesian methods, which can produce accurate inferences about
This document is copyrighted by the American Psychological Association or one of its allied publishers.

fixed effects and variance/covariance parameters. In this article, the authors contrast the benefits and
limitations of likelihood and Bayesian methods in the estimation of multilevel models. The issues are
discussed in the context of a partially clustered intervention study, a common intervention design that has
been shown to require an adjusted likelihood analysis. The authors report a Monte Carlo study that
compares the performance of an adjusted restricted maximum likelihood (REML) analysis to a Bayesian
analysis. The results suggest that for fixed effects, the models perform equally well with respect to bias,
efficiency, and coverage of interval estimates. Bayesian models with a carefully selected gamma prior for
the variance components were more biased but also more efficient with respect to estimation of the
variance components than the REML model. However, the results also show that the inferences about the
variance components in partially clustered studies are sensitive to the prior distribution when sample
sizes are small. Finally, the authors compare the results of a Bayesian and adjusted likelihood model
using data from a partially clustered intervention trial.

Keywords: Bayesian methods, MCMC, multilevel data, partially clustered data

Supplemental materials: http://dx.doi.org/10.1037/a0030642.supp

Clustered data are common in the social sciences. Examples of data, including predictors at the individual and cluster levels,
clustered data include educational data where students are clus- considering complex covariance relationships among observa-
tered within schools, psychotherapy data where patients are clus- tions, and using non-normal and multivariate outcomes (Littell,
tered within therapists, family data where individuals are clustered Milliken, Stroup, Wolfinger, & Schabenberger, 2006). Conse-
within families, and longitudinal data where observations over quently, many research questions in the social sciences can be
time are clustered within persons. When analyzing clustered data addressed with a multilevel model.
it is often of interest to model variance due to clusters because When sample sizes are small and the variance structure of the
cluster variance is substantively interesting (e.g., how much do data is complex, inferences about the fixed effects and variance
schools vary with respect to educational outcomes) and because components can be complicated because of uncertainty about the
modeling cluster variance accounts for the correlation among true value of the variance components. Adjustments to the standard
observations within groups and thus produces correct confidence errors and degrees of freedom when using maximum likelihood
intervals and p-values for fixed effects (e.g., Baldwin, Murray, & (ML) or restricted maximum likelihood (REML) estimation meth-
Shadish, 2005; Baldwin, Murray, et al., 2011; Roberts & Roberts, ods produces correct confidence intervals and p-values for the
2005). Multilevel (or mixed) models are a flexible and commonly
fixed effects (Kenward & Roger, 1997, 2009). However, these
used tool for analyzing clustered data because they allow research-
adjustments only impact inferences for the fixed effects; uncer-
ers to accommodate multiple sources of variability. Additionally,
tainty about the estimates of the variance components themselves
multilevel models are flexible with respect to dealing with missing
is not dealt with directly, which could lead to poor inferences about
variance components.
Bayesian estimation is a potential alternative to likelihood esti-
This article was published Online First November 12, 2012. mation. Bayesian estimation accounts for uncertainty in the esti-
Scott A. Baldwin, Department of Psychology, Brigham Young Univer- mation of all parameters, including variance components, through
sity; Gilbert W. Fellingham, Department of Statistics, Brigham Young the use of prior distributions and thus do not require adjustments.
University.
A number of articles and books have detailed the benefits and
We would like to thank Eric Stice for allowing us to use the Body
Project data and Matthew Heiner for his thoughtful reading of the article.
challenges of Bayesian methods in the social sciences (e.g.,
Correspondence concerning this article should be addressed to Scott A. Alegría et al., 2008; Gelman & Hill, 2007; Howard, Maxwell, &
Baldwin, 268 TLRB, Brigham Young University, Provo, UT 84602. E- Fleming, 2000; Jackman, 2009; Kruschke, 2011b; Lynch, 2007).
mail: [email protected] This methodological work has covered topics such as (a) how
151
152 BALDWIN AND FELLINGHAM

Bayesian models formally incorporate existing information into an n ⫽ 123), and an assessment-only control (AO; n ⫽ 126). To
analysis (e.g., Howard et al., 2000), (b) the benefits of Bayesian simplify our discussion of the models and estimation, we focus just
hypothesis tests and interval estimates (Kruschke, 2011a), (c) how on the DI and AO condition. The DI condition was clustered and
multilevel models can be formulated and extended in a Bayesian was delivered in 17 groups (average group size ⫽ 6.7). The AO
framework (Gelman & Hill, 2007; Jackman, 2009; Lynch, 2007), condition was unclustered.
and (d) the flexibility of Bayesian models in constructing interval The Body Project is known as a partially clustered design
estimates for parameters with non-normal sampling distributions because clustering occurs in some conditions but not others (see
(e.g., indirect effects in multilevel mediation models; Yuan & Baldwin, Bauer, et al., 2011, for a review and evaluation of
MacKinnon, 2009). Furthermore, studies have compared the rela- approaches to modeling partially clustered data). Statistical models
tive performance of likelihood and Bayesian estimation methods need to reflect heteroscedasticity across conditions in order to
applied to multilevel models (Browne, 2008). With respect to bias, maintain nominal Type I error rates and confidence interval cov-
efficiency, and coverage, Bayesian methods have performed as erage. Specifically, models that ignore clustering in partially clus-
well or better than likelihood methods. tered data have inflated Type I error rates and confidence intervals
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

We are unaware of any study that has directly compared the that are too narrow. Modeling data as if they were fully clustered
This document is copyrighted by the American Psychological Association or one of its allied publishers.

performance of likelihood and Bayesian methods in situations that can also lead to inflated Type I error rates as well as biased
require adjustments to the likelihood model. One such situation estimates of the variance components (Baldwin, Bauer, et al.,
where adjustments are necessary is a partially clustered design 2011).
(Baldwin, Bauer, Stice, & Rohde, 2011; Bauer, Sterba, & Hallfors, Several articles have described and applied a multilevel model
2008; Roberts & Roberts, 2005). Partially clustered designs, where that is consistent with the structure of the partially clustered data
some participants are clustered and others are not, are a relatively (Baldwin, Bauer, et al., 2011; Bauer et al., 2008; Roberts &
common design in intervention research in social science and Roberts, 2005). A multilevel model with a clustered (C) and
medicine (Bauer et al., 2008). Additionally, likelihood methods unclustered (U) condition is
require adjustment when applied to partially clustered data. For
example, Baldwin, Bauer, et al. (2011) showed that Type I error Y ij ⫽ b0 ⫹ b1Xij ⫹ u jZ j ⫹ eij , (1)
rates for fixed effects in partially clustered designs exceed the where Y ij is the outcome of the ith participant in the jth cluster, Xij
nominal rate unless Satterthwaite degrees of freedom (Satterth- is a dummy variable coded as 1 for the clustered condition and 0
waite, 1946) are used. for the unclustered condition, b0 is the mean of the clustered
The primary purposes of this article are to (a) introduce Bayes- condition, b1 is the difference between the means of the clustered
ian estimation for multilevel models where adjustment to likeli- and unclustered condition (i.e., the intervention effect), Z j is an
hood estimation is needed and (b) compare the performance of indicator variable coded as 1 if a participant is in the jth cluster and
likelihood and Bayesian estimation in the analysis of complex 0 otherwise, u jZ j is the effect for cluster j in the clustered condi-
multilevel data—specifically partially clustered data. The outline tion, and eij is a person-specific residual. The cluster effects are
of the article is as follows. First, we introduce our motivating normally distributed:
example, a partially clustered intervention study known as the
Body Project and briefly review the multilevel model for partially u j ⬃ N(0, ␴2u). (2)
clustered data. Second, we discuss inferential challenges with
multilevel models applied to partially clustered data and discuss The residuals are also normally distributed and independent of the
how the Kenward-Roger adjustment to likelihood estimates helps cluster effects. Additionally, it is often beneficial to estimate
with inferences about fixed effects but not variance components. unique residual variances across intervention conditions (Bauer et
Third, we introduce Bayesian methods and how they may be used al., 2008):
to estimate the parameters in the multilevel model for partially
eUi ⬃ N(0, ␴2eU) (3)
clustered data. We include a discussion of the issues and chal-
lenges researchers face in selecting prior distributions. Fourth, we eCij ⬃ N(0, ␴e2C). (4)
report a simulation study comparing the performance, with respect
to both fixed effects and variance components, of Bayesian and The proportion of variance in the clustered condition associated
likelihood estimation. Fifth, we analyze the Body Project data and with clusters is indexed with an intraclass correlation:
include an online technical appendix (see the online supplemental
materials) with annotated SAS and JAGS/BUGS syntax for run- ␴2u
␳⫽ . (5)
ning the Bayesian model. We also include appendix material ␴2u ⫹ ␴e2C
illustrating how to write a Bayesian sampler.
Estimation and Inferential Challenges
Motivating Example
Using the multilevel model for partially clustered data, infer-
The Body Project evaluated a dissonance based eating disorder ences can be drawn about the fixed effects, variance components,
prevention intervention (complete details of the study can be found intraclass correlation, or all three. Bauer et al. (2008) noted how
in Stice, Shaw, Burton, & Wade, 2006). Female adolescents inferences about fixed effects in partially clustered data are com-
(N ⫽ 480) were randomized to one of four conditions: the disso- plicated by the nonparallel variance structure of the data. They
nance intervention (DI; n ⫽ 114), a healthy-weight management noted that two primary problems exist when likelihood methods
program (HW; n ⫽ 117), an expressive writing control (EW; are used: (a) standard errors tend to be underestimated when
MULTILEVEL BAYES 153

sample sizes are small and data are unbalanced and when there are intraclass correlations are not normal because of the boundaries
multiple covariance parameters (Kenward & Roger, 1997, 2009); on the parameter space. For variance components, the boundary
(b) the form of the reference distribution for computing p-values is 0. For intraclass correlations defined as in Equation 5 the
and confidence intervals is unknown. The first problem occurs boundary is 0 on the low end and 1 on the high end. Because
because likelihood estimation proceeds in two steps. First, the these sampling distributions are not normal, special procedures
variance and covariance parameters are estimated. Second, the and adjustments need to be made to construct confidence inter-
variance and covariance estimates are assumed to be the true vals. Further, these adjustments vary depending on whether the
values and are used to estimate the fixed effects (see Littell et al., data are balanced and on the structure of the data (e.g., factorial
2006, Appendix A). When sample sizes are small and unbalanced, design versus repeated measures design). Entire books have
the variance components will not be well estimated and can lead to been devoted to detailing the necessary adjustments (Burdick &
standard errors for the fixed effects that are too small and confi- Graybill, 1992).
dence intervals that are too narrow. Kenward and Roger (1997, In sum, when sample sizes are small and variance structures are
2009) developed an adjustment that inflates the standard errors of complex, inferences about fixed effects, variance components, and
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

the fixed effects to account for the uncertainty in the variance intraclass correlations are complicated. Although adjustments to
This document is copyrighted by the American Psychological Association or one of its allied publishers.

component estimates. The bias in the variance/covariance matrix likelihood estimation methods have been proposed to aid infer-
of the fixed effects is estimated through a Taylor series expansion, ences about fixed effects, these adjustments do not help with
and the bias is added back into the variance/covariance matrix of variance components and intraclass correlations. Bayesian estima-
the fixed effects so that the standard errors are sufficiently large. tion represents an alternative to likelihood estimation that can
Kenward and Roger’s simulations suggest that this inflation factor accommodate inferential challenges regarding the fixed effects,
produces appropriately large standard errors. variance components, and intraclass correlations.
Even with the appropriate adjustment, the form of the reference
distribution for conducting hypothesis tests and constructing con-
Bayesian Multilevel Models
fidence intervals is unknown. Often the sampling distribution of
fixed effects in multilevel models is approximated by a At a conceptual level Bayesian inference is simple. We state our
t-distribution. The degrees of freedom of the t-distribution are knowledge about a theory, including our uncertainty, and use data
unknown except when data are balanced and involve simple co- to update our knowledge. Bayes’ theorem provides the mechanism
variance structures. The degrees of freedom can be approximated for combining prior knowledge with data and produces the prob-
using the Satterthwaite (1946) or Satterthwaite-like approxima- ability of a theory given the data, p(theory|data). Bayes theorem
tions (Kenward & Roger, 1997). Baldwin, Bauer, et al. (2011) can be written as
showed that Satterthwaite degrees of freedom maintained the
nominal Type I error rate for fixed effects in partially clustered p(data|theory) p(theory)
designs with normal outcomes, but other approaches to degrees of p(theory|data) ⫽ , (6)
p(data)
freedom did not.
Although the Kenward-Roger adjustment aids fixed effects where p(data|theory) is the probability of the data given theory
inferences, it does not solve challenges with estimating variance (known as the likelihood), p(theory) is the probability of the theory
components. When the number of clusters is small, which is prior to the data (known as the prior), and p(data) is the probability
extremely common in the social sciences (especially interven- of the data. In Bayesian computation, p(data) is a normalizing
tion research), the sampling variability for variances is large. constant, so Bayes theorem is often written as p(theory|data) ⬀
Consequently, likelihood estimates of the variance components p(data|theory) p(theory). The left-hand side of Equation 6 is known
tend to vary considerably, leading to drastically over- and as the posterior distribution because it is the combination of prior
under-estimated values of the population variance (cf. Baldwin, information and data. For a more thorough introduction to Bayes
Murray, et al., 2011; Crits-Christoph et al., 1991). This can lead theorem and Bayesian computation see introductory Bayesian
to problems estimating parameters at the boundary of the pa- texts (e.g., Carlin & Louis, 2009; Gelman, Carlin, Stern, & Rubin,
rameter space— 0 for variances and –1/1 for correlations. 2003; Kruschke, 2011b; Lynch, 2007).
Boundary problems are common for cluster variances. Even The posterior distribution makes Bayesian inference unique and
when the population cluster variance is greater than 0, likeli- separates it from inference based on null hypothesis tests. The
hood methods will often estimate the cluster variance as 0 when posterior distribution is a probability distribution and as such
the number of clusters is small because there is limited infor- allows us to make probability statements. For example, if the
mation about between-cluster variance in the data due to sam- parameter of interest is a mean difference, ␮d, the posterior distri-
pling error. Thus, researchers will wrongly assume that the bution allows us to make statements such as “the probability that
between-cluster variance is zero. Furthermore, the adjustments ␮d exceeds 0 is 95%.” Contrast this with a one-tailed null hypoth-
to standard errors and degrees of freedom for fixed effects do esis test where we say “if the null-hypothesis is true, the proba-
not apply when variance and correlation parameters are esti- bility of observing results as large or larger than the observed
mated at the boundary (Kenward & Roger, 2009), so the ad- results is 5%.” With the posterior distribution we can create 95%
justment for uncertainty in estimation of the variance compo- intervals (often called credible intervals) that allows us to say that
nents is not taken into account. “there is a 95% probability that ␮d falls in this interval.” Again,
Computing confidence intervals for variance components and contrast this with 95% confidence intervals where we can say “if
intraclass correlations is also challenging. The difficulty is that we repeated this study many times, 95% of the confidence inter-
the sampling distributions for variance components and for vals computed using this method would contain ␮d.”
154 BALDWIN AND FELLINGHAM

Bayesian methods can also address some of the inferential We have not yet discussed interval estimates for covariance
challenges discussed previously. Bayesian methods inherently deal parameters and intraclass correlations. The benefits of Bayesian
with the issue of uncertainty in covariance parameters through the methods in this regard are best explained once Bayesian estimation
prior distribution for the covariance parameters. For example, the procedures have been described. Consequently, we now briefly
posterior distribution for the cluster variance (␴2u) is a combination introduce Bayesian estimation procedures.
of information from the data (the likelihood) and information from
the prior distribution, which describes uncertainty about the cluster
variance. Consequently, Bayesian models do not require that co- Bayesian Estimation
variance parameters be considered known when computing other Equation 6 highlights one of the primary differences between
parameters in the model. Thus, point and interval estimates for the likelihood and Bayesian model. Likelihood methods produce
parameters such as intervention effects automatically take into point and interval estimates for parameters, whereas Bayesian
account uncertainty of the covariance parameters, as well as all methods provide the entire probability distribution of parameters
other parameters, and do not require additional adjustment. given the data (Gelman et al., 2003). Point estimates can be
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Determining the appropriate reference distribution is not needed obtained by computing expected values and variances of the pos-
This document is copyrighted by the American Psychological Association or one of its allied publishers.

in Bayesian modeling. If a Bayesian model is constructed cor- terior distribution, whereas interval estimates can be estimated by
rectly, then the posterior distribution is a true probability distribu- identifying the appropriate percentiles of the posterior distribution.
tion and can be used to make probability statements about the For simple problems (e.g., small numbers of parameters), the
parameter. Consequently, there is no need to use a reference posterior distribution often follows a known probability distribu-
probability distribution or worry about the degrees of freedom of tion—that is, the posterior distribution has a closed-form. In those
the distribution. instances, generating point and interval estimates involves com-
The use of prior information in Bayesian analysis helps deal puting expected values, variances, and percentiles of the known
with the efficiency of estimation and boundary problems faced in probability distribution.
likelihood estimation of covariance parameters. Priors help reduce Complex models with multiple parameters do not necessarily
excess sampling variability of covariance parameters by taking yield posterior distributions with a closed-form. The most common
advantage of the information contained in the prior through a method for dealing with this problem is to use Markov chain
process called shrinkage or partial pooling (Gelman & Hill, 2007). Monte Carlo (MCMC) methods to simulate draws from the pos-
For example, extreme values of the covariance parameters, which terior distribution (see Kruschke, 2011b, for a readable introduc-
are common when sample sizes are small, will be shifted or tion to MCMC estimation for social scientists). As Jackman (2009)
“shrunk” toward the prior mean. The prior and data borrow stated, “anything we want to know about random variable ␪, we
strength from each other to produce the parameter estimate. Be- can learn by sampling repeatedly from its density, p共␪兲” (p. 201).
cause priors can have a significant impact when sample sizes are There are multiple MCMC algorithms, Gibbs and Metropolis-
small, researchers need to be careful and thoughtful about priors to Hastings (MH) being two common algorithms. An explanation of
ensure that estimates are not shrunk to implausible values. Shrink- Gibbs sampling requires an understanding of a conditional poste-
age occurs for all parameters in a fully Bayesian model because rior density. The conditional posterior density for a parameter is
priors are assigned to all parameters. Shrinkage also occurs when the probability distribution for that parameter assuming all other
predicting cluster means (u j; see Equation 1) in likelihood estima- parameters are known. If the draws are made sequentially through
tion because the cluster means are given a prior distribution all the parameters assuming the other parameters are known at
described in Equation 2. The cluster means are often referred to as their current drawn value, then the posterior distributions so pro-
empirical Bayes estimates because the cluster variance in Equation duced are known to converge to the full joint posterior of all the
2 is estimated from the data. parameters.
When sampling error limits the information about covariance If the conditional density follows a known probability distribu-
parameters in the data, Bayesian estimation will shrink the esti- tion, then we can simulate values directly from that probability
mate toward the prior rather than estimating the parameter at a distribution, which is a highly efficient solution. If the conditional
boundary as often happens when using likelihood estimation meth- density is not one from which a random draw can be obtained, we
ods. Shrinking estimates away from boundaries is important when then need the ability to draw a sample from an arbitrary functional
dealing with parameter uncertainty because Kenward and Roger’s form of a distribution with unknown scale. The MH algorithm has
adjustment does not work under those situations. As Kenward and this ability and so it can be used in many analysis situations.
Roger (2009) noted the following: However, this flexibility can lead to some inefficiency in estima-
tion and thus require longer MCMC simulations to obtain suffi-
We have not addressed the issue of covariance parameter estimators ciently precise estimates. For example, SAS PROC MCMC, which
lying on a boundary of the parameter space, usually, but not neces- uses a MH algorithm, is somewhat inefficient in estimating the
sarily, zero. This is most common with variance components that are
cluster variance for the Body Project data (see Empirical Example
assumed to be positive. The series expansions on which the results
section). Alternative, flexible sampling algorithms, such as slice
above are based do not apply in such settings. The main problem is
usually the lack of information in the data at hand on the relevant sampling (see Gelman et al., 2003), can be more efficient than
variance components. Methods using formal sensitivity analyses, or MH. Whether one can use alternative sampling schemes depends
incorporating external information directly through, for example, upon whether the software used implements alternative samplers
Bayesian methods, would then seem more appropriate than classical or whether the researcher can write the sampler by hand. Appendix
analyses such as those considered here. (p. 2595) B in the online supplemental materials provides an example of a
MULTILEVEL BAYES 155

MH algorithm for partially clustered data. Details about various them. For example, a researcher might say that the cluster effects
sampling algorithms can be found in Gelman et al. (2003). can be approximated with a normal distribution with a mean equal
A key question when using MCMC is whether the random to 0 but an unknown cluster variance, ␴2u. Because the cluster
samples actually constitute a random sample from the posterior variance is unknown, it also requires a prior distribution; these
distribution for a given parameter—that is, did the MCMC chain prior distributions are known as hyperpriors. Hyperpriors also are
converge? Common approaches to assessing convergence include probability distributions and the parameters of hyperprior are
examining traceplots to ensure that the samples span the appropri- referred to as hyperparameters. Bayesian models involving hyper-
ate parameter space as well as using various convergence diag- priors are known as hierarchical models because there is a hierar-
nostic statistics. Examples of diagnostic statistics include the chy of priors (Gelman et al., 2003). In principle, hyperparameters
Geweke diagnostic, Heidelberger-Welch test of stationarity, and can be unknown and also have prior distributions. However, with
the Gelman-Rubin diagnostic. Interested readers are referred to multilevel models it is common to only have one level of hyper-
Jackman (2009) for an introduction to convergence diagnostics for priors and hyperparameters.
MCMC chains. It is important to note that these diagnostic tests do An important decision about the prior is the spread of the
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

not establish convergence but, instead, help researchers identify distribution because the spread of the distribution should reflect
This document is copyrighted by the American Psychological Association or one of its allied publishers.

MCMC chains that have not converged. the uncertainty regarding the value of a given parameter. When a
MCMC chains can be used to compute point and interval research area is new and little is known about a topic, researchers
estimates for the parameter. A nice feature of MCMC chains is that often select a prior that is relatively flat. As a research area grows,
they can be combined to create chains of new quantities. In a knowledge about parameter values should increase and priors can
multilevel model, one can compute an intraclass correlation by contain more information. For example, in the case of the Body
taking the ratio of the cluster variance to the total variance for each Project, published data about the effects of the dissonance inter-
draw from the posterior. This produces a new chain for the vention compared to control groups can be used to select the shape
intraclass correlation that can be used to construct point and of the prior for the intervention effect (Stice, Chase, Stormer, &
interval estimates. Therefore, in contrast to a multilevel model Appel, 2001; Stice, Rohde, Gau, & Shaw, 2009; Stice, Trost, &
estimated with REML, constructing interval estimates for all pa- Chase, 2003). Furthermore, several articles have published esti-
rameters in a Bayesian model is straightforward and intuitive. mates of intraclass correlations for group-administered interven-
tions that provide some information regarding how large cluster
and residual variances are likely to be (Baldwin, Stice, & Rohde,
Bayesian Model for Partially Clustered Designs
2008; Herzog et al., 2002; Imel, Baldwin, Bonus, & Maccoon,
Likelihoods, priors, and hyperpriors. The first step in any 2008; Roberts & Roberts, 2005). External data can be formally
Bayesian analysis is to identify the likelihood for the data, and thus included in analyses and serve as prior distributions. For example,
the parameters of greatest interest. The likelihood is dictated by the Turner, Thompson, and Spiegelhalter (2005) showed how re-
outcome variable(s). If the outcome is continuous and can be searchers can use external estimates of intraclass correlations as
reasonably approximated by a normal distribution, then one can prior distributions in analyses of cluster randomized trials by using
use a normal distribution (or a t-distribution if there are outliers). the posterior distribution of intraclass correlations from a meta-
Likewise, if the outcomes are binary or count, one could use a analysis as a prior distribution for a primary data analysis. Other
binomial or Poisson distribution, respectively. In the Body Project examples of using data-based priors can be found in Spiegelhalter,
data, thin-ideal internalization can be reasonably approximated Abrams, and Myles (2004).
with a normal distribution. Once the likelihood has been deter- Researchers may also have practical experience that can be used
mined, one or more of the parameters in the likelihood may be to determine a prior or have access to experts that can provide
related in some functional way to other variables. In our example, information. Methods for prior elicitation from experts include
the mean of the normal likelihood is assumed to be a linear informal discussion, structured interviewing, structured question-
function of the treatment condition and cluster effects. naires, and computer-based elicitation (Spiegelhalter et al., 2004,
All parameters in our model need a prior distribution. Prior see pp. 141–148). Even when published data or experts in a
distributions are also probability distributions that are associated specific area are not available, researchers often know more about
with a parameter rather than the data. The first aspect of the prior the parameters than it might seem. For example, the permissible
distribution to consider is the permissible values for a parameter. range of an outcome variable provides much information about
Regression coefficients and cluster effects can be positive or variability. The upshot is that identifying priors is an important
negative, whereas variance components can only be positive. Con- process, and researchers have numerous tools at their disposal for
sequently, prior distributions for regression coefficients and cluster selecting priors.
effects typically have support for positive and negative values
(e.g., a normal distribution), whereas prior distributions for vari-
ance components have support only for positive values (e.g., a Likelihood and Priors for the Body Project Data
gamma, inverse gamma, or truncated normal distribution). We used a normal likelihood for the clustered and unclustered
Often the parameters of the prior distributions are set to specific conditions:
values by the researcher. For example, a researcher might say that
the prior for a regression coefficient can be approximated with a Y ijⱍClustered ⬃ N(b0 ⫹ b1 ⫹ u j, ␴e2C);Y iⱍUnclustered ⬃ N(b0, ␴2eU).
normal distribution and set the mean of the prior distribution to 0
and the variance to 10. However, sometimes the parameters of the We used normal priors for the regression coefficients: b0 ⬃ N
prior distribution are unknown or the researcher wishes to estimate 共3, var ⫽ 2.25兲 and b1 ⬃ N共0, var ⫽ 1兲. The primary outcome for
156 BALDWIN AND FELLINGHAM

the Body Project, thin-ideal internalization (TII), typically ranges bution has a shape (␣) and scale (␤) parameter and the mean of the
between 1 and 5; thus, we set the mean of the prior distribution for gamma distribution is ␣␤ and the variance is ␣␤2. Given our prior
b0 (the mean of the unclustered condition) to the center of the expectations regarding ␴2u, we selected ␣ and ␤ parameters that
range and set the variability to be fairly wide. We set the mean of provide the highest density between 0 and .14. We set the expected
the prior for b1, the intervention effect, to 0 and again set the value for the prior distribution to be in the center of the 0 –.14
variability to be wide. We also used a normal distribution for the range. A gamma distribution with ␣ ⫽ .7 and ␤ ⫽ .098 provides
prior for the u’s: u j ⬃ N共0, var ⫽ ␴2u兲. the appropriate coverage— 85% of the distribution is below .14.
The hyperprior for ␴2u should be carefully selected when the We arrived at these particular values by trying different values of
number of clusters is small because the posterior distribution will ␣ and ␤ to find a set that met the criteria described previously. We
be influenced, sometimes heavily, by the prior distribution. Even went through the same process for selecting prior distributions for
flat prior distributions can be influential and lead to misleading ␴e2C and ␴2eU. For ␴e2C we selected Gamma(13, .03) and for ␴2eU we
posterior distributions (Gelman, 2006). Consequently, selecting selected Gamma(9, .03). The difference in prior distributions for
plausible prior distributions is essential. We now describe the the residual terms reflects the fact that published data indicates that
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

process we went through in selecting a prior distribution for the residual variance tends to be larger in intervention conditions than
This document is copyrighted by the American Psychological Association or one of its allied publishers.

variance components. control conditions (Bergin, 1966).


To help guide our decisions about the prior we used information We randomly generated 10,000 values of ␴2u and ␴e2C using the
from another study that used (a) the same intervention conditions, gamma priors and then constructed ␳. The left panel of Figure 1
(b) the same measure of TII, and (c) the same age group of shows the distribution of ␳ based on the gamma prior. The prior
participants (Stice et al., 2009). Stice et al. (2009) reported that the distribution for ␳ is closer to estimates reported in the literature.
post-intervention TII in the clustered condition had a mean of 2.93 This prior distribution for ␳ had a mean of 0.14 and 25th, 50th, and
2
and a variance (␴Total ) of .46. They did not report an intraclass 75th percentiles 0.03, 0.09, and 0.21. Furthermore, the gamma
correlation (␳) for TII but previous research indicates that most ␳ priors allow for larger values of ␳ but the large values are improb-
for group-administered interventions are between 0 and .3, with a able. Given the gamma priors conform more closely to what we
low probability that ␳ will be above .3 (cf. Baldwin et al., 2008; expect given previous data, we chose the gamma priors for the
Herzog et al., 2002; Imel et al., 2008; Roberts & Roberts, 2005). Body Project data and as the primary priors for the simulation
With the estimate of ␴Total
2
from Stice et al. (2009) and a probable study.
range for ␳, we can compute a probable range for the cluster Before discussing the results of the Bayesian analysis of the
variance (␴2u): Body Project, we present a Monte Carlo simulation study that
directly compares the performance of likelihood methods using the
␴2u Kenward-Roger adjustment to Bayesian methods for partially clus-
␳⫽ . (7)
␴Total
2 tered data with respect to bias, efficiency, and coverage. We focus
on small sample sizes because small sample are common and
We can rearrange Equation 7 to isolate ␴2u: ␴2u ⫽ ␳ ⫻ ␴Total 2
. The because small samples are where estimation will be most difficult.
probable range for ␴u is Lower ␴u ⫽ 0 ⫻ .46 ⫽ 0, Upper ␴2u ⫽
2 2

.3 ⫻ .46 ⫽ .14. Given our uncertainty in our estimate of ␳, we


want the prior distributions for ␴2u to reflect that uncertainty. Simulation Study
One method for reflecting uncertainty is to choose a prior that We generated data to be similar to the portion of the Body
covers the plausible range for the parameters of interest, but does Project data we evaluate. Our simulation included one clustered
not favor any value over another within that range. For example, (Xij ⫽ 1) and one unclustered condition (Xij ⫽ 0) condition. Based
one could use a uniform prior distribution as a prior for the on estimates from another partially clustered eating disorder pre-
variance. The limits for ␴2u could be set at 0 and .14, whereas the vention study (Stice et al., 2009), we set the total variance in the
limits for ␴e2C could be set at 0 and .46 (the anticipated value of ␴2y ). clustered condition to ␴21 ⫽ .46 and the total variance in the
A uniform (0, .14) prior for ␴2u would provide a strict upper limit unclustered condition to ␴20 ⫽ .27. We set the intervention effect to
on ␴2u. To allow for some estimates of ␴2u to exceed .14, one can zero. Data in the clustered condition were generated according to
extend the upper limit to .23, half of the anticipated value of ␴2y . the following model:
We increased the upper limit on ␴e2C to .69 for the same reason. To
examine the plausibility of these priors, we randomly generated Y ijⱍ(Xij ⫽ 1) ⫽ u j ⫹ eij , (8)
10,000 values of ␴2u and ␴e2C and constructed ␳ using Equation 5 for
each set of values. The right panel of Figure 1 depicts the expected where
distribution ␳ based on the uniform prior distribution. This prior
distribution for ␳ seems to overestimate ␳, given what has been u j ⫽ 兹␳z j ; z j ⬃ N(0, .46) and eij ⫽ 兹1 ⫺ ␳zij ; zij ⬃ N(0, .46).
reported in the literature. This prior distribution for ␳ had a mean
The data in the unclustered condition were generated according to
of 0.3 and 25th, 50th, and 75th percentiles of 0.15, 0.25, and 0.40.
the following model:
Furthermore, there are a nontrivial number of estimates above 0.5.
A second method for reflecting uncertainty is to choose a prior Y ijⱍ(Xij ⫽ 0) ⫽ eij , (9)
that covers the plausible range for the parameters of interest but
provides the most density over the probable range (0 –.14 for ␴2u) where
while still allowing for values outside the plausible range. The
gamma distribution has support from 0 to ⬁. The gamma distri- eij ⫽ zij ; zij ⬃ N(0, .27).
MULTILEVEL BAYES 157
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This document is copyrighted by the American Psychological Association or one of its allied publishers.

Figure 1. Prior distributions for the intraclass correlations (␳) based on 10,000 draws from either a gamma or
uniform prior.

We varied the number of clusters (c), cluster size (m), and to construct confidence intervals for variance components. The
intraclass correlation (␳). We selected values for these variables confidence intervals for the variance components assume infinite
that reflect partially clustered studies reported in the literature. We degrees of freedom.
set c to 8 or 16, and we set m to be 5 or 15. We chose small values We used PROC MCMC in SAS, which uses a Metropolis-
for c and m because psychosocial intervention studies typically Hastings (MH) sampler (see Appendix B), to perform the Bayesian
have small numbers of clusters and small cluster size (Baldwin, estimation. Based on pilot simulation data, we used 50,000 MCMC
Bauer, et al., 2011). We set ␳ to be .05 or .15. When ␳ ⫽ .05 then iterations, with 10,000 burn-in iterations. We thinned the chain by
␴2u ⫽ 0.023, and when ␳ ⫽ .15 then ␴2u ⫽ 0.069. Few estimates of taking every 10th observation to reduce the amount of data we
␳ have been reported for outcome variables; however, of the needed to store at each iteration of the Monte Carlo simulation. We
estimates reported, nearly all are less than .30 and most are less used the normal prior distributions described previously for the
than .15 (Baldwin et al., 2008; Herzog et al., 2002; Imel et al., regression coefficients and cluster effects and the gamma prior
2008; Roberts & Roberts, 2005). distributions for the variances. We constructed 95% interval esti-
For each combination of c, m, and ␳, we generated 5,000 data mates using the 2.5% and 97.5% percentiles for the posterior
sets. For each data set we fit the model described in Equations 1– 4 distribution. We report both the mean and median of the posterior
using either restricted maximum likelihood (REML) or Bayesian distribution as point estimates of the posterior distribution.
methods. We used PROC MIXED in SAS (Littell et al., 2006) to We focus on the mean and median as point estimates as opposed
estimate the REML models. For the REML models, we computed to other point estimates (e.g., the mode) for two reasons. First, the
confidence intervals for the intervention effect using the Kenward- mean and median make the most sense from the perspective of
Roger correction, and we computed the confidence intervals for statistical decision theory (Carlin & Louis, 2009; Ferguson, 1967).
the cluster variance using the upper and lower limits provided by In decision theory point estimates are assigned an expected loss,
PROC MIXED, as this is what most applied researchers would use which reflects how far away the point estimate is to a quantity of
158 BALDWIN AND FELLINGHAM

interest—in our case the population value. Loss will be lower the Table 1
closer the estimate is to the parameter of interest. The loss function Coverage Rate for the Treatment Effect and Cluster Variance
determines the expected loss. Carlin and Louis (2009) provided a
brief introduction to various loss functions and the situations in c m ␳ MCMC interval REML interval
which they may be applied. For example, if squared error loss is Intervention effect
the loss function, then the mean is the optimal estimator and if 8 5 .05 .96 .96
absolute error is the loss function, then the median is the optimal 8 5 .15 .95 .95
estimator. The mode is the optimal estimator under the 0 –1 loss 8 15 .05 .97 .96
function. If the estimate equals the parameter then loss is 0; if the 8 15 .15 .94 .95
16 5 .05 .95 .95
estimate does not equal the parameter then loss is 1. The 0 –1 loss
16 5 .15 .94 .95
function is most appropriate for discrete parameter spaces as it is 16 15 .05 .95 .95
possible for an estimate to actually equal the population value. All 16 15 .15 .95 .95
of our parameters are continuous parameters, and thus the squared
Cluster variance
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

error and absolute error loss functions are consistent with the
This document is copyrighted by the American Psychological Association or one of its allied publishers.

model. The second reason for focusing on the mean and median is 8 5 .05 .99 .76 (.42)
8 5 .15 .99 .91 (.21)
largely practical—the most commonly used Bayesian software 8 15 .05 .98 .91 (.22)
(e.g., JAGS/BUGS, PROC MCMC) provide the mean and the 8 15 .15 .98 .96 (.03)
median as output. Given that many researchers will focus on 16 5 .05 .99 .80 (.35)
the output of the program when reporting results of their analysis, 16 5 .15 .99 .92 (.09)
we focus on the mean and median. 16 15 .05 .98 .93 (.11)
16 15 .15 .96 .96 (.003)
We examined the bias, efficiency, and coverage rates for the
intervention effect and the cluster variance. Results for the other Note. c ⫽ number of clusters in the clustered condition; m ⫽ cluster size;
parameters are available from the first author. Bias was defined as ␳ ⫽ intraclass correlation; REML ⫽ restricted maximum likelihood;
MCMC ⫽ Markov chain Monte Carlo. The MCMC models using gamma
the difference between the average estimate for a given parameter priors for the variances. Numbers in parentheses are the proportion of the
across the replications and the population value. Efficiency was Monte Carlo replications for which PROC MIXED could not compute a
defined as the square root of the average squared deviation be- confidence interval for the cluster variance. Thus, the REML coverage
tween an estimate of a given parameter and the population value rates for the cluster variance are based only on the simulation replications
where PROC MIXED could compute a confidence interval.
(root-mean-square error). Finally, coverage was defined as the
proportion of interval estimates that included the population value.
Table 1 presents coverage rates for the intervention effect and
the cluster variance for the REML and MCMC models. Coverage and MCMC models performed differently when estimating the
rates for the intervention effect were close to 95% for both REML cluster variance and the differences reflect a bias– efficiency
and MCMC. The coverage rates for the cluster variance tell a tradeoff (Greenland, 2000), as we discuss in the Conclusions
different story. First, PROC MIXED had difficulty producing section. Specifically, likelihood estimates are less biased in the
confidence intervals for the cluster variance, especially when sam- long run than the MCMC estimates but are more variable than
ple sizes were small or when the intraclass correlation was small. MCMC estimates in any given data set. In other words, if we
Thus, the calculated coverage rates for REML were not based on repeat a study many times, REML estimates, when averaged, will
all 5,000 replications (see Table 1). Second, even when PROC recover the population cluster variance, whereas MCMC estimates
MIXED provided confidence intervals, they were well below 95% will over- or underestimate the population value. However, in any
when cluster size was small and the intraclass correlation was given study, REML estimates will be farther from the population
small. Furthermore, coverage rates for REML tended to be worse value, and thus have more error, than the MCMC estimates. Given
than for MCMC. As noted above, these differences occur because that researchers do not repeat the same study, in the same way
the likelihood method for constructing confidence intervals as- (including sample size, sampling strategy, measures, population,
sumes that the sampling distribution of the cluster variance is etc.), many times, Bayesian methods have a distinct advantage
normal, when it is not. Bayesian methods do not make this as- over likelihood methods when estimating cluster variances.
sumption. This advantage occurs because the cluster variances are shrunk
Table 2 presents bias (⫻10,000) and root-mean-square error toward the prior for the cluster variance, which keeps estimates
(RMSE; ⫻100) results for the intervention effect and cluster more plausible. Figures 2 and 3 illustrate the effect of the prior.
variance. For the intervention effect, both bias and RMSE were Figure 2 is a scatterplot of the REML estimate (x-axis) and MCMC
minimal for the likelihood and MCMC models and were identical mean estimate (y-axis). If all points fell on the dashed line, then the
to one another out to four decimal places. In contrast, bias and REML and MCMC models would be providing identical estimates
RMSE for the cluster variance were substantial and differed by of the cluster variance. When points fall below the line, the REML
estimation method. The equivalence of the likelihood and MCMC model produced a larger estimate than the MCMC model and
models for the intervention effect is not surprising because it is when points fall above the line, the MCMC model produced larger
easier and requires less information to estimate a mean well than estimates than the REML model. The solid line represents the
it does a variance. population cluster variance when the intraclass correlation is either
The difficulty in estimating variances is reflected in the in- .05 or .15. In the simulations, the MCMC model produced larger
creased RMSE values for the cluster variance compared to the estimates than the REML model when estimates were below the
intervention effect regardless of estimation method. The likelihood population value. When estimates were slightly above the popu-
MULTILEVEL BAYES 159

Table 2
Bias (⫻10,000) and Root-Mean-Square Error (RMSE; ⫻100) for the Treatment Effect and
Cluster Variance

REML MCMC mean MCMC median


c m ␳ Bias RMSE Bias RMSE Bias RMSE

Intervention effect
8 5 .05 ⫺1 1.44 ⫺1 1.44 ⫺1 1.44
8 5 .15 56 1.59 56 1.59 56 1.59
8 15 .05 12 0.93 12 0.93 12 0.93
8 15 .15 ⫺17 1.19 ⫺17 1.19 ⫺17 1.19
16 5 .05 18 1.02 18 1.02 18 1.02
16 5 .15 6 1.13 6 1.13 6 1.13
16 15 .05 0 0.66 ⫺1 0.66 ⫺1 0.66
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

16 15 .15 1 0.83 1 0.83 1 0.83


This document is copyrighted by the American Psychological Association or one of its allied publishers.

Cluster variance
8 5 .05 130 5.01 271 3.70 123 2.64
8 5 .15 43 7.36 ⫺10 3.70 ⫺164 3.96
8 15 .05 22 2.60 155 2.67 57 2.09
8 15 .15 ⫺6 4.92 41 3.80 ⫺77 3.66
16 5 .05 68 3.59 178 2.93 82 2.40
16 5 .15 17 5.38 ⫺17 3.75 ⫺118 3.88
16 15 .05 0 1.81 70 1.85 19 1.64
16 15 .15 ⫺2 3.46 38 3.18 ⫺32 3.03
Note. c ⫽ number of clusters in the clustered condition; m ⫽ cluster size; ␳ ⫽ intraclass correlation; REML ⫽
restricted maximum likelihood; MCMC ⫽ Markov chain Monte Carlo. The MCMC models used the gamma
priors for the variances.

lation value, MCMC models produced larger estimates than the observed in this cell can be considered the biggest differences
REML model, which is consistent with the positive bias observed likely to be observed.
in Table 2. When the estimates were well above the population Both bias and RMSE for the intervention effect were negligible
value, the MCMC model produced smaller estimates than the when using a uniform prior and were identical to what was
REML model. These distinctions become less important as sample observed with the other estimation methods. The coverage rate for
sizes get larger. the intervention effect when using a uniform prior was 98%, which
Figure 3 shows the distribution of the MCMC mean estimate of is higher than either the REML model or the MCMC model using
the cluster variance when the REML model fixed the cluster the gamma priors. The higher coverage rate occurred because the
variance to 0. When a model fixes the cluster variance to 0 it is a estimates of the cluster effect using the uniform prior were two to
Type M or magnitude error (Gelman, Hill, & Yajima, 2012) three times more upwardly biased than the estimates for the other
because it fixes a positive quantity to 0. The REML model led to models. The uniform prior also led to an increase in the RMSE.
Type M errors most commonly when both the number of clusters For example, the RMSE (⫻100) for the Bayesian mean estimate of
and cluster size was small and/or when the population intraclass the cluster variance was 3.7 using gamma priors and 6.8 using the
correlation was small. As seen in Figure 3, the Bayesian estimates uniform priors. Despite these differences, the coverage rate was
of the cluster variance were shrunk toward the prior mean (the 95%.
dashed line) when the REML model estimated the variance as 0. In sum, the simulation results indicate that the likelihood and
Figure 3 also demonstrates that the shrunken Bayesian estimates Bayesian methods perform equally well in estimating intervention
are closer to the population cluster variance (the solid line) than the effects with respect to bias, efficiency, and coverage. Further,
REML estimate of 0. Bayesian methods have nearly identical coverage for the interven-
Although we prefer the gamma priors for the variance compo- tion effect as the Kenward-Roger adjusted REML estimates. The
nents because they are “thoughtful” priors that consider what has performance of the Bayesian and likelihood methods diverged
been reported about variance components in clustered intervention with respect to the cluster variance. Bayesian estimates, when
studies, some researcher prefer uniform priors for variance com- using a carefully chosen prior for the cluster variance, were more
ponents, such as those described previously, because the uniform biased but more efficient than likelihood estimates, although this
prior is completely flat across the range of plausible values. To difference became less apparent as sample size increased. The
investigate the performance of the uniform priors for the variance shrinkage of estimates in the Bayesian model kept the cluster
components, we added a MCMC model with a Uniform(0, .23) variance estimates for any given data set closer to the population
prior for ␴2u and a Uniform(0, .69) prior for both ␴e2C and ␴2eU to the cluster variance than did the likelihood estimates. However, the
c ⫽ 8, m ⫽ 5, and ␳ ⫽ .05 cell. We limited the simulation to just uniform prior performed more poorly with respect to the cluster
this cell because the choice of prior becomes less important the variance than the other models. These results underscore the im-
more data one has. Thus, any differences in results across methods portance of carefully selecting priors when sample sizes are small.
160 BALDWIN AND FELLINGHAM
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This document is copyrighted by the American Psychological Association or one of its allied publishers.

Figure 2. The relationship between the estimated cluster variance using Markov chain Monte Carlo (MCMC)
means and restricted maximum likelihood (REML) point estimates. Results are stratified by intraclass correlation
(␳), number of clusters (c), and cluster size (m). If all points fell on the dashed line, then MCMC and REML
would be providing identical estimates. Note that the axes have wider limits in the lower panels to accommodate
the larger estimates of the cluster variance.

Empirical Example: The Body Project For both MCMC models presented in Table 3, convergence
diagnostics were consistent with convergence of the MCMC
We evaluated intervention effects in The Body Project using the
chains for all parameters. Traceplots suggested adequate explora-
model described in Equations 1– 4 (see Table 3). The primary
tion of the parameter space, all parameters passed the
outcome variable was TII, with lower values indicating a good
Heidelberger-Welch test of stationarity, and the Geweke diagnos-
outcome. Intervention condition was coded as 1 for the DI condi-
tic was not significant for all parameters. We also ran two more
tion and 0 for the AO condition. We estimated the model using
independent chains and computed the Gelman-Rubin diagnostic—
both REML and Bayesian methods. We used PROC MIXED to fit
the REML model, and we used the Kenward-Roger adjusted all parameters passed. However, the traceplot and Raftery-Lewis
standard errors and degrees of freedom. We used PROC MCMC diagnostic indicated that there was substantial autocorrelation in
which uses a Metropolis-Hastings (MH) algorithm to fit the Bayes- the cluster variance chain, indicating that more iterations may be
ian model. To evaluate how robust the results were to the prior needed to obtain sufficient precision of interval estimates. This
distribution used for the variance components, we estimated two occurred because the posterior distribution of the cluster variance
models, one model with the gamma priors and one with the is right-skewed; the MH algorithm spends many iterations explor-
uniform priors for the variance components. In our experience, the ing small values of the cluster variance, which leads to high
algorithm PROC MCMC uses can produce slow mixing for vari- autocorrelation. As a sensitivity analysis, we used four strategies
ances, especially when the variance is small. Thus, longer chains for assessing the impact of the autocorrelation on the results. First,
are sometimes needed to obtain precise estimates. Consequently, we used a reparameterized model using hierarchical centering.
we initially fit a model with 10,000 burn-in iterations and 100,000 Second, we ran the sampler for 1,000,000 iterations. Third, we
iterations of the MH algorithm. estimated the model in JAGS (see Appendix for annotated code),
MULTILEVEL BAYES 161
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This document is copyrighted by the American Psychological Association or one of its allied publishers.

Figure 3. Bayesian estimate of the cluster variance (i.e., mean of the posterior distribution) when the restricted
maximum likelihood (REML) estimate was 0 for different values of the intraclass correlation (␳). N is the
number of replications where the REML model fixed the cluster variance to zero. The dashed line is equal to
the mean of the prior distribution, and the solid line is equal to the population variance. c ⫽ number of clusters;
m ⫽ cluster size.

which uses a slice sampler, rather than MH, to sample the cluster residual variances were roughly equivalent across models. Finally,
variance. In our experience, the slice sampler can be more efficient the intraclass correlation estimate was larger in both Bayesian
and thus not produce as much autocorrelation. Fourth, we ran the models (Gamma prior: 0.07; Uniform: 0.12) than the REML model
sampler for 500,000 iterations and thinned the posterior distribu- (0.05), because the cluster variance estimates were higher in the
tion by taking every 100th iteration.1 The reparameterized model Bayesian analyses. Further, we were able to easily compute a 95%
did not improve the mixing of the cluster variance chain. Both the interval estimate for the intraclass correlation in the Bayesian
model with 1,000,000 iterations and the JAGS analysis, which did models (Gamma prior: 0.003, 0.23; Uniform prior: 0.01, 0.32)—an
indeed have less autocorrelation, produced estimates that were easily computed interval estimate was not available in the REML
similar to the initial analysis with 100,000 iterations. Likewise, the model.
thinned analysis had less autocorrelation and produced similar Differences between the Bayesian model with the gamma prior
results. Consequently, we report the results for the initial analysis and the model with the uniform prior are consistent with the
in PROC MCMC in Table 3. simulation results. Namely, point and interval estimates for the
In general, parameter estimates were equivalent across models, fixed effects were similar, whereas the uniform prior led to a larger
with the exception of the interval estimate for the cluster variance point estimate of the cluster variance and larger upper bound for
and the interval estimate for the intraclass correlation. Thus, the the interval estimate. Similarly, there were slight differences be-
models provide roughly equivalent results about the mean struc- tween the point and interval estimates for the residual variances.
ture of the data and slightly different results about the variance
The differences in the estimates were not major. Thus, the results
structure of the data. Specifically, all models provided similar
of the Body Project analysis were somewhat sensitive to the
point and interval estimates of the intercept (b0) and intervention
specific prior distribution but not in a way that would change the
effect (b1). The results indicated that participants in the DI condi-
interpretation of the model in a meaningful way.
tion had significantly less thin-ideal internalization than the AO
condition. The point estimates of the cluster variance differed
slightly across models, with both of the Bayesian models providing 1
Thinning of MCMC chains is used for two primary reasons. First, a
slightly larger estimates. The 95% interval estimate for the cluster researcher can thin a chain to reduce the amount of information that must
variance was considerably narrower in both the Bayesian models be stored (as we did in our simulations). Second, thinning can be used to
help reduce autocorrelation in a MCMC chain. Thinning can reduce auto-
(Gamma prior: 0.001, 0.11; Uniform Prior: 0.005, 0.17) than the correlation, but it can also reduce the precision of the estimates (Link &
REML model (0.01, 9.90). In fact, the REML interval was so wide Eaton, 2012). Consequently, we present our unthinned analysis. We thank
as to be virtually useless. Point and interval estimates for the John Kruschke for his comments on this issue.
162 BALDWIN AND FELLINGHAM

Table 3
Results for the Bayesian Analysis of the Body Project Data

Parameter REML Gamma prior Uniform prior

b0 3.52 [3.41, 3.64] 3.52 [3.41, 3.63] 3.52 [3.41, 3.64]


b1 ⫺0.39 [⫺0.58, ⫺0.21] ⫺0.39 [⫺0.58, ⫺0.21] ⫺0.39 [⫺0.59, ⫺0.19]
␴2u 0.02 [0.01, 9.90] 0.03 [0.001, 0.11] 0.06 [0.005, 0.17]
␴e2C 0.38 [0.29, 0.51] 0.39 [0.30, 0.49] 0.39 [0.29, 0.51]
␴e2U 0.43 [0.34, 0.56] 0.41 [0.33, 0.51] 0.45 [0.35, 0.57]
␳ 0.05 0.07 [0.003, 0.23] 0.12 [0.01, 0.32]
Note. REML ⫽ restricted maximum likelihood; ␳ ⫽ intraclass correlation. Values inside the brackets are 95%
confidence intervals (REML) and 95% credible intervals (gamma and uniform prior). Point estimates for the
Markov chain Monte Carlo analysis are means of the posterior distributions.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
This document is copyrighted by the American Psychological Association or one of its allied publishers.

Conclusions estimates with the gamma prior were most similar to Rifle 2, a
small amount of bias and less variability in the estimates than
Given the prevalence of clustered data in the social and behavior likelihood estimates. Bayesian estimates with the uniform prior
sciences, it is not surprising to see multilevel models used fre-
were most like Rifle 3, both relatively high bias and variability.
quently. Furthermore, most major statistical packages have easy-
Like Greenland (2000), we favor estimators that balance bias and
to-use functions for estimating multilevel models using likelihood
efficiency. Thus, in context of complex multilevel data, such as
estimation methods. This has been a boon to researchers, as it has
partially clustered data, we favor the Bayesian estimator with a
allowed more researchers to take advantage of the benefits of
carefully chosen prior as it does the best job balancing bias and
multilevel models (e.g., accounting for multiple sources of vari-
efficiency. We note that these distinctions between methods—
ance in a single model, modeling both cluster-level and individual-
likelihood versus Bayesian and amongst priors in Bayesian meth-
level effects, etc.). However, likelihood methods often need to be
ods—are most prominent when sample sizes are small. As sample
adjusted to produce correct coverage rates for fixed effects. These
sizes increase, these differences will fade.
adjustments perform well but do not address inferential problems
Despite the fact that Bayesian estimators do a reasonable job
for variance components. We have detailed how a Bayesian ap-
balancing bias and efficiency, some researchers may resist using
proach to multilevel models is a viable alternative estimation
Bayesian models because they must specify a prior distribution for
approach for multilevel models by providing some of the same
benefits as the adjusted likelihood approach, such as adequate parameters and would prefer to “let the data speak.” This line of
coverage for fixed effects, as well as additional advantages, such reasoning is problematic for four reasons. First, data do not speak
as avoiding boundary problems for variance components and pro- until a model, likelihood or Bayesian, has been applied (Green-
viding interval estimates for all parameters. land, 2006). Second, multilevel models estimated using likelihood
The simulation showed that Bayesian point estimates for the methods make assumptions about the likelihood that are not ob-
cluster variance, when a carefully chosen prior was used, were jective (Gelman, 2008). Third, researchers bring information into
more biased than the likelihood estimates but also more efficient. the study by the way they design a study. For example, treatment
We noted that this was an instance of the bias-efficiency tradeoff; researchers make systematic, coherent decisions about what treat-
namely, estimators that improve upon the efficiency of an unbiased ments to include in a study that are based on past research and
estimator typically produce some bias (Carlin & Louis, 2009). clinical experience. Priors need not be different. Fourth, research-
Greenland (2000) used a metaphor of a shooting a target with three ers typically know something about domain; objectivity does not
rifles to illustrate the bias versus efficiency idea. The bulls-eye of demand that information be left out.
the target represents the population parameter, shots inside the ring Even if researchers are comfortable with a prior distribution,
represent an estimate near the population parameter, and the rifles they may insist on using flat or diffuse priors. As the simulation
represent three estimators of the parameter. Shots from Rifle 1 are study demonstrated, when sample sizes are small flat priors can
evenly scattered around the bulls-eye but only 20% of the shots are influence results and lead to additional bias and decreased effi-
inside the ring. Rifle 1 thus represents an estimator that is unbiased ciency. Indeed, diffuse priors can be unnecessarily inefficient if the
but highly variable—the shots are evenly and widely scattered priors imply that mathematically possible but extreme outcomes
around the bulls-eye. Most shots from Rifle 2 are to the left of the are equally probable to nonextreme values or if the priors ignore
bulls-eye, but 75% are within the ring. Rifle 2 represents an existing evidence (see Figure 1). Consequently, we recommend
estimator that is biased but efficient—the shots are consistently left that researchers devote the necessary time and effort to construct
but close together and close to the population value. Shots from plausible, thoughtful priors that incorporate existing information
Rifle 3 are consistently to the right of the bulls-eye, close together, about a research area. Like any aspect of the research process, the
and outside of the ring. Thus, Rifle 3 represents a biased and details of the prior distributions as well as the methods used to
inconsistent estimator—the shots are consistently right and far select them should be well described and justified in a research
away from the bulls-eye. The simulations indicated that the like- report.
lihood estimates of the cluster variance were most similar to Rifle Researchers have a number of software options for implement-
1, especially when sample sizes were small. In contrast, Bayesian ing Bayesian methods. Most modern computers are sufficiently
MULTILEVEL BAYES 163

fast to fit Bayesian models in a reasonable amount of time. Social data from group-administered treatments. Journal of Consulting and
scientists wishing to use Bayesian methods to analyze complex Clinical Psychology, 73, 924 –935. doi:10.1037/0022-006X.73.5.924
multilevel data have three software options. First, they can use Baldwin, S. A., Murray, D. M., Shadish, W. R., Pals, S. L., Holland, J. M.,
general purpose Bayesian modeling software, such as SAS PROC Abramowtiz, J. S., . . . Watson, J. (2011). Intraclass correlation associ-
MCMC (SAS Institute, 2009), WinBUGS (Spiegelhalter, Thomas, ated with therapists: Estimates and applications in planning psychother-
apy research. Cognitive Behaviour Therapy, 40, 15–33. doi:10.1080/
Best, & Lunn, 2003), JAGS (Plummer, 2003), and PyMC (Patil,
16506073.2010.520731
Huard, & Fonnesbeck, 2010). Appendix A of the online supple-
Baldwin, S. A., Stice, E., & Rohde, P. (2008). Statistical analysis of
mental material provides annotated code for SAS PROC MCMC group-administered intervention data: Reanalysis of two randomized
and JAGS/BUGS. These general purpose programs are capable of trials. Psychotherapy Research, 18, 365–376. doi:10.1080/
fitting many different models but because they are general purpose 10503300701796992
Bayesian programs, their routines are not optimized for multilevel Bauer, D. J., Sterba, S. K., & Hallfors, D. D. (2008). Evaluating group-
models in particular. However, we have found that they are suf- based interventions when control participants are ungrouped. Multivar-
ficiently fast for most of our needs. Second, researchers can use iate Behavioral Research, 43, 210 –236. doi:10.1080/
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

multilevel software that has Bayesian estimation options, such as 00273170802034810


This document is copyrighted by the American Psychological Association or one of its allied publishers.

MLwiN (Browne, 2008) or Mplus (Muthén & Muthén, 2010). Bergin, A. E. (1966). Some implications of psychotherapy research for
These packages are nice because they are optimized for multilevel therapeutic practice. Journal of Abnormal Psychology, 71, 235–246.
models but lack the flexibility of the general purpose programs. doi:10.1037/h0023577
For example, the multilevel specific software is not as flexible with Browne, W. (2008). MCMC estimation in MLwiN. Bristol, England: Centre
for Multilevel Modelling.
respect to the specific prior distributions one can specify. Third,
Burdick, R. K., & Graybill, F. A. (1992). Confidence intervals on variance
researchers can write their own estimation routines in a general
components: New York, NY: Marcel Dekker.
purpose software language or statistical package. Appendix B in Carlin, B. P., & Louis, T. A. (2009). Bayesian methods for data analysis
the online supplemental material provides an example in Python, (3rd ed.). Boca Raton, FL: Chapman & Hall/CRC.
although any language that has a linear algebra package and good Crits-Christoph, P., Baranackie, K., Kurcias, J. S., Beck, A. T., Carroll, K.,
random number generators can be used. Using a general purpose Perry, K., . . . Zitrin, C. (1991). Meta-analysis of therapist effects in
software language provides the most flexibility, but it can be psychotherapy outcome studies. Psychotherapy Research, 1, 81–91.
time consuming and error-prone. Further, it makes it difficult to doi:10.1080/10503309112331335511
make major changes to the model as that often requires rewriting Ferguson, T. S. (1967). Mathematical statistics: A decision theory ap-
huge portions of code. proach. New York, NY: Academic Press.
To be sure, Bayesian approaches to multilevel models, specifi- Gelman, A. (2006). Prior distributions for variance parameters in hierar-
cally, and data analysis, generally, are not a panacea. Bayesian chical models. Bayesian Analysis, 1, 515–534. doi:10.1214/06-BA117A
methods can be challenging to implement because they require Gelman, A. (2008). Rejoinder. Bayesian Analysis, 3, 467– 478. doi:
10.1214/08-BA318REJ
sophisticated decisions about likelihood and prior distributions and
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian
an understanding of how they should combine to create posterior
data analysis (2nd ed.). Boca Raton, FL: Chapman & Hall/CRC.
distributions. A misunderstanding of these can lead to problems in Gelman, A., & Hill, J. (2007). Data analysis using regression and multi-
a Bayesian analysis. Indeed, the WinBUGS website comes with a level/hierarchical models. New York, NY: Cambridge University Press.
“health warning,” noting that “there is no in-built protection Gelman, A., Hill, J., & Yajima, M. (2012). Why we (usually) don’t have
against misuse” (http://www.mrc-bsu.cam.ac.uk/bugs/). Addition- to worry about multiple comparisons. Journal of Research on Educa-
ally, advancements will not come from simply applying Bayesian tional Effectiveness, 5, 189 –211. doi:10.1080/19345747.2011.618213
models. Whether estimating models from a likelihood or Bayesian Greenland, S. (2000). Principles of multilevel modelling. International
perspective, researchers must carefully use theory and past re- Journal of Epidemiology, 29, 158 –167. doi:10.1093/ije/29.1.158
search to develop reasonable models. In our own research, we like Greenland, S. (2006). Bayesian perspectives for epidemiological research:
the fact that Bayesian models force us to consider what likelihood I. Foundations and basic methods. International Journal of Epidemiol-
is most appropriate for the data at hand and what we know about ogy, 35, 765–775. doi:10.1093/ije/dyi312
Herzog, T. A., Lazev, A. B., Irvin, J. E., Juliano, L. M., Greenbaum,
the parameters of interest before we see the data. Such questions
P. E., & Brandon, T. H. (2002). Testing for group membership effects
force us to think more carefully about our data and about how our
during and after treatment: The example of group therapy for smok-
statistical models match-up to theory. We suspect that other re-
ing cessation. Behavior Therapy, 33, 29 – 43. doi:10.1016/S0005-
searchers will experience a similar benefit. 7894(02)80004-1
Howard, G. S., Maxwell, S. E., & Fleming, K. J. (2000). The proof of the
pudding: An illustration of the relative strengths of null hypothesis,
References meta-analysis, and Bayesian analysis. Psychological Methods, 5, 315–
Alegría, M., Canino, G., Shrout, P. E., Woo, M., Duan, N., Vila, D., . . . 332. doi:10.1037/1082-989X.5.3.315
Meng, X.-L. (2008). Prevalence of mental illness in immigrant and Imel, Z., Baldwin, S., Bonus, K., & Maccoon, D. (2008). Beyond the
non-immigrant U.S. Latino groups. The American Journal of Psychiatry, individual: Group effects in mindfulness-based stress reduction. Psycho-
165, 359 –369. doi:10.1176/appi.ajp.2007.07040704 therapy Research, 18, 735–742. doi:10.1080/10503300802326038
Baldwin, S. A., Bauer, D. J., Stice, E., & Rohde, P. (2011). Evaluating Jackman, S. (2009). Bayesian analysis for the social sciences. New York,
models for partially clustered designs. Psychological Methods, 16, 149 – NY: Wiley. doi:10.1002/9780470686621
165. doi:10.1037/a0023464 Kenward, M. G., & Roger, J. H. (1997). Small sample inference for fixed
Baldwin, S. A., Murray, D. M., & Shadish, W. R. (2005). Empirically effects from restricted maximum likelihood. Biometrics, 53, 983–997.
supported treatments or Type I errors? Problems with the analysis of doi:10.2307/2533558
164 BALDWIN AND FELLINGHAM

Kenward, M. G., & Roger, J. H. (2009). An improved approximation to the Spiegelhalter, D. J., Abrams, K. R., & Myles, J. P. (2004). Bayesian
precision of fixed effects from restricted maximum likelihood. Compu- approaches to clinical trials and health-care evaluation. Hoboken, NJ:
tational Statistics and Data Analysis, 53, 2583–2595. doi:10.1016/j.csda Wiley.
.2008.12.013 Spiegelhalter, D. J., Thomas, A., Best, N. G., & Lunn, D. (2003).
Kruschke, J. K. (2011a). Bayesian assessment of null values via parameter WinBUGS Version 1.4 user manual. Cambridge, England: MRC Bio-
estimation and model comparison. Perspectives on Psychological Sci- statistics Unit.
ence, 6, 299 –312. doi:10.1177/1745691611406925 Stice, E., Chase, A., Stormer, S., & Appel, A. (2001). A randomized trial
Kruschke, J. K. (2011b). Doing Bayesian data analysis: A tutorial with R of a dissonance-based eating disorder prevention program. International
and BUGS. Burlington, MA: Academic Press. Journal of Eating Disorders, 29, 247–262. doi:10.1002/eat.1016
Link, W. A., & Eaton, M. J. (2012). On thinning of chains in MCMC.
Stice, E., Rohde, P., Gau, J., & Shaw, H. (2009). An effectiveness trial of
Methods in Ecology and Evolution, 3, 112–115. doi:10.1111/j.2041-
a dissonance-based eating disorder prevention program for high-risk
210X.2011.00131.x
adolescent girls. Journal of Consulting and Clinical Psychology, 77,
Littell, R. C., Milliken, G. A., Stroup, W. W., Wolfinger, R. D., &
825– 834. doi:10.1037/a0016132
Schabenberger, O. (2006). SAS for mixed models: Cary, NC: SAS
Stice, E., Shaw, H., Burton, E., & Wade, E. (2006). Dissonance and healthy
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Institute.
Lynch, S. M. (2007). Introduction to applied Bayesian statistics and weight eating disorder prevention programs: A randomized efficacy
This document is copyrighted by the American Psychological Association or one of its allied publishers.

estimation for social scientists: New York, NY: Springer. doi:10.1007/ trial. Journal of Consulting and Clinical Psychology, 74, 263–275.
978-0-387-71265-9 doi:10.1037/0022-006X.74.2.263
Muthén, L. K., & Muthén, B. O. (2010). Mplus user’s guide (6th ed.). Los Stice, E., Trost, A., & Chase, A. (2003). Healthy weight control and
Angeles, CA: Muthén & Muthén. dissonance-based eating disorder prevention programs: Results from a
Patil, A., Huard, D., & Fonnesbeck, C. J. (2010). PyMC: Bayesian sto- controlled trial. International Journal of Eating Disorders, 33, 10 –21.
chastic modelling in Python. Journal of Statistical Software, 35, 1– 81. doi:10.1002/eat.10109
Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical Turner, R. M., Thompson, S. G., & Spiegelhalter, D. J. (2005). Prior
models using Gibbs sampling. In K. Hornik, F. Leisch, & A. Zeileis distributions for the intracluster correlation coefficient, based on multi-
(Eds.), Proceedings of the 3rd international workshop in distributed ple previous estimates, and their application in cluster randomized trials.
statistical computing. Retrieved from http://www.ci.tuwien.ac.at/ Clinical Trials, 2, 108 –118. doi:10.1191/1740774505cn072oa
Conferences/DSC-2003/Proceedings/ Yuan, Y., & MacKinnon, D. P. (2009). Bayesian mediation analysis.
Roberts, C., & Roberts, S. A. (2005). Design and analysis of clinical trials Psychological Methods, 14, 301–322. doi:10.1037/a0016972
with clustering effects due to treatment. Clinical Trials, 2, 152–162.
doi:10.1191/1740774505cn076oa
SAS Institute. (2009). SAS/STAT 9.2 user’s guide. Cary, NC: Author.
Satterthwaite, F. E. (1946). An approximate distribution of estimates of Received November 21, 2011
variance components. Biometrics Bulletin, 2, 110 –114. doi:10.2307/ Revision received August 29, 2012
3002019 Accepted September 6, 2012 䡲

You might also like