logMLMtutorial Preprint
logMLMtutorial Preprint
Funding: This work was supported by a fellowship (awarded to S.D.) from the Natural Sciences
and Engineering Research Council of Canada.
Author contributions: S.D. conceived of the initial research question, reviewed literature, and
wrote the initial draft of the tutorial. J.O.U. provided expert commentary throughout the writing
and coding process. A.R.O. provided critical feedback throughout the writing process. J.K.F.
helped write and edit the manuscript and supervised the project.
Data and materials availability: All code and data used for this tutorial can be found at
https://github.com/seandamiandevine/logisticicc or https://osf.io/pnykg/. Both sources contain
the same information and are reproduced for ease of accessibility.
This preprint is a draft; it has been accepted and is pending final publication at Collabra:
Abstract
Multilevel modeling techniques have gained traction among experimental psychologists for their
ability to account for dependencies in nested data structures. Increasingly, these techniques are
extended to the analysis of binary data (e.g., correct or incorrect responses). Despite their
popularity, the information in logistic multilevel models is often underutilized when researchers
focus solely on fixed effects and ignore important heterogeneity that exists between participants.
In this tutorial, we review four techniques for estimating and quantifying the relative degree of
between-person variability in logistic multilevel models in an accessible manner using real data.
First, we introduce logistic multilevel modeling, including the interpretation of fixed and random
effects. Second, we review the challenges associated with the estimation and interpretation of
the intraclass correlation coefficient (ICC), which is usually a first, simple step in a linear MLM.
Third, we demonstrate four existing methods of quantifying the ICC in logistic multilevel models
and discuss their relative advantages and disadvantages. Fourth, we present bootstrapping
methods to make statistical inference about these ICC estimates. To facilitate reuse, we
developed R code to implement the discussed techniques, which is provided throughout the text
1. Introduction
From 2007 to 2017 there has been a threefold increase in published psychology articles that use
multilevel modeling techniques (Huang, 2018), which is in part due to the nested structure of
within groups. Across multiple areas of psychology, behavioural economics, and other
behavioural sciences, nested data often arise from experiments in which participants have
completed multiple trials of a task. Analyzing these data with traditional statistical tests (e.g.,
fixed-effects ANOVA, linear regression) fails to account for this nested structure and violates the
assumption of independence of observations. When responses come from the same participant,
they are more similar than if they come from independently sampled participants. These response
redundancies deflate the effective sample size, cause incorrect standard errors, and often higher
In addition to these data being nested, the outcome variable of interest is many
psychological datasets are often binary (e.g., accuracy or binary choices), rather than continuous.
Such data can be modeled using logistic multilevel models (i.e., logistic mixed-effects
regressions), which model the probability of success/correct choice while handling the
dependencies that arise in nested datasets. However, these techniques are underexploited in
practice. That is, while it is becoming common for researchers to fit multilevel models to nested
experimental data, interpretation of the results is primarily, if not exclusively, focused on the
‘fixed effects’ of the model. As a result, estimates based on the variance of the random effects—
e.g., the intraclass correlation (ICC), which capture the relative importance of variability
between participants’ responses (or between other types of clusters)—are seldom used to support
Why should researchers care about the ICC in their multilevel logistic models? First, it
quantifies theoretical phenomena, such as, how variable participants’ responses are during a task
or how stable an effect is across people. Relatedly, this provides a starting point for explanatory
power: if most of the variability in the data comes from differences between persons, then model
building is a useful exercise with person-level predictors, whereas if most of the variability in the
data comes from within-person variations, then the researcher will want to explore the variability
uncommon for researchers to report or even investigate the random effects variance or the ICC
when outcomes are binary, which is inconsistent with best practices for the transparent reporting
To support researchers in expanding their understanding of their data and encourage best
practice, this tutorial explains how to estimate and interpret the ICC in logistic multilevel
models. A tutorial focusing on this topic is needed for at least three reasons: 1) the unique
challenges that computing and interpreting the ICC in logistic multilevel models poses are
seldom addressed in an accessible manner, 2) while other works have provided methods to
compute the ICC in a logistic multilevel context (e.g., Goldstein, 2002; Merlo et al., 2006), there
does not exist a didactic survey of methods aimed at increasing uptake by psychologists (see, for
instance, examples in social epidemiology; Austin & Merlo, 2017; Merlo et al., 2006), and 3)
texts that do address this issue tend to focus on only a single method (e.g., Snijders & Bosker,
Accordingly, we have four goals. First, for the uninitiated, we provide a brief
introduction to linear multilevel regression (in the Supplemental Materials, sections S6b,
explained below) and logistic multilevel regression, which includes the interpretation of fixed
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 5
and random effects. Second, we review the challenges associated with the interpretation and
multilevel models—namely in the computation of the intraclass correlation (also known as the
variance partition coefficient, VPC). Third, we explain four existing methods of quantifying the
ICC in logistic multilevel models, highlighting the differences between the methods. Finally, we
describe bootstrapping methods which permit statistical inferences about these variance
estimates and comment on the application of the reviewed techniques for models with additional
predictors and random slopes. Throughout the tutorial we include Code Boxes which illustrate
the concepts discussed in the main text using the R programming language, in a step-by-step
fashion1.
psychology (see Section 2), where responses are nested within-participants. However, the topics
discussed extend to cases where binary data may be nested within other types of clusters (e.g.,
groups, schools, countries, etc.). Namely, while the computations presented below do not change
based on the nature of the grouping variable in the data, the interpretation of the ICC does. In
contexts where the data are grouped by organizations or “clusters”, such as data collected from
different schools or hospitals, the ICC measures the proportion of the total variance that is
correlation.” This quantifies the impact of the clustering variable (e.g., schools) as well as how
subjects within the same cluster are similar. Another interpretation of the ICC in this context then
is the average correlation in the outcome data between measurements from the same cluster.
1
For ease of use, function code for each method is available for download from the following GitHub repository,
which compute the measure of interest on a fitted logistic multilevel regression model:
https://github.com/seandamiandevine/logisticicc. A brief demonstration of how to call these functions is provided in
the Supplemental Materials (S6d).
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 6
When participants provide repeated measures (e.g., in a psychological experiment), the ICC
measures the proportion of the total variance that is due to the differences between individuals.
Here, the ICC can provide insights about how individual differences among subjects exert an
influence on the outcome. It is this latter interpretation of the ICC, in the context of a binary
along with the supplementary material, which we refer to throughout (e.g., S1, referring to
“supplement section 1”) and includes all R code with explanatory narrative that can be used to
reproduce the results highlighted in the main text. The main text of this tutorial is aimed at
someone who has experience executing multilevel models in R and would like to expand their
knowledge to logistic models. For those with less background knowledge, readers can review the
Data for this tutorial are taken from a study examining how acute stress engenders avoidance of
cognitively effortful tasks (Bogdanov et al., 2021; data available at https://osf.io/26w4u/). Thirty-
eight young adults (20 female; PID in the dataframe) completed 300 trials of the well-
characterized demand selection task (DST) in blocks of 150 trials, in which participants
repeatedly chose between two nondescript cues that represented demand levels of a cognitively
Figure 1. Schematic Representation of the DST. In each trial of the task, participants chose
between one of two pattern cues. After selecting a cue, a random number (range = 1–4 and 6–9)
was presented in either yellow or blue at the center of the chosen cue. Participants then judged
either the number’s parity if the number was yellow (whether the number was odd or even) or its
magnitude if the number was blue (whether the number was larger or smaller than 5). The choice
of cue dictated the probability with which these tasks switched from trial-to-trial (the switch
rate). Choosing the high-demand cue (upper row) resulted in a task-switching probability of .9,
whereas choosing the low-demand cue (lower row) resulted in a task-switching probability of .1.
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 8
In Bogdanov et al., (2021), participants completed the DST before and after being exposed to
research question was whether willingness to exert cognitive effort (i.e., to choose the high effort
cue) would change after being exposed to acute social stress. For the purposes of this tutorial,
we will focus on data from sessions in which participants were not exposed to social stress to
predict effort aversion (S1 demonstrates how to load this subset of the data).
The traditional finding using this task is that participants will demonstrate a marked
preference for the low-demand option over the high-demand option, reflecting a general bias
against performing cognitively demanding activities, and in favour of less cognitively demanding
activities (Kool et al., 2010). This recurring finding of demand-avoidant preferences has been
interpreted as evidence that humans have a default tendency to avoid cognitively effortful tasks
In this tutorial, the outcome variable of interest is participants’ effort choices in the DST,
which are binary (low demand or high demand) and nested (within participants; each participant
makes many choices). We begin with how to model these data using logistic multilevel models,
predicting the effect of effort level on cue preference in the DST, with the lme4 package in R
(Bates et al., 2015). For those less familiar with linear multilevel regression, review S6a, for
Linear multilevel regression models (see S6b) can be generalized to model binary outcomes. In
this case, the goal is to model the probability that a given participant chose the high-effort option
(effort_choice) on a given trial. If our binary data are coded as 0 (low effort option chosen)
and 1 (high effort option chosen), this probability will be equal to the mean of the outcome:
p ( Y ij =1 )=Y j, where i refers to the ith trial and j refers to jth participant. However, predicting
bounded between 0 and 1 inclusive, whereas predicted values from a linear model (see S6a for
more information) can, in theory, take on any value between negative and positive infinity. To
circumvent this issue in logistic regression, a linear combination of the model’s parameters and
its predictors are computed on the logit scale—and then, afterwards, the regression predictions
are transformed back to probabilities via a mean function. The regression on the logit scale is a
linear model for the log of the odds of an event occurring (in this case, the odds of choosing the
high effort option), where the odds represent the probability of the event occurring over its
p ( Y ij =1 )
complement ( ):
1− p ( Y ij =1 )
Here, the left-hand side of the equation represents the linear combination of b 0 , b1 , and X on the
logit scale (log-odds, LO), where X is some predictor variable of interest (presented here for
completeness, though for most of this tutorial we will focus on intercept-only models). A feature
of the logit scale is that it is simple to transform it into odds, and from there, into probabilities:
Odd s ij Eq. 3
p ( Y ij =1 )=
1+Odd s ij
This gives us three ways to express the same linear combination of predictors in Eq. 1: log odds
(i.e., logit scale), odds, and probabilities. In the case of effort choices in the DST, if b 0=−0.25
and b 1=−.09, assuming X =1, then the log odds would be -0.34 (-0.25 + -0.09; Eq. 1), which is
proportional to an odds of 0.71 (exp(-0.34); Eq. 2.), and a probability of 0.42 (0.71 / (1 + 0.71),
Eq. 3), which all correspond to the same observed pattern: a participant is 0.71 times as likely to
choose the high-demand option relative to low-demand option, or, equivalently, that a participant
will select the high-demand option 42% of the time. This demonstrates an overall trend for
These techniques can be extended to a multilevel framework to account for the nested
p ( Y ij =1 )
Level 1: log
( 1− p ( Y ij =1 ))=b0 j +b 1 j X ij Eq. 4
log-odds for person j when X ij is equal to 0. b 1 j is the person-specific effect of X ij on Y (in log-
odds)for person j.
At the first level, participant j’s effort choice on trial i are predicted from a linear
by an intercept term (b 0 j). The left-hand side of the level 1 equation (Eq. 4) corresponds to the
applying the standard (not multilevel) logistic regression to the data from each participant. Doing
so would yield two coefficients on the log-odds scale unique to each participant—b 0 j and b 1 j
participants would yield the fixed effects at level 2 ( γ 00 , γ 10 ) in Eq. 4. In other words, γ 00
represents the average log-odds of Y =1 assuming X ij is zero for the average or typical person,
and γ 10 represents the average influence of X on the log odds of Y =1 for the average person.
These fixed effects are the average of participant-specific effects in log-odds. This is different
correlated or nested data (e.g. Generalized Estimating Equations; Hardin & Hilbe, 2014).
participant will have coefficients that deviate from these average estimates. These deviations are
captured by U 0 j and U 1 j in Eq. 4 and represent how much more or less a given participant’s
intercept or slope was from γ 00 and γ 10, respectively. These values play a central role in a
multilevel framework, as they capture variation between participants. They are assumed to be
Level-1 of equation 4 (the bs) are outcomes in Level-2. For example, the expected value of a
given b 0 j will be the sum of the average estimate, γ 00, plus the participant-level deviation, U 0 j.
Such a model can be fit in R using the glmer() function from the lme4 package (Bates
et al., 2015; S2). First, a random intercept only model is fit, which captures participants’ general
preponderance for selecting the high effort tasks. In lme4 syntax, the model formula begins with
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 12
effort_choice ~ 1, which asks R to fit a model where effort choice is predicted by its
mean, the intercept, represented by a “1”. Additionally, we add (1|PID) to this equation,
which reflects the random intercept (“1”) for each (“|”) participant (“PID”) in our dataset. We
then specify the data frame that contains our variables, data.Ctl, and finally we indicate that
the family = ‘binomial’, which instructs R to fit a logistic regression model when it
Code Box 1. Input and Output from an Intercept-Only Logistic Multilevel Model of Effort-
Option Choices
Library(lme4)
summary(logistic_MLM0)
Scaled residuals:
Min 1Q Median 3Q Max
-4.7586 -1.0096 0.3631 0.9050 1.6890
Random effects:
Groups Name Variance Std.Dev.
PID (Intercept) 0.7554 0.8691
Number of obs: 11204, groups: PID, 38
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3338 0.1428 -2.338 0.0194 *
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 13
Looking at the “Fixed effects” section of the output in Code Box 1, the intercept
((Intercept) in the output) is -0.33, which represents the average log-odds of choosing the
high demand option for the typical person. This value can be converted to a probability using Eq.
2 and 3, which yields a 42% chance of choosing the high-demand option, which, consistent with
past work, shows that, on average, people avoid high-demand options more often than chance
Additionally, under the “Random effects” section of the output, the variance of
participants’ choices around this grand mean (τ 20, Variance in the output) is 0.76. This
intuition for this value, we can convert it from a variance to a standard deviation by taking the
square root: τ 0=√ τ 20=0.87 .Since we assume normality roughly 95% of participants will have
log-odds of choosing the high effort options within γ 00 ±2 τ 0 =−0.33 ±2∗0.87=[−2.07, 1.41].
Converting these values into probabilities using Eq. 2 and 3, 95% of participants will choose the
high effort option between 11% and 80% of the time. Thus, while the typical person avoids
effort, some prefer it. Additionally, a researcher could visualize this variability by plotting the
can extract the intercept for each participant from the model. These values are known as
“empirical Bayes estimates” and are a weighted combination of person specific information and
the average across all persons ( γ 00). The EB estimates for persons with more informative data,
e.g., from completing more trials, would be closer to their person-specific estimates. While EB
estimates for persons with less informative data would be closer to the average across all persons
( γ 00). The more trials a person completes, the more their EB estimates can be differentiated from
the average across all persons. Alternatively stated, EB estimates reflect skepticism about
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 14
differences between persons (or clusters more generally). Doing so we obtain a per participant
measure of effort avoidance (b 0 j)—in other words, the fixed effect, -0.33 ( γ 00), plus each
participant's deviation from this fixed effect (U 0 j)—which yields each participant’s estimated
log-odds of choosing the high effort option. In R, this computation can be automated using the
coef(logsitic_MLM0).
(Intercept)
1 -1.35826976
2 -0.28081092
3 -0.03887375
4 -0.03193523
5 0.77954561
6 -0.42367159
The output of Code Box 2 shows the first 6 participants’ estimated log-odds of choosing the
high-effort option. To emphasize how these values correspond to participant behaviour, we have
visualized the correspondence between each participant’s estimated log-odds of choosing the
high effort option and their actual proportion of high effort choices (Figure 2A). As can be seen,
participants with higher estimated log-odds of choosing the high effort option also exhibited
higher actual effort seeking behaviour. Notably, this relationship follows an “S-shaped” curve,
which is the result of applying the logistic transformation discussed in Section 3 to the data (Eq.
1-3), allowing the log-odds to take both positive and negative values, whereas actual proportions
are constrained to be between 0 and 1. It is important to note here that while above we have
considered the fixed effects to be “averages” of person-level random effects, this is only true in
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 15
the transformed log-odds scale. When computing log-odds back to probabilities, the non-
linearity of the transformation makes it so large differences in log-odds correspond to only small
with estimated log-odds of choosing the high-effort option of 3 and 4 in terms of their raw
probabilities, these values would correspond to probabilities 0.95 and 0.98—a difference of 3%.
Conversely, if we compare two participants with the same difference in log-odds but lower
absolute values, e.g., a participant with log-odds 1 and another with log-odds 2, then the
With this caveat in mind, we can also visualize between-participant variation in log-odds.
As can be seen in Figure 2B, there is substantial variation in individual participants’ effort-
participants demonstrate demand-seeking behaviour (red area in Figure 2B), preferring the high-
demand option to the low-demand option on average (across trials). These individual differences
expand theoretical understanding, because past work has often taken the fixed effects estimates
of effort avoidance ( γ00 ) as evidence for a general and ubiquitous cognitive mechanism that aims
to minimize effort exertion and maximize reward (e.g., Kool et al., 2010; Kool & Botvinick,
2018).
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 16
While visualizing these variations yields insight into the generality of the estimated fixed
effect, it is useful for researchers to quantify the degree to which variability in the data stems
from individual differences versus within-participant response noise (for example, Volpert-
variability that stems from level-2 variance (i.e., differences between participants) and level-1
variance (variability within participants’ response). The former provides insight into the
generality of our conclusions—e.g., are all participants effort avoidant and, if not, which ones
are and which ones are not?—while the latter speaks to the degree to which our experimental
manipulation yields reliable responses from participants—e.g., how stable is one’s aversion to
effort over the course of the task? Generally-speaking, high within-person variability would
would require a greater number of between-person predictors. With answers to these questions a
software (see S6b), but this is not the case with logistic models.
In a linear random intercept model, the total variance in the outcome is the sum of between and
2 2
within variance components: var ( Y ij ) =τ 0 + σ , where σ 2 is the residual, within-person, variance
(see S6b). As such, one can compute the proportion of total variance accounted for by variations
between individuals, τ 20 ,—a value called the intraclass correlation coefficient (ICC; a.k.a.
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 18
variance partition coefficient), or simply ρ, as the ratio of variation about individuals to the total
τ 20
variance: ρ= (see S6b for more information). Conceptually, the ICC corresponds to the
var ( Y ij )
degree of nesting in a dataset, or the importance of the cluster variable (the variable in which
observations are grouped; here, participants) in explaining variance in the outcome (Goldstein et
al., 2002). This is why computing the ICC usually the best first step for a multilevel analysis
(McCoach & Adelson, 2010). In cases where the data are nested within participants, the ICC
summarizes the degree to which the outcome variable is sensitive to individual differences.
Despite that many introductory texts on multilevel modeling discuss logistic MLMs as “simple”
extensions of linear models, the ICC cannot be computed in the same way, and in fact not at all
with the default output provided from lme4 or other commonplace multilevel modeling software.
A key difference between linear and logistic multilevel models is that logistic models do
not estimate a residual variance term, σ 2—that is, a term that represents within-participant
variability in the outcome. In a linear model, the residual variance term is the variance of
deviations, negative and positive, from a Gaussian variable’s mean or prediction. In logistic
models which use a Bernoulli distribution, the mean represents the probability of an event
occurring. In turn, the variance of deviations about that mean/probability is entirely determined
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 19
by the mean according to the following equation2: var=P ( Y =1 ) ( 1−P ( Y =1 )) , .42(1-.42) = .24.
As a result, a subtle, but important, difference between linear and logistic multilevel models is
that logistic models, which are based on the Bernoulli distribution, lack a residual variance
parameter, σ 2. The level 2 variance, τ 2, is on the logit scale. The variance of the Bernoulli
outcome as implied by the mean, however, is on the probability scale. As such, we cannot
combine the level-1 and level-2 variance to form the total variance denominator as is done in the
logistic multilevel model. First, there is no straightforward value for the residual variance (σ 2)
that captures within-participant variation in the outcome—and accordingly, lme4 does not
provide one (the residual variance parameter when available would be printed under the Random
effects section in Code Box 1). Thus, unlike a continuous outcome model, there is not a singular,
“simple”, way to quantify the relative contribution of individual differences to variability in the
data, and thus the degree of unexplained variance (Goldstein et al., 2002). Consequently,
variance in the outcome cannot be decomposed into within and between components in a
straightforward manner and the standard ICC formula cannot be used to create a ratio of
2
To demonstrate why this equation holds, let X be some binary variable coded as 0 and 1, θ be the mean of the
distribution of this variable ( p ( Y =1 ) ), and X́ 2 −(X´¿)2 ¿be the equation for the variance. We begin by solving for
2
θ ( X 2 ) as follows: θ ( X 2 ) = p ( X=1 ) 1 + p ( X=0 ) 02 =p ( X=1 )=θ . Following the standard equation for
∑ X2
computing variance, σ =
( )
2
N
−μ2, keeping in mind that N is already captured by θ by virtue of it being a
∑X
probability ( ) and μ= p ( X =1 )=θ , we can compute the variance around θ as:
N
var ( X )=θ X 2 −θ2=θ−θ2=θ ( 1−θ ). In this final form, we can see that the variance is just a restatement of θ .
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 20
References texts usually discuss one method of quantifying the ICC in a logistic
multilevel framework: the latent threshold method (e.g., Snijders & Bosker, 2011). However,
other tested methods exist in the literature, and we review three additional approaches 1) the
simulation approach, 2) linearization, and 3) the median odds ratio. While all four approaches
attempt to quantify the degree of variability between people, they do not estimate the same exact
statistical quantity and will often return different values. As we will see, the first three
approaches estimate the ICC in the (binary) outcome variable but they do so differently. The
median odds ratio, however, estimates a different value, the heterogeneity between people in
terms of the odds of occurrence of a binary outcome variable—i.e., the median odds ratio that
would be observed if multiple random samples were drawn from the population. After
illustrating how to implement and interpret each method in R step-by-step, using the random
intercept only model fit as an example above (logistic_MLM0, see Code Box 1), we
The simplest and most popular method of calculating the ICC in a logistic MLM is to
assume that the within-cluster variation is equal to the variance of a logistic distribution with a
π2
scale parameter equal to one, yielding the value (Goldstein et al., 2002; Snijders & Bosker,
3
2011, p. 440). Using this approach then, the ICC is a quantification of the proportion of variance
This method arises from a common formulation or data generation mechanism for binary
weighted sum of predictors and an error term.3 However, the continuous outcome is unavailable,
and instead we have a dichotomized version of the variable. Under this formulation, binary
regression models attempt to retrieve the parameters or regression coefficients of the underlying
continuous variable regression. Since the continuous variable is unavailable, we make certain
“identification” assumptions to permit model estimation. For logistic regression, the relevant
assumption is that the error term in the continuous variable regression follows a standard logistic
π2
distribution i.e. it has mean 0 and variance (Goldstein et al., 2002).
3
Applying this formulation to the data at hand, we may assume a continuous distribution
of values across participants, and participants with higher values have greater preference for high
effort tasks. The exact preference value of a participant is a weighted combination of their
predictors and the logistic error term. When this preference value exceeds some threshold, that
varies by participant, the participant chooses the high effort task. A random-intercept multilevel
logistic regression returns the variation across persons, τ 20, and we assume the variation within
π2
persons to be fixed at . At which point one can calculate the ICC as a ratio of the variance
3
π2
component of interest and the total variance, var=τ 20+ (Goldstein et al., 2002).
3
Code Box 3. Computation of the ICC using the Latent Threshold Method
tau20 <- VarCorr(logistic_MLM0)$PID[1]
sigma2 <- pi^2/3
threshold_ICC <- tau20 / (tau20 + sigma2)
3
This formulation is mathematically equivalent to the GLM formulation for binary regression models based on the
Bernoulli distribution, with a linear model for probabilities on an unbounded scale.
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 22
threshold_ICC
[1] 0.1867374
In the Code Box above, The VarCorr command extracts the random variance
components from a fitted model (here logistic_MLM0, i.e., the value .7554 from Code Box
1) . Alone, this yields a variance component per grouping variable and per number of random
variance parameters (τ 20 , τ 21 , τ 22 , …) specified in the model. In our case, we have one grouping
variable, PID, and so we specify that we are interested in this the random intercept variance for
this group using $PID[1]. We then use this value to compute the latent threshold ICC as
described above. Executing the code in Code Box 3 yields an ICC = 0.19—that is, nearly 20%
In contexts where the modeler is primarily interested in a continuous outcome, but only
has access to a dichotomized variable, this formulation for the ICC is ideal. For example, if a
aversion, then effort-related choices in the DST might be seen as a dichotomous representation
of this latent construct. It is worth noting, however, that the assumption that unobservable
contexts. While effort choice could be thought to reflect a continuous underlying distribution of
preference, some may think of it as a truly dichotomous outcome—a participant chooses either
one option or the other—in the same way that a person is either a bachelor or not. In these cases,
it may be desirable to assume responses as truly dichotomous when calculating the ICC.
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 23
Simulation Approach
simulation-based approach as a more general means of estimating the ICC on a binary (observed)
scale. Differently from the Latent Threshold approach which focuses on the logit scale, this is an
ICC measure on the probability scale. The key idea is to compute the Bernoulli variance over a
large number of simulated datasets as a proxy for the estimate of residual variance. This
computes the ICC from two sets of values: first, the person-level estimated probabilities, which
are obtained from the fixed and random effects of a fit model, and second, a measure of within-
person variance based on the Bernoulli variance seen above, P ( Y =1 ) ( 1−P ( Y =1 ) ) , which is
averaged across people. These two components can then be used to compute the ICC. In practice,
2. For each random effect, compute predicted probabilities for each level 1 observation, ^pijm,
according to the model’s fixed effects ( γ 00) and the random effect obtained in step 1
exp ( γ 00 +U 0 jm)
follows: ^pi j = .
m
1+ exp ( γ 00 +U 0 jm )
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 24
3. For each of these predicted probabilities, compute the level 1 variance according to a
probabilities over total variance, which itself is composed of cluster-level variance and
M
Va r m=1 ( ^pijm )
ICC= M
average level-1 variance: 1 .
Va r Mm=1 ( ^pijm ) + ∑ ^
V
M m=1 ijm
# 1. Simulate random effects (using the square root of the random variance;
i.e., the SD)
U0j <- rnorm(M, 0, sqrt(tau20))
# 2. Predict probabilities
logit_p_hat <- gamma00 + U0j
p_hat <- exp(logit_p_hat) / (1 + exp(logit_p_hat))
# 4. Compute ICC
sigma2 <- mean(var_L1)
simICC <- var(p_hat) / (var(p_hat) + sigma2)
# print result
simICC
[1] 0.1390484
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 25
At the start of Code Box 4, we use the VarCorr function to extract the between cluster
variance, as before (Code Box 3). We then extract the fixed intercept ( γ 00 in Eq. 4) using the
fixef command and specify the number of simulations, M . We then follow the steps described
Executing the code in Code Box 4 yields an ICC of 0.14—that is, 14% of total variance
in effort avoidance owes to differences between individuals. Notably, this estimate is smaller
than that obtained by the latent threshold method. In the latent threshold method, the ICC is
computed entirely on the logit scale, and the residual variance parameter is assumed known. In
the simulation approach, both the between- and within-participant variances are computed on the
probability scale—the calculations include the fixed effects. Hence, the fixed effects shift the
mean of the probability in ways that affect the variance of probabilities—which in turn increase
at a slower rate than the variance on the logit scale. These differences lead to discrepancies in
It is worth bearing in mind that because simulated deviations from the fixed effect are
based on the fixed effects themselves, estimates of the ICC in models with additional covariates
(i.e., not a null model) will depend on the covariate patterns. That being said, differences in the
ICC associated with different covariate values may themselves be of interest, though this process
Linearization
a linear approximation of the logistic multilevel equation and use the variance from this
approximation. Thus, linearization attempts to approximate the ICC otherwise obtained via
simulation method (e.g., using the Simulation approach)—i.e., one in which variance is
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 26
calculated on the probability scale. Goldstein et al. (2002) proposed the following approximation
for the null model, where ^pij refers to the estimated probability that Y ij =1.
exp ( γ 00 )
^pij = Eq. 5
1+ exp ( γ 00 )
This approximation works for the case of the null model specifically (i.e., the model for which
we typically compute the ICC), but Goldstein’s et al. (2002) approach may also work in the
presence of multiple predictors (a topic we return to later). Considering the null model, then the
exp ( γ 00 )
^pij =
1+ exp ( γ 00 )
^ 1= ^pij ( 1− ^pij )
V
V^ 2
ICC=
V^ 1 + V^ 2
where ^p reflects the model-estimated probability of choosing the high effort option at the mean
level-2 outcomes as predicted by the model, including information about each person (τ 2; i.e., the
variance in the (null) model-estimated value of choosing the high effort option for each
participant). The total estimated variance is the sum of these values. This variance then appears
in the denominator of the standard ICC calculation. Note that the equation above is a more
general form of the linearization equation, as the present model (logistic_MLM0) is a null
# 2. Compute the Bernouilli variance of this fixed estimate first (last part
of # equation 6)
var1 <- p * (1 - p)
# 4. Compute ICC
linICC <- var2 / (var1 + var2)
# print result
linICC
[1] 0.1551821
As before, at the start of Code Box 5, we extract the random effects and fixed effects using the
VarCorr and fixef functions in R. Following this, we follow the steps described above to
compute the ICC. Executing Code Box 5 yields an ICC of approximately 16%—closer to the
While this linearization method does not require simulation, much like the simulation
approach, its value is conditional on the covariate structure. That is, the ICC will be conditional
on the values of the predictor variables. We discuss ways around this limitation later, but it is at
the discretion of the modeler whether the unique ICCs for different patterns of covariates are of
interest over a global ICC estimate.4 Another consideration is that when τ 20 is large, the
4
Nakagawa et al. (2013; 2017) have proposed a similar method for estimating the coefficient of determination ( R2)
in generalized multilevel models. In the case of binary outcomes, this method estimates the proportion of variance
explained in a model. This value is proportional to the ICC when considering null models, as we do here (Snijders &
Bosker, 2011), and in fact the latent threshold ICC is exactly equal to the conditional R-squared value, assuming a
null model. Accordingly, we do not review it in detail here, but point readers to Nakagawa’s (2013; 2017) work on
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 28
linearization transformation will be applied to values along a broad span of the logistic curve,
which is non-linear and poorly approximated by a linear function. As such, the linearization
The methods reviewed so far have aimed to compute the ICC for logistic multilevel
models using techniques analogous to linear models. An alternative approach is to abandon the
framework of the ICC and instead compute a measure of participant-level variation that more
readily captures the binary nature of the data, without the need for an estimate of within-
participant variance. That is, a method in which the measure of variation is expressed in terms of
odds, which is a more common scale for interpreting binary data models. Accordingly, the
median odds ratio (MOR), proposed by Merlo et al. (2006), represents median participant-level
variation on the odds ratio scale. Conceptually, the MOR represents the median odds of success
across every pairing of participants in the dataset. Imagine that every participant is paired to each
other participant in the sample and their odds of choosing the high demand option are calculated
according to their participant-level deviation from the fixed intercept. The ratio of these pairs of
odds are then computed, with the higher odds always placed in the numerator. Doing this for all
pairs of participants would yield a distribution of odds ratios, the median of which would be the
MOR. If the MOR is equal to 1, it would suggest no differences between participants (all
participants are equally effort-avoidant; that is, they avoid high effort stimulus γ 00 log-odds of
the time). If the MOR is considerably larger than 1, it would suggest sizable individual
differences. As such, the MOR is a useful technique for those interested solely in the between-
Practically, it is not necessary to carry out these computations in full, because the MOR
can be readily computed using the following equation (see Merlo et al., 2006):
where Φ is the cumulative distribution function of a standard normal distribution, and ϕ ( 0.75) is
(see Larsen & Merlo, 2005 for a derivation of this equation). In R, the MOR
# print result
MOR
[1] 2.291137
Using the data at hand, the MOR is 2.29, which means that the median odds of avoiding the
high-demand option increased by 2.29 times when randomly comparing two participants. In
other words, the odds of effort avoidance in the present sample can vary, in median, by 2.29.
Relative to the original estimate of between-participant variability from the output of fitting
logistic_MLM0 (τ 20), the MOR reflects a summary of how variation in effort aversion when
While the MOR is unlike the other approaches discussed thus far, it is worth emphasizing
its conceptual similarity to the ICC (and thus its inclusion in this tutorial). Much like the ICC, the
MOR provides information about individual variability between participants impacts the odds of
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 30
the outcome variable being equal to 1 (in this case, choosing the high effort option). Put simply,
both the MOR and ICC quantify how important the participant is in determining the outcome.
Moreover, like other ICC values (and holding total variance constant), larger MOR values imply
reduced within-person variability, and vice-versa. In this respect, the MOR and the ICC convey
similar information. Nevertheless, unlike the ICC, the MOR is not expressed as a proportion. As
such, this method offers no quantification of residual, within-participant, variance, which could
Table 1. Advantages and Disadvantages of Different Techniques for Computing ICC in Logistic
MLM.
Method Estimated ICC for Pros Cons
logistic_MLM
0
In the previous section, we outlined four methods for quantifying variation in logistic multilevel
modeling, as well as some relative advantages and disadvantages of each method. However,
quantifying uncertainty around any given estimate of the ICC, which in our in our example
varies between 15-20% depending on the method, is not routinely discussed in the literature.
To quantify uncertainty in the ICC, Austin and Leckie (2020) proposed using a
parametric bootstrapping process with percentile confidence intervals. In its simplest form, this
1. Simulate many datasets from the original logistic model, randomly sampling random
effects from a normal distribution centered at the estimated mean and variance from a fit
model.
2. Compute the quantity of the interest (the ICC or residual variance measure in this case).
3. Summarize these data (e.g., by taking the mean, computing a 95% interval, etc.) to make
values using the information provided in Code Boxes throughout this tutorial (and, for the
and S5b). For ease however, we provide a function that draws bootstrapped samples of the ICC
using the methods specified above. Below, we use this function to obtain 100 bootstrapped
Code Box 7. Example of bootstrapping procedure to obtain sampling distributions of the ICC,
using the latent threshold method.
bootstrap_MOR <- bootstrap_icc(logistic_MLM0, gr='PID', method='icc_thr', B
= 100)
2.5% 97.5%
0.1089664 0.2576925
variable, here PID, a string representing one of the methods for computing the ICC described
above, here icc_thre, and the number of samples to draw, B. Applying this yields a
bootstrapped sampling distribution that characterizes uncertainty around the point estimate. For
instance, while the point estimate for the latent threshold ICC in the present data was 0.19, the
bootstrapped confidence interval suggests the ICC estimates most compatible with the current
data range between 0.11 and 0.26 (see Table 2). In other words, anywhere between
Variable Quantile
2.5% 97.5%
ICC
(Latent Threshold) 0.11 0.26
ICC
(Simulation) 0.11 0.23
ICC
(Linearization) 0.10 0.22
MOR 1.83 2.77
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 35
Notably, unlike confidence intervals calculated using standard errors, the lower bound of
these bootstrapped confidence intervals will never go below 0 in the case of the ICC, or 1 in the
case of an MOR, because these are the lowest possible values these metrics can take.
Accordingly, 1 for the MOR and 0 for the ICC might be very conservative thresholds for
minimal participant-level variation. The advantage of the bootstrap approach is that researchers
can set their own lower bound for what constitutes minimal heterogeneity in the outcome
The 95% interval of bootstrapped distributions for MOR, and ICC using all of the
methods discussed thus far from 100 samples are summarized in Table 2. Regardless of the
measure, there are substantial individual differences in effort avoidant behaviour, suggesting that
not all people are equally averse to effort action—a finding echoed by recent work exploring
individual differences in cognitive effort investment, which speaks to the theoretical insights
examining the random effects can provide (Sandra & Otto, 2018; Otto & Daw, 2019).
7. Computing the ICC and MOR in Models with Predictors and Random Slopes
Thus far, the tutorial has focused on a null model with no predictors and random
intercepts only. We recommend this as a first step to modeling building because it provides a
starting point to understanding the variability in the outcome. Researchers are then likely to
proceed to adding predictors and estimating random slopes. In this section we review how to use
trial number, and random intercepts. Code Box 7 shows the regression equation now includes
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 36
trial0, which captures the number of trials that have elapsed since the beginning of a block
in the DST (where a block is a subset of trials). The predictor has been centered at the median
and scaled to be between 0 and 1. In line with past work (Kool et al., 2010), we consider whether
participants’ desire to avoid higher effort tasked increased as their exposure to the experimental
tasks increased. The fixed intercept, γ 00, reflects the average log-odds of selecting the high-effort
tasks when trial0 is equal to zero for the typical participant, and the fixed slope, γ 10, reflects
the change in log-odds from the beginning to the end of the task for the typical participant (from
when trial0 equals 0 to when it equals 1). As before, we have specified that the intercept will
vary per participant in a normally distributed fashion, with random variance τ 20.
Code Box 7. Multilevel logistic model with one predictor, random intercepts, and fixed slopes.
logistic_MLM1 = glmer(effort_choice ~ trial0 + (1|PID), data=data.Ctl,
family='binomial')
summary(logistic_MLM1)
Scaled residuals:
Min 1Q Median 3Q Max
-1.7442 -0.8961 -0.3605 0.9873 4.8429
Random effects:
Groups Name Variance Std.Dev.
PID (Intercept) 0.778 0.882
Number of obs: 10909, groups: PID, 37
Fixed effects:
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 37
The simulation and linearization approaches for computing the ICC described above rely
on an estimate of intercept variance (τ 20), which depends on the value of the intercept in a model,
which in turn depends on the covariate structure. For these approaches, if researchers are
interested in various points, (e.g., when covariate 1 is high and covariate 2 is low, when both
covariates are high, when both are low, and so on), they need to calculate multiple ICCs. In
Figure 3A, we plot the ICC at each value of trial0 using the linearization method, which
demonstrates how the ICCs changes as we move along different values of the covariate.
Conversely, both the ICC computed using the latent threshold approach and the MOR implicitly
assume fixed residual variance and will be assumed to be appropriate for any covariate structure.
In the presence of covariates, the latent threshold ICC and MOR return values that are adjusted
for the covariates. For example, the latent threshold ICC functions similarly to the residual ICC
in linear multilevel models in that it is the latent variance attributable to differences between
5
Since the residual variance is fixed across logistic models, estimated model parameters are automatically scaled
across models to reflect changes in residual variance across models (an issue referred to as non-collapsibility or
unobserved heterogeneity depending on the literature; Greenland et al., 1999; Mood, 2010). For example, if one
extends a random-intercept model by including within-participant predictors that do not explain any participant level
2
variance (e.g. person-mean centered predictors), the intercept variance (τ 0) would increase from the random-
intercept, producing a larger latent threshold ICC or MOR since the residual variance has reduced from the earlier
model.
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 38
Figure 3. Linearization-based ICC in models where (A) within-person and (B) between-
person predictors are included. In A, the x-axis represents the trial number at which the ICC
was computed. In B, the x-axis represents the experimental condition at which the ICC was
computed. The y-axis shows the resultant estimated ICC. The red dotted line shows the estimated
ICC using the averaging method discussed in the main text. Together, these figures illustrate the
problems that accompany estimating the ICC when predictors are included in a logistic
multilevel model.
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 39
increasingly complex covariate structures is to calculate the ICC at the average prediction of the
observed data on the logit scale. To do so, the average predicted value from the model is used in
place of the combination of fixed effects. The benefit of this method is that it is simpler while the
cost is that it cannot capture the relative contribution of within- vs. between-participant variance
at different levels of the predictor. Thus, faced with a model like our logisticMLM1 we must
decide if we want to compute the ICC at different levels of the predictor—at the beginning vs.
end of a block, for instance—or compute a global ICC at the average—here, the middle of a
block. We take the latter approach by centering the predictor at the median, but the former
approach could be taken if trial0 was recentered to the beginning or end of the task.
As an example, we can leverage the averaging approach described above for the
estimated above, which yields an estimate of the ICC that is close to the estimate for the
this estimate of the ICC will only be valid for the middle of the block (see Figure 3A). This is
implemented in the icc_lin() function provided in the Supplemental Materials (S6d; see
As an aside, in the case of the simulation method, this averaging approach can be
extended via numerical integration to estimate the moments of the logit-normal distribution, for
which, for example, the simulation method can be thought of as a method for approximating.
This yields numerically very similar estimates of the ICC without the need for simulation. We
describe this Numerical Integration Approach in the Supplemental Materials (S6d) and provide a
[1] 0.1558737
The same issues apply when a level 2, in our example a between person, predictor is
included. In Code Box 9, we fit a logistic multilevel model where effort choices are predicted by
experimental condition: control (0) or stress (1). The original purpose of Bogdanov and
colleagues’ (2022) study was to examine the role that acute stress played on effort avoidance.
The model below tests this prediction, examining average effort avoidance rates depending on if
Code Box 9. Multilevel logistic model with one between-person predictor and random intercepts
logistic_MLM2 <- glmer(effort_choice ~ condition + (1|PID), data=data,
family='binomial')
summary(logistic_MLM2)
Scaled residuals:
Min 1Q Median 3Q Max
-1.2636 -0.8786 -0.4627 0.9761 4.5435
Random effects:
Groups Name Variance Std.Dev.
PID (Intercept) 0.5838 0.7641
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 41
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.44017 0.12483 -3.526 0.000422 ***
condition1 0.12024 0.01417 8.484 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Just as with the continuous predictor example above, the intercept and the ICC using the
simulation and linearization approaches will depend on the value of Condition. We illustrate
this in Figure 3B. In code box 10, we compute the averaged ICC using the linearization method.
As before, we obtain a value not too far from, though slightly lower than, the linearized ICC
Code Box 10. Linearization method for a model with between-person predictors
icc_lin(logistic_MLM2, 'PID', avg_preds=T)
[1] 0.1221064
As before, these issues do not apply to the latent threshold ICC nor the MOR, which will
Notably, while continuous predictors may yield an unwieldy number of possible ICCs,
categorical variables may have very restricted covariate patterns. As a result, researchers may in
fact be interested to examine how person-level variability in the intercept (indexed by the ICC)
varies per categorical conditions. Moreover, when a model includes categorical predictors, the
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 42
average prediction may not reflect any actual case in the data, such it may be wiser to consider
Finally, a modeler may be interested in estimating how much the influence of a level 1
predictor varies per participant. For example, we might expect that over the course of a block,
participants become increasingly effort averse, but that some participants remain steadfast in
their initial rate of effort aversion while others rapidly reject all effortful tasks after only a few
trials. This type of conditional between-participant variability can be modeled using random
estimate the slope variance across people. As can be seen in Code Box 11, a new random effects
variance is included, which encodes how variable different participants’ effort preferences are to
Code Box 11. Multilevel logistic model with one predictor, random intercepts, and random
slopes.
logistic_MLM2 = glmer(effort_choice ~ trial0 + (trial0|PID), data=data.Ctl,
family='binomial')
summary(logistic_MLM2)
Scaled residuals:
Min 1Q Median 3Q Max
-1.7942 -0.8994 -0.3641 0.9937 4.9218
Random effects:
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 43
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.33915 0.14781 -2.295 0.0218 *
trial0 -0.04598 0.05958 -0.772 0.4402
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Though adding random slopes to the model and computing the ICCs in the manner we
have demonstrated is relatively straightforward in lme4, there is debate in the literature about
whether the ICC should be computed for models that contain random slopes. For instance, Kreft
& De Leeuw (1998) state that “The concept of intra-class correlation is based on a model with a
random intercept only. No unique intra-class correlation can be calculated when a random slope
is present in the model.” (p. 63). Conversely, Goldstein et al. (2002) highlight that it is, in theory,
possible to estimate the ICC for a model with random slopes, but that doing so will be
conditional on the pattern of covariates and that doing so changes the interpretation of the ICC—
and its components: between- and within-participant variation—to a point that it no longer serves
the same purpose of assessing the relative contribution of between- versus within-participant
variation to total variance. We will not settle this debate in this tutorial and to our knowledge, no
work exists that tackles this issue in logistic multilevel models. For now, we recommend
implementing the ICC using the techniques described above for models with random intercepts,
and we caution that the interpretations we’ve provided do not hold for models with random
slopes.
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 44
8. Conclusion
Calculating the ICC to describe the effect of clustering in the dependent variable is a best
practice for the early stages of multilevel model building. However, for logistic models, this is
not straightforward to do, with popular programs not outputting the necessary estimates. In this
tutorial, we have reviewed four methods for computing estimates of between- and within-
participant variability from logistic multilevel models and demonstrated how to calculate the
ICC. We have highlighted relative advantages and disadvantages of each technique and provided
R code to compute these estimates, both manually (using didactic code from the Code Boxes or
the supplement to reproduce all analyses reported herein) or functionally (using the function
code provided in the repository associated to this paper and described in S6d). Finally, we
Here, we focused on modeling binary data using a model with a logit link function,
alternative link functions, like the probit function, are commonly employed. The techniques to
quantify the ICC (excluding the MOR, because it is in odds ratios) described could be used for a
probit model. That is, in a probit model, the value of the residual variance in the ICC using the
π2
threshold approach would be 1, rather than , in the simulation approach, the mean function
3
from inverse-logit function would change to a normal cumulative density function, and for the
linearization approach, the formula presented in equations 5 and 6 would also change.
At this point, readers may ask which method is the best overall, or which method they
should use for a specific scenario. To help arbitrate between the different methods presented
here, we point readers to Table 1 and Figure 4 for practical considerations. Additionally, the
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 45
decision between methods might also be made on theoretical grounds. If researchers view the
observed binary outcome as the result of a measurement process that dichotomized a continuous
variable/trait, then the latent threshold approach is appropriate. If they are instead concerned only
researchers wish to express between-person heterogeneity in terms of odds, the MOR is a good
option.
Finally, another way to identify which measure to report is to consider the type of effect
size the researcher plans to report. If the researchers’ primarily focus on log-odds and odds-
ratios, then the latent approach and MOR are adequate, because they are already expressed in
that scale. If the researchers are instead interested in transforming effects into probabilities or
attempt to communicate effects on the probability scale (i.e., on the observed scale), then the
investigate variability in their binary data and expand their theoretical considerations beyond
Figure 4. Article Flowchart. Readers are encouraged to use this flowchart to direct and their
selection of an estimate of ICC (and MOR) for logistic multilevel modeling.
QUANTIFYING VARIANCE IN LOGISTIC MULTILEVEL MODELS 47
References