0% found this document useful (0 votes)
50 views16 pages

Journal of Statistical Software: Multilevel IRT Modeling in Practice With The Package Mlirt

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 16

JSS Journal of Statistical Software

May 2007, Volume 20, Issue 5. http://www.jstatsoft.org/

Multilevel IRT Modeling in Practice with the


Package mlirt

Jean-Paul Fox
University of Twente

Abstract
Variance component models are generally accepted for the analysis of hierarchical
structured data. A shortcoming is that outcome variables are still treated as measured
without an error. Unreliable variables produce biases in the estimates of the other model
parameters. The variability of the relationships across groups and the group-effects on
individuals’ outcomes differ substantially when taking the measurement error in the de-
pendent variable of the model into account. The multilevel model can be extended to
handle measurement error using an item response theory (IRT) model, leading to a multi-
level IRT model. This extended multilevel model is in particular suitable for the analysis
of educational response data where students are nested in schools and schools are nested
within cities/countries.

Keywords: item response data, MCMC, multilevel IRT model, FORTRAN.

1. Introduction
The objective in school effectiveness research is to investigate the relationship between ex-
planatory and outcome factors. This involves choosing an outcome variable, such as ex-
amination achievement, and studying differences among schools after adjusting for relevant
background variables. Multilevel analysis techniques are a generally accepted approach in the
analysis of school effects (Aitkin and Longford 1986). Multilevel models are used to make
inferences about the relationships between explanatory variables and response or outcome
variables within and between schools. This type of model simultaneously handles student
level relationships and takes account of the way students are grouped in schools.
The outcome variable or response variable (examination results, behavior) and the character-
istics of the student intake (socioeconomic status, individual ability on entrance to the school)
has been the subject of much attention and research. Students’ abilities are regarded as a
2 Multilevel IRT Modeling in Practice with the Package mlirt

continuous unidimensional quantity, and can only be observed indirectly. Since each student
can be presented only a limited number of questionnaire items, inference about its ability is
subject to considerable uncertainty. This also includes response error due to the unreliability
of the measurement instrument. Further, human response behavior is stochastic in nature.
This problem can be handled by extending an item response theory (IRT) model to a multi-
level item response theory model consisting of a latent variable assumed to be the outcome
in a regression analysis. This model has already become an attractive alternative to the
traditional multilevel models. Verhelst and Eggen (1989) and Zwinderman (1991) defined a
structural model for the one parameter logistic model and the Rasch model with observed
covariates assuming the item parameters are known. Adams, Wilson, and Wu (1997) and
Raudenbush and Sampson (1999) discussed a Rasch model embedded within a hierarchical
structure. Kamata (2001) defined the multilevel formulation of the Rasch model as a hier-
archical generalized linear model. Maier (2001) defines a Rasch model with a hierarchical
model imposed on the person parameters but without additional covariates. Fox and Glas
(2001) extended the two-parameter normal ogive model by imposing a multilevel model, with
covariates on both levels, on the ability parameters. This multilevel IRT model describes
the link between dichotomous response data and a latent dependent variable within a struc-
tural multilevel model. They also showed how to model latent explanatory variables within
a structural multilevel model using dichotomous response data.
Handling response error in the dependent variable in a multilevel model using item response
theory has some advantages. Measurement error can be defined locally as the posterior
variance of the ability parameter given a response pattern resulting in a more realistic, het-
eroscedastic treatment of the measurement error. Besides the fact that in IRT reliability can
be defined conditionally on the value of the latent variable it offers the possibility of separat-
ing the influence of item difficulty and ability level, which supports the use of incomplete test
administration designs, optimal test assembly, computer adaptive testing and test equating.
Further, it is possible handle various kinds of item responses to assess the ability of interest
without simplifying assumptions regarding the discrete nature of the responses.
In the present paper, a few analyses concerning the PISA 2003 survey (OECD 2004) are pre-
sented using the multilevel IRT model. The PISA 2003 survey analysed student performance
and associated factors that may support success in education. The measurement error or
degree of uncertainty associated with the estimated student performances was acknowledged.
Samples from an empirically derived distribution of student performance values were obtained
(plausible values). Plausible values were used to obtain consistent estimates of population
characteristics since students were administered too few items to allow precise estimates of
their performance. The multilevel IRT results are compared with the outcomes based on
plausible values of the PISA 2003 study.
The multilevel IRT model is presented for binary and polytomous response data in Section 2.
The prior choices are discussed. In Section 3, an overview is given of the MCMC algorithm
which is implemented in the package mlirt. In Section 4 a brief overview is given of procedures
for testing the fit of the model. The package mlirt is described in Section 5; a description is
given of the common input and output variables. In the next Section, a simulation study is
given to demonstrate the MCMC algorithm. Then, a PISA 2003 data analysis is shown and
the results are compared with the plausible values method implemented in HLM (Raudenbush,
Bryk, Cheong, and Congdon 2004). Finally, other extensions of the model are discussed.
Journal of Statistical Software 3

2. A multilevel IRT model

2.1. Level 1: Measurement model


Item response theory models are used to describe the relationship between (latent) person
parameters, say abilities, and responses of examinees to test items. One goal is to assess the
abilities of the examinees. The class of item response theory (IRT) models is based on test
characteristics and the dependence of the observed responses to binary or polytomous scored
items on the ability is specified by item characteristic functions. In specific, for binary response
data, the probability of a student
 i (i = 1, . . . , nj ) in group j (j = 1, . . . , J) responding correct
to an item k k = 1, . . . , K , is given by
 
P yijk = 1 | θij , ak , bk = Φ ak θij − bk , (1)

where Φ . denotes the standard normal cumulative distribution function, and ak and bk are
the discrimination and difficulty parameter
t of item k. Below, the parameters of item k will
also be denoted by ξ k , ξ k = ak , bk . For polytomous ordered response data, the probability
that an individual indexed ij given some underlying latent ability, θij gives a response falling
into category c (c = 1, . . . , Ck ) on item k is defined by
  
P yijk = c | θij , ak , κk = Φ ak θij − κkc−1 − Φ ak θij − κkc , (2)
The response categories are ordered as follows,
−∞ < κk1 ≤ κk2 ≤ . . . ≤ κkCk , (3)
where there are Ck categories. For notational convenience, κ0 = −∞ and the upper cutoff
parameter κkCk = ∞ for every item k. This item response model, called the graded response
model or the ordinal probit model (Samejima 1969), for polytomous scored items have been
used by several researchers (Johnson and Albert 1999; Muraki and Carlson 1995, among
others).

2.2. Level 2: Structural multilevel model


The measurement model is sometimes of interest in its own right, but here attention is fo-
cused on relations between the latent variable and other observed variables. The structural
multilevel model defines the population model of the underlying latent variable. A sample of
clusters indexed j = 1, . . . , J is considered. A total of N Individuals, labeled i = 1, . . . , nj ,
j = 1, . . . , J, are nested within clusters. Consider at Level 1, the latent dependent variable θ
and Q covariates denoted as x. At Level 2, S covariates are considered denoted as w. This
corresponds with the following structural multilevel model
Level 1
θij = β0j + β1j x1ij + . . . + βqj xqij + . . . + βQj xQij + eij (4)
Level 2
β0j = γ00 + γ01 w1j + . . . + γ0S wSj + u0j (5)
β1j = γ10 + γ11 w1j + . . . + γ1S wSj + u1j (6)
.. .
. = .. (7)
βQj = γQ0 + γQ1 w1j + . . . + γQS wSj + uQj , (8)
4 Multilevel IRT Modeling in Practice with the Package mlirt

where eij ∼ N 0, σ 2 , and uj ∼ N 0, T .


 

Both measurement models, the normal ogive and the graded response model are not identified.
The models are overparameterized and require some restrictions on the parameters. The most
common way is to fix the scale of the latent ability with mean zero and variance one. As a
result, the multilevel IRT model is identified by fixing the scale of the latent variable. Another
possibility is to impose identifying restrictions on the item parameters.

2.3. Priors and identifying restrictions


A common normal prior distribution is specified for the item parameters (k = 1, . . . , K) of
the normal ogive response model,
 
log ak , bk , ∼ N µI , ΣI (9)

This assumption allows for the fact that the item parameters within the IRT model usually
correlate. The full covariance matrix
 
σa σa,b
ΣI = . (10)
σb,a σb

As a hyperprior for (µI , ΣI ), a normal-inverse-Wishart distribution is chosen. That is,

ΣI ∼ Inv − W ishartνI VI−1



(11)

µI | ΣI ∼ N µ0 , ΣI /κ , (12)

where νI and VI are the degrees of freedom and scale matrix of the inverse Wishart distribu-
tion, µ0 is the prior mean and κ the number of prior measurements.
The log of the discrimination parameter of the graded response model has a normal distributed
prior. That is, 
log ak ∼ N µI , σI , (13)
and hyper prior parameter µI is set at zero. The variance parameter σI is assumed to have
the conjugated inverse-gamma prior with degrees of freedom g1 and scale parameter g2 . The
threshold parameters in the graded response model have a common uniform prior distribution.
Note that the threshold parameters are also present in the order restriction.

3. The MCMC algorithm


Developments in simulation techniques facilitate Bayesian analysis of complex generalized
(random effects) models. A Bayesian approach provides a natural way for taking into account
all sources of uncertainty in the estimation of the parameters. Adopting a fully Bayesian
framework results in a straightforward and easily implemented estimation procedure. A
Markov Chain Monte Carlo (MCMC) method (Geman and Geman 1984; Tanner and Wong
1987) can be used to estimate the parameters of interest. Within this Bayesian approach,
all parameters are estimated simultaneously and goodness-of-fit statistics for evaluating the
posited model are obtained.
A Gibbs sampling algorithm is described for the multilevel IRT model. All conditional poste-
rior distributions are specified. A Gibbs sampler is used to simulate draws from the conditional
Journal of Statistical Software 5

distributions for binary response data and a Metropolis-Hastings within Gibbs algorithm for
polytomous response data. Each sampler produces a sequence of random variables that con-
verge in distribution to the joint posterior distribution.
A data augmentation step is introduced that makes the Gibbs sampling algorithm feasible
(Albert 1992). A continuous latent variable is defined, z, that underlies the binary or polyto-
mous response. It turns out that it is easier to sample from the conditional distributions of
the parameters of interest. Let z denote the augmented data regarding the observed binary
or polytomous data, y, for measuring the latent ability θ. As a result, the augmented data
are defined as
( 
N ak θij − bk , 1 for binary data,
p(zijk | y, θ, ξ k ) =  (14)
N ak θij , 1 for polytomous data.

Subsequently, the response yijk is the indicator of zijk being positive (binary data) and zijk
falls between the thresholds κkc−1 and κkc when the observed response is classified into cat-
egory c (polytomous data). Note that the item parameters (person parameters) can be con-
sidered to be regression parameters in the regression of z on θ (ξ).

MCMC algorithm
Initial values for the parameters can be obtained by fitting the IRT model separately using,
for example, BILOG-MG (Zimowski, Muraki, Mislevy, and Bock 2005). Subsequently, initial
values for the multilevel model parameters can be obtained via HLM (Raudenbush et al.,
2004) given the estimated person parameters.

Full conditionals of the IRT model

Step 1. According to Equation (14), sample augmented data given item and ability parameter
values.

Step 2.

• Binary data. Item parameter values are sampled from p(ξ k | zk , θ, µI , ΣI ) for
(k = 1, . . . , K). From Lindley and Smith (1972) follows that a product of a normal
distributed likelihood and a normal prior leads to a normal distributed posterior
distribution. From Equation (14) and Equation (9) follows
   
p ξk | θ, zk , µI , ΣI = p zk | ξ k , θ p ξk | µI , ΣI /p zk | θ, µI , ΣI (15)

= φ ξ k | ξ̂ k , ΩI , (16)

where
 
t
ξ̂ k = ΩI H zk + Σ−1
I µI (17)
Ω−1
I = Ht H + Σ−1
I (18)
 
and H = θ, 1 and φ(.) the normal density function.
6 Multilevel IRT Modeling in Practice with the Package mlirt

The full conditional posterior distribution of the hyper prior parameters (µI , ΣI )
has a normal-inverse-Wishart distribution (e.g., Gelman, Carlin, Stern, and Rubin,
2004). The full conditional can be specified as
  
p µI | ξ, µ0 , ΣI , VI = φ κµ0 + K ξ̄ /(K + κ), ΣI /(K + κ) , (19)
P
where ξ̄ = k ξ k /K. The full conditional of ΣI is an inverse Wishart with param-
P  t κK  t
eters K +νI and scale parameter VI + k ξ k − ξ̄ ξ k − ξ̄ + κ+K ξ̄ −µ0 ξ̄ −µ0 .
• Polytomous data. The full conditional of the discrimination parameter values is
constructed from (14) and (13), that is,
   
p ak | θ, zk , µI , σI = p zk | ak , θ p ak | µI , σI /p zk | θ, µI , σI (20)

= φ ak | âk , Σa , (21)

where

âk = Σa θ t zk + σI−1 µI

(22)
Σ−1
a = θ θ+t
σI−1 . (23)

Hyperprior parameter µI is set equal to 1, and an inverse-gamma prior is specified


for σI with parameters g1 and g2 . A proper noninformative prior is specified with
g1 = g2 = 1. The full conditional of σI equals:
   
p σI | a, µI = p a | σI , µI p σI ; g1 , g2 /p a | µI (24)

= IG K/2 + g1 , Sa /2 + g2 (25)

where Sa = (ak − µI )2 .
P

The conditional distribution of the threshold parameter is difficult to specify.


Therefore, a candidate κ∗k , regarding the thresholds of item k, is sampled from
a proposal distribution from which it is easy to sample. The candidate is accepted
or rejected based on the Metropolis-Hastings acceptance probability

Φ ak θij − κ∗kyijk −1 − Φ ak θij − κ∗kyijk


"  
Y
min  
i|j Φ ak θij − κkyijk −1 − Φ ak θij − κkyijk
CYk −1
#
Φ κkc+1 − κkc /σM H − Φ κ∗kc−1 − κkc /σM H
 
, 1 (26)
Φ κ∗kc+1 − κ∗kc /σM H − Φ κkc−1 − κ∗kc /σM H
 
c=1

where yijk denotes the response of person ij on item k and σM H denotes the
standard deviation of the proposal distribution. For the other parameters the
sampled values from the last iteration are used. The first part represents the
contribution from the likelihood whereas the second part represents normalized
proposal distributions.

Step 3. The full conditional of the latent variable θij follows from Equation (14) and (4);

p θij | z∗ij , ξ, β j , σ 2 = p z∗ij | θij , ξ p θij | β j , σ 2 /p z∗ij | ξ, β j , σ 2


   
(27)

= φ θij | µθ , Σθ (28)
Journal of Statistical Software 7

where

µθ = Σθ at z∗ij + xij β j /σ 2

(29)
Σ−1
θ
t
= a a+σ −2
. (30)

and for binary data z∗ij equals zij + b and for polytomous data z∗ij equals zij .

Full conditionals of the multilevel model

Step 4. The full conditional of the (random) regression coefficients, β j is constructed from
the prior information at Level 2 and the Level 1 information. Let x and w be the
explanatory variables at Level 1 and 2, respectively. From Equation (4) and (5) − (8)
it follows that

p β j | θ, σ 2 , γ, T = p θ j | β j , σ 2 p β j | γ, T /p θ | σ 2 , γ, T
   
(31)

= φ β j | µβ , Σβ (32)

where

µβ = Σβ xtj θ j /σ 2 + T−1 wj γ

(33)
Σβ = xtj xj /σ 2 + T−1 . (34)

Step 5. The full conditional for the fixed effects, γ, follows from equation (5) − (8) and a
noninformative prior;
   
p γ | β, T = p β | γ, T p γ /p β | T (35)

= φ γ | µγ , Σγ (36)

where
X X 
µγ = wjt T−1 wj wjt T−1 β j (37)
j j
X
Σ−1
γ = wjt T−1 wj (38)
j

Step 6. The prior distribution for the Level 1 residual variance can be specified in the form
of an inverse-gamma (IG) distribution with shape and scale parameters, n0 , S0 . It
follows that
σ 2 | θ, β, ∼ IG N/2 + n0 , N S/2 + S0 ,

(39)
P  2
where S = i|j 1/nj θij − xij β j . A non-informative but proper prior is specified if
n0 = .0001 and S0 = 1 (Congdon 2001).
Step 7. An inverse-Wishart distribution with small degrees of freedom, but greater than
the dimension of β j , n0 , and unity-matrix, S0 , can be used as a diffuse proper prior for
T. Then,  −1 
T | β, γ ∼ Inv-Wishart n0 + J, S + S0 (40)
P  t
where S = j βj − wj γ β j − wj γ .
8 Multilevel IRT Modeling in Practice with the Package mlirt

4. Goodness of fit
The adequacy and the plausibility of the model can be investigated via a residual analysis.
The classical or Bayesian residuals are based on the difference between observed and predictive
data under the model, but they are difficult to define and interpret due to the discrete nature
of the response variable. Another approach to a residual analysis is proposed by Albert and
Chib (1993). The dichotomous or polytomous outcomes on the item-level are supposed to
have an underlying normal regression structure on latent continuous data. This assumption
results in an analysis of Bayesian latent residuals, based on the difference between the latent
continuous and predictive data under the model. The Bayesian latent residuals of multilevel
IRT models have continuous-valued posterior distributions and are easily estimated with the
Gibbs sampler (Fox 2005). Further, Bayesian residuals have different posterior variances but
the Bayesian latent residuals are identically distributed.
When integrating out the random effects parameters the likelihood of the model can be
presented as,
"
YZ YZ Y yijk (1−yijk )
2

p y | ξ, γ, σ , T = p yijk | θij , ξ k 1 − p yijk | θij , ξ k dθij
j βj i|j θij k
#
2
 
p θij | β j , σ dθij p β j | γ, T dβ j (41)

The likelihood of the multilevel IRT model consists of two parts. A part following from the
measurement model M1 and a part following from the multilevel model M2 . The marginal
log-likelihood of the data under the multilevel IRT model can be presented as,
  
log p y | M = log p y | M1 + log p y | M2 . (42)

Both parts can be estimated via importance sampling using the joint posterior distribution of
the model parameters as importance sampling function. Each marginal likelihood is estimated
by the harmonic means of the likelihoods using samples from the joint posterior distribution.
The estimated marginal log-likelihood can be used for model comparison via a Bayes factor.
The idea is that model changes in the multilevel part M2 can be tested conditional on the
measurement part M1 such that relatively small changes in the marginal log-likelihood of the
multilevel part can be detected. Via importance sampling the Bayesian Information Criterium
(BIC) can also be computed to compare non-nested models.
Finally, multilevel IRT models can be compared with respect to the Deviance Information
Criterium (DIC, Spiegelhalter, Best, Carlin, and van der Linde 2002). The DIC is defined as

DIC = D Θ̂ + 2pD (43)

= −2 log p y | Θ̂ + 2pD (44)

where Θ represent the multilevel IRT model parameters and D Θ̂ the deviance of the model
evaluated at the posterior mean Θ̂, and pD represents the effective number of parameters and
equals the posterior mean of the deviance minus the deviance evaluated at the posterior mean
of the model parameters.
Journal of Statistical Software 9

5. Package mlirt
The program package mlirt contains three user-callable routines. A function for generating
multilevel IRT data titled, simmlirtdata, a function that handles the parameter estimation
of the model via MCMC, estmlirt, and a summary function, mlirtout, that reports a
summary of the results.
The MCMC algorithm is programmed in Visual Pro FORTRAN (Intel 2004) using the IMSL
FORTRAN statistics library (Visual Numerics 2004) for handling the random number gener-
ation and for sampling from several probability distributions. A dynamic link library appli-
cation was created, mlirt.dll, that can be used as a subprogram in R (R Development Core
Team, 2007).
The function simmlirtdata has arguments N, K, C, nll, and S and some optional arguments,
for the number of respondents, number of items, number of response categories (binary data,
C=1, polytomous data, C > 2, a vector that contains the number of persons per group, and
a vector S that contains the specifications of the structural multilevel model, respectively.
The data matrix of binary responses has values of zero (incorrect response) and one (correct
response), and the data matrix of polytomous responses has values of 1 up to C. The first
element S specifies a random (S[1]=1) or a fixed intercept (S[1]=0), the second element
presents the number of Level 1 explanatory variables with random regression effects, the third
element presents the number of Level 1 explanatory variables with fixed regression effects, the
fourth element specifies the number of explanatory variables at Level 2. The other optional
arguments are fully specified in the accompanying R-documentation.
The function estmlirt has arguments Y, S, nll, and XG, for the data matrix of item responses,
specifications of the multilevel model, the grouping structure, and the number of MCMC
iterations, respectively. Optional arguments are specified in the R-documentation. Missing
data are to be coded as 9. The missing data can be assumed to be missing at random
(design=0, default) and an imputation method is used, or they are assumed to be missing
by design (design=1). The model can be identified in three different ways. If optional
argument (scaling1=1) the mean and standard deviation of the latent variable are fixed at
zero and one (default) unless optional arguments fixm (mean) and fixsd (standard deviation)
are also given. If scaling1=2, restrictions are set on the item parameters, that is, the
product of discrimination parameters equal one (binary and polytomous data) and the sum
of difficulty parameters equal zero (binary data) or the first threshold parameter, κ11 , is fixed
at zero (polytomous data). If scaling1=3, the discrimination parameter a1 = 1 and difficulty
parameter b1 = 0 (binary data) or threshold parameter κ11 = 0 (polytomous data). The
function’s outcome variable is a list that contains the output of the MCMC algorithm which
is completely specified in the corresponding R-documentation.
Convergence can be evaluated by comparing the between and within variance of generated
multiple Markov chains from different starting points. Another method is to generate a
single Markov chain and to evaluate convergence by dividing the chain into sub-chains and
comparing the between- and within-sub-chain variance. A single run is less wasteful in the
number of iterations needed. A unique chain and a slow rate of convergence is more likely
to get closer to the stationary distribution than several shorter chains. Further, the boa
software which is available, in library format from the Comprehensive R Archive Network
(CRAN, http://CRAN.R-project.org/), can be used to analyze the output from the Gibbs
sampler and the convergence of the Markov chains. This includes posterior estimates, trace
10 Multilevel IRT Modeling in Practice with the Package mlirt

plots, density plots, and several convergence diagnostics. This way a burn-in period can be
specified.
The function mlirtout provides estimates of the parameters and corresponding posterior
variances and highest posterior density intervals given the burn-in period and the object
from function estmlirt. A log-likelihood estimate is given that can be used for comput-
ing a Bayesian Information Criterium (BIC) for model comparison. The posterior standard
deviations and highest posterior density intervals are estimated from the sampled values.

6. Parameter recovery
To present some empirical idea about the performance of the estimation method a simulated
data set were analyzed. The following structural multilevel model was considered,

θij = β0j + β1j xij + eij (45)


β0j = γ00 + γ01 wj + u0j
β1j = γ10 + u1j

where eij ∼ N 0, σ 2 = 1 and uj ∼ N 0, T where T is matrix with diagonal elements equal


 

to .5 and off-diagonal elements equal to .2. At Level 1, a sample of 2, 500 students, divided
equally over 50 groups, responding to a test of 20 binary items was considered to measure the
latent dependent variable. Values for the explanatory variables x and w were generated from
a standard normal distribution. The discrimination and difficulty parameters, regarding  the
normal ogive model
 for measuring θ, were sampled as follows; ak ∼ log N exp(1), 1/4 and
bk ∼ N 0, 1/2 , k = 1, . . . , 20. The true population values of the unknown parameters are
given in Table 1.
The model parameters were estimated based on 19, 000 draws from the joint posterior distri-
bution. The burn-in period consisted of the first 1, 000 iterations. This burn-in period was
determined using procedures in the boa software. Initial values of the multilevel parameters

True mlirt HLM LME


Fixed part Value Mean SD Mean SD Mean SD

γ00 0 .059 .104 .053 .105 .053 .102


γ01 −.5 −.526 .095 −.530 .096 −.530 .099
γ10 1 .944 .096 .943 .094 .943 .095

Random part

σ2 1 .978 .035 1.007 1.007


τ00 .5 .482 .103 .486 .486
τ11 .5 .432 .093 .428 .428
τ01 .2 .170 .073 .142 .142

Table 1: True values and posterior estimates of multilevel model parameters.


Journal of Statistical Software 11

were obtained by estimating the multilevel model via HLM using properly scaled observed
sum scores as an estimate for the dependent variable.
Table 1 presents the true parameters, estimated posterior means and standard deviations that
are obtained via a multilevel IRT analysis, a multilevel analysis using HLM (Raudenbush et al.
2004), and a mixed effects analysis using the nlme package (Pinheiro and Bates 2000) for R
(R Development Core Team 2007). The values of the latent dependent variable are known
in the multilevel analysis and in the mixed effects analysis. In this case the parameters
are estimated via restricted maximum likelihood estimation (REML). The results of both
packages HLM and nlme are almost similar. That is, small differences were found between
the corresponding estimated posterior standard deviations of the fixed effects. There is a
close agreement between the multilevel IRT parameter estimates and the REML estimates
which means that the 20 items provide enough information for estimating θ besides the other
multilevel parameters. The posterior standard deviations of the estimated fixed effects do
not differ much which means that the measurement error corresponding to the estimate of
θ is minimal. The REML estimation procedure does not provide standard deviations of the
estimated variance parameters.

7. PISA 2003 data


The Programme for International Student Assessment (PISA) launched by the Organisation
for Economic Co-operation and Development (OECD) is conducted to assess student perfor-
mance and to collect data on student and institutional factors that can explain differences
in performance. The PISA 2003 results can be found in OECD (2004), and the PISA 2003
data can be found at http://pisaweb.acer.edu.au/oecd_2003/oecd_pisa_data_s1.html
(April, 2007).
In 2003, 41 countries participated and the survey covered mathematics (the main focus in
2003), reading, science, and problem solving. In this section, attention is focused on the
mathematic abilities of the 15-year old Dutch students. Student performance in mathematics
is measured via 85 items. Students were given credit for each item that they answered with
an acceptable response. In most cases the responses were marked as correct or incorrect but
some item responses were marked with partial credit. All item responses were coded as zero
(incorrect) or one (correct) since the mlirt package cannot handle mixed response formats. In
PISA 2003 each student was given a test booklet with clusters of items. Each mathematics
item appeared in the same number of test booklets. This (linked) incomplete design makes
it possible using IRT to construct a scale of mathematical performance where each student
has a score on this scale representing his or her estimated ability. Variation in Dutch student
abilities within the Netherlands is investigated using various background variables. A number
of 3992 students across 154 schools were questioned.
The multilevel IRT model makes it possible to simultaneously estimate the item and ability
parameters and the structural multilevel model parameters. Therefore, measurement error in
the estimated abilities is taken into account in estimating the multilevel parameters. First,
the distinction is made between the variance attributable to differences in student abilities
across schools and variance attributable to differences in abilities within schools. This can be
12 Multilevel IRT Modeling in Practice with the Package mlirt

formulated as an empty multilevel IRT model,


 
P yijk = 1 | θij , ξ k = Φ ak θij − bk (46)
θij = β0j + eij (47)
β0j = γ00 + u0j , (48)

where eij ∼ N (0, σ 2 ) and uj ∼ N (0, τ00 ).


In PISA 2003, plausible values were computed that represent random draws from the poste-
rior distribution of the ability parameters given the response patterns. When using plausible
values, the standard error of the ability estimates can be taken into account when estimating
the other multilevel model parameters. In some cases, population estimates can be biased
when point estimates are used and the plausible values facilitate the computation of standard
errors of estimates for complex sample designs taking into account the uncertainty associated
with the ability estimates. In Table 2 the empty multilevel model parameter estimates are
presented using five plausible values for each ability parameter. Further, the estimates cor-
responding to the multilevel IRT model are presented. It can be seen that the parameter
estimates and standard deviations of both analyses are almost similar. The estimated poste-
rior variances of the ability estimates in the multilevel IRT analysis were based on a normal
hierarchical population distribution of ability and the plausible values were based on a normal
population distribution of ability. This difference does not seem to affect the results. The
estimated intra class correlation coefficient is around 64% which is the proportion of variation
in ability estimates explained by the grouping of students in schools. This proportion is high
and above the OECD average. Note that in this analysis the scale of the ability parameter
has a mean of zero and a standard deviation of one where in the PISA 2003 analysis (OECD
2004) the Dutch overall performance in mathematics was measured on a scale with mean 542
and standard deviation of around 92.
To investigate differences in performances between schools and the effects of student-level
and school-level factors on student’s ability several background characteristics can be incor-
porated in the multilevel model. According to the PISA 2003 study, the following student
characteristics explained variation in performance; gender, place of birth (Netherlands or for-
eign), language (Dutch or speaks foreign language most of the time), index of economic, social
and cultural status. The school’s mean index of economic, social and cultural status is used

mlirt HLM
Fixed part Mean SD Mean SD

γ00 −.030 .066 −.033 .066

Random part

σ2 .368 .011 .384 .009


τ00 .654 .076 .648 .076

Table 2: PISA 2003: Posterior estimates of the empty multilevel model.


Journal of Statistical Software 13

mlirt HLM
Fixed part Mean SD Mean SD

γ00 −.062 .041 −.068 .041


Mean index of economic, social and cultural status:
γ11 1.323 .090 1.323 .094
Student is female:
γ10 −.158 .022 −.158 .021
Student is foreign born:
γ20 −.329 .048 −.316 .051
Student speaks foreign language most of the time:
γ30 −.223 .054 −.203 .045
Index of economic, social and cultural status:
γ40 .128 .015 .122 .016

Random part

σ2 .350 .011 .359 .009


τ00 .212 .027 .219 .027

Table 3: PISA 2003: Posterior estimates of multilevel model.

as an explanatory variable for the random intercept. Table 3 reports the results of the HLM
analysis using plausible values and the multilevel IRT analysis.
It can be seen that the estimated standard deviations are almost similar. The posterior
estimates of the fixed effects from the multilevel IRT analysis are slightly larger. In the
multilevel IRT analysis the ability parameters were re-estimated with a multilevel part that
includes covariates. This resulted in slightly higher estimated fixed effects compared to the
HLM analysis that used the same generated plausible values. From the multilevel IRT analysis
follows that the male students perform slightly better than the females. The native speakers
also perform better than non-native speakers those with a migrant background taking account
of socio-economic differences between students and schools. It can be concluded that students
from more advantaged socio-economic backgrounds generally perform better.

8. Discussion
An item response theory model for binary or polytomous data is used to define the relationship
between observable test scores and latent person parameters. A multilevel model presents the
population distribution of the respondents. The combined model is a multilevel IRT model
since a multilevel structure is build on the latent variable in the IRT model. This structural
multilevel model describes the relationship between the latent variable and observed variables
on different levels. A multilevel IRT analysis will differ substantially from a multilevel anal-
ysis using estimated values for the latent variable when there are only a few item responses
observed and as a result the associated measurement error is relatively large. Differences will
14 Multilevel IRT Modeling in Practice with the Package mlirt

also be observed when respondents vary in their number of responses. Estimates of the latent
variable only based on response data will differ from the multilevel IRT estimates when there
is substantial explanatory information. These differences were studied in Fox and Glas (2001)
and Fox (2004).
The simulation study shows that the Bayesian estimation method works well. The MCMC
algorithm is very flexible and allows the modeling of a latent dependent variable using di-
chotomous or polytomous responses. The flexibility of the estimation procedure allows the
use of other measurement error models and can handle multilevel models with three or more
levels. The estimation procedure takes the full error structure into account and allows for er-
rors in the dependent variable. The Bayesian estimation method for estimating all parameters
simultaneously is implemented in R package mlirt.
In the present paper, the measurement models, within the multilevel IRT model, assume
that the ability parameter is unidimensional. In some situations, a priori information may
show that multiple abilities are involved in producing the observed response patterns. Then,
a multidimensional IRT model serves to link the observed response data to several latent
variables. The multilevel IRT model could be extended to handle these correlated latent
variables within the structural multilevel model. This way, the dependency structure and
other person and group characteristics can be taken into account in analysing the relation
between multidimensional latent abilities.

References

Adams RJ, Wilson M, Wu M (1997). “Multilevel Item Response Models: An Approach to


Errors in Variable Regression.” Journal of Educational and Behavioral Statistics, 22, 47–76.

Aitkin M, Longford N (1986). “Statistical Modelling in School Effectiveness Studies.” Journal


of the Royal Statistical Society A, 149, 1–43.

Albert JH (1992). “Bayesian Estimation of Normal Ogive Item Response Curves Using Gibbs
Sampling.” Journal of Educational Statistics, 17, 251–269.

Albert JH, Chib S (1993). “Bayesian Analysis of Binary and Polychotomous Response Data.”
Journal of the American Statistical Association, 88, 669–679.

Congdon P (2001). Bayesian Statistical Modeling. Wiley, Chichester, England.

Fox JP (2004). “Modelling Response Error in School Effectiveness Research.” Statistica


Neerlandica, 58, 138–160.

Fox JP (2005). “Multilevel IRT Model Assessment.” In LA van der Ark, MA Croon, K Si-
jtsma (eds.), “New Develoments in Categorical Data Analysis for the Social and Behavioral
Sciences,” pp. 227–252. Lawrence Erlbaum, Mahwah, New Jersey.

Fox JP, Glas CAW (2001). “Bayesian Estimation of a Multilevel IRT Model Using Gibbs
Sampling.” Psychometrika, 66, 269–286.

Gelman A, Carlin JB, Stern HS, Rubin DB (2004). Bayesian Data Analysis. Chapman &
Hall/CRC, New York, 2nd edition.
Journal of Statistical Software 15

Geman S, Geman D (1984). “Stochastic Relaxation, Gibbs Distribution, and the Bayesian
Restoration of Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence,
6, 721–741.

Intel (2004). Visual FORTRAN Compiler for Windows (Version 8.0). Intel Corporation,
Santa Clara, CA. URL http://www.intel.com/.

Johnson V, Albert J (1999). Ordinal Data Modeling. Springer-Verlag, New York.

Kamata A (2001). “Item Analysis by the Hierarchical Generalized Linear Model.” Journal of
Educational Measurement, 38, 79–93.

Lindley DV, Smith AFM (1972). “Bayes Estimates for the Linear Model.” Journal of the
Royal Statistical Society B, 34, 1–41.

Maier KS (2001). “A Rasch Hierarchical Measurement Model.” Journal of Educational and


Behavioral Statistics, 26, 307–330.

Muraki E, Carlson JE (1995). “Full-Information Factor Analysis for Polytomous Item Re-
sponses.” Applied Psychological Measurement, 19, 73–90.

OECD (2004). Learning from Tomorrow’s World. First Results From PISA 2003. Organisa-
tion for Economic Co-operation and Development, Paris. URL http://www.pisa.oecd.
org/.

Pinheiro JC, Bates DM (2000). Mixed-Effects Models in S and S-PLUS. Springer-Verlag, New
York.

Raudenbush SW, Bryk AS, Cheong YF, Congdon RT (2004). HLM 5 – Hierarchical Linear
and Nonlinear Modeling. Lincolnwood, IL. URL http://www.ssicentral.com/.

Raudenbush SW, Sampson RJ (1999). “Ecometrics: Toward a Science of Assessing Ecolog-


ical Settings, With Application to the Systematic Social Observation of Neighborhoods.”
Sociological Methodology, 29, 1–41.

R Development Core Team (2007). R: A Language and Environment for Statistical Computing.
Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.

Samejima F (1969). “Estimation of Latent Ability Using a Response Pattern of Graded


Scores.” Psychometrika Monographs, 17.

Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A (2002). “Bayesian Measures of Model
Complexity and Fit.” Journal of the Royal Statistical Society B, 64, 583–639.

Tanner MA, Wong WH (1987). “The Calculation of Posterior Distributions by Data Aug-
mentation.” Journal of the American Statistical Association, 82, 528–550.

Verhelst ND, Eggen TJHM (1989). Psychometrische en Statistische Aspecten van Peilingson-
derzoek, (PPON rapport 4, In Dutch) [Psychometric and Statistical Aspects of Measurement
Research]. CITO, Arnhem, Netherlands.

Visual Numerics (2004). IMSL Numerical Libraries (Version 5.0). Visual Numerics, Houston,
Texas. URL http://www.vni.com/.
16 Multilevel IRT Modeling in Practice with the Package mlirt

Zimowski M, Muraki E, Mislevy R, Bock D (2005). BILOG-MG 3 – Multiple-Group IRT


Analysis and Test Maintenance for Binary Items. Scientific Software International, Inc.,
Lincolnwood, IL. URL http://www.ssicentral.com/.

Zwinderman AH (1991). “A Generalized Rasch Model for Manifest Predictors.” Psychome-


trika, 56, 589–600.

Affiliation:
Jean-Paul Fox
University of Twente
Department of Research Methodology, Measurement and Data Analysis
Enschede, The Netherlands
E-mail: [email protected]
URL: http://users.edte.utwente.nl/Fox/

Journal of Statistical Software http://www.jstatsoft.org/


published by the American Statistical Association http://www.amstat.org/
Volume 20, Issue 5 Submitted: 2006-10-01
May 2007 Accepted: 2007-02-22

You might also like