Practical Guide To Conducting An TRI
Practical Guide To Conducting An TRI
Adolescence
http://jea.sagepub.com/
Published by:
http://www.sagepublications.com
Additional services and information for The Journal of Early Adolescence can be found at:
Subscriptions: http://jea.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
Citations: http://jea.sagepub.com/content/34/1/120.refs.html
What is This?
Article
Journal of Early Adolescence
2014, Vol 34(1) 120–151
Practical Guide to © The Author(s) 2013
Reprints and permissions:
Conducting an Item sagepub.com/journalsPermissions.nav
DOI: 10.1177/0272431613511332
Response Theory jea.sagepub.com
Analysis
Michael D. Toland1
Abstract
Item response theory (IRT) is a psychometric technique used in the
development, evaluation, improvement, and scoring of multi-item scales. This
pedagogical article provides the necessary information needed to understand
how to conduct, interpret, and report results from two commonly used
ordered polytomous IRT models (Samejima’s graded response [GR] model
and reduced GR model). Throughout this article, simulated data from a
multi-item scale is used to illustrate IRT analyses. The simulated data and
IRTPRO version 2.1 point-and-click commands needed to reproduce all
analyses in this article are available as supplemental online materials at http://
jea.sagepub.com/maint. The intent of this article is to provide an overview
of essential components of an IRT analysis to enable increased access to this
powerful tool for applied early adolescence researchers.
Keywords
item response theory, pedagogical, IRTPRO
Corresponding Author:
Michael D. Toland, Department of Educational, School, and Counseling Psychology, University
of Kentucky, 243 Dickey Hall, Lexington, KY 40506, USA.
Email: [email protected]
1.0
Item 1
0.9
Item 2
0.8
Item 3
Probability of Response
0.7
Item 4
0.6
0.5
0.4
0.3
0.2
0.1
0.0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Low General Perceived Self-Efficacy (θ) High
GR Model
The GR model is a natural extension of the 2PL model developed for use with
items possessing two or more ordinal response categories or items consisting
of varying number of ordinal response categories (e.g., some items have three
categories, while others consist of five categories). The GR model estimates
a unique slope parameter for each item across the ordinal response categories
along with multiple between-category thresholds (e.g., b1 to b3) for items
having more than two categories. As each item on the GSE scale has four
ordered response categories, there are 4 – 1 = 3 threshold parameters and one
unique slope parameter to be estimated for each item. So, with 10 items, 40
parameters are estimated (i.e., 10 unique slope parameters across items and 3
thresholds per item for a total of 10 + 3 × 10 = 40). Each threshold reflects the
level of general perceived self-efficacy needed to have equal (.50) probability
of choosing to respond above a given threshold. In essence, each item is sepa-
rated into a series of dichotomies and an IRF is created for each threshold
(dichotomy) by means of the 2PL model. For instance, an IRF is created for
b1 to describe the probability of choosing to respond not at all true versus
hardly true, moderately true, and exactly true; then, another IRF is created
for b2 to describe the probability of choosing to respond not all true and
hardly true versus moderately true and exactly true, and a final IRF is created
for b3 to describe the probability of choosing to respond not at all true, hardly
true, and moderately true versus exactly true (i.e., the IRFs plot for each
1.0
Not at all true
0.9
Hardly True
0.8 Moderately True
Probability of Response
0.6
0.5
0.4
0.3
0.2
0.1
0.0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Low General Perceived Self-Efficacy (θ) High
Figure 2. ORFs for Item 1 on the 10-item four-category GSE scale fit by the GR
model.
Note. The horizontal axis represents the level of the latent trait (which has a standard normal
distribution by construction) and vertical axis that measures the probability of choosing a
given response category at a specified latent trait level. ORF = option response function; GSE
= general self-efficacy; GR = graded response.
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Low General Perceived Self-Efficacy (θ) High
Figure 3. Ideal ORFs for Item 1 on the 10-item four-category GSE scale fit by a
GR model.
Note. The horizontal axis represents the level of the latent trait (which has a standard normal
distribution by construction), and vertical axis that measures the probability of choosing a
given response category at a specified latent trait level. ORF = option response function; GSE
= general self-efficacy; GR = graded response.
categories and is mostly encompassed by the ORF for the category labeled
hardly true. This is an indication that the first response category for this par-
ticular item is not attracting respondents as intended. Ideally, if each category
within an item was operating as expected by the model, is useful, and offers
a unique (nonredundant) contribution as a response category, then the ORFs
for each category would be expected to have a unique peak and be spaced out
(separated) along the continuum (see Figure 3).
Reduced GR Model
The reduced GR model is a constrained version of the GR model. The reduced
GR model estimates one common slope (a) parameter across the ordinal
response categories for all items along with multiple between-category
thresholds (e.g., b1 to b3) for items having more than two categories. So, for
the 10-item four-category GSE scale using the reduced GR model, we would
estimate one common slope parameter across items and three thresholds per
item parameter estimates along with ORFs plots for each item will help to
determine if response options are being used as expected. This will be evalu-
ated when an inspection of expected functional form and model-data fit is
conducted.
As it is suspected that response options might not be used as expected, the
10-item four-category GSE scale was reduced into a 10-item three-category
GSE scale. Specifically, the two lowest response categories were combined
(i.e., not at all true and hardly true), with response categories for this com-
bined category represented in parentheses in Table 1.
Appropriate Dimensionality
The assumption of appropriate dimensionality means that the IRT model
being used contains the correct number of continuous latent trait variables per
person for the data (Embretson & Reise, 2000). Before choosing an IRT
model, the dimensionality of the data should be thoroughly inspected (De
Ayala, 2009). However, scale developers or evaluators usually have previous
research, theory, conceptual framework, or a logical argument to build from
to identify how many latent trait variables a scale is intended to measure or
reflect. Most common parametric IRT models assume the latent trait variable
is reflected by a unidimensional continuum. That is, item responses (observed
data) can be reasonably explained by one continuous person variable (i.e., a
single dimension). When this assumption is found tenable, smaller minor fac-
tors do not have consequential influences on estimated latent trait scores (θ;
e.g., general perceived self-efficacy; Embretson & Reise, 2000).
Unidimensionality can be tested using non-IRT methods such as explor-
atory factor analysis (EFA) or confirmatory factor analysis (CFA): A CFA is
Response label
Note. Values in parenthesis represent response percentages after collapsing the first two response categories, not at all true or Hardly true, into one
category.
131
132 Journal of Early Adolescence 34(1)
appropriate when the scale has known dimensional properties, while an EFA
is more appropriate when the scale is relatively unexplored in terms of dimen-
sion. However, if there appears to be a violation from the assumption of uni-
dimensionality, then the use of an exploratory or confirmatory multi-dimensional
IRT (MIRT) model may be warranted (De Ayala, 2009; Wirth & Edwards,
2007). MIRT models are readily available in IRTPRO, but are more complex
and beyond the scope of this article. If, however, the scale is intended to mea-
sure one latent trait variable, then problematic items can be removed from the
analysis to achieve plausible unidimensionality for the purposes of IRT anal-
ysis (Edwards, 2009).
To evaluate the assumption of unidimensionality, a one-factor CFA of the
10-item four-category GSE scale was fit using the current simulated sample
of 700 adolescents. A CFA was conducted because of theoretical knowledge
and previous empirical research showing a unidimensional construct under-
lies the GSE scale. A CFA was also repeated for the 10-item three-category
GSE scale. If a one-factor CFA fits the data, then this provides empirical
evidence that a single latent trait sufficiently explains the item responses or
common covariation among the items. Given the ordinal nature of the item
response categories, which cannot be assumed continuous, a robust (mean-
and variance-adjusted) weighted least-squares (WLSMV) estimator was
used, as implemented in Mplus 7.11 (Muthén & Muthén, 1998-2013): This
estimator functions by factor analyzing a Polychoric correlation matrix
among items. Several indices were used to assess the dimensionality or fit of
the one-factor CFA model to the observed sample data. We used the p value
associated with the χ2 index, the comparative fit index (CFI), the root-mean-
square error of approximation (RMSEA), and the weighted root-mean-square
residual (WRMR). The good model fit was based on guidelines suggested by
Hu and Bentler (1999) and Yu (2002): nonsignificant p value (p > .05) associ-
ated with the χ2 index, CFI ≥ .95, RMSEA ≤ .06, and WRMR close to 1. Note,
these guidelines may not be suitable for all situations.
Results from a one-factor CFA model show the model had good fit to the
sample data using the 10-item four-category GSE scale, CFI = .953, RMSEA
= .06, 90% CI [.048, .071], and WRMR = 1.057, but the model lacked fit
according to the χ2 index, χ2(35) = 122.58, p < .001. Similarly good fit was
found using the 10-item three-category GSE scale, CFI = .97, RMSEA = .05,
90% CI [.04, .06], and WRMR = 1.03, but the model lacked fit according to
the χ2 index, χ2(35) = 99.6, p < .001. However, a statistically significant χ2
index is not uncommon in practice—Minimum departures from the data
would be rejected statistically. Therefore, researchers in practice tend to
focus more on the other fit indices to judge model fit. All standardized load-
ings (i.e., the correlation between the item and the latent variable)
were positive, statistically significant, and ranged from .43 to .66 for the
four-category GSE scale and .47 to .69 for the three-category GSE scale. In
addition, all residual correlations were near zero or relatively small in value
(i.e., rresidual < |.10|) for the four-category and three-category GSE scales,
except for two item pairs (4 & 5, and 4 & 7), which were at |.12|. Based on
these CFA results, the four-category and three-category versions of the
10-item GSE scales were considered adequately unidimensional for con-
ducting a unidimensional IRT analysis.
IRT Calibrations
Using IRTPRO, the GR model and reduced GR model were fit to each item
on the GSE scale. The top panel of Table 2 summarizes the item calibration
results for the GR model fit to the 10-item four-category GSE scale (see the
left half of Table 2) and 10-item three-category GSE scale (see the right half
of Table 2). Slope parameters for the GR model fit to the 10-item four-
category and three-category GSE scale ranged from .86 (Item 4) to 1.64 (Item
8) and .91 (Item 4) to 1.67 (Item 8), respectively. The variation in slope
parameters suggests that a GR model estimating a unique slope parameter for
each item may be reasonable for these data. Threshold parameters for the GR
model fit to the 10-item four-category GSE scale ranged from −4.64 to −2.77
for b1, −1.69 to −0.41 for b2, and 1.26 to 2.73 for b3. For the GR model fit to
the 10-item three-category GSE scale, thresholds ranged from −1.67 to −0.39
for b1, and 1.30 to 2.62 for b2. Given the similarity in slope and threshold
estimates across the two versions of the GSE, item calibration results suggest
a three-category response scale system for the GSE is potentially a stable
alternative for the GSE scale. However, the elimination or collapsing of the
first category should not be based on this information alone for the GR model,
but in conjunction with the ORFs plots.
The bottom panel of Table 2 summarizes the item calibration results for
the reduced GR model fit to the 10-item four-category (see the left half of
Table 2) and three-category GSE scale (see the right half of Table 2). The
slope parameter for the reduced GR model fit to the 10-item four-category
and three-category GSE was 1.18 and 1.19, respectively.
Threshold parameters for the reduced GR model fit to the 10-item four-
category GSE scale ranged from −4.22 to −3.13 for b1, −1.68 to −0.42 for b2,
and 1.32 to 2.24 for b3, while the reduced GR model fit to the 10-item three-
category GSE scale had thresholds that ranged from −1.68 to −0.41 for b1,
and 1.31 to 2.22 for b2. Similar values for the threshold parameters across the
two versions of the GSE scale (i.e., b2 and b3 thresholds for the four-category
GSE scale matched-up with b1 and b2 thresholds for the three-category GSE
134
10-Item Four-Category (Left Half of Table) and Three-Category (Right Half of Table) GSE Scale.
GR model 10-item four-category GSE scale GR model 10-item three-category GSE scale
1 0.96 (.11) −3.84 (.41) −1.69 (.17) 1.81 (.19) 98.87 .0001 0.99 (.11) −1.64 (.18) 1.78 (.18) 73.27 .0001
2 1.25 (.12) −3.48 (.32) −1.61 (.14) 1.26 (.12) 45.52 .1326 1.20 (.12) −1.67 (.15) 1.30 (.13) 23.92 .5251
3 1.22 (.12) −3.41 (.31) −1.08 (.11) 1.64 (.15) 69.94 .0017 1.22 (.12) −1.07 (.11) 1.65 (.15) 32.36 .1810
4 0.86 (.10) −4.64 (.55) −0.82 (.12) 2.73 (.31) 53.29 .0777 0.91 (.11) −0.77 (.12) 2.62 (.28) 36.09 .1131
5 0.99 (.11) −4.35 (.47) −1.56 (.16) 1.86 (.19) 58.95 .0201 1.03 (.11) −1.51 (.16) 1.81 (.18) 46.78 .0105
6 1.24 (.12) −3.02 (.26) −0.41 (.08) 2.16 (.19) 52.85 .0551 1.26 (.13) −0.39 (.08) 2.14 (.19) 44.35 .0099
7 1.26 (.12) −3.99 (.39) −1.43 (.13) 1.32 (.12) 43.82 .0794 1.24 (.12) −1.45 (.14) 1.33 (.12) 38.89 .0377
8 1.64 (.15) −2.77 (.21) −0.84 (.08) 1.46 (.11) 41.89 .1653 1.67 (.16) −0.83 (.08) 1.45 (.11) 28.31 .2036
9 1.25 (.12) −4.01 (.39) −1.18 (.11) 1.88 (.17) 33.74 .3849 1.26 (.13) −1.17 (.12) 1.87 (.16) 30.19 .1781
10 1.23 (.12) −3.52 (.32) −0.81 (.09) 1.70 (.15) 47.96 .1069 1.27 (.13) −0.79 (.10) 1.66 (.14) 21.23 .7307
Reduced GR model 10-item four-category GSE scale Reduced GR model 10-item three-category GSE scale
1 1.18 (.05) −3.30 (.22) −1.46 (.11) 1.57 (.11) 112.63 .0001 1.19 (.05) −1.44 (.10) 1.56 (.11) 77.16 .0001
2 1.18 (.05) −3.64 (.26) −1.68 (.12) 1.32 (.10) 46.26 .1173 1.19 (.05) −1.68 (.11) 1.31 (.10) 24.73 .5353
3 1.18 (.05) −3.50 (.24) −1.11 (.09) 1.69 (.11) 71.61 .0016 1.19 (.05) −1.09 (.09) 1.67 (.11) 31.88 .1964
4 1.18 (.05) −3.61 (.25) −0.65 (.08) 2.15 (.14) 55.77 .0143 1.19 (.05) −0.63 (.08) 2.14 (.14) 40.65 .0249
5 1.18 (.05) −3.80 (.27) −1.38 (.10) 1.65 (.11) 66.88 .0013 1.19 (.05) −1.36 (.10) 1.63 (.11) 53.63 .0011
Note. GR = graded response; GSE = general self-efficacy; a = item slope (discrimination) parameter; b = item threshold (difficulty, location) parameter; S-χ2 = item-fit
statistic; p = p value associated with item-fit statistic. Values in Parenthesis are item parameter standard error estimate.
Toland 135
scale) suggest a three-category response scale system for the GSE is poten-
tially a stable alternative for the GSE scale. However, examination of the
ORFs plots will help to determine if this is indeed necessary.
LI
A second assumption of unidimensional IRT models is that of LI or condi-
tional independence, which is closely related to the assumption of unidimen-
sionality. LI is the assumption that the only influence on an individual’s item
response is that of the latent trait variable being measured and that no other
variable (e.g., other items on the GSE scale, reading ability, or another latent
trait variable) is influencing individual item responses. That is, for a given
adolescent with a known general perceived self-efficacy score, a response to
an item is independent from a response to any other item. Although LI is not
necessarily a concern in CTT nor detectable from a classical item analysis
revolving around Cronbach’s alpha, violating the LI assumption is a serious
issue for an IRT analysis because it can distort estimated item parameters
(e.g., slopes can become inflated and thresholds across items can become
more homogenous), item standard errors (e.g., standard errors can appear to
look smaller giving the impression of better item parameter estimates), IRT
scores and associated standard errors (e.g., standard errors around scores may
be smaller, item and/or scale information functions may be inflated, which
may lead to a false impression of score precision), and model-fit statistics (De
Ayala, 2009; Edelen & Reeve, 2007). In essence, local dependency (LD) can
result in a score different from the construct being measured. LD can occur
for numerous reasons such as when the wording of two or more item stems or
synonyms are used across items that adolescents can’t differentiate between
items, thus selecting the same response category across items (see De Ayala,
2009; Reeve et al., 2007).
To assess the tenability of LI, the (approximately) standardized LD χ2 sta-
tistic (Chen & Thissen, 1997) for each item pair was examined. LD statistics
greater than |10| were considered large and reflecting likely LD issues or
leftover residual variance that is not accounted for by the unidimensional IRT
model, LD statistics between |5| and |10| were considered moderate and ques-
tionable LD, and LD statistics less than |5| were considered small and incon-
sequential (see footnote in Cai, du Toit, & Thissen, 2011b, p. 77). However,
sparseness in the observed table for an item pair can lead to a possible LD
issue (Cai et al., 2011b, p. 77). Thus, item content and a cross tabulation of
item pairs displaying potential LD should be inspected. If an item identified
as having LD is indeed a threat to the assumption of LI, it is expected that
parameter estimates (i.e., slopes and/or thresholds) and item-fit statistics
Item 1 2 3 4 5 6 7 8 9 10
1 11.3 (4.4) 7.9 (10.9) 16.7 (14.4) 14.4 (15.5) 14.3 (9.8) 1.6 (0.9) 6.2 (5.9) 2.7 (3.3) 4.6 (3.4)
2 11.8 (4.5) 2.4 (3.2) 3.8 (6.0) 5.8 (6.5) 2.2 (0.3) 1.3 (2.1) 1.1 (0.4) 0.4 (2.3) 0.3 (0.5)
3 6.6 (9.2) 2.2 (3.1) 0.2 (0.5) 0.5 (0.2) 0.7 (2.0) 0.3 (0.4) 5.7 (3.6) 0.8 (0.3) 2.1 (2.3)
4 9.9 (9.9) 3.1 (5.0) 0.1 (0.1) 7.2 (5.3) 1.3 (0.5) 10.0 (2.8) 0.1 (0.1) 3.2 (4.1) 1.3 (1.7)
5 10.0 (11.2) 4.9 (5.3) 0.1 (0.6) 8.7 (7.7) 3.0 (0.8) 3.0 (2.8) 2.4 (5.6) 1.3 (2.7) 2.5 (1.5)
6 12.5 (9.5) 2.3 (0.2) 0.6 (1.8) 1.0 (0.3) 2.5 (0.4) 5.4 (1.3) 2.7 (2.7) 4.2 (4.6) 4.1 (5.6)
7 1.6 (1.5) 1.5 (2.2) 0.4 (0.4) 6.4 (3.1) 2.7 (2.6) 5.1 (1.1) 6.7 (4.3) 3.8 (5.3) 2.1 (4.1)
8 6.2 (5.6) 0.2 (-0.2) 3.2 (4.2) 5.9 (0.0) 3.0 (6.6) 3.1 (4.0) 5.6 (0.8) 2.6 (4.1) 5.2 (7.5)
9 2.7 (3.1) 0.8 (2.6) 0.6 (0.0) 3.8 (5.1) 1.7 (3.0) 4.4 (4.7) 4.3 (5.8) 1.3 (2.7) 1.0 (0.9)
10 3.5 (2.6) 0.2 (0.8) 2.2 (2.3) 0.6 (0.6) 2.2 (1.4) 4.2 (5.7) 2.6 (0.2) 2.6 (4.0) 0.8 (0.7)
Note. LD = local dependency; GR = graded response; GSE = general self-efficacy. Lower left diagonal represents standardized LD χ2 statistics for
GR model. Upper right diagonal represents standardized LD χ2 statistics for reduced GR model. Values not in parenthesis are standardized LD χ2
statistics for the 10-item four-category GSE scale. Values in parenthesis are standardized LD χ2 statistics for the 10-item three-category GSE scale.
Absolute values are reported. Bolded values represent large LD statistics (i.e., |LD| > 10).
(i.e., another non-IRT method that can be used for detecting LD; see
Appropriate Dimensionality results) from the unidimensional CFAs also
showed little excess dependency or covariation remaining among items. That
is, residual correlations were ≤ |.12|, which is below the |.20| cutoff suggested
by Morizot, Ainsworth, and Reise (2007), further evidencing that LI is tena-
ble. Based on these results, the assumption of LI was deemed tenable, but
Item 1 should be considered for removal.
1.0
Not at all True or Hardly True
0.9 Moderately True
0.8 Exactly True
Probability of Response
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Low General Perceived Self-Efficacy (θ) High
Figure 4. ORF for Item 1 on the 10-item three-category GSE scale fit by the GR
model.
Note. The horizontal axis represents the level of the latent trait (which has a standard normal
distribution by construction), and vertical axis that measures the probability of choosing a
given response category at a specified latent trait level. ORF = option response function; GSE
= general self-efficacy; GR = graded response.
within the 10-item four-category GSE scale. As can be seen, the predicted
ORFs plot shows that Item 1 is behaving primarily as a three-category item,
with a category score of 0 (not at all true) being less likely to be selected than
any other category for almost the entire general perceived self-efficacy con-
tinuum (i.e., between −3 and 3). Based on this observation, the low response
frequency for the not at all true category (see Table 1), and similar values for
the threshold parameters across the two versions of the GSE scale (see the left
half of Table 2), the four-category GSE scale was collapsed into a three-
category GSE scale. Accordingly, plots of the ORFs were reexamined for the
GR and reduced GR models. Figure 4 provides the GR model ORFs plot for
Item 1 based on the 10-item three-category GSE scale, which is similar to
ORFs plot observed for each item on this scale and the reduced GR model fit
to the 10-item three-category GSE scale. Based on these results, the optimal
number of response categories for items on the GSE scale was viewed as 3.
Therefore, all remaining IRT analyses are based on the 10-item three-
category GSE scale.
Model level fit (Comparison). After model-data fit at the item level has been
found to be reasonable, complimentary model-data fit statistics designed to
assess relative fit at the model level can now be used. To compare the relative
fit of the models to the sample data, multiple methods were used as described
in De Ayala (2009): The change in the −2 log likelihood (−2LL or Deviance)
from two hierarchically nested models (also known as a likelihood ratio test;
LRT) and its complement the relative change statistic (R∆2 ; Haberman, 1978),
the Bayesian information criterion (BIC), the Akaike information criterion
(AIC), and the M2 limited information goodness-of-fit statistic and its
GR model 9-item three-category GSE scale Reduced GR model 9-item three-category GSE scale
Note. GR = graded response; GSE = general self-efficacy; a = item slope (discrimination) parameter; b = item threshold (difficulty, location) param-
eter; S-χ2= item-fit statistic; p = p value associated with item-fit statistic. Values in parenthesis are item parameter standard error estimate.
plexity of the Full model (i.e., the additional estimation of a unique slope
parameter for each item) is not necessary to improve model-data fit, while a
statistically significant χ2 statistic would suggest it is necessary (De Ayala,
2009). measures the relative change (i.e., the % improvement) between two
hierarchically nested models and is calculated as R∆2 = (−2LLReduced model −
2LLFull model) / −2LLReduced model. BIC and AIC are relative information criteria
statistics, where smaller values indicate a better fitting model. The M2 statis-
tic measures how well a model fits the sample data, which is based on one-
and two-way marginal tables (Cai et al., 2006; Maydeu-Olivares & Joe, 2005,
2006). Similar to other goodness-of-fit statistics, M2 assumes perfect model-
data fit in the population. Although a nonsignificant p value is desired with
the M2 statistic, this test can be overly sensitive to small model-data misfit,
which can lead to artificially small p values. Therefore, the RMSEA is
reported along with the M2 statistic. RMSEA ranges from 0 to 1 with values
close to zero indicating adequate model-data fit (e.g., RMSEA ≅ .05), which
is similar to how it is defined in structural equation modeling (see Maydeu-
Olivares, Cai, & Hernández, 2011). In general, smaller M2 values indicate
better model-data fit. The M2 statistic is a relatively newer statistic that is not
incorporated into most IRT programs (De Ayala, 2009); however, IRTPRO
provides this statistic along with the RMSEA statistic for some IRT models
on request.
Results from the LRT suggest the additional complexity of the GR model
(i.e., allowing slopes to vary across items) is necessary to improve model-data
fit over and above that obtained with the reduced GR model (i.e., estimating a
common slope across items), χ ∆ (27 − 19 = 8) = 11,142.98 − 11,127.2 = 15.78,
2
p = .046. The relative change between these models was R∆2 = 15.78 / 11,142.98
= .0014, which means that the GR model improves our explanation of the item
responses over that of the reduced GR model by only 0.14%. Although the LRT
suggests statistically significant improvement in model-data fit in favor of the
GR model, the R∆2 value suggests that this is not a meaningful improvement
over the reduced GR model. A comparison of the BIC and AIC statistics demon-
strates the lack of superiority of GR model (BIC = 11,304.07, AIC = 11,181.20)
to the reduced GR model (BIC = 11,267.14, AIC = 11,180.98) based on the
smaller AIC and BIC statistics for the reduced GR model. Moreover, both the
GR model and reduced GR model demonstrated similar and adequate model-
data fit, M2(135) = 331.98, p < .001, RMSEA = .05, and M2(143) = 345.63,
1.0
Item 2 Item 3 Item 4
0.9 Item 5 Item 6 Item 7
0.8 Item 8 Item 9 Item 10
0.7
Informaon (θ)
0.6
0.5
0.4
0.3
0.2
0.1
0.0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Low General Perceived Self-Efficacy (θ) High
Figure 5. Graded response model item information functions for nine items from
the GSE scale.
Note. Each function represents the amount of information (precision) each item provides over
the θ range. GSE = general self-efficacy.
from all other items because it provides the most amount of information (pre-
cision) around θ = −0.85 and θ = 1.48, which are the item’s respective thresh-
olds (b1 and b2). The item providing the least amount of information across
the continuum is Item 4 as its slope value was the lowest relative to all other
items on the scale. This item could be removed if it was deemed that its con-
tent was already redundant with another item or if a shorter form was desired.
Two items that appear to provide nearly identical information across the con-
tinuum are Items 2 and 7 because their respective IIFs are nearly identical,
which suggests that only one of these items may be necessary.
To understand how the GSE scale works as a whole, the area under each
IIF can be summed together to create a total information function (TIF).
Thus, the quality of items (i.e., the amount of information each item pro-
vides) and the number of items determine the TIF. This means that each item
contributes independently unique information to the TIF and is not dependent
on other items. This is also another reason the assumption of LI is important.
The TIF provides useful details about variable scale information as a function
of location on the trait continuum; furthermore, the TIF can be used to
identify gaps in the continuum. Although the metric of information is not
directly interpretable on its own (Edwards, 2009), a useful metric often used
to capture the amount of error around an IRT score is the expected standard
error of estimate (SEE; SEE ≅ 1/√information). The expected SEE measures
the amount of uncertainty about a person’s IRT score (De Ayala, 2009, p. 27).
The SEE can also be plotted as a function to gauge the expected amount of
error along the continuum. So, if the goal of our GSE scale was to measure a
broad range of the latent trait continuum, say between −3 and 3, then an ideal
TIF and corresponding SEE function would be uniform across this range. For
instance, if information is constantly 16 across a broad range of the contin-
uum, then the expected SEE for this range is 1/√16 ≅ .25. However, if the
goal of our GSE scale was to measure a specific range or point on the con-
tinuum, such as a cutpoint used to determine whether an individual possesses
an adequate (or lack thereof) amount of a given latent trait (e.g., θ = 0.5), then
items can be selected that best match this location on the continuum. That is,
the TIF would be more peaked (and corresponding SEE would be smaller) at
the cutpoint. In essence, the TIF and SEE function can be used as the blue-
print for designing a scale based on a pre-specified amount of information or
maximum amount of expected error needed around a score or range of scores.
If, however, it was necessary to report a single numeric value that sum-
marizes the precision for the entire range or region using IRT, then marginal
reliability (Green, Bock, Humphreys, Linn, & Rechase, 1984) can be esti-
mated (marginal reliability ≅ 1 − SEE2 or 1 – 1/information). Marginal reli-
ability is similar to traditional reliability and in IRTPRO is an estimate based
on the total test information function. Using our example from earlier, if
information is constantly 16 and SEE is .25 across a broad range of the con-
tinuum, then the estimated marginal reliability for this range is 1−.252 ≅
.9375. However, the marginal reliability value provided by IRTPRO is only
useful if the TIF or SEE function is uniform across the entire latent trait
continuum; otherwise, it can over- or under-estimate precision along the
continuum.
The TIF, SEE function, and marginal reliability estimate are readily avail-
able in IRTPRO once a set of items have been calibrated. In IRTPRO, the TIF
is the sum of all the IIFs + 1. The main reference for + 1 is Thissen and
Wainer (2001). The + 1 comes from the fact that a prior (assumed) distribu-
tion (e.g., standard normal distribution) is used for estimating latent trait
scores (θ), which provides information (L. Stam, personal communication,
May 14, 2013). If the + 1 is not used in the creation of the TIF, then the esti-
mated scores and corresponding standard errors do not accurately match up
with the scores and SEEs found within IRTPRO.
Figure 6 displays the TIF (solid line) for the 9-item GSE scale. The TIF
shows the GSE scale provides relatively uniform information (e.g., informa-
tion ≅ 4) for the range of −1.5 to 2.2, which has an associated marginal reli-
ability of about .75 (marginal reliability ≅ 1−1/4) and expected standard
error of estimate (dashed line in Figure 6) of about 0.5 (SEE ≅ 1/√4) around
scores in this range. The marginal reliability for response pattern scores pro-
vided by IRTPRO is .76, but this value is an estimate for the entire range of
the continuum. However, outside this range of −1.5 to 2.2 marginal reliabil-
ity decreases and SEE increases. Thus, if a more precise GSE scale was
desired within this range or across more of the continuum, then more items
need to be added to the scale to meet the desired information or level of
expected SEE. For instance, if we desired information to be 15, then the cor-
responding SEE ≅ .2582 and marginal reliability ≅ .93, but if we desired the
marginal reliability to be .90, then the corresponding SEE ≅ .3162 and infor-
mation ≅ 10.
To summarize, the 9-item GSE scale provides precise estimates of scores
(information ≅ 4, marginal reliability ≅ .75, expected SEE ≅ 0.5) for a broad
range of the continuum, −1.5 to 2.2. The maximum amount of information
(precision) was approximately 4.5 around latent trait estimates of −0.8 and
1.5. However, precision and expected SEEs around score estimates worsen
outside of this range, which are less than would be desired. To improve score
estimates beyond this range additional items need to be written that have
thresholds below −1.5 and above 2.
Once item parameters have been estimated, respondents’ estimated
scores on the latent trait continuum can be found. Conceptually, IRT score
5.0 0.8
0.75
0.7
4.0 0.65
0.6
0.55
0.5
Informaon (θ)
3.0
0.45
SEE (θ)
0.4
0.35
2.0 0.3
0.25
Informaon 0.2
1.0 0.15
Standard Error of Esmate 0.1
0.05
0.0 0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Low General Perceived Self-Efficacy (θ) High
Figure 6. Total information function (solid line) and expected SEE function
(dashed line) for the GSE scale.
Note. The horizontal axis represents the latent variable, general perceived self-efficacy. The
left vertical axis represents the amount of information (precision) provided by the GSE scale
for a given score. The right axis represents the expected amount of standard error around a
score. More information (Information ≅ 1/SEE2) produces a more reliable score (marginal reli-
ability ≅ 1 − 1/information) and smaller expected SEE around a score (SEE ≅ 1/√information).
SEE = standard error of estimation; GSE = general self-efficacy.
estimates are created by taking the observed response pattern for each
respondent and weighting them by the item parameters (Edwards, 2009).
By default, IRT score estimates are placed on a standard normal metric. The
IRT scores for the 700 respondents range from −2.3 to 2.7 (M = 0, SD =
0.87), which are on the same metric as the item thresholds. Given that some
respondents (about 6% of our 700) were observed to have IRT scores out-
side the range where the GSE scale provides the most precise estimates
(−1.5 to 2), uncertainty in estimates increase for these respondents. So, if
estimates are needed outside this range, then more items with thresholds
less than −1.5 or above 2 are needed to measure the more extreme levels of
the general perceived self-efficacy continuum, while revisions to existing
items need to be made or new items need to be added to improve the overall
precision of the GSE scale.
Conclusion
This article is unique from other articles demonstrating how to conduct IRT
analyses in several ways. Similar to Edelen and Reeve (2007) and Edwards
(2009), this article provides details for replication by applied learners through
detailed description of the necessary steps for conducting IRT analyses.
Furthermore, this article offers a realistic depiction of the mental processes
involved with determining the best decision at pivotal points in the analysis
process, including notes on how these decisions may be modified dependent
on the particular data set and research purpose. This article also demonstrates
the utility of the (approximately) standardized LD χ2 statistic and the M2 sta-
tistic as provided in IRTPRO, but not readily available in most IRT programs
and not commonly discussed in pedagogical papers for IRT. Finally, this
article builds on the pedagogical papers written by Edelen and Reeve and
Edwards by providing and interpreting the IRT results as well as offering
access to the data and IRTPRO files used throughout the article. It is hoped
that this article facilitates the work of applied researchers wanting to conduct,
interpret, and report IRT analyses on a multi-item scale. Those who want to
deepen their understanding of IRT after reading this article may consider De
Ayala (2009), Embretson and Reise (2000), and Hambleton et al. (1991).
Funding
The author(s) received no financial support for the research, authorship, and/or publi-
cation of this article.
References
Baker, F. B. (2001). The basics of item response theory (2nd ed., ERIC Document
Reproduction Service No. ED 458 219). College Park, MD: Eric Clearing House
on Assessment and Evaluation.
Cai, L., du Toit, S. H. C., & Thissen, D. (2011a). IRTPRO: Flexible professional item
response theory modeling for patient reported outcomes (Version 2.1) [Computer
software]. Chicago, IL: Scientific Software International.
Cai, L., du Toit, S. H. C., & Thissen, D. (2011b). IRTPRO: User guide. Lincolnwood,
IL: Scientific Software International.
Cai, L., Maydeu-Olivares, A., Coffman, D. L., & Thissen, D. (2006). Limited-
information goodness-of-fit testing of item response theory models for sparse 2P
Morizot, J., Ainsworth, A. T., & Reise, S. P. (2007). Toward modern psychometrics:
Application of item response theory models in personality research. In R. W.
Robins, R. C. Fraley, & R. F. Krueger (Eds.), Handbook of research methods in
personality (pp. 407-423). New York, NY: Guilford Press.
Muthén, L. K., & Muthén, B. O. (1998-2013). Mplus user’s guide (7th ed.). Los
Angeles, CA: Author.
Nering, M. L, & Ostini, R. (Eds.). (2010). Handbook of polytomous item response
theory models. New York, NY: Routledge.
Orlando, M., & Thissen, D. (2000). Likelihood-based item fit indices for dichotomous
item response theory models. Applied Psychological Measurement, 24, 50-64.
doi:10.1177/01466216000241003
Orlando, M., & Thissen, D. (2003). Further investigation of the performance of S-χ2:
An item fit index for use with dichotomous item response theory models. Applied
Psychological Measurement, 27, 289-298. doi:10.1177/0146621603027004004
Reckase, M. D. (2009). Multidimensional item response theory. New York, NY:
Springer.
Reeve, B. B., & Fayers, P. (2005). Applying item response theory modeling for evalu-
ating questionnaire item and scale properties. In P. Fayers & R. D. Hays (Eds.),
Assessing quality of life in clinical trials (2nd ed., pp. 55-73). New York, NY:
Oxford University Press.
Reeve, B. B., Hayes, R. D., Bjorner, J. B., Cook, K. F., Crane, P. K., Teresi, J. A., . . .
Cella, D. (2007). Psychometric evaluation and calibration of health-related qual-
ity of life item banks. Medical Care, 45, S22-S31.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded
scores (Psychometric Monograph No. 17, Part 2). Richmond, VA: Psychometric
Society.
Schwarzer, R., & Jerusalem, M. (1995). Generalized self-efficacy scale. In J.
Weinman, S. Wright & M. Johnston (Eds.), Measures in health psychology: A
user’s portfolio. Causal and control beliefs (pp. 35-37). Windsor, UK: NFER-
NELSON.
Sijtsma, K., & Molenaar, I. W. (2002). Introduction to nonparametric item response
theory. Thousand Oaks, CA: Sage.
Stone, C. A., & Zhang, B. (2003). Assessing goodness of fit of item response the-
ory models: A comparison of traditional and alternative procedures. Journal
of Educational Measurement, 40, 331-352. doi:10.1111/j.1745-3984.2003.
tb01150.x
Thissen, D. & Wainer, H. (Eds.). (2001). Test scoring. Mahwah, NJ: Lawrence
Erlbaum.
Wirth, R. J., & Edwards, M. C. (2007). Item factor analysis: Current approaches and
future directions. Psychological Methods, 12, 58-79.
Yu, C.-Y. (2002). Evaluating cutoff criteria of model fit indices for latent vari-
able models with binary and continuous outcomes (Doctoral dissertation). Los
Angeles, CA. Retrieved from http://statmodel2.com/download/Yudissertation.
pdf
Author Biography
Michael D. Toland received his PhD in August of 2008 from the Quantitative,
Qualitative, and Psychometric Methods program at the University of Nebraska at
Lincoln, where he was an advisee of Dr. Ralph De Ayala. Since August of 2008, he
has been an assistant professor in the Educational Psychology program in the
Department of Educational, School, and Counseling Psychology at the University of
Kentucky. His research interests include psychometrics, item response theory, factor
analysis, scale development, multilevel modeling, and the realization of modern mea-
surement and statistical methods in educational research.