0% found this document useful (0 votes)
14 views33 pages

Practical Guide To Conducting An TRI

This article serves as a practical guide for conducting Item Response Theory (IRT) analyses, focusing on two commonly used ordered polytomous models. It outlines essential steps for researchers, including clarifying study purpose, model selection, data inspection, and result interpretation, using simulated data to illustrate the process. The aim is to enhance accessibility to IRT for applied researchers in early adolescence studies, providing supplemental materials for reproducibility of analyses.

Uploaded by

kulchi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views33 pages

Practical Guide To Conducting An TRI

This article serves as a practical guide for conducting Item Response Theory (IRT) analyses, focusing on two commonly used ordered polytomous models. It outlines essential steps for researchers, including clarifying study purpose, model selection, data inspection, and result interpretation, using simulated data to illustrate the process. The aim is to enhance accessibility to IRT for applied researchers in early adolescence studies, providing supplemental materials for reproducibility of analyses.

Uploaded by

kulchi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

The Journal of Early

Adolescence
http://jea.sagepub.com/

Practical Guide to Conducting an Item Response Theory Analysis


Michael D. Toland
The Journal of Early Adolescence 2014 34: 120 originally published online 19
November 2013
DOI: 10.1177/0272431613511332

The online version of this article can be found at:


http://jea.sagepub.com/content/34/1/120

Published by:

http://www.sagepublications.com

Additional services and information for The Journal of Early Adolescence can be found at:

Email Alerts: http://jea.sagepub.com/cgi/alerts

Subscriptions: http://jea.sagepub.com/subscriptions

Reprints: http://www.sagepub.com/journalsReprints.nav

Permissions: http://www.sagepub.com/journalsPermissions.nav

Citations: http://jea.sagepub.com/content/34/1/120.refs.html

>> Version of Record - Dec 17, 2013

OnlineFirst Version of Record - Nov 19, 2013

What is This?

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


511332
research-article2013
JEA34110.1177/0272431613511332Journal of Early AdolescenceToland

Article
Journal of Early Adolescence
2014, Vol 34(1) 120­–151
Practical Guide to © The Author(s) 2013
Reprints and permissions:
Conducting an Item sagepub.com/journalsPermissions.nav
DOI: 10.1177/0272431613511332
Response Theory jea.sagepub.com

Analysis

Michael D. Toland1

Abstract
Item response theory (IRT) is a psychometric technique used in the
development, evaluation, improvement, and scoring of multi-item scales. This
pedagogical article provides the necessary information needed to understand
how to conduct, interpret, and report results from two commonly used
ordered polytomous IRT models (Samejima’s graded response [GR] model
and reduced GR model). Throughout this article, simulated data from a
multi-item scale is used to illustrate IRT analyses. The simulated data and
IRTPRO version 2.1 point-and-click commands needed to reproduce all
analyses in this article are available as supplemental online materials at http://
jea.sagepub.com/maint. The intent of this article is to provide an overview
of essential components of an IRT analysis to enable increased access to this
powerful tool for applied early adolescence researchers.

Keywords
item response theory, pedagogical, IRTPRO

Constructing a psychometrically appropriate multi-item scale to measure a


latent trait variable (such as general perceived self-efficacy, subjective well-
being, or emotional intelligence) entails the collaboration of content experts,

1University of Kentucky, Lexington, USA

Corresponding Author:
Michael D. Toland, Department of Educational, School, and Counseling Psychology, University
of Kentucky, 243 Dickey Hall, Lexington, KY 40506, USA.
Email: [email protected]

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


Toland 121

item writing experts, and psychometricians or measurement experts (Reeve


& Fayers, 2005). A modern approach for measuring the relationship between
the items on a scale (i.e., the observed manifestations) and the latent trait
variable is item response theory (IRT; Lord, 1952, 1980). IRT comprises a
collection of statistical models that define the relationship between an indi-
vidual’s unobserved continuous trait variable (e.g., general perceived self-
efficacy) and item characteristics (e.g., ease of endorsing an item that reflects
general perceived self-efficacy) to predict the probability of endorsing an
item on a scale (De Ayala, 2009). The importance of IRT is that the analysis
focus is at the item level and not just the overall scale level. An item level
focus allows for the potential of devising, revising, and optimizing scales for
specific uses (Baker, 2001; De Ayala, 2009; Embretson & Reise, 2000;
Hambleton, Swaminathan, & Rogers, 1991).
The primary purpose of this article is to show the fundamental steps
involved in conducting, interpreting, and presenting the results from an IRT
analysis in a format accessible for applied researchers working in the field of
early adolescence. Although not exhaustive, the general steps involved in an
IRT analysis include (1) clarifying the purpose of a study, (2) considering
relevant models, (3) conducting a preliminary data inspection, (4) evaluating
model assumptions and testing competing models, and (5) evaluating and
interpreting results. These steps are all introduced around a reoccurring sam-
ple and scale.
To follow these steps, it is assumed that the reader has some familiarity
with regression, classical test theory (CTT), and factor analysis. All IRT-
based analyses presented herein are based on using IRTPRO (Item
Response Theory for Patient-Reported Outcomes) version 2.1 (Cai, du
Toit, & Thissen, 2011a), while Mplus 7.11 (Muthén & Muthén, 1998-
2013) was used to assess appropriate dimensionality. The simulated data
and IRTPRO version 2.1 point-and-click commands needed to reproduce
all analyses in this article are available as supplemental online materials at
http://jea.sagepub.com/maint. This allows the reader to compare their
analysis with those provided in this article. The reader is encouraged to
conduct the IRT analyses while reading along. IRTPRO is a recent devel-
opment in IRT software that addresses barriers to its widespread use by
offering a more user-friendly interface. Because several IRT models exist,
this article is limited to two common parametric unidimensional models
applicable to Likert-type or ordinal multi-categorical (polytomous) item
responses. To produce this article, the author borrowed heavily from De
Ayala (2009), Edelen and Reeve (2007), Edwards (2009), and Embretson
and Reise (2000).

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


122 Journal of Early Adolescence 34(1)

An Illustrated Example Using Simulated Perceived


Self-Efficacy Data
For our analysis example, assume data were collected from 700 U.S. early
adolescents aged 13 to 14 in Grade 8 on the English version of Schwarzer and
Jerusalem’s (1995) general self-efficacy (GSE) scale. (Note, sample size
guidelines are not entirely clear and are beyond the scope of this article.
Those interested in sample size guidelines for various IRT models and sug-
gestions for further readings on this topic can refer to De Ayala, 2009).
Briefly, perceived self-efficacy refers to the belief that an individual has
regarding their capability to perform tasks or to cope with adversity
(Schwarzer & Jerusalem, 1995). The GSE scale was designed to measure a
general sense of perceived self-efficacy for the general adult population and
not intended for use with people below the age of 12. This scale is supposed
to be used to measure how a person adapts after life changes and is also use-
ful as an indicator of quality of life. The GSE scale asks respondents to self-
report how true statements are of themselves using a 4-point unipolar
Likert-type response scale: 0 (not at all true), 1 (hardly true), 2 (moderately
true), and 3 (exactly true). Higher scores on the GSE scale are intended to
imply higher levels of general perceived self-efficacy. More information on
the GSE scale can be retrieved from http://userpage.fu-berlin.de/~health/
engscal.htm

Basic IRT Concepts and a Common Dichotomous


Model in IRT
Suppose we wanted to measure early adolescents’ general perceived self-
efficacy, which we believe to underlie or give rise to responses on the GSE
scale. Further suppose we asked each early adolescent to self-report how true
each statement was about them on a dichotomous scale: 0 (not at all true) and
1 (exactly true). For these data, we could use a common IRT model appropri-
ate for dichotomous item responses—the two-parameter logistic (2PL)
response model. Note, the 2PL model equation is included in the Appendix of
the supplemental online materials at http://jea.sagepub.com/maint. It is called
the 2PL model because it estimates a unique slope (a) and threshold param-
eter (b) for each item. The item slope (or discrimination) parameter reflects
the strength of the relationship each item on a multi-item scale (e.g., GSE
scale) has with the latent trait variable being measured (e.g., general per-
ceived self-efficacy), which is frequently symbolized with the Greek letter
theta (θ). Higher (steeper) slopes reflect a stronger relationship with the latent

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


Toland 123

trait variable. The slope parameter is similar to the item-total correlation in a


CTT item analysis or factor pattern loading from a factor analysis, but the
slope parameter is not bounded between −1 and 1. The item threshold (also
referred to as a location, difficulty, or severity) parameter reflects the point at
which a respondent with a given latent trait has an equal probability (50:50)
of endorsing an item (e.g., not at all true vs. exactly true), which is like the
item difficulty index in a CTT item analysis. However, the item threshold
interpretation is the inverse of the interpretation of the CTT item difficulty
index in that higher threshold values in IRT mean more difficult to endorse
versus higher values in CTT which denote that an item is easier to endorse.
Importantly, IRT places item threshold parameters and person latent trait
scores on the same continuum, which can be conceptualized as a z score type
metric (e.g., standard deviation units from the mean). Higher scores reflect
more of the latent trait variable. Latent trait scores and threshold parameters
commonly range from −2 to 2 (Baker, 2001), but it is not uncommon for
threshold to range between −3 and 3. However, threshold values outside this
range are uncommon and potentially an indication of problematic items or
less useful response categories. Moreover, latent trait scores can be converted
to any preferred scale score metric (e.g., M of 500 and SD of 100). As for the
slope parameter, it can theoretically range from −∞ to ∞, but a reasonably
“good” range is from 0.5 to 3.0 (Baker, 2001; De Ayala, 2009; Hambleton
et al., 1991). However, the slope parameter for ordered polytomous IRT mod-
els can have a much broader range.
A key feature of IRT is the item response function (IRF), which graphi-
cally represents the nonlinear relationship between the probability of endors-
ing an item (e.g., not at all true vs. exactly true) on a scale given an individual’s
score on a latent trait variable (e.g., general perceived self-efficacy) and char-
acteristics of that specific item (e.g., slope and threshold[s]). This relation-
ship is often modeled using a logistic function that is monotonically
increasing, but can also be modeled using other functions (e.g., loglinear).
This means that the probability of endorsing an item increases as a person
moves upward along the trait continuum. To better understand this concept,
Figure 1 displays the IRFs plot for four hypothetical items from the 2PL
model. Items 1 and 2 have equal (parallel) slopes, but are located at different
points along the general perceived self-efficacy continuum. This is because
Items 1 and 2 both have a slope parameter of 1 and threshold parameters of
−1 and 1, respectively. Clearly, changes in the threshold parameters shift the
IRFs left and right on the general perceived self-efficacy continuum. A simi-
lar interpretation can be made about Items 3 and 4. When comparing Item 1
with Item 3 we can see the IRFs cross at −1, but have different slopes. Items
1 and 3 can be described as the “easiest” to endorse items because

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


124 Journal of Early Adolescence 34(1)

1.0
Item 1
0.9
Item 2
0.8
Item 3
Probability of Response

0.7
Item 4
0.6

0.5

0.4

0.3

0.2

0.1

0.0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Low General Perceived Self-Efficacy (θ) High

Figure 1. Two-parameter logistic model item response functions for four


hypothetical items dichotomously scored.
Note. a1 = 1, b1 = −1; a2 = 1, b2 = 1; a3 = 2, b3 = −1; a4 = 2, b4 = 1; subscripts refer to item
number; a = slope; b = threshold. The horizontal axis represents the level of the latent trait
(which has a standard normal distribution by construction), and vertical axis that measures the
probability of choosing or endorsing the category exactly true at a specified latent trait level.

the probability of an Exactly true response requires a lower level of general


perceived self-efficacy as is required by Items 2 and 4. Similarly, Items 2 and
4 can be described as the “hardest” items to endorse because the probability
of an Exactly true response tends to require a higher level of general per-
ceived self-efficacy than Items 1 and 3. Collectively, all four items conform
to the 2PL model because the slope parameter is estimated freely across all
items. All of the IRFs in Figure 1 show the effect different slopes and thresh-
olds have on the probability of endorsing an item.

Step 1: Clarifying the Purpose of a Study


Before beginning any study, it is important to clarify the purpose of the study
along with research questions and then proceed accordingly to choose the
analytic technique that offers the best insight into the characteristics of the
data. In general, researchers using IRT are interested in evaluating the

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


Toland 125

psychometric properties of items and multi-item scales prior to using them.


Given the diversity of uses that exist for any one scale, an IRT analysis of an
existing or recently developed scale could be conducted prior to its intended
use to ensure the scale is as precise as possible across a desired score range or
appropriately matched to latent trait levels of the intended population (i.e.,
test appropriateness). Importantly, attempts to use a scale with poor psycho-
metric characteristics (e.g., poor precision across or at particular locations
along the intended latent trait continuum) can jeopardize potential score
interpretation inferences and, if used as an outcome variable in a statistical
analysis, can lead to reduced statistical conclusion validity (e.g., statistical
decision errors; see Embretson, 1996; Kang & Waller, 2005).
Suppose this study’s purpose was to investigate the psychometric proper-
ties of a 10-item GSE scale administered to U.S. early adolescents aged 13 to
14 in Grade 8 using IRT. The intention was to use the GSE scale to describe
adolescents with low to high general perceived self-efficacy; this scale score
could then be used as a predictor or outcome variable in a larger study (e.g.,
a mediation analysis). To this end, the following research questions were pos-
ited: Is the ordinal nature of the 4-point response set stable across the response
category system? What is the level of measurement precision across the gen-
eral perceived self-efficacy continuum? Are there redundant items? Are there
any gaps on the measured continuum? Note that this article uses simulated
data that were designed to mimic data collected as part of a larger study based
on self-report measures. Hence, no theoretical questions were evaluated and
no empirical inferences should be drawn from the analyses described herein,
but the analyses demonstrated herein attempt to make the conceptual more
accessible.

Step 2: Considering Relevant Models


Several ordered polytomous IRT models have been developed for items that
use a response scale with more than two options (see Nering & Ostini, 2010,
for a thorough overview of the development of polytomous IRT models), but
for the purposes of this article two ordered polytomous IRT models easily
available in IRTPRO will be emphasized. The first model is Samejima’s
(1969) graded response (GR) model, sometimes referred to as the homoge-
neous GR model. The second model is a reduced (constrained, modified, or
parsimonious) version of this model. These two models were selected because
they are commonly used and most appropriate for Likert-type item responses.
Specifically, understanding the GR model is important to know and demon-
strated in this article it offers necessary details through which readers can
learn key concepts and uses of IRT that extend to other IRT models. For

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


126 Journal of Early Adolescence 34(1)

instance, the models discussed herein are generalizations of dichotomous


models and when all items are reduced to two response categories these mod-
els can represent two commonly used dichotomous models. There are also
extensions or variations of the GR model that allow for varying slopes across
thresholds within each item, which may be referred to as the heterogeneous
GR model (Samejima, pp. 19-20), but are beyond the scope of this article.
Therefore, an understanding of the GR model, and one of its variations, the
reduced GR model, will prepare readers wanting to understand and use other
IRT models; especially given its unique relationship among polytomous and
dichotomous IRT models. The primary distinction among polytomous IRT
models is the varying number of parameters used to describe each item.
Ultimately, the choice of which IRT model to use is up to the user. This choice
is usually driven by types of items (e.g., dichotomous, polytomous, both),
tests of the assumptions of the potential models being considered, compari-
sons of competing models, and ultimately how well a model explains the
observed item responses (e.g., model-data fit assessments; De Ayala, 2009;
Hambleton et al., 1991).

GR Model
The GR model is a natural extension of the 2PL model developed for use with
items possessing two or more ordinal response categories or items consisting
of varying number of ordinal response categories (e.g., some items have three
categories, while others consist of five categories). The GR model estimates
a unique slope parameter for each item across the ordinal response categories
along with multiple between-category thresholds (e.g., b1 to b3) for items
having more than two categories. As each item on the GSE scale has four
ordered response categories, there are 4 – 1 = 3 threshold parameters and one
unique slope parameter to be estimated for each item. So, with 10 items, 40
parameters are estimated (i.e., 10 unique slope parameters across items and 3
thresholds per item for a total of 10 + 3 × 10 = 40). Each threshold reflects the
level of general perceived self-efficacy needed to have equal (.50) probability
of choosing to respond above a given threshold. In essence, each item is sepa-
rated into a series of dichotomies and an IRF is created for each threshold
(dichotomy) by means of the 2PL model. For instance, an IRF is created for
b1 to describe the probability of choosing to respond not at all true versus
hardly true, moderately true, and exactly true; then, another IRF is created
for b2 to describe the probability of choosing to respond not all true and
hardly true versus moderately true and exactly true, and a final IRF is created
for b3 to describe the probability of choosing to respond not at all true, hardly
true, and moderately true versus exactly true (i.e., the IRFs plot for each

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


Toland 127

1.0
Not at all true
0.9
Hardly True
0.8 Moderately True
Probability of Response

0.7 Exactly True

0.6

0.5

0.4

0.3

0.2

0.1

0.0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Low General Perceived Self-Efficacy (θ) High

Figure 2. ORFs for Item 1 on the 10-item four-category GSE scale fit by the GR
model.
Note. The horizontal axis represents the level of the latent trait (which has a standard normal
distribution by construction) and vertical axis that measures the probability of choosing a
given response category at a specified latent trait level. ORF = option response function; GSE
= general self-efficacy; GR = graded response.

ordered threshold would be presented similarly to those in Figure 1). Then, to


calculate the probabilities that a respondent would respond in a given cate-
gory, the difference in probabilities between adjacent IRFs for varying levels
of the latent trait variable is found. These differences in response probabili-
ties are then used to create option response functions (ORFs), which reflect
the probability of responding in a particular response category given a level
of the latent trait. Figure 2 displays a sample ORFs plot for Item 1 on the
10-item four-category GSE scale fit by the GR model. Notice the sum of the
probabilities across ORFs equals 1, or 100%, for any given level of the latent
trait, which will always be true. For instance, given an adolescent with aver-
age (θ = 0) general perceived self-efficacy, the most likely category to be
endorsed is moderately true with a probability of .69, while the probability of
responding in the other categories is .02 (not at all true), .15 (hardly true),
and .20 (exactly true). Furthermore, notice the ORF for the category not at all
true is substantially less than that of the ORFs for the other response

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


128 Journal of Early Adolescence 34(1)

1.0 Not at all true


Hardly True
0.9
Moderately True
0.8 Exactly True
Probability of Response

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Low General Perceived Self-Efficacy (θ) High

Figure 3. Ideal ORFs for Item 1 on the 10-item four-category GSE scale fit by a
GR model.
Note. The horizontal axis represents the level of the latent trait (which has a standard normal
distribution by construction), and vertical axis that measures the probability of choosing a
given response category at a specified latent trait level. ORF = option response function; GSE
= general self-efficacy; GR = graded response.

categories and is mostly encompassed by the ORF for the category labeled
hardly true. This is an indication that the first response category for this par-
ticular item is not attracting respondents as intended. Ideally, if each category
within an item was operating as expected by the model, is useful, and offers
a unique (nonredundant) contribution as a response category, then the ORFs
for each category would be expected to have a unique peak and be spaced out
(separated) along the continuum (see Figure 3).

Reduced GR Model
The reduced GR model is a constrained version of the GR model. The reduced
GR model estimates one common slope (a) parameter across the ordinal
response categories for all items along with multiple between-category
thresholds (e.g., b1 to b3) for items having more than two categories. So, for
the 10-item four-category GSE scale using the reduced GR model, we would
estimate one common slope parameter across items and three thresholds per

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


Toland 129

item for a total of 1 + 3 × 10 = 31 parameters. Recall, the GR model would


estimate 40 parameters. An advantage of the reduced GR model is that it is
more parsimonious than the GR model because it simplifies the number of
parameters needing estimation, but by simplifying the model in this manner,
the relationship between the latent trait and each item is assumed to be the
same.

Step 3: Conducting a Preliminary Data Inspection


Before proceeding, a brief description of the data file is warranted. In an IRT
analysis, a separate or unique data record is needed for each case. A unique
identifier in the data file serves two purposes. One, it can be used to identify
any cases with aberrant or unexpected response patterns. Two, once the item
calibrations (i.e., the estimation of item parameters) have been completed and
IRT scores have been created, a researcher could use the IRT scores matched
with a unique identifier in another desired data analysis (e.g., ANOVA,
Regression): This can be termed a two-step approach with Step 1 including
both item calibration and IRT score generation, while Step 2 involves imple-
mentation of a statistical analysis using IRT scores as either a predictor, medi-
ator, moderator, or outcome variable.
Prior to conducting an IRT analysis, it is important to check that adequate
numbers fall into each response category per item. Although no hard-and-fast
guideline exists for what is considered adequate, more responses per category
help to increase item parameter estimation accuracy and to assess the utility
of response categories. If adequate numbers are not being used in response
categories, then it may be necessary to collapse across response categories to
produce a reduced response category system. Doing so will improve the
accuracy and precision (stability) in item parameter estimates. If this latter
step is done, it would be prudent to collect data from a new sample (De Ayala,
2009) with newly applied labels to reassess the validity of the model and item
parameters; otherwise, the item calibrations could be sample specific.
Table 1 displays item numbers, statements, and response percentages for
the 10-item four-category GSE scale. Item level response percentages show
few adolescents chose to use the lowest response category (not at all true)
across most items with as few as 1.4% (n = 10) to as many as 4.4% (n = 31)
choosing the lowest response category. Although the response percentages
(frequencies) seem low for the not at all true response category, the fact that
there are responses in each response category for each item is enough to esti-
mate item parameters for an IRT model. However, observed responses in
each response category across items do not mean the response category sys-
tem is being used as expected or is stable. An inspection of the

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


130 Journal of Early Adolescence 34(1)

item parameter estimates along with ORFs plots for each item will help to
determine if response options are being used as expected. This will be evalu-
ated when an inspection of expected functional form and model-data fit is
conducted.
As it is suspected that response options might not be used as expected, the
10-item four-category GSE scale was reduced into a 10-item three-category
GSE scale. Specifically, the two lowest response categories were combined
(i.e., not at all true and hardly true), with response categories for this com-
bined category represented in parentheses in Table 1.

Step 4: Evaluating Model Assumptions and Testing


Competing Models
Common IRT Assumptions
For a typical parametric IRT analysis, there are four assumptions that under-
lie the model: appropriate dimensionality, local independence (LI), functional
form, unidimensionality, and the latent variable is distributed normally in the
population. There are IRT models available that allow for relaxation of these
assumptions (e.g., nonparametric IRT, see Sijtsma & Molenaar, 2002; multi-
dimensional IRT, see Reckase, 2009). Each of these assumptions are described
and tested below with the exception of the assumption of a normally distrib-
uted latent variable, which is assumed by the estimation process in IRTPRO.

Appropriate Dimensionality
The assumption of appropriate dimensionality means that the IRT model
being used contains the correct number of continuous latent trait variables per
person for the data (Embretson & Reise, 2000). Before choosing an IRT
model, the dimensionality of the data should be thoroughly inspected (De
Ayala, 2009). However, scale developers or evaluators usually have previous
research, theory, conceptual framework, or a logical argument to build from
to identify how many latent trait variables a scale is intended to measure or
reflect. Most common parametric IRT models assume the latent trait variable
is reflected by a unidimensional continuum. That is, item responses (observed
data) can be reasonably explained by one continuous person variable (i.e., a
single dimension). When this assumption is found tenable, smaller minor fac-
tors do not have consequential influences on estimated latent trait scores (θ;
e.g., general perceived self-efficacy; Embretson & Reise, 2000).
Unidimensionality can be tested using non-IRT methods such as explor-
atory factor analysis (EFA) or confirmatory factor analysis (CFA): A CFA is

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


Table 1. Response Percentages for the 10-Item Four-Category and Three-Category General Self-Efficacy Scale (N = 700).

Response label

Not at all Moderately Exactly


Item Statement true Hardly true true true
1 I can always manage to solve difficult problems if I try hard 3.7 16.1 (19.9) 61.6 18.6
enough.
2 If someone opposes me, I can find the means and ways to get 2.6 14.3 (16.9) 60.6 22.6
what I want.
3 It is easy for me to stick to my aims and accomplish my goals. 3.0 23.6 (26.6) 56.7 16.7
4 I am confident that I could deal efficiently with unexpected 2.6 32.9 (35.4) 53.6 11.0
events.
5 Thanks to my resourcefulness, I know how to handle 2.1 19.3 (21.4) 61.3 17.3
unforeseen situations.
6 I can solve most problems if I invest the necessary effort. 4.4 36.1 (40.6) 49.1 10.3
7 I can remain calm when facing difficulties because I can rely on 1.4 18.0 (19.4) 58.9 21.7
my coping abilities.
8 When I am confronted with a problem, I can usually find 3.3 24.4 (27.7) 56.6 15.7
several solutions.

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


9 If I am in trouble, I can usually think of a solution. 1.4 22.7 (24.1) 62.6 13.3
10 I can usually handle whatever comes my way. 2.6 29.0 (31.6) 52.4 16.0

Note. Values in parenthesis represent response percentages after collapsing the first two response categories, not at all true or Hardly true, into one
category.

131
132 Journal of Early Adolescence 34(1)

appropriate when the scale has known dimensional properties, while an EFA
is more appropriate when the scale is relatively unexplored in terms of dimen-
sion. However, if there appears to be a violation from the assumption of uni-
dimensionality, then the use of an exploratory or confirmatory multi-dimensional
IRT (MIRT) model may be warranted (De Ayala, 2009; Wirth & Edwards,
2007). MIRT models are readily available in IRTPRO, but are more complex
and beyond the scope of this article. If, however, the scale is intended to mea-
sure one latent trait variable, then problematic items can be removed from the
analysis to achieve plausible unidimensionality for the purposes of IRT anal-
ysis (Edwards, 2009).
To evaluate the assumption of unidimensionality, a one-factor CFA of the
10-item four-category GSE scale was fit using the current simulated sample
of 700 adolescents. A CFA was conducted because of theoretical knowledge
and previous empirical research showing a unidimensional construct under-
lies the GSE scale. A CFA was also repeated for the 10-item three-category
GSE scale. If a one-factor CFA fits the data, then this provides empirical
evidence that a single latent trait sufficiently explains the item responses or
common covariation among the items. Given the ordinal nature of the item
response categories, which cannot be assumed continuous, a robust (mean-
and variance-adjusted) weighted least-squares (WLSMV) estimator was
used, as implemented in Mplus 7.11 (Muthén & Muthén, 1998-2013): This
estimator functions by factor analyzing a Polychoric correlation matrix
among items. Several indices were used to assess the dimensionality or fit of
the one-factor CFA model to the observed sample data. We used the p value
associated with the χ2 index, the comparative fit index (CFI), the root-mean-
square error of approximation (RMSEA), and the weighted root-mean-square
residual (WRMR). The good model fit was based on guidelines suggested by
Hu and Bentler (1999) and Yu (2002): nonsignificant p value (p > .05) associ-
ated with the χ2 index, CFI ≥ .95, RMSEA ≤ .06, and WRMR close to 1. Note,
these guidelines may not be suitable for all situations.
Results from a one-factor CFA model show the model had good fit to the
sample data using the 10-item four-category GSE scale, CFI = .953, RMSEA
= .06, 90% CI [.048, .071], and WRMR = 1.057, but the model lacked fit
according to the χ2 index, χ2(35) = 122.58, p < .001. Similarly good fit was
found using the 10-item three-category GSE scale, CFI = .97, RMSEA = .05,
90% CI [.04, .06], and WRMR = 1.03, but the model lacked fit according to
the χ2 index, χ2(35) = 99.6, p < .001. However, a statistically significant χ2
index is not uncommon in practice—Minimum departures from the data
would be rejected statistically. Therefore, researchers in practice tend to
focus more on the other fit indices to judge model fit. All standardized load-
ings (i.e., the correlation between the item and the latent variable)

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


Toland 133

were positive, statistically significant, and ranged from .43 to .66 for the
four-category GSE scale and .47 to .69 for the three-category GSE scale. In
addition, all residual correlations were near zero or relatively small in value
(i.e., rresidual < |.10|) for the four-category and three-category GSE scales,
except for two item pairs (4 & 5, and 4 & 7), which were at |.12|. Based on
these CFA results, the four-category and three-category versions of the
10-item GSE scales were considered adequately unidimensional for con-
ducting a unidimensional IRT analysis.

IRT Calibrations
Using IRTPRO, the GR model and reduced GR model were fit to each item
on the GSE scale. The top panel of Table 2 summarizes the item calibration
results for the GR model fit to the 10-item four-category GSE scale (see the
left half of Table 2) and 10-item three-category GSE scale (see the right half
of Table 2). Slope parameters for the GR model fit to the 10-item four-
category and three-category GSE scale ranged from .86 (Item 4) to 1.64 (Item
8) and .91 (Item 4) to 1.67 (Item 8), respectively. The variation in slope
parameters suggests that a GR model estimating a unique slope parameter for
each item may be reasonable for these data. Threshold parameters for the GR
model fit to the 10-item four-category GSE scale ranged from −4.64 to −2.77
for b1, −1.69 to −0.41 for b2, and 1.26 to 2.73 for b3. For the GR model fit to
the 10-item three-category GSE scale, thresholds ranged from −1.67 to −0.39
for b1, and 1.30 to 2.62 for b2. Given the similarity in slope and threshold
estimates across the two versions of the GSE, item calibration results suggest
a three-category response scale system for the GSE is potentially a stable
alternative for the GSE scale. However, the elimination or collapsing of the
first category should not be based on this information alone for the GR model,
but in conjunction with the ORFs plots.
The bottom panel of Table 2 summarizes the item calibration results for
the reduced GR model fit to the 10-item four-category (see the left half of
Table 2) and three-category GSE scale (see the right half of Table 2). The
slope parameter for the reduced GR model fit to the 10-item four-category
and three-category GSE was 1.18 and 1.19, respectively.
Threshold parameters for the reduced GR model fit to the 10-item four-
category GSE scale ranged from −4.22 to −3.13 for b1, −1.68 to −0.42 for b2,
and 1.32 to 2.24 for b3, while the reduced GR model fit to the 10-item three-
category GSE scale had thresholds that ranged from −1.68 to −0.41 for b1,
and 1.31 to 2.22 for b2. Similar values for the threshold parameters across the
two versions of the GSE scale (i.e., b2 and b3 thresholds for the four-category
GSE scale matched-up with b1 and b2 thresholds for the three-category GSE

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


Table 2. GR Models (Upper Panel) and Reduced GR Models (Lower Panel) Item Parameter Estimates and Item-Fit Statistics for

134
10-Item Four-Category (Left Half of Table) and Three-Category (Right Half of Table) GSE Scale.
GR model 10-item four-category GSE scale GR model 10-item three-category GSE scale

Item a b1 b2 b3 S-χ2 p a b1 b2 S-χ2 p

1 0.96 (.11) −3.84 (.41) −1.69 (.17) 1.81 (.19) 98.87 .0001 0.99 (.11) −1.64 (.18) 1.78 (.18) 73.27 .0001
2 1.25 (.12) −3.48 (.32) −1.61 (.14) 1.26 (.12) 45.52 .1326 1.20 (.12) −1.67 (.15) 1.30 (.13) 23.92 .5251
3 1.22 (.12) −3.41 (.31) −1.08 (.11) 1.64 (.15) 69.94 .0017 1.22 (.12) −1.07 (.11) 1.65 (.15) 32.36 .1810
4 0.86 (.10) −4.64 (.55) −0.82 (.12) 2.73 (.31) 53.29 .0777 0.91 (.11) −0.77 (.12) 2.62 (.28) 36.09 .1131
5 0.99 (.11) −4.35 (.47) −1.56 (.16) 1.86 (.19) 58.95 .0201 1.03 (.11) −1.51 (.16) 1.81 (.18) 46.78 .0105
6 1.24 (.12) −3.02 (.26) −0.41 (.08) 2.16 (.19) 52.85 .0551 1.26 (.13) −0.39 (.08) 2.14 (.19) 44.35 .0099
7 1.26 (.12) −3.99 (.39) −1.43 (.13) 1.32 (.12) 43.82 .0794 1.24 (.12) −1.45 (.14) 1.33 (.12) 38.89 .0377
8 1.64 (.15) −2.77 (.21) −0.84 (.08) 1.46 (.11) 41.89 .1653 1.67 (.16) −0.83 (.08) 1.45 (.11) 28.31 .2036
9 1.25 (.12) −4.01 (.39) −1.18 (.11) 1.88 (.17) 33.74 .3849 1.26 (.13) −1.17 (.12) 1.87 (.16) 30.19 .1781
10 1.23 (.12) −3.52 (.32) −0.81 (.09) 1.70 (.15) 47.96 .1069 1.27 (.13) −0.79 (.10) 1.66 (.14) 21.23 .7307

Reduced GR model 10-item four-category GSE scale Reduced GR model 10-item three-category GSE scale

Item a b1 b2 b3 S-χ2 p a b1 b2 S-χ2 p

1 1.18 (.05) −3.30 (.22) −1.46 (.11) 1.57 (.11) 112.63 .0001 1.19 (.05) −1.44 (.10) 1.56 (.11) 77.16 .0001
2 1.18 (.05) −3.64 (.26) −1.68 (.12) 1.32 (.10) 46.26 .1173 1.19 (.05) −1.68 (.11) 1.31 (.10) 24.73 .5353
3 1.18 (.05) −3.50 (.24) −1.11 (.09) 1.69 (.11) 71.61 .0016 1.19 (.05) −1.09 (.09) 1.67 (.11) 31.88 .1964
4 1.18 (.05) −3.61 (.25) −0.65 (.08) 2.15 (.14) 55.77 .0143 1.19 (.05) −0.63 (.08) 2.14 (.14) 40.65 .0249
5 1.18 (.05) −3.80 (.27) −1.38 (.10) 1.65 (.11) 66.88 .0013 1.19 (.05) −1.36 (.10) 1.63 (.11) 53.63 .0011

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


6 1.18 (.05) −3.13 (.21) −0.42 (.08) 2.24 (.14) 52.24 .0617 1.19 (.05) −0.41 (.08) 2.22 (.14) 43.11 .0136
7 1.18 (.05) −4.19 (.33) −1.50 (.11) 1.38 (.10) 42.92 .1153 1.19 (.05) −1.49 (.10) 1.36 (.10) 37.94 .0468
8 1.18 (.05) −3.44 (.24) −1.03 (.09) 1.80 (.11) 47.77 .1102 1.19 (.05) −1.02 (.09) 1.78 (.12) 35.70 .0970
9 1.18 (.05) −4.22 (.33) −1.23 (.10) 1.96 (.12) 33.57 .3926 1.19 (.05) −1.22 (.09) 1.94 (.13) 31.73 .2016
10 1.18 (.05) −3.63 (.26) −0.84 (.09) 1.75 (.11) 46.53 .1351 1.19 (.05) −0.83 (.08) 1.74 (.12) 20.47 .7695

Note. GR = graded response; GSE = general self-efficacy; a = item slope (discrimination) parameter; b = item threshold (difficulty, location) parameter; S-χ2 = item-fit
statistic; p = p value associated with item-fit statistic. Values in Parenthesis are item parameter standard error estimate.
Toland 135

scale) suggest a three-category response scale system for the GSE is poten-
tially a stable alternative for the GSE scale. However, examination of the
ORFs plots will help to determine if this is indeed necessary.

LI
A second assumption of unidimensional IRT models is that of LI or condi-
tional independence, which is closely related to the assumption of unidimen-
sionality. LI is the assumption that the only influence on an individual’s item
response is that of the latent trait variable being measured and that no other
variable (e.g., other items on the GSE scale, reading ability, or another latent
trait variable) is influencing individual item responses. That is, for a given
adolescent with a known general perceived self-efficacy score, a response to
an item is independent from a response to any other item. Although LI is not
necessarily a concern in CTT nor detectable from a classical item analysis
revolving around Cronbach’s alpha, violating the LI assumption is a serious
issue for an IRT analysis because it can distort estimated item parameters
(e.g., slopes can become inflated and thresholds across items can become
more homogenous), item standard errors (e.g., standard errors can appear to
look smaller giving the impression of better item parameter estimates), IRT
scores and associated standard errors (e.g., standard errors around scores may
be smaller, item and/or scale information functions may be inflated, which
may lead to a false impression of score precision), and model-fit statistics (De
Ayala, 2009; Edelen & Reeve, 2007). In essence, local dependency (LD) can
result in a score different from the construct being measured. LD can occur
for numerous reasons such as when the wording of two or more item stems or
synonyms are used across items that adolescents can’t differentiate between
items, thus selecting the same response category across items (see De Ayala,
2009; Reeve et al., 2007).
To assess the tenability of LI, the (approximately) standardized LD χ2 sta-
tistic (Chen & Thissen, 1997) for each item pair was examined. LD statistics
greater than |10| were considered large and reflecting likely LD issues or
leftover residual variance that is not accounted for by the unidimensional IRT
model, LD statistics between |5| and |10| were considered moderate and ques-
tionable LD, and LD statistics less than |5| were considered small and incon-
sequential (see footnote in Cai, du Toit, & Thissen, 2011b, p. 77). However,
sparseness in the observed table for an item pair can lead to a possible LD
issue (Cai et al., 2011b, p. 77). Thus, item content and a cross tabulation of
item pairs displaying potential LD should be inspected. If an item identified
as having LD is indeed a threat to the assumption of LI, it is expected that
parameter estimates (i.e., slopes and/or thresholds) and item-fit statistics

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


136 Journal of Early Adolescence 34(1)

without a particular item or item pairs would have meaningful differences


from an IRT analysis including these suspect items (Edelen & Reeve, 2007).
Thus, an inspection of the item parameter estimates, LD statistics, and related
IRT model-data fit statistics with and without suspect items should be con-
ducted. Doing so will help to determine the degree to which LD is an issue
and if removal of such items is necessary or if small amounts of LD can be
resolved by combining two LD items to form one composite item and items
re-estimated. IRTPRO provides LD statistics by default during model
estimation.
LD statistics for the GR and reduced GR models fit to the 10-item four-
category and three-category GSE scale are summarized in Table 3. Overall,
LD statistics for the GR model fit to the 10-item four-category GSE scale
show most LD statistics were relatively small, but the largest LD statistics
tended to be paired with Item 1. For instance, the large LD statistics that stand
out the most for the 10-item four-category GSE scale are the ones between
item pairs (1 & 2), (1 & 5), and (1 & 6). A similar trend was observed for LD
statistics for the reduced GR model fit to the 10-item four-category GSE
scale, but item pairs (1 & 4) and (4 & 7) were also identified as large.
However, for the GR and reduced GR models fit to the 10-item three-
category GSE scales, fewer large LD pairs were identified, but the larger LD
pairs or rankings of LD pairs matched up with those observed with the GR
model. A closer inspection of the moderate and large LD statistics from mod-
els fit to the 10-item four-category GSE scale did show sparseness in the
observed tables for most item pairs; however, sparseness was not apparent for
moderate and large LD statistics from models fit to the 10-item three-
category GSE scale. In general, the LD statistics suggest the models fit less
than perfectly and that LD may be present, especially for some items paired
with Item 1.
To determine if the LI assumption was violated or problematic, an item
calibration was conducted without a suspect item (i.e., Item 1). The sensitiv-
ity calibration showed the slopes and threshold estimates for items when Item
1 was dropped from the analysis were relatively similar to the slope and
threshold estimates when including all 10 items. The decision that Item 1 was
the main source of problematic LI violations was buttressed by leaving Item
1 in the model while simultaneously dropping another possible problematic
item, Item 2, and performing a sensitivity calibration. This sensitivity calibra-
tion resulted such that Item 1 tended to have the larger LD statistic with other
items. However, calibrations without Item 1 showed LD statistics to be typi-
cal of small and moderate LD statistics. At this juncture, Item 1 could be
removed if it is believed to be an issue for LI and the remaining items recali-
brated. However, it is important to recall that all residual correlations

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


Table 3. Standardized LD χ2 Statistics for GR and Reduced GR Models Fit to 10-Item Four-Category and Three-Category GSE
Scale.

Item 1 2 3 4 5 6 7 8 9 10
1 11.3 (4.4) 7.9 (10.9) 16.7 (14.4) 14.4 (15.5) 14.3 (9.8) 1.6 (0.9) 6.2 (5.9) 2.7 (3.3) 4.6 (3.4)
2 11.8 (4.5) 2.4 (3.2) 3.8 (6.0) 5.8 (6.5) 2.2 (0.3) 1.3 (2.1) 1.1 (0.4) 0.4 (2.3) 0.3 (0.5)
3 6.6 (9.2) 2.2 (3.1) 0.2 (0.5) 0.5 (0.2) 0.7 (2.0) 0.3 (0.4) 5.7 (3.6) 0.8 (0.3) 2.1 (2.3)
4 9.9 (9.9) 3.1 (5.0) 0.1 (0.1) 7.2 (5.3) 1.3 (0.5) 10.0 (2.8) 0.1 (0.1) 3.2 (4.1) 1.3 (1.7)
5 10.0 (11.2) 4.9 (5.3) 0.1 (0.6) 8.7 (7.7) 3.0 (0.8) 3.0 (2.8) 2.4 (5.6) 1.3 (2.7) 2.5 (1.5)
6 12.5 (9.5) 2.3 (0.2) 0.6 (1.8) 1.0 (0.3) 2.5 (0.4) 5.4 (1.3) 2.7 (2.7) 4.2 (4.6) 4.1 (5.6)
7 1.6 (1.5) 1.5 (2.2) 0.4 (0.4) 6.4 (3.1) 2.7 (2.6) 5.1 (1.1) 6.7 (4.3) 3.8 (5.3) 2.1 (4.1)
8 6.2 (5.6) 0.2 (-0.2) 3.2 (4.2) 5.9 (0.0) 3.0 (6.6) 3.1 (4.0) 5.6 (0.8) 2.6 (4.1) 5.2 (7.5)
9 2.7 (3.1) 0.8 (2.6) 0.6 (0.0) 3.8 (5.1) 1.7 (3.0) 4.4 (4.7) 4.3 (5.8) 1.3 (2.7) 1.0 (0.9)
10 3.5 (2.6) 0.2 (0.8) 2.2 (2.3) 0.6 (0.6) 2.2 (1.4) 4.2 (5.7) 2.6 (0.2) 2.6 (4.0) 0.8 (0.7)

Note. LD = local dependency; GR = graded response; GSE = general self-efficacy. Lower left diagonal represents standardized LD χ2 statistics for
GR model. Upper right diagonal represents standardized LD χ2 statistics for reduced GR model. Values not in parenthesis are standardized LD χ2
statistics for the 10-item four-category GSE scale. Values in parenthesis are standardized LD χ2 statistics for the 10-item three-category GSE scale.
Absolute values are reported. Bolded values represent large LD statistics (i.e., |LD| > 10).

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


137
138 Journal of Early Adolescence 34(1)

(i.e., another non-IRT method that can be used for detecting LD; see
Appropriate Dimensionality results) from the unidimensional CFAs also
showed little excess dependency or covariation remaining among items. That
is, residual correlations were ≤ |.12|, which is below the |.20| cutoff suggested
by Morizot, Ainsworth, and Reise (2007), further evidencing that LI is tena-
ble. Based on these results, the assumption of LI was deemed tenable, but
Item 1 should be considered for removal.

Functional Form and Model-Data Fit


Unidimensional IRT models have a third assumption, known as functional
form, which states that the observed or empirical data follows the function
specified by the IRT model (De Ayala, 2009). In the context of the GR
model, functional form implies that all threshold parameters are ordered and
that there is a common slope within each item, although not necessarily
across items. This assumption is rarely perfect in practice, but we attempt to
model the empirical data as closely as possible by assessing model-data fit
(De Ayala, 2009). Essentially, a comparison is made between the observed
or empirical data and that predicted by the IRT model. Numerous statistical
(or absolute) tests and graphical, relative, or heuristic methods have been
developed for examining model-data fit at the item level. Once model-data
fit at the item level has been found to be reasonable, complimentary model-
data statistics can be used to compare the relative fit of IRT models and to
select the optimal IRT model for this data (De Ayala, 2009). Importantly,
more than one method for goodness of fit should be used to judge the fit of
a model to data.
In addition to assessing model-data fit, it is important to check if the item
response category system is operating as expected for each item. This means
that each increasing category is more likely to be selected than previous
response categories as one moves along the latent trait continuum. To assess
whether category usage is occurring as expected (or not) by the GR and
reduced GR models fit to the 10-item four-category GSE scale, each item’s
ORF plots were inspected. It is important to point out that when reviewing
item-level results from a GR model the thresholds alone cannot be used as an
indicator of which categories, if any, are not being used as expected by this
model: thresholds between adjacent response categories are assumed sequen-
tial (in order), which means they will always appear in order (De Ayala,
2009), such as those in Table 2.
IRTPRO was used to generate the ORF plots, an easily accessible feature
of this software once a set of items has been calibrated. Figure 3 provides the
GR model ORFs plot for Item 1, which is typical of the other ORFs plots

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


Toland 139

1.0
Not at all True or Hardly True
0.9 Moderately True
0.8 Exactly True
Probability of Response

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Low General Perceived Self-Efficacy (θ) High

Figure 4. ORF for Item 1 on the 10-item three-category GSE scale fit by the GR
model.
Note. The horizontal axis represents the level of the latent trait (which has a standard normal
distribution by construction), and vertical axis that measures the probability of choosing a
given response category at a specified latent trait level. ORF = option response function; GSE
= general self-efficacy; GR = graded response.

within the 10-item four-category GSE scale. As can be seen, the predicted
ORFs plot shows that Item 1 is behaving primarily as a three-category item,
with a category score of 0 (not at all true) being less likely to be selected than
any other category for almost the entire general perceived self-efficacy con-
tinuum (i.e., between −3 and 3). Based on this observation, the low response
frequency for the not at all true category (see Table 1), and similar values for
the threshold parameters across the two versions of the GSE scale (see the left
half of Table 2), the four-category GSE scale was collapsed into a three-
category GSE scale. Accordingly, plots of the ORFs were reexamined for the
GR and reduced GR models. Figure 4 provides the GR model ORFs plot for
Item 1 based on the 10-item three-category GSE scale, which is similar to
ORFs plot observed for each item on this scale and the reduced GR model fit
to the 10-item three-category GSE scale. Based on these results, the optimal
number of response categories for items on the GSE scale was viewed as 3.
Therefore, all remaining IRT analyses are based on the 10-item three-
category GSE scale.

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


140 Journal of Early Adolescence 34(1)

Assessing IRT Model-Data Fit


Item level fit. To assess the absolute fit of the model to each item, a general-
ization of Orlando and Thissen’s (2000, 2003) S-χ2 item-fit statistic for poly-
tomous data was examined: This item-fit statistic is provided by default in
IRTPRO. For each item, S-χ2 assesses the degree of similarity between
model-predicted and empirical (observed) response frequencies by item
response category. A statistically significant value indicates the model does
not fit a given item. Given that several statistical tests were being conducted,
larger samples lead to a greater likelihood statistically significant results, and
the length of the scale is short, item fit statistics were evaluated at the 1%
significance level (Stone & Zhang, 2003). The S-χ2 item fit statistics (see the
right half and last two columns of Table 2) from the calibration results indi-
cate a satisfactory fit in that only 2 of the 10 items are not well represented by
the estimated item parameters for both the GR model and reduced GR model
(i.e., p < .01 for Items 1 and 5 for the GR model, while Items 1 and 6 have p
< .01 for the reduced GR model). Both models tended to have the poorest fit
to Item 1 based on it having the largest S-χ2 item fit statistic and its associated
p value was < .001. If the fit of the model to an item is not deemed acceptable
by a researcher, then the offending item(s) should be removed, the IRT item
calibration performed again, and tests of item level fit reassessed. If we chose
to use the absolute definition of model fit (i.e., all items must have adequate
fit), Items 1 and 5 would be removed from the GSE scale as estimated by the
GR model, and Items 1 and 6 would be removed from the GSE scale as esti-
mated by the reduced GR model. Interestingly, if only Item 1 is removed
from the item calibrations we find satisfactory fit (i.e., all p values > .01) for
all nine items based on the GR model, but for the reduced GR model satisfac-
tory fit is only observed for seven of the nine remaining items (see columns
labeled S-χ2 in Table 4). Based on just the item level fit test results, Item 1
was removed, and the 9-item version of the GSE scale was concluded to be
modeled better by the GR model than the reduced GR model.

Model level fit (Comparison). After model-data fit at the item level has been
found to be reasonable, complimentary model-data fit statistics designed to
assess relative fit at the model level can now be used. To compare the relative
fit of the models to the sample data, multiple methods were used as described
in De Ayala (2009): The change in the −2 log likelihood (−2LL or Deviance)
from two hierarchically nested models (also known as a likelihood ratio test;
LRT) and its complement the relative change statistic (R∆2 ; Haberman, 1978),
the Bayesian information criterion (BIC), the Akaike information criterion
(AIC), and the M2 limited information goodness-of-fit statistic and its

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


Table 4. Final GR Model and Reduced GR Model Item Parameter Estimates Item-Fit Statistics for 9-Item Three-Category GSE
Scale.

GR model 9-item three-category GSE scale Reduced GR model 9-item three-category GSE scale

Item a b1 b2 S-χ2 p a b1 b2 S-χ2 p


2 1.18 (.12) −1.69 (.16) 1.32 (.13) 22.47 .4934 1.22 (.06) −1.65 (.11) 1.29 (.10) 22.74 .4774
3 1.26 (.13) −1.05 (.11) 1.62 (.15) 28.12 .2105 1.22 (.06) −1.07 (.09) 1.65 (.12) 27.77 .2239
4 0.94 (.11) −0.75 (.12) 2.56 (.27) 37.51 .0515 1.22 (.06) −0.62 (.08) 2.11 (.14) 48.10 .0016
5 1.07 (.12) −1.47 (.15) 1.76 (.17) 40.46 .0190 1.22 (.06) −1.34 (.10) 1.61 (.11) 46.65 .0037
6 1.24 (.13) −0.40 (.08) 2.16 (.19) 38.68 .0294 1.22 (.06) −0.40 (.08) 2.19 (.14) 38.09 .0339
7 1.18 (.12) −1.49 (.15) 1.37 (.13) 35.28 .0641 1.22 (.06) −1.46 (.10) 1.34 (.10) 36.29 .0513
8 1.61 (.16) −0.85 (.09) 1.48 (.11) 24.27 .2793 1.22 (.06) −1.01 (.09) 1.75 (.12) 31.06 .1210
9 1.26 (.13) −1.17 (.12) 1.87 (.16) 29.54 .1298 1.22 (.06) −1.20 (.09) 1.92 (.13) 28.72 .1526
10 1.31 (.13) −0.78 (.09) 1.64 (.14) 26.08 .2474 1.22 (.06) −0.82 (.08) 1.71 (.12) 27.69 .2270

Note. GR = graded response; GSE = general self-efficacy; a = item slope (discrimination) parameter; b = item threshold (difficulty, location) param-
eter; S-χ2= item-fit statistic; p = p value associated with item-fit statistic. Values in parenthesis are item parameter standard error estimate.

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


141
142 Journal of Early Adolescence 34(1)

associated p value and RMSEA index (Cai, Maydeu-Olivares, Coffman, &


Thissen, 2006; Maydeu-Olivares & Joe, 2005, 2006).
The LRT is χ2 distributed with df equal to the difference in the number of
estimated item parameters between the Reduced model and the Full model. A
nonsignificant χ2 change statistic ( χ ∆ ) would suggest the additional com-
2

plexity of the Full model (i.e., the additional estimation of a unique slope
parameter for each item) is not necessary to improve model-data fit, while a
statistically significant χ2 statistic would suggest it is necessary (De Ayala,
2009). measures the relative change (i.e., the % improvement) between two
hierarchically nested models and is calculated as R∆2 = (−2LLReduced model −
2LLFull model) / −2LLReduced model. BIC and AIC are relative information criteria
statistics, where smaller values indicate a better fitting model. The M2 statis-
tic measures how well a model fits the sample data, which is based on one-
and two-way marginal tables (Cai et al., 2006; Maydeu-Olivares & Joe, 2005,
2006). Similar to other goodness-of-fit statistics, M2 assumes perfect model-
data fit in the population. Although a nonsignificant p value is desired with
the M2 statistic, this test can be overly sensitive to small model-data misfit,
which can lead to artificially small p values. Therefore, the RMSEA is
reported along with the M2 statistic. RMSEA ranges from 0 to 1 with values
close to zero indicating adequate model-data fit (e.g., RMSEA ≅ .05), which
is similar to how it is defined in structural equation modeling (see Maydeu-
Olivares, Cai, & Hernández, 2011). In general, smaller M2 values indicate
better model-data fit. The M2 statistic is a relatively newer statistic that is not
incorporated into most IRT programs (De Ayala, 2009); however, IRTPRO
provides this statistic along with the RMSEA statistic for some IRT models
on request.
Results from the LRT suggest the additional complexity of the GR model
(i.e., allowing slopes to vary across items) is necessary to improve model-data
fit over and above that obtained with the reduced GR model (i.e., estimating a
common slope across items), χ ∆ (27 − 19 = 8) = 11,142.98 − 11,127.2 = 15.78,
2

p = .046. The relative change between these models was R∆2 = 15.78 / 11,142.98
= .0014, which means that the GR model improves our explanation of the item
responses over that of the reduced GR model by only 0.14%. Although the LRT
suggests statistically significant improvement in model-data fit in favor of the
GR model, the R∆2 value suggests that this is not a meaningful improvement
over the reduced GR model. A comparison of the BIC and AIC statistics demon-
strates the lack of superiority of GR model (BIC = 11,304.07, AIC = 11,181.20)
to the reduced GR model (BIC = 11,267.14, AIC = 11,180.98) based on the
smaller AIC and BIC statistics for the reduced GR model. Moreover, both the
GR model and reduced GR model demonstrated similar and adequate model-
data fit, M2(135) = 331.98, p < .001, RMSEA = .05, and M2(143) = 345.63,

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


Toland 143

p < .001, RMSEA = .04, respectively, as evidenced by the similar M2 and


RMSEA statistics. Besides the LRT test, model level fit comparisons showed
evidence for similar fit between the two models or slightly in favor of the
reduced GR model, which would suggest the reduced GR model is the model
of choice. However, item-level fit results indicated that the nine items are better
represented by the GR model because it had adequate fit to each item (i.e., all
nine items p > .01), but only seven items were well represented by the reduced
GR model (see item-fit statistics in Table 4). Given these results, an argument
for either model could be chosen to represent the item responses. However,
given the flexibility of the GR model and for pedagogical reasons, the GR
model is selected and emphasized here-on-out.

Step 5: Evaluating and Interpreting Results


Item Properties, Information Functions, and IRT Score Estimates
Given that the model assumptions are tenable, a description of the item prop-
erties, including the amount of psychometric information (precision) avail-
able, can be made for each item, subset of items, or the entire scale (Edelen
& Reeve, 2007). The GR model item parameter estimates for the 9-item ver-
sion of the GSE scale are provided in Table 4 (see the left side of Table 4).
Slope estimates range from 0.94 (Item 4) to 1.61 (Item 8) and threshold
parameters range from −1.69 (Item 1, b1) to 2.56 (Item 4, b2). In general, item
slope values suggest most items have a similar relationship with general per-
ceived self-efficacy, but the large slope for Item 8 indicates that it has the
strongest relationship with the latent trait and measures general perceived
self-efficacy more precisely than other items. Moreover, the majority of the
first and second thresholds for items are around an underlying general per-
ceived self-efficacy level of −1 and 1.5, respectively. This information
implies that the GSE scale is most useful in distinguishing between adoles-
cents around these latent trait levels, but less useful for those with more
extreme levels (i.e., beyond 2) of general perceived self-efficacy.
The item parameters (slopes and thresholds) in Table 4 provide a first
glance at which areas of the latent trait continuum can be measured with the
most precision; however, certain caveats to IRT assessment and the variety
of tools available for examining the items and the scale as a whole are func-
tional in providing a deeper knowledge of the scale and the latent trait being
assessed. Because the amount of precision is not assumed to be constant
across the continuum, no single number in IRT is used to summarize the
precision of the entire set of scores from a scale (Thissen & Wainer, 2001,
p. 117). Instead, the amount of precision can be gathered for a particular location

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


144 Journal of Early Adolescence 34(1)

1.0
Item 2 Item 3 Item 4
0.9 Item 5 Item 6 Item 7
0.8 Item 8 Item 9 Item 10

0.7
Informaon (θ)

0.6

0.5

0.4

0.3

0.2

0.1

0.0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Low General Perceived Self-Efficacy (θ) High

Figure 5. Graded response model item information functions for nine items from
the GSE scale.
Note. Each function represents the amount of information (precision) each item provides over
the θ range. GSE = general self-efficacy.

or across a broad range on the latent trait continuum, which is usually


reflected by the variable information. Information is our knowledge (cer-
tainty) about a particular location on the continuum. Importantly, each item
has its own item information function (IIF) that is shaped by its slope and
thresholds. IIFs are used to identify how much empirical information each
item is adding to the entire scale and where that information is occurring
along the continuum. Importantly, IIFs can be used to identify redundant
items that occur by observing IIFs that are (nearly) identical. IIFs can also be
used to identify items providing less useful information to the total scale by
observing IIFs that are well below or lower than all other IIFs. Thus, a deci-
sion to remove an item can be made on information alone or usually in con-
junction with item content. Or, IIFs can be used to develop brief scales to
reduce respondent burden, parallel scales for use in longitudinal or pre−post
designs, or a tailored scale. IIFs are readily available in IRTPRO once a set of
items have been calibrated.
Figure 5 shows IIFs for nine items from the GSE scale. The maximum
height of an IIF is located at an item’s threshold(s) and is dependent on the
item parameter slope. For instance, the IIF for Item 8 stands out the most

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


Toland 145

from all other items because it provides the most amount of information (pre-
cision) around θ = −0.85 and θ = 1.48, which are the item’s respective thresh-
olds (b1 and b2). The item providing the least amount of information across
the continuum is Item 4 as its slope value was the lowest relative to all other
items on the scale. This item could be removed if it was deemed that its con-
tent was already redundant with another item or if a shorter form was desired.
Two items that appear to provide nearly identical information across the con-
tinuum are Items 2 and 7 because their respective IIFs are nearly identical,
which suggests that only one of these items may be necessary.
To understand how the GSE scale works as a whole, the area under each
IIF can be summed together to create a total information function (TIF).
Thus, the quality of items (i.e., the amount of information each item pro-
vides) and the number of items determine the TIF. This means that each item
contributes independently unique information to the TIF and is not dependent
on other items. This is also another reason the assumption of LI is important.
The TIF provides useful details about variable scale information as a function
of location on the trait continuum; furthermore, the TIF can be used to
identify gaps in the continuum. Although the metric of information is not
directly interpretable on its own (Edwards, 2009), a useful metric often used
to capture the amount of error around an IRT score is the expected standard
error of estimate (SEE; SEE ≅ 1/√information). The expected SEE measures
the amount of uncertainty about a person’s IRT score (De Ayala, 2009, p. 27).
The SEE can also be plotted as a function to gauge the expected amount of
error along the continuum. So, if the goal of our GSE scale was to measure a
broad range of the latent trait continuum, say between −3 and 3, then an ideal
TIF and corresponding SEE function would be uniform across this range. For
instance, if information is constantly 16 across a broad range of the contin-
uum, then the expected SEE for this range is 1/√16 ≅ .25. However, if the
goal of our GSE scale was to measure a specific range or point on the con-
tinuum, such as a cutpoint used to determine whether an individual possesses
an adequate (or lack thereof) amount of a given latent trait (e.g., θ = 0.5), then
items can be selected that best match this location on the continuum. That is,
the TIF would be more peaked (and corresponding SEE would be smaller) at
the cutpoint. In essence, the TIF and SEE function can be used as the blue-
print for designing a scale based on a pre-specified amount of information or
maximum amount of expected error needed around a score or range of scores.
If, however, it was necessary to report a single numeric value that sum-
marizes the precision for the entire range or region using IRT, then marginal
reliability (Green, Bock, Humphreys, Linn, & Rechase, 1984) can be esti-
mated (marginal reliability ≅ 1 − SEE2 or 1 – 1/information). Marginal reli-
ability is similar to traditional reliability and in IRTPRO is an estimate based

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


146 Journal of Early Adolescence 34(1)

on the total test information function. Using our example from earlier, if
information is constantly 16 and SEE is .25 across a broad range of the con-
tinuum, then the estimated marginal reliability for this range is 1−.252 ≅
.9375. However, the marginal reliability value provided by IRTPRO is only
useful if the TIF or SEE function is uniform across the entire latent trait
continuum; otherwise, it can over- or under-estimate precision along the
continuum.
The TIF, SEE function, and marginal reliability estimate are readily avail-
able in IRTPRO once a set of items have been calibrated. In IRTPRO, the TIF
is the sum of all the IIFs + 1. The main reference for + 1 is Thissen and
Wainer (2001). The + 1 comes from the fact that a prior (assumed) distribu-
tion (e.g., standard normal distribution) is used for estimating latent trait
scores (θ), which provides information (L. Stam, personal communication,
May 14, 2013). If the + 1 is not used in the creation of the TIF, then the esti-
mated scores and corresponding standard errors do not accurately match up
with the scores and SEEs found within IRTPRO.
Figure 6 displays the TIF (solid line) for the 9-item GSE scale. The TIF
shows the GSE scale provides relatively uniform information (e.g., informa-
tion ≅ 4) for the range of −1.5 to 2.2, which has an associated marginal reli-
ability of about .75 (marginal reliability ≅ 1−1/4) and expected standard
error of estimate (dashed line in Figure 6) of about 0.5 (SEE ≅ 1/√4) around
scores in this range. The marginal reliability for response pattern scores pro-
vided by IRTPRO is .76, but this value is an estimate for the entire range of
the continuum. However, outside this range of −1.5 to 2.2 marginal reliabil-
ity decreases and SEE increases. Thus, if a more precise GSE scale was
desired within this range or across more of the continuum, then more items
need to be added to the scale to meet the desired information or level of
expected SEE. For instance, if we desired information to be 15, then the cor-
responding SEE ≅ .2582 and marginal reliability ≅ .93, but if we desired the
marginal reliability to be .90, then the corresponding SEE ≅ .3162 and infor-
mation ≅ 10.
To summarize, the 9-item GSE scale provides precise estimates of scores
(information ≅ 4, marginal reliability ≅ .75, expected SEE ≅ 0.5) for a broad
range of the continuum, −1.5 to 2.2. The maximum amount of information
(precision) was approximately 4.5 around latent trait estimates of −0.8 and
1.5. However, precision and expected SEEs around score estimates worsen
outside of this range, which are less than would be desired. To improve score
estimates beyond this range additional items need to be written that have
thresholds below −1.5 and above 2.
Once item parameters have been estimated, respondents’ estimated
scores on the latent trait continuum can be found. Conceptually, IRT score

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


Toland 147

5.0 0.8
0.75
0.7
4.0 0.65
0.6
0.55
0.5
Informaon (θ)

3.0
0.45

SEE (θ)
0.4
0.35
2.0 0.3
0.25
Informaon 0.2
1.0 0.15
Standard Error of Esmate 0.1
0.05
0.0 0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Low General Perceived Self-Efficacy (θ) High

Figure 6. Total information function (solid line) and expected SEE function
(dashed line) for the GSE scale.
Note. The horizontal axis represents the latent variable, general perceived self-efficacy. The
left vertical axis represents the amount of information (precision) provided by the GSE scale
for a given score. The right axis represents the expected amount of standard error around a
score. More information (Information ≅ 1/SEE2) produces a more reliable score (marginal reli-
ability ≅ 1 − 1/information) and smaller expected SEE around a score (SEE ≅ 1/√information).
SEE = standard error of estimation; GSE = general self-efficacy.

estimates are created by taking the observed response pattern for each
respondent and weighting them by the item parameters (Edwards, 2009).
By default, IRT score estimates are placed on a standard normal metric. The
IRT scores for the 700 respondents range from −2.3 to 2.7 (M = 0, SD =
0.87), which are on the same metric as the item thresholds. Given that some
respondents (about 6% of our 700) were observed to have IRT scores out-
side the range where the GSE scale provides the most precise estimates
(−1.5 to 2), uncertainty in estimates increase for these respondents. So, if
estimates are needed outside this range, then more items with thresholds
less than −1.5 or above 2 are needed to measure the more extreme levels of
the general perceived self-efficacy continuum, while revisions to existing
items need to be made or new items need to be added to improve the overall
precision of the GSE scale.

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


148 Journal of Early Adolescence 34(1)

Conclusion
This article is unique from other articles demonstrating how to conduct IRT
analyses in several ways. Similar to Edelen and Reeve (2007) and Edwards
(2009), this article provides details for replication by applied learners through
detailed description of the necessary steps for conducting IRT analyses.
Furthermore, this article offers a realistic depiction of the mental processes
involved with determining the best decision at pivotal points in the analysis
process, including notes on how these decisions may be modified dependent
on the particular data set and research purpose. This article also demonstrates
the utility of the (approximately) standardized LD χ2 statistic and the M2 sta-
tistic as provided in IRTPRO, but not readily available in most IRT programs
and not commonly discussed in pedagogical papers for IRT. Finally, this
article builds on the pedagogical papers written by Edelen and Reeve and
Edwards by providing and interpreting the IRT results as well as offering
access to the data and IRTPRO files used throughout the article. It is hoped
that this article facilitates the work of applied researchers wanting to conduct,
interpret, and report IRT analyses on a multi-item scale. Those who want to
deepen their understanding of IRT after reading this article may consider De
Ayala (2009), Embretson and Reise (2000), and Hambleton et al. (1991).

Declaration of Conflicting Interests


The author(s) declared no potential conflicts of interest with respect to the research,
authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publi-
cation of this article.

References
Baker, F. B. (2001). The basics of item response theory (2nd ed., ERIC Document
Reproduction Service No. ED 458 219). College Park, MD: Eric Clearing House
on Assessment and Evaluation.
Cai, L., du Toit, S. H. C., & Thissen, D. (2011a). IRTPRO: Flexible professional item
response theory modeling for patient reported outcomes (Version 2.1) [Computer
software]. Chicago, IL: Scientific Software International.
Cai, L., du Toit, S. H. C., & Thissen, D. (2011b). IRTPRO: User guide. Lincolnwood,
IL: Scientific Software International.
Cai, L., Maydeu-Olivares, A., Coffman, D. L., & Thissen, D. (2006). Limited-
information goodness-of-fit testing of item response theory models for sparse 2P

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


Toland 149

tables. British Journal of Mathematical and Statistical Psychology, 59, 173-194.


doi:10.1348/000711005X66419
Chen, W-H., & Thissen, D. (1997). Local dependence indices for item pairs using
item response theory. Journal of Educational and Behavioral Statistics, 22, 265-
289. doi:10.3102/10769986022003265
De Ayala, R. J. (2009). The theory and practice of item response theory. New York,
NY: Guilford.
Edelen, M. O., & Reeve, B. B. (2007). Applying item theory (IRT) modeling to ques-
tionnaire development, evaluation, and refinement. Quality of Life Research, 16,
5-18. doi:10.1007/s11136-007-9198-0
Edwards, M. C. (2009). An introduction to item response theory using the need
for cognition scale. Social and Personality Compass, 3, 507-529. doi:10.1111/
j.1751-9004.2009.00194.x
Embretson, S. E. (1996). Item response theory models and spurious interaction effects
in factorial ANOVA designs. Applied Psychological Measurement, 20, 201-212.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists.
Mahwah, NJ: Lawrence Erlbaum.
Green, B. F., Bock, R. D., Humphreys, L. G., Linn, R. L., & Rechase, M. D. (1984).
Technical guidelines for assessing computerized adaptive tests. Journal of
Educational Measurement, 32, 347-360.
Haberman, S. J. (1978). Analysis of qualitative data: Vol. 1: Introductory topics. New
York, NY: Academic Press.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item
response theory. Newbury Park, CA: Sage.
Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indices in covariance struc-
ture analysis: Conventional criteria versus new alternatives. Structural Equation
Modeling, 6, 1-55. doi:10.1080/10705519909540118
Kang, S-M., & Waller, N. G. (2005). Moderated multiple regression, spurious
interaction effects, and IRT. Applied Psychological Measurement, 29, 87-105.
doi:10.1177/0146621604272737
Lord, F. M. (1952). A theory of test scores (Psychometric Monograph, No. 7).
Richmond, VA: Psychometric Corporation.
Lord, F. M. (1980). Applications of item response theory to practical testing prob-
lems. Hillsdale, NJ: Lawrence Erlbaum.
Maydeu-Olivares, A., Cai, L., & Hernández, A. (2011). Comparing the fit of item
response theory and factor analysis models. Structural Equation Modeling, 18,
333-356.
Maydeu-Olivares, A., & Joe, H. (2005). Limited and full information estimation and
testing in 2n contingency tables: A unified framework. Journal of the American
Statistical Association, 100, 1009-1020.
Maydeu-Olivares, A., & Joe, H. (2006). Limited information goodness-of-fit testing in
multidimensional contingency tables. Psychometrika, 71, 713-732. doi:10.1007/
s11336-005-1295-9

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


150 Journal of Early Adolescence 34(1)

Morizot, J., Ainsworth, A. T., & Reise, S. P. (2007). Toward modern psychometrics:
Application of item response theory models in personality research. In R. W.
Robins, R. C. Fraley, & R. F. Krueger (Eds.), Handbook of research methods in
personality (pp. 407-423). New York, NY: Guilford Press.
Muthén, L. K., & Muthén, B. O. (1998-2013). Mplus user’s guide (7th ed.). Los
Angeles, CA: Author.
Nering, M. L, & Ostini, R. (Eds.). (2010). Handbook of polytomous item response
theory models. New York, NY: Routledge.
Orlando, M., & Thissen, D. (2000). Likelihood-based item fit indices for dichotomous
item response theory models. Applied Psychological Measurement, 24, 50-64.
doi:10.1177/01466216000241003
Orlando, M., & Thissen, D. (2003). Further investigation of the performance of S-χ2:
An item fit index for use with dichotomous item response theory models. Applied
Psychological Measurement, 27, 289-298. doi:10.1177/0146621603027004004
Reckase, M. D. (2009). Multidimensional item response theory. New York, NY:
Springer.
Reeve, B. B., & Fayers, P. (2005). Applying item response theory modeling for evalu-
ating questionnaire item and scale properties. In P. Fayers & R. D. Hays (Eds.),
Assessing quality of life in clinical trials (2nd ed., pp. 55-73). New York, NY:
Oxford University Press.
Reeve, B. B., Hayes, R. D., Bjorner, J. B., Cook, K. F., Crane, P. K., Teresi, J. A., . . .
Cella, D. (2007). Psychometric evaluation and calibration of health-related qual-
ity of life item banks. Medical Care, 45, S22-S31.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded
scores (Psychometric Monograph No. 17, Part 2). Richmond, VA: Psychometric
Society.
Schwarzer, R., & Jerusalem, M. (1995). Generalized self-efficacy scale. In J.
Weinman, S. Wright & M. Johnston (Eds.), Measures in health psychology: A
user’s portfolio. Causal and control beliefs (pp. 35-37). Windsor, UK: NFER-
NELSON.
Sijtsma, K., & Molenaar, I. W. (2002). Introduction to nonparametric item response
theory. Thousand Oaks, CA: Sage.
Stone, C. A., & Zhang, B. (2003). Assessing goodness of fit of item response the-
ory models: A comparison of traditional and alternative procedures. Journal
of Educational Measurement, 40, 331-352. doi:10.1111/j.1745-3984.2003.
tb01150.x
Thissen, D. & Wainer, H. (Eds.). (2001). Test scoring. Mahwah, NJ: Lawrence
Erlbaum.
Wirth, R. J., & Edwards, M. C. (2007). Item factor analysis: Current approaches and
future directions. Psychological Methods, 12, 58-79.
Yu, C.-Y. (2002). Evaluating cutoff criteria of model fit indices for latent vari-
able models with binary and continuous outcomes (Doctoral dissertation). Los
Angeles, CA. Retrieved from http://statmodel2.com/download/Yudissertation.
pdf

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014


Toland 151

Author Biography
Michael D. Toland received his PhD in August of 2008 from the Quantitative,
Qualitative, and Psychometric Methods program at the University of Nebraska at
Lincoln, where he was an advisee of Dr. Ralph De Ayala. Since August of 2008, he
has been an assistant professor in the Educational Psychology program in the
Department of Educational, School, and Counseling Psychology at the University of
Kentucky. His research interests include psychometrics, item response theory, factor
analysis, scale development, multilevel modeling, and the realization of modern mea-
surement and statistical methods in educational research.

Downloaded from jea.sagepub.com at Afyon Kocatepe Universitesi on May 20, 2014

You might also like