Intro Measurement PDF
Intro Measurement PDF
Tony Albano
September 4, 2018
Contents
1. Introduction 11
1.1. Proola . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2. R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3. Intro stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3. Testing Applications 59
3.1. Tests and decision making . . . . . . . . . . . . . . . . . . . 60
3.2. Test types and features . . . . . . . . . . . . . . . . . . . . 63
3.3. Finding test information . . . . . . . . . . . . . . . . . . . . 66
3.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4. Test Development 69
4.1. Validity and test purpose . . . . . . . . . . . . . . . . . . . 70
4.2. Learning objectives . . . . . . . . . . . . . . . . . . . . . . . 71
4.3. Features of cognitive items . . . . . . . . . . . . . . . . . . . 73
4.4. Cognitive item writing . . . . . . . . . . . . . . . . . . . . . 82
4.5. Personality . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.6. Validity and test purpose . . . . . . . . . . . . . . . . . . . 84
4.7. Noncognitive test construction . . . . . . . . . . . . . . . . 87
4.8. Features of personality items . . . . . . . . . . . . . . . . . 89
4.9. Personality item writing . . . . . . . . . . . . . . . . . . . . 91
4.10. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5. Reliability 95
5.1. Consistency of measurement . . . . . . . . . . . . . . . . . . 97
5.2. Classical test theory . . . . . . . . . . . . . . . . . . . . . . 99
5.3. Reliability and unreliability . . . . . . . . . . . . . . . . . . 104
5.4. Interrater reliability . . . . . . . . . . . . . . . . . . . . . . 112
5.5. Generalizability theory . . . . . . . . . . . . . . . . . . . . . 118
5.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3
Contents
9. Validity 193
9.1. Overview of validity . . . . . . . . . . . . . . . . . . . . . . 194
9.2. Content validity . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.3. Criterion validity . . . . . . . . . . . . . . . . . . . . . . . . 200
9.4. Construct validity . . . . . . . . . . . . . . . . . . . . . . . 203
9.5. Unified validity and threats . . . . . . . . . . . . . . . . . . 204
9.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Appendix 215
B. R Code 225
4
Preface
This book provides an introduction to the theory and application of mea-
surement in education and psychology. Topics include test development,
item writing, item analysis, reliability, dimensionality, and item response
theory. These topics come together in overviews of validity and, finally, test
evaluation.
Validity and test evaluation are based on both qualitative and quantitative
analysis of the properties of a measure. This book addresses the qualitative
side using a simple argument-based approach. The quantitative side is
addressed using descriptive and inferential statistical analyses, all of which
are presented and visualized within the statistical environment R (R Core
Team, 2017).
The intended audience for this book includes advanced undergraduate and
graduate students, practitioners, researchers, and educators. Knowledge of
R is not a prerequisite to using this book. However, familiarity with data
analysis and introductory statistics concepts, especially ones used in the
social sciences, is recommended.
Perspectives on testing
Testing has spiraled out of control, and the related costs are
unacceptably high and are taking their educational toll on
students, teachers, principals and schools.
5
Contents
abandoned altogether, one school district in this study could add from 20
to 40 minutes of instruction to each school day for most grades.”
The second problem is a reliance on tests which are not relevant enough
to what is being taught and learned in the classroom. There is a lack of
cohesion, a mismatch in content, where teachers, students, and tests are
not on the same page. As a result, the tests become frustrating, unpleasant,
and less meaningful to the people who matter most, those who take them
and those who administer or oversee them at the classroom level.
Both of these problems identified in our testing agenda have to do with what
is typically regarded as the pinnacle of quality testing, the all-encompassing,
all-powerful validity. Commenting over 50 years ago on its status for the
test developer, Ebel (1961) concluded with some religious imagery that
“validity has long been one of the major deities in the pantheon of the
psychometrician. It is universally praised, but the good works done in its
name are remarkably few” [p. 640]. Arguably, nothing is more important in
the testing process than validity, the extent to which test scores accurately
represent what they’re intended to measure. However, even today, it may
be that “relatively few testing programs give validation the attention it
deserves” (Brennan, 2013, p. 81).
Establishing validity evidence for a test has traditionally been the respon-
sibility of the test developer, who may only have limited interaction with
test users and secondhand familiarity with the issues they face. Yet, Kane
(2013) notes that from early on in the testing movement in the US, the ap-
propriateness of testing applications at the student level was a driving force
in bringing validity to prominence. He cites Kelley (1927), who observed:
It appears that Kelley (1927) recognized in the early 1900s the same issue
that we’re dealing with today. Tests are recommended or required for a
variety of applications, and we (even the most complacent and autocratic)
often can only wonder about their fitness. Consumers and other stakeholders
6
Contents
7
Contents
statistical concepts is helpful but not essential. This book also provides
extensive reproducible examples in R with a single, dedicated R package
encapsulating most of the needed functionality.
The book is divided into ten chapters. Most chapters involve analysis of
data in R, and item writing activities to be completed with the Proola web
app available at proola.org. An introduction to R and Proola is given in
Chapter 1, and examples are provided throughout the book.
8
Contents
Learning objectives
Each chapter in this book is structured around a set of learning objectives
appearing at the beginning of the corresponding chapter. These learning
objectives capture, on a superficial and summary level, all the material you’d
be expected to master in a course accompanying this book. Any assignments,
quizzes, or class discussions should be built around these learning objectives,
or a subset of them. The full list of objectives is in Appendix A.
Exercises
Each chapter ends with self assessments and other activities for stretching
and building understanding. These exercises include discussion questions
and analyses in R. As this is a book on assessment, you’ll also write your
own questions, with the opportunity to submit them to an online learning
community for feedback.
License
This book is published under a Creative Commons Attribution 4.0 Inter-
national License, which allows for adaptation and redistribution as long as
appropriate credit is given. The source code for the book is available at
https://github.com/talbano/intro-measurement. Comments and contribu-
tions are welcome.
9
1. Introduction
This chapter introduces two resources we’ll be using throughout the book,
the assessment development tools at Proola.org, and the statistical software
R. The chapter ends with an introductory statistics review. The learning
objectives focus on understanding statistical terms, and calculating and
interpreting some introductory statistics.
Learning objectives
1.1. Proola
11
1. Introduction
• The site itself is still under development, with new features on the
way. Report bugs or send suggestions to [email protected].
• Everything you share is public. Don’t post copyrighted items, images,
or other information, and don’t share items you need to keep secure.
• You can learn a lot from the successes and failures of others. Search
the bank for items related to your content area, and then see where
people struggle and what they do well.
The Proola item writing process is broken down into four general steps (see
proola.org/learn_more).
First, create a new item by clicking “Create” in the top navbar. This takes
you to a series of text boxes and drop-downs where you’ll provide basic
information about your item. See Figure 1.1 for an example. Give the item
a short but descriptive title. Next, link the item to a learning objective.
Learning objectives can be imported from other users, or you can organize
your own objectives in Proola. Then, choose from a variety of intended
subject areas and grade levels. The item itself then consists of a stem,
where the question statement or prompt resides, and a response format,
whether selected-response or constructed-response. Any other background
information you’d like to share can go in the comments that become available
once your question is saved.
Second, get feedback from peers and assessment specialists. Once an item
is saved as a draft, anyone can view and comment on it. Wait patiently for
comments, or recruit peers in your grade level, subject area, department,
school, district, and get them to sign up and leave feedback. Assessment
specialists include faculty and graduate students with training in assessment
12
1.2. R
1.2. R
13
1. Introduction
with R. RStudio isn’t essential, but it gives you nice features for saving
your R code, organizing the output of your analyses, and managing
your add-on packages.
• As noted above, R is accessed via code, primarily in the form of
commands that you’ll type or paste in the R console. The R console
is simply the window where your R code goes, and where output will
appear. Note that RStudio will present you with multiple windows,
one of which will be the R console. That said, when instructions here
say to run code in R, this applies to R via RStudio as well.
• When you type or paste code directly into the R console, any previous
code you’ve already entered gets pushed up on the screen. In the
console, you can scroll through your old code by hitting the up arrow
once your cursor is in front of the R prompt >. In RStudio, you can
also view and scroll through a list of previous commands by holding
one of the control/command buttons on your keyboard while hitting
up.
• Only type directly in the console for simple and quick calculations
that you don’t care about forgetting. Otherwise, type all your code in
a text file that is separate from the console itself. In R and RStudio,
these text files are called R scripts. They let you save your code in a
separate document, so you always have a structured record of what
you’ve done. Remember that R scripts are only used to save code
and any comments annotating the code, not data or results.
1.2.1. Code
We’ll start our tour of R with a summary of how R code is used to interact
with R via the console. In this book, blocks of example R code are offset
from the main text as shown below. Comments within code blocks start
with a single hash #, the code itself has nothing consistent preceding it, and
output from my R console is preceded by a double hash ##. You can copy
and paste example code directly into your R console. Anything after the #
will be ignored.
In the code above, we’re creating a short vector of scores in x and calculating
14
1.2. R
its mean and standard deviation. You should paste this code in the console
and verify that you get the same results. Note that code you enter at the
console is preceded by the R prompt >, whereas output printed in your
console is not.
The first thing to notice about the example code above is that the functions
that get things done in R have names, and to use them, we simply call them
by name with parentheses enclosing any required information or instructions
for how the functions should work. Whenever functions are discussed in
this book, you’ll recognize them by the parentheses. For example, we used
the function c() to combine a set of numbers into a “vector” of scores. The
information supplied to c() consisted of the scores themselves, separated
by commas. mean() and sd() are functions for obtaining the mean and
standard deviation of vectors of scores, like the ones in x.
The second thing to notice in the example above is that data and results
are saved to “objects” in R using the assignment operator <-. We used the
concatenate function to stick our numbers together in a set, c(4, 8, 15,
16, 23, 42), and then we assigned the result to have the name x. Objects
created in this way can be accessed later on by their assigned name, for
example, to find a mean or standard deviation. If we wanted to access it
later, we could also save the mean of x to a new object.
Note that for larger objects, like data.frames with lots of rows and columns,
viewing all the data at once isn’t very helpful. In this book we’ll analyze
data from the Programme for International Student Assessment (PISA),
with dozens of variables measured for thousands of students. Printing all
the PISA data would flood the console with information. This brings us to
the third thing to notice about the example code above, that the console
isn’t the best place to view results. The console is functional and efficient,
but it isn’t pretty or well organized. Fortunately, R offers other mediums
besides fixed-width text for visualizing output, discussed below.
15
1. Introduction
1.2.2. Packages
When you install R on your computer, you get a variety of functions and
example data sets by default as part of the base packages that come with
R. For example, mean() and print() come from the base R packages.
Commonly used procedures like simple linear regression, via lm(), and
t-testing, via t.test(), are also included in the base packages. Additional
functionality comes from add-on packages written and shared online by a
community of R enthusiasts.
The examples in this book rely on a few different R packages. The book
itself is compiled using the bookdown and knitr packages (Xie, 2015, 2016),
and some knitr code is shown for formatting output tables. The ggplot2
package (Wickham, 2009) is used for plotting. In this chapter we also
need the devtools package (Wickham and Chang, 2016), which allows us
to install R packages directly from the code sharing website github.com.
Finally, throughout the book we’ll then be using a package called epmr,
which contains functions and data used in many of the examples.
After a package is installed, and you’ve run library(), you have access to
the functionality of the package. Here, we’ve loaded epmr and ggplot2 so
we can type functions directly, without referencing the package name each
16
1.2. R
1.2.4. Data
Data can be entered directly into the console by using any of a variety
of functions for collecting information together into an R object. These
functions typically give the object certain properties, such as length, rows,
columns, variable names, and factor levels. In the code above, we created
x as a vector of quantitative scores using c(). You could think of x as
containing test scores for a sample of length(x) 6 test takers, with no other
variables attached to those test takers.
We can create a factor by supplying a vector of categorical values, as quoted
text, to the factor() function.
17
1. Introduction
In this code, the classroom object is first assigned a vector of letters, which
might represent labels for three different classrooms. The classroom object
is then converted to a factor, and assigned to an object of the same name,
which essentially overwrites the first assignment. Reusing object names like
this is usually not recommended. This is just to show that a name in R
can’t be assigned two separate objects at once.
The code above also demonstrates a simple form of indexing. Square brackets
are used after some R objects to select subsets of their elements, for example,
the first and fourth values in classroom. The vector c(1, 4) is used as an
indexing object. Take a minute to practice indexing with x and classroom.
Can you print the last three classrooms? Can you print them in reverse
order? Can you print the first score in x three times?
The data.frame() function combines multiple vectors into a set of variables
that can be viewed and accessed as a matrix with rows and columns.
We can index both the columns and rows of a matrix. The indexing objects
we use must be separated by a comma. For example, mydata[1:3, 2] will
print the first three rows in the second column. mydata[6, 1:2] prints
both columns for the sixth row. mydata[, 2], with the rows index empty,
prints all rows for column two. Note that the comma is still needed, even if
the row or column index object is omitted. Also note that a colon : was
used here as a shortcut function to obtain sequences of numbers, where, for
example, 1:3 is equivalent to typing c(1, 2, 3).
Typically, we’ll import or load data into R, rather than enter it manually.
Importing is most commonly done using read.table() and read.csv().
Each one takes a “file path” as its first argument. See the help documentation
for instructions on their use and the types of files they require. Data and
any other objects in the console can also be saved directly to your computer
using save(). This creates an “rda” file, which can then be loaded back in
to R using load(). Finally, some data are already available in R and can
be loaded into our current session with data(). The PISA data, referenced
above, are loaded with data(PISA09). Make sure the epmr package has
been loaded with library() before you try to load the data.
The PISA09 object is a data.frame containing demographic, noncognitive,
and cognitive variables for a subset of questions and a subset of students
from the PISA 2009 study (nces.ed.gov/surveys/pisa). It is stored within
the epmr R package that accompanies this book. After loading the data, we
18
1.3. Intro stats
can print a few rows for a selection of variables, accessed via their sequential
column number, and then the first 10 values for the age variable.
With the basics of R under your belt, you’re now ready for a review of the
introductory statistics that are prerequisite for the analyses that come later
in this book.
Many people are skeptical of statistics, and for good reasons. We often
encounter statistics that contradict one another or that are appended to
fantastic claims about the effectiveness of a product or service. At their
heart, statistics are pretty innocent and shouldn’t be blamed for all the
confusion and misleading. Statistics are just numbers designed to summarize
and capture the essential features of larger amounts of data or information.
19
1. Introduction
We’ll begin this review with some basic statistical terms. First, a variable is
a set of values that can differ for different people. For example, we often
measure variables such as age and gender. These are italicized here to
denote them as statistical variables, as opposed to words. The term variable
is synonymous with quality, attribute, trait, or property. Constructs are
also variables. Really, a variable is anything assigned to people that can
potentially take on more than just a single constant value. As noted above,
variables in R can be contained within simple vectors, for example, x, or
they can be grouped together in a data.frame.
Generic variables will be labeled in this book using capital letters, usually
X and Y . Here, X might represent a generic test score, for example, the
total score across all the items in a test. It might also represent scores on
a single item. Both are considered variables. The definition of a generic
variable like X depends on the context in which it is defined.
Indices can also be used to denote generic variables that are part of some
sequence of variables. Most often this will be scores on items within a test,
where, for example, X1 is the first item, X2 is the second, and XJ is the last,
with J being the number of items in the test and Xj representing any given
item. Subscripts can also be used to index individual people on a single
variable. For example, test scores for a group of people could be denoted as
X1 , X2 , . . ., XN , where N is the number of people and Xi represents the
score for a generic person. Combining people and items, Xij would be the
score for person i on item j.
The number of people is denoted by n or sometimes N . Typically, the
lowercase n represents sample size and the uppercase N represents the
population, however, the two are often used interchangeably. Greek and
Arabic letters are used for other sample and population statistics. The
sample mean is denoted by m and the population mean by µ, the standard
deviation is s or σ, variance is s2 or σ 2 , and correlation is r or ρ. Note
that the mean and standard deviation are sometimes abbreviated as M
and SD. Note also that distinctions between sample and population values
often aren’t necessary, in which case the population terms are used. If a
distinction is necessary, it will be identified.
Finally, you may see named subscripts added to variable names and other
terms, for example, Mcontrol might denote the mean of a control group.
These subscripts depend on the situation and must be interpreted in context.
Descriptive and inferential are terms that refer to two general uses of
statistics. These uses differ based on whether or not an inference is made
from properties of a sample of data to parameters for an unknown population.
Descriptive statistics, or descriptives, are used simply to explore and describe
20
1.3. Intro stats
certain features of distributions. For example, the mean and variance are
statistics identifying the center of and variability in a distribution. These
and other statistics are used inferentially when an inference is made to a
population
The dstudy() function in the epmr package returns some commonly used
univariate descriptive statistics, including the mean, median, standard
deviation (sd), skewness (skew), kurtosis (kurt), minimum (min), and
maximum (max). Each is discussed further below. The dstudy() function
also returns the number of people (n) and the number of people with missing
data (na).
1.3.3. Distributions
21
1. Introduction
Lets look at some frequencies within the PISA data. The first variable in
PISA09, named cnt, is classified as a factor in R, and it contains the country
that each student was tested in. We can check the class and then use a
frequency table() to see the number of students from each country.
22
1.3. Intro stats
10000
12000
7500
8000
count
5000
4000
2500
0 0
AUS BEL CAN DEU ESP GBR HKG ITA JPN RUS SGP USA −2 0 2
cnt PISA09$memor
Figure 1.2.: A bar plot of student frequencies by country in the first plot,
and a histogram of memor scores in the second.
A box plot, also known as a box and whisker plot, displays a distribution
of scores using five summary statistics: the minimun, first quartile (below
which 25% of people score), second quartile (the median or 50th percentile),
third quartile (below which 75% of people score), and the maximum. In R,
these statistics can be obtained using summary(). The box in a box plot
captures the interquartile range, and the whiskers extend roughly to the
minimum and maximum values, with outliers sometimes noted using points.
Box plots are used here to compare the distributions of memor scores by
country.
23
1. Introduction
1
memor
−1
−2
−3
AUS BEL CAN DEU ESP GBR HKG ITA JPN RUS SGP USA
cnt
Note that these plotting functions come from the ggplot2 package. As
alternatives, the base package contains barplot(), hist(), and boxplot(),
which will also get the job done. You don’t need to understand all the
details of the plotting syntax right now, but you should be able to modify
the example code given here to examine the distributions of other variables.
24
1.3. Intro stats
Central tendency provides statistics that describe the middle, most common,
or most normal value in a distribution. The mean, which is technically
only appropriate for interval or ratio scaled variables, is the score that is
closest to all other scores. The mean also represents the balancing point in
a distribution, so that the further a score is from the center, the more pull
it will have on the mean in a given direction. The mean for a variable X is
simply the sum of all the X scores divided by the sample size:
Pn
i=1 Xi
µ= . (1.1)
n
The median is the middle score in a distribution, the score that divides a
distribution into two halves with the same number of people on either side.
The mode is simply the score or scores with the largest frequencies.
The mean is by far the most popular measure of central tendency, in part
because it forms the basis of many other statistics, including standard
deviation, variance, and correlation. As a result, the mean is also the basis
for regression and ANOVA.
In R, we can easily find the mean() and median() for a vector of scores.
There is no base function for the mode. Instead, we can examine a frequency
table to find the most common value(s).
When using certain functions in R, such as mean() and sd(), you’ll have
to tell R how to handle missing data. This can usually be done with the
argument na.rm. For example, with na.rm = TRUE, R will remove people
with missing data prior to running a calculation.
1.3.5. Variability
Variability describes how much scores are spread out or differ from one
another in a distribution. Some simple measures of variability are the
minimum and maximum, which together capture the range of scores for a
variable.
25
1. Introduction
max(PISA09$age)
## [1] 16.33
range(PISA09$age)
## [1] 15.17 16.33
Variance and standard deviation are much more useful measures of variability
as they tell us how much scores vary. Both are defined based on variability
around the mean. As a result, they, like the mean, are technically only
appropriate with variables measured on interval and ratio scales, which are
discussed in Chapter 2.
The variance is the mean squared distance for each score from the mean,
or the sum of squared distances from the mean divided by the sample size
minus 1:
Pn
2 − µ)2
i=1 (Xi
σ = . (1.2)
n−1
sP
n
− µ)2
i=1 (Xi
σ= . (1.3)
n−1
The standard deviation is interpreted as the average distance from the mean,
and it is expressed in the raw score metric, making it more easy to interpret.
The standard deviation of sd(PISA09$age) 0.292078 tells us that students
in PISA09 vary on average by about 0.3 years, or 3.5 months.
26
1.3. Intro stats
1.3.6. Correlation
Pn
i=1 (Xi − µX )(Yi − µY )
σXY = . (1.4)
n−1
Note that we now have two different variables, X and Y , and our means
are labeled accordingly. Covariance is often denoted by σXY .
Like the variance, the covariance isn’t very useful in and of itself because it
is expressed in terms of products of scores, rather than in the more familiar
raw score metric. However, square rooting the covariance doesn’t help us
because there are two raw score metrics involved in the calculation. The
correlation solves this problem by removing, or dividing by, these metrics
entirely:
σXY
ρXY = . (1.5)
σX σY
27
1. Introduction
use = "complete")
## elab cstrat memor
## elab 1.0000000 0.5126801 0.3346299
## cstrat 0.5126801 1.0000000 0.4947891
## memor 0.3346299 0.4947891 1.0000000
0
cstrat
−2
−2 −1 0 1 2 3
elab
Figure 1.3.: Scatter plot for the elab and cstrat PISA learning strategy scales
for the US.
28
1.3. Intro stats
1.3.7. Rescaling
Y − µY
Yz = . (1.6)
σY
Having subtracted the mean from each score, the mean of our new variable
Yz is 0, and having divided each score by the SD, the SD of our new variable
is 1. We can now multiply Yz by any constant s, and then add or subtract
another constant value m to obtain a linearly transformed variable with
mean m and SD equal to s. The new rescaled variable is labeled here as Yr :
Yr = Yz s + m. (1.7)
The linear transformation of any variable Y from its original metric, with
mean and SD of µY and σY , to a scale defined by a new mean and standard
deviation, is obtained via the combination of these two equations, as:
29
1. Introduction
s
Yr = (Y − µY ) + m. (1.8)
σY
Scale transformations are often employed in testing for one of two reasons.
First, transformations can be used to express a variable in terms of a familiar
mean and SD. For example, IQ scores are traditionally expressed on a scale
with mean of 100 and SD of 15. In this case, Equation (1.8) is used with
m = 100 and s = 15. Another popular score scale is referred to as the
t-scale, with m = 50 and s = 10. Second, transformations can be used to
express a variable in terms of a new and unique metric. When the GRE
(www.ets.org/gre) was revised in 2011, a new score scale was created, in
part to discourage direct comparisons with the previous version of the exam.
The former quantitative and verbal reasoning GRE scales ranged from 200
to 800, and the revised versions now range from 130 to 170.
Learning check: Summarize, with examples, the two main reasons for
linearly transforming a score scale.
30
1.4. Summary
4000 4000
count
2000 2000
1000 1000
0 0
−2.12 −1.84 −1.57 −1.26 −0.99 −0.71 −0.41 −0.13 0.14 0.45 0.72 1 1.31 1.58 1.85 182.46223.54264.63310.85351.93393.02439.24480.32521.41567.63608.71 649.8 696.02 737.1 778.19
factor(round(zage, 2)) factor(round(newage, 2))
Figure 1.4.: Histograms of age in the first plot, and age rescaled to have a
mean of 500 and SD of 150 in the second.
1.4. Summary
This chapter introduced two resources that will be used throughout this
book, Proola and R. Some introductory statistics were also reviewed. These
will support our discussions of reliability, item analysis, item response theory,
dimensionality, and validity.
Before moving forward, you should get an account at Proola and browse
around the item bank. You should also install R and optionally RStudio
on your own computer, and make sure you are comfortable explaining fre-
quency distributions, central tendency, variability, correlation, and rescaling,
including what they are and how they’re used.
1.4.1. Exercises
1. Find descriptive statistics for the age variable using dstudy(). What
is the mean age? What is the largest age recorded in the PISA data
set? Confirm these results using the base functions mean() and max().
2. Report the frequency distribution as a table for PISA09$grade. What
proportion of students are in 10th grade?
3. Create a bar plot of age using the ggplot() code above for Figure
1.2, replacing cnt with age. Don’t forget the + geom_bar(stat =
"count"), which specifies that bars should represent frequencies.
About how many students are reported to be exactly 16 years old?
You can check this result using table(). Note that, although age
is a continuous variable, it was measured in PISA using a discrete
number of values.
4. Find the mean, median, and mode for the PISA attitude toward
school item st33q01 for Germany and Hong Kong. Interpret these
values for each country. Again, on this item, students rated their level
of agreement (1 = strongly disagree to 4 = strongly agree) with the
statement, “School has done little to prepare me for adult life when I
leave school.”
31
1. Introduction
5. How do the countries JPN and RUS compare in terms of their mean
memor scores? Note that the international average across all countries
is 0.
6. Describe the variability of ratings on the PISA attitude toward school
item st33q01. Are students consistently reporting similar levels of
agreement, or do they vary in their ratings?
7. The variable PISA09$bookid identifies the test booklet or test form
a student was given in this subsample of the full PISA data set. How
many booklets were used, and how many students were given each
booklet?
8. Which of the learning strategy scales has the most normal score
distribution? What information supports your choice? Report your
results.
9. PISA09$r414q02s and PISA09$r414q11s contain scores on two of
the PISA reading items, with 1 coded as a correct response and 0
as incorrect. Find and interpret the correlation between these two
variables.
10. Create a vector of scores ranging from 0 to 10. Find the mean and SD
of these scores. Then, convert them to the IQ metric, with a mean of
100 and SD 15.
32
2. Measurement, Scales, and
Scoring
Learning objectives
33
2. Measurement, Scales, and Scoring
In this chapter, we’ll analyze and create plots with PISA09 data using the
epmr and ggplot2 packages. We’ll also analyze some data on US states,
contained within the datasets package automatically included with R.
34
2.1. What is measurement?
Measurement is happening all the time, all around us. Daily, we measure
what we eat, where we go, and what we do. For example, drink sizes are
measured using categories like tall, grande, and venti. A jog or a commute is
measured in miles or kilometers. We measure the temperature of our homes,
the air pressure in our tires, and the carbon dioxide in our atmosphere.
The wearable technology you might have strapped to your wrist could be
monitoring your lack of movement and decreasing heart rate as you doze off
reading this sentence. After you wake up, you might check your watch and
measure the length of your nap in minutes or hours.
Let’s look at some examples of measurement from the state data sets in R.
The object state.x77 contains data on eight variables, with the fifty US
states as the objects of measurement. For details on where the data come
from, see the help file ?state.
35
2. Measurement, Scales, and Scoring
practice what you learned in Chapter 1, try to convert the illiteracy rates
in state.x77[, "Illiteracy"] from proportions to counts for each state.
Note that state.x77 is a matrix, which means we can’t index columns
using $. That only works with a list or data.frame. However, the rows
and columns of a matrix can have names, accessed with rownames() and
colnames(), and we can use these names to index the matrix, as shown
above.
With most physical measurements, the property that we’re trying to rep-
resent or capture with our values can be clearly defined and consistently
measured. For example, amounts of food are commonly measured in grams.
A cup of cola has about 44 grams of sugar in it. When you see that number
printed on your can of soda pop or fizzy water, the meaning is pretty clear,
and there’s really no need to question if its accurate. Cola has a shocking
amount of sugar in it.
But, just as often, we take a number like the amount of sugar in our food
and use it to represent something abstract or intangible like how healthy
or nutritious the food is. A food’s healthiness isn’t as easy to define as its
mass or volume. A measurement of healthiness or nutritional value might
account for the other ingredients in the food and how many calories they
boil down to. Furthermore, different foods can be more or less nutritional
for different people, depending on a variety of factors. Healthiness, unlike
physical properties, is intangible and difficult to measure.
Most of the variables in state.x77 are relatively easy to measure,
as they involve observable quantities, such as numbers of dollars for
state.x77[, "Income"], years for state.x77[, "Life Exp"], and days
for state.x77[, "Frost"]. On the other hand, illiteracy rates are not as
easily measured. A variable such as illiteracy is not countable or directly
observable, which makes it subject to measurement error.
The social sciences of education and psychology typically focus on the
measurement of constructs, intangible and unobservable qualities, attributes,
or traits that we assume are causing certain observable behavior or responses.
In this book, our objects of measurement are typically people, and our goal
is to give these people numbers or labels that tell us something meaningful
about constructs such as their intelligence, reading ability, or social anxiety.
Constructs like these are difficult to measure. That’s why we need an entire
book to discuss how to best measure them.
A good question to ask at this point is, how can we measure and provide
values for something that’s unobservable? How do we score a person’s math
ability if we can’t observe it directly? The approach commonly taken is to
identify an operationalization of our construct, an observable behavior or
response that increases or decreases, we assume, as a person moves up or
36
2.1. What is measurement?
37
2. Measurement, Scales, and Scoring
accuracy refer to the reliability and validity of test scores, that is, the extent
to which the same scores would be obtained across repeated administrations
of a test, and the extent to which scores fully represent the construct they
are intended to measure.
These two terms, reliability and validity, will come up many times throughout
the book. The second one, validity, will help us clarify our definition of
measurement in terms of its purpose. Of all the considerations that make
for effective measurement, the first to address is purpose.
The purpose of a test specifies its intended application and use. It addresses
how scores from the test are designed to be interpreted. A test without a
clear purpose can’t be effective.
38
2.1. What is measurement?
Table 2.1.: Intended Uses for Some Common Types of Standardized Tests
Learning check: Take a minute to think about some of the tests you’ve
used or taken in the past. How would you express the purposes of these
tests? When answering this question, be careful to avoid simply saying
that the purpose of the test is to measure something.
A statement of test purpose should clarify what can be done with the
resulting scores. For example, scores from placement tests are used to
determine what courses a student should take or identify students in need
of certain instructional resources. Scores on admissions tests inform the
selection of applicants for entrance to a college or university. Scores on
certification and licensure exams are used to verify that examinees have the
knowledge, skills, and abilities required for practice in a given profession.
Table 2.1 includes these and a few more examples. In each case, scores are
intended to be used in a specific way.
Here’s one more example. Some of my research is based on a type of
standardized placement testing that is used to measure student growth over
a short period of time. In addition to measuring growth, scores are also
used to evaluate the effectiveness of intervention programs, where effective
interventions lead to positive results for students. My latest project in
this area of assessment involved measures of early literacy called IGDIs
(Bradfield et al., 2014). A brochure for the measures from www.myigdis.com
states,
This summary contains specific claims regarding score use, including mon-
itoring growth, and sensitivity to small changes in achievement. Validity
evidence is needed to demonstrate that scores can effectively be used in
39
2. Measurement, Scales, and Scoring
these ways.
The point of these examples is simply to clarify what goes into a statement
of purpose, and why a well articulated purpose is an essential first step to
measurement. We’ll come back to validation of test purpose in Chapter 10.4.
For now, you just need to be familiar with how a test purpose is phrased
and why it’s important.
2.1.5. Summary
Now that we’ve established what measurement is, and some key features
that make the measurement process good, we can get into the details of how
measurement is carried out. As defined by Stevens (1946), measurement
involves the assignment of values to objects according to certain rules. The
rules that guide the measurement process determine the type of measurement
scale that is produced and the statistics that can be used with that scale.
Measurement scales are grouped into four different types. These differ in the
meaning that is given to the values that are assigned, and the relationship
between these values for a given variable.
40
2.2. Measurement scales
2.2.1. Nominal
The most basic measurement scale is really the absence of a scale, because
the values used are simple categories or names, rather than quantities of
a variable. For this reason it is referred to as a nominal scale, where our
objects of measurement are grouped qualitatively, for example by gender
or political party. The nominal scale can also represent variables such as
zip code or eye color, where multiple categories are present. So, identifying
(ID) variables such as student last name or school ID are also considered
nominal.
Only frequencies, proportions, and percentages (and related nonparametric
statistics) are permitted with nominal variables. Means and standard
deviations (and related parametric statistics) do not work. It would be
meaningless to calculate something like an average gender or eye color,
because nominal variables lack any inherent ordering or quantity in their
values.
What variables from PISA09 would be considered nominal? Nominal vari-
ables often have a specific class in R, which you can check with class(), as
we did in Chapter 1. The class of an R object doesn’t map directly to its
measurement scale, but it can provide helpful information about features
of the scale. mode() can also be informative regarding how a variable is
measured.
Nominal variables are often coded in R using text strings. Our classroom
variable from Chapter 1 is an example. After converting this character
object to the factor class, R identifies each unique value within the object as
a level, or unit of measurement. Other functions in R will then be able to
recognize the object as a factor, and will interpret its contents differently. To
see this interpretation in action, try to calculate the mean of our classroom
variable, or any of the nominal variables in PISA09, and you’ll receive a
warning from R.
2.2.2. Ordinal
The dominant feature of the ordinal scale is order, where values do have an
inherent ordering that cannot be removed without losing meaning. Common
41
2. Measurement, Scales, and Scoring
examples of ordinal scales include ranks (e.g., first, second, third, etc.), the
multi-point rating scales seen in surveys (e.g., strongly disagree, disagree,
etc.), and level of educational attainment.
The distance between the ordered categories in ordinal scale variables (i.e.,
the interval) is never established. So, the difference between first and second
place does not necessarily mean the same thing as the difference between
second and third. In a swimming race, first and second might differ by a
matter of milliseconds, whereas second and third differ by minutes. We
know that first is faster than second, and second is faster than third, but
we don’t know how much faster. Note that the construct we’re measuring
here is probably swimming ability, which is actually operationalized on a
ratio scale, in terms of speed, but it is simplified to an ordinal scale when
giving out awards.
What variables in PISA09 are measured on ordinal scales? To identify a
variable as being measured on an ordinal scale, we really need to know
how data were collected for the variable, and what the scale values are
intended to represent. Whereas the choice of nominal is relatively simple,
the choice of ordinal over interval or ratio is often requires a judgement call,
as discussed below.
In theory, statistics which rely on interval level information, such as the
mean, standard deviation, and all mean-based statistical tests, are still not
allowed with an ordinal scale. Statistics permitted with ordinal variables
include the median and any other statistics based on percentiles.
2.2.3. Interval
42
2.2. Measurement scales
2.2.4. Ratio
43
2. Measurement, Scales, and Scoring
The first of these four, a ratio scale, is the most versatile and can be
converted into any of the scales below it. However, once age is defined based
on a classification, such as “same as Mike,” no improvement can be made.
For this reason a variable’s measurement scale should be considered in the
planning stages of test design, ideally when we identify the purpose of our
test.
In the social sciences, measurement with the ratio scale is difficult to achieve
because our operationalizations of constructs typically don’t have meaningful
zeros. So, interval scales are considered optimal, though they too are not
easily obtained. Consider the sociability measure described above. What
type of scale is captured by this measure? Does a zero score indicate a
total absence of sociability? This is required for ratio. Does an incremental
increase at one end of the scale mean the same thing as an incremental
increase at the other end of the scale? This is required for interval.
Upon close examination, it is difficult to measure sociability, and most
other constructs in the social sciences, with anything more than an ordinal
scale. Unfortunately, an interval or ratio scale is required for the majority of
statistics that we’d like to use. Along these lines, Stevens (1946) concluded:
44
2.2. Measurement scales
item responses in PISA09 have only two possible values, 0 and 1, representing
incorrect and correct responses. These scored item responses could all be
considered at least on ordinal scales, as 1 represents more of the measured
construct than 0. The total score across all reading items could also be
considered at least ordinal.
When we talk about measurement scales, and next scoring, keep in mind
that we’re focusing on the x-axis of a plot like the one shown in Figure 2.1.
The distribution itself doesn’t necessarily help us interpret the measurement
scale or scoring process. Instead, we’re examining how individual items and
combinations of them capture differences in the underlying construct.
45
2. Measurement, Scales, and Scoring
5000
4000
3000
count
2000
1000
0 1 2 3 4 5 6 7 8 9 10 11
factor(rtotal)
Figure 2.1.: A bar plot of total scores on PISA09 scored reading items.
2.3. Scoring
This book focuses on cognitive and noncognitive, that is, affective, test scores
as operationalizations of constructs in education and psychology. As noted
above, these test scores often produce ordinal scales with some amount of
meaning in their intervals. The particular rules for assigning values within
these scales depend on the type of scoring mechanisms used. Here, we’ll
cover the two most common scoring mechanisms or rules, dichotomous and
polytomous scoring, and we’ll discuss how these are used to create rating
scales and composite scores.
46
2.3. Scoring
If “alone” were keyed positively, and “in a group” were keyed zero, how
would you describe the construct that it measures? What if “in a group”
were scored positively, and “alone” negatively?
47
2. Measurement, Scales, and Scoring
example is the use of rating scales to score written responses such as essays.
In this case, score values may still describe the correctness of a response, but
with differing levels of correctness, for example, incorrect, partially correct,
and fully correct.
Polytomous scoring with cognitive tests can be less straightforward and less
objective than dichotomous scoring, primarily because it usually requires the
use of human raters with whom it is difficult to maintain consistent meaning
of assigned categories such as partially correct. The issue of interrater
reliability will be discussed in Chapter 10.3.
Polytomous scoring with affective or non-cognitive measures most often
occurs with the use of rating scales. For example, individuals may use
a rating scale to describe how much they identify with a statement, or
how well a statement represents them, rather than simply saying “yes” or
“no.” Such rating scales measure multiple levels of agreement (e.g., from
disagree to agree) or preference (e.g., from dislike to like). In this case,
because individuals provide their own responses, subjectivity in scoring is
not an issue as it is with polytomous scoring in cognitive tests. Instead,
the challenge with rating scales is in ensuring that individuals interpret the
rating categories in the same way. For example, strongly disagree could
mean different things to different people, which will impact how the resulting
scores can be compared across individuals.
The student survey variables in PISA09$st27q01 through PISA09$st42q05
are polytomously scored rating scale items that contribute to four different
composite scales. One of these scales measures students attitude toward
school, shown in Figure 2.2.
Figure 2.2.: PISA 2009 student survey items measuring attitude toward
school.
Notice that the first two items in the scale are phrased negatively, and
the second two are phrased positively. Items PISA09$st33q01 and
PISA09$st33q02, labeled a and b in Figure 2.2, need to be reverse coded if
48
2.3. Scoring
Having recoding the negatively worded attitude items, we can consider ways
of combining information across all four items to get an overall measure for
each student. Except in essay scoring and with some affective measures, an
individual question is rarely used alone to measure a construct. Instead,
scores from multiple items are combined to create composite scores or rating
scale scores.
49
2. Measurement, Scales, and Scoring
na.rm = TRUE)
dstudy(PISA09$atotal)
##
## Descriptive Study
##
## mean median sd skew kurt min max n na
## x 12.3 13 2.71 -1.66 8.12 0 16 44878 0
In Chapter 4 we will address rating scales in more detail. We’ll cover issues
in the construction and administration of rating categories. Here, we are
more concerned with the benefits of using composite scale scores.
50
2.4. Measurement models
By themselves, any one of these items may not reflect the full construct
that we are trying to measure. A person may strongly support animal
rights, except in the case of medical research. Or a person may define the
phrase “that I liked,” from the third example question, in different ways so
that this individual question would produce different results for people who
might actually be similar in their regard for animals. A composite score
helps to average out the idiosyncracies of individual items. (Side note from
this study: a regression model showed that 25% of the variance in attitude
toward animals was accounted for by gender and a personality measure of
sensitivity.)
Whereas a simple sum or average over a set of items lets each item contribute
the same amount to the overall score, more complex measurement models
can be used to estimate the different contributions of individual items
to the underlying construct. These contributions can be examined in a
variety of ways, as discussed in Chapters 10.3, 7, and 8. Together, they can
provide useful information about the quality of a measure, as they help us
understand the relationship between our operationalization of the construct,
in terms of individual items, and the construct itself.
51
2. Measurement, Scales, and Scoring
Jones, 1993). For now, we’ll just look at the basics of what a measurement
model does.
Figure 2.3 contains a visual representation of a simple measurement model
where the underlying construct of sociability, shown in an oval, causes, in
part, the observed responses in a set of three questions, shown in rectangles
as Item 1, Item 2, and Item 3. Unobservable quantities in a measurement
model are typically represented by ovals, and observable quantities by
rectangles. Causation is then represented by the arrows which point from
the construct to the item responses. The numbers over each arrow from
the construct are the scaled factor loadings reported in Nelson et al. (2010),
which represent the strength of the relationship between the items and the
construct which they together define. As with a correlation coefficient, the
larger the factor loading, the stronger the relationship. Thus, item 1 has
the strongest relationship with the sociability factor, and item 3 has the
weakest.
The other unobserved quantities in Figure 2.3 are the error terms, in the
circles, which also impact responses on the three items. Without arrows
linking the error terms from one to another, the model assumes that errors
are independent and unrelated across items. In this case, any influence on
a response that does not come from the common factor of sociability is
attributed to measurement error.
Figure 2.3.: A simple measurement model for sociability with three items,
based on results from D. A. Nelson et al. (2010). Numbers are
factor loadings and E represents unique item error.
Note that the model in Figure 2.3 breaks down our observed variables, the
scored item responses, into two unrelated sources of variability, one based
on the commonly measured construct, and the other based on individual
item errors. In theory, these errors are still present in our total scores, but
they are extracted from the construct scores produced by a measurement
model. Thus, measurement models provide a more reliable measure of the
construct.
52
2.5. Score referencing
Models such as the one in Figure 2.3 are referred to as confirmatory factor
analysis models, because we propose a given structure for the relationships
between constructs, error, and observations, and seek to confirm it by placing
certain constraints on the relationships we estimate. In Chapter 8, we’ll
discuss these along with exploratory models where we aren’t certain how
many underlying constructs are causing responses.
53
2. Measurement, Scales, and Scoring
54
2.5. Score referencing
6
rtotal
8 9 10
factor(grade)
Figure 2.4.: Bar plots of total scores on PISA09 scored reading items for
Germany by grade.
testing.
Standardized state test results, which were presented above as an example
of norm referencing, are also given meaning using some form of criterion
referencing. The criteria in state tests are established, in part, by a panel
of teachers and administrators who participate in what is referred to as a
standard setting. State test standards are chosen to reflect different levels of
mastery of the test content. In Nebraska, for example, two cut-off scores
are chosen per test to categorize students as below the standards, meets the
standards, and exceeds the standards. These categories are referred to as
performance levels. Student performance can then be evaluated based on the
description of typical performance for their level. Here is the performance
level description for grade 5 science, meets the standard, as of 2014:
55
2. Measurement, Scales, and Scoring
ment,
4. Develops a reasonable explanation based on collected data,
5. Describes the physical properties of matter and its changes.
Although norm and criterion referencing are presented here as two distinct
methods of giving meaning to test scores, they can sometimes be interrelated
and thus difficult to distinguish from one another. The myIGDI testing
program described above is one example of score referencing that combines
both norms and criteria. These assessments were developed for measuring
growth in early literacy skills in preschool and kindergarten classrooms.
Students with scores falling below a cut-off value are identified as potentially
being at risk for future developmental delays in reading. The cut-off score
is determined in part based on a certain percentage of the test content
(criterion information) and in part using mean performance of students
evaluated by their teachers as being at-risk (normative information).
56
2.6. Summary
2.6. Summary
This chapter provides an overview of what measurement is, how measurement
is carried out in terms of scaling and scoring, and how measurement is
given additional meaning through the use of score referencing and scale
transformation. Before moving on to the next chapter, make sure you
can respond to the learning objectives for this chapter, and complete the
exercises below.
2.6.1. Exercises
1. Teachers often use brief measures of oral reading fluency to see how
many words students can read correctly from a passage of text in one
minute. Describe how this variable could be modified to fit the four
different scales of measurement.
2. Examine frequency distributions for each attitude toward school
item, as was done with the reading items. Try converting counts to
percentages.
3. Plot a histogram and describe the shape of the distribution of attitude
toward school scores.
4. What country has the most positive attitude toward school?
5. Describe how both norm and criterion referencing could be helpful in
an exam used to screen applicants for a job.
6. Describe how norm and criterion referencing could be used in eval-
uating variables outside the social sciences, for example, with the
physical measurement applications presented at the beginning of the
chapter.
7. Provide details about a measurement application that interests you.
a. How would you label your construct? What terms can be used
to define it?
b. With whom would you measure this construct? Who is your
object of measurement?
c. What are the units of measurement? What values are used
when assigning scores to people? What type of measurement
scale will these values produce?
d. What is the purpose in measuring your construct? How will
scores be used?
e. How is your construct commonly measured? Are there existing
measures that would suit your needs? If you’re struggling to find
a measurement application that interests you, you can start with
the construct addressed in this book. As a measurement student,
you possess an underlying construct that will hopefully increase
as you read, study, practice, and contribute to discussions and
assignments. This construct could be labeled assessment literacy
(Stiggins, 1991).
57
3. Testing Applications
This chapter gives an overview of the various different types of tests, including
standardized and unstandardized ones, that are used to support decision
making in education and psychology. Chapter 2 referred to a test’s purpose
as the starting point for determining its quality or effectiveness. In this
chapter we’ll compare types of tests in terms of their purposes. We’ll
examine how these purposes are associated with certain features in a test,
and we’ll look again at how the quality or validity of a test score can impact
the effectiveness of score interpretations. We’ll end with an introduction to
so-called next generation assessments, which are distinguished from more
traditional item-based instruments by their leveraging of dynamic digital
technologies for test administration and statistical learning for scoring and
decision-making.
Learning objectives
59
3. Testing Applications
Let’s look at one of the oldest and best known standardized tests as an
example of high-stakes decision making. Development of the first college
admissions test began in the late 1800s when a group of universities in the US
came together to form the College Entrance Examination Board (now called
the College Board). In 1901, this group administered the original version of
what would later be called the SAT. This original version consisted only of
essay questions in select subject areas such as Latin, history, and physics.
Some of these questions resemble ones we might find in standardized tests
today. For example, from the physics section:
60
3.1. Tests and decision making
The original test was intended only for limited use within the College
Board. However, in 1926, the SAT was redesigned to appeal to institutions
across the US. The 1926 version included nine content areas: analogies,
antonyms, arithmetic, artificial classification, language, logical inference,
number series, reading, and word definitions. It was based almost entirely on
multiple-choice questions. For additional details, see sat.collegeboard.org.
The College Board notes that the SAT was initially intended to be a universal
measure of preparation for college. It was the first test to be utilized across
multiple institutions, and it provided the only common metric with which
to evaluate applicants. In this way, the authors assert it helped to level
the playing field for applicants of diverse socio-economic backgrounds, to
“democratize access to higher education for all students” (College Board,
2012, p. 3). For example, those who may have otherwise received preferential
treatment because of connections with alumni could be compared directly
to applicants without legacy connections.
Since it was formally standardized in 1926, the SAT has become the most
widely used college admissions exam, with over 1.5 million administrations
annually (as of 2015). The test itself has changed substantially over the
years; however, its stated purpose remains the same (College Board, 2012):
As we’ll see in Chapter 10.4, test developers such as the College Board
are responsible for backing up claims such as these with validity evidence.
However, in the end, colleges must evaluate whether or not these claims
are met, and, if not, whether admission decisions can be made without a
standardized test. Colleges are responsible for choosing how much weight
test scores have in admissions decisions, and whether or not minimum cutoff
scores are used.
61
3. Testing Applications
A 2009 survey of 246 colleges in the US found that 73% used the SAT in
admissions decisions (Briggs, 2009). Of those colleges using the SAT, 78%
reported using scores holistically, that is, as supporting information within
a portfolio of work contained in an application. On the other hand, 31%
reported using SAT scores for quantitative rankings of applicants, and 21%
reported further to have defined cut-off scores below which applicants would
be disqualified for admission.
Another controversial high-stakes use of educational testing involves account-
ability decisions within the US education system. For reviews of these issues
in the context of the No Child Left Behind Act of 2001 (NCLB; Public Law
107-110), see Hursh (2005) and Linn et al. (2002). Abedi (2004) discusses
validity implications of NCLB for English language learners.
62
3.2. Test types and features
Over the past hundred years numerous terms have been introduced to
distinguish different tests from one another based on certain features of the
tests themselves and what they are intended to measure. Educational tests
have been described as measuring constructs that are the focus of instruction
and learning, whereas psychological tests measure constructs that are not
the focus of teaching and learning. Thus, educational and psychological tests
differ in the constructs they measure. Related distinctions are made between
cognitive and affective tests, and achievement and aptitude tests. Other
distinctions include summative versus formative, mastery versus growth,
and knowledge versus performance.
63
3. Testing Applications
Achievement and aptitude describe two related forms of cognitive tests. Both
types of tests measure similar cognitive abilities and processes, but typically
for slightly different purposes. Achievement tests are intended to describe
learning and growth, for example, in order to identify how much content
students have mastered in a unit of study. Accountability tests required by
NCLB are achievement tests built on the educational curricula for states in
the US. State curricula are divided into what are called learning standards
or curriculum standards. These standards operationalize the curriculum in
terms of what proficient students should know or be able to do.
In contrast to achievement tests, aptitude tests are typically intended to
measure cognitive abilities that are predictive of future performance. This
future performance could be measured in terms of the same or a similar
cognitive ability, or in terms of performance on other constructs and in
other situations. For example, intelligence or IQ tests are used to identify
individuals with developmental and learning disabilities and to predict job
performance (e.g., Carter, 2002). The Stanford Binet Intelligence Scales,
originally developed in the early 1900s, were the first standardized aptitude
tests. Others include the Wechsler Scales and the Woodcock-Johnson
Psycho-Educational Battery.
Intelligence tests and related measures of cognitive functioning have tradi-
tionally been used in the US to identify students in need of special education
services. However, an over-reliance on test scores in these screening and
placement decisions has led to criticism of the practice. A federal report
(US Department of Education, 2002, p. 25) concluded that,
64
3.2. Test types and features
should have high face validity, that is, it should be clear what content the
question is intended to measure. Furthermore, a correct response to such a
question should depend directly on an individual’s learning of that content.
On the other hand, aptitude tests, which are not intended to assess learning
or mastery of a specific content domain, need not be restricted to specific
content. They are still designed using a test outline. However, this outline
captures the abilities and skills that are most related to or predictive of
the construct, rather than content areas and learning objectives. Aptitude
tests typically measure abilities and skills that generalize to other constructs
and outcomes. As a result, the content of an aptitude question is not as
important as the cognitive reasoning or processes used to respond correctly.
Aptitude questions may not have high face validity, that is, it may not be
clear what they are intended to measure and how the resulting scores will
be used.
In the 1970s and 1980s, researchers in the areas of education and school
psychology began to highlight a need for educational assessments tied directly
to the curriculum and instructional objectives of the classroom in which
they would be used. The term performance assessment was introduced
to describe a more authentic form of assessment requiring students to
demonstrate skills and competencies relevant to key outcomes of instruction
(Mehrens, 1992, Stiggins (1987)). Various types of performance assessment
were developed specifically as alternatives to the summative tests that were
then often created outside the classroom. For example, curriculum-based
measurement (CBM), developed in the early 1980s (Deno, 1985), involved
brief, performance-based measures used to monitor student progress in core
subject areas such as reading, writing, and math. Reading CBM assessed
students’ oral reading fluency in the basal texts for the course; the content
of the reading assessments came directly from the readings that students
would encounter during the academic year. These assessments produced
scores that could be used to model growth, and predict performance on
end-of-year tests (Deno et al., 2001, Fuchs and Fuchs (1999)).
Although CBM and other forms of performance assessment remain popular
today, the term formative assessment is now commonly used as a general
label for the classroom-focused alternative to the traditional summative or
end-of-year test. The main distinction between formative and summative
is in the purpose or intended use of the resulting test score. Formative
assessments are described as measuring incrementally, where the purpose is
to directly encourage student growth (Black and Wiliam, 1998). They can
be spread across multiple administrations, or used in conjunction with other
versions of an assessment, so as to monitor and promote progress. Thus,
formative assessments are designed to inform teaching and form learning.
They seek to answer the question, “how are we doing?” Wiliam and Black
(1996) further assert that in order to be formative, an assessment must
65
3. Testing Applications
3.2.4. Summary
66
3.4. Summary
of failure.” In addition to a title and abstract, the data base includes basic
information on publication date and authorship, sometimes with links to
publisher websites.
The Buros Center for Testing buros.org also publishes a comprehensive data
base of educational and psychological measures. In addition to descriptive
information, they include peer evaluations of the psychometric properties
of tests in what is called the Mental Measurements Yearbook. Buros peer
reviews are available through university library subscriptions, or can be
accessed online for a fee.
3.4. Summary
This chapter provides an overview of how different types of tests are designed
to inform a variety of decisions in education and psychology. For the most
part, tests are designed merely to inform decision making processes, and
test authors are often careful to clarify that no decision should be made
based solely on test scores. Online data bases provide access to descriptive
summaries and peer reviews of tests.
3.4.1. Exercises
67
4. Test Development
Good items are the building blocks of good tests, and the validity of test
scores can hinge on the quality of individual test items. Unfortunately, test
makers, both in low-stakes and high-stakes settings, often presume that
good items are easy to come by. As noted above, item writing is often not
given the attention it deserves. Research shows that effective item writing
is a challenging process, and even the highest-stakes of tests include poorly
written items (Haladyna and Rodriguez, 2013).
This chapter summarizes the main stages of both cognitive and noncognitive
test construction, from conception to development, and the main features
of commonly used item formats. Cognitive test development is covered first.
The cognitive item writing guidelines presented in Haladyna et al. (2002)
are summarized, along with the main concepts from the style guides used
by testing companies. Next, noncognitive personality test development is
discussed. Noncognitive item writing guidelines are reviewed, along with
strategies for reducing the impact of response sets.
Learning objectives
Cognitive
69
4. Test Development
Noncognitive
As is the case throughout this book, we will begin this chapter on test
development with a review of validity and test purpose. Recall from Chapters
1 through 3 that validity refers to the degree to which evidence and theory
support the interpretations of test scores entailed by the proposed uses of a
test. In other words, validity indexes the extent to which test scores can
be used for their intended purpose. These are generic definitions of validity
that apply to any type of educational or psychological measure.
In this chapter we focus first on cognitive tests, where the purpose of the test
is to produce scores that can inform decision making in terms of aptitude
and achievement, presumably of students. So, we need to define validity
in terms of these more specific test uses. Let’s use a midterm exam from
an introductory measurement course as an example. We could say that
validity refers to the degree to which the content coverage of the exam (as
specified in the outline, based on the learning objectives) supports the use
of scores as a measure of student learning for topics covered in the first part
of the course. Based on this definition of validity, what would you say is the
purpose of the exam? Note how test purpose and validity are closely linked.
70
4.2. Learning objectives
Construction of a valid test begins with a test purpose. You need to be able
to identify the three components of a test purpose, both when presented
with a well defined purpose, and when presented with a general description
of a test. Later in the course you’ll be reviewing information from test
reviews and technical documentation which may or may not include clear
definitions of test purpose. You’ll have to take the available information and
identify, to the best of your ability, what the test purpose is. Here are some
verbs to look for: assess, test, and measure (obviously), but also describe,
select, identify, examine, and gauge, to name a few.
Do your best to distill the lengthy description below into a one-sentence
test purpose. This should be pretty straightforward. The information is
all there. This description comes from the technical manual for the 2011
California Standards Test, which is part of what was formerly known as
the Standardized Testing and Reporting (STAR) program for the state of
California (see www.cde.ca.gov). These are more recent forms of the state
tests that I took in school in the 1980s.
You should have come up with something like this for the CST test purpose:
the CST measures ELA, mathematics, history/social science, and science for
students in grades two through eleven to show how well they are performing
with respect to California’s content standards, and to help determine AYP.
To keep the description of the CST brief, I omitted details about the content
standards. California, like all other states, has detailed standards or learning
objectives defining what content/skills/knowledge/information/etc. must be
covered by schools in the core subject areas. The standards specify what a
71
4. Test Development
Note that these standards reflect specific things students should be able to
do, and some conditions for how students can do these things well. Such
specific wording greatly simplifies the item writing process because it clarifies
precisely the knowledge, skills, and abilities that should be measured.
Notice also that the simplest way to assess the first science objective listed
above would be to simply ask students to design and conduct an investigation
that leads to the use of logic and evidence in the formulation of scientific
explanations and models. The standard itself is almost worded as a test
question. This is often the case with well-written standards. Unfortunately,
the standardized testing process includes constraints, like time limits, that
make it difficult or impossible to assess standards so directly. Designing
and conducting an experiment requires time and resources. Instead, in a
test we might refer students to an example of an experiment and ask them
to identify correct or incorrect procedures, or we might ask students to use
logic when making conclusions from given experimental results. In this way,
we use individual test questions to indirectly assess different components of
a given standard.
72
4.3. Features of cognitive items
DOK descriptions such as this are used to categorize items in the item
writing process, and thereby ensure that the items together support the
overall DOK required in the purpose of the test. Typically, higher DOK is
preferable. However, lower levels of DOK are sometimes required to assess
certain objectives, for example, ones that require students to recall or repro-
duce definitions, steps, procedures, or other key information. Furthermore,
constraints on time and resources within the standardized testing process
often make it impossible to assess the highest level of DOK, which requires
extended thinking and complex cognitive demands.
Items can often be taken beyond DOK1 with the introduction of novel
material and context. The following examples assess the same learning
objective at different levels of DOK by relying on a simple hypothetical
scenario to varying degrees. The learning objective is Define criterion
referencing and identify contexts in which it is appropriate.
73
4. Test Development
Ineffective:
This first example is ineffective because the question itself is detached from
the context. The context is simple and straightforward, and sets us up nicely
to get beyond recalling the definition of criterion referencing. However, it
boils down to recall in the end because we provide too much scaffolding in the
question. Option A is the correct choice, regardless of what’s happening with
the aggression test. Also, option C is absurd. Something like “Percentile
referencing” would be better because it at least resembles an actual term.
Effective:
This version is better than the first because the question itself is dependent
on the context. Students must refer to the scenario to identify the most
appropriate type of score referencing. The next version requires an even
more focused evaluation of the context.
More focused:
74
4.3. Features of cognitive items
Cognitive test items come in a variety of types that differ in how material
is presented to the test taker, and how responses are then collected. Most
cognitive test questions begin with a stem or question statement. The stem
should present all of the information needed to answer an item correctly.
Optionally, an item, or a set of items, can refer to a figure, table, text, or
other presentation of information that students must read, interpret, and
sometimes interact with before responding. These are referred to as prompts,
and they are typically presented separately from the item stem.
Selected-response (SR) items collect responses from test takers using two or
more response options. The classic multiple-choice question is an SR item
with a stem ending in a question or some direction that the test taker must
choose one or more of options.
75
4. Test Development
4. matching, where test takers select for each option in one list the
correct match from a second list;
5. complex multiple-choice, where different combinations of response
options can be selected as correct, resembling a constrained form of
multiple correct (e.g., options A and B, A and C, or all of the above);
and
6. evidence-based, which can be any form of SR item where a follow-
up question requires test takers to select an option justifying their
response to the original item.
A. 1
B. 2
C. 3
D. 4
A constructed-response (CR) item does not present options to the test taker.
As the name implies, a response must be constructed. Constructed-response
items include short-answer, fill-in-the-blank, graphing, manipulation of
information, and essays. Standardized performance assessments, for example,
reading fluency measures, can also be considered CR tasks.
The science question itself within Part I of the evidence-based DOK item
above is an example of a simple essay question. Note that this science
question could easily be converted to a SR question with multiple correct
answers, where various components of an experiment, some correct and
76
4.3. Features of cognitive items
some incorrect, could be presented to the student. Parts I and II from the
evidence-based DOK question could also easily be converted to a single
CR question, where test takers identify the correct DOK for the science
question, and then provide their own supporting evidence.
There are some key advantages and disadvantages to multiple-choice or SR
items and CR items. In terms of advantages, SR items are typically easy to
administer and score, and are more objective and reliable than CR items.
They are also more efficient, and can be used to cover more test content in
a shorter period of time. Finally, SR items can provide useful diagnostic
information about specific misconceptions that test takers might have.
Although they are more efficient and economical, SR items are more difficult
to write well, they tend to focus on lower-order thinking and skills, such as
recall and reproduction, and they are more susceptible to test-wiseness and
guessing. Constructed-response items address each of these issues. They
are easier to write, especially for higher-level thinking, and they eliminate
the potential for simple guessing.
The main benefit of CR questions is they can be used to test more practical,
authentic, and realistic forms of performance and tasks, including creative
skills and abilities. The downside is that these types of performance and
tasks require time to demonstrate and are then complex and costly to score.
77
4. Test Development
Take a minute to think about the construct that this question is intended
to measure. “Analytical writing” is a good start, but you should try to be
more specific. What knowledge, skills, and abilities are required to do well
on the question? What are the benefits of performance assessments such as
these?
As with CR items, the main benefit of performance assessment is that it
is considered more authentic than traditional mastery assessment, because
it allows us to assess directly what we’re trying to measure. For example,
the GRE essay question above measures the construct of analytical writing
by asking examinees to write analytically. This improves the validity of
the resulting score as an indicator of the construct itself. Performance
assessments, because they require individuals to generate their own response,
rather than select a response from a list of options, are also able to assess
higher order thinking and skills like synthesis and evaluation. These skills
are not easily assessed with simple selected-response questions.
Two of the drawbacks of performance assessments result from the fact
that humans are involved in the scoring process. First, as noted above,
performance assessments are less practical, because they require substantially
more time and resources to develop and score. Second, the scoring process
becomes subjective, to some extent. A third drawback to performance
assessment is that content, though it may be assessed deeply, for example,
using more depth of knowledge, it is not assessed broadly.
4.3.4. Rubrics
78
4.3. Features of cognitive items
ent judges can consistently apply the correct scores to the corresponding
performance levels.
Here is an example from the GRE analytical writing rubric, available online.
GRE essays are scored on a scale from 1 to 6, and the description below is
for one of the six possible score categories. The first sentence describes the
overall quality of the essay as demonstrating “some competence” but also
being “obviously flawed.” Then, a list of common features for this category
of essay is provided.
After reading through this portion of the rubric, try to guess which of the
six score categories it applies to.
Figure 4.1 contains a prompt used in the PISA 2009 reading test, with items
r414q02, r414q11, r414q06, and r414q09. The full text is also included in
Appendix C. This prompt presented students with information on whether
cell phones are dangerous, along with recommendations for safe cell phone
use. Some key points were also summarized in the left margin of the prompt.
Question PISA09$r414q06 required that students interpret information from
a specific part of the prompt. It read, “Look at Point 3 in the No column of
the table. In this context, what might one of these ‘other factors’ be? Give
a reason for your answer.”
79
4. Test Development
Figure 4.1.: PISA 2009 prompt for reading items r414q02, r414q11, r414q06,
and r414q09.
80
4.3. Features of cognitive items
Correct
Answers which identify a factor in modern lifestyles that could
be related to fatigue, headaches, or loss of concentration. The
explanation may be self-evident, or explicitly stated.
Incorrect
Answers which give an insufficient or vague response.
Fatigue. [Repeats information in the text.]
Tiredness. [Repeats information in the text.]
Answers which show inaccurate comprehension of the material
or are implausible or irrelevant.
In its simplest form, a test outline is a table that summarizes how the
items in a test are distributed in terms of key features such as content areas
or subscales (e.g., quantitative reasoning, verbal reasoning), standards or
objectives, item types, and depth of knowledge. Table 4.1 contains a simple
example for a cognitive test with three content areas.
A test outline is used to ensure that a test measures the content areas
captured by the tested construct, and that these content areas are measured
in the appropriate ways. Notice that in Table 4.1 we’re only assessing
reading using the first two levels of DOK. Perhaps scores from this test will
be used to identify struggling readers. The test purpose would likely need
to include some mention of reading comprehension, which would then be
assessed at a deeper level of knowledge.
The learning objectives in Table 4.1 are intentionally left vague. How can
81
4. Test Development
they be improved to make these content areas more testable? Consider how
qualifying information could be included in these objectives to clarify what
would constitute high-quality performance or responses.
The item writing guidelines presented in Haladyna et al. (2002) are para-
phrased here for reference. The guidelines are grouped into ones addressing
content concerns, formatting concerns, style concerns, issues in writing the
stem, and issues in writing the response options.
Content concerns
1. Every item should reflect specific content and a single specific men-
tal behavior, as called for in test specifications (two-way grid, test
outline).
2. Base each item on important content to learn; avoid trivial content.
3. Use novel material to test higher level learning. Paraphrase textbook
language or language used during instruction when used in a test item
to avoid testing for simply recall.
4. Keep the content of each item independent from content of other
items on the test.
5. Avoid over specific and over general content when writing multiple-
choice (MC) items.
6. Avoid opinion-based items.
7. Avoid trick items.
8. Keep vocabulary simple for the group of students being tested.
Formatting concerns
9. Use the question, completion, and best answer versions of the con-
ventional MC, the alternate choice, true-false, multiple true-false,
matching, and the context-dependent item and item set formats, but
AVOID the complex MC (Type K) format.
10. Format the item vertically instead of horizontally.
Style concerns
14. Ensure that the directions in the stem are very clear.
15. Include the central idea in the stem instead of the choices.
16. Avoid window dressing (excessive verbiage).
82
4.4. Cognitive item writing
17. Word the stem positively, avoid negatives such as NOT or EXCEPT.
If negative words are used, use the word cautiously and always ensure
that the word appears capitalized and boldface.
18. Develop as many effective choices as you can, but research suggests
three is adequate.
19. Make sure that only one of these choices is the right answer.
20. Vary the location of the right answer according to the number of
choices.
21. Place choices in logical or numerical order.
22. Keep choices independent; choices should not be overlapping.
23. Keep choices homogeneous in content and grammatical structure.
24. Keep the length of choices about equal.
25. None-of-the-above should be used carefully.
26. Avoid All-of-the-above.
27. Phrase choices positively; avoid negatives such as NOT.
28. Avoid giving clues to the right answer, such as
a. Specific determiners including always, never, completely, and
absolutely.
b. Clang associations, choices identical to or resembling words in
the stem.
c. Grammatical inconsistencies that cue the test-taker to the cor-
rect choice.
d. Conspicuous correct choice.
e. Pairs or triplets of options that clue the test-taker to the correct
choice.
f. Blatantly absurd, ridiculous options.
29. Make all distractors plausible.
30. Use typical errors of students to write your distractors.
31. Use humor if it is compatible with the teacher and the learning
environment.
Rather than review each item writing guideline, we’ll just summarize the
main theme that they all address. This theme has to do with the intended
construct that a test is measuring. Each guideline targets a different source
of what is referred to as construct irrelevant variance that is introduced in
the testing process.
For example, consider guideline 8, which recommends that we “keep vo-
cabulary simple for the group of students being tested.” When vocabulary
becomes unnecessarily complex, we end up testing vocabulary knowledge
and related constructs in addition to our target construct. The complex-
ity of the vocabulary should be appropriate for the audience and should
83
4. Test Development
4.5. Personality
Validity and test purpose are once again at the start of the test construction
process. As with cognitive test construction, the valid use of affective test
scores requires a clearly articulated test purpose. This purpose tells us
84
4.6. Validity and test purpose
about the construct we intend to measure, for whom we measure it, and for
what reasons. Item writing then directly supports this purpose.
Affective tests are used in a variety of contexts. For example, test results can
support evaluations of the effectiveness of clinical or counseling interventions.
They can also inform clinical diagnosis. Affective measures can also be
used for research purposes, for example, to examine relationships between
patterns of thought or behavior. See Chapter 3 for example applications in
the areas of mental health and job placement.
The Myers-Briggs Type Indicator (MBTI), first mentioned in Chapter 2, is
a popular but somewhat controversial personality test based on the work of
Carl Jung, who is famous, in part, for his research on psychological archetypes.
Jung’s original archetypes were defined by the extraversion-introversion and
perception-judgment dichotomies. For each of these dichotomies, Jung
claimed that people tend to find themselves at one end, for example, ex-
traversion, more than the other, introversion.
The MBTI seeks to measure combinations of these original archetypes with
the additional dichotomies of sensing-intuition and thinking-feeling. It does
so using simple questions such as the following (from Myers et al., 1998):
Change for me is
* Difficult
* Easy
I prefer to work
* Alone
* In a team
I consider myself to be
* Social
* Private
The main criticism of the MBTI is that there is insufficient evidence sup-
porting its reliability and validity. The test is used widely in counseling
settings and employment settings for personnel selection and professional
development. However, these uses may not be validated. For example,
Gardner and Martinko (1996) found that relationships between MBTI types
and variables such as managerial effectiveness, which would provide validity
evidence for the use of scores in this setting, were weak or not well described.
Pittenger (2005) concluded that overuse of the MBTI, where support for
it is lacking, may be due to the simplicity of the measure and the MBTI
publisher’s marketing strategy.
At this point you may be wondering, what does the MBTI actually claim to
do? What is its intended purpose? Consider the following broad disclaimer
85
4. Test Development
All types are equal. The purpose of taking the MBTI is to rec-
ognize your strengths and weaknesses as well as those of others.
The MBTI was created in order to facilitate an understanding
and appreciation of differences among human beings. No type
is better than another.
The Myers-Briggs Type Indicator does not measure ability,
traits, or character. Unlike other personality assessments, the
MBTI does not do any of the above. Carl Jung and Isabel
Briggs-Myers believed that preferences are inborn while traits
are not. Someone can improve upon a trait (e.g. working on
their public speaking) but they cannot change their preference
(e.g. preferring to work alone than with a group in general).
Your type does not dictate who you are as a person. Ethical
use of the MBTI is being able to discern and understand your
results. However, your type does not truly represent who you
are. You are your own person. Myers believed that all individ-
uals are unique in their own way. Being assigned a type does
not mean you are every little detail outlined in the description.
You should make your own reasonable judgment and verify
your own preferences.
Contrast this with the variety of potential uses described on the publisher’s
website www.myersbriggs.org. For example,
Note that there are now multiple versions of the MBTI for different applica-
tions. One version, called Step III, is described as being
86
4.7. Noncognitive test construction
In the end, it seems that the purpose of the MBTI is simply to inform
individuals about their profile of types. The publisher then claims that type
scores can be used in employment or counseling settings, or by individuals,
to inform “life choices” and help them “become more effective in the natural
use of their type.” Note that the goal is not to change types, or work on
deficiencies, but to capitalize on strengths.
Personality tests like the MBTI, BDI (introduced in Chapter 2), and MMPI
(introduced in Chapter 3) are developed using one of two approaches or
strategies. These strategies serve the same purpose as the test outline in
cognitive test construction, that is, they identify the underlying structure
or framework for the test. The two strategies used with affective and
personality tests are called deductive and empirical.
4.7.1. Deductive
87
4. Test Development
measure for the presence of anorexic behaviors. Note that the simplicity
of this approach stems from the simplicity of the construct itself. When a
construct is easily operationalized in terms of observable behaviors, logic
may be sufficient for determining test content.
4.7.2. Empirical
88
4.8. Features of personality items
1 2 3 4 5 6
Am indifferent to the feelings of others
Inquire about others’ well-being
Know how to comfort others
Love children
Make people feel at ease
the item is unimportant, as long as the data come from representative and
cross-validated samples of people with schizophrenia.
89
4. Test Development
90
4.9. Personality item writing
analysis, discussed in Chapter 6, might reveal that patients are only able to
consistently use a subset of the ten pain points. Perhaps we could reduce
the ten-point scale to a four-point scale that includes categories such as
none, some, lots, and most. Additional scale points can potentially lead to
increases in score variability. However, in many measurement applications
this variability reflects inconsistent use of the scale, that is, measurement
error, rather than meaningful differences between individuals.
Although pain is subjective, and the anchors themselves seem arbitrary, the
pain scale is useful in two contexts. Numerical anchors and visual cues, via
the faces, are helpful when verbal communication is challenging. And as
long as an individual interprets the anchors consistently over time, the scale
can be used to identifying changes in pain within the individual.
A second issue in rating scale construction is whether or not to include a
central, neutral option. Although examinees may prefer to have it, especially
when responding to controversial topics where commitment one way or the
other is difficult, the neutral option rarely provides useful information
regarding the construct being measured, and should be avoided whenever
possible (Kline, 1986).
The item writing guidelines presented in Spector (1992) and Kline (1986)
are paraphrased here for reference. Regarding these guidelines, Kline (1986)
notes,
Much of what I shall say is obvious and little more than com-
mon sense. Nevertheless, examination of many published tests
and tests used for internal selection by large organizations has
convinced this author that these things need to be said. Too
often, test constructors, blinded by the brilliance of the tech-
nology of item analysis, forget the fact that a test can be no
better (but it can be worse) than its items.
As with the cognitive item writing guidelines, the affective guidelines mainly
91
4. Test Development
4.9.1. Guidelines
1. Reduce insight: in general, the less examinees know about the con-
struct being measured, the more authentic or genuine their responses
are likely to be.
2. Encourage immediate response: related to insight, the more an exam-
inee reflects on an item, in general, the less likely they are to respond
genuinely.
3. Write clearly, specifically, and unambiguously: reliable and valid mea-
surement requires that examinees consistently interpret and under-
stand what is asked of them. Conflicting, confusing, uninterpretable,
or excessive information introduces measurement error.
4. Reference behaviors: feelings, like pleasure and pain, mean differ-
ent things to different people. Refer to easily identified symptoms,
experiences, or behaviors, rather than feelings or interpretations of
them.
Response sets describe patterns of response that introduce bias into the
process of measuring noncognitive constructs via self-reporting. Bias refers
to systematic error that has a consistent and predictable impact on responses.
The main response sets include social desirability, acquiescence, extremity,
and neutrality.
Social desirability refers to a tendency for examinees to respond in what
appears to be a socially desirable or favorable way. Examinees tend to under-
report or de-emphasize constructs that carry negative connotations, and
over-report or overemphasize constructs that carry positive connotations.
For example, examinees tend to be less likely to endorse or identify with items
that are perceived to measure stigmatized constructs such as depression,
anxiety, and aggression, or behaviors such as procrastination and drug use.
On the other hand, examinees are more likely to endorse or identify with
items measuring desirable traits such as kindness, resilience, and generosity.
92
4.10. Summary
4.10. Summary
4.10.1. Exercises
93
4. Test Development
94
5. Reliability
Too much consistency is as bad for the mind as it is for the
body. Consistency is contrary to nature, contrary to life. The
only completely consistent people are the dead.
— Aldous Huxley
95
5. Reliability
the four main study designs and corresponding methods for estimating
reliability are reviewed. Finally, reliability is discussed for situations where
scores come from raters. This is called interrater reliability, and it is best
conceptualized using G theory.
Learning objectives
CTT reliability
Interrater reliability
96
5.1. Consistency of measurement
In this chapter, we’ll conduct reliability analyses on PISA09 data using epmr,
and plot results using ggplot2. We’ll also simulate some data and examine
interrater reliability using epmr.
97
5. Reliability
whose arrows are scattered around the target, with one hitting close to the
bulls-eye and the rest spread widely around it. This represents measurement
that is inconsistent and inaccurate, though more accurate perhaps than
with the first archer. Reliability and validity are both present only when
the arrows are all close to the center of the target. In that case, we’re
consistently measuring what we intend to measure.
A key assumption in this analogy is that our archers are actually skilled, and
any errors in their shots are due to the measurement process itself. Instead,
consistently hitting a nearby tree may be evidence of a reliable test given
to someone who is simply missing the mark because they don’t know how
to aim. In reality, if someone scores systematically off target or above or
below their true underlying ability, we have a hard time attributing this to
bias in the testing process versus a true difference in ability.
A key point here is that evidence supporting the reliability of a test can
be based on results from the test itself. However, evidence supporting the
validity of the test must come, in part, from external sources. The only
ways to determine that consistently hitting a tree represents low ability
are to a) confirm that our test is unbiased and b) conduct a separate test.
These are validity issues, which will be covered in Chapter 10.4.
Second, the testing process itself could differ across measurement occasions.
Perhaps there is a strong cross-breeze one day but not the next. Or maybe
the people officiating the competition allow for different amounts of time for
warm-up. Or maybe the audience is rowdy or disagreeable at some points
and supportive at others. These factors are tied to the testing process itself,
and they may all lead to changes in scores.
Finally, our test may simply be limited in scope. Despite our best efforts,
98
5.2. Classical test theory
it may be that arrows differ from one another to some degree in balance
or construction. Or it may be that archers’ fingers occasionally slip for no
fault of their own. By using a limited number of shots, scores may change
or differ from one another simply because of the limited nature of the test.
Extending the analogy to other sports, football and basketball each involve
many opportunities to score points that could be used to represent the ability
of a player or team. On the other hand, because of the scarcity of goals in
soccer, a single match may not accurately represent ability, especially when
the referees have been bribed!
X = T + E. (5.1)
99
5. Reliability
So, at one administration of the test, some form of error may cause your
score to decrease by two points. Maybe you weren’t feeling well that day.
In this case, knowing that T = 20, what is E in Equation (5.1), and what
is X? At another administration, you might guess correctly on a few of the
test questions, resulting in an increase of 3 based solely on error. What is
E now? And what is X?
Solving for E in Equation (5.1) clarifies that random error is simply the
difference between the true score and the observation, where a negative error
always indicates that X is too low and a positive error always indicates that
X is too high:
E = X − T. (5.2)
100
5.2. Classical test theory
Figure 5.1 presents the three pieces of the CTT model using PISA09 total
reading scores for students from Belgium. We’re transitioning here from
lots of X and E scores for an individual with a constant T , to an X, T , and
E score per individual within a sample. Total reading scores on the x-axis
represent X, and simulated T scores are on the y-axis. These total scores
are calculated as T = X − E. The solid line represents what we’d expect
if there were no error, in which case X = T . As a result, the horizontal
scatter in the plot represents E. Note that T are simulated to be continuous
and to range from 0 to 11. X scores are discrete, but they’ve been “jittered”
slightly left to right to reveal the densities of points in the plot.
Let’s think about some specific examples now of the classical test theory
model. Consider a construct that interests you, how this construct is
101
5. Reliability
6
t
0 3 6 9
x1
Figure 5.1.: PISA total reading scores with simulated error and true scores
based on CTT.
operationalized, and the kind of measurement scale that results from it.
Consider the possible score range, and try to articulate X and T in your
own example.
Next, let’s think about E. What might cause an individual’s observed
score X to differ from their true score T in this situation? Think about
the conditions in which the test would be administered. Think about the
population of students, patients, individuals that you are working with.
Would they tend to bring some form of error or unreliability into the
measurement process?
Here’s a simple example involving preschoolers. As I mentioned in earlier
chapters, some of my research involves measures of early literacy. In this
research, we test children’s phonological awareness by presenting them with
a target image, for example, an image of a star, and asking them to identify
the image among three response options that rhymes with the target image.
So, we’d present the images and say, “Which one rhymes with star?” Then,
for example, children might point to the image of a car.
Measurement error is problematic in a test like this for a number of reasons.
First of all, preschoolers are easily distracted. Even with standardized one-
on-one test administration apart from the rest of the class, children can be
distracted by a variety of seemingly innocuous features of the administration
or environment, from the chair they’re sitting in, to the zipper on their
jacket. In the absence of things in their environment, they’ll tell you about
things from home, what they had for breakfast, what they did over the
weekend, or, as a last resort, things from their imagination. Second of all,
because of their short attention span, the test itself has to be brief and
simple to administer. Shorter tests, as mentioned above in terms of archery
and other sports, are less reliable tests; fewer items makes it more difficult
to identify the reliable portion of the measurement process. In shorter tests,
102
5.2. Classical test theory
problems with individual items have a larger impact on the test as a whole.
A systematic error is one that influences a person’s score in the same way at
every repeated administration. A random error is one that could be positive
or negative for a person, or one that changes randomly by administration.
In the preschooler literacy example, as students focus less on the test itself
and more on their surroundings, their scores might involve more guessing,
which introduces random error, assuming the guessing is truly random.
Interestingly, we noticed in pilot studies of early literacy measures that
some students tended to choose the first option when they didn’t know the
correct response. This resulted in a systematic change in their scores based
on how often the correct response happened to be first.
What type of error does the standard deviation not capture? Systematic
error doesn’t vary from one measurement to the next. If the scale itself is
not calibrated correctly, for example, it may overestimate or underestimate
weight consistently from one measure to the next. The important point
to remember here is that only one type of error is captured by E in CTT:
the random error. Any systematic error that occurs consistently across
administrations will become part of T , and will not reduce our estimate of
reliability.
103
5. Reliability
5.3.1. Reliability
Figure 5.2 contains a plot similar to the one in Figure 5.1 where we identified
X, T , and E. This time, we have scores on two reading test forms, with the
first form is now called X1 and second form is X2 , and we’re going to focus
on the overall distances of the points from the line that goes diagonally
across the plot. Once again, this line represents truth. A person with a
true score of 11 on X1 will score 11 on X2 , based on the assumptions of the
CTT model.
Although the solid line represents what we’d expect to see for true scores,
we don’t actually know anyone’s true score, even for those students who
happen to get the same score on both test forms. The points in Figure 5.2
are all observed scores. The students who score the same on both test forms
do indicate more consistent measurement. However, it could be that their
true score still differs from observed. There’s no way to know. To calculate
truth, we would have to administer the test an infinite number of times,
and then take the average, or simply simulate it, as in Figure 5.1.
104
5.3. Reliability and unreliability
6
x2
0 3 6 9
x1
Figure 5.2.: PISA total reading scores and scores on a simulated second
form of the reading test.
on the line. If you score, for example, 10 on one form, you also score 10 on
the other. The correlation in this case would be 1. Would we expect scores
to remain consistent from one test to the next?
We’re now ready for a statistical definition of reliability. In CTT, reliability
is defined as the proportion of variability in X that is due to variability in
true scores T :
σT2
r= 2 . (5.3)
σX
Note that true scores are assumed to be constant in CTT for a given
individual, but not across individuals. Thus, reliability is defined in terms of
variability in scores for a population of test takers. Why do some individuals
get higher scores than others? In part because they actually have higher
abilities or true scores than others, but also, in part, because of measurement
error. The reliability coefficient in Equation (5.3) tells us how much of our
observed variability in X is due to true score differences.
Unfortunately, we can’t ever know the CTT true scores for test takers. So we
have to estimate reliability indirectly. One indirect estimate made possible
by CTT is the correlation between scores on two forms of the same test, as
represented in Figure 5.2:
σX1 X2
r = ρ X1 X2 = . (5.4)
σX1 σX2
105
5. Reliability
The Spearman-Brown formula was originally used to correct for the reduction
in reliability that occurred when correlating two test forms that were only
half the length of the original test. In theory, reliability will increase as
we add items to a test. Thus, Spearman-Brown is used to estimate, or
predict, what the reliability would be if the half-length tests were made into
full-length tests.
106
5.3. Reliability and unreliability
labeled here as the old reliability, rold , and the factor by which the length
of X is predicted to change, k:
krold
rnew = . (5.5)
(k − 1)rold + 1
2
P 2
!
σX − σX
J j
r=α= 2 , (5.6)
J −1 σX
2
where J is the number of Pitems on the test, σX is the variance of observed
2
total scores on X, and σX j
is the sum of variances for each item j on
X. To see how it relates to the CTT definition of reliability in Equation
(5.3), consider the top of the second fraction in Equation (5.6). The total
2
test variance σX captures all the variability available in the total scores for
the test. We’re subtracting from it the variances that are unique to the
individual items themselves. What’s left over? Only the shared variability
among the items in the test. We then divide this shared variability by the
total available variability. Within the formula for alpha you should see the
general formula for reliability, true variance over observed.
107
5. Reliability
Keep in mind, alpha is an estimate of reliability, just like the correlation is.
So, any equation requiring an estimate of reliability, like SEM below, can
be computed using either a correlation coefficient or an alpha coefficient.
Students often struggle with this point: correlation is one estimate of
reliability, alpha is another. They’re both estimating the same thing, but in
different ways based on different reliability study designs.
5.3.3. Unreliability
2
σE
1−r = 2 . (5.7)
σX
√
SEM = σX 1 − r. (5.8)
108
5.3. Reliability and unreliability
10
x
1 2 3 4 5 6 7 8 9 10 11
factor(x)
Figure 5.3.: The PISA09 reading scale shown with 68 and 95 percent confi-
dence intervals around each point.
109
5. Reliability
110
5.3. Reliability and unreliability
1 Form 2 Forms
1 Occasion Internal Consistency Equivalence
2 Occasions Stability Equivalence & Stability
Now that we’ve established the more common estimates of reliability and
unreliability, we can discuss the four main study designs that allow us to
collect data for our estimates. These designs are referred to as internal
consistency, equivalence, stability, and equivalence/stability designs. Each
design produces a corresponding type of reliability that is expected to be
impacted by different sources of measurement error.
The four standard study designs vary in the number of test forms and the
number of testing occasions involved in the study. Until now, we’ve been
talking about using two test forms on two separate administrations. This
study design is found in the lower right corner of Table 5.2, and it provides
us with an estimate of equivalence (for two different forms of a test) and
stability (across two different administrations of the test). This study design
has the potential to capture the most sources of measurement error, and
it can thus produce the lowest estimate of reliability, because of the two
factors involved. The more time that passes between administrations, and
as two test forms differ more in their content and other features, the more
error we would expect to be introduced. On the other hand, as our two
test forms are administered closer in time, we move from the lower right
corner to the upper right corner of Table 5.2, and our estimate of reliability
captures less of the measurement error introduced by the passage of time.
We’re left with an estimate of the equivalence between the two forms.
As our test forms become more and more equivalent, we eventually end up
with the same test form, and we move to the first column in Table 5.2, where
one of two types of reliability is estimated. First, if we administer the same
test twice with time passing between administrations, we have an estimate
of the stability of our measurement over time. Given that the same test
is given twice, any measurement error will be due to the passage of time,
rather than differences between the test forms. Second, if we administer one
test only once, we no longer have an estimate of stability, and we also no
longer have an estimate of reliability that is based on correlation. Instead,
we have an estimate of what is referred to as the internal consistency of
the measurement. This is based on the relationships among the test items
themselves, which we treat as miniature alternate forms of the test. The
resulting reliability estimate is impacted by error that comes from the items
themselves being unstable estimates of the construct of interest.
Internal consistency reliability is estimated using either coefficient alpha or
split-half reliability. All the remaining cells in Table 5.2 involve estimates of
111
5. Reliability
112
5.4. Interrater reliability
Let’s find the proportion agreement for the simulated coin flip data. The
question we’re answering is, how often did the coin flips have the same value,
whether 0 or 1, for both raters across the 30 tosses? The crosstab shows this
agreement in the first row and first column, with raters both flipping tails 5
times, and in the second row and second column, with raters both flipping
heads 10 times. We can add these up to get 15, and divide by n = 30 to get
the percentage agreement.
Data for the next few examples were simulated to represent scores given
by two raters with a certain correlation, that is, a certain reliability. Thus,
agreement here isn’t simply by chance. In the population, scores from
these raters correlated at 0.90. The score scale ranged from 0 to 6 points,
with means set to 4 and 3 points for the raters 1 and 2, and SD of 1.5 for
both. We’ll refer to these as essay scores, much like the essay scores on
the analytical writing section of the GRE. Scores were also dichotomized
around a hypothetical cut score of 3, resulting in either a “Fail” or “Pass.”
113
5. Reliability
Table 5.3.: Crosstab of Scores From Rater 1 in Rows and Rater 2 in Columns
0 1 2 3 4 5 6
0 1 1 0 0 0 0 0
1 1 2 0 0 0 0 0
2 5 8 2 0 0 0 0
3 0 6 11 2 1 0 0
4 0 0 9 9 10 0 0
5 0 0 1 4 6 3 0
6 0 0 0 2 3 3 10
## Fail 20 0
## Pass 27 53
The upper left cell in the table() output above shows that for 20 individuals,
the two raters both gave “Fail.” In the lower right cell, the two raters both
gave “Pass” 53 times. Together, these two totals represent the agreement in
ratings, 73 . The other cells in the table contain disagreements, where one
rater said “Pass” but the other said “Fail.” Disagreements happened a total
of 27 times. Based on these totals, what is the proportion agreement in the
pass/fail ratings?
Table 5.3 shows the full crosstab of raw scores from each rater, with scores
from rater 1 (essays$r1) in rows and rater 2 (essays$r2) in columns. The
bunching of scores around the diagonal from upper left to lower right shows
the tendency for agreement in scores.
Proportion agreement for the full rating scale, as shown in Table 5.3, can
be calculated by summing the agreement frequencies within the diagonal
elements of the table, and dividing by the total number of people.
Finally, let’s consider the impact of chance agreement between one of the
hypothetical human raters and a monkey who randomly applies ratings,
regardless of the performance that is demonstrated, as with a coin toss.
114
5.4. Interrater reliability
## monkey
## Fail Pass
## Fail 7 13
## Pass 32 48
The results show that the hypothetical rater agrees with the monkey 55
percent of the time. Because we know that the monkey’s ratings were
completely random, we know that this proportion agreement is due entirely
to chance.
Proportion agreement is useful, but because it does not account for chance
agreement, it should not be used as the only measure of interrater consistency.
Kappa agreement is simply an adjusted form of proportion agreement that
takes chance agreement into account.
Equation (5.9) contains the formula for calculating kappa for two raters.
Po − Pc
κ= (5.9)
1 − Pc
115
5. Reliability
116
5.4. Interrater reliability
## [1] 0.8549923
dstudy(essays[, 1:2])
##
## Descriptive Study
##
## mean median sd skew kurt min max n na
## r1 3.83 4 1.5 -0.23 2.44 0 6 100 0
## r2 2.84 3 1.7 0.26 2.20 0 6 100 0
cor(essays$r1, essays$r2 + 1)
## [1] 0.8549923
4
r2 + 1
r2
0 2 4 6 0 2 4 6
r1 r1
Figure 5.4.: Scatter plots of simulated essay scores with a systematic differ-
ence around 0.5 points.
117
5. Reliability
Recall from Equation (5.1) that in CTT the observed total score X is
separated into a simple sum of the true score T and error E. Given the
assumptions of the model, a similar separation works with the total score
variance:
2
σX = σT2 + σE
2
. (5.11)
Because observed variance is simply the sum of true and error variances, we
can rewrite the reliability coefficient in Equation (5.3) entirely in terms of
true scores and error scores:
σT2 σT2
r= 2 = 2 . (5.12)
σX σT2 + σE
X =P +R+E (5.13)
2
σX = σP2 + σR
2 2
+ σE (5.14)
118
5.5. Generalizability theory
σP2
g= 2 + σ2 . (5.15)
σP2 + σR E
If all of these equations are making your eyes cross, you should return to
the question we asked earlier: why would a person score differently on a test
from one administration to the next? The answer: because of differences in
the person, because of differences in the rating process, and because of error.
The goal of G theory is to compare the effects of these different sources on
the variability in scores.
The definition of g in Equation (5.15) is a simple example of how we could
estimate reliability in a person by rater study design. As it breaks down
score variance into multiple reliable components, g can also be used in
more complex designs, where we would expect some other facet of the data
collection to lead to reliable variability in the scores. For example, if we
administer two essays on two occasions, and at each occasion the same two
raters provide a score for each subject, we have a fully crossed design with
three facets: raters, tasks (i.e., essays), and occasions (e.g., Goodwin, 2001).
When developing or evaluating a test, we should consider the study design
used to estimate reliability, and whether or not the appropriate facets are
used.
In addition to choosing the facets in our reliability study, there are three
other key considerations when estimating reliability with g. The first consid-
eration is whether we need to make absolute or relative score interpretations.
Absolute score interpretations account for systematic differences in scores,
whereas relative score interpretations do not. As mentioned above, the
correlation coefficient considers only the relative consistency of scores, and
119
5. Reliability
120
5.5. Generalizability theory
121
5. Reliability
relative ones.
5.6. Summary
5.6.1. Exercises
1. Explain the CTT model and its assumptions using the archery example
presented at the beginning of the chapter.
2. Use the R object scores to find the average variability in x1 for a
given value on t. How does this compare to the SEM?
3. Use PISA09 to calculate split-half reliabilities with different com-
binations of reading items in each half. Then, use these in the
Spearman-Brown formula to estimate reliabilities for the full-length
test. Why do the results differ?
4. Suppose you want to reduce the SEM for a final exam in a course you
are teaching. Identify three sources of measurement error that could
contribute to the SEM, and three that could not. Then, consider
strategies for reducing error from these sources.
5. Estimate the internal consistency reliability for the attitude toward
school scale. Remember to reverse code items as needed.
6. Dr. Phil is developing a measure of relationship quality to be used
in counseling settings with couples. He intends to administer the
measure to couples multiple times over a series of counseling sessions.
Describe an appropriate study design for examining the reliability of
this measure.
7. More and more TV shows lately seem to involve people performing
some talent on stage and then being critiqued by a panel of judges,
one of whom is British. Describe the “true score” for a performer in
this scenario, and identify sources of measurement error that could
result from the judging process, including both systematic and random
sources of error.
8. With proportion agreement, consider the following questions. When
would we expect to see 0% or nearly 0% agreement, if ever? What
would the counts in the table look like if there were 0% agreement?
When would we expect to see 100% or nearly 100% agreement, if
122
5.6. Summary
ever? What would the counts in the table look like if there were 100%
agreement?
9. What is the maximum possible value for kappa? And what would we
expect the minimum possible value to be?
10. Given the strengths and limitations of correlation as a measure of in-
terrater reliability, with what type of score referencing is this measure
of reliability most appropriate?
11. Compare the interrater agreement indices with interrater reliability
based on Pearson correlation. What makes the correlation coefficient
useful with interval data? What does it tell us, or what does it do,
that an agreement index does not?
12. Describe testing examples where different combinations of the three
G theory considerations are appropriate. For example, when would
we want a G coefficient that captures reliability for an average across
2 raters, where raters are a random effect, ignoring systematic differ-
ences?
123
6. Item Analysis
Chapter 10.3 covered topics that rely on statistical analyses of data from
educational and psychological measurements. These analyses are used
to examine the relationships among scores on one or more test forms,
in reliability, and scores based on ratings from two or more judges, in
interrater reliability. Aside from coefficient alpha, all of the statistical
analyses introduced so far focus on composite scores. Item analysis focuses
instead on statistical analysis of the items themselves that make up these
composites.
This chapter extends concepts from Chapters 2 and 10.3 to analysis of item
performance within a CTT framework. The chapter begins with an overview
of item analysis, including some general guidelines for preparing for an
item analysis, entering data, and assigning score values to individual items.
Some commonly used item statistics are then introduced and demonstrated.
Finally, two additional item-level analyses are discussed, differential item
functioning analysis and option analysis.
Learning objectives
125
6. Item Analysis
In this chapter, we’ll run item and option analyses on PISA09 data using
epmr, with results plotted, as usual, using ggplot2.
As noted above, item analysis lets us examine the quality of individual test
items. Information about individual item quality can help us determine
whether or not an item is measuring the content and construct that it was
written to measure, and whether or not it is doing so at the appropriate
ability level. Because we are discussing item analysis here in the context
of CTT, we’ll assume that there is a single construct of interest, perhaps
being assessed across multiple related content areas, and that individual
items can contribute or detract from our measurement of that construct by
limiting or introducing construct irrelevant variance in the form of bias and
random measurement error.
Bias represents a systematic error with an influence on item performance
that can be attributed to an interaction between examinees and some feature
of the test. Bias in a test item leads examinees having a known background
characteristic, aside from their ability, to perform better or worse on an
item simply because of this background characteristic. For example, bias
sometimes results from the use of scenarios or examples in an item that are
more familiar to certain gender or ethnic groups. Differential familiarity
126
6.1. Preparing for item analysis
with item content can make an item more relevant, engaging, and more
easily understood, and can then lead to differential performance, even for
examinees of the same ability level. We identify such item bias primarily by
using measures of item difficulty and differential item functioning (DIF),
discussed below and again in Chapter 7.
Bias in a test item indicates that the item is measuring some other construct
besides the construct of interest, where systematic differences on the other
construct are interpreted as meaningful differences on the construct of
interest. The result is a negative impact on the validity of test scores
and corresponding inferences and interpretations. Random measurement
error on the other hand is not attributed to a specific identifiable source,
such as a second construct. Instead, measurement error is inconsistency of
measurement at the item level. An item that introduces measurement error
detracts from the overall internal consistency of the measure, and this is
detected in CTT, in part, using item analysis statistics.
6.1.2. Piloting
127
6. Item Analysis
directly examined or not. Note also that bias and measurement error arise
in addition to this standard error or sampling error, and we cannot identify
bias in our test questions without representative data from our intended
population. Thus, adequate sampling in the pilot study phase is critical.
The item analysis statistics discussed here are based on the CTT model
of test performance. In Chapter 7 we’ll discuss the more complex item
response theory (IRT) and its applications in item analysis.
After piloting a set of items, raw item responses are organized into a data
frame with test takers in rows and items in columns. The str() function is
used here to summarize the structure of the unscored items on the PISA09
reading test. Each unscored item is coded in R as a factor with four to
eight factor levels. Each factor level represents different information about
a student’s response.
In addition to checking the structure of the data, it’s good practice to run
frequency tables on each variable. An example is shown below for a subset
of PISA09 reading items. The frequency distribution for each variable will
reveal any data entry errors that resulted in incorrect codes. Frequency
distributions should also match what we expect to see for correct and
incorrect response patterns and missing data.
PISA09 items that include a code or factor level of “0” are constructed-
response items, scored by raters. The remaining factor levels for these CR
128
6.1. Preparing for item analysis
items are coded “1” for full credit, “7” for not administered, “9” for missing,
and “r” for not reached, where the student ran out of time before responding
to the item. Selected-response items do not include a factor level of “0.”
Instead, they contain levels “1” through up to “5,” which correspond to
multiple-choice options one through five, and then codes of “7” for not
administered, “8” for an ambiguous selected response, “9” for missing, and
“r” again for not reached.
6.1.4. Scoring
129
6. Item Analysis
130
6.2. Traditional item statistics
Three statistics are commonly used to evaluate the items within a scale or
test. These are item difficulty, discrimination, and alpha-if-item-deleted.
Each is presented below with examples based on PISA09.
Once we have established scoring schemes for each item in our test, and we
have applied them to item-response data from a sample of individuals, we can
utilize some basic descriptive statistics to examine item-level performance.
The first statistic is item difficulty, or, how easy or difficult each item is
for our sample. In cognitive testing, we talk about easiness and difficulty,
where test takers can get an item correct to different degrees, depending on
their ability or achievement. In noncognitive testing, we talk instead about
endorsement or likelihood of choosing the keyed response on the item, where
test takers are more or less likely to endorse an item, depending on their
level on the trait. In the discussions that follow, ability and trait can be
used interchangeably, as can correct/incorrect and keyed/unkeyed response,
and difficulty and endorsement. See Table 6.1 for a summary of these terms.
In CTT, the item difficulty is simply the mean score for an item. For
dichotomous 0/1 items, this mean is referred to as a p-value, since it
represents the proportion of examinees getting the item correct or choosing
the keyed response. With polytomous items, the mean is simply the average
131
6. Item Analysis
score. When testing noncognitive traits, the term p-value may still be used.
However, instead of item difficulty we refer to endorsement of the item, with
proportion correct instead becoming proportion endorsed.
Looking ahead to IRT, item difficulty will be estimated as the predicted
mean ability required to have a 50% chance of getting the item correct or
endorsing the item.
Here, we calculate p-values for the scored reading items, by item type. Item
PISA09$r452q03, a CR item, stands out from the rest as having a very low
p-value of 0.13. This tells us that only 13% of students who took this item
got it right. The next lowest p-value was 0.37.
For item r452q03, students read a short description of a scene from The
Play’s the Thing, shown in Appendix C. The question then is, “What were
the characters in the play doing just before the curtain went up?” This
question is difficult, in part, because the word “curtain” is not used in the
scene. So, the test taker must infer that the phrase “curtain went up” refers
to the start of a play. The question is also difficult because the actors in
this play are themselves pretending to be in a play. For additional details
on the item and the rubric used in scoring, see Appendix C.
Although difficult questions may be frustrating for students, sometimes
they’re necessary. Difficult or easy items, or items that are difficult or easy
to endorse, may be required given the purpose of the test. Recall that
the purpose of a test describes: the construct, what we’re measuring; the
population, with whom the construct is being measured; and the application
or intended use of scores. Some test purposes can only be met by including
some very difficult or very easy items. PISA, for example, is intended to
measure students along a continuum of reading ability. Without difficult
questions, more able students would not be measured as accurately. On the
132
6.2. Traditional item statistics
other hand, a test may be intended to measure lower level reading skills,
which many students have already mastered. In this case, items with high
p-values would be expected. Without them, low ability students, who are
integral to the test purpose, would not be able to answer any items correctly.
Given their high means and p-values, we might conclude that these items are
not adequately measuring students with negative attitudes toward school,
assuming such students exist. Perhaps if an item were worded differently or
were written to ask about another aspect of schooling, such as the value of
homework, more negative attitudes would emerge. On the other hand, it
could be that students participating in PISA really do have overall positive
attitudes toward school, and regardless of the question they will tend to
have high scores.
This brings us to one of the major limitations of CTT item analysis: the
item statistics we compute are dependent on our sample of test takers. For
example, we assume a low p-value indicates the item was difficult, but it may
have simply been difficult for individuals in our sample. What if our sample
happens to be lower on the construct than the broader population? Then,
the items would tend to have lower means and p-values. If administered to
a sample higher on the construct, item means would be expected to increase.
Thus, the difficulty of an item is dependent on the ability of the individuals
133
6. Item Analysis
taking it.
Because we estimate item difficulty and other item analysis statistics without
accounting for the ability or trait levels of individuals in our sample, we can
never be sure of how sample-dependent our results really are. This sample
dependence in CTT will be addressed in IRT.
134
6.2. Traditional item statistics
Note that when you correlate something with itself, the result should be a
correlation of 1. When you correlate a component score, like an item, with a
composite that includes that component, the correlation will increase simply
because of the component in on both sides of the relationship. Correlations
between item responses and total scores can be “corrected” for this spurious
increase simply by excluding a given item when calculating the total. The
result is referred to as a corrected item-total correlation (CITC). ITC and
CITC are typically positively related with one another, and give relatively
similar results. However, CITC is preferred, as it is considered more a
conservative and more accurate estimate of discrimination.
Figure 6.1 contains scatter plots for two CR reading items from PISA09, items
r414q06s and r452q03s. On the x-axis in each plot are total scores across
all reading items, and on the y-axis are the scored item responses for each
135
6. Item Analysis
item. These plots help us visualize both item difficulty and discrimination.
Difficulty is the amount of data horizontally aligned with 0, compared to
1, on the y-axis. More data points at 0 indicate more students getting the
item wrong. Discrimination is then the bunching of data points at the low
end of the x-axis for 0, and at the high end for 1.
1 1
factor(r414q06s)
factor(r452q03s)
0 0
0 3 6 9 0 3 6 9
rtotal rtotal
Figure 6.1.: Scatter plots showing the relationship between total scores on
the x-axis with dichotomous item response on two PISA items
on the y-axis.
Suppose you had to guess a student’s reading ability based only on their
score from a single item. Which of the two items shown in Figure 6.1 would
best support your guess? Here’s a hint: it’s not item r452q03. Notice
how students who got item r452q03 wrong have rtotal scores that span
almost the entire score scale? People with nearly perfect scores still got this
item wrong. On the other hand, item r414q06 shows a faster tapering off
of students getting the item wrong as total scores increase, with a larger
bunching of students of high ability scoring correct on the item. So, item
r414q06, has a higher discrimination, and gives us more information about
the construct than item r452q03.
Next, we calculate the ITC and CITC “by hand” for the first attitude
toward school item, which was reverse coded as st33q01r. There is a
sizable difference between the ITC and the CITC for this item, likely
because the scale is so short to begin with. By removing the item from the
total score, we reduce our scale length by 25%, and, presumably, our total
score becomes that much less representative of the construct. Discrimination
for the remaining attitude items will be examined later.
136
6.2. Traditional item statistics
The last item analysis statistic we’ll consider here indexes how individual
items impact the overall internal consistency reliability of the scale. Internal
consistency is estimated via coefficient alpha, introduced in Chapter 10.3.
Alpha tells us how well our items work together as a set. “Working together”
refers to how consistently the item responses change, overall, in similar ways.
A high coefficient alpha tells us that people tend to respond in similar ways
from one item to the next. If coefficient alpha were perfectly 1.00, we would
know that each person responded in exactly the same rank-ordered way
across all items. An item’s contribution to internal consistency is measured
by estimating alpha with that item removed from the set. The result is a
statistic called alpha-if-item-deleted (AID).
AID answers the question, what happens to the internal consistency of a
set of items with a given item is removed from the set? Because it involves
the removal of an item, higher AID indicates a potential increase in internal
consistency when an item is removed. Thus, when it is retained, the item is
actually detracting from the internal consistency of the scale. Items that
detract from the internal consistency should be considered for removal.
To clarify, it is bad news for an item if the AID is higher than the overall
alpha for the full scale. It is good news for an item if AID is lower than
alpha for the scale.
The istudy() function in the epmr package estimates AID, along with the
other item statistics presented so far. AID is shown in the last column of
the output. For the PISA09 reading items, the overall alpha is 0.7596, which
is an acceptable level of internal consistency for a low-stakes measure like
this (see Table 5.1). The AID results then tell us that alpha never increases
beyond its original level after removing individual items, so, good news.
Instead, alpha decreases to different degrees when items are removed. The
137
6. Item Analysis
lowest AID is 0.7251 for item r414q06. Removal of this item results in the
largest decrease in internal consistency.
Note that discrimination and AID are typically positively related with one
another. Discriminating items tend also to contribute to internal consistency.
However, these two item statistics technically measure different things, and
they need not correspond to one another. Thus, both should be considered
when evaluating items in practice.
Now that we’ve covered the three major item analysis statistics, difficulty,
discrimination, and contribution to internal consistency, we need to examine
how they’re used together to build a set of items. All of the scales in PISA09
have already gone through extensive piloting and item analysis, so we’ll
work with a hypothetical scenario to make things more interesting.
138
6.2. Traditional item statistics
Let’s remove these two lower-quality items and check the new results. The
means and SD should stay the same for this new item analysis. However,
the remaining statistics, ITC, CITC, and AID, all depend on the full set, so
we would expect them to change. The results indicate that all of our AID
are below alpha for the full set of nine items. The CITC are acceptable for a
low-stakes test. However, item item r414q09 has the weakest discrimination,
making it the best candidate for removal, all else equal.
139
6. Item Analysis
Note that the reading items included in PISA09 were not developed to
function as a reading scale. Instead, these are merely a sample of items from
the study, the items with content that was made publicly available. Also, in
practice, an item analysis will typically involve other considerations besides
the statistics we are covering here, most importantly, content coverage.
Before removing an item from a set, we should consider how the test
outline will be impacted, and whether or not the balance of content is still
appropriate.
Whereas item analysis is useful for evaluating the quality of items and their
contribution to a test or scale, other analyses are available for digging deeper
into the strengths and weaknesses of items. Two commonly used analyses
are option analysis, also called distractor analysis, and differential item
functioning analysis.
So far, our item analyses have focused on scored item responses. However,
data from unscored SR items can also provide useful information about
test takers. An option analysis involves the examination of unscored item
responses by ability or trait groupings for each option in a selected-response
item. Relationships between the construct and response patters over keyed
and unkeyed options can give us insights into whether or not response
options are functioning as intended.
140
6.3. Additional analyses
The ostudy() function from epmr takes a matrix of unscored item responses,
along with some information about the construct (a grouping variable, a
vector of ability or trait scores, or a vector containing the keys for each
item) and returns crosstabs for each item that break down response choices
by construct groupings. By default, the construct is categorized into three
equal-interval groups, with labels “low”, “mid”, and “high.”
Consider the distribution over response choices for high ability students, in
the last column of the crosstab for item r458q04. The large majority of
141
6. Item Analysis
high ability students, 89%, chose the second option on this item, which is
also the correct or keyed option. On the other hand, low ability students, in
the left column of the crosstab, are spread out across the different options,
with only 31% choosing the correct option. This is what we hope to see for
an item that discriminates well.
142
6.4. Summary
6.4. Summary
143
6. Item Analysis
and that item analysis results are based on a representative sample from
our population of test takers. Chapter 7 builds on the concepts introduced
here by extending them to the more complex but also more popular IRT
model of test performance.
6.4.1. Exercises
144
7. Item Response Theory
One could make a case that item response theory is the most
important statistical method about which most of us know
little or nothing.
— David Kenny
Item response theory (IRT) is arguably one of the most influential devel-
opments in the field of educational and psychological measurement. IRT
provides a foundation for statistical methods that are utilized in contexts
such as test development, item analysis, equating, item banking, and com-
puterized adaptive testing. Its applications also extend to the measurement
of a variety of latent constructs in a variety of disciplines.
This chapter begins with a comparison of IRT with classical test theory
(CTT), including a discussion of strengths and weaknesses and some typical
uses of each. Next, the traditional dichotomous IRT models are introduced
with definitions of key terms and a comparison based on assumptions,
benefits, limitations, and uses. Finally, details are provided on applications
of IRT in item analysis, test development, item banking, and computer
adaptive testing.
Learning objectives
145
7. Item Response Theory
4. Define the three item parameters and one ability parameter in the
traditional IRT models, and describe the role of each in modeling
performance with the IIF.
5. Distinguish between the 1PL, 2PL, and 3PL IRT models in terms
of assumptions made, benefits and limitations, and applications of
each.
6. Describe how IRT is utilized in item analysis, test development,
item banking, and computer adaptive testing.
In this chapter, we’ll use epmr to run IRT analyses with PISA09 data, and
we’ll use ggplot2 for plotting results.
Since its development in the 1950s and 1960s (Lord, 1952; Rasch, 1960),
IRT has become the preferred statistical methodology for item analysis
and test development. The success of IRT over its predecessor CTT comes
primarily from the focus in IRT on the individual components that make
up a test, that is, the items themselves. By modeling outcomes at the item
level, rather than at the test level as in CTT, IRT is more complex but
also more comprehensive in terms of the information it provides about test
performance.
As presented in Chapter 10.3, CTT gives us a model for the observed total
score X. This model decomposes X into two parts, truth T and error E:
X = T + E. (7.1)
The true score T is the construct we’re intending to measure, and we assume
it plays some systematic role in causing people to obtain observed scores on
X. The error E is everything randomly unrelated to the construct we’re
146
7.1. IRT versus CTT
For example, suppose an instructor gives the same final exam to a new
classroom of students each semester. At the first administration, the CITC
discrimination for one item is 0.08. That’s low, and it suggests that there’s
a problem with the item. However, in the second administration of the
same exam to another group of students, the same item is found to have
a CITC of 0.52. Which of these discriminations is correct? According to
CTT, they’re both correct, for the sample with which they are calculated.
In CTT there is technically no absolute item difficulty or discrimination
that generalizes across samples or populations of examinees. The same
goes for ability estimates. If two students take different final exams for
the same course, each with different items but the same number of items,
ability estimates will depend on the difficulty and quality of the respective
exams. There is no absolute ability estimate that generalizes across samples
of items. This is the main limitation of CTT: parameters that come from
the model are sample and test dependent.
A second major limitation of CTT results from the fact that the model is
specified using total scores. Because we rely on total scores in CTT, a given
test only produces one estimate of reliability and, thus, one estimate of SEM,
and these are assumed to be unchanging for all people taking the test. The
measurement error we expect to see in scores would be the same regardless
of level on the construct. This limitation is especially problematic when
test items do not match the ability level of a particular group of people.
For example, consider a comprehensive vocabulary test covering all of the
words taught in a fourth grade classroom. The test is given to a group of
students just starting fourth grade, and another group who just completed
fourth grade and is starting fifth. Students who have had the opportunity
to learn the test content should respond more reliably than students who
have not. Yet, the test itself has a single reliability and SEM that would
be used to estimate measurement error for any score. Thus, the second
major limitation of CTT is that reliability and SEM are constant and do
not depend on the construct.
147
7. Item Response Theory
IRT addresses the limitations of CTT, the limitations of sample and test
dependence and a single constant SEM. As in CTT, IRT also provides a
model of test performance. However, the model is defined at the item level,
meaning there is, essentially, a separate model equation for each item in
the test. So, IRT involves multiple item score models, as opposed to a
single total score model. When the assumptions of the model are met, IRT
parameters are, in theory, sample and item independent. This means that a
person should have the same ability estimate no matter which set of items
she or he takes, assuming the items pertain to the same test. And in IRT,
a given item should have the same difficulty and discrimination no matter
who is taking the test.
IRT also takes into account the difficulty of the items that a person responds
to when estimating the person’s ability or trait level. Although the construct
estimate itself, in theory, does not depend on the items, the precision with
which we estimate it does depend on the items taken. Estimates of the
ability or trait are more precise when they’re based on items that are close
to a person’s construct level. Precision decreases when there are mismatches
between person construct and item difficulty. Thus, SEM in IRT can vary
by the ability of the person and the characteristics of the items given.
The main limitation of IRT is that it is a complex model requiring much
larger samples of people than would be needed to utilize CTT. Whereas in
CTT the recommended minimum is 100 examinees for conducting an item
analysis (see Chapter 6), in IRT, as many as 500 or 1000 examinees may be
needed to obtain stable results, depending on the complexity of the chosen
model.
Another key difference between IRT and CTT has to do with the shape of
the relationship that we estimate between item score and construct score.
The CTT discrimination models a simple linear relationship between the
two, whereas IRT models a curvilinear relationship between them. Recall
from Chapter 6 that the discrimination for an item can be visualized within
a scatter plot, with the construct on the x-axis and item scores on the y-axis.
A strong positive item discrimination would be shown by points for incorrect
148
7.1. IRT versus CTT
scores bunching up at the bottom of the scale, and points for correct scores
bunching up at the top. A line passing through these points would then
have a positive slope. Because they’re based on correlations, ITC and CITC
discriminations are always represented by a straight line. See Figure 7.1 for
an example based on PISA09 reading item r452q06s.
1.2
0.8
r414q06s
0.4
0.0
0 3 6 9
rtotal
Figure 7.1.: Scatter plots showing the relationship between total scores on
the x-axis and scores from PISA item ‘r452q06s‘ on the y-axis.
Lines represent the relationship between the construct and item
scores for CTT (straight) and IRT (curved).
149
7. Item Response Theory
the item correct steadily increases. At a certain total score, around 5.5, we
see roughly half the people get the item correct. Then, as we continue up
the theta scale, more and more people get the item correct. IRT is used to
capture the trend shown by the conditional p-values in Figure 7.1.
7.2.1. Terminology
Understanding IRT requires that we master some new terms. First, in IRT
the underlying construct is referred to as theta or θ. Theta refers to the same
construct we discussed in CTT, reliability, and our earlier measurement
models. The underlying construct is the unobserved variable that we assume
causes the behavior we observe in item responses. Different test takers are
at different levels on the construct, and this results in them having different
response patterns. In IRT we model these response patterns as a function
of the construct theta. The theta scale is another name for the ability or
trait scale.
Second, the dependent variable in IRT differs from CTT. The dependent
variable is found on the left of the model, as in a regression or other statistical
model. The dependent variable in the CTT model is the total observed score
X. The IRT model instead has an item score as the dependent variable.
The model then predicts the probability of a correct response for a given
item, denoted Pr(X = 1).
Finally, in CTT we focus only on the person construct T within the model
itself. The dependent variable X is modeled as a function of T , and whatever
is left over via E. In IRT, we include the construct, now θ, along with
parameters for the item that characterize its difficulty, discrimination, and
lower-asymptote.
In the discussion that follows, we will frequently use the term function. A
function is simply an equation that produces an output for each input we
supply to it. The CTT model could be considered a function, as each T has
a single corresponding X that is influenced, in part, by E. The IRT model
also produces an output, but this output is now at the item level, and it is
a probability of correct response for a given item. We can plug in θ, and
get a prediction for the performance we’d expect on the item for that level
on the construct. In IRT, this prediction of item performance depends on
the item as well as the construct.
The IRT model for a given item has a special name in IRT. It’s called the
item response function (IRF), because it can be used to predict an item
response. Each item has its own IRF. We can add up all the IRF in a test
to obtain a test response function (TRF) that predicts not item scores but
total scores on the test.
150
7.2. Traditional IRT models
The last IRT function we’ll discuss here gives us the SEM for our test. This
is called the test error function (TEF). As with all the other IRT functions,
there is an equation that is used to estimate the function output, in this
case, the standard error of measurement, for each point on the theta scale.
This overview of IRT terminology should help clarify the benefits of IRT
over CTT. Recall that the main limitations of CTT are: 1) sample and
test dependence, where our estimates of construct levels depend on the
items in our test, and our estimates of item parameters depend on the
construct levels for our sample of test takers; and 2) reliability and SEM
that do not change depending on the construct. IRT addresses both of these
limitations. The IRT model estimates the dependent variable of the model
while accounting for both the construct and the properties of the item. As
a result, estimates of ability or trait levels and item analysis statistics will
be sample and test independent. This will be discussed further below. The
IRT model also produces, via the TEF, measurement error estimates that
vary by theta. Thus, the accuracy of a test depends on where the items are
located on the construct.
Here’s a recap of the key terms we’ll be using throughout this chapter:
• The logistic curve is the name for the shape we use to model
performance via the IRF. It is a curve with certain properties, such
as horizontal lower and upper asymptotes.
• Functions are simply equations that produce a single output value for
151
7. Item Response Theory
each value on the theta scale. IRT functions include the IRF, TRF,
IIF, TIF, and TEF.
• Information refers to a summary of the discriminating power provided
by an item or test.
We’ll now use the terminology above to compare three traditional IRT
models. Equation (7.2) contains what is referred to as the three-parameter
IRT model, because it includes all three available item parameters. The
model is usually labeled 3PL, for 3 parameter logistic. As noted above, in
IRT we model the probability of correct response on a given item (Pr(X = 1))
as a function of person ability (θ) and certain properties of the item itself,
namely: a, how well the item discriminates between low and high ability
examinees; b, how difficult the item is, or the construct level at which we’d
expect people to have a probability Pr = 0.50 of getting the keyed item
response; and c, the lowest Pr that we’d expect to see on the item by chance
alone.
ea(θ−b)
Pr(X = 1) = c + (1 − c) (7.2)
1 + ea(θ−b)
The a and b parameters should be familiar. They are the IRT versions
of the item analysis statistics from CTT, presented in Chapter 10.3. The
a parameter corresponds to ITC, where larger a indicate larger, better
discrimination. The b parameter corresponds to the opposite of the p-value,
where a low b indicates an easy item, and a high b indicates a difficult item;
higher b require higher θ for a correct response. The c parameter should then
be pretty intuitive if you think of its application to multiple-choice questions.
When people low on the construct guess randomly on a multiple-choice item,
the c parameter attempts to capture the chance of getting the item correct.
In IRT, we acknowledge with the c parameter that the probability of correct
response may never be zero. The smallest probability of correct response
produced by Equation (7.2) will be c.
To better understand the IRF in Equation (7.2), focus first on the difference
we take between theta and item difficulty, in (θ − b). Suppose we’re using
a cognitive test that measures some ability. If someone is high ability and
taking a very easy item, with low b, we’re going to get a large difference
between theta and b. This large difference filters through the rest of the
equation to give us a higher prediction of how well the person will do on
the item. This difference is multiplied by the discrimination parameter,
so that, if the item is highly discriminating, the difference between ability
and difficulty is magnified. If the discrimination is low, for example, 0.50,
the difference between ability and difficulty is cut in half before we use it
to determine probability of correct response. The fractional part and the
exponential term represented by e serve to make the straight line of ITC
152
7.2. Traditional IRT models
into a nice curve with a lower and upper asymptote at c and 1. Everything
on the right side of Equation (7.2) is used to estimate the left side, that is,
how well a person with a given ability would do on a given item.
Figure 7.2 contains IRF for five items with different item parameters. Let’s
examine the item with the IRF shown by the black line. This item would
be considered the most difficult of this set, as it is located the furthest
to the right. We only begin to predict that a person will get the item
correct once we move past theta 0. The actual item difficulty, captured
by the b parameter, is 3. This is found as the theta required to have a
probability of 0.05 of getting the keyed response. This item also has the
highest discrimination, as it is steeper than any other item. It is most useful
for distinguishing between probabilities of correct response around theta 3,
its b parameter; below and above this value, the item does not discriminate
as well, as the IRF becomes more horizontal. Finally, this item appears to
have a lower-asymptote of 0, suggesting test takers are likely not guessing
on the item.
1.00
0.75
Pr(X)
0.50
0.25
0.00
−4 −2 0 2 4
theta
153
7. Item Response Theory
Examine the remaining IRF in Figure 7.2. You should be able to compare
the items in terms of easiness and difficulty, low and high discrimination,
and low and high predicted probability of guessing correctly. Below are
some specific questions and answers for comparing the items.
• Which item has the highest discrimination? Black, with the steepest
slope.
• Which item are you most likely to guess correct? Green, cyan, and
red appear to have the highest lower asymptotes.
ea(θ−b)
Pr(X = 1) = . (7.3)
1 + ea(θ−b)
e(θ−b)
Pr(X = 1) = . (7.4)
1 + e(θ−b)
Zero guessing and constant discrimination may seem like unrealistic assump-
tions, but the Rasch model is commonly used operationally. The PISA
studies, for example, utilize a form of Rasch modeling. The popularity of
the model is due to its simplicity. It requires smaller sample sizes (100 to
200 people per item may suffice) than the 2PL and 3PL (requiring 500 or
154
7.3. Applications
more people). The theta scale produced by the Rasch model can also have
certain desirable properties, such as consistent score intervals (see de Ayala,
2009).
7.2.3. Assumptions
The three traditional IRT models discussed above all involve two main
assumptions, both of them having to do with the overall requirement that
the model we chose is “correct” for a given situation. This correctness is
defined based on 1) the dimensionality of the construct, that is, how many
constructs are causing people to respond in a certain way to the items, and
2) the shape of the IRF, that is, which of the three item parameters are
necessary for modeling item performance.
The second assumption in IRT is that we’ve chosen the correct shape for our
IRF. This implies that we have a choice regarding which item parameters
to include, whether only b in the 1PL or Rasch model, b and a in the 2PL,
or b, a, and c in the 3PL. So, in terms of shape, we assume that there is a
nonlinear relationship between ability and probability of correct response,
and this nonlinear relationship is captured completely by up to three item
parameters.
Note that anytime we assume a given item parameter, for example, the
c parameter, is unnecessary in a model, it is fixed to a certain value for
all items. For example, in the Rasch and two-parameter IRT models, the
c parameter is typically fixed to 0, which means we are assuming that
guessing is not an issue. In the Rasch model we also assume that all items
discriminate in the same way, and a is fixed to 1; then, the only item
parameter we estimate is item difficulty.
7.3. Applications
7.3.1. In practice
As is true when comparing other statistical models, the choice of Rasch, 1PL,
2PL, or 3PL should be based on considerations of theory, model assumptions,
and sample size.
155
7. Item Response Theory
Because of its simplicity and lower sample size requirements, the Rasch
model is commonly used in small-scale achievement and aptitude testing,
for example, with assessments developed and used at the district level, or
instruments designed for use in research or lower-stakes decision-making.
The IGDI measures discussed in Chapter 2 are developed using the Rasch
model. The MAP tests, published by Northwest Evaluation Association,
are also based on a Rasch model. Some consider the Rasch model most
appropriate for theoretical reasons. In this case, it is argued that we should
seek to develop tests that have items that discriminate equally well; items
that differ in discrimination should be replaced with ones that do not.
Others utilize the Rasch model as a simplified IRT model, where the sample
size needed to accurately estimate different item discriminations and lower
asymptotes cannot be obtained. Either way, when using the Rasch model,
we should be confident in our assumption that differences between items in
discrimination and lower asymptote are negligible.
The 2PL and 3PL models are often used in larger-scale testing situations, for
example, on high-stakes tests such as the GRE and ACT. The large samples
available with these tests support the additional estimation required by
these models. And proponents of the two-parameter and three-parameter
models often argue that it is unreasonable to assume zero lower asymptote,
or equal discriminations across items.
In terms of the properties of the model itself, as mentioned above, IRT
overcomes the CTT limitation of sample and item dependence. As a result,
ability estimates from an IRT model should not depend on the sample of
items used to estimate ability, and item parameter estimates should not
depend on the sample of people used to estimate them. An explanation of
how this is possible is beyond the scope of this book. Instead, it is important
to remember that, in theory, when IRT is correctly applied, the resulting
parameters are sample and item independent. As a result, they can be
generalized across samples for a given population of people and test items.
IRT is useful first in item analysis, where we pilot test a set of items and
then examine item difficulty and discrimination, as discussed with CTT
in Chapter 6. The benefit of IRT over CTT is that we can accumulate
difficulty and discrimination statistics for items over multiple samples of
people, and they are, in theory, always expressed on the same scale. So,
our item analysis results are sample independent. This is especially useful
for tests that are maintained across more than one administration. Many
admissions tests, for example, have been in use for decades. State tests, as
another example, must also maintain comparable item statistics from year
to year, since new groups of students take the tests each year.
Item banking refers to the process of storing items for use in future, po-
tentially undeveloped, forms of a test. Because IRT allows us to estimate
sample independent item parameters, we can estimate parameters for certain
items using pilot data, that is, before the items are used operationally. This
is what happens in a computer adaptive test. For example, the difficulty
of a bank of items is known, typically from pilot administrations. When
156
7.3. Applications
you sit down to take the test, an item of known average difficulty can then
be administered. If you get the item correct, you are given a more difficult
item. The process continues, with the difficulty of the items adapting based
on your performance, until the computer is confident it has identified your
ability level. In this way, computer adaptive testing relies heavily on IRT.
7.3.2. Examples
The epmr package contains functions for estimating and manipulating results
from the Rasch model. Other R packages are available for estimating the
2PL and 3PL (e.g., Rizopoulos, 2006). Commercial software packages are
also available, and are often used when IRT is applied operationally.
Here, we estimate the Rasch model for PISA09 students in Great Britain.
The data set was created at the beginning of the chapter. The irtstudy()
function in epmr runs the Rasch model and prints some summary results,
including model fit indices and variability estimates for random effect. Note
that the model is being fit as a generalized linear model with random person
and random item effects, using the lme4 package (Bates et al., 2015). The
estimation process won’t be discussed here. For details, see Doran et al.
(2007) and De Boeck et al. (2011).
157
7. Item Response Theory
head(irtgbr$ip)
## a b c
## r414q02s 1 -0.02535278 0
## r414q11s 1 0.61180610 0
## r414q06s 1 -0.11701587 0
## r414q09s 1 -0.96943042 0
## r452q03s 1 2.80576876 0
## r452q04s 1 -0.48851334 0
Figure 7.3 shows the IRF for a subset of the PISA09 reading items based on
data from Great Britain. These items pertain to the prompt “The play’s the
thing” in Appendix C.2. Item parameters are taken from irtgbr$ip and
supplied to the rirf() function from epmr. This function is simply Equation
(7.2) translated into R code. Thus, when provided with item parameters
and a vector of thetas, rirf() returns the corresponding Pr(X = 1).
# Get IRF for the set of GBR reading item parameters and a
# vector of thetas
# Note the default thetas of seq(-4, 4, length = 100)
# could also be used
irfgbr <- rirf(irtgbr$ip, seq(-6, 6, length = 100))
# Plot IRF for items r452q03s, r452q04s, r452q06s, and
# r452q07s
ggplot(irfgbr, aes(theta)) + scale_y_continuous("Pr(X)") +
geom_line(aes(y = irfgbr$r452q03s, col = "r452q03")) +
geom_line(aes(y = irfgbr$r452q04s, col = "r452q04")) +
geom_line(aes(y = irfgbr$r452q06s, col = "r452q06")) +
geom_line(aes(y = irfgbr$r452q07s, col = "r452q07")) +
scale_colour_discrete(name = "item")
158
7.3. Applications
1.00
0.75
item
r452q03
Pr(X)
0.50 r452q04
r452q06
r452q07
0.25
0.00
−6 −3 0 3 6
theta
Figure 7.3.: IRF for PISA09 reading items from “The play’s the thing” based
on students in Great Britain.
The rtef() function is used above with the default vector of theta seq(-4,
4, length = 100). However, SEM can be obtained for any theta based on
the item parameters provided. Suppose we want to estimate the SEM for a
high ability student who only takes low difficulty items. This constitutes
a mismatch in item and construct, which produces a higher SEM. These
SEM are interpreted like SEM from CTT, as the average variability we’d
expect to see in a score due to measurement error. They can be used to
build confidence intervals around theta for an individual.
# SEM for theta 3 based on the four easiest and the four
# most difficult items
rtef(irtgbr$ip[c(4, 6, 9, 11), ], theta = 3)
## theta se
## 1 3 2.900782
rtef(irtgbr$ip[c(2, 7, 8, 10), ], theta = 3)
## theta se
## 1 3 1.996224
The reciprocal of the SEM function, via the TEF, is the test information
159
7. Item Response Theory
3
se
−4 −2 0 2 4
theta
Figure 7.4.: SEM for two subsets of ‘PISA09‘ reading items based on students
in Great Britain.
function. This simply shows us where on the theta scale our test items
are accumulating the most discrimination power, and, as a result, where
measurement will be the most accurate. Higher information corresponds to
higher accuracy and lower SEM. Figure 7.5 shows the test information for
all the reading items, again based on students in Great Britain. Information
is highest where SEM is lowest, at the center of the theta scale.
2.5
2.0
1.5
i
1.0
0.5
−4 −2 0 2 4
theta
Figure 7.5.: Test information for ‘PISA09‘ reading items based on students
in Great Britain.
160
7.4. Summary
Finally, just as the IRF can be used to predict probability of correct response
on an item, given theta and the item parameters, the TRF can be used
to predict total score given theta and parameters for each item in the test.
The TRF lets us convert person theta back into the raw score metric for
the test. Similar to the IRF, the TRF is asymptotic at 0 and the number of
dichotomous items in the test, in this case, 11. Thus, no matter how high
or how low theta, our predicted total score can’t exceed the bounds of the
raw score scale. Figure 7.6 shows the test response function for the Great
Britain results.
10.0
7.5
p
5.0
2.5
0.0
−4 −2 0 2 4
theta
Figure 7.6.: Test response function for ‘PISA09‘ reading items based on
students in Great Britain.
7.4. Summary
This chapter provided an introduction to IRT, with a comparison to CTT,
and details regarding the three traditional, dichotomous, unidimensional
IRT models. Assumptions of the models and some testing applications were
presented. The Rasch model was demonstrated using data from PISA09,
with examples of the IRF, TEF, TIF, and TRF.
7.4.1. Exercises
1. Sketch out a plot of IRF for the following two items: a difficult item
1 with a high discrimination and negligible lower asymptote, and an
easier item 2 with low discrimination and high lower asymptote. Be
sure to label the axes of your plot.
161
7. Item Response Theory
2. Sketch another plot of IRF for two items having the same difficulties,
but different discriminations and lower asymptotes.
3. Examine the IRF for the remaining PISA09 reading items for Great
Britain. Check the content of the items in Appendix C to see what
features of an item or prompt seemed to make it relatively easier or
more difficult.
4. Using the PISA09 reading test results for Great Britain, find the
predicted total scores associated with thetas of -1, 0, and 1.
5. Estimate the Rasch model for the PISA09 memorization strategies
scale. First, dichotomize responses by recoding 1 to 0, and the
remaining valid responses to 1. After fitting the model, plot the IRF
for each item.
6. Based on the distribution of item difficulties for the memorization
scale, where should SEM be lowest? Check the SEM by plotting the
TEF for the full scale.
162
8. Factor Analysis
Learning objectives
1. Compare and contrast the factor analytic model with other mea-
surement models, including CTT and IRT, in terms of their appli-
cations in instrument development.
2. Describe the differing purposes of exploratory and confirmatory
factor analysis.
3. Explain how an EFA is implemented, including the type of data
required and steps in setting up and fitting the model.
4. Interpret EFA results, including factor loadings and eigenvalues.
5. Use a scree plot to visually compare factors in an EFA.
163
8. Factor Analysis
In this chapter we will run EFA and plot results using epmr functions. We’ll
also install a new package called lavaan for running CFA.
So, we now have three terms, construct, latent trait, and factor, that all
refer essentially to the same thing, the underlying, unobserved variable of
interest that we expect is causing people to differ in their test scores.
164
8.1. Measurement models
Note the emphasis here on the number of factors being smaller than the
number of observed variables. Factor analysis is a method for summarizing
correlated data, for reducing a set of related observed variables to their
essential parts, the unobserved parts they share in common. We cannot
analyze more factors than we have observed variables. Instead, we’ll usually
aim for the simplest factor structure possible, that is, the fewest factors
needed to capture the gist of our data.
In testing applications, factor analysis will often precede the use of a more
specific measurement model such as CTT or IRT. In addition to helping us
summarize correlated data in terms of their underlying essential components,
factor analysis also tells us how our items relate to or load on the factors we
analyze. When piloting a test, we might find that some items relate more
strongly to one factor than another. Other items may not relate well to
any of the identified factors, and these might be investigated further and
perhaps removed from the test. After the factor analysis is complete, CTT
or IRT could then be used to evaluate the test further, and to estimate test
takers’ scores on the construct. In this way, factor analysis serves as a tool
for examining the different factors underlying our data, and refining our test
to include items that best assess the factors we’re interested in measuring.
Recall that classical item analysis, discussed in Chapter 6, can also be used
to refine a test based on features such as item difficulty and discrimination.
Discrimination in CTT, which resembles the slope parameter in IRT, is
similar to a factor loading in factor analysis. A factor loading is simply an
estimate of the relationship between a given item and a given factor. Again,
the main difference in factor analysis is that we’ll often examine multiple
factors, and the loadings of items on each one, rather than a single factor
as in CTT.
165
8. Factor Analysis
Figure 8.1.: A simple exploratory factor analysis model for six items loading
on two correlated factors.
166
8.2. Exploratory factor analysis
Some questions we might ask about the BDI-II include, do the items all
measure the same thing? Or do they measure different things? Is depression
a unidimensional construct? Or is it multidimensional, involving different
underlying components? And do these components capture meaningful
amounts of variability in observed depression scores? Or do scores func-
tion independently with little shared variability between them? These are
questions we can begin to answer with EFA.
Suppose that a single factor is best, with all of the items loading strongly
on it and only loading to a small degree on other factors. In this case,
we could assume that the single factor represents an overall measure of
depression, and all of the items contribute to the measure. Alternatively,
suppose that the items don’t consistently load on any one factor, but instead
are influenced by numerous less important ones. In this case, each item
may measure a distinct and unique component of depression that does not
relate to the other measured components. Finally, what if the items tend
to function well together in groups? Perhaps some items tend to correlate
strongly with one another because they all measure a similar component of
depression.
The presence of two factors on the BDI-II suggests that subscores based on
each factor may provide more detailed information about where an individ-
ual’s depressive symptoms lie. For example, consider two patients each with
the same total score of 18 across all 21 items. Using the published cutoffs,
each patient would be categorized as having mild depression. However, one
patient may have scored highly only on the somatic items, whereas the other
may have scored highly only on the cognitive ones. These differences are
not evident in a single total score. Factor analysis suggests that subscores
interpretations in cases like this may be justified.
We’ll come back to the BDI-II later on, when we use a confirmatory factor
analysis to examine the correlations in BDI.
The BDI-II example above brings up some of the important steps in con-
ducting an EFA. These steps involve choosing an initial number of factors
167
8. Factor Analysis
to explore, preparing our data and fitting the EFA model, examining pat-
terns in factor loadings and error terms, and evaluating factor quality and
choosing a final factor structure.
1. Choose the number of factors
We start an EFA by choosing the number of factors that we want to ex-
plore, potentially up to the number of items in our data set, but preferably
much fewer. This choice may simply be based on apparent similarities in
the content of our test items. For example, in the field of industrial and
organizational psychology, a test developer may be tasked with defining the
key features of a particular job, so as to create a measure that identifies
employees who are most likely to succeed in that job. Test content could
be based initially on conversations with employees about their job responsi-
bilities and the skills and traits they feel are most important to doing well.
The test developer can then look for trends or themes in the test content.
Perhaps some items have to do with loyalty, others with organizational skills,
and the rest with being outgoing. Three factors seem appropriate here.
The initial choice of a number of factors may similarly be based on a test
outline showing the intended structure of an instrument, with items already
written specifically to fit into distinct scales. For example, educational tests
are often designed from the beginning to assess one or more content domains.
These domains usually map onto scales or subscales within a test. In turn,
they can inform our choice of the number of factors to explore in an EFA.
When fitting an EFA, we will allow the model to identify more factors that
we expect or hope to end up with. Some EFA software will automatically
fit the maximum possible number of factors. R requires that we specify a
number. By allowing for more factors than we initially choose, we get an
idea of how our expectation compares with less parsimonious solutions. The
maximum possible number of factors in an EFA depends on our sample
size and the number of items in our test. A general guideline is to plan for
at least 3 items loading primarily on each factor. So, a test with 13 items
should have no more than 4 factors.
2. Prepare the data and fit the model
The factor analysis models discussed in this chapter require that our items be
measured on continuous scales, and that our factors of interest be expressed
in terms of continuous scores. With dichotomously scored items, we need
to adjust our data, and this process will not be covered here. Instead, we
will examine polytomous items with responses coming from rating scales.
These data may not be strictly continuous, but we will assume that they
are continuous enough to support the corresponding EFA.
Factor analysis requires either a scored data set, with people in rows and
observations (scored item responses) in columns, or a correlation matrix
based on such a data set. In the first demonstration below we’ll use a data
set, and in the second we’ll use a correlation matrix. When using a data set,
the data are prepared simply be ensuring that all of the variables contain
168
8.2. Exploratory factor analysis
Factor analysis also requires sufficient sample size given the number of items
in our test and the number of parameters estimated in the model. One
somewhat generous rule of thumb is to get five times as many people as
observed variables. So, when piloting a test with 100 items, we would hope
for complete data on at least 500 respondents. With small sample sizes and
too many parameters, our EFA will not run or will fail to converge on a
solution, in which case we may need to collect more data, reduce test length,
or revise our proposed number of factors.
The EFA model estimates the relationships between each item on our test
and each factor that the model extracts. These relationships are summarized
within a factor loading matrix, with items in rows and loadings on each factor
in columns. The loadings are usually standardized so as to be interpreted as
correlations. Loadings closer to 0 indicate small or negligible relationships,
whereas values closer to 1 indicate strong relationships. Most of the loadings
we will see fall between 0 and 0.80.
In an EFA, each item has its own unique error term, sometimes referred to
as its uniqueness. This error consists of the leftover unexplained variance
for an item, after removing the shared variance explained by the factors.
The error terms in the EFA that we will examine are simply 1 minus the
sum of the squared factor loadings for each item across factors. Because
they measure the converse of the factor loadings, larger errors reveal items
that do not fit well in a given model.
The EFA model also estimates the amount of variability in observed item
scores that is explained by each factor. The concept of explained variability is
related to the coefficient of determination from regression, and the reliability
coefficient as presented in Chapter 10.3. In EFA, variability across all of
169
8. Factor Analysis
our items can be standardized so that the variance per item is 1, and the
sum of all the variances is simply the number of items. On a test with 13
items, the total variability would then be 13.
Variability explained for a given factor can be indexed using its standardized
eigenvalue. An eigenvalue of 1 tells us that a factor only explains as much
variability, on average, as a single item. On a test with 13 items, an
eigenvalue of 13 (not likely) would tell us that a factor explains all of the
available variability. On this same test, an eigenvalue of 3 would indicate
that a factor explains as much variability as 3 items, and dividing 3 by 13
gives us 0.23, the proportion of total variability explained by this factor.
Larger standardized eigenvalues are better, as they indicate stronger fac-
tors that better represent the correlations in scores. Smaller eigenvalues,
especially ones below 1, indicate factors that are not useful in explaining
variability. Eigenvalues, along with factor loadings, can help us identify
an appropriate factor structure for our test. We can reduce the number of
factors and remove problematic items and then rerun our EFA to explore
how the results change. This would bring us back to step 1.
Confirming our factor structure
In summary, we’ve outlined here four main steps in conducting an EFA.
2. Preparing the data and fitting the model. Our data must be
quantitative and measured on a relatively continuous scale. Our
factors will be measured as continuous variables. We’ll fit the model
in R using the default options, as demonstrated below.
Having completed these steps, the EFA should point us in the right direction
in terms of finding a suitable number of factors for our test, and determining
how our items load on these factors. However, because EFA does not involve
any formal hypothesis testing, the results are merely descriptive. The next
step involves a confirmatory analysis of the factor structure that we think
is most appropriate for our test. Following a demonstration of EFA with
170
8.2. Exploratory factor analysis
PISA data, we’ll learn more about the role of CFA in test development.
The Approaches to Learning scale from the PISA 2009 student questionnaire
contains 13 items measuring the strategies students use when learning. These
items were separated by the PISA developers into the three subscales of
memorization, elaboration, and control strategies. See the full text for the
items in Appendix D.
Choose factors, prep data, fit model
Note that the PISA09 data set includes IRT theta scores on each subscale,
as estimated by PISA (PISA09$memor, PISA09$elab, and PISA09$cstrat).
We’ll look at these later. For now, let’s explore an EFA model for all 13
items, with three factors specified. Give the small number of items, four
factors may also work, but more any more is not recommended. We will use
the fastudy() function from the epmr package, which runs an EFA using
factanal() from the base package.
171
8. Factor Analysis
##
## Loadings:
## Factor1 Factor2 Factor3
## st27q01 0.13 0.66
## st27q03 0.23 0.66
## st27q05 0.10 0.20 0.49
## st27q07 0.39 0.40
## st27q04 0.44 0.22 0.25
## st27q08 0.63 0.14
## st27q10 0.67 0.18
## st27q12 0.74 0.12
## st27q02 0.11 0.38 0.43
## st27q06 0.15 0.64 0.24
## st27q09 0.33 0.59 0.13
## st27q11 0.17 0.55 0.33
## st27q13 0.25 0.52 0.15
##
## Factor1 Factor2 Factor3
## SS loadings 1.85 1.84 1.73
## Proportion Var 0.14 0.14 0.13
## Cumulative Var 0.14 0.28 0.42
The first four items in the table pertain to the memorization scale. Notice
that they load strongest on factor 3, with loadings of 0.66, 0.66, 0.49, and
0.40. Two of the memorization items also load above 0.1 on factor 1, and
three load above 0.1 on factor 2.
The next four items pertain to the elaboration scale. These load strongest
on factor 1, with loadings of 0.44, 0.63, 0.67, and 0.74. They all also load
somewhat on factor 2, and one loads on factor 3.
Finally, the last five items pertain to the control strategies scale. Loadings
tended to be strongest for factor 2, with 0.38, 0.64, 0.59, 0.55, and 0.52.
All of the control strategies items also had small loadings on the other two
factors.
By increasing the cutoff when printing the loadings matrix, we can highlight
visually where the stronger factor loadings are located. The trends described
above, with scale items loading together on their own factors, become more
apparent.
172
8.2. Exploratory factor analysis
## st27q03 0.66
## st27q05 0.49
## st27q07 0.39 0.40
## st27q04 0.44
## st27q08 0.63
## st27q10 0.67
## st27q12 0.74
## st27q02 0.38 0.43
## st27q06 0.64
## st27q09 0.33 0.59
## st27q11 0.55 0.33
## st27q13 0.52
##
## Factor1 Factor2 Factor3
## SS loadings 1.85 1.84 1.73
## Proportion Var 0.14 0.14 0.13
## Cumulative Var 0.14 0.28 0.42
Note that item st27q02 had a larger loading on factor 3, the memorization
factor, than on factor 2. This suggests that this item is measuring, in part,
a learning strategy that involves a skill related to memorization. The item
may fit better in the memorization scale than the control scale.
The item error terms summarize the unexplained variability for each item.
Items with larger factor loadings will have lower errors, and vice versa. Here,
we confirm that the errors, contained in alefa$uniquenesses, are 1 minus
the sum of squared factor loadings
Evaluate factors
The factor loadings tend to support the alignment of the approaches to
learning items into their corresponding scales. However, the results also
show that many of the items are related to more than just the scales they
were written for. This could be due to the fact that the three factors measure
related components of a broader learning strategies construct. Correlations
between the IRT theta scores for each scale are all moderately positive,
suggesting overlap in what the scales are measuring.
173
8. Factor Analysis
Next, we can look at the eigenvalues for the factors, to determine the
amount of total score variability they each capture. The R output labels
the eigenvalues as SS loadings, since they are calculated as the sum of
the squared loadings for each factor. The EFA results show eigenvalues
of 1.85, 1.84, and 1.73, which each represent about 14% of the total score
variability, for a cumulative variance explained of 42%. These results aren’t
encouraging, as they indicate that the majority of variability in approaches
to learning scores is still unexplained.
1.5
1.0
0.5
0.0
1 2 3
Factor
The scree plot for the Approaches to Learning EFA with three factors
174
8.3. Confirmatory factor analysis
resembles more of a plain than a cliff edge. The eigenvalues are all above 1,
which is sometimes used as a cutoff for acceptability. They’re also all nearly
equal.
With other types of tests where one or two strong underlying constructs are
present, the scree effect will be more apparent. Here is an example based
on the BDI$R correlation matrix presented above. The plot shows that the
first two factors have eigenvalues near or above 2, whereas the rest are near
or below 1. These first two factors correspond to the cognitive and somatic
factors described above. Together they account for 26% of the variance.
1.5
1.0
0.5
0.0
1 2 3 4 5 6 7 8 9 10 11 12
Factor
The EFA of the PISA 2009 Approaches to Learning scale leads us to conclude
first that the 13 items provide only a limited picture of the three subscales
being measured. Four to five items per scale, and per factor, does not
appear to be sufficient. Although loadings were moderate, and aligned as
expected onto the corresponding subscales, more items would likely improve
the variance explained by each factor. Given these limitations, computing
theta values based on unidimensional IRT models for each subscale may not
be appropriate.
175
8. Factor Analysis
the number of factors. Furthermore, EFA does not formally support the
testing of model fit or statistical significance in our results. CFA extends
EFA by providing a framework for proposing a specific measurement model,
fitting the model, and then testing statistically for the appropriateness or
accuracy of the model given our instrument and data.
Note that CFA falls within a much broader and more general class of
structural equation modeling (SEM) methods. Other applications of SEM
include path analysis, multilevel regression modeling, and latent growth
modeling. CFA and SEM models can be fit and analyzed in R, but, given
their complexity, commercial software is often preferable, depending on the
model and type of data used. For demonstration purposes, we will examine
CFA in R using the lavaan package.
The BDI-II
Let’s consider again the BDI-II, discussed above in terms of EFA. Exploring
the factor structure of the instrument gives us insights into the number of
factors needed to adequately capture certain percentages of the variability in
scores. Again, results indicate that two factors, labeled in previous studies
as cognitive and somatic, account for about a quarter of the total score
variance. In the plot above, these two factors clearly stand out above the
scree. But the question remains, is a two factor model correct? And, if so,
what is the best configuration of item loadings on those two factors?
In the end, there is no way to determine with certainty that a model is
correct for a given data set and instrument. Instead, in CFA a model can be
considered adequate or appropriate based on two general criteria. The first
has to do with the explanatory power of the model. We might consider a
model adequate when it exceeds some threshold for percentage of variability
explained, and thereby minimizes in some way error variance or variability
unexplained. The second criterion involves the relative appropriateness
of a given model compared to other competing models. Comparisons are
typically made using statistical estimates of model fit, as discussed below.
Numerous studies have employed CFA to test for the appropriateness of
a two factor structure in the BDI-II. Whisman et al. (2000) proposed an
initial factor structure that included five BDI-II items loading on the somatic
factor, and 14 items loading on the cognitive factor (the authors label this
factor cognitive-affective). The remaining two items, Pessimism and Loss of
Interest in Sex, had been found in exploratory analyses not to load strongly
on either factor. Thus, loadings for these were not estimated in the initial
CFA model.
Results for this first CFA showed poor fit. Whisman et al. (2000) reported a
few different fit indices, including a Comparative Fit Index (CFI) of 0.80 and
a Root Mean Square Error of Approximation (RMSEA) of 0.08. Commonly
used thresholds for determining “good fit” are CFI at or above 0.90 and
RMSEA at or below 0.05.
An important feature of factor analysis that we have not yet discussed has
176
8.3. Confirmatory factor analysis
to do with the relationships among item errors. In EFA, we assume that all
item errors are uncorrelated. This means that the unexplained variability
for a given item does not relate to the unexplained variability for any other.
Similar to the CTT model, variability in EFA can only come from the
estimated factors or random, uncorrelated noise. In CFA, we can relax this
assumption by allowing certain error terms to correlate.
Whisman et al. (2000) allowed error terms to correlate for the following
pairs of items: sadness and crying, self-dislike and self-criticalness, and loss
of pleasure and loss of interest. This choice to have correlated errors seems
justified, given the apparent similarities in content for these items. The two
factors in the initial CFA didn’t account for the fact that respondents would
answer similarly within these pairs of items. Thus, the model was missing
an important feature of the data.
Having allowed for correlated error terms, the two unassigned items Pes-
simism and Loss of Interest in Sex were also specified to load on the cognitive
factor. The result was a final model with more acceptable fit statistics,
including CFI of 0.90 and RMSEA of 0.06.
Here, we will fit the final model from Whisman et al. (2000), while demon-
strating the following basic steps in conducting a CFA:
177
8. Factor Analysis
Figure 8.2.: CFA model for the BDI-II, with factor loadings and error vari-
ances.
178
8.3. Confirmatory factor analysis
Considering the complexity of our model, the sample size of 576 should be
sufficient for estimation purposes. Since different CFA models can involve
different numbers of estimated parameters for the same number of items,
sample size guidelines are usually stated in terms of the number of parameters
rather than the number of items. We’ll consider the recommended minimum
of five times as many people as parameters. The total number of parameters
estimated by our model will be 43, with 21 factor loadings, 21 error variances,
and the correlation between the cognitive and somatic factors.
We fit our CFA using the covariance matrix in BDI$S. The cfa() function
in the lavaan package requires that we name each factor and then list out
the items that load on it after the symbol =~. Here, we label our cognitive
factor cog and somatic factor som. The items listed after each factor must
match the row or column names in the covariance matrix, and must be
separated by +.
The lavaanify() function will automatically add error variances for each
item, and the correlation between factors, so we don’t have to write those
out. We do need to add the three correlated error terms, for example, with
sadness ~~ crying, where the ~~ indicates that the variable on the left
covaries with the one on the right.
The model is fit using the cfa() function. We supply the model specification
object that we created above, along with the covariance matrix and sample
size.
179
8. Factor Analysis
To evaluate the model that’s stored in the object bdicfa, we send it to the
lavaan summary() function, while requesting fit indices with the argument
fit = TRUE. Lots of output are printed to the console, including a summary
of the model estimation, some fit statistics, factor loadings (under Latent
Variables), covariances, and variance terms (including item errors).
The CFI and RMSEA are printed in the top half of the output. Neither
statistic reaches the threshold we’d hope for. CFI is 0.84 and RMSEA is
0.07, indicating less than optimal fit.
Standardized factor loadings are shown in the last column of the Latent
Variables output. Some of the loadings match up well with those reported
in the original study. Others are different. The discrepancies may be due
to the fact that our CFA was run on correlations rounded to two decimal
places that were converted using standard deviations to covariances, whereas
raw data were used in the original study. Rounding can result in a loss of
meaningful information.
180
8.3. Confirmatory factor analysis
181
8. Factor Analysis
182
8.3. Confirmatory factor analysis
## 0.847 0.847
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .sadness 0.203 0.013 15.727 0.000
## .crying 0.203 0.013 15.948 0.000
## .failure 0.249 0.015 16.193 0.000
## .guilt 0.195 0.013 15.424 0.000
## .punish 0.263 0.016 16.165 0.000
## .dislike 0.196 0.012 16.006 0.000
## .critical 0.239 0.016 14.991 0.000
## .pessimism 0.265 0.017 15.715 0.000
## .nopleasure 0.109 0.007 16.043 0.000
## .nointerest 0.549 0.033 16.451 0.000
## .noworth 0.279 0.017 16.492 0.000
## .suicide 0.230 0.014 16.359 0.000
## .indecisive 0.294 0.018 16.117 0.000
## .irritable 0.195 0.013 15.435 0.000
## .agitated 0.253 0.016 16.142 0.000
## .nosex 0.481 0.029 16.764 0.000
## .tired 0.226 0.015 15.157 0.000
## .noenergy 0.436 0.028 15.458 0.000
## .noconcentrate 0.337 0.023 14.387 0.000
## .appetite 0.231 0.017 13.737 0.000
## .sleep 0.192 0.012 16.521 0.000
## cog 1.000
## som 1.000
## Std.lv Std.all
## 0.203 0.625
## 0.203 0.672
## 0.249 0.716
## 0.195 0.560
## 0.263 0.709
## 0.196 0.700
## 0.239 0.517
## 0.265 0.610
## 0.109 0.683
## 0.549 0.798
## 0.279 0.804
## 0.230 0.762
## 0.294 0.697
## 0.195 0.562
## 0.253 0.703
## 0.481 0.905
## 0.226 0.723
## 0.436 0.756
## 0.337 0.651
183
8. Factor Analysis
## 0.231 0.603
## 0.192 0.911
## 1.000 1.000
## 1.000 1.000
4. Revise as needed
The discouraging CFA results may inspire us to modify our factor structure
in hopes of improving model fit. Potential changes include the removal of
items with low factor loadings, the correlating of more or fewer error terms,
and the evaluation of different numbers of factors. Having fit multiple CFA
models, we can then compare fit indices and look for relative improvements
in fit for one model over another.
Let’s quickly examine a CFA where all items load on a single factor. We
no longer have correlated error terms, and item errors are again added
automatically. We only specify the loading of all items on our single factor,
labeled depression.
184
8.3. Confirmatory factor analysis
185
8. Factor Analysis
186
8.3. Confirmatory factor analysis
We can compare fit indices for our two models using the anova() function.
This comparison requires that the same data be used to fit all the models
of interest. So, it wouldn’t be appropriate to compare bdimod with another
model fit to only 20 of the 21 BDI-II items.
187
8. Factor Analysis
Of these three fit comparison statistics, we will focus on AIC and BIC. The
AIC and BIC both increase from the model with two factors to the model
with only one. This indicates poorer model fit for the depression model.
Overall, results for the unidimensional model are no better than for the
previous one. The AIC and BIC are both larger, and the CFI and RMSEA
are relatively unchanged. Most loadings are moderately positive, but some
leave a substantial amount of item variability unexplained. For example,
the item error terms (in the Std.all column of the Variances table) for sleep
and noenergy are both larger than 0.80. These results indicate that more
than 80% of the variability is unexplained for these items. Errors for the
remaining items are all above 0.50.
Taken together, the results of these CFA suggest that the two-factor and
unidimensional models may not be appropriate for the BDI-II, at least in
the undergraduate population with which the data were collected.
188
8.3. Confirmatory factor analysis
st27q13",
auto.var = TRUE, auto.cov.lv.x = TRUE, std.lv = TRUE)
# Fit the model
alcfa <- cfa(almod, sample.cov = cov(pisagbr[, alitems]),
sample.nobs = 3514)
# Print output
summary(alcfa, fit = TRUE, standardized = TRUE)
## lavaan 0.6-2 ended normally after 20 iterations
##
## Optimization method NLMINB
## Number of free parameters 29
##
## Number of observations 3514
##
## Estimator ML
## Model Fit Test Statistic 1474.664
## Degrees of freedom 62
## P-value (Chi-square) 0.000
##
## Model test baseline model:
##
## Minimum Function Test Statistic 12390.734
## Degrees of freedom 78
## P-value 0.000
##
## User model versus baseline model:
##
## Comparative Fit Index (CFI) 0.885
## Tucker-Lewis Index (TLI) 0.856
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -52923.964
## Loglikelihood unrestricted model (H1) -52186.632
##
## Number of free parameters 29
## Akaike (AIC) 105905.928
## Bayesian (BIC) 106084.699
## Sample-size adjusted Bayesian (BIC) 105992.552
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.081
## 90 Percent Confidence Interval 0.077 0.084
## P-value RMSEA <= 0.05 0.000
##
## Standardized Root Mean Square Residual:
189
8. Factor Analysis
##
## SRMR 0.057
##
## Parameter Estimates:
##
## Information Expected
## Information saturated (h1) model Structured
## Standard Errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## memor =~
## st27q01 0.476 0.015 32.615 0.000
## st27q03 0.509 0.014 36.402 0.000
## st27q05 0.541 0.016 33.380 0.000
## st27q07 0.556 0.017 33.656 0.000
## elab =~
## st27q04 0.470 0.016 30.098 0.000
## st27q08 0.577 0.015 37.460 0.000
## st27q10 0.650 0.016 41.722 0.000
## st27q12 0.660 0.015 43.070 0.000
## cstrat =~
## st27q02 0.460 0.014 31.931 0.000
## st27q06 0.562 0.014 39.938 0.000
## st27q09 0.547 0.014 39.525 0.000
## st27q11 0.544 0.013 40.705 0.000
## st27q13 0.551 0.016 34.522 0.000
## Std.lv Std.all
##
## 0.476 0.585
## 0.509 0.644
## 0.541 0.597
## 0.556 0.601
##
## 0.470 0.534
## 0.577 0.644
## 0.650 0.706
## 0.660 0.725
##
## 0.460 0.550
## 0.562 0.662
## 0.547 0.657
## 0.544 0.672
## 0.551 0.588
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
190
8.3. Confirmatory factor analysis
## memor ~~
## elab 0.368 0.021 17.565 0.000
## cstrat 0.714 0.015 46.863 0.000
## elab ~~
## cstrat 0.576 0.017 34.248 0.000
## Std.lv Std.all
##
## 0.368 0.368
## 0.714 0.714
##
## 0.576 0.576
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .st27q01 0.435 0.013 34.471 0.000
## .st27q03 0.365 0.012 31.715 0.000
## .st27q05 0.527 0.016 33.981 0.000
## .st27q07 0.546 0.016 33.797 0.000
## .st27q04 0.555 0.015 37.116 0.000
## .st27q08 0.471 0.014 33.242 0.000
## .st27q10 0.427 0.014 29.620 0.000
## .st27q12 0.393 0.014 28.195 0.000
## .st27q02 0.487 0.013 37.590 0.000
## .st27q06 0.405 0.012 34.078 0.000
## .st27q09 0.395 0.012 34.310 0.000
## .st27q11 0.359 0.011 33.630 0.000
## .st27q13 0.576 0.016 36.648 0.000
## memor 1.000
## elab 1.000
## cstrat 1.000
## Std.lv Std.all
## 0.435 0.658
## 0.365 0.585
## 0.527 0.643
## 0.546 0.638
## 0.555 0.715
## 0.471 0.586
## 0.427 0.502
## 0.393 0.474
## 0.487 0.697
## 0.405 0.562
## 0.395 0.569
## 0.359 0.548
## 0.576 0.655
## 1.000 1.000
## 1.000 1.000
## 1.000 1.000
191
8. Factor Analysis
Model fit is not as impressive as we would hope. The CFI is 0.71 and the
RMSEA is 0.13. As with the BDI-II, loadings are moderately positive, but
they leave the majority of the item variance unexplained. These results
suggest that the Approaches to Learning constructs may not be strong
enough or distinct enough to warrant the calculation of scale scores across
the memorization, elaboration, and control items.
8.4. Summary
This chapter presented two popular methods for conducting factor analysis,
one that provides descriptive and relatively unstructured information about
a test, and the other providing more structured results oriented around
formal hypotheses. The main steps involved in fitting EFA and CFA and
evaluating results were discussed and demonstrated using real data.
EFA and CFA can improve the test development process by allowing us to
examine and confirm the presence of unobserved constructs that explain
variability in our tests. In practice, tests without a confirmed factor structure
may not be suitable for testing applications that require a total score
calculated across items, in the case of unidimensional factor models, or
scores across subscales, in the case of multidimensional models.
8.4.1. Exercises
192
9. Validity
Validity has long been one of the major deities in the pantheon
of the psychometrician. It is universally praised, but the good
works done in its name are remarkably few.
— Robert Ebel
Learning objectives
193
9. Validity
7. Describe the unified view of validity and how it differs from and
improves upon the traditional view of validity.
8. Identify threats to validity, including features of a test, testing
process, or score interpretation or use, that impact validity. Con-
sider, for example, the issues of content underrepresentation and
misrepresentation, and construct irrelevant variance.
9.1.1. Definitions
194
9.1. Overview of validity
Recent literature on validity theory has clarified that tests and even test
scores themselves are not valid or invalid. Instead, only score inferences
and interpretations are valid or invalid (e.g., Kane, 2013). Tests are then
described as being valid only for a particular use. This is a simple distinction
in the definition of validity, but some authors continue to highlight it.
Referring to a test or test score as valid implies that it is valid for any use,
even though this is likely not the case. Shorthand is sometimes used to
refer to tests themselves as valid, because it is simpler than distinguishing
between tests, uses, and interpretations. However, the assumption is always
that validity only applies to a specific test use and not broadly to the test
itself.
To evaluate the proposed score interpretations and uses for a test, and the
extent to which they are valid, we should first examine the purpose of the test
itself. As discussed in Chapters 2 and 3, a good test purpose articulates key
information about the test, including what it measures (the construct), for
whom (the intended population), and why (for what purpose). The question
then becomes, given the quality of its contents, how they were constructed,
and how they are implemented, is the test valid for this purpose?
195
9. Validity
196
9.2. Content validity
We will review each source of validity evidence in detail, and go over some
practical examples of when one is more relevant than another. In this
discussion, consider your own example, and examples of other tests you’ve
encountered, and what type of validity evidence could be used to support
their use.
197
9. Validity
on these three. Note that the content domain for a construct should be
established both by research and practice.
Next, we map the portions of our test that will address each area of the
content domain. The test outline can include information about the type of
items used, the cognitive skills required, and the difficulty levels that are
targeted, among other things. Review Chapter 4 for additional details on
test outlines or blueprints.
Table 9.1 contains an example of a test outline for the IGDI measures. The
three content areas listed above are shown in the first column. These are
then broken down further into cognitive processes or skills. Theory and
practical constraints determine reasonable numbers and types of test items
or tasks devoted to each cognitive process in the test itself. The final column
shows the percentage of the total test that is devoted to each area.
3. Evaluate
Licensure testing
198
9.2. Content validity
Psychological measures
Here are two main sources of content invalidity. First, if items reflecting
domain elements that are important to the construct are omitted from our
test outline, the construct will be underrepresented in the test. In our panic
attack example, if the test does not include items addressing “nausea or
abdominal distress,” other criteria, such as “fear of dying,” may have too
much sway in determining an individual’s score. Second, if unnecessary items
measuring irrelevant or tangential material are included, the construct will
be misrepresented in the test. For example, if items measuring depression
are included in the scoring process, the score itself is less valid as a measure
of the target construct.
199
9. Validity
9.3.1. Definition
Criterion validity is the degree to which test scores correlate with, predict,
or inform decisions regarding another measure or outcome. If you think of
content validity as the extent to which a test correlates with or corresponds
to the content domain, criterion validity is similar in that it is the extent to
which a test correlates with or corresponds to another test. So, in content
validity we compare our test to the content domain, and hope for a strong
relationship, and in criterion validity we compare our test to a criterion
variable, and again hope for a strong relationship.
Validity by association
The equation for a validity coefficient is the same as the equations for
correlation that we encountered in previous chapters. Here we denote our
test as X and the criterion variable as Y . The validity coefficient is the
correlation between the two, which can be obtained as the covariance divided
by the product of the individual standard deviations.
σXY
ρXY = (9.1)
σX σY
Criterion validity is limited because it does not actually require that our test
be a reasonable measure of the construct, only that it relate strongly with
another measure of the construct. Nunnally and Bernstein (1994) clarify
this point with a hypothetical example:
200
9.3. Criterion validity
The scenario is silly, but it highlights the fact that, on it’s own, criterion
validity is insufficient. The take-home message is that you should never use
or trust a criterion relationship as your sole source of validity evidence.
There are two other challenges associated with criterion validity. First,
finding a suitable criterion can be difficult, especially if your test targets
a new or not well defined construct. Second, a correlation coefficient is
attenuated, or reduced in strength, by any unreliability present in the
two measures being correlated. So, if your test and the criterion test are
unreliable, a low validity coefficient (the correlation between the two tests)
may not necessarily represent a lack of relationship between the two tests. It
may instead represent a lack of reliable information with which to estimate
the criterion validity coefficient.
Attenuation
ρXY
ρTX TY = √ (9.2)
ρX ρY
201
9. Validity
In summary, the steps for establishing criterion validity evidence are rela-
tively simple. After defining the purpose of the test, a suitable criterion is
identified. The two tests are administered to the same sample of individuals
from the target population, and a correlation is obtained. If reliability
estimates are available, we can then estimate the disattenuated coefficient,
as shown above.
Note that a variety of other statistics are available for establishing the
predictive power of a test X for a criterion variable Y . Two popular examples
are regression models, which provide more detailed information about the
bivariate relationship between our test and criterion, and contingency tables,
which describe predictions in terms of categorical or ordinal outcomes. In
each case, criterion validity can be maximized by writing items for our test
that are predictive of or correlate with the criterion.
202
9.4. Construct validity
9.4.1. Definition
Chapter 8 presents EFA and CFA as tools for exploring and confirming the
factor structure of a test. Results from these analysis, particularly CFA, are
key to establishing construct validity evidence.
203
9. Validity
9.4.2. Examples
The entire set of relationships between our construct and other available
constructs is sometimes referred to as a nomological network. This network
outlines what the construct is, based on what it relates positively with,
and conversely what it is not, based on what it relates negatively with.
For example, what variables would you expect to relate positively with
depression? As a person gets more depressed, what else tends to increase?
What variables would you not expect to correlate with depression? Finally,
what variables would you expect to relate negatively with depression?
Table 9.2 contains an example of a correlation matrix that describes a
nomological network for a hypothetical new depression scale. The BDI
would be considered a well known criterion measure of depression. The
remaining labels in this table refer to other related or unrelated variables.
“Fake bad” is a measure of a person’s tendency to pretend to be “bad” or
associate themselves with negative behaviors or characteristics. Positive
correlations in this table represent what is referred to as convergence. Our
hypothetical new scale converges with the BDI, a measure of anxiety, and
a measure of faking bad. Negative correlations represent divergence. Our
new scale diverges with measures of happiness and health. Both types of
correlation should be predicted by a theory of depression.
204
9.6. Summary
9.6. Summary
205
9. Validity
9.6.1. Exercises
1. Consider your own testing application and how you would define a
content domain. What is this definition of the content domain based
on? In education, for example, end-of-year testing, it’s typically based
on a curriculum. In psychology, it’s typically based on research and
practice. How would you confirm that this content domain is adequate
or representative of the construct? And how could content validity
be compromised for your test?
2. Consider your own testing application and a potential criterion mea-
sure for it. How do you go about choosing the criterion? How would
you confirm that a relationship exists between your test and the
criterion? How could criterion validity be compromised in this case?
3. Construct underrepresentation and misrepresentation are reviewed
briefly for the hypothetical test of panic attacks. Explain how un-
derrepresentation and misrepresentation could each impact content
validity for the early literacy measures.
4. Suppose a medical licensure test correlates at 0.50 with a criterion
measure based on supervisor evaluations of practicing physicians who
have passed the test. Interpret this correlation as a validity coefficient,
discussing the issues presented in this chapter.
5. Consider what threats might impact the validity of your own testing
application. How could these threats be identified in practice? How
could they be avoided through sound test development?
206
10. Test Evaluation
Test evaluation summarizes many of the topics that precede it in this book,
including test purpose, study design and results for reliability and validity,
scoring and reporting guidelines, and recommendations for test use. The new
material for this chapter includes the process of evaluating this information
within a test review or technical manual when considering one or more tests
for a particular use. Our perspective will be that of a test consumer, for
example, a researcher or practitioner in the market for a test to inform
some application, for example, a research question or some decision making
process.
Learning objectives
207
10. Test Evaluation
We’ll begin, as usual, with a review of test purpose. Suppose you have a
research question that requires measurement of some kind, and you don’t
have the time or resources to develop your own test. So, you start looking
for an existing measure that will meet your needs. How do you go about
finding such a measure?
It is possible that the gold standard measures for your field or area of work
will be well known to you. But, if they aren’t, you will need to compare tests
based on their stated purposes. Some initial questions are, what construct(s)
are you hoping to measure, for what population, and for what reason? And
what tests have purposes that match your particular application?
As noted above, in this chapter we’ll consider a scenario involving the
construct of creativity. In the late 1990s, a national committee in Britain
reported on the critical importance of creative and cultural education, both
of which, it was argued, were missing in many education systems (Robinson,
1999). Creative education was defined as “forms of education that develop
young people’s capacities for original ideas and action.” Cultural education
referred to “forms of education that enable them [students] to engage
positively with the growing complexity and diversity of social values and
ways of life.” The report sought to dispell a common myth that creativity is
an inborn and static ability. Instead, it was argued that creativity can and
must be taught, both to teachers and students, if our education systems are
to succeed.
This idea that creativity is critical to effective education provides a back-
drop for the hypothetical testing application we’ll focus on in this chapter.
Suppose we are studying the efficacy of different classroom curriculae for
improving teachers’ and students’ creative thinking. In addition to qual-
itative measures of effectiveness, for example, interviews and reports on
participants’ experiences, we also need a standardized, quantitative measure
of creativity that we can administer to participants at the start and end of
our program.
To research and evaluate potential measures of creativity, we turn to the
Buros test database buros.org/. The Buros Center for Testing includes a
test review library. The library is physically located at the University of
Nebraska-Lincoln. Electronic access to library resources is available online.
Each year, Buros solicits and publishes reviews of newly published tests in
what is called the Mental Measurements Yearbook (MMY). These reviews
are written by testing professionals and they summarize and evaluate the
information that test users need to know about a test before using it. This
review information is essentially what we will cover in this chapter.
If you have access to the electronic version of MMY, for example, through
a university library, you should go there and search for a test of creative
problem solving. If you use the search terms “creative problem solving,”
you probably won’t get many results. I only had ten at the time of writing
208
10.1. Test purpose
CAPSAT CAP
Publication year 2011 1980
Format Self-administered Parent and teacher ratings
Number of scales 4 8
Number of items 36 48 SR, 4 CR
Score scale range 0 to 100 Three-point frequency scale
Referencing Criterion Norm
Population Adults 17 to 40 Children 6 to 18
this. Judging by their titles, some of the results sounded like they might be
appropriate, but I decided to try again with just the search term “creativity.”
This returned 182 results.
As you browse through these tests, think about the construct of creative
problem solving, and creativity in general. How could we test a construct
like this? What types of tasks would we expect children to respond to, and
how would we score their responses? It turns out, creativity is not easily
measured, primarily because a standardized test with a structured scoring
guide does not leave room for responses that come from “outside the box,”
or that represent divergent thinking, or that we would consider novel or. . .
“creative.” In a way, the term standardized suggests the opposite of creativity.
Still, quite a few commercial measures of creativity exist.
Table 10.1 contains information for two tests of creativity, the Creativity and
Problem-Solving Aptitude Test (CAPSAT) and the Creativity Assessment
Packet (CAP). Only one test (the CAP) is available in the electronic
version of MMY. When you perform your search on MMY with search
term “creativity” you should see the CAP near the top of your results.
Neither test purpose is stated clearly in the technical documentation, so
we’ll have to assume that they’re both intended to describe creativity and
problem-solving for the stated populations. The intended uses of these tests
are also not clear.
Given the limited information in Table 10.1, which test would be more
appropriate for our program? The CAP seems right, given that it’s designed
for children 6 to 18. Unfortunately, as we dig deeper into the technical
information for the test, we discover that there is essentially no reliability or
validity evidence for it. One MMY review states that test-retest reliability
and criterion validity were established for the CAP in the 1960s, but no
coefficients are actually reported in the CAP documentation.
209
10. Test Evaluation
No matter how well a test purpose matches our own, a lack of reliability
and validity evidence makes a test unusable. The only solution for the CAP
would be to conduct our own reliability and validity studies. Technical
documentation for the CAPSAT indicates that internal consistency (alpha)
for the entire test is 0.90, with subscales having coefficient alphas between
0.72 and 0.87. Those are all acceptable. However, as noted in the MMY
reviews, validity information for the CAPSAT is missing.
For the uninitiated, the MMY provides a brief statement of purpose for the
TTCT: “to identify and evaluate creative potential through words (verbal
forms) and pictures (figural forms).” And here’s a short summary of the
TTCT from one reviewer:
Given this background information, the TTCT sounds like a viable solution.
Next, we need to look into the reliability and validity information for the
test.
Because the CAPSAT and the CAP both lacked reliability and validity
evidence, there really wasn’t anything to evaluate. Most tests will include
some form of evidence supporting reliability and validity, and we will need
to evaluate this evidence both in terms of its strength and its relevance to
the test purpose.
210
10.3. Reliability
10.3. Reliability
Here are some of the questions that you need to ask when evaluating
the relevance of reliability evidence for your test purpose. What types of
reliability are estimated? Are these types appropriate given the test purpose?
Are the study designs appropriate given the chosen types? Are the resulting
reliability estimates strong or supportive? These, along with the questions
presented below, simply review what we have covered in previous chapters.
The point here is to highlight how these features of a test (reliability, validity,
scoring, and test use) are important when evaluating a test for a particular
use.
So, what type of reliability is reported for the TTCT? The test reviews
note that a “collection of reliability and validity data” are available. Tests
of creativity, like the TTCT, require that individuals create things, that is,
perform tasks. Given that the TTCT involves performance assessment, we
need to know about interrater consistency. One review highlights interrater
reliabilities ranging from 0.66 to 0.99 for trained scorers and classroom
teachers. The 0.66 is low, but acceptable, given the low-stakes nature of
211
10. Test Evaluation
our test use. The 0.99 is optimal, and is something we would expect when
correlating scores given by two well trained test administrators. Are we con-
cerned that correlations were reported, and not generalizability coefficients?
Given the ordinal/interval nature of the scale (discussed further below),
percentage agreement and kappa are not as appropriate. Generalizability
coefficients, taking into account systematic scoring differences, would be
more informative than correlations.
According to the test reviews, we also have test-retest reliabilities from
the 0.50s in one study to 0.93 in another. A reviewer notes that these
different results could be due to differences in the variability in the samples
of students used in each study; a wider age range was used to obtain the
test-retest of 0.93. Other studies reported test-retest reliabilities in the 0.60s
and 0.70s, which the reviewer states are acceptable, given the number of
weeks that pass between administrations.
10.4. Validity
Here are some questions that you need to ask when evaluating the relevance
of validity evidence for your test purpose. These are essentially the same as
the questions for reliability. What types of validity evidence are examined?
Are these types appropriate given the test purpose? Are the study designs
appropriate given the chosen types of validity evidence? Does the resulting
evidence support the intended uses and inferences of the test?
Remember, factors impacting construct validity fall into two categories.
Construct underrepresentation is failure to represent what the construct
contains or consists of. Construct misrepresentation happens when we
measure other constructs or factors, including measurement error.
One reviewer of the TTCT notes that “validity data are provided relating
TTCT scores to various measures of personality and intelligence without
structuring a theory of creativity to describe what the relationship of these
variables to creativity should be.” However, the same reviewer then makes
this comment, which is especially relevant to our purpose:
212
10.5. Scoring
10.5. Scoring
Methods for creating and reporting scores for a test can vary widely from
one test to the next. However, there are a few key questions to ask when
evaluating the scoring that is implemented with a test. What types of scores
are produced? That is, what type of measurement scale is used? What is
the score scale range? How is meaning given to scores? And what type of
score referencing is used, and does this seem reasonable? Finally, what kinds
of score reporting guidelines are provided, and do they seem appropriate?
As with reliability and validity, these questions should be considered in
reference to the purpose of the test.
As with many educational and psychological tests, the score scale for the
TTCT is based on a sum of individual item/task scores. In this case,
scores come from a trained rater. We assume this is an interval scale of
measurement. The score scale range isn’t important for our test purpose, as
long as it can capture growth, which we assume it can, given the information
presented above. However, note that a small scale range, for example, 1 to
5 points, might be problematic in a pre/post test administration, depending
on how much growth is expected to take place.
Regarding scoring, one MMY reviewer notes:
The scoring system for the tests also has some problems. For
example, the assignment of points to responses on the Origi-
nality scale was based on the frequency of appearance of these
responses among 500 unspecified test takers. For the Asking
subtest, for example, responses that appeared in less than 2% of
the protocols receive a weight of 2. What is the basis for setting
this criterion? How would it alter scores if 2s were assigned to
10%, for example? Or 7%? How does one decide? No empirical
basis for this scoring decision is given. Further, not all tests
use the same frequency criteria. This further complicates the
rationale. Also, 2s may be awarded for unlisted items on the
basis of their “creative strength.” A feel for this “variable” is
hard to get.
213
10. Test Evaluation
10.7. Summary
This chapter provides a brief overview of the test evaluation process, using
tests of creative problem solving as a context. Some key questions are
discussed, including questions about test purpose, study design, reliability,
validity, scoring, and test use.
10.7.1. Exercises
214
10.7. Summary
4. Search the MMY reviews for a test on a topic that interests you. Use
the information in the reviews for a given test to evaluate the test for
its intended purpose.
215
A. Learning Objectives
Chapter 1: Introduction
1. Identify and use statistical notation for variables, sample size, mean,
standard deviation, variance, and correlation.
6. Apply and explain the process of rescaling variables using means and
standard deviations for linear transformations.
2. Define the term construct and describe how constructs are used in
measurement, with examples.
217
A. Learning Objectives
11. Describe how standards and performance levels are used in criterion
referencing with standardized state tests.
12. Compare and contrast norm and criterion score referencing, and
identify their uses in context.
3. Identify the distinctive features of aptitude tests and the main ben-
efits and limitations in using aptitude tests to inform decision-making.
6. Compare and contrast different types of tests and test uses and
identify examples of each, including summative, formative, mastery,
and performance.
Cognitive
218
objectives in the item writing process.
2. Describe how a test outline or test plan is used in cognitive test devel-
opment to align the test to the content domain and learning objectives.
8. Write and critique cognitive test items that match given learning
objectives and depths of knowledge and that follow the item writing
guidelines.
Noncognitive
11. Compare and contrast item types and response types used in affective
measurement, describing the benefits and limitations of each type,
and demonstrating their use.
12. Define the main affective response sets, and demonstrate strategies
for reducing their effects.
13. Write and critique effective affective items using empirical guidelines.
Chapter 5: Reliability
CTT reliability
219
A. Learning Objectives
10. Describe the formula for coefficient alpha, the assumptions it is based
on, and what factors impact it as an estimate of reliability.
12. Describe factors related to the test, the test administration, and the
examinees, that affect reliability.
Interrater reliability
220
16. Identify appropriate uses of each interrater index, including the
benefits and drawbacks of each.
1. Explain how item bias and measurement error negatively impact the
quality of an item, and how item analysis, in general, can be used to
address these issues.
2. Describe general guidelines for collecting pilot data for item analysis,
including how following these guidelines can improve item analysis
results.
10. Utilize item analysis to distinguish between items that function well
in a set and items that do not.
11. Remove items from an item set to achieve a target level of reliability.
1. Compare and contrast IRT and CTT in terms of their strengths and
weaknesses.
2. Identify the two main assumptions that are made when using a
traditional IRT model, regarding dimensionality and functional form
or the number of model parameters.
221
A. Learning Objectives
4. Define the three item parameters and one ability parameter in the
traditional IRT models, and describe the role of each in modeling
performance with the IIF.
5. Distinguish between the 1PL, 2PL, and 3PL IRT models in terms of
assumptions made, benefits and limitations, and applications of each.
Chapter 9: Validity
222
5. Describe how unreliability can attenuate a correlation, and how to
correct for attenuation in a validity coefficient.
7. Describe the unified view of validity and how it differs from and
improves upon the traditional view of validity.
223
B. R Code
Chapter 1: Introduction
225
B. R Code
226
# Histograms of age and the rescaled version of age
ggplot(PISA09, aes(factor(round(zage, 2)))) + geom_bar()
ggplot(PISA09, aes(factor(round(newage, 2)))) + geom_bar()
227
B. R Code
Chapter 5: Reliability
228
# R setup for this chapter
# Required packages are assumed to be installed,
# see chapter 1
library("epmr")
library("ggplot2")
# Functions we'll use in this chapter
# set.seed(), rnorm(), sample(), and runif() to simulate
# data
# rowSums() for getting totals by row
# rsim() and setrange() from epmr to simulate and modify
# scores
# rstudy(), alpha(), and sem() from epmr to get
# reliability and SEM
# geom_point(), geom_abline(), and geom_errorbar() for
# plotting
# diag() for getting the diagonal elements from a matrix
# astudy() from epmr to get interrater agreement
# gstudy() from epmr to run a generalizability study
# Simulate a constant true score, and randomly varying
# error scores from
# a normal population with mean 0 and SD 1
# set.seed() gives R a starting point for generating
# random numbers
# so we can get the same results on different computers
# You should check the mean and SD of E and X
# Creating a histogram of X might be interesting too
set.seed(160416)
myt <- 20
mye <- rnorm(1000, mean = 0, sd = 1)
myx <- myt + mye
# Calculate total reading scores, as in Chapter 2
ritems <- c("r414q02", "r414q11", "r414q06", "r414q09",
"r452q03", "r452q04", "r452q06", "r452q07", "r458q01",
"r458q07", "r458q04")
rsitems <- paste0(ritems, "s")
xscores <- rowSums(PISA09[PISA09$cnt == "BEL", rsitems],
na.rm = TRUE)
# Simulate error scores based on known SEM of 1.4, which
# we'll calculate later, then create true scores
# True scores are truncated to fall bewteen 0 and 11 using
# setrange()
escores <- rnorm(length(xscores), 0, 1.4)
tscores <- setrange(xscores - escores, y = xscores)
# Combine in a data frame and create a scatterplot
scores <- data.frame(x1 = xscores, t = tscores,
e = escores)
ggplot(scores, aes(x1, t)) +
229
B. R Code
230
ggplot(beldat, aes(factor(x), x)) +
geom_errorbar(aes(ymin = x - sem * 2,
ymax = x + sem * 2), col = "violet") +
geom_errorbar(aes(ymin = x - sem, ymax = x + sem),
col = "yellow") +
geom_point()
# Simulate random coin flips for two raters
# runif() generates random numbers from a uniform
# distribution
flip1 <- round(runif(30))
flip2 <- round(runif(30))
table(flip1, flip2)
# Simulate essay scores from two raters with a population
# correlation of 0.90, and slightly different mean scores,
# with score range 0 to 6
# Note the capital T is an abbreviation for TRUE
essays <- rsim(100, rho = .9, meanx = 4, meany = 3,
sdx = 1.5, sdy = 1.5, to.data.frame = T)
colnames(essays) <- c("r1", "r2")
essays <- round(setrange(essays, to = c(0, 6)))
# Use a cut off of greater than or equal to 3 to determine
# pass versus fail scores
# ifelse() takes a vector of TRUEs and FALSEs as its first
# argument, and returns here "Pass" for TRUE and "Fail"
# for FALSE
essays$f1 <- factor(ifelse(essays$r1 >= 3, "Pass",
"Fail"))
essays$f2 <- factor(ifelse(essays$r2 >= 3, "Pass",
"Fail"))
table(essays$f1, essays$f2)
# Pull the diagonal elements out of the crosstab with
# diag(), sum them, and divide by the number of people
sum(diag(table(essays$r1, essays$r2))) / nrow(essays)
# Randomly sample from the vector c("Pass", "Fail"),
# nrow(essays) times, with replacement
# Without replacement, we'd only have 2 values to sample
# from
monkey <- sample(c("Pass", "Fail"), nrow(essays),
replace = TRUE)
table(essays$f1, monkey)
# Use the astudy() function from epmr to measure agreement
astudy(essays[, 1:2])
# Certain changes in descriptive statistics, like adding
# constants won't impact correlations
cor(essays$r1, essays$r2)
dstudy(essays[, 1:2])
cor(essays$r1, essays$r2 + 1)
231
B. R Code
232
# exclude = NULL shows us NAs as well
# raw and scored are not arguments to table, but are used
# simply to give labels to the printed output
table(raw = PISA09[, critems[1]],
scored = PISA09[, crsitems[1]],
exclude = NULL)
# Create the same type of table for the first SR item
# Check the structure of raw attitude items
str(PISA09[, c("st33q01", "st33q02", "st33q03",
"st33q04")])
# Rescore two items
PISA09$st33q01r <- recode(PISA09$st33q01)
PISA09$st33q02r <- recode(PISA09$st33q02)
# Get p-values for reading items by type
round(colMeans(PISA09[, crsitems], na.rm = T), 2)
round(colMeans(PISA09[, srsitems], na.rm = T), 2)
# Index for attitude toward school items, with the first
# two items recoded
atsitems <- c("st33q01r", "st33q02r", "st33q03",
"st33q04")
# Check mean scores
round(colMeans(PISA09[, atsitems], na.rm = T), 2)
# Convert polytomous to dichotomous, with any disagreement
# coded as 0 and any agreement coded as 1
ats <- apply(PISA09[, atsitems], 2, recode,
list("0" = 1:2, "1" = 3:4))
round(colMeans(ats, na.rm = T), 2)
# Get total reading scores and check descriptives
PISA09$rtotal <- rowSums(PISA09[, rsitems])
dstudy(PISA09$rtotal)
# Compare CR item p-values for students below vs above the
# median total score
round(colMeans(PISA09[PISA09$rtotal <= 6, crsitems],
na.rm = T), 2)
round(colMeans(PISA09[PISA09$rtotal > 6, crsitems],
na.rm = T), 2)
# Create subset of data for German students, then reduce
# to complete data
pisadeu <- PISA09[PISA09$cnt == "DEU", c(crsitems,
"rtotal")]
pisadeu <- pisadeu[complete.cases(pisadeu), ]
round(cor(pisadeu), 2)
# Scatter plots for visualizing item discrimination
ggplot(pisadeu, aes(rtotal, factor(r414q06s))) +
geom_point(position = position_jitter(w = 0.2, h = 0.2))
ggplot(pisadeu, aes(rtotal, factor(r452q03s))) +
geom_point(position = position_jitter(w = 0.2, h = 0.2))
233
B. R Code
234
PISA09$rtotal <- rowSums(PISA09[, rsitems])
pisagbr <- PISA09[PISA09$cnt == "GBR",
c(rsitems, "rtotal")]
pisagbr <- pisagbr[complete.cases(pisagbr), ]
# Get p-values conditional on rtotal
# tapply() applies a function to the first argument over
# subsets of data defined by the second argument
pvalues <- data.frame(rtotal = 0:11,
p = tapply(pisagbr$r452q06s, pisagbr$rtotal, mean))
# Plot CTT discrimination over scatter plot of scored item
# responses
ggplot(pisagbr, aes(rtotal, r414q06s)) +
geom_point(position = position_jitter(w = 0.1,
h = 0.1)) +
geom_smooth(method = "lm", fill = NA) +
geom_point(aes(rtotal, p), data = pvalues,
col = "green", size = 3)
# Make up a, b, and c parameters for five items
# Get IRF using the rirf() function from epmr and plot
# rirf() will be demonstrated again later
ipar <- data.frame(a = c(2, 1, .5, 1, 1.5),
b = c(3, 2, -.5, 0, -1),
c = c(0, .2, .25, .1, .28),
row.names = paste0("item", 1:5))
ggplot(rirf(ipar), aes(theta)) + scale_y_continuous("Pr(X)") +
geom_line(aes(y = item1), col = 1) +
geom_line(aes(y = item2), col = 2) +
geom_line(aes(y = item3), col = 3) +
geom_line(aes(y = item4), col = 4) +
geom_line(aes(y = item5), col = 5)
# The irtstudy() function estimates theta and b parameters
# for a set of scored item responses
irtgbr <- irtstudy(pisagbr[, rsitems])
irtgbr
head(irtgbr$ip)
# Get IRF for the set of GBR reading item parameters and a
# vector of thetas
# Note the default thetas of seq(-4, 4, length = 100)
# could also be used
irfgbr <- rirf(irtgbr$ip, seq(-6, 6, length = 100))
# Plot IRF for items r452q03s, r452q04s, r452q06s, and
# r452q07s
ggplot(irfgbr, aes(theta)) + scale_y_continuous("Pr(X)") +
geom_line(aes(y = irfgbr$r452q03s, col = "r452q03")) +
geom_line(aes(y = irfgbr$r452q04s, col = "r452q04")) +
geom_line(aes(y = irfgbr$r452q06s, col = "r452q06")) +
geom_line(aes(y = irfgbr$r452q07s, col = "r452q07")) +
235
B. R Code
scale_colour_discrete(name = "item")
# Plot SEM curve conditional on theta for full items
# Then add SEM for the subset of items 1:8 and 1:4
ggplot(rtef(irtgbr$ip), aes(theta, se)) + geom_line() +
geom_line(aes(theta, se), data = rtef(irtgbr$ip[1:8, ]),
col = "blue") +
geom_line(aes(theta, se), data = rtef(irtgbr$ip[1:4, ]),
col = "green")
# SEM for theta 3 based on the four easiest and the four
# most difficult items
rtef(irtgbr$ip[c(4, 6, 9, 11), ], theta = 3)
rtef(irtgbr$ip[c(2, 7, 8, 10), ], theta = 3)
# Plot the test information function over theta
# Information is highest at the center of the theta scale
ggplot(rtif(irtgbr$ip), aes(theta, i)) + geom_line()
# Plot the test response function over theta
ggplot(rtrf(irtgbr$ip), aes(theta, p)) + geom_line()
236
# Print results again, rounding and filtering loadings
print(alefa, digits = 2, cutoff = 0.3)
# Print uniquenesses, and check sum of squared loadings
round(alefa$uniquenesses, 2)
round(rowSums(alefa$loadings^2) + alefa$uniquenesses, 2)
# Correlations between PISA approaches to learning scores
# for Great Britain
round(cor(PISA09[PISA09$cnt == "GBR", c("memor", "elab",
"cstrat")], use = "c"), 2)
# Plot of approaches to learning eigenvalues
plot(alefa, ylim = c(0, 3))
# Plot of eigenvalues for BDI
bdiefa <- fastudy(covmat = BDI$R, factors = 12,
n.obs = 576)
plot(bdiefa, ylim = c(0, 3))
# CFA of BDI using the lavaan package
# Specify the factor structure
# Comments within the model statement are ignored
bdimod <- lavaanify(model = "
# Latent variable definitions
cog =~ sadness + crying + failure + guilt + punish +
dislike + critical + pessimism + nopleasure +
nointerest + noworth + suicide + indecisive +
irritable + agitated + nosex
som =~ tired + noenergy + noconcentrate + appetite +
sleep
# Covariances
sadness ~~ crying
dislike ~~ critical
nopleasure ~~ nointerest",
auto.var = TRUE, auto.cov.lv.x = TRUE, std.lv = TRUE)
# Fit the model
bdicfa <- cfa(bdimod, sample.cov = BDI$S,
sample.nobs = 576)
# Print fit indices, loadings, and other output
summary(bdicfa, fit = TRUE, standardized = TRUE)
# Specify the factor structure
# Comments within the model statement are ignored as
# comments
bdimod2 <- lavaanify(model = "
depression =~ sadness + crying + failure + guilt +
punish + dislike + critical + pessimism + nopleasure +
nointerest + noworth + suicide + indecisive +
irritable + agitated + nosex + tired + noenergy +
noconcentrate + appetite + sleep",
auto.var = TRUE, std.lv = TRUE)
# Fit the model
237
B. R Code
Chapter 9: Validity
238
C. PISA Reading Items
The items and scoring information below are excerpted from the
Reading Literacy Items and Scoring Guides document available at
nces.ed.gov/surveys/pisa/educators.asp (Organization for Economic
Cooperation and Development, 2009). Note that the format and content
of the items and scoring information have been modified slightly for
presentation here. In the PISA 2009 study, these items were administered
in the context of other reading items and some general instructions not
shown here were also given to students.
Three sets of reading items are included here under the headings Cell phone
safety, The play’s the thing, and Telecommuting. Each set contains multiple
items, some of them selected-response and some constructed-response. Scor-
ing rubrics are provided for the constructed-response items. These item sets
correspond to PISA09 items: r414q02, r414q11, r414q06, and r414q09 for
cell phone safety; r452q03, r452q04, r452q06, and r452q07 for the play’s
the thing; and r458q01, r458q07, and r458q04 for telecommuting.
239
C. PISA Reading Items
“Cell Phone Safety” above is from a website. Use “Cell Phone Safety” to
answer the questions that follow.
Question 1 (R414Q02)
240
C.1. Cell phone safety
Question 2 (R414Q11)
“It is difficult to prove that one thing has definitely caused another.”
What is the relationship of this piece of information to the Point 4 Yes and
No statements in the table Are cell phones dangerous?
A. It supports the Yes argument but does not prove it.
B. It proves the Yes argument.
C. It supports the No argument but does not prove it.
D. It shows that the No argument is wrong.
Scoring
Correct: Answer C. It supports the No argument but does not prove it.
Incorrect: Other responses.
Question 3 (R414Q06)
Look at Point 3 in the No column of the table. In this context, what might
one of these “other factors” be? Give a reason for your answer.
Scoring
Correct: Answers which identify a factor in modern lifestyles that could be
related to fatigue, headaches, or loss of concentration. The explanation may
be self-evident, or explicitly stated.
Incorrect: Answers which give an insufficient or vague response.
Fatigue. [Repeats information in the text.]
Tiredness. [Repeats information in the text.]
Answers which show inaccurate comprehension of the material or are im-
plausible or irrelevant.
Question 4 (R414Q09)
Look at the table with the heading If you use a cell phone. . .
241
C. PISA Reading Items
242
C.2. The play’s the thing
TURAI: If you do not master it, you are its slave. There is no middle
ground. Trust me, it’s no joke starting a play well. It is one of the toughest
problems of stage mechanics. Introducing your characters promptly. Let’s
look at this scene here, the three of us. Three gentlemen in tuxedoes. Say
they enter not this room in this lordly castle, but rather a stage, just when
a play begins. They would have to chat about a whole lot of uninteresting
topics until it came out who we are. Wouldn’t it be much easier to start all
this by standing up and introducing ourselves? Stands up. Good evening.
The three of us are guests in this castle. We have just arrived from the
dining room where we had an excellent dinner and drank two bottles of
champagne. My name is Sandor Turai, I’m a playwright, I’ve been writing
plays for thirty years, that’s my profession. Full stop. Your turn.
GAL: Stands up. My name is Gal, I’m also a playwright. I write plays as
well, all of them in the company of this gentleman here. We are a famous
playwright duo. All playbills of good comedies and operettas read: written
by Gal and Turai. Naturally, this is my profession as well.
ADAM: Stands up. This young man is, if you allow me, Albert Adam,
twenty-five years old, composer. I wrote the music for these kind gentlemen
for their latest operetta. This is my first work for the stage. These two
elderly angels have discovered me and now, with their help, I’d like to
become famous. They got me invited to this castle. They got my dress-coat
and tuxedo made. In other words, I am poor and unknown, for now. Other
than that I’m an orphan and my grandmother raised me. My grandmother
has passed away. I am all alone in this world. I have no name, I have no
money.
TURAI: You shouldn’t have added that. Everyone in the audience would
figure that out anyway.
TURAI: Trust me, it’s not that hard. Just think of this whole thing as. . .
GAL: All right, all right, all right, just don’t start talking about the theatre
again. I’m fed up with it. We’ll talk tomorrow, if you wish.
243
C. PISA Reading Items
“The Play’s the Thing” is the beginning of a play by the Hungarian dramatist
Ferenc Molnar.
Use “The Play’s the Thing” to answer the questions that follow. (Note that
line numbers are given in the margin of the script to help you find parts that
are referred to in the questions.)
Question 1 (R452Q03)
What were the characters in the play doing just before the curtain went up?
Scoring
Correct: Answer which refer to dinner or drinking champagne. May para-
phrase or quote the text directly.
They have just had dinner and champagne.
“We have just arrived from the dining room where we had an excellent
dinner.” [direct quotation]
“An excellent dinner and drank two bottles of champagne.” [direct quotation]
Incorrect: Answers which give an insufficient or vague response, show
inaccurate comprehension of the material, or are implausible or irrelevant.
Question 2 (R452Q04)
“It’s an eternity, sometimes as much as a quarter of an hour. . . ” (lines 29-30)
According to Turai, why is a quarter of an hour “an eternity”?
A. It is a long time to expect an audience to sit still in a crowded theatre.
B. It seems to take forever for the situation to be clarified at the beginning
of a play. C. It always seems to take a long time for a dramatist to write
the beginning of a play.
D. It seems that time moves slowly when a significant event is happening in
a play.
Scoring
Correct: Answer B. It seems to take forever for the situation to be clarified
at the beginning of a play.
Incorrect: Other responses.
Question 3 (R452Q06)
A reader said, “Adam is probably the most excited of the three characters
about staying at the castle.”
244
C.2. The play’s the thing
What could the reader say to support this opinion? Use the text to give a
reason for your answer.
Scoring
Correct: Indicates a contrast between Adam and the other two characters
by referring to one or more of the following: Adam’s status as the poorest
or youngest of the three characters; his inexperience (as a celebrity).
• He is young, and young people just get more excited about things,
it’s a fact!
• He is an artist.
Question 4 (R452Q07)
Overall, what is the dramatist Molnar doing in this extract?
A. He is showing the way that each character will solve his own problems.
B. He is making his characters demonstrate what an eternity in a play is
like.
C. He is giving an example of a typical and traditional opening scene for
a play. D. He is using the characters to act out one of his own creative
problems.
Scoring
Correct: Answer D. He is using the characters to act out one of his own
creative problems.
245
C. PISA Reading Items
C.3. Telecommuting
246
C.3. Telecommuting
Question 2 (R458Q07)
What is one kind of work for which it would be difficult to telecommute?
Give a reason for your answer.
Scoring
Correct: Answers which identify a kind of work and give a plausible explana-
tion as to why a person who does that kind of work could not telecommute.
Responses MUST indicate (explicitly or implicitly) that it is necessary to
be physically present for the specific work.
• Building. It’s hard to work with the wood and bricks from just
anywhere.
• Plumber. You can’t fix someone else’s sink from your home!
• Digging ditches.
• Fire fighter.
• Student.
Question 3 (R458Q04)
Which statement would both Molly and Richard agree with?
A. People should be allowed to work for as many hours as they want to.
B. It is not a good idea for people to spend too much time getting to
work. C. Telecommuting would not work for everyone. D. Forming social
relationships is the most important part of work.
247
C. PISA Reading Items
Scoring
Correct: Answer B. It is not a good idea for people to spend too much time
getting to work.
Incorrect: Other responses.
248
D. Subset of PISA Survey Items
This appendix contains a subset of items from the PISA 2009 student
questionnaire.
The 13 items were grouped into three subscales based on the type of learning
strategy involved, whether memorization (labeled memor below), elaboration
(elab), or control (cstrat). Note that the question numbers and subscale
labels were not presented to students, and the rating scale is not shown.
Question 27
When you are studying, how often do you do the following? (Please tick
only one box in each row)
249
D. Subset of PISA Survey Items
The attitude toward school scale contained the four questions shown below.
Students responded using an agreement scale with categories for strongly
disagree, disagree, agree, and strongly agree.
Question 37
Thinking about what you have learned in school: To what extent do you
agree or disagree with the following statements?
250
D.2. Attitude toward school
Variable Statement
ST33Q01 School has done little to prepare me for adult life
when I leave school.
ST33Q02 School has been a waste of time
ST33Q03 School helped give me confidence to make deci-
sions.
ST33Q04 School has taught me things which could be useful
in a job
251
Bibliography
Abedi, J. (2004). The No Child Left behind Act and English language
learners: Assessment and accountability issues. Educational Researcher,
33:4–14.
AERA, APA, and NCME (1999). Standards for educational and psychological
testing. Washington DC: American Educational Research Association.
Bates, D., Mächler, M., Bolker, B., and Walker, S. (2015). Fitting linear
mixed-effects models using lme4. Journal of Statistical Software, 67(1):1–
48.
Beck, A., Steer, R., and Brown, G. (1996). Manual for the BDI-II.
Beck, A. T., Ward, C. H., Mendelson, M., Mock, J., and Erbaugh, J. (1961).
An inventory for measuring depression. Archives of general psychiatry,
4:53–63.
Black, P. and Wiliam, D. (1998). Inside the black box: Raising standards
through classroom assessment. Phi Delta Kappan, 80:139–148.
253
Bibliography
College Board (2012). The SAT report on college and career readiness: 2012.
Technical report, New York, NY: College Board.
De Boeck, P., Bakker, M., Zwitser, R., Nivard, M., Hofman, A., Tuerlinckx,
F., and Partchev, I. (2011). The estimation of item response models with
the lmer function from the lme4 package in R. Journal of Statistical
Software, 39(12):1–28.
Deno, S. L., Fuchs, L. S., Marston, D., and Shin, J. (2001). Using curriculum-
based measurement to establish growth standards for students with
learning disabilities. School Psychology Review, 30:507–524.
Doran, H., Bates, D., Bliese, P., and Dowling, M. (2007). Estimating the
multilevel Rasch model: With the lme4 package. Journal of Statistical
Software, 20(2):1–18.
254
Bibliography
255
Bibliography
Nelson, D. A., Robinson, C. C., Hart, C. H., Albano, A. D., and Marshall,
S. J. (2010). Italian preschoolers’ peer-status linkages with sociability and
subtypes of aggression and victimization. Social Development, 19:698–720.
Pope, K. S., Butcher, J. N., and Seelen, J. (2006). The MMPI, MMPI-2, &
MMPI-A in court: A practical guide for expert witnesses and attorneys
(3rd). Washington, DC: American Psychological Association.
256
Bibliography
Raymond, M. (2001). Job analysis and the specification of content for licen-
sure and certification examinations. Applied Measurement in Education,
14:369–415.
257
Bibliography
Wickham, H. (2009). ggplot2: elegant graphics for data analysis. New York,
NY: Springer.
258