Psychometric Principles in Student Assessment
Psychometric Principles in Student Assessment
net/publication/251470314
CITATIONS READS
41 3,504
4 authors, including:
Kadriye Ercikan
University of British Columbia - Vancouver
86 PUBLICATIONS 1,968 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Modeling collaborative problem-solving data: new approaches for individual and team level inferences View project
Developing and Testing Multi-Component Computer-Based Assessment Tasks for the Next Generation Science Standards View project
All content following this page was uploaded by Mark Wilson on 28 May 2014.
September, 2001
Abstract
An Introductory Example
The assessment design framework provides a way of thinking about psychometrics
that relates what we observe to what we infer. The models of the evidence-centered
design framework are illustrated in Figure 1. The student model, at the far left, concerns
what we want to say about what a student knows or can do—aspects of their knowledge
or skill. Following a tradition in psychometrics, we label this “θ” (theta). This label may
stand for something rather simple, like a single category of knowledge such as vocabulary
usage, or something much more complex, like a set of variables that concern which
strategies a student can bring to bear on mixed-number subtraction problems and under
what conditions she uses which ones. The task model, at the far right, concerns the
situations we can set up in the world, in which we will observe the student say or do
something that gives us clues about the knowledge or skill we’ve built into the student
model. Between the student and task model are the scoring model and the measurement
model, through which we reason from what we observe in performances to what we infer
about a student.
Let’s illustrate these models with a recent example—an assessment system built for a
middle school science curriculum, “Issues, Evidence and You” (IEY; SEPUP, 1995).
Figure 2 describes variables in the student model upon which both the IEY curriculum
and its assessment system, called the BEAR Assessment System (Wilson & Sloane, 2000)
are built. The student model consists of four variables at least one of which is the target
of every instructional activity and assessment in the curriculum. The four variables are
seen as four dimensions on which students will make progress during the curriculum.
The dimensions are correlated (positively, we expect), because they all relate to
“science”, but are quite distinct educationally. The psychometric tradition would use a
Psychometric Principles
Page 3
diagram like Figure 3 to illustrate this situation. Each of the variables is represented as a
circle—this is intended to indicate that they are unobservable or “latent” variables. They
are connected by curving lines—this is intended to indicate that they are not necessarily
causally related to one another (at least as far as we are modeling that relationship), but
they are associated (usually we use a correlation coefficient to express that association).
=================
Insert Figures 1- 3 here
=================
The student model represents what we wish to measure in students. These are
constructs—variables that are inherently unobservable, but which we propose as a useful
way to organize our thinking about students. They describe aspects of their skill or
knowledge for the purposes of, say, comparing programs, evaluating progress, or
planning instruction. We use them to accumulate evidence from what we can actually
observe students say and do.
Now look at the right hand side of Figure 1. This is the task model. This is how we
describe the situations we construct in which students will actually perform. Particular
situations are generically called “items” or “tasks”.
In the case of IEY, the items are embedded in the instructional curriculum, so much
so that the students would not necessarily know that they were being assessed unless the
teacher tells them. An example task is shown in Figure 4. It was designed to prompt
student responses that relate to the “Evidence and Tradeoffs” variable defined in Figure 2.
Note that this variable is a somewhat unusual one in a science curriculum—the IEY
developers think of it as representing the sorts of cognitive skills one would need to
evaluate the importance of, say, an environmental impact statement—something that a
citizen might need to do that is directly related to science’s role in the world. An example
of a student response to this task is shown in Figure 5.
How do we extract from this particular response some evidence about the
unobservable student-model variable we have labeled Evidence and Tradeoffs? What we
need is in the second model from the right in Figure 1—the scoring model. This is a
procedure that allows one to focus on aspects of the student response and assign them to
categories, in this case ordered categories that suggest higher levels of proficiency along
Psychometric Principles
Page 4
the underlying latent variable. A scoring model can take the form of what is called a
“rubric” in the jargon of assessment, and in IEY does take that form (although it is called
a “scoring guide”). The rubric for the Evidence and Tradeoffs variable is shown in Figure
6. It enables a teacher or a student to recognize and evaluate two distinct aspects of
responses to the questions related to the Evidence and Tradeoffs variable. In addition to
the rubric, scorers have exemplars of student work available to them, complete with
adjudicated scores and explanation of the scores. They also have a method (called
“assessment moderation”) for training people to use the rubric. All these elements
together constitute the scoring model. So, what we put in to the scoring model is a
student’s performance; what we get out is one or more scores for each task, and thus a set
of scores for a set of tasks.
=================
Insert Figures 4-6 here
=================
What now remains? We need to connect the student model on the left hand side of
Figure 1 with the scores that have come out of the scoring model—in what way, and with
what value, should these nuggets of evidence affect our beliefs about the student’s
knowledge? For this we have another model, which we will call the measurement model.
This single component is commonly known as a psychometric model. Now this is
somewhat of a paradox, as we have just explained that the framework for psychometrics
actually involves more than just this one model. The measurement model has indeed
traditionally been the focus of psychometrics, but it is not sufficient to understand
psychometric principles. The complete set of elements, the full evidentiary argument,
must be addressed.
Figure 7 shows the relationships in the measurement model for the sample IEY task.
Here the student model (first shown in Figure 3) has been augmented with a set of boxes.
The boxes are intended to indicate that they are observable rather than latent, and these
are in fact the scores from the scoring model for this task. They are connected to the
Evidence and Tradeoffs student-model variable with straight lines, meant to indicate a
causal (though probabilistic) relationship between the variable and the observed scores,
and the causality is posited to run from the student model variables to the scores. Said
Psychometric Principles
Page 5
another way, what the student knows and can do, as represented by the variables of the
student model, determines how likely it is that the students will make right answers rather
than wrong ones, carry out sound inquiry rather than founder, and so on, in each
particular task they encounter. In this example, both observable variables are posited to
depend on the same aspect of knowledge, namely Evidence and Tradeoffs. A different
task could have more or fewer observables, and each would depend on one or more
student-model variables, all in accordance with what knowledge and skill the task is
designed to evoke.
=================
Insert Figure 7 here
=================
It is important for us to say that the student model in this example (indeed in most
psychometric applications) is not proposed as a realistic explanation of the thinking that
takes place when a student works through a problem. It is a piece of machinery we use to
accumulate information across tasks, in a language and at a level of detail we think suits
the purpose of the assessment (for a more complete perspective on this see Pirolli &
Wilson, 1998). Without question, it is selective and simplified. But it ought to be
consistent with what we know about how students acquire and use knowledge, and it
ought to be consistent with what we see students say and do. This is where psychometric
principles come in.
What do psychometric principles mean in IEY? Validity concerns whether the tasks
actually do give sound evidence about the knowledge and skills the student-model
variables are supposed to measure, namely, the five IEY progress variables. Or are there
plausible alternative explanations for good or poor performance? Reliability concerns
how much we learn about the students, in terms of these variables, from the performances
we observe. Comparability concerns whether what we say about students, based on
estimates of their student model variables, has a consistent meaning even if students have
taken different tasks, or been assessed at different times or under different conditions.
Fairness asks whether we have been responsible in checking important facts about
students and examining characteristics of task model variables that would invalidate the
inferences that test scores would ordinarily suggest.
Psychometric Principles
Page 6
Psychometric Principles and Evidentiary Arguments
We have seen through a quick example how assessment can be viewed as evidentiary
arguments, and that psychometric principles can be viewed as desirable properties of
those arguments. Let’s go back to the beginning and develop this line of reasoning more
carefully.
Reliability
Reliability concerns the adequacy of the data to support a claim, presuming the
appropriateness of the warrant and the satisfactory elimination of alternative hypotheses.
Even if the reasoning is sound, there may not be enough information in the data to
support the claim. Later we will see how reliability is expressed quantitatively when
probability-based measurement models are employed. We can mention now, though, that
the procedures by which data are gathered can involve multiple steps or features that each
affect the evidentiary value of data. Depending on Jim’s rating of Sue’s essay rather than
evaluating it ourselves adds a step of reasoning to the chain, introducing the need to
establish an additional warrant, examine alternative explanations, and assess the value of
the resulting data.
How can we gauge the adequacy of evidence? Brennan (2000/in press) writes that the
idea of repeating the measurement process has played a central role in characterizing an
assessment’s reliability since the work of Spearman (1904)—much as it does in physical
sciences. If you weigh a stone ten times and get a slightly different answer each time, the
variation among the measurements is a good index of the uncertainty associated with that
measurement procedure. It is less straightforward to know just what repeating the
measurement procedure means, though, if the procedure has several steps that could each
be done differently (different occasions, different task, different raters), or if some of the
steps can’t be repeated at all (if a person learns something by working through a task, a
second attempt isn’t measuring the same level of knowledge). We will see that the
Psychometric Principles
Page 12
history of reliability is one of figuring out how to characterize the value of evidence in
increasingly wider ranges of assessment situations.
Comparability
Comparability concerns the common occurrence that the specifics of data collection
differ for different students, or for the same students at different times. Differing
conditions raise alternative hypotheses when we need to compare students with one
another or against common standards, or when we want to track students’ progress over
time. Are there systematic differences in the conclusions we would draw when we
observe responses to Test Form A as opposed to Test Form B, for example? Or from a
computerized adaptive test instead of the paper-and-pencil version? Or if we use a rating
based on two judges, as opposed to the average of two, or the consensus of three? We
must then extend the warrant to deal with these variations, and we must include them as
alternative explanations of differences in students’ scores.
Comparability overlaps with reliability, as both raise questions of how evidence
obtained through one application of a data-gathering procedure might differ from
evidence obtained through another application. The issue is reliability when we consider
the two measures interchangeable—which is used is a matter of indifference to the
examinee and assessor alike. Although we expect the results to differ somewhat, we don’t
know if one is more accurate than the other, whether one is biased toward higher values,
or if they will illuminate different aspects of knowledge. The same evidentiary argument
holds for both measures, and the obtained differences are what constitutes classical
measurement error. The issue is comparability when we expect systematic differences of
any of these types, but wish to compare results obtained from the two distinct processes
nevertheless. A more complex evidentiary argument is required. It must address the way
that observations from the two processes bear different relationships to the construct we
want to measure, and it must indicate how to take these differences into account in our
inferences.
Fairness
Fairness is a term that encompasses more territory than we can address in this
presentation. Many of its senses concern social, political, and educational perspectives on
Psychometric Principles
Page 13
the uses to which assessment results inform (Willingham & Cole, 1997)—legitimate
questions all, which would exist even if the chain of reasoning from observations to
constructs contained no uncertainty whatsoever. Like Wiley (1991), we focus our
attention here on construct meaning rather than use or consequences, and consider aspects
of fairness that bear directly on this portion of the evidentiary argument.
Fairness in this sense concerns alternative explanations of assessment performances in
light of other characteristics of students that we could and should take into account.
Ideally, the same warrant backs inferences about many students, reasoning from their
particular data to a claim about what each individually knows or can do. This is never
quite truly the case in practice, for factors such as language background, instructional
background, and familiarity with representations surely influence performance. When the
same argument is to be applied with many students, considerations of fairness require us
to examine the impact of such factors on performance and identify the ranges of their
values beyond which the common warrant can no longer be justified. Drawing the usual
inference from the usual data for a student who lies outside this range leads to inferential
errors. If they are errors we should have foreseen and avoided, they are unfair. Ways of
avoiding such errors are using additional knowledge about students to condition our
interpretation of what we observe under the same procedures, and gathering data from
different students in different ways, such as providing accommodations or allowing
students to choose among ways of providing data (and accepting the responsibility as
assessors to establish the comparability of data so obtained!)
Xj = θ + Ej, (1)
Psychometric Principles
Page 22
where Ej is an “error” term, normally-distributed with a mean of zero and a variance of
σ E2 .iii Thus Xj|θ ~ N(θ, σ E ). This statistical structure quantifies the patterns that the
substantive arguments express qualitatively, in a way that tells us exactly how to carry out
reverse reasoning for particular cases. If p(θ) expresses belief about Sue’s θ prior to
observing her responses, belief posterior to learning them is denoted as p(θ|x1,x2,x3) and
is calculated by Bayes theorem as
Figure 10 gives the numerical details for a hypothetical example, calculated with a
variation of an important early result called Kelley’s formula for estimating true scores
(Kelley, 1927). Suppose that from a large number of students like Sue, we’ve estimated
that the measurement error variance is σ E2 =25, and for the population of students, θ
follows a normal distribution with mean 50 and standard deviation 10. We now observe
Sue’s three scores, which take the values 70, 75, and 85. We see that the posterior
distribution for Sue’s θ is a normal distribution with mean 74.6 and standard deviation
2.8.
[[Figure 10—CTT numerical example—about here]]
The additional backing that was used to bring the probability model into the
evidentiary argument was an analysis of data from students like Sue. Spearman’s (1904)
seminal insight was that if their structure is set up in the right wayiv, it is possible to
estimate the quantitative features of relationships like this, among both variables that
could be observed and others which by their nature never can be. The index of
measurement accuracy in CTT is the reliability coefficient ρ, which is the proportion of
variance in observed scores in a population of interest that is attributable to true scores as
Psychometric Principles
Page 23
opposed to the total variance, which is composed of true score variance and noise. It is
defined as follows:
σ θ2
ρ= , (2)
σ θ2 + σ E2
where σ θ2 is the variance of true score in the population of examinees and σ E2 is the
Validity
Some of the historical “flavors” of validity are statistical in nature. Predictive
validity is the degree to which scores in selection tests correlate with future performance.
Convergent validity looks for high correlations of a test’s scores with other sources of
evidence about the targeted knowledge and skills, while divergent validity looks for low
correlations with evidence about irrelevant factors (Campbell & Fiske, 1959). Concurrent
validity examines correlations with other tests presumed to provide evidence about the
same or similar knowledge and skills.
The idea is that substantive considerations that justify an assessment’s conception and
construction can be put to empirical tests. In each of the cases mentioned above,
relationships are posited among observable phenomena that would hold if the substantive
argument were correct, and see if in fact they do; that is, exploring the nomothetic net.
These are all potential sources of backing for arguments for interpreting and using test
results, and they are at the same time explorations of plausible alternative explanations.
Consider, for example, assessments meant to support decisions about whether a
student has attained some criterion of performance (Ercikan & Julian, 2001, Hambleton
& Slater, 1997). These decisions, typically reported as proficiency or performance level
scores, which are increasingly being considered to be useful in communicating
assessment results to students, parents and the public as well as for evaluation of
programs, involve classification of examinee performance to a set of proficiency levels.
Rarely do the tasks on such a test exhaust the full range of performances and situations
users are interested in. Examining the validity of a proficiency test from this nomothetic-
net perspective would involve seeing whether students who do well on that test also
perform well in more extensive assessment, obtain high ratings from teachers or
employers, or succeed in subsequent training or job performance.
Statistical analyses of these kinds have always been important after the fact, as
significance-focused validity studies informed, constrained, and evaluated the use of a
test—but they rarely prompted more than minor modifications to its contents. Rather,
Embretson (1998) notes, substantive considerations have traditionally driven assessment
Psychometric Principles
Page 27
construction. Neither of two meaning-focused lines of justification that were considered
forms of validity used probability-based reasoning. They were content validity, which
concerned the nature and mix of items in a test, and face validity, which is what a test
appears to be measuring on the surface, especially to non-technical audiences. We will
see in our discussion of item response theory how statistical machinery is increasingly
being used in the exploration of construct representation as well in after-the-fact validity
studies.
Reliability
Reliability, historically, was used to quantify the amount of variation in test scores
that reflected ‘true’ differences among students, as opposed to noise (Equation 2). The
correlations between parallel tests forms we used in classical test theory are one way to
estimate reliability in this sense. Internal consistency among test items, as gauged by the
KR-20 formula (Kuder & Richardson, 1937) or Cronbach’s (1951) Alpha coefficient, is
another. A contemporary view sees reliability as the evidentiary value that a given
realized or prospective body of data would provide for a claim—more specifically, the
amount of information for revising belief about an inference involving student-model
variables, be it an estimate for a given student, a comparison among students, or a
determination of whether a student has attained some criterion of performance.
A wide variety of specific indices or parameters can be used to characterize
evidentiary value. Carrying out a measurement procedure two or more times with
supposedly equivalent alternative tasks and raters will not only ground an estimate of its
accuracy, as in Spearman’s original procedures, but it demonstrates convincingly that
there is some uncertainty to deal with in the first place (Brennan, 2000/in press). The KR-
20 and Cronbach’s alpha apply the idea of replication to tests that consist of multiple
items, by treating subsets of the items as repeated measures. These CTT indices of
reliability appropriately characterize the amount of evidence for comparing students in a
particular population with one another, but not necessarily for comparing them against a
fixed standard, or for comparisons in other populations, or for purposes of evaluating
schools or instructional programs. In this sense CTT indices of reliability are tied to
particular populations and inferences.
Psychometric Principles
Page 28
Since reasoning about reliability takes place in the realm of the measurement model
(assuming that it is both correct and appropriate), it is possible to approximate the
evidentiary value of not only the data in hand, but the value of similar data gathered in
somewhat different ways. Under CTT, the Spearman-Brown formula (Brown, 1910;
Spearman, 1910) can be used to approximate the reliability coefficient that would result
from doubling the length of a test:
2ρ
ρ double = . (3)
1+ ρ
That is, if ρ is the reliability of the original test, then ρdouble is the reliability of an
otherwise comparable test with twice as many items. Empirical checks have shown that
these predictions can hold up quite well—but not if the additional items differ as to their
content or difficulty, or if the new test is long enough to fatigue students. In these cases,
the real-world counterparts of the modeled relationships are stretched so far that the
results of reasoning through the model fail.
Extending this thinking to a wider range of inferences, generalizability theory
(Cronbach, Gleser, Nanda, and Rajaratnam, 1972) permits predictions for the accuracy of
similar tests with different numbers and configurations of raters, items, and so on. And
once the parameters of tasks have been estimated under an item response theory (IRT)
model, one can even assemble tests item by item to individual examinees on the fly, to
maximize the accuracy with which each is assessed. (Later we’ll point to some “how-to”
references for g-theory and IRT.)
Typical measures of accuracy used in CTT are not sufficient for examining accuracy
of the decisions concerning criterion of performance discussed above. In CTT framework,
the classification accuracy is defined as the extent to which classification of students
based on their observed test scores agree with those based on their true scores (Traub &
Rowley, 1980). One of the two commonly used measures of classification accuracy is a
simple measure of agreement, p0, defined as
L
p =∑ p ,
0 ll
l =1
Psychometric Principles
Page 29
where pll represents the proportion of examinees who were classified into the same
proficiency level (l=2,..,5) according to their true score and observed score. The second
is Cohen’s κ coefficient (Cohen, 1960). This statistic is similar to the proportion
agreement p0 , except that it is corrected for the agreement which is due to chance. The
coefficient is defined as
p −p
κ= 0 c ,
1− p c
where
p =∑p p
c l. .l
.
l =1
Generalizability Theory
Generalizability Theory (g-theory) extends classical test theory by allowing us to
examine how different aspects of the observational setting affect the evidentiary value of
test scores. As in CTT, the student is characterized by overall proficiency in some
domain of tasks. However, the measurement model can now include parameters that
correspond to “facets” of the observational situation such as features of tasks (i.e., task-
model variables), numbers and designs of raters, and qualities of performance that will be
evaluated. An observed score of a student in a generalizability study of an assessment
consisting of different item types and judgmental scores is an elaboration of the basic
CTT equation:
X ijk = θ i + τ j + ς k + Eijk ,
Psychometric Principles
Page 34
σ θ2
α= ,
σ θ2 + σ τ2 + (σ ς2 + σ E2 ) 2
f(xj;θ,βj). (4)
Under the Rasch (1960/1980) model for dichotomous (right/wrong) items, for
example, the probability of a correct response takes the following form:
where Xij is the response of Student i to Item j, 1 if right and 0 if wrong; θi is the
proficiency parameter of Student i; βj is the difficulty parameter of Item j; and Ψ(⋅) is the
logistic function, Ψ(x) = exp(x)/[1+exp(x)]. The probability of an incorrect response is
then
Prob(Xij=0|θi,βj) = f(0;θi,βj) = 1-Ψ(θi - βj). (6)
Taken together, Equations 5 and 6 specify a particular form for the item response
function, Equation 4. Figure 11 depicts Rasch item response curves for two items, Item 1
an easy one, with β1=-1 and Item 2 a hard one with β2=2. It shows the probability of a
correct response to each of the items for different values of θ. For both items, the
probability of a correct response increases toward one as θ increases. Conditional
independence means that for a given value of θ, the probability of Student i making
responses xi1 and xi2 to the two items is the product of terms like Equations 5 and 6:
The amount of information about θ available from Item j, Ij(θ), can be calculated as a
function of θ, βj, and the functional form of f (see the references mentioned above for
formulas for particular IRT models). Under IRT, the amount of information for
measuring proficiency at each point along the scale is simply the sum of these item-by-
item information functions. The square root of the reciprocal of this value is the standard
error of estimation, or the standard deviation of estimates of θ around its true value.
Figure 13 is the test information curve that corresponds to the two items in the preceding
example. It is of particular importance in IRT that once item parameters have been
estimated (“calibrating” them), estimating individual students’ θs and calculating the
accuracy of those estimates can be accomplished for any subset of items. Easy items can
be administered to fourth graders and harder ones to fifth graders, for example, but all
scores arrive on the same θ scale. Different test forms can be given as pretests and
posttests, and differences of difficulty and accuracy are taken into account.
Extensions of IRT
We have just seen how IRT extends statistical modeling beyond the constraints of
classical test theory and generalizability theory. The simple elements in the basic
equation of IRT (Equation 4) can be elaborated in several ways, each time expanding the
range of assessment situations to which probability-based reasoning can be applied in the
pursuit of psychometric principles.
Multiple-category responses. Whereas IRT was originally developed with
dichotomous (right/wrong) test items, researchers have extended the machinery to
observations that are coded in multiple categories. This is particularly useful for
performance assessment tasks that are evaluated by raters on, say, 0-5 scales. Samejima
(1969) carried out pioneering work in this regard. Thissen and Steinberg (1986) explain
Psychometric Principles
Page 40
the mathematics of the extension and provide a useful taxonomy of multiple-category IRT
models, and Wright and Masters (1982) offer a readable introduction to their use.
Rater models. The preceding paragraph mentioned that multiple-category IRT models
are useful in performance assessments with judgmental rating scales. But judges
themselves are sources of uncertainty, as even knowledgeable and well-meaning raters
rarely agree perfectly. Generalizability theory, discussed earlier, incorporates the overall
impact of rater variation on scores. Adding terms for individual raters into the IRT
framework goes further, so that we can adjust for their particular effects, offer training
when it is warranted, and identify questionable ratings with greater sensitivity. Recent
work along these lines is illustrated by Patz and Junker (1999) and Linacre (1989).
Conditional dependence. Standard IRT assumes that responses to different items are
independent once we know the item parameters and examinee’s θ. This is not strictly
true when several items concern the same stimulus, as in paragraph comprehension tests.
Knowledge of the content tends to improve performance on all items in the set, while
misunderstandings tend to depress all, in ways that don’t affect items from other sets.
Ignoring these dependencies leads one to overestimate the information in the responses.
The problem is more pronounced in complex tasks when responses to one subtask depend
on results from an earlier subtask, or when multiple ratings of different aspects of the
same performance are obtained. Wainer and his colleagues (e.g., Wainer & Keily, 1987;
Bradlow, Wainer, & Wang, 1999) have studied conditional dependence in the context of
IRT. This line of work is particularly important for tasks in which several aspects of the
same complex performance must be evaluated (Yen, 1993).
Multiple attribute models. Standard IRT posits a single proficiency to “explain”
performance on all the items in a domain. One can extend the model to situations in
which multiple aspects of knowledge and skill are required in different mixes in different
items. One stream of research on multivariate IRT follows the tradition of factor analysis,
using analogous models and focusing on estimating structures from tests more or less as
they come to the analyst from the test developers (e.g., Reckase, 1985). Another stream
starts from multivariate conceptions of knowledge, and constructs tasks that contain
evidence of that knowledge in theory-driven ways (e.g., Adams, Wilson, & Wang, 1997).
Psychometric Principles
Page 41
As such, this extension fits in neatly with the task-construction extensions discussed in
the following paragraph. Either way, having a richer syntax to describe examinees within
the probability-based argument supports more nuanced discussions of knowledge and the
ways it is revealed in task performances.
Incorporating item features into the model. Embretson (1983) not only argued for
paying greater attention to construct representation in test design, she argued for how to
do it: Incorporate task model variables into the statistical model, and make explicit the
ways that features of tasks impact examinees’ performance. A signal article in this regard
was Fischer’s (1973) linear logistic test model, or LLTM. The LLTM is a simple
extension of the Rasch model shown above in Equation 5, with the further requirement
that each item difficulty parameter β is the sum of effects that depend on the features of
that particular item:
m
β j = ∑ q jkηk ,
k =1
where hk is the contribution to item difficulty from Feature k, and qjk is the extent to
which Feature k is represented in Item j. Some of the substantive considerations that
drive task design can thus be embedded in the statistical model, and the tools of
probability-based reasoning are available to examine how well they hold up in practice
(validity), how they affect measurement precision (reliability), how they can be varied
while maintaining a focus on targeted knowledge (comparability), and whether some
items prove hard or easy for unintended reasons (fairness). Embretson (1998) walks
through a detailed example of test design, psychometric modeling, and construct
validation from this point of view. Additional contributions along these lines can be
found in the work of Tatsuoka (1990), Falmagne and Doignon (1988), Pirolli and Wilson
(1998), and DiBello, Stout, and Roussos (1995).
Conclusion
These are days of rapid change in assessment.vi Advances in cognitive psychology
deepen our understanding of how students gain and use knowledge (National Research
Council, 1999). Advances in technology make it possible to capture more complex
performances in assessment settings, by including, for example, simulation, interactivity,
collaboration, and constructed response (Bennett, 2001). Yet as forms of assessment
evolve, two themes endure: The importance of psychometric principles as guarantors of
social values, and their realization through sound evidentiary arguments.
We have seen that the quality of assessment depends on the quality of the evidentiary
argument, and how substance, statistics, and purpose must be woven together throughout
the argument. A conceptual framework such as the assessment design models of Figure 1
helps experts from different fields integrate their diverse work to achieve this end
(Mislevy, Steinberg, Almond, Haertel, and Penuel, in press). Questions will persist, as to
‘How do we synthesize evidence from disparate sources?’, ‘How much evidence do we
have?’, ‘Does it tell us what we think it does?’, and ‘Are the inferences appropriate for
each student?’ The perspectives and the methodologies that underlie psychometric
principles—validity, reliability, comparability, and fairness—provide formal tools to
address these questions, in whatever specific forms they arise.
Psychometric Principles
Page 45
Notes
ii
We are indebted to Prof. David Schum for our understanding of evidentiary reasoning,
such as it is. This first part of this section draws on Schum (1987, 1994) and Kadane &
Schum (1996).
ii
p(Xj|θ) is the probability density function for the random variable Xj, given that θ is
fixed at a specified value.
iii
Strictly speaking, CTT does not address the full distributions of true and observed
scores, only means, variances, and covariances. But we want to illustrate probability-
based reasoning and review CTT at the same time. Assuming normality for θ and E is the
easiest way to do this, since the first two moments are sufficient for normal distributions.
iv
In statistical terms, if the parameters are identified. Conditional independence is key,
because CI relationships enable us to make multiple observations that are assumed to
depend on the same unobserved variables in ways we can model. This generalizes the
concept of replication that grounds reliability analysis.
v
See Greeno, Collins, & Resnick (1996) for an overview of these three perspectives on
learning and knowing, and discussion of their implications for instruction and assessment.
vi
Knowing what students know (National Research Council, 2001), a report by the
Committee on the Foundations of Assessment, surveys these developments.
References
Adams, R., Wilson, M.R., & Wang, W.-C. (1997). The multidimensional random
coefficients multinomial logit model. Applied Psychological Measurement, 21, 1-23.
Almond, R. G., & Mislevy, R. J. (1999). Graphical models and computerized adaptive
testing. Applied Psychological Measurement.
American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education. (1999). Standards for educational
and psychological testing. Washington, D.C.: American Educational Research
Association.
Anderson, J.R., Boyle, C.F., & Corbett, A.T. (1990). Cognitive modelling and intelligent
tutoring. Artificial Intelligence, 42, 7-49.
Bennett, R. E. (2001). How the internet will help large-scale assessment reinvent itself.
Education Policy Analysis, 9(5).
Bradlow, E.T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for
testlets. Psychometrika, 64, 153-168.
Brennan, R. L. (1983). The elements of generalizability theory. Iowa City, IA:
American College Testing Program.
Brennan, R. L. (2000/in press). An essay on the history and future of reliability from the
perspective of replications. Paper presented at the Annual Meeting of the National
Council on Measurement in Education, New Orleans, April 2000. To appear in the
Journal of Educational Measurement.
Brown, W. (1910). Some experimental results in the correlation of mental abilities.
British Journal of Psychology, 3, 296-322.
Bryk, A. S., & Raudenbush, S. (1992). Hierarchical linear models: Applications and
data analysis methods. Newbury Park: Sage Publications.
Campbell, D.T., & Fiske, D.W. (1959). Convergent and discriminant validation by the
multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105.
Cohen, J. A. (1960). A coefficient of agreement for nominal scales. Educational and
Psychological Measurement, 20, 37-46.
Psychometric Principles
Page 47
Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests.
Psychometrika, 17, 297-334.
Cronbach, L.J. (1989). Construct validation after thirty years. In R.L. Linn (Ed.),
Intelligence: Measurement, theory, and public policy (pp.147-171). Urbana, IL:
University of Illinois Press.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability
of behavioral measurements: Theory of generalizability for scores and profiles. New
York: Wiley.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests.
Psychological Bulletin, 52, 281-302.
Dayton, C. M. (1999). Latent class scaling analysis. Thousand Oaks, CA: Sage
Publications.
Dibello, L.V., Stout, W.F., & Roussos, L.A. (1995). Unified cognitive/psychometric
diagnostic assessment likelihood based classification techniques. In P. Nichols, S.
Chipman, & R. Brennan (Eds.), Cognitively diagnostic assessment (pp. 361-389).
Hillsdale, NJ: Erlbaum.
Embretson, S. (1983). Construct validity: Construct representation versus nomothetic
span. Psychological Bulletin, 93, 179-197.
Embretson, S. E. (1998). A cognitive design systems approach to generating valid tests:
Application to abstract reasoning. Psychological Methods, 3, 380-396.
Ercikan, K. (1998). Translation effects in international assessments. International
Journal of Educational Research, 29, 543-553.
Ercikan, K., & Julian, M. (2001, in press). Classification Accuracy of Assigning Student
Performance to Proficiency Levels: Guidelines for Assessment Design. Applied
Measurement in Education.
Falmagne, J.-C., & Doignon, J.-P. (1988). A class of stochastic procedures for the
assessment of knowledge. British Journal of Mathematical and Statistical
Psychology, 41, 1-23.
Fischer, G.H. (1973). The linear logistic test model as an instrument in educational
research. Acta Psychologica, 37, 359-374.
Psychometric Principles
Page 48
Gelman, A., Carlin, J., Stern, H., and Rubin, D. B. (1995). Bayesian data analysis.
London: Chapman and Hall.
Greeno, J. G., Collins, A. M., & Resnick, L. B. (1996). Cognition and learning. In D. C.
Berliner and R. C. Calfee (Eds.), Handbook of educational psychology (pp. 15-146).
New York: MacMillan.
Gulliksen, H. (1950/1987). Theory of mental tests. New York: John Wiley/Hillsdale,
NJ; Lawrence Erlbaum.
Haertel, E.H. (1989). Using restricted latent class models to map the skill structure of
achievement test items. Journal of Educational Measurement, 26, 301-321.
Haertel, E.H., & Wiley, D.E. (1993). Representations of ability structures: Implications
for testing. In N. Frederiksen, R.J. Mislevy, and I.I. Bejar (Eds.), Test theory for a
new generation of tests. Hillsdale, NJ: Lawrence Erlbaum.
Hambleton, R. J. (1993). Principles and selected applications of item response theory.
In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 147-200). Phoenix, AZ:
American Council on Education/Oryx Press.
Hambleton, R. K. & Slater, S. C. (1997). Reliability of credentialing examinations and
the impact of scoring models and standard-setting policies. Applied Measurement in
Education, 10, 19-39.
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-
Haenzsel procedures. In H. Wainer and H. I. Braun (Eds.), Test validity (pp. 129-
145). Hillsdale, NJ: Lawrence Erlbaum.
Holland, P. W., & Wainer, H. (1993). Differential item functioning. Hillsdale, NJ:
Lawrence Erlbaum.
Jöreskog, K. G., and Sörbom, D. (1979). Advances in factor analysis and structural
equation models. Cambridge, MA: Abt Books.
Kadane, J.B., & Schum, D.A. (1996). A probabilistic analysis of the Sacco and Vanzetti
evidence. New York: Wiley.
Kane, M.T. (1992). An argument-based approach to validity. Psychological Bulletin,
112, 527-535.
Psychometric Principles
Page 49
Kelley, T.L. (1927). Interpretation of Educational Measurements. New York: World
Book.
Kuder, G.F., & Richardson, M.W. (1937). The theory of estimation of test reliability.
Psychometrika, 2, 151-160.
Lane, W., Wang., N., & Magone, M. (1996). Gender-related differential item functioning
on a middle-school mathematics performance assessment. Educational measurement:
Issues and practice, 15(4), 21-27; 31.
Lazarsfeld, P. F. (1950). The logical and mathematical foundation of latent structure
analysis. In S. A. Stouffer, L. Guttman, E. A. Suchman, P. R. Lazarsfeld, S. A. Star,
and J. A Clausen (Eds.), Measurement and prediction (pp.362-412). Princeton, NJ:
Princeton University Press.
Levine, M., & Drasgow, F. (1982). Appropriateness measurement: Review, critique, and
validating studies. British Journal of Mathematical and Statistical Psychology, 35,
42-56.
Linacre, J. M. (1989). Many faceted Rasch measurement. Doctoral Dissertation.
University of Chicago.
Lord, F. M. (1980). Applications of item response theory to practical testing problems.
Hillsdale, NJ: Lawrence Erlbaum.
Lord, R. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading,
MA: Addison-Wesley.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp.
13-103). New York: American Council on Education/Macmillan.
Messick, S. (1994). The interplay of evidence and consequences in the validation of
performance assessments. Education Researcher, 32(2), 13-23.
Messick, S., Beaton, A.E., & Lord, F.M. (1983). National Assessment of Educational
Progress reconsidered: A new design for a new era. NAEP Report 83-1. Princeton,
NJ: National Assessment for Educational Progress.
Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (in press). On the roles of task model
variables in assessment design. To appear in S. Irvine & P. Kyllonen (Eds.),
Generating items for cognitive tests: Theory and practice. Hillsdale, NJ: Erlbaum.
Psychometric Principles
Page 50
Mislevy, R.J., Steinberg, L.S., Almond, R.G., Haertel, G., & Penuel, W. (in press).
Leverage points for improving educational assessment. In B. Means & G. Haertel
(Eds.), Evaluating the effects of technology in education. Hillsdale, NJ: Erlbaum.
Mislevy, R.J., Steinberg, L.S., Breyer, F.J., Almond, R.G., & Johnson, L. (1999). A
cognitive task analysis, with implications for designing a simulation-based assessment
system. Computers and Human Behavior, 15, 335-374.
Mislevy, R.J., Steinberg, L.S., Breyer, F.J., Almond, R.G., & Johnson, L. (in press).
Making sense of data from complex assessment. Applied Measurement in Education.
Myford, C. M., & Mislevy, R. J. (1995). Monitoring and improving a portfolio
assessment system (Center for Performance Assessment Research Report). Princeton,
NJ: Educational Testing Service.
National Research Council (1999). How people learn: Brain, mind, experience, and
school. Committee on Developments in the Science of Learning. Bransford, J. D.,
Brown, A. L., and Cocking, R. R. (Eds.). Washington, DC: National Academy Press.
National Research Council (2001). Knowing what students know: The science and design
of educational assessment. Committee on the Foundations of Assessment.
Pellegrino, J., Chudowsky, N., and Glaser, R., (Eds.). Washington, DC: National
Academy Press.
O’Neil, K. A., & McPeek, W. M., (1993). In P. W. Holland, & H. Wainer (Eds.),
Differential item functioning. (pp. 255-276). Hillsdale, NJ: Erlbaum.
Patz, R. J., & Junker, B. W. (1999). Applications and extensions of MCMC in IRT:
Multiple item types, missing data, and rated responses. Journal of Educational and
Behavioral Statistics, 24(4), 342-366.
Petersen, N.S., Kolen, M.J., & Hoover, H.D. (1989). Scaling, norming, and equating. In
R.L. Linn (Ed.), Educational measurement (3rd Ed.) (pp. 221-262). New York:
American Council on Education/Macmillan.
Pirolli. P., & Wilson, M. (1998). A theory of the measurement of knowledge content,
access, and learning. Psychological Review 105(1), 58-82.
Psychometric Principles
Page 51
Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests.
Copenhagen: Danish Institute for Educational Research/Chicago: University of
Chicago Press (reprint).
Reckase, M. (1985). The difficulty of test items that measure more than one ability.
Applied Psychological Measurement, 9, 401-412.
Rogosa, D.R., & Ghandour, G.A. (1991). Statistical models for behavioral
observations(with discussion). Journal of Educational Statistics, 16, 157-252.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded
scores. Psychometrika Monograph No. 17, 34, (No. 4, Part 2).
Samejima, F. (1973). Homogeneous case of the continuous response level.
Psychometrika, 38, 203-219.
Schum, D.A. (1987). Evidence and inference for the intelligence analyst. Lanham, MD:
University Press of America.
Schum, D.A. (1994). The evidential foundations of probabilistic reasoning. New York:
Wiley.
SEPUP (1995). Issues, evidence, and you: Teacher’s guide. Berkeley: Lawrence Hall
of Science.
Shavelson, R. J., & Webb, N. W. (1991). Generalizability theory: A primer. Newbury
Park, CA: Sage Publications.
Spearman, C. (1904). The proof and measurement of association between two things.
American Journal of Psychology, 15, 72-101.
Spearman, C. (1910). Correlation calculated with faulty data. British Journal of
Psychology, 3, 271-295.
Spiegelhalter, D.J., Thomas, A., Best, N.G., & Gilks, W.R. (1995). BUGS: Bayesian
inference using Gibbs sampling, Version 0.50. Cambridge: MRC Biostatistics Unit.
Tatsuoka, K.K. (1990). Toward an integration of item response theory and cognitive
error diagnosis. In N. Frederiksen, R. Glaser, A. Lesgold, & M.G. Shafto, (Eds.),
Diagnostic monitoring of skill and knowledge acquisition (pp. 453-488). Hillsdale,
NJ: Erlbaum.
Psychometric Principles
Page 52
Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models.
Psychometrika, 51, 567-77.
Toulmin, S. (1958). The uses of argument. Cambridge, England: University Press.
Traub, R. E. & Rowley, G. L. (1980). Reliability of test scores and decisions. Applied
Psychological Measurement, 4, 517-545.
van der Linden, W. J. (1998). Optimal test assembly. Applied Psychological
Measurement, 22, 195-202.
van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modern item response
theory. New York: Springer.
Wainer, H., Dorans, N.J., Flaugher, R., Green, B.F., Mislevy, R.J., Steinberg, L., &
Thissen, D. (2000). Computerized adaptive testing: A primer (second edition).
Hillsdale, NJ: Lawrence Erlbaum Associates.
Wainer, H., & Keily, G. L. (1987). Item clusters and computerized adaptive testing: A
case for testlets. Journal of Educational Measurement, 24, 195-201.
Wiley, D.E. (1991). Test validity and invalidity reconsidered. In R.E. Snow & D.E.
Wiley (Eds.), Improving inquiry in social science (pp. 75-107). Hillsdale, NJ:
Erlbaum.
Willingham, W. W., & Cole, N. S. (1997). Gender and fair assessment. Mahwah, NJ:
Lawrence Erlbaum.
Wilson, M., & Sloane, K. (2000). From principles to practice: An embedded assessment
system. Applied Measurement in Education, 13(2), 181-208.
Wolf, D., Bixby, J., Glenn, J., & Gardner, H. (1991). To use their minds well:
Investigating new forms of student assessment. In G. Grant (Ed.), Review of
Educational Research, Vol. 17 (pp. 31-74). Washington, DC: American Educational
Research Association.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press.
Yen, W.M. (1993). Scaling performance assessments: Strategies for managing local item
dependence. Journal of Educational Measurement, 30, 187-213.
Student Model Evidence Model(s) Task Model(s)
Measurement Scoring
Model Model
θ3 θ3
θ1 X1 X1
θ4 θ4 1. x x x x x xx x 2. xxxxxxxx
θ2 3. x x x x x xx x 4. xxxxxxxx
X2 X2 5. x x x x x xx x 6. xxxxxxxx
θ5 θ5 7. x x x x x xx x 8. xxxxxxxx
Figure 1
General Form of the Assessment Design Models
Understanding Concepts (U)--Understanding scientific concepts (such as
properties and interactions of materials, energy, or thresholds) in order to apply
the relevant scientific concepts to the solution of problems. This variable is the
IEY version of the traditional “science content”, although this content is not just
“factoids”.
Designing and Conducting Investigations (I)--Designing a scientific
experiment, carrying through a complete scientific investigation, performing
laboratory procedures to collect data, recording and organizing data, and analyzing
and interpreting results of an experiment. This variable is the IEY version of the
traditional “science process”.
Evidence and Tradeoffs (E)--Identifying objective scientific evidence as well as
evaluating the advantages and disadvantages of different possible solutions to a
problem based on the available evidence.
Communicating Scientific Information (C)--Organizing and presenting results
in a way that is free of technical errors and effectively communicates with the
chosen audience.
Figure 2
The Variables in the Student Model for the BEAR “Issues, Evidence, and You” Example
θE θU θI θC
Figure 3
Figure 4
Figure 5
Response provides some objective reasons Response states at least one perspective of
AND some supporting evidence, BUT at issue AND provides some objective
2 least one reason is missing and/or part of reasons using some relevant evidence BUT
the evidence is incomplete. reasons are incomplete and/or part of the
evidence is missing; OR only one complete
& accurate perspective has been provided.
Response states at least one perspective of
Response provides only subjective issue BUT only provides subjective
reasons (opinions) for choice and/or uses reasons and/or uses inaccurate or irrelevant
1 inaccurate or irrelevant evidence from the evidence.
activity.
No response; illegible response; response
No response; illegible response; response lacks reasons AND offers no evidence to
0 offers no reasons AND no evidence to support decision made.
support choice made.
Figure 6
The Scoring Model for Evaluating Two Observable Variables from Task Responses
in the BEAR Assessment
θE θU θI θC
Using
Using Evidence
Evidence to Make
Tradeoffs
Figure 7
Graphical Representation of the Measurement Model for the BEAR Sample Task Linked
Figure 8
A Toulmin Diagram for a Simple Assessment Situation
p(θ)
p(X1|θ)
p(X3|θ)
p(X2 |θ)
X1 X2 X3
Figure 9
Figure 10
Figure 11
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
θ
Figure 12
0.2
Information
Information
from Item 1
0.1 from Item 2
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
theta
Figure 13