0% found this document useful (0 votes)
79 views9 pages

Messick1995 PDF

Uploaded by

Diego
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views9 pages

Messick1995 PDF

Uploaded by

Diego
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Validity of Psychological Assessment

Validation of Inferences From Persons' Responses and Performances as


Scientific Inquiry Into Score Meaning
Samuel Messick
Educational Testing Service

The traditional conception of validity divides it into three The Value of Validity
separate and substitutable types—namely, content, cri-
terion, and construct validities. This view is fragmented The principles of validity apply not just to interpretive
and incomplete, especially because it fails to take into and action inferences derived from test scores as ordinarily
account both evidence of the value implications of score conceived, but also to inferences based on any means of
meaning as a basis for action and the social consequences observing or documenting consistent behaviors or attri-
of score use. The new unified concept of validity interre- butes. Thus, the term score is used generically in its
lates these issues as fundamental aspects of a more com- broadest sense to mean any coding or summarization of
prehensive theory of construct validity that addresses both observed consistencies or performance regularities on a
score meaning and social values in test interpretation and test, questionnaire, observation procedure, or other as-
test use. That is, unified validity integrates considerations sessment devices such as work samples, portfolios, and
of content, criteria, and consequences into a construct realistic problem simulations.
framework for the empirical testing of rational hypotheses This general usage subsumes qualitative as well as
about score meaning and theoretically relevant relation- quantitative summaries. It applies, for example, to behavior
ships, including those of an applied and a scientific nature. protocols, to clinical appraisals, to computerized verbal score
Six distinguishable aspects of construct validity are high- reports, and to behavioral or performance judgments or
lighted as a means of addressing central issues implicit ratings. Scores in this sense are not limited to behavioral
in the notion of validity as a unified concept. These are consistencies and attributes of persons (e.g., persistence and
content, substantive, structural, generalizability, external, verbal ability). Scores may also refer to functional consis-
and consequential aspects of construct validity. In effect, tencies and attributes of groups, situations or environments,
these six aspects function as general validity criteria or and objects or institutions, as in measures of group solidarity,
standards for all educational and psychological measure- situational stress, quality of artistic products, and such social
ment, including performance assessments, which are dis- indicators as school dropout rate.
cussed in some detail because of their increasing emphasis Hence, the principles of validity apply to all assess-
in educational and employment settings. ments, including performance assessments. For example,
student portfolios are often the source of inferences—not
just about the quality of the included products but also

V alidity is an overall evaluative judgment of the


degree to which empirical evidence and theoret-
ical rationales support the adequacy and appro-
priateness of interpretations and actions on the basis of
test scores or other modes of assessment (Messick, 1989b).
about the knowledge, skills, or other attributes of the stu-
dent—and such inferences about quality and constructs
need to meet standards of validity. This is important be-
cause performance assessments, although long a staple of
industrial and military applications, are now touted as
Validity is not a property of the test or assessment as purported instruments of standards-based education re-
such, but rather of the meaning of the test scores. These form because they promise positive consequences for
scores are a function not only of the items or stimulus teaching and learning. Indeed, it is precisely because of
conditions, but also of the persons responding as well as
the context of the assessment. In particluar, what needs
Editor's note. Samuel M. Turner served as action editor for this article.
to be valid is the meaning or interpretation of the score; This article was presented as a keynote address at the Conference on
as well as any implications for action that this meaning Contemporary Psychological Assessment, Arbetspsykologiska Utveck-
entails (Cronbach, 1971). The extent to which score lingsinstitutet, June 7-8, 1994, Stockholm, Sweden.
meaning and action implications hold across persons or
population groups and across settings or contexts is a Author's note. Acknowledgments are gratefully extended to Isaac Bejar,
persistent and perennial empirical question. This is the Randy Bennett, Drew Gitomer, Ann Jungeblut, and Michael Zieky for
their reviews of various versions of the manuscript.
main reason that validity is an evolving property and val- Correspondence concerning this article should be addressed to Sa-
idation a continuing process. muel Messick, Educational Testing Service, Princeton, NJ 08541.

September 1995 • American Psychologist 741


Copyright 1995 by the American Psychological Association, Inc. 0003-066X/95/S2.00
Vol. 50. No. 9. 741-749
Comprehensiveness
Comoro of Construct
Validity
In principle as well as in practice, construct validity is
based on an integration of any evidence that bears on the
interpretation or meaning of the test scores—including
content- and criterion-related evidence—which are thus
subsumed as part of construct validity. In construct val-
idation the test score is not equated with the construct it
attempts to tap, nor is it considered to define the con-
struct, as in strict operationism (Cronbach & Meehl,
1955). Rather, the measure is viewed as just one of an
extensible set of indicators of the construct. Convergent
empirical relationships reflecting communality among
such indicators are taken to imply the operation of the
construct to the degree that discriminant evidence dis-
Samuel counts the intrusion of alternative constructs as plausible
Messick
Photo by William rival hypotheses.
Monachan, Educational A fundamental feature of construct validity is con-
Testing Service,
Princeton, NJ. struct representation, whereby one attempts to identify
through cognitive-process analysis or research on person-
ality and motivation the theoretical mechanisms under-
lying task performance, primarily by decomposing the
such politically salient potential consequences that the
task into requisite component processes and assembling
validity of performance assessment needs to be system-
them into a functional model or process theory (Em-
atically addressed, as do other basic measurement issues
bretson, 1983). Relying heavily on the cognitive psy-
such as reliability, comparability, and fairness. The latter
chology of information processing, construct represen-
reference to fairness broaches a broader set of equity issues tation refers to the relative dependence of task responses
in testing that includes fairness of test use, freedom from on the processes, strategies, and knowledge (including
bias in scoring and interpretation, and the appropriateness metacognitive or self-knowledge) that are implicated in
of the test-based constructs or rules underlying decision task performance.
making or resource allocation, that is, distributive justice
(Messick, 1989b). Sources of Invalidity
These issues are critical for performance assess-
There are two major threats to construct validity: In the
ment—as they are for all educational and psychological
one known as construct underrepresentation, the assess-
assessment—because validity, reliability, comparability,
ment is too narrow and fails to include important di-
and fairness are not just measurement principles, they
mensions or facets of the construct. In the threat to va-
are social values that have meaning and force outside of lidity known as construct-irrelevant variance, the assess-
measurement whenever evaluative judgments and deci- ment is too broad, containing excess reliable variance
sions are made. As a salient social value, validity assumes associated with other distinct constructs as well as method
both a scientific and a political role that can by no means variance such as response sets or guessing propensities
be fulfilled by a simple correlation coefficient between that affects responses in a manner irrelevant to the in-
test scores and a purported criterion (i.e., classical cri- terpreted construct. Both threats are operative in all as-
terion-related validity) or by expert judgments that test sessments. Hence a primary validation concern is the ex-
content is relevant to the proposed test use (i.e., traditional tent to which the same assessment might underrepresent
content validity). the focal construct while simultaneously contaminating
Indeed, validity is broadly defined as nothing less the scores with construct-irrelevant variance.
than an evaluative summary of both the evidence for and There are two basic kinds of construct-irrelevant vari-
the actual—as well as potential—consequences of score ance. In the language of ability and achievement testing,
interpretation and use (i.e., construct validity conceived these might be called construct-irrelevant difficulty and con-
comprehensively). This comprehensive view of validity struct-irrelevant easiness. In the former, aspects of the task
integrates considerations of content, criteria, and con- that are extraneous to the focal construct make the task
sequences into a construct framework for empirically irrelevantly difficult for some individuals or groups. An ex-
testing rational hypotheses about score meaning and util- ample is the intrusion of undue reading comprehension
ity. Therefore, it is fundamental that score validation is requirements in a test of subject matter knowledge. In gen-
an empirical evaluation of the meaning and consequences eral, construct-irrelevant difficulty leads to construct scores
of measurement. As such, validation combines scientific that are invalidly low for those individuals adversely affected
inquiry with rational argument to justify (or nullify) score (e.g., knowledge scores of poor readers or examinees with
interpretation and use. limited English proficiency). Of course, if concern is solely

742 September 1995 • American Psychologist


with criterion prediction and the criterion performance re- Hartigan & Wigdor, 1989; Schmidt, 1991). However, rec-
quires reading skill as well as subject matter knowledge, ognizing that these adjustments distort the meaning of
then both sources of variance would be considered criterion- the construct as originally assessed, psychologists should
relevant and valid. However, for score interpretations in distinguish such controversial procedures in applied test-
terms of subject matter knowledge and for any score uses ing practice (Gottfredson, 1994; Sackett & Wilk, 1994)
based thereon, undue reading requirements would constitute from the valid assessment of focal constructs and from
construct-irrelevant difficulty. any score uses based on that construct meaning. Con-
Indeed, construct-irrelevant difficulty for individuals struct-irrelevant variance is always a source of invalidity
and groups is a major source of bias in test scoring and in the assessment of construct meaning and its action
interpretation and of unfairness in test use. Differences implications. These issues portend the substantive and
in construct-irrelevant difficulty for groups, as distinct consequential aspects of construct validity, which are dis-
from construct-relevant group differences, is the major cussed in more detail later.
culprit sought in analyses of differential item functioning Sources of Evidence in Construct Validity
(Holland & Wainer, 1993).
In contrast, construct-irrelevant easiness occurs In essence, construct validity comprises the evidence and
when extraneous clues in item or task formats permit rationales supporting the trustworthiness of score interpre-
some individuals to respond correctly or appropriately in tation in terms of explanatory concepts that account for
ways irrelevant to the construct being assessed. Another both test performance and score relationships with other
instance occurs when the specific test material, either de- variables. In its simplest terms, construct validity is the ev-
liberately or inadvertently, is highly familiar to some re- idential basis for score interpretation. As an integration of
spondents, as when the text of a reading comprehension evidence for score meaning, it applies to any score inter-
passage is well-known to some readers or the musical score pretation—not just those involving so-called "theoretical
for a sight reading exercise invokes a well-drilled rendition constructs." Almost any kind of information about a test
for some performers. Construct-irrelevant easiness leads can contribute to an understanding of score meaning, but
to scores that are invalidly high for the affected individuals the contribution becomes stronger if the degree of fit of the
as reflections of the construct under scrutiny. information with the theoretical rationale underlying score
The concept of construct-irrelevant variance is im- interpretation is explicitly evaluated (Cronbach, 1988; Kane,
portant in all educational and psychological measure- 1992; Messick, 1989b). Historically, primary emphasis in
ment, including performance assessments. This is es- construct validation has been placed on internal and external
pecially true of richly contextualized assessments and so- test structures—that is, on the appraisal of theoretically ex-
called "authentic" simulations of real-world tasks. This pected patterns of relationships among item scores or be-
is the case because "paradoxically, the complexity of con- tween test scores and other measures.
text is made manageable by contextual clues" (Wiggins, Probably even more illuminating in regard to score
1993, p. 208). And it matters whether the contextual clues meaning are studies of expected performance differences
that people respond to are construct-relevant or represent over time, across groups and settings, and in response to
construct-irrelevant difficulty or easiness. experimental treatments and manipulations. For exam-
However, what constitutes construct-irrelevant vari- ple, over time one might demonstrate the increased scores
ance is a tricky and contentious issue (Messick, 1994). from childhood to young adulthood expected for mea-
This is especially true of performance assessments, which sures of impulse control. Across groups and settings, one
typically invoke constructs that are higher order and might contrast the solution strategies of novices versus
complex in the sense of subsuming or organizing multiple experts for measures of domain problem-solving or, for
processes. For example, skill in communicating mathe- measures of creativity, contrast the creative productions
matical ideas might well be considered irrelevant variance of individuals in self-determined as opposed to directive
in the assessment of mathematical knowledge (although work environments. With respect to experimental treat-
not necessarily vice versa). But both communication skill ments and manipulations, one might seek increased
and mathematical knowledge are considered relevant knowledge scores as a function of domain instruction or
parts of the higher-order construct of mathematical power, increased achievement motivation scores as a function of
according to the content standards delineated by the Na- greater benefits and risks. Possibly most illuminating of
tional Council of Teachers of Mathematics (1989). It all all, however, are direct probes and modeling of the pro-
depends on how compelling the evidence and arguments cesses underlying test responses, which are becoming both
are that the particular source of variance is a relevant more accessible and more powerful with continuing de-
part of the focal construct, as opposed to affording a plau- velopments in cognitive psychology (Frederiksen, Mislevy,
sible rival hypothesis to account for the observed perfor- & Bejar, 1993; Snow & Lohman, 1989). At the simplest
mance regularities and relationships with other variables. level, this might involve querying respondents about their
A further complication arises when construct-irrel- solution processes or asking them to think aloud while
evant variance is deliberately capitalized upon to produce responding to exercises during field trials.
desired social consequences, as in score adjustments for In addition to reliance on these forms of evidence,
minority groups, within-group norming, or sliding band construct validity, as previously indicated, also subsumes
procedures (Cascio, Outtz, Zedeck, & Goldstein, 1991; content relevance and representativeness as well as cri-

September 1995 • American Psychologist 743


terion-relatedness. This is the case because such infor- The latter adverse consequences of test invalidity present
mation about the range and limits of content coverage measurement problems that need to be investigated in
and about specific criterion behaviors predicted by the the validation process, whereas the former consequences
test scores clearly contributes to score interpretation. In of valid assessment represent problems of social policy.
the latter instance, correlations between test scores and But more about this later.
criterion measures—viewed within the broader context Thus, the process of construct validation evolves
of other evidence supportive of score meaning—contrib- from these multiple sources of evidence a mosaic of con-
ute to the joint construct validity of both predictor and vergent and discriminant findings supportive of score
criterion. In other words, empirical relationships between meaning. However, in anticipated applied test use, this
predictor scores and criterion measures should make mosaic of general evidence may or may not include per-
theoretical sense in terms of what the predictor test is tinent specific evidence of (a) the relevance of the test to
interpreted to measure and what the criterion is presumed the particular applied purpose and (b) the utility of the
to embody (Gulliksen, 1950). test in the applied setting. Hence, the general construct
An important form of validity evidence still re- validity evidence may need to be buttressed in applied
maining bears on the social consequences of test inter- instances by specific evidence of relevance and utility.
pretation and use. It is ironic that validity theory has paid In summary, the construct validity of score interpre-
so little attention over the years to the consequential basis tation comes to undergird all score-based inferences—not
of test validity, because validation practice has long in- just those related to interpretive meaningfulness but also
voked such notions as the functional worth of the test- the content- and criterion-related inferences specific to ap-
ing—that is, a concern over how well the test does the plied decisions and actions based on test scores. From the
job for which it is used (Cureton, 1951; Rulon, 1946). discussion thus far, it should also be clear that test validity
And to appraise how well a test does its job, one must cannot rely on any one of the supplementary forms of ev-
inquire whether the potential and actual social conse- idence just discussed. However, neither does validity require
quences of test interpretation and use are not only sup- any one form, granted that there is defensible convergent
portive of the intended testing purposes, but also at the and discriminant evidence supporting score meaning. To
same time consistent with other social values. the extent that some form of evidence cannot be devel-
With some trepidation due to the difficulties inherent oped—as when criterion-related studies must be forgone
in forecasting, both potential and actual consequences are because of small sample sizes, unreliable or contaminated
included in this formulation for two main reasons: First, criteria, and highly restricted score ranges—heightened em-
anticipation of likely outcomes may guide one where to phasis can be placed on other evidence, especially on the
look for side effects and toward what kinds of evidence are construct validity of the predictor tests and on the relevance
needed to monitor consequences; second, such anticipation of the construct to the criterion domain (Guion, 1976; Mes-
may alert one to take timely steps to capitalize on positive sick, 1989b). What is required is a compelling argument
effects and to ameliorate or forestall negative effects. that the available evidence justifies the test interpretation
However, this form of evidence should not be viewed and use, even though some pertinent evidence had to be
in isolation as a separate type of validity, say, of "conse- forgone. Hence, validity becomes a unified concept, and the
quential validity." Rather, because the values served in unifying force is the meaningfulness or trustworthy inter-
the intended and unintended outcomes of test interpre- pretability of the test scores and their action implications,
tation and use both derive from and contribute to the namely, construct validity.
meaning of the test scores, appraisal of the social conse-
quences of the testing is also seen to be subsumed as an Aspects of Construct Validity
aspect of construct validity (Messick, 1964, 1975, 1980). However, to speak of validity as a unified concept does not
In the language of the Cronbach and Meehl (1955) sem- imply that validity cannot be usefully differentiated into
inal manifesto on construct validity, the intended con- distinct aspects to underscore issues and nuances that might
sequences of the testing are strands in the construct's no- otherwise be downplayed or overlooked, such as the social
mological network representing presumed action impli- consequences of performance assessments or the role of score
cations of score meaning. The central point is that meaning in applied use. The intent of these distinctions is
unintended consequences, when they occur, are also to provide a means of addressing functional aspects of va-
strands in the construct's nomological network that need lidity that help disentangle some of the complexities inherent
to be taken into account in construct theory, score inter- in appraising the appropriateness, meaningfulness, and use-
pretation, and test use. At issue is evidence for not only fulness of score inferences.
negative but also positive consequences of testing, such In particular, six distinguishable aspects of construct
as the promised benefits of educational performance as- validity are highlighted as a means of addressing central
sessment for teaching and learning. issues implicit in the notion of validity as a unified con-
A major concern in practice is to distinguish adverse cept. These are content, substantive, structural, general-
consequences that stem from valid descriptions of indi- izability, external, and consequential aspects of construct
vidual and group differences from adverse consequences validity. In effect, these six aspects function as general
that derive from sources of test invalidity such as construct validity criteria or standards for all educational and psy-
underrepresentation and construct-irrelevant variance. chological measurement (Messick, 1989b). Following a

744 September 1995 • American Psychologist


capsule description of these six aspects, some of the va- ered, which is usually described as selecting tasks that
lidity issues and sources of evidence bearing on each are sample domain processes in terms of their functional im-
highlighted: portance, or what Brunswik (1956) called ecological
sampling. Functional importance can be considered in
• The content aspect of construct validity includes terms of what people actually do in the performance do-
evidence of content relevance, representativeness, main, as in job analyses, but also in terms of what char-
and technical quality (Lennon, 1956; Messick, acterizes and differentiates expertise in the domain, which
1989b); would usually emphasize different tasks and processes.
• The substantive aspect refers to theoretical ratio- Both the content relevance and representativeness of as-
nales for the observed consistencies in test re- sessment tasks are traditionally appraised by expert
sponses, including process models of task perfor- professional judgment, documentation of which serves to
mance (Embretson, 1983), along with empirical address the content aspect of construct validity.
evidence that the theoretical processes are actually
Substantive Theories, Process Models, and
engaged by respondents in the assessment tasks; Process Engagement
• The structural aspect appraises the fidelity of the
scoring structure to the structure of the construct The substantive aspect of construct validity emphasizes
domain at issue (Loevinger, 1957; Messick 1989b); the role of substantive theories and process modeling in
• The generalizability aspect examines the extent to identifying the domain processes to be revealed in as-
which score properties and interpretations gen- sessment tasks (Embretson, 1983; Messick, 1989b). Two
eralize to and across population groups, settings, important points are involved: One is the need for tasks
and tasks (Cook & Campbell, 1979; Shulman, providing appropriate sampling of domain processes in
1970), including validity generalization of test cri- addition to traditional coverage of domain content; the
terion relationships (Hunter, Schmidt, & Jackson, other is the need to move beyond traditional professional
1982); judgment of content to accrue empirical evidence that
• The external aspect includes convergent and dis- the ostensibly sampled processes are actually engaged by
criminant evidence from multitrait-multimethod respondents in task performance.
comparisons (Campbell & Fiske, 1959), as well as Thus, the substantive aspect adds to the content as-
evidence of criterion relevance and applied utility pect of construct validity the need for empirical evidence
(Cronbach & Gleser, 1965); of response consistencies or performance regularities re-
• The consequential aspect appraises the value im- flective of domain processes (Loevinger, 1957). Such ev-
plications of score interpretation as a basis for ac- idence may derive from a variety of sources, for example,
tion as well as the actual and potential conse- from "think aloud" protocols or eye movement records
quences of test use, especially in regard to sources during task performance; from correlation patterns
of invalidity related to issues of bias, fairness, and among part scores; from consistencies in response times
distributive justice (Messick, 1980, 1989b). for task segments; or from mathematical or computer
modeling of task processes (Messick, 1989b, pp. 53-55;
Content- Relevance and Representativeness Snow & Lohman, 1989). In summary, the issue of domain
coverage refers not just to the content representativeness
A key issue for the content aspect of construct validity is of the construct measure but also to the process repre-
the specification of the boundaries of the construct do- sentation of the construct and the degree to which these
main to be assessed—that is, determining the knowledge, processes are reflected in construct measurement.
skills, attitudes, motives, and other attributes to be re-
vealed by the assessment tasks. The boundaries and The core concept bridging the content and substantive
structure of the construct domain can be addressed by aspects of construct validity is representativeness. This be-
means of job analysis, task analysis, curriculum analysis, comes clear once one recognizes that the term representative
and especially domain theory, in other words, scientific has two distinct meanings, both of which are applicable to
inquiry into the nature of the domain processes and the performance assessment. One is in the cognitive psycholo-
ways in which they combine to produce effects or out- gist's sense of representation or modeling (Suppes, Pavel, &
comes. A major goal of domain theory is to understand Falmagne, 1994); the other is in the Brunswikian sense of
the construct-relevant sources of task difficulty, which ecological sampling (Brunswik, 1956; Snow, 1974). The
then serves as a guide to the rational development and choice of tasks or contexts in assessment is a representative
scoring of performance tasks and other assessment for- sampling issue. The comprehensiveness and fidelity of sim-
mats. At whatever stage of its development, then, domain ulating the constructs realistic engagement in performance
theory is a primary basis for specifying the boundaries is a representation issue. Both issues are important in ed-
and structure of the construct to be assessed. ucational and psychological measurement and especially in
performance assessment.
However, it is not sufficient merely to select tasks
that are relevant to the construct domain. In addition, Scoring Models As Reflective of Task and
the assessment should assemble tasks that are represen- Domain Structure
tative of the domain in some sense. The intent is to insure According to the structural aspect of construct validity,
that all important parts of the construct domain are cov- scoring models should be rationally consistent with what is

September 1995 • American Psychologist 745


known about the structural relations inherent in behavioral meaning of the scores is substantiated externally by ap-
manifestations of the construct in question (Loevinger, 1957; praising the degree to which empirical relationships with
Peak, 1953). That is, the theory of the construct domain other measures—or the lack thereof—are consistent with
should guide not only the selection or construction of rel- that meaning. That is, the constructs represented in the
evant assessment tasks but also the rational development of assessment should rationally account for the external
construct-based scoring criteria and rubrics. pattern of correlations. Both convergent and discriminant
Ideally, the manner in which behavioral instances correlation patterns are important, the convergent pattern
are combined to produce a score should rest on knowledge indicating a correspondence between measures of the
of how the processes underlying those behaviors combine same construct and the discriminant pattern indicating
dynamically to produce effects. Thus, the internal struc- a distinctness from measures of other constructs (Camp-
ture of the assessment (i.e., interrelations among the bell & Fiske, 1959). Discriminant evidence is particularly
scored aspects of task and subtask performance) should critical for discounting plausible rival alternatives to the
be consistent with what is known about the internal focal construct interpretation. Both convergent and dis-
structure of the construct domain (Messick, 1989b). This criminant evidence are basic to construct validation.
property of construct-based rational scoring models is Of special importance among these external rela-
called structural fidelity (Loevinger, 1957). tionships are those between the assessment scores and
criterion measures pertinent to selection, placement, li-
Generalizability and the Boundaries of Score
Meaning censure, program evaluation, or other accountability
purposes in applied settings. Once again, the construct
The concern that a performance assessment should pro- theory points to the relevance of potential relationships
vide representative coverage of the content and processes between the assessment scores and criterion measures,
of the construct domain is meant to insure that the score and empirical evidence of such links attests to the utility
interpretation not be limited to the sample of assessed of the scores for the applied purpose.
tasks but be broadly generalizable to the construct do-
main. Evidence of such generalizability depends on the Consequences As Validity Evidence
degree of correlation of the assessed tasks with other tasks The consequential aspect of construct validity includes
representing the construct or aspects of the construct. evidence and rationales for evaluating the intended and
This issue of generalizability of score inferences across unintended consequences of score interpretation and use
tasks and contexts goes to the very heart of score meaning. in both the short- and long-term. Social consequences of
Indeed, setting the boundaries of score meaning is pre- testing may be either positive, such as improved educa-
cisely what generalizability evidence is meant to address. tional policies based on international comparisons of stu-
However, because of the extensive time required for dent performance, or negative, especially when associated
the typical performance task, there is a conflict in per- with bias in scoring and interpretation or with unfairness
formance assessment between time-intensive depth of ex- in test use. For example, because performance assess-
amination and the breadth of domain coverage needed ments in education promise potential benefits for teaching
for generalizability of construct interpretation. This con- and learning, it is important to accrue evidence of such
flict between depth and breadth of coverage is often viewed positive consequences as well as evidence that adverse
as entailing a trade-off between validity and reliability (or consequences are minimal.
generalizability). It might better be depicted as a trade- The primary measurement concern with respect to
off between the valid description of the specifics of a com- adverse consequences is that any negative impact on in-
plex task and the power of construct interpretation. In dividuals or groups should not derive from any source of
any event, such a conflict signals a design problem that test invalidity, such as construct underrepresentation or
needs to be carefully negotiated in performance assess- construct-irrelevant variance (Messick, 1989b). In other
ment (Wiggins, 1993). words, low scores should not occur because the assessment
In addition to generalizability across tasks, the limits is missing something relevant to the focal construct that,
of score meaning are also affected by the degree of gen- if present, would have permitted the affected persons to
eralizability across time or occasions and across observers display their competence. Moreover, low scores should
or raters of the task performance. Such sources of mea- not occur because the measurement contains something
surement error associated with the sampling of tasks, oc- irrelevant that interferes with the affected persons' dem-
casions, and scorers underlie traditional reliability con- onstration of competence.
cerns (Feldt & Brennan, 1989).
Validity as Integrative Summary
Convergent and Discriminant Correlations With These six aspects of construct validity apply to all edu-
External Variables
cational and psychological measurement, including per-
The external aspect of construct validity refers to the ex- formance assessments. Taken together, they provide a way
tent to which the assessment scores' relationships with of addressing the multiple and interrelated validity ques-
other measures and nonassessment behaviors reflect the tions that need to be answered to justify score interpre-
expected high, low, and interactive relations implicit in tation and use. In previous writings, I maintained that it
the theory of the construct being assessed. Thus, the is "the relation between the evidence and the inferences

746 September 1995 • American Psychologist


drawn that should determine the validation focus" (Mes- touches the important bases; if the bases are not covered,
sick, 1989b, p. 16). This relation is embodied in theo- an argument that such omissions are defensible must be
retical rationales or persuasive arguments that the ob- provided. These six aspects are highlighted because most
tained evidence both supports the preferred inferences score-based interpretations and action inferences, as well
and undercuts plausible rival inferences. From this per- as the elaborated rationales or arguments that attempt to
spective, as Cronbach (1988) concluded, validation is legitimize them (Kane, 1992), either invoke these prop-
evaluation argument. That is, as stipulated earlier, vali- erties or assume them, explicitly or tacitly.
dation is empirical evaluation of the meaning and con- In other words, most score interpretations refer to rel-
sequences of measurement. The term empirical evalua- evant content and operative processes, presumed to be re-
tion is meant to convey that the validation process is sci- flected in scores that concatenate responses in domain-ap-
entific as well as rhetorical and requires both evidence propriate ways and are generalizable across a range of tasks,
and argument. settings, and occasions. Furthermore, score-based interpre-
By focusing on the argument or rationale used to tations and actions are typically extrapolated beyond the
support the assumptions and inferences invoked in the test context on the basis of presumed relationships with
score-based interpretations and actions of a particular nontest behaviors and anticipated outcomes or conse-
test use, one can prioritize the forms of validity evidence quences. The challenge in test validation is to link these
needed according to the points in the argument requiring inferences to convergent evidence supporting them and to
justification or support (Kane, 1992; Shepard, 1993). discriminant evidence discounting plausible rival inferences.
Helpful as this may be, there still remain problems in Evidence pertinent to all of these aspects needs to be inte-
setting priorities for needed evidence because the argu- grated into an overall validity judgment to sustain score
ment may be incomplete or off target, not all the as- inferences and their action implications, or else provide
sumptions may be addressed, and the need to discount compelling reasons why there is not a link, which is what
alternative arguments evokes multiple priorities. This is is meant by validity as a unified concept.
one reason that Cronbach (1989) stressed cross-argument
criteria for assigning priority to a line of inquiry, such as Meaning and Values in Test Validation
the degree of prior uncertainty, information yield, cost, The essence of unified validity is that the appropriateness,
and leverage in achieving consensus. meaningfulness, and usefulness of score-based inferences
Kane (1992) illustrated the argument-based ap- are inseparable and that the integrating power derives from
proach by prioritizing the evidence needed to validate a empirically grounded score interpretation. As seen in this
placement test for assigning students to a course in either article, both meaning and values are integral to the concept
remedial algebra or calculus. He addressed seven as- of validity, and psychologists need a way of addressing both
sumptions that, from the present perspective, bear on the concerns in validation practice. In particular, what is needed
content, substantive, generalizability, external, and con- is a way of configuring validity evidence that forestalls undue
sequential aspects of construct validity. Yet the structural reliance on selected forms of evidence as opposed to a pat-
aspect is not explicitly addressed. Hence, the compen- tern of supplementary evidence, that highlights the impor-
satory property of the usual cumulative total score, which tant yet subsidiary role of specific content- and criterion-
permits good performance on some algebra skills to com- related evidence in support of construct validity in testing
pensate for poor performance on others, remains une- applications. This means should formally bring considera-
valuated in contrast, for example, to scoring models with tion of value implications and social consequences into the
multiple cut scores or with minimal requirements across validity framework.
the profile of prerequisite skills. The question is whether A unified validity framework meeting these require-
such profile scoring models might yield not only useful ments distinguishes two interconnected facets of validity
information for diagnosis and remediation but also better as a unitary concept (Messick, 1989a, 1989b). One facet
student placement. is the source of justification of the testing based on ap-
The structural aspect of construct validity also re- praisal of either evidence supportive of score meaning or
ceived little attention in Shepard's (1993) argument-based consequences contributing to score valuation. The other
analysis of the validity of special education placement facet is the function or outcome of the testing—either
decisions. This was despite the fact that the assessment interpretation or applied use. If the facet for justification
referral system under consideration involved a profile of (i.e., either an evidential basis for meaning implications
cognitive, biomedical, behavioral, and academic skills that or a consequential basis for value implications of scores)
required some kind of structural model linking test results is crossed with the facet for function or outcome (i.e.,
to placement decisions. However, in her analysis of se- either test interpretation or test use), a four-fold classifi-
lection uses of the General Aptitude Test Battery (GATB), cation is obtained, highlighting both meaning and values
Shepard (1993) did underscore the structural aspect be- in both test interpretation and test use, as represented by
cause the GATB within-group scoring model is both sa- the row and column headings of Figure 1.
lient and controversial. These distinctions may seem fuzzy because they are
The six aspects of construct validity afford a means not only interlinked but overlapping. For example, social
of checking that the theoretical rationale or persuasive consequences of testing are a form of evidence, and other
argument linking the evidence to the inferences drawn forms of evidence have consequences. Furthermore, to

September 1995 • American Psychologist 747


commensurate, because value implications are not an-
Figure 1 cillary but, rather, integral to score meaning. Therefore,
Facets of Validity as a Progressive Matrix to make clear that score interpretation is needed to ap-
Test Interpretation Test Use
praise value implications and vice versa, this cell for the
consequential basis of test interpretation needs to com-
Evidential prehend both the construct validity as well as the value
Construct Validity (CV) CV + Relevance/Utility (R/U)
Basis ramifications of score meaning.
Finally, the consequential basis of test use is the ap-
Consequential CV + CV + R/U +
praisal of both potential and actual social consequences
Basis Value Implications (VI) VI + Social Consequences of the applied testing. One approach to appraising po-
tential side effects is to pit the benefits and risks of the
proposed test use against the pros and cons of alternatives
or counterproposals. By taking multiple perspectives on
proposed test use, the various (and sometimes conflicting)
interpret a test is to use it, and all other test uses involve value commitments of each proposal are often exposed
interpretation either explicitly or tacitly. Moreover, utility to open examination and debate (Churchman, 1971;
is both validity evidence and a value consequence. This Messick, 1989b). Counterproposals to a proposed test use
conceptual messiness derives from cutting through what might involve quite different assessment techniques, such
indeed is a unitary concept to provide a means of dis- as observations or portfolios when educational perfor-
cussing its functional aspects. mance standards are at issue. Counterproposals might
Each of the cells in this four-fold crosscutting of uni- attempt to serve the intended purpose in a different way,
fied validity are briefly considered in turn, beginning with such as through training rather than selection when pro-
the evidential basis of test interpretation. Because the ev- ductivity levels are at issue (granted that testing may also
idence and rationales supporting the trustworthiness of be used to reduce training costs, and that failure in train-
score meaning are what is meant by construct validity, ing yields a form of selection).
the evidential basis of test interpretation is clearly con- What matters is not only whether the social conse-
struct validity. The evidential basis of test use is also con- quences of test interpretation and use are positive or negative,
struct validity, but with the important proviso that the but how the consequences came about and what determined
general evidence supportive of score meaning either al- them. In particular, it is not that adverse social consequences
ready includes or becomes enhanced by specific evidence of test use render the use invalid but, rather, that adverse
for the relevance of the scores to the applied purpose and social consequences should not be attributable to any source
for the utility of the scores in the applied setting, where of test invalidity, such as construct underrepresentation or
utility is broadly conceived to reflect the benefits of testing construct-irrelevant variance. And once again, in recognition
relative to its costs (Cronbach & Gleser, 1965). of the fact that the weighing of social consequences both
The consequential basis of test interpretation is the presumes and contributes to evidence of score meaning, of
appraisal of value implications of score meaning, includ- relevance, of utility, and of values, this cell needs to include
ing the often tacit value implications of the construct label construct validity, relevance, and utility, as well as social
itself, of the broader theory conceptualizing construct and value consequences.
properties and relationships that undergirds construct Some measurement specialists argue that adding
meaning, and of the still broader ideologies that give the- value implications and social consequences to the validity
ories their perspective and purpose—for example, ide- framework unduly burdens the concept. However, it is
ologies about the functions of science or about the nature simply not the case that values are being added to validity
of the human being as a learner or as an adaptive or fully in this unified view. Rather, values are intrinsic to the
functioning person. The value implications of score in- meaning and outcomes of the testing and have always
terpretation are not only part of score meaning, but a been. As opposed to adding values to validity as an ad-
socially relevant part that often triggers score-based ac- junct or supplement, the unified view instead exposes the
tions and serves to link the construct measured to ques- inherent value aspects of score meaning and outcome to
tions of applied practice and social policy. One way to open examination and debate as an integral part of the
protect against the tyranny of unexposed and unexamined validation process (Messick, 1989a). This makes explicit
values in score interpretation is to explicitly adopt mul- what has been latent all along, namely, that validity judg-
tiple value perspectives to formulate and empirically ap- ments are value judgments.
praise plausible rival hypotheses (Churchman, 1971; A salient feature of Figure 1 is that construct validity
Messick, 1989b). appears in every cell, which is fitting because the construct
Many constructs such as competence, creativity, in- validity of score meaning is the integrating force that un-
telligence, or extraversion have manifold and arguable ifies validity issues into a unitary concept. At the same
value implications that may or may not be sustainable time, by distinguishing facets reflecting the justification
in terms of properties of their associated measures. A and function of the testing, it becomes clear that distinct
central issue is whether the theoretical or trait implications features of construct validity need to be emphasized, in
and the value implications of the test interpretation are addition to the general mosaic of evidence, as one moves

748 September 1995 • American Psychologist


from the focal issue of one cell to that of the others. In Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.),
Educational measurement (3rd ed., pp. 105-146). New York: Mac-
particular, the forms of evidence change and compound millan.
as one moves from appraisal of evidence for the construct Frederiksen, N., Mislevy, R. J., & Bejar, I. (Eds.). (1993). Test theory for
interpretation per se, to appraisal of evidence supportive a new generation of tests. Hillsdale, NJ: Erlbaum.
of a rational basis for test use, to appraisal of the value Gottfredson, L. S. (1994). The science and politics of race-norming.
consequences of score interpretation as a basis for action, American Psychologist, 49, 955-963.
Guion, R. M. (1976). Recruiting, selection, and job placement. In
and finally, to appraisal of the social consequences—or, M. D. Dunnette (Ed.), Handbook of industrial and organizational
more generally, of the functional worth—of test use. psychology (pp. 777-828). Chicago: Rand McNally.
As different foci of emphasis are highlighted in ad- Gulliksen, H. (1950). Intrinsic validity. American Psychologist, 5, 511-517.
dressing the basic construct validity appearing in each Hartigan, J. A., & Wigdor, A. K. (Eds.). (1989). Fairness in employment
testing: Validity generalization, minority issues, and the General Ap-
cell, this movement makes what at first glance was a sim- titude Test Battery. Washington, DC: National Academy Press.
ple four-fold classification appear more like a progressive Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning.
matrix, as portrayed in the cells of Figure 1. From one Hillsdale, NJ: Erlbaum.
perspective, each cell represents construct validity, with Hunter, J. E., Schmidt, F. L., & Jackson, C. B. (1982). Advanced meta-
different features highlighted on the basis of the justifi- analysis: Quantitative methods of cumulating research findings across
studies. San Francisco: Sage.
cation and function of the testing. From another per- Kane, M. T. (1992). An argument-based approach to validity. Psycho-
spective, the entire progressive matrix represents construct logical Bulletin, 112, 527-535.
validity, which is another way of saying that validity is a Lennon, R. T. (1956). Assumptions underlying the use of content validity.
unified concept. One implication of this progressive-ma- Educational and Psychological Measurement, 16, 294-304.
Loevinger, J. (1957). Objective tests as instruments of psychological theory
trix formulation is that both meaning and values, as well [Monograph]. Psychological Reports, 3, 635-694 (Pt. 9).
as both test interpretation and test use, are intertwined Messick, S. (1964). Personality measurement and college performance.
in the validation process. Thus, validity and values are In Proceedings of the 1963 Invitational Conference on Testing Problems
one imperative, not two, and test validation implicates (pp. 110-129). Princeton, NJ: Educational Testing Service.
both the science and the ethics of assessment, which is Messick, S. (1975). The standard problem: Meaning and values in mea-
surement and evaluation. American Psychologist, 30, 955-966.
why validity has force as a social value. Messick, S. (1980). Test validity and the ethics of assessment. American
Psychologist, 35, 1012-1027.
REFERENCES Messick, S. (1989a). Meaning and values in test validation: The science
and ethics of assessment. Educational Researcher, 18(2), 5-11.
Brunswik, E. (1956). Perception and the representative design of psychological Messick, S. (1989b). Validity. In R. L. Linn (Ed.), Educational mea-
experiments (2nd ed.). Berkeley: University of California Press. surement (3rd ed., pp. 13-103). New "York: Macmillan.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant Messick, S. (1994). The interplay of evidence and consequences in the
validation by the multitrait-multimethod matrix. Psychological Bul- validation of performance assessments. Educational Researcher, 23(2),
letin, 56, 81-105. 13-23.
Cascio, W. E, Outtz, J., Zedeck, S., & Goldstein, I. L. (1991). Statistical National Council of Teachers of Mathematics. (1989). Curriculum and
implications of six methods of test score use in personnel selection. evaluation standards for school mathematics. Reston, VA: Author.
Human Performance, 4, 233-264. Peak, H. (1953). Problems of observation. In L. Festinger & D. Katz
Churchman, C. W. (1971). The design of inquiring systems: Basic con- (Eds.), Research methods in the behavioral sciences (pp. 243-299).
cepts of systems and organization. New York: Basic Books. Hinsdale, IL: Dryden Press.
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design Rulon, P. J. (1946). On the validity of educational tests. Harvard Ed-
and analysis issues for field settings. Chicago: Rand McNally. ucational Review, 16, 290-296.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Ed- Sackett, P. R., & Wilk, S. L. (1994). Within-group norming and other
ucational measurement (2nd ed., pp. 443-507). Washington, DC: forms of score adjustment in preemployment testing. American Psy-
American Council on Education. chologist, 49, 929-954.
Cronbach, L. J. (1988). Five perspectives on validation argument. In H. Schmidt, F. L. (1991). Why all banding procedures are logically flawed.
Wainer & H. Braun (Eds.), Test validity (pp. 34-35). Hillsdale, NJ: Human Performance, 4, 265-278.
Erlbaum. Shepard, L. A. (1993). Evaluating test validity. Review of research in
Cronbach, L. J. (1989). Construct validation after thirty years. In R. L. education, 19, 405-450.
Linn (Ed.), Intelligence: Measurement, theory, and public policy (pp. Shulman, L. S. (1970). Reconstruction of educational research. Review
147-171). Chicago: University of Illinois Press. of Educational Research, 40, 371-396.
Cronbach, L. J., & Gleser, G. C. (1965). Psychological tests and personnel Snow, R. E. (1974). Representative and quasi-representative designs for
decisions (2nd ed.). Urbana: University of Illinois Press. research on teaching. Review of Educational Research, 44, 265-291.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological Snow, R. E., & Lohman, D. F. (1989). Implications of cognitive psy-
tests. Psychological Bulletin, 52, 281-302. chology for educational measurement. In R. L. Linn (Ed.), Educational
Cureton, E. E. (1951). Validity. In E. F. Lindquist (Ed.), Educational measurement (3rd ed., pp. 263-331). New York: Macmillan.
measurement (1st ed., pp. 621-694). Washington, DC: American Suppes, P., Pavel, M., & Falmagne, J.-C. (1994). Representations and
Council on Education. models in psychology. Annual Review of Psychology, 45, 517-544.
Embretson, S. (1983). Construct validity: Construct representation versus Wiggins, G. (1993). Assessment: Authenticity, context, and validity. Phi
nomothetic span. Psychological Bulletin, 93, 179-197. Delta Kappan, 75, 200-214.

September 1995 • American Psychologist 749

You might also like