Part 1 Validity
Part 1 Validity
To support test
1. VALIDITY development, the proposed construct
interpretation is elaborated by describing its scope
and extent and by delin eating the aspects of the
construct that are to be represented. The detailed
description provides a conceptual framework for
the test, delineating the knowledge, skills,
BACK abilities, traits, interests, processes, competencies,
or characteristics to be assessed. Ideally, the
GROU framework indicates how the construct as
ND represented is to be distinguished from other
constructs and how it should relate to other
variables.
Validity refers to the degree to which evidence The conceptual framework is partially shaped
and theory support the interpretations of test by the ways in which test scores will be used. For
scores for proposed uses of tests. Validity is, instance, a test of mathematics achievement might
therefore, the most fundamental consideration in be used to place a student in an appropriate
developing tests and evaluating tests. The process program of instruction, to endorse a high school
of validation involves accumulating relevant diploma, or to inform a college admissions
evidence to provide a sound scientific basis for decision. Each of these uses implies a somewhat
the proposed score interpretations. It is the inter different interpretation of the mathematics
pretations of test scores for proposed uses that are achievement test scores: that a student will benefit
evaluated, not the test itself. When test scores are from a particular instructional intervention, that a
interpreted in more than one way (e.g., both to student has mastered a specified curriculum, or
describe a test taker’s current level of the attribute that a student is likely to be successful with
being measured and to make a prediction about a college-level work. Similarly, a test of consci
future outcome), each intended interpretation entiousness might be used for psychological coun
must be validated. Statements about validity seling, to inform a decision about employment, or
should refer to particular interpretations for for the basic scientific purpose of elaborating the
specified uses. It is incorrect to use the construct of conscientiousness. Each of these
unqualified phrase “the validity of the test.” potential uses shapes the specified framework and
Evidence of the validity of a given the proposed interpretation of the test’s scores and
interpretation of test scores for a specified use is a also can have implications for test development
necessary con dition for the justifiable use of the and evaluation. Validation can be viewed as a
test. Where suf ficient evidence of validity exists, process of constructing and evaluating arguments
the decision as to whether to actually administer a for and against the intended interpretation of test
particular test generally takes additional scores and their relevance to the proposed use.
considerations into ac count. These include cost- The conceptual framework points to the kinds
benefit considerations, framed in different of evidence that might be collected to evaluate the
subdisciplines as utility analysis or as proposed interpretation in light of the purposes of
consideration of negative consequences of test testing. As validation proceeds, and new evidence
use, and a weighing of any negative consequences regarding the interpretations that can and cannot
against the positive consequences of test use. be drawn from test scores becomes available,
Validation logically begins with an explicit revisions may be needed in the test, in the
statement of the proposed interpretation of test conceptual framework that shapes it, and even in
scores, along with a rationale for the relevance of the construct underlying the test.
the interpretation to the proposed use. The The wide variety of tests and circumstances
proposed interpretation includes specifying the makes it natural that some types of evidence will
construct the test is intended to measure. The term be especially critical in a given case, whereas
construct is used in the Standards to refer to the other types will be less useful. Decisions about
concept or characteristic that a test is designed to what types of evidence are important for the val
measure. Rarely, if ever, is there a single possible idation argument in each instance can be clarified
meaning that can be attached to a test score or a by developing a set of propositions or claims that
pattern of test responses. Thus, it is always in support the proposed interpretation for the
cumbent on test developers and users to particular purpose of testing. For instance, when a
specify the construct interpretation that will be mathematics achievement test is used to assess
made on the basis of the score or response pattern. readiness for an advanced course, evidence for the
Examples of constructs currently used in as following propositions might be relevant: (a) that
sessment include mathematics achievement, certain skills are prerequisite for the ad vanced
general cognitive ability, racial identity attitudes, course; (b) that the content domain of the test is
consistent with these prerequisite skills; (c) that some extent by processes that are not part of the
test scores can be generalized across relevant sets construct. In the case of a reading comprehension
of items; (d) that test scores are not unduly test, these might include material too far above or
influenced by ancillary variables, such as writing below the level intended to be tested, an
ability; (e) that success in the ad vanced course emotional reaction to the test content, familiarity
can be validly assessed; and (f) that test takers with the subject matter of the reading passages on
with high scores on the test will be more the test, or the writing skill needed to compose a
successful in the advanced course than test takers response. Depending on the detailed definition of
with low scores on the test. Examples of the con struct, vocabulary knowledge or reading
propositions in other testing contexts might speed might also be irrelevant components. On a
include, for instance, the proposition that test test designed to measure anxiety, a response bias
takers with high general anxiety scores experience to underreport one’s anxiety might be considered
significant anxiety in a range of settings, the a source of construct-irrelevant variance. In the
proposition that a child’s score on an intelligence case of a mathematics test, it might include
scale is strongly related to the child’s academic overreliance on reading comprehension skills that
performance, or the proposition that a certain English lan guage learners may be lacking. On a
pattern of scores on a neuropsychological battery test designed to measure science knowledge, test-
indicates impairment that is characteristic of brain taker inter nalizing of gender-based stereotypes
injury. The validation process evolves as these about women in the sciences might be a source of
propositions are articulated and evidence is construct-ir relevant variance.
gathered to evaluate their soundness. Nearly all tests leave out elements that some
Identifying the propositions implied by a pro potential users believe should be measured and
posed test interpretation can be facilitated by include some elements that some potential users
considering rival hypotheses that may challenge consider inappropriate. Validation involves
the proposed interpretation. It is also useful to careful attention to possible distortions in
consider the perspectives of different interested meaning arising from inadequate representation
parties, existing experience with similar tests and of the construct and also to aspects of
contexts, and the expected consequences of the measurement, such as test format, administration
proposed test use. A finding of unintended con conditions, or language level, that may materially
sequences of test use may also prompt a consider limit or qualify the interpretation of test scores for
ation of rival hypotheses. Plausible rival various groups of test takers. That is, the process
hypotheses can often be generated by considering of vali dation may lead to revisions in the test, in
whether a test measures less or more than its the conceptual framework of the test, or both.
proposed con struct. Such considerations are Inter pretations drawn from the revised test would
referred to as construct underrepresentation (or again need validation.
construct deficiency) and construct-irrelevant When propositions have been identified that
variance (or construct con tamination), would support the proposed interpretation of test
respectively. scores, one can proceed with validation by
Construct underrepresentation refers to the obtaining empirical evidence, examining relevant
degree to which a test fails to capture important literature, and/or conducting logical analyses to
aspects of the construct. It implies a narrowed evaluate each of the propositions. Empirical
meaning of test scores because the test does not evidence may include both local evidence,
adequately sample some types of content, engage produced within the contexts where the test will
some psychological processes, or elicit some be used, and evidence from similar testing
ways of responding that are encompassed by the applications in other settings. Use of existing
intended construct. Take, for example, a test evidence from similar tests and contexts can
intended as a comprehensive measure of anxiety. enhance the quality of the validity argument,
A particular test might underrepresent the especially when data for the test and context in
intended construct because it measures only question are limited.
physiological reactions and not emotional, Because an interpretation for a given use typ
cognitive, or situational com ponents. As another ically depends on more than one proposition,
example, a test of reading comprehension strong evidence in support of one part of the in
intended to measure children’s ability to read and terpretation in no way diminishes the need for
interpret stories with under standing might not evidence to support other parts of the
contain a sufficient variety of reading passages or interpretation. For example, when an employment
might ignore a common type of reading material. test is being considered for selection, a strong
Construct-irrelevance refers to the degree to predictor-criterion relationship in an employment
which test scores are affected by processes that setting is ordinarily not sufficient to justify use of
are extraneous to the test’s intended purpose. The the test. One should also consider the
test scores may be systematically influenced to appropriateness and meaning fulness of the
criterion measure, the appropriateness of the As the discussion in the prior section emphasizes,
testing materials and procedures for the full range each type of evidence presented below is not
of applicants, and the consistency of the support required in all settings. Rather, support is needed
for the proposed interpretation across groups. for each proposition that underlies a proposed test
Professional judgment guides decisions regarding interpretation for a specified use. A proposition
the specific forms of evidence that can best that a test is predictive of a given criterion can be
support the intended interpretation for a specified supported without evidence that the test samples a
use. As in all scientific endeavors, the quality of particular content domain. In contrast, a propo
the evidence is paramount. A few pieces of solid sition that a test covers a representative sample of
evidence regarding a particular proposition are a particular curriculum may be supported without
better than numerous pieces of evidence of evidence that the test predicts a given criterion.
questionable quality. The determination that a However, a more complex set of propositions,
given test interpretation for a specific purpose is e.g., that a test samples a specified domain and
warranted is based on professional judgment that thus is predictive of a criterion reflecting a related
the preponderance of the available evidence domain, will require evidence supporting both
supports that interpretation. The quality and parts of this set of propositions. Tests developers
quantity of evidence sufficient to reach this judg are also expected to make the case that the scores
ment may differ for test uses depending on the are not unduly influenced by construct-irrelevant
stakes involved in the testing. A given variance (see chap. 3 for detailed treatment of
interpretation may not be warranted either as a issues related to construct-irrelevant variance). In
result of insufficient evidence in support of it or general, adequate support for proposed interpre
as a result of credible evidence against it. tations for specific uses will require multiple
Validation is the joint responsibility of the test sources of evidence.
developer and the test user. The test developer is
responsible for furnishing relevant evidence and a The position developed above also underscores
rationale in support of any test score the fact that if a given test is interpreted in
interpretations for specified uses intended by the multiple ways for multiple uses, the propositions
developer. The test user is ultimately responsible underlying these interpretations for different uses
for evaluating the evidence in the particular also are likely to differ. Support is needed for the
setting in which the test is to be used. When a test propositions underlying each interpretation for a
user proposes an interpretation or use of test specific use. Evidence supporting the
scores that differs from those supported by the interpretation of scores on a mathematics
test developer, the responsibility for providing achievement test for placing students in
validity evidence in support of that interpretation subsequent courses (i.e., evidence that the test
for the specified use is the responsibility of the interpretation is valid for its intended purpose)
user. It should be noted that important does not permit inferring validity for other
contributions to the validity evidence may be purposes (e.g., promotion or teacher evaluation).
made as other researchers report findings of
investigations that are related to the meaning of 14
scores on the test.