0% found this document useful (0 votes)
31 views17 pages

Part 1 Validity

1. Validity refers to the degree to which evidence supports the proposed interpretations and uses of test scores. Validity is evaluated based on the specific interpretations and uses, not the test itself. 2. Validation involves accumulating evidence to support the proposed interpretation of test scores for a specified use. It also involves considering rival hypotheses and unintended consequences that could challenge the proposed interpretation. 3. The conceptual framework describes the construct being measured and guides test development and evaluation. It may need revision as new validity evidence becomes available.

Uploaded by

farahna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views17 pages

Part 1 Validity

1. Validity refers to the degree to which evidence supports the proposed interpretations and uses of test scores. Validity is evaluated based on the specific interpretations and uses, not the test itself. 2. Validation involves accumulating evidence to support the proposed interpretation of test scores for a specified use. It also involves considering rival hypotheses and unintended consequences that could challenge the proposed interpretation. 3. The conceptual framework describes the construct being measured and guides test development and evaluation. It may need revision as new validity evidence becomes available.

Uploaded by

farahna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

depression, and self-esteem.

To support test
1. VALIDITY development, the proposed construct
interpretation is elaborated by describing its scope
and extent and by delin eating the aspects of the
construct that are to be represented. The detailed
description provides a conceptual framework for
the test, delineating the knowledge, skills,
BACK abilities, traits, interests, processes, competencies,
or characteristics to be assessed. Ideally, the
GROU framework indicates how the construct as
ND represented is to be distinguished from other
constructs and how it should relate to other
variables.
Validity refers to the degree to which evidence The conceptual framework is partially shaped
and theory support the interpretations of test by the ways in which test scores will be used. For
scores for proposed uses of tests. Validity is, instance, a test of mathematics achievement might
therefore, the most fundamental consideration in be used to place a student in an appropriate
developing tests and evaluating tests. The process program of instruction, to endorse a high school
of validation involves accumulating relevant diploma, or to inform a college admissions
evidence to provide a sound scientific basis for decision. Each of these uses implies a somewhat
the proposed score interpretations. It is the inter different interpretation of the mathematics
pretations of test scores for proposed uses that are achievement test scores: that a student will benefit
evaluated, not the test itself. When test scores are from a particular instructional intervention, that a
interpreted in more than one way (e.g., both to student has mastered a specified curriculum, or
describe a test taker’s current level of the attribute that a student is likely to be successful with
being measured and to make a prediction about a college-level work. Similarly, a test of consci
future outcome), each intended interpretation entiousness might be used for psychological coun
must be validated. Statements about validity seling, to inform a decision about employment, or
should refer to particular interpretations for for the basic scientific purpose of elaborating the
specified uses. It is incorrect to use the construct of conscientiousness. Each of these
unqualified phrase “the validity of the test.” potential uses shapes the specified framework and
Evidence of the validity of a given the proposed interpretation of the test’s scores and
interpretation of test scores for a specified use is a also can have implications for test development
necessary con dition for the justifiable use of the and evaluation. Validation can be viewed as a
test. Where suf ficient evidence of validity exists, process of constructing and evaluating arguments
the decision as to whether to actually administer a for and against the intended interpretation of test
particular test generally takes additional scores and their relevance to the proposed use.
considerations into ac count. These include cost- The conceptual framework points to the kinds
benefit considerations, framed in different of evidence that might be collected to evaluate the
subdisciplines as utility analysis or as proposed interpretation in light of the purposes of
consideration of negative consequences of test testing. As validation proceeds, and new evidence
use, and a weighing of any negative consequences regarding the interpretations that can and cannot
against the positive consequences of test use. be drawn from test scores becomes available,
Validation logically begins with an explicit revisions may be needed in the test, in the
statement of the proposed interpretation of test conceptual framework that shapes it, and even in
scores, along with a rationale for the relevance of the construct underlying the test.
the interpretation to the proposed use. The The wide variety of tests and circumstances
proposed interpretation includes specifying the makes it natural that some types of evidence will
construct the test is intended to measure. The term be especially critical in a given case, whereas
construct is used in the Standards to refer to the other types will be less useful. Decisions about
concept or characteristic that a test is designed to what types of evidence are important for the val
measure. Rarely, if ever, is there a single possible idation argument in each instance can be clarified
meaning that can be attached to a test score or a by developing a set of propositions or claims that
pattern of test responses. Thus, it is always in support the proposed interpretation for the
cumbent on test developers and users to particular purpose of testing. For instance, when a
specify the construct interpretation that will be mathematics achievement test is used to assess
made on the basis of the score or response pattern. readiness for an advanced course, evidence for the
Examples of constructs currently used in as following propositions might be relevant: (a) that
sessment include mathematics achievement, certain skills are prerequisite for the ad vanced
general cognitive ability, racial identity attitudes, course; (b) that the content domain of the test is
consistent with these prerequisite skills; (c) that some extent by processes that are not part of the
test scores can be generalized across relevant sets construct. In the case of a reading comprehension
of items; (d) that test scores are not unduly test, these might include material too far above or
influenced by ancillary variables, such as writing below the level intended to be tested, an
ability; (e) that success in the ad vanced course emotional reaction to the test content, familiarity
can be validly assessed; and (f) that test takers with the subject matter of the reading passages on
with high scores on the test will be more the test, or the writing skill needed to compose a
successful in the advanced course than test takers response. Depending on the detailed definition of
with low scores on the test. Examples of the con struct, vocabulary knowledge or reading
propositions in other testing contexts might speed might also be irrelevant components. On a
include, for instance, the proposition that test test designed to measure anxiety, a response bias
takers with high general anxiety scores experience to underreport one’s anxiety might be considered
significant anxiety in a range of settings, the a source of construct-irrelevant variance. In the
proposition that a child’s score on an intelligence case of a mathematics test, it might include
scale is strongly related to the child’s academic overreliance on reading comprehension skills that
performance, or the proposition that a certain English lan guage learners may be lacking. On a
pattern of scores on a neuropsychological battery test designed to measure science knowledge, test-
indicates impairment that is characteristic of brain taker inter nalizing of gender-based stereotypes
injury. The validation process evolves as these about women in the sciences might be a source of
propositions are articulated and evidence is construct-ir relevant variance.
gathered to evaluate their soundness. Nearly all tests leave out elements that some
Identifying the propositions implied by a pro potential users believe should be measured and
posed test interpretation can be facilitated by include some elements that some potential users
considering rival hypotheses that may challenge consider inappropriate. Validation involves
the proposed interpretation. It is also useful to careful attention to possible distortions in
consider the perspectives of different interested meaning arising from inadequate representation
parties, existing experience with similar tests and of the construct and also to aspects of
contexts, and the expected consequences of the measurement, such as test format, administration
proposed test use. A finding of unintended con conditions, or language level, that may materially
sequences of test use may also prompt a consider limit or qualify the interpretation of test scores for
ation of rival hypotheses. Plausible rival various groups of test takers. That is, the process
hypotheses can often be generated by considering of vali dation may lead to revisions in the test, in
whether a test measures less or more than its the conceptual framework of the test, or both.
proposed con struct. Such considerations are Inter pretations drawn from the revised test would
referred to as construct underrepresentation (or again need validation.
construct deficiency) and construct-irrelevant When propositions have been identified that
variance (or construct con tamination), would support the proposed interpretation of test
respectively. scores, one can proceed with validation by
Construct underrepresentation refers to the obtaining empirical evidence, examining relevant
degree to which a test fails to capture important literature, and/or conducting logical analyses to
aspects of the construct. It implies a narrowed evaluate each of the propositions. Empirical
meaning of test scores because the test does not evidence may include both local evidence,
adequately sample some types of content, engage produced within the contexts where the test will
some psychological processes, or elicit some be used, and evidence from similar testing
ways of responding that are encompassed by the applications in other settings. Use of existing
intended construct. Take, for example, a test evidence from similar tests and contexts can
intended as a comprehensive measure of anxiety. enhance the quality of the validity argument,
A particular test might underrepresent the especially when data for the test and context in
intended construct because it measures only question are limited.
physiological reactions and not emotional, Because an interpretation for a given use typ
cognitive, or situational com ponents. As another ically depends on more than one proposition,
example, a test of reading comprehension strong evidence in support of one part of the in
intended to measure children’s ability to read and terpretation in no way diminishes the need for
interpret stories with under standing might not evidence to support other parts of the
contain a sufficient variety of reading passages or interpretation. For example, when an employment
might ignore a common type of reading material. test is being considered for selection, a strong
Construct-irrelevance refers to the degree to predictor-criterion relationship in an employment
which test scores are affected by processes that setting is ordinarily not sufficient to justify use of
are extraneous to the test’s intended purpose. The the test. One should also consider the
test scores may be systematically influenced to appropriateness and meaning fulness of the
criterion measure, the appropriateness of the As the discussion in the prior section emphasizes,
testing materials and procedures for the full range each type of evidence presented below is not
of applicants, and the consistency of the support required in all settings. Rather, support is needed
for the proposed interpretation across groups. for each proposition that underlies a proposed test
Professional judgment guides decisions regarding interpretation for a specified use. A proposition
the specific forms of evidence that can best that a test is predictive of a given criterion can be
support the intended interpretation for a specified supported without evidence that the test samples a
use. As in all scientific endeavors, the quality of particular content domain. In contrast, a propo
the evidence is paramount. A few pieces of solid sition that a test covers a representative sample of
evidence regarding a particular proposition are a particular curriculum may be supported without
better than numerous pieces of evidence of evidence that the test predicts a given criterion.
questionable quality. The determination that a However, a more complex set of propositions,
given test interpretation for a specific purpose is e.g., that a test samples a specified domain and
warranted is based on professional judgment that thus is predictive of a criterion reflecting a related
the preponderance of the available evidence domain, will require evidence supporting both
supports that interpretation. The quality and parts of this set of propositions. Tests developers
quantity of evidence sufficient to reach this judg are also expected to make the case that the scores
ment may differ for test uses depending on the are not unduly influenced by construct-irrelevant
stakes involved in the testing. A given variance (see chap. 3 for detailed treatment of
interpretation may not be warranted either as a issues related to construct-irrelevant variance). In
result of insufficient evidence in support of it or general, adequate support for proposed interpre
as a result of credible evidence against it. tations for specific uses will require multiple
Validation is the joint responsibility of the test sources of evidence.
developer and the test user. The test developer is
responsible for furnishing relevant evidence and a The position developed above also underscores
rationale in support of any test score the fact that if a given test is interpreted in
interpretations for specified uses intended by the multiple ways for multiple uses, the propositions
developer. The test user is ultimately responsible underlying these interpretations for different uses
for evaluating the evidence in the particular also are likely to differ. Support is needed for the
setting in which the test is to be used. When a test propositions underlying each interpretation for a
user proposes an interpretation or use of test specific use. Evidence supporting the
scores that differs from those supported by the interpretation of scores on a mathematics
test developer, the responsibility for providing achievement test for placing students in
validity evidence in support of that interpretation subsequent courses (i.e., evidence that the test
for the specified use is the responsibility of the interpretation is valid for its intended purpose)
user. It should be noted that important does not permit inferring validity for other
contributions to the validity evidence may be purposes (e.g., promotion or teacher evaluation).
made as other researchers report findings of
investigations that are related to the meaning of 14
scores on the test.

Sources of Validity Evidence


The following sections outline various sources of
evidence that might be used in evaluating the
validity of a proposed interpretation of test scores
for a particular use. These sources of evi dence
may illuminate different aspects of validity, but Evidence Based on Test Content
they do not represent distinct types of validity.
Important validity evidence can be obtained from
Validity is a unitary concept. It is the degree to
an analysis of the relationship between the content
which all the accumulated evidence supports the
of a test and the construct it is intended to
intended interpretation of test scores for the
measure. Test content refers to the themes,
proposed use. Like the 1999 Stan dards, this
wording, and format of the items, tasks, or
edition refers to types of validity evi dence, rather
questions on a test. Administration and scoring
than distinct types of validity. To emphasize this
may also be relevant to content-based evidence.
distinction, the treatment that follows does not
Test developers often work from a specification
follow historical nomenclature (i.e., the use of the
of the content domain. The content specification
terms content validity or pre dictive validity).
carefully describes the content in detail, often
with a classification of areas of content and types
of items. Evidence based on test content can construct and test content domain by a diverse
include logical or empirical analyses of the panel of experts may point to potential sources of
adequacy with which the test content represents irrelevant difficulty (or easiness) that require
the content domain and of the relevance of the further investigation.
content domain to the proposed interpretation of Content-oriented evidence of validation is at
test scores. Evidence based on content can also the heart of the process in the educational arena
come from expert judg ments of the relationship known as alignment, which involves evaluating
between parts of the test and the construct. For the correspondence between student learning
example, in developing a licensure test, the major standards and test content. Content-sampling
facets that are relevant to the purpose for which issues in the alignment process include evaluating
the occupation is regulated can be specified, and whether test content appropriately samples the
experts in that occupation can be asked to assign domain set forward in curriculum standards,
test items to the categories defined by those whether the cognitive de mands of test items
facets. These or other experts can then judge the correspond to the level reflected in the student
representativeness of the chosen set of items. learning standards (e.g., content standards), and
Some tests are based on systematic whether the test avoids the inclusion of features
observations of behavior. For example, a list of irrelevant to the standard that is the in tended
the tasks con stituting a job domain may be target of each test item.
developed from observations of behavior in a job,
together with judgments of subject matter experts. Evidence Based on Response Processes
Expert judg ments can be used to assess the Some construct interpretations involve more or
relative importance, criticality, and/or frequency less explicit assumptions about the cognitive
of the various tasks. A job sample test can then be processes engaged in by test takers.
constructed from a random or stratified sampling Theoretical and empirical analyses of the
of tasks rated highly on these characteristics. The response processes of test takers can provide
test can then be ad ministered under standardized evidence concerning the fit between the construct
conditions in an off-the-job setting. and the detailed nature of the performance or
The appropriateness of a given content response actually engaged in by test takers. For
domain is related to the specific inferences to be instance, if a test is intended to assess
made from test scores. Thus, when considering an mathematical reasoning, it becomes im portant to
available test for a purpose other than that for determine whether test takers are, in fact,
which it was first developed, it is especially reasoning about the material given instead of
important to evaluate the appropriateness of the following a standard algorithm applicable only to
original content domain for the proposed the specific items on the test.
new purpose. For example, a test given for Evidence based on response processes
research purposes to compare student generally comes from analyses of individual
achievement across states in a given domain may responses. Questioning test takers from various
properly also cover material that receives little or groups making up the intended test-taking
no attention in the curriculum. Policy makers can population about their performance strategies or
then evaluate student achievement with respect to responses to particular items can yield evidence
both content neglected and content addressed. On that enriches the definition of a construct.
the other hand, when student mastery of a Maintaining records that monitor the
delivered cur riculum is tested for purposes of development of a response to a writing task,
informing decisions about individual students, through successive written drafts or electronically
such as pro motion or graduation, the framework monitored revisions, for instance, also provides
elaborating a content domain is appropriately evidence of process. Documentation of other
limited to what students have had an opportunity aspects of performance, like eye move ments or
to learn from the curriculum as delivered. response times, may also be relevant to some
Evidence about content can be used, in part, to constructs. Inferences about processes in volved
address questions about differences in the in performance can also be developed by
meaning or interpretation of test scores across analyzing the relationship among parts of the test
relevant sub groups of test takers. Of particular and between the test and other variables. Wide
concern is the extent to which construct individual differences in process can be re vealing
underrepresentation or construct-irrelevance may and may lead to reconsideration of certain test
give an unfair advantage or disadvantage to one formats.
or more subgroups of test takers. For example, in Evidence of response processes can contribute
an employment test, the use of vocabulary more to answering questions about differences in
complex than needed on the job may be a source meaning or interpretation of test scores across
of construct-irrelevant variance for English relevant sub groups of test takers. Process studies
language learners or others. Careful review of the involving test takers from different subgroups can
assist in determining the extent to which series of increasingly difficult test components,
capabilities irrel evant or ancillary to the empirical evidence of the extent to which
construct may be differ entially influencing test response patterns conformed to this expectation
takers’ test performance. would be provided. A theory that posited
Studies of response processes are not limited unidimensionality would call for evidence of item
to the test taker. Assessments often rely on homogeneity. In this case, the number of items
observers or judges to record and/or evaluate test and item interrelationships form the basis for an
takers’ performances or products. In such cases, estimate of score reliability, but such an index
relevant validity evidence includes the extent to would be inappropriate for tests with a more
which the processes of observers or judges are complex internal structure.
consistent with the intended interpretation of Some studies of the internal structure of tests
scores. For in stance, if judges are expected to are designed to show whether particular items
apply particular criteria in scoring test takers’ may function differently for identifiable
performances, it is subgroups of test takers (e.g., racial/ethnic or
gender sub groups.) Differential item functioning
15
occurs when different groups of test takers with
CHAPTER 1 similar overall ability, or similar status on an
important to ascertain whether they are, in appropriate criterion, have, on average,
fact, applying the appropriate criteria and not systematically different responses to a particular
being influenced by factors that are irrelevant to item. This issue is discussed in chapter 3.
the in tended interpretation (e.g., quality of However, differential item functioning is not
handwriting is irrelevant to judging the content of always a flaw or weakness. Subsets of items that
an written essay). Thus, validation may include have a specific characteristic in common (e.g.,
empirical studies of how observers or judges specific content, task representation) may
record and evaluate data along with analyses of function differently for different groups of
the appropri ateness of these processes to the similarly scoring test takers. This indicates a kind
intended inter pretation or construct of multi dimensionality that may be unexpected
definition. While evidence about response or may conform to the test framework.
processes may be central in settings where
explicit claims about response processes are made Evidence Based on Relations to Other
by test developers or where inferences about Variables
responses are made by test users, there are many In many cases, the intended interpretation for a
other cases where claims about response given use implies that the construct should be
processes are not part of the validity argument. In related to some other variables, and, as a result,
some cases, multiple response processes are analyses of the relationship of test scores to
available for solving the problems of interest, and variables external to the test provide another im
the construct of interest is only con cerned with portant source of validity evidence. External
whether the problem was solved cor rectly. As a variables may include measures of some criteria
simple example, there may be multiple possible that the test is expected to predict, as well as rela
routes to obtaining the correct solution to a tionships to other tests hypothesized to measure
mathematical problem. the same constructs, and tests measuring related
or different constructs. Measures other than test
Evidence Based on Internal Structure scores, such as performance criteria, are often
Analyses of the internal structure of a test can used in employment settings. Categorical
indicate the degree to which the relationships variables, including group membership variables,
among test items and test components conform to become relevant when the theory underlying a
the construct on which the proposed test score in proposed test use suggests that group differences
terpretations are based. The conceptual should be present or absent if a proposed test
framework for a test may imply a single score interpre tation is to be supported. Evidence
dimension of behavior, or it may posit several based on rela tionships with other variables
components that are each expected to be provides evidence about the degree to which
homogeneous, but that are also distinct from each these relationships are consistent with the
other. For example, a measure of discomfort on a construct underlying the pro posed test score
health survey might assess both physical and interpretations.
emotional health. The extent to which item
interrelationships bear out the presumptions of the Convergent and discriminant evidence. Rela
framework would be relevant to validity. tionships between test scores and other
The specific types of analyses and their inter measures intended to assess the same or similar
pretation depend on how the test will be used. For constructs provide convergent evidence, whereas
example, if a particular application posited a relationships between test scores and measures
purportedly of different constructs provide academic admission or em ployment settings, or
discriminant evidence. For instance, within some in planning rehabilitation regimens, predictive
theoretical frameworks, scores on a multiple- studies can retain the temporal differences and
choice test of reading com prehension might be other characteristics of the practical situation.
expected to relate closely (convergent evidence) Concurrent evidence, which avoids tem poral
to other measures of reading comprehension changes, is particularly useful for psychodi
based on other methods, such as essay responses. agnostic tests or in investigating alternative
Conversely, test scores might be expected to measures of some specified construct for which
relate less closely (discriminant evidence) to an accepted measurement procedure already
measures of other skills, such as logical exists. The choice of a predictive or concurrent
reasoning. Relationships among different research strategy in a given domain is also
methods of meas uring the construct can be usefully informed by prior research evidence
especially helpful in sharpening and elaborating regarding the extent to which predictive and
score meaning and interpretation. concurrent studies in that domain yield the same
Evidence of relations with other variables can or different results.
involve experimental as well as correlational evi Test scores are sometimes used in allocating
dence. Studies might be designed, for instance, to individuals to different treatments in a way that is
investigate whether scores on a measure of advantageous for the institution and/or for the
anxiety improve as a result of some psychological individuals. Examples would include assigning
treatment or whether scores on a test of academic individuals to different jobs within an
achievement differentiate between instructed and organization, or determining whether to place a
noninstructed groups. If performance increases given student in a remedial class or a regular
due to short term coaching are viewed as a threat class. In that context, evidence is needed to judge
to validity, it would be useful to investigate the suitability of using a test when classifying or
whether coached and uncoached groups perform assigning a person to one job versus another or to
differently. one treatment versus another. Support for the
validity of the classification procedure is provided
Test-criterion relationships. Evidence of the by showing that the test is useful in determining
relation of test scores to a relevant criterion may which persons are likely to profit differentially
be expressed in various ways, but the from one treatment or another. It is possible for
fundamental question is always, how accurately tests to be highly predictive of performance for
do test scores predict criterion performance? The different education programs or jobs without pro
degree of ac curacy and the score range within viding the information necessary to make a com
which accuracy is needed depends on the purpose parative judgment of the efficacy of assignments
for which the test is used. or treatments. In general, decision rules for
The criterion variable is a measure of some at selection or placement are also influenced by the
tribute or outcome that is operationally distinct number of persons to be accepted or the numbers
from the test. Thus, the test is not a measure of a that can be accommodated in alternative
criterion, but rather is a measure hypothesized as placement cate gories (see chap. 11).
a potential predictor of that targeted criterion. Evidence about relations to other variables is
Whether a test predicts a given criterion in a also used to investigate questions of differential
given context is a testable hypothesis. The criteria prediction for subgroups. For instance, a finding
that are of interest are determined by test users, that the relation of test scores to a relevant
for example administrators in a school system or criterion variable differs from one subgroup to
managers of a firm. The choice of the criterion another may imply that the meaning of the scores
and the measurement procedures used to is not the same for members of the different
obtain criterion scores are of central importance. groups, perhaps due to construct
The credibility of a test-criterion study depends underrepresentation or construct-irrelevant
on the relevance, reliability, and validity of the sources of variance. However, the difference may
inter pretation based on the criterion measure for also imply that the criterion has different meaning
a given testing application. for different groups. The differences in test-
Historically, two designs, often called criterion relationships can also arise from
predictive and concurrent, have been measurement error, especially when group means
distinguished for eval uating test-criterion differ, so such differences do not necessarily
relationships. A predictive study indicates the indicate differences in score meaning. See the
strength of the relationship between test scores discussion of fairness in chapter 3 for more
and criterion scores that are obtained at a later extended consideration of possible courses of
time. A concurrent study obtains test scores and action when scores have different meanings for
criterion information at about the same time. different groups.
When prediction is actually contemplated, as in
Validity generalization. An important issue in cluded may vary according to several situational
educational and employment settings is the degree facets. Some of the major facets are (a)
to which validity evidence based on test-criterion differences in the way the predictor construct is
relations can be generalized to a new situation measured, (b) the type of job or curriculum
without further study of validity in that new situ involved, (c) the type of criterion measure used,
ation. When a test is used to predict the same or (d) the type of test takers, and (e) the time period
similar criteria (e.g., performance of a given job) in which the study was conducted. In any
at different times or in different places, it is particular study of validity generalization, any
typically found that observed test-criterion corre number of these facets might vary, and a major
lations vary substantially. In the past, this has objective of the study is to de termine empirically
been taken to imply that local validation studies the extent to which variation in these facets
are always required. More recently, a variety of affects the test-criterion correlations obtained.
approaches to generalizing evidence from other The extent to which predictive or concurrent
settings has been developed, with meta-analysis validity evidence can be generalized to new
the most widely used in the published literature. situations is in large measure a function of accu
In particular, meta-analyses have shown that in mulated research. Although evidence of general
some domains, much of this variability may be ization can often help to support a claim of
due to statistical artifacts such as sampling fluctu validity in a new situation, the extent of available
ations and variations across validation studies in data limits the degree to which the claim can be
the ranges of test scores and in the reliability of sustained.
criterion measures. When these and other The above discussion focuses on the use of
influences are taken into account, it may be found cumulative databases to estimate predictor-
that the remaining variability in validity criterion relationships. Meta-analytic techniques
coefficients is relatively small. Thus, statistical can also be used to summarize other forms of data
summaries of past validation studies in similar relevant to other inferences one may wish to draw
situations may be useful in estimating test- from test scores in a particular application, such
criterion relationships in a new situation. This as effects of coaching and effects of certain
practice is referred to as the study of validity alterations in testing conditions for test takers
generalization. with specified disabilities. Gathering evidence
In some circumstances, there is a strong basis about how well validity findings can be
for using validity generalization. This would be generalized across groups of test takers is an
the case where the meta-analytic data base is important part of the validation process. When the
large, where the meta-analytic data ade quately evidence suggests that inferences from test scores
represent the type of situation to which one can be drawn for some subgroups but not for
wishes to generalize, and where correction for others, pursuing options such as those discussed
statistical artifacts produces a clear and con in chapter 3 can reduce the risk of unfair test use.
sistent pattern of validity evidence. In such cir
cumstances, the informational value of a local Evidence for Validity and
validity study may be relatively limited if not Consequences of Testing
actually misleading, especially if its sample size Some consequences of test use follow directly
is small. In other circumstances, the inferential from the interpretation of test scores for uses in
leap required for generalization may be much tended by the test developer. The validation
larger. The meta-analytic database may be small, process involves gathering evidence to evaluate
the findings may be less consistent, or the new the soundness of these proposed interpretations
situation may involve features markedly different for their intended uses.
from those represented in the meta-analytic Other consequences may also be part of a
database. In such circumstances, situation- claim that extends beyond the interpretation or
specific validity evidence will be relatively more use of scores intended by the test developer. For
inform ative. Although research on validity example, a test of student achievement might
generalization shows that results of a single local provide data for a system intended to identify and
validation study may be quite imprecise, there are improve lower-performing schools. The claim
situations where a single study, carefully done, that testing results, used this way, will result in
with adequate sample size, provides sufficient improved student learning may rest on
evidence to support or reject test use in a new propositions about the system or intervention
situation. This highlights the importance of itself, beyond propositions based on the meaning
examining carefully the comparative of the test itself. Consequences may point to the
informational value of local versus meta-analytic need for evidence about components of the
studies. system that will go beyond the interpretation of
In conducting studies of the generalizability of test scores as a valid measure of student
validity evidence, the prior studies that are in achievement.
Still other consequences are unintended, and performance on these key tasks is supported and
are often negative. For example, school district or the interpretation of the test scores as a predictor
statewide educational testing on selected subjects of kindergarten readiness would be valid. If,
may lead teachers to focus on those subjects at however, the claim were made that use of the test
the expense of others. As another example, a test scores for screening would result in the greatest
developed to measure knowledge needed for a benefit to students, the interpretation of test
given job may result in lower passing rates for scores as indicators of readiness for kindergarten
one group than for another. Unintended might not be valid because students with low
consequences merit close examination. While not scores might actually benefit more from access to
all consequences can be anticipated, in some kindergarten. In this case, different evidence is
cases factors such as prior experiences in other needed to support different claims that might be
settings offer a basis for anticipating and made about the same use of the screening test (for
proactively addressing unintended consequences. example, evidence that students below a certain
See chapter 12 for additional ex amples from cut score benefit more from another assignment
educational settings. In some cases, actions to than from assignment to kindergarten). The test
address one consequence bring about other developer is responsible for the validation of the
consequences. One example involves the notion interpretation that the test scores assess the
of “missed opportunities,” as in the case of indicated readiness skills. The school district is
moving to computerized scoring of student essays responsible for the validation of the proper
to increase grading consistency, thus forgoing the interpretation of the readiness test scores and for
educational benefits of addressing the same evaluation of the policy of using the readiness test
problem by training teachers to grade more for placement/admissions decisions.
consistently.
These types of consideration of consequences Claims made about test use that are not
of testing are discussed further below. directly based on test score interpretations.
Claims are sometimes made for benefits of testing
Interpretation and uses of test scores intended that go beyond the direct interpretations or uses of
by test developers. Tests are commonly the test scores themselves that are specified by the
administered in the expectation that some benefit test developers. Educational tests, for example,
will be realized from the interpretation and use of may be advocated on the grounds that their use
the scores intended by the test developers. A few will improve student motivation to learn or
of the many possible benefits that might be encourage changes in classroom instructional
claimed are selection of effi cacious therapies, practices by holding educators accountable for
placement of workers in suitable jobs, prevention valued learning outcomes. Where such claims are
of unqualified individuals from entering a central to the rationale advanced for testing, the
profession, or improvement of classroom direct exami nation of testing consequences
instructional practices. A fundamental purpose of necessarily assumes even greater importance.
validation is to indicate whether these specific Those making the claims are responsible for
benefits are likely to be realized. Thus, in the case evaluation of the claims. In some cases, such
of a test used in placement decisions, the information can be drawn from existing data
validation would be informed by evidence that collected for purposes other than test validation;
alternative placements, in fact, are differentially in other cases new information will be needed to
beneficial to the persons and the institution. In the address the impact of the testing program.
case of em ployment testing, if a test publisher
asserts that use of the test will result in reduced Consequences that are unintended. Test score
employee training costs, improved workforce interpretation for a given use may result in unin
efficiency, or some other benefit, then the tended consequences. A key distinction is
validation would be informed by evidence in between consequences that result from a source of
support of that proposition. error in the intended test score interpretation for a
It is important to note that the validity of test given use and consequences that do not result
score interpretations depends not only on the from error in test score interpretation. Examples
uses of the test scores but specifically on the of each are given below.
claims that underlie the theory of action for these As discussed at some length in chapter 3, one
uses. For example, consider a school district that domain in which unintended negative
wants to determine children’s readiness for consequences of test use are at times observed
kindergarten, and so administers a test battery and involves test score differences for groups defined
screens out students with low scores. If higher in terms of race/eth nicity, gender, age, and other
scores do, in fact, predict higher performance on characteristics. In such cases, however, it is
key kindergarten tasks, the claim that use of the important to distinguish between evidence that is
test scores for screening results in higher directly relevant to validity and evidence that may
inform decisions about social policy but falls organization due to the perception that the test
outside the realm of validity. For example, invades personal privacy. Thus, there is an
concerns have been raised about the effect of unintended negative consequence of test use, but
group differences in test scores on em ployment one that is not due to a flaw in the intended
selection and promotion, the placement of interpretation of test scores as predicting
children in special education classes, and the subsequent performance. Some employers faced
narrowing of a school’s curriculum to exclude with this situation may conclude that this
learning objectives that are not assessed. negative consequence is grounds for dis
Although information about the consequences of continuing test use; others may conclude that the
testing may influence decisions about test use, benefits gained by screening applicants outweigh
such con sequences do not, in and of themselves, this negative consequence. As this example illus
detract from the validity of intended trates, a consideration of consequences can
interpretations of the test scores. Rather, influence a decision about test use, even though
judgments of validity or invalidity in the light of the conse quence is independent of the validity of
testing consequences depend on a more searching the intended test score interpretation. The
inquiry into the sources of those consequences. example also illustrates that different decision
Take, as an example, a finding of different makers may make different value judgments
hiring rates for members of different groups as a about the impact of consequences on test use.
consequence of using an employment test. If the The fact that the validity evidence supports the
difference is due solely to an unequal distribution intended interpretation of test scores for use in
of the skills the test purports to measure, and if applicant screening does not mean that test use is
those skills are, in fact, important contributors to thus required: Issues other than validity, including
job performance, then the finding of group dif legal constraints, can play an important and, in
ferences per se does not imply any lack of validity some cases, a determinative role in decisions
for the intended interpretation. If, however, the about test use. Legal constraints may also limit an
test measured skill differences unrelated to job em ployer’s discretion to discard test scores from
performance (e.g., a sophisticated reading test tests that have already been administered, when
for a job that required only minimal functional that decision is based on differences in scores for
literacy), or if the differences were due to the sub groups of different races, ethnicities, or
test’s sensitivity to some test-taker characteristic genders.
not intended to be part of the test construct, then Note that unintended consequences can also
the intended interpretation of test scores as pre be positive. Reversing the above example of test
dicting job performance in a comparable manner takers who form a negative impression of an or
for all groups of applicants would be rendered in ganization based on the use of a particular test, a
valid, even if test scores correlated positively with different test may be viewed favorably by
some measure of job performance. If a test covers applicants, leading to a positive impression of the
most of the relevant content domain but omits organization. A given test use may result in
some areas, the content coverage might be judged multiple consequences, some positive and some
adequate for some purposes. However, if it is negative.
found that excluding some components that could In short, decisions about test use are appro
readily be assessed has a noticeable impact on se priately informed by validity evidence about in
lection rates for groups of interest (e.g., subgroup tended test score interpretations for a given use,
differences are found to be smaller on excluded by evidence evaluating additional claims about
components than on included components), the consequences of test use that do not follow
intended interpretation of test scores as predicting directly from test score interpretations, and by
job performance in a comparable manner for all value judg ments about unintended positive and
groups of applicants would be rendered invalid. negative consequences of test use.
Thus, evidence about consequences is relevant to
validity when it can be traced to a source of
Integrating the Validity Evidence
invalidity such as construct underrepresentation or
construct-irrelevant components. Evidence about A sound validity argument integrates various
consequences that cannot be so traced is not strands of evidence into a coherent account of the
relevant to the validity of the intended inter degree to which existing evidence and theory sup
pretations of the test scores. port the intended interpretation of test scores for
As another example, consider the case where specific uses. It encompasses evidence gathered
research supports an employer’s use of a from new studies and evidence available from
particular test in the personality domain (i.e., the earlier reported research. The validity argument
test proves to be predictive of an aspect of may indicate the need for refining the definition
subsequent job performance), but it is found that of the construct, may suggest revisions in the test
some applicants form a negative opinion of the
or other aspects of the testing process, and may
indicate areas needing further study.
It is commonly observed that the validation
process never ends, as there is always additional
information that can be gathered to more
fully understand a test and the inferences that can
be drawn from it. In this way an inference of
validity is similar to any scientific inference.
However, a test interpretation for a given use
rests on evidence for a set of propositions making
up the validity argument, and at some point
validation evidence allows for a summary
judgment of the intended interpretation that is
well supported and defensible. At some point the
effort to provide sufficient validity evidence to
support a given test interpre tation for a specific
use does end (at least provi sionally, pending the
emergence of a strong basis for questioning that
judgment). Legal requirements may necessitate
that the validation study be updated in light of
such factors as changes in the test population or
newly developed alternative testing methods.
The amount and character of evidence
required to support a provisional judgment of
validity often vary between areas and also within
an area as research on a topic advances. For
example, pre vailing standards of evidence may
vary with the stakes involved in the use or
interpretation of the test scores. Higher stakes
may entail higher standards of evidence. As
another example, in areas where data collection
comes at a greater cost, one may find it necessary
to base interpretations on fewer data than in areas
where data collection comes with less cost.
Ultimately, the validity of an intended inter
pretation of test scores relies on all the available
evidence relevant to the technical quality of a
testing system. Different components of validity
evidence are described in subsequent chapters of
the Standards, and include evidence of careful
test construction; adequate score reliability;
appropriate test administration and scoring;
accurate score scaling, equating, and standard
setting; and careful attention to fairness for all test
takers, as appropriate to the test interpretation in
question.
STANDARD which test scores are to be employed, and the
processes by which the test is to be administered
S FOR and scored.
VALIDITY
Standard 1.2
The standards in this chapter begin with an over A rationale should be presented for each
arching standard (numbered 1.0), which is intended interpretation of test scores for a
designed to convey the central intent or primary given use, together with a summary of the
focus of the chapter. The overarching standard evidence and theory bearing on the intended
may also be viewed as the guiding principle of the interpretation.
chapter, and is applicable to all tests and test
users. All subsequent standards have been Comment: The rationale should indicate what
separated into three thematic clusters labeled as propositions are necessary to investigate the
follows: intended interpretation. The summary should
combine logical analysis with empirical evidence
to provide support for the test rationale. Evidence
1. Establishing Intended Uses and Interpreta tions
may come from studies conducted locally, in the
2. Issues Regarding Samples and Settings Used in
setting where the test is to be used; from specific
Validation
prior studies; or from comprehensive statistical
3. Specific Forms of Validity Evidence syntheses of available studies meeting clearly
spec ified study quality criteria. No type of
Standard 1.0 evidence is inherently preferable to others; rather,
the quality and relevance of the evidence to the
Clear articulation of each intended test score intended test score interpretation for a given use
in terpretation for a specified use should be set determine the value of a particular kind of
forth, and appropriate validity evidence in evidence. A presentation of empirical evidence on
support of each intended interpretation should any point should give due weight to all relevant
be provided. findings in the scientific literature, including
those inconsistent with the intended interpretation
Cluster 1. Establishing or use. Test developers have the responsibility to
provide support for their own recommendations,
Intended Uses and but test users bear ultimate responsibility for
Interpretations evaluating the quality of the validity evidence
provided and its relevance to the local situation.
Standard 1.1
The test developer should set forth clearly how Standard 1.3
test scores are intended to be interpreted and
consequently used. The population(s) for If validity for some common or likely
which a test is intended should be delimited interpretation for a given use has not been
clearly, and the construct or constructs that evaluated, or if such an interpretation is
the test is intended to assess should be inconsistent with available evidence, that fact
described clearly. should be made clear and po tential users
should be strongly cautioned about making
Comment: Statements about validity should refer unsupported interpretations.
to particular interpretations and consequent uses.
It is incorrect to use the unqualified phrase “the Comment: If past experience suggests that a test
validity of the test.” No test permits is likely to be used inappropriately for
interpretations that are valid for all purposes or in certain kinds of decisions or certain kinds of test
all situations. Each recommended interpretation takers, specific warnings against such uses should
for a given use requires validation. The test be given. Professional judgment is required to
developer should specify in clear language the evaluate the extent to which existing validity
population for which the test is intended, the evidence sup ports a given test use.
construct it is intended to measure, the contexts in
Standard 1.4 scientific literature, including findings sug
gesting important indirect outcomes other than
If a test score is interpreted for a given use in a those predicted.
way that has not been validated, it is
incumbent on the user to justify the new Comment: For example, certain educational
interpretation for that use, providing a testing programs have been advocated on the
rationale and collecting new evidence, if grounds that they would have a salutary influence
necessary. on class room instructional practices or would
clarify stu dents’ understanding of the kind or
Comment: Professional judgment is required to level of achievement they were expected to attain.
evaluate the extent to which existing validity evi To the extent that such claims enter into the
dence applies in the new situation or to the new justification for a testing program, they become
group of test takers and to determine what new part of the ar gument for test use. Evidence for
evidence may be needed. The amount and kinds such claims should be examined—in conjunction
of new evidence required may be influenced by with evidence about the validity of intended test
experience with similar prior test uses or interpre score interpre tation and evidence about
tations and by the amount, quality, and relevance unintended negative consequences of test use—in
of existing data. making an overall decision about test use. Due
A test that has been altered or administered in weight should be given to evidence against such
ways that change the construct underlying the test predictions, for ex ample, evidence that under
for use with subgroups of the population re quires some conditions edu cational testing may have a
evidence of the validity of the interpretation made negative effect on classroom instruction.
on the basis of the modified test (see chap. 3). For
example, if a test is adapted for use with
individuals with a particular disability in a way Standard 1.7
that changes the underlying construct, the
modified test should have its own evidence of If test performance, or a decision made
validity for the intended interpretation. therefrom, is claimed to be essentially
unaffected by practice and coaching, then the
propensity for test per formance to change
Standard 1.5 with these forms of instruction should be
documented.
When it is clearly stated or implied that a rec
ommended test score interpretation for a given Comment: Materials to aid in score
use will result in a specific outcome, the basis interpretation should summarize evidence
for expecting that outcome should be indicating the degree to which improvement with
presented, together with relevant evidence. practice or coaching can be expected. Also,
Comment: If it is asserted, for example, that in materials written for test takers should provide
terpreting and using scores on a given test for em practical guidance about the value of test
ployee selection will result in reduced employee preparation activities, including coaching.
errors or training costs, evidence in support of
that assertion should be provided. A given claim
may be supported by logical or theoretical Cluster 2. Issues Regarding
argument as well as empirical data. Appropriate Samples and Settings Used in
weight should be given to findings in the Validation
scientific literature that may be inconsistent with
the stated expectation. Standard 1.8
The composition of any sample of test takers
Standard 1.6 from which validity evidence is obtained
When a test use is recommended on the should be described in as much detail as is
grounds that testing or the testing program practical and permissible, including major
itself will result in some indirect benefit, in relevant socio -demographic and
addition to the utility of information from developmental characteristics.
interpretation of the test scores themselves, the Comment: Statistical findings can be influenced
recommender should make explicit the by factors affecting the sample on which the
rationale for anticipating the indirect benefit. results are based. When the sample is intended to
Logical or theoretical argu ments and represent a population, that population should be
empirical evidence for the indirect benefit described, and attention should be drawn to any
should be provided. Appropriate weight should systematic factors that may limit the repre
be given to any contradictory findings in the
sentativeness of the sample. Factors that might cut scores), or in test scoring (e.g., rating of essay
reasonably be expected to affect the results responses). Whenever such procedures are
include self-selection, attrition, linguistic ability, employed, the quality of the resulting judgments
disability status, and exclusion criteria, among is important to the validation. Level of agreement
others. If the participants in a validity study are should be specified clearly (e.g., whether percent
patients, for example, then the diagnoses of the agreement refers to agreement prior to or after a
patients are important, as well as other consensus discussion, and whether the criterion
characteristics, such as the severity of the for agreement is exact agreement of ratings or
diagnosed conditions. For tests used in agreement within a certain number of scale
employment settings, the em ployment status points.) The basis for specifying certain types of
(e.g., applicants versus current job holders), the individuals (e.g., experienced teachers,
general level of experience and educational experienced job incumbents, supervisors) as
background, and the gender and ethnic appropriate experts for the judgment or rating task
composition of the sample may be relevant should be articulated. It may be entirely
information. For tests used in credentialing, the appropriate to have experts work together to reach
status of those providing information (e.g., can consensus, but it would not then be appropriate to
didates for a credential versus already- treat their respective judgments as statistically
credentialed individuals) is important for independent. Different judges may be used for
interpreting the re sulting data. For tests used in different purposes (e.g., one set may rate items for
educational settings, relevant information may cultural sensitivity while another may rate for
include educational background, developmental reading level) or for different portions of a test.
level, community characteristics, or school
admissions policies, as well as the gender and Standard 1.10
ethnic composition of the sample. Sometimes
legal restrictions about privacy preclude obtaining When validity evidence includes statistical
or disclosing such population information or limit analyses of test results, either alone or together
the level of particularity at which such data may with data on other variables, the conditions
be disclosed. The specific privacy laws, if any, under which the data were collected should be
governing the type of data should be considered, described in enough detail that users can judge
in order to ensure that any description of a the relevance of the statistical findings to local
population does not have the po tential to identify conditions. At tention should be drawn to any
an individual in a manner in consistent with such features of a val idation data collection that are
standards. The extent of missing data, if any, and likely to differ from typical operational testing
the methods for handling missing data (e.g., use conditions and that could plausibly influence
of imputation procedures) should be described. test performance.
Comment: Such conditions might include (but
Standard 1.9 would not be limited to) the following: test-taker
motivation or prior preparation, the range of test
When a validation rests in part on the opinions
scores over test takers, the time allowed for test
or decisions of expert judges, observers, or
takers to respond or other administrative
raters, procedures for selecting such experts
conditions, the mode of test administration (e.g.,
and for eliciting judgments or ratings should
unproctored online testing versus proctored on-
be fully described. The qualifications and
site testing), examiner training or other examiner
experience of the judges should be presented.
characteristics, the time intervals separating
The description of procedures should include
collection of data on different measures, or
any training and instructions provided, should
conditions that may have changed since the
indicate whether participants reached their
validity evidence was obtained.
decisions independently, and should report the
level of agreement reached. If participants
interacted with one another or exchanged Cluster 3. Specific Forms of
information, the procedures through which Validity Evidence
they may have influenced one another should
be set forth.
Comment: Systematic collection of judgments or
(a) Content-Oriented Evidence
opinions may occur at many points in test con Standard 1.11
struction (e.g., eliciting expert judgments of
content appropriateness or adequate content When the rationale for test score
representation), in the formulation of rules or interpretation for a given use rests in part on
standards for score interpretation (e.g., in setting the appropriateness of test content, the
procedures followed in specifying and mension was much greater than the score
generating test content should be de scribed variability attributable to any other identified
and justified with reference to the intended dimension, or showing that a single factor
population to be tested and the construct the adequately accounts for the covariation among
test is intended to measure or the domain it is test items. When a test provides more than one
intended to represent. If the definition of the score, the interrelationships of those scores
content sampled incorporates criteria such as should be shown to be consistent with the
importance, frequency, or criticality, these construct(s) being assessed.
criteria should also be clearly explained and
justified. Standard 1.14
Comment: For example, test developers might When interpretation of subscores, score
provide a logical structure that maps the items on differences, or profiles is suggested, the
the test to the content domain, illustrating the rationale and relevant evidence in support of
relevance of each item and the adequacy with such interpretation should be provided. Where
which the set of items represents the content do composite scores are devel oped, the basis and
main. Areas of the content domain that are not rationale for arriving at the composites should
included among the test items could be indicated be given.
as well. The match of test content to the targeted
domain in terms of cognitive complexity and the Comment: When a test provides more than one
accessibility of the test content to all members of score, the distinctiveness and reliability of the
the intended population are also important con separate scores should be demonstrated, and the
siderations. interrelationships of those scores should be shown
to be consistent with the construct(s) being
assessed. Moreover, evidence for the validity of
(b) Evidence Regarding Cognitive interpretations of two or more separate scores
Processes would not necessarily justify a statistical or sub
stantive interpretation of the difference between
Standard 1.12 them. Rather, the rationale and supporting
If the rationale for score interpretation for a evidence must pertain directly to the specific
given use depends on premises about the score, score combination, or score pattern to be
psychological processes or cognitive operations interpreted for a given use. When subscores from
of test takers, then theoretical or empirical one test or scores from different tests are
evidence in support of those premises should combined into a composite, the basis for
be provided. When state ments about the combining scores and for how scores are
processes employed by observers or scorers combined (e.g., differential weighting versus
are part of the argument for validity, similar simple summation) should be specified.
information should be provided.
Standard 1.15
Comment: If the test specification delineates the
processes to be assessed, then evidence is needed When interpretation of performance on
that the test items do, in fact, tap the intended specific items, or small subsets of items, is
processes. suggested,

(c) Evidence Regarding Internal


Structure

Standard 1.13 the rationale and relevant evidence in support of


If the rationale for a test score interpretation such interpretation should be provided. When
for a given use depends on premises about the interpretation of individual item responses is
relationships among test items or among parts likely but is not recommended by the developer,
of the test, evidence concerning the internal the user should be warned against making such
structure of the test should be provided. interpretations.
Comment: It might be claimed, for example, that Comment: Users should be given sufficient
a test is essentially unidimensional. Such a claim guidance to enable them to judge the degree of
could be supported by a multivariate statistical confidence warranted for any interpretation for a
analysis, such as a factor analysis, showing that use recom mended by the test developer. Test
the score variability attributable to one major di manuals and score reports should discourage
overinterpretation of information that may be Comment: The description of each criterion
subject to considerable error. This is especially variable should include evidence concerning its
important if interpretation of performance on reliability, the extent to which it represents the
isolated items, small subsets of items, or subtest intended construct (e.g., task performance on the
scores is suggested. job), and the extent to which it is likely to be in
fluenced by extraneous sources of variance.
Special attention should be given to sources that
(d) Evidence Regarding Relationships previous research suggests may introduce
With Conceptually Related extraneous variance that might bias the criterion
Constructs for or against identi fiable groups.

Standard 1.16 Standard 1.18


When validity evidence includes empirical
When it is asserted that a certain level of test
analyses of responses to test items together
performance predicts adequate or inadequate
with data on other variables, the rationale for
criterion performance, information about the
selecting the ad ditional variables should be
levels of criterion performance associated with
provided. Where ap propriate and feasible,
given levels of test scores should be provided.
evidence concerning the constructs represented
by other variables, as well as their technical Comment: For purposes of linking specific test
properties, should be presented or cited. scores with specific levels of criterion
Attention should be drawn to any likely performance, regression equations are more
sources of dependence (or lack of useful than correlation coefficients, which are
independence) among variables other than generally insufficient to fully describe patterns of
dependencies among the construct(s) they association between tests and other variables.
represent. Means, standard deviations, and other statistical
summaries are needed, as well as information
Comment: The patterns of association between
about the distribution of criterion performances
and among scores on the test under study and
conditional upon a given test score. In the case of
other variables should be consistent with
categorical rather than continuous variables,
theoretical expectations. The additional variables
techniques appropriate to such data should be
might be demographic characteristics, indicators
used (e.g., the use of logistic regression in the
of treatment conditions, or scores on other
case of a dichotomous criterion). Evidence about
measures. They might include intended measures
the overall association between variables should
of the same construct or of different constructs.
be supplemented by information about the form
The reliability of scores from such other measures
of that association and about the variability of
and the validity of intended interpretations of
that association in different ranges of test scores.
scores from these measures are an important part
Note that data collections employing test takers
of the validity ev idence for the test under study.
selected for their extreme scores on one or more
If such variables include composite scores, the
measures (extreme groups) typically cannot
manner in which the composites were constructed
provide adequate information about the
should be explained (e.g., transformation or
association.
standardization of the variables, and weighting of
the variables). In addition to considering the
properties of each variable in isolation, it is Standard 1.19
important to guard against faulty interpretations
arising from spurious sources of dependency If test scores are used in conjunction with
among measures, including correlated errors or other variables to predict some outcome or
shared variance due to common methods of criterion, analyses based on statistical models
measurement or common elements. of the pre dictor-criterion relationship should
include those additional relevant variables
along with the test scores.
(e) Evidence Regarding
Relationships With Criteria Comment: In general, if several predictors of
some criterion are available, the optimum combi
Standard 1.17 nation of predictors cannot be determined solely
from separate, pairwise examinations of the
When validation relies on evidence that test criterion variable with each separate predictor in
scores are related to one or more criterion turn, due to intercorrelation among predictors. It
variables, information about the suitability is often informative to estimate the increment in
and technical quality of the criteria should be predictive accuracy that may be expected when
reported.
each variable, including the test score, is of interest rather than the sample available. Such
introduced in addition to all other available adjustments are often appropriate, as when results
variables. As empirically derived weights for are compared across various situations. The
combining predictors can cap italize on chance correlation between two variables is also affected
factors in a given sample, analyses involving by measurement error, and methods are available
multiple predictors should be verified by cross- for adjusting the correlation to estimate the
validation or equivalent analysis whenever strength of the correlation net of the effects of
feasible, and the precision of estimated regression measurement error in either or both variables.
coefficients or other indices should be reported. Reporting of an adjusted correlation should be
Cross-validation procedures include formula esti accompanied by a statement of the method and
mates of validity in subsequent samples and em the statistics used in making the adjustment.
pirical approaches such as deriving weights in one
portion of a sample and applying them to an in Standard 1.22
dependent subsample.
When a meta-analysis is used as evidence of
the strength of a test-criterion relationship, the
Standard 1.20 test and the criterion variables in the local
When effect size measures (e.g., correlations be situation should be comparable with those in
tween test scores and criterion measures, stan the studies summarized. If relevant research
dardized mean test score differences between includes credible evidence that any other
subgroups) are used to draw inferences that go specific features of the testing application may
beyond describing the sample or samples on influence the strength of the test-criterion
which data have been collected, indices of the relationship, the correspon dence between
degree of uncertainty associated with these those features in the local situation and in the
meas ures (e.g., standard errors, confidence meta-analysis should be reported. Any
intervals, or significance tests) should be significant disparities that might limit the
reported. applicability of the meta-analytic findings to
the local situation should be noted explicitly.
Comment: Effect size measures are usefully
paired with indices reflecting their sampling error Comment: The meta-analysis should incorporate
to make meaningful evaluation possible. There all available studies meeting explicitly stated in
are various possible measures of effect size, each clusion criteria. Meta-analytic evidence used in
ap plicable to different settings. In the test validation typically is based on a number of
presentation of indices of uncertainty, standard tests measuring the same or very similar
errors or confi dence intervals provide more constructs and criterion measures that likewise
information and thus are preferred in place of, or measure the same or similar constructs. A meta-
as supplements to, significance testing. analytic study may also be limited to multiple
studies of a single test and a single criterion. For
each study included in the analysis, the test-
Standard 1.21 criterion relationship is expressed in some
When statistical adjustments, such as those for common metric, often as an effect size. The
restriction of range or attenuation, are made, strength of the test-criterion rela tionship may be
both adjusted and unadjusted coefficients, as moderated by features of the sit uation in which
well as the specific procedure used, and all the test and criterion measures were obtained
statistics used in the adjustment, should be re (e.g., types of jobs, characteristics of test takers,
ported. Estimates of the construct-criterion re time interval separating collection of test and
lationship that remove the effects of criterion measures, year or decade in which the
measurement error on the test should be data were collected). If test-criterion relationships
clearly reported as adjusted estimates. vary according to such moderator variables, then
the meta-analysis should report separate estimated
Comment: The correlation between two effect-size distributions condi tional upon levels
variables, such as test scores and criterion of these moderator variables when the number of
measures, depends on the range of values on each studies available for analysis permits doing so.
variable. For example, the test scores and the This might be accomplished, for example, by
criterion values of a selected subset of test takers reporting separate distributions for subsets of
(e.g., job applicants who have been selected for studies or by estimating the magni tudes of the
hire) will typically have a smaller range than the influences of situational features on effect sizes.
scores of all test takers (e.g., the entire applicant This standard addresses the responsibilities of
pool.) Statistical methods are available for the individual who is drawing on meta-analytic
adjusting the correlation to reflect the population evidence to support a test score interpretation for
a given use. In some instances, that individual show that the test predicts treatment outcomes.
may also be the one conducting the meta-analysis; Support for the validity of the classification
in other instances, existing meta-analyses are procedure is provided by showing that the test is
relied on. In the latter instance, the individual useful in determining which persons are likely to
drawing on meta-analytic evidence does not have profit differentially from one treatment or
control over how the meta-analysis was another. Treatment categories may have to be
conducted or re ported, and must evaluate the combined to assemble sufficient cases for
soundness of the meta-analysis for the setting in statistical analysis. It is recognized, however, that
question. such research may not be feasible, because ethical
and legal constraints on differential assignments
Standard 1.23 may forbid control groups.

Any meta-analytic evidence used to support an


intended test score interpretation for a given (f) Evidence Based on Consequences
use should be clearly described, including of Tests
method ological choices in identifying and
coding studies, correcting for artifacts, and
Standard 1.25
examining potential moderator variables. When unintended consequences result from
Assumptions made in cor recting for artifacts test use, an attempt should be made to
such as criterion unreliability and range investigate whether such consequences arise
restriction should be presented, and the from the test’s sensitivity to characteristics
consequences of these assumptions made other than those it is intended to assess or from
clear. the test’s failure to fully represent the intended
Comment: The description should include docu construct.
mented information about each study used as Comment: The validity of test score interpreta
input to the meta-analysis, thus permitting evalu tions may be limited by construct-irrelevant
ation by an independent party. Note also that components or construct underrepresentation.
meta-analysis inevitably involves judgments re When unintended consequences appear to stem,
garding a number of methodological choices. The at least in part, from the use of one or more tests,
bases for these judgments should be articulated. it is especially important to check that these
In the case of choices involving some degree of consequences do not arise from
uncertainty, such as artifact corrections based on constructirrelevant components or construct
assumed values, the uncertainty should be ac underrepre sentation. For example, although
knowledged and the degree to which conclusions group differ ences, in and of themselves, do not
about validity hinge on these assumptions should call into question the validity of a proposed
be examined and reported. interpretation, they may increase the salience of
As in the case of Standard 1.22, the individual plausible rival hypotheses that should be
who is drawing on meta-analytic evidence to evaluated as part of the validation effort. A
support a test score interpretation for a given use finding of unintended consequences may also
may or may not also be the one conducting the lead to reconsideration of the appropriateness of
meta-analysis. As Standard 1.22 addresses the the construct in question. Ensuring that
reporting of meta-analytic evidence, the unintended consequences are evaluated is the
individual drawing on existing meta-analytic responsibility of those making the decision
evidence must evaluate the soundness of the whether to use a particular test, al though legal
meta-analysis for the setting in question. constraints may limit the test user’s discretion to
discard the results of a previously administered
Standard 1.24 test, when that decision is based on differences in
scores for subgroups of different races,
If a test is recommended for use in assigning ethnicities, or genders. These issues are discussed
persons to alternative treatments, and if further in chapter 3.
outcomes from those treatments can
reasonably be compared on a common
criterion, then, whenever feasible, supporting
evidence of differential outcomes should be
provided.
Comment: If a test is used for classification into
alternative occupational, therapeutic, or
educational programs, it is not sufficient just to

You might also like