Measurement and Assessment in Education-Pearson (2008)
Measurement and Assessment in Education-Pearson (2008)
Measurement and Assessment in Education-Pearson (2008)
te
=yo :
a
11945.
“8 SCIONASY
we GIVNOd
“¢ NOLSONIAN
== aOLOIA N
Cecil R. Reynolds
Texas A&M University
Ronald B. Livingston
University of Texas at Tyler
Victor Willson
Texas A&M University
PEARSON
i a
This book was set in Times Roman by Omegatype Typography, Inc. It was printed and bound by
R. R. Donnelley/Harrisonburg. The cover was printed by Phoenix Color Corporation/Hagerstown.
——————————————————
5 —
Copyright © 2009, 2006 by Pearson Education, Inc., Upper Saddle River, New Jersey 07458.
Pearson. All rights reserved. Printed in the United States of America. This publication is protected
by Copyright and permission should be obtained from the publisher prior to any prohibited
reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic,
mechanical, photocopying, recording, or likewise. For information regarding permission(s), write
to: Rights and Permissions Department, 501 Boylston Street, Suite 900, Boston, MA 02116, or fax
your request to 617-671-2290.
Merrill
is an imprint of
10) OS) FT AGe See
PEARSON ISBN-13: 978-0-205-57934-1
s é www.pearsonhighered.com ISBN-10: 0-205-57934-5
Cecil: To Julia for the many sacrifices she makes for me and my work.
Ron: To my buddy Kyle—you bring great joy to my life!
Vic: To 34 years’ grad students in measurement and statistics.
Le? @
Ji “ agi
ea
NY
WH
F&F
nN The Development and Use of Constructed-Response Items
©CPmMATIA 222
emk
pm
fkNm
©
= Standardized Achievement Tests in the Era
of High-Stakes Assessment 299
fmMn
fem
bem
Le
kt Assessment Accommodations 395
Vii
16 The Problem of Bias in Educational Assessment 421
References 503
Index 511
LN EPS
CONTENTS
Preface xix
19
Common Applications of Educational Assessments
Student Evaluations 19
Instructional Decisions 20
Selection, Placement, and Classification Decisions 20
CONTENTS
Policy Decisions 21
Counseling and Guidance Decisions 21
Summary 29
Scales of Measurement 34
What Is Measurement? 34
Nominal Scales 35
Ordinal Scales 35
Interval Scales 36
Ratio Scales 36
Correlation Coefficients 51
Scatterplots 52
Correlation and Prediction 54 ‘
CONTENTS
Summary 56
Summary 86
Summary 119
Summary 191
Summary 219
Portfolios 268
Guidelines for Developing Portfolio Assessments 269
Strengths of Portfolio Assessments 271
Weaknesses of Portfolio Assessments 272
Summary 273
Summary 297
Summary 327
Summary 392
The Controversy over Bias in Testing: Its Origin, What It Is, and What
It Is Not 425
Cultural Bias and the Nature of Psychological Testing 431
Summary 447
References 503
Index 511
RES
ee,
eR
PREFACE
Be
Wren we meet someone for the first time, we engage inescapably in some form of evalu-
ation. Funny, personable, intelligent, witty, arrogant, and rude are just some of the descrip-
tors we might apply to people we meet. This happens in classrooms as well. As university
professors, just as other classroom teachers do, we meet new students each year and form
impressions about them from our interactions. These impressions are forms of evaluation or
assessment of characteristics we observe or determine from our interactions with these new
students. We all do this, and we do it informally, and at times we realize, once we have had
more experience with someone, that our early evaluations were in error. There are times,
however, when our evaluations must be far more formal and hopefully more precise. This is a
book about those times and how to make our appraisals more accurate and meaningful.
We must, for example, assign grades and determine a student’s suitability for ad-
vancement. Psychologists need to determine accurately proper diagnoses of various forms
of psychopathology such as mental retardation, learning disabilities, schizophrenia, de-
pression, anxiety disorders, and the like. These types of evaluations are best accomplished
through more rigorous means than casual interaction and more often than not are accom-
plished best via the use of some formal measurement procedures. Just as a carpenter can
estimate the length of a board needed for some construction project, we can estimate student
characteristics—but neither is satisfactory when it is time for the final construction or deci-
sion. We both must measure.
Educational and psychological tests are the measuring devices we use to address such
questions as the degree of mastery of a subject matter area, the achievement of educational
objectives, the degree of anxiety a student displays over taking a test, or even the ability
of a student to pay attention in a classroom environment. Some tests are more formal than
others, and the degree of formality of our measuring techniques varies on a continuum from
the typical teacher-made test on a specific assignment to commercially prepared, carefully
standardized procedures with large, nationally representative reference samples for standard
setting.
The purpose of this book is to educate the reader about the different ways in which
we do
we can measure constructs of interest to us in schools and the ways to ensure that
the best job possible in designing our own classroom assessments. We also provide detailed
such as
information on a variety of assessments used by other professionals in schools,
more intel-
school psychologists, so the reader can interact with these other professionals
a better job
ligently and use the results of the many assessments that occur in schools to do
with the students.
of various
Not only is the classroom assessment process covered in detail, but the use
is covered. The regular or general educatio n classroo m is empha-
standardized tests also
ons of the evaluatio n and measure ment processes to students
sized, but special applicati
have tried to illustrate
with disabilities are also noted and explained. Whenever possible, we
applicati on to everyday problems in the schools. Through an
the principles taught through
and explanat ion of principle s of tests and measurement
integrated approach to presentation
XIX
XX PREFACE
with an emphasis on applications to classroom issues, we hope we will have prepared the
reader for the changing face of assessment and evaluation in the schools. The fundamental
principles taught may change little, but actual practice in the schools is sure to change.
This book is targeted primarily at individuals who are in teacher preparation programs
or preparing for related educational positions such as school administrators. Others who
may pursue work in educational settings will also find the content informative and at all
times, we hope, practical. In preparing this text, we repeatedly asked ourselves two ques-
tions. First, what do teachers really need to know to perform their jobs? We recognize that
most teachers do not aspire to become assessment experts, so we have tried to focus on the
essential knowledge and skills and avoid esoteric information. Second, what does the em-
pirical research tell us about educational assessment and measurement? At times it might be
easier to go with educational fads and popular trends and disregard what years of research
have shown us. While this may be enticing, it is not acceptable! We owe you, our readers,
the most accurate information available that is based on the current state of scientific knowl-
edge. We also owe this to the many students you will be evaluating during your careers.
The authors have developed two indispensable supplements to augment the textbook.
Particularly useful for student review and mastery of the material presented are the audio-
enhanced PowerPoint™ lectures featuring Dr. Victor Willson. A Test Bank is also available
to instructors.
We appreciate the opportunity to prepare a second edition of this text! While this edition
maintains the organization of the first edition, there have been a number of substantive
changes. A primary focus of this revision was updating and increasing the coverage of fed-
eral laws and how they have an impact on educational assessment. In doing this we tried to
emphasize how these laws affect teachers in our classrooms on a daily basis. Our guiding
principle was to follow our instructors’ and readers’ lead—retaining what they liked and
adding what they requested.
Acknowledgments
We would like to express our appreciation to our editor, Arnis Burvikovs, for his leadership in
helping us bring this eidtion to closure. His work in obtaining numerous high-quality reviews
and then guiding us on how best to implement the suggestions from them was of tremendous
benefit to us in our writing assignments. Our thanks to the reviewers: Nick Elksnin, The
Citadel; Kathy Peca, Eastern New Mexico University; and Dan Smith, University at Buffalo,
Canisius College.
To our respective families, we owe a continual debt of gratitude for the warm recep-
tion they give our work, for their encouragement, and for their allowances for our work
schedules. Hopefully, this volume will be of use to those who toil in the classrooms of our
nation and will assist them in conducting better evaluations of students, enabling even better
teaching to occur. This will be our thanks. ‘
Measurement and Assessment
in Education
~ oobey aoe pooped the
os The togelaneneal
ieseo . “are -
Serer
oy » u ;
ia ie @. fp Man’ walel Gee najort. » oe
ented Pertti" Wriwits tewdiring Ur. Vicor Wilbow. A Teer Tank vaabigs @ ¢
eer. «
Introduction to
Educational Assessment
CHAPTER HIGHLIGHTS
LEARNING OBJECTIVES
e
Students in teacher preparation programs want to teach, but our combined experienc
in colleges of education suggests that they are generally not very
of more than 60 years
testing is
interested in testing and assessment. Yes, they know that teachers do test, but
a career in teaching. Teachers love children and love teaching,
not what led them to select
or at best neutral view of testing. This predisposi tion is not
but they often have a negative
uate psycholog y students are typically drawn to
limited to education students. Undergrad
with and help people. Most aspire to be counselors or
psychology because they want to work
1
2 CHAPTER 1
Assessment is an integral therapists, and relatively few want to specialize in assessment. When
component of the teaching we teach undergraduate education or psychology test and measure-
process. Assessment can and ment courses, we recognize that it is important to spend some time
should provide information that explaining to our students why they need to learn about testing and
both enhances instruction and assessment. This is one of the major goals of this chapter. We want to
promotes learning. explain to you why you need to learn about testing and assessment,
and hopefully convince you that this is a worthwhile endeavor.
Teaching is often conceptualized as a straightforward process whereby teachers
provide instruction and students learn. With this perspective, teaching is seen as a simple
instruction—learning process. In actual practice, it is more realistic to view assessment as an
integral component of the teaching process. i
1992). Assessment can and should provide relevant information that both enhances instruc-
tion and promotes learning. In other words, there should be a close reciprocal relationship
between instruction, learning, and assessment. With this expanded conceptualization of teach-
ing, instruction and assessment are integrally related, with assessment providing objective
feedback about what the students have learned, how well they have learned it, how effective
the instruction has been, and what information, concepts, and objectives require more atten-
tion. Instead of teaching being limited to an instruction—learning process, it is conceptualized
more accurately as an instruction—learning—assessment process. In this model, the goal of as-
sessment, like that of instruction, is to facilitate student achievement (e.g., Gronlund, 1998).
In the real world of education, it is difficult to imagine effective teaching that does not involve
some form of assessment. The better job teachers do in assessing student learning, the better
their teaching will be.
The following quote from Stiggins and Conklin (1992) illustrates the important role
teachers play in the overall process of educational assessment.
In summary, if you want to be an effective teacher, you need to be knowledgeable about testing
and assessment. Instruction and assessment are both instrumental parts of the teaching process,
and assessment is a major component of a teacher’s day-to-day job. We hope that by the time
you finish this chapter you will have a better understanding of the role of assessment in education
and recognize that although you may never want to specialize in testing and assessment, you will
appreciate the important role assessment plays in the overall educational process.
In our brief introduction we have already used a number of relatively common but somewhat
technical terms. Before we go any further it would be beneficial to define them for you.
Introduction to Educational Assessment 3
Now that we have defined these common terms, with some reluctance we acknowledge
that in actual practice many educational professionals use testing, measurement, and assess-
ment interchangeably. Recognizing this, Popham (2000) noted that in contemporary educa-
rigid
tional circles, assessment has become the preferred term. Measurement sounds rather
con-
and sterile when applied to students and tends to be avoided. Testing has its own negative
goes by when newspapers don’t contain articles about
notations. For example, hardly a week
“teaching to the test” or “high-stakes testing,” typically with negative connotations. Addition-
recent
ally, when people hear the word test they usually think of paper-and-pencil tests. In
with traditional paper-and- pencil tests, alternative
years, as a result of growing dissatisfaction
GHAP TER W
testing procedures have been developed (e.g., performance assessments and portfolios). As
a result, testing is not seen as particularly descriptive of modern educational practices. That
leaves us with assessment as the current buzzword among educators.
Before proceeding, we should define some other terms. — ——y
fessional who has specialized in the area oftesting, measurement, and assessment. You will
likely hear people refer to the psychometric properties of a test, and by
Psychometrics is the science of this they mean the measurement or statistical characteristics of a test.
psychological measurement. These measurement characteristics include reliability and validity.{Re=
» liability refers tothe stability or consistency of test scores) On a more
‘theoretical level, reliability refers to the degree to which test scores
Reliability refers to the stability arenfreeifroni measurement efor (AERA et al., 1999). Scores that are
or consistency of the test scores. relatively free from measurement errors will be stable or consistent
(i.e., reliable).
Validity refers to the accuracy . If test scores are
interpreted as reflecting intelligence, do they actually reflect intellec-
of the interpretations of test
tual ability? If test scores are used to predict success on a job, can they
scores.
accurately predict who will be successful on the job?
Types of Tests
We defined a fest as a device or procedure in which a sample of an individual’s behavior
is obtained, evaluated, and scored using standardized procedures (AERA et al., 1999). You
have probably taken a large number of tests in your life, and it is likely that you have noticed
that all tests are not alike. For example, people take tests in schools that help determine their
grades, tests to obtain drivers’ licenses, interest inventories to help make educational and
vocational decisions, admissions tests when applying for college, exams to obtain profes-
sional certificates and licenses, and personality tests to gain personal understanding. This
brief list is clearly not exhaustive! ;
Cronbach (1990) notes that tests can generally be classified as measures of either maxi-
mum performance or typical response. Maximum performance tests are also often referred to
as ability tests, but achievement tests are included here as well. On maximum performance
tests items may be scored as either “correct” or “incorrect,” and ex-
Maximum performance tests aminees are encouraged to demonstrate their very b ances.
are designed to assess the Waist’ pertains aes ERAN RRC epeetaame,
upper limits of the examinee’s Gfithejexamince’Siknowledge
und Abilities) For example, maximum
knowledge and abilities. performance tests can be designed to assess how well a student per-
forms selected tasks or has mastered a specified content domain. Intel-
ligence tests and classroom achievement tests are common examples
of maximum performance tests. In contrast, typical response tests attempt to measure the
typical behavior and characteristics of examinees. Often, typical response tests are referred to
as personality tests, and in this context personality is used broadly to reflect a whole host of
noncognitive characteristics such as attitudes, behaviors, emotions, and interests (Anastasi
&
Urbina, 1997). Some individuals reserve the term test for maximum performance measures,
while using terms such as scale and inventory when referring to typical performance instru-
Introduction to Educational Assessment 5
ments (AERA et al., 1999). In this textbook we will use the term fest in its broader sense,
applying to both maximum performance and typical response procedures.
Achievement and Aptitude. Maximum performance tests are often classified as either
achievement tests or aptitude tests. A’
On speed tests, performance Speed and Power Tests. Maximum performance tests often are cat-
-
reflects differences in the speed egorized as either speed or power tests. On
-A speed
of performance.
test generally contains items that are relatively easy and has a strict
On power tests, performance time limit that prevents any examinees from successfully completing
iC
reflects the difficulty of the items. Ona pure po
items the examinee is able GRRE yore is given plenty of time to attempt all the items,
to answer correctly. but the items are ordered according to difficulty, and the test contains
6 CHAPTER 1
some items that are so difficult that no examinee is expected to answer them all. As a result,
performance on a power test primarily reflects the difficulty of the items the examinee is
able to answer correctly. Well-developed speed and power tests are designed so no one will
obtain a perfect score. They are designed this way because perfect scores are “indetermi-
nate.” That is, if someone obtains a perfect score on a test, the test failed to assess the very
upper limits of that person’s ability. To access adequately the upper limits of ability, tests
need to have what test experts refer to as an “adequate ceiling”; that is, the tests are difficult
enough that no examinee will be able to obtain a perfect score. As you might expect, this
distinction between speed and power tests is also one of degree rather than being absolute.
Most often a test is not a pure speed test or a pure power test, but incorporates some com-
bination of the two approaches. For example, the Scholastic Assessment Test (SAT) and
Graduate Record Examination (GRE) are considered power tests, but both have time limits.
When time limits are set such that 95% or more of examinees will have the opportunity to
respond to all items, the test is still considered to be a power test and not a speed test.
Objective and Subjective Maximum Performance Tests. Objectivity typically implies im-
partiality or the absence of personal bias. Cronbach (1990) notes that the less test scores are
influenced by the subjective judgment of the person grading or scoring the test, the more
objective the test is. In other words, objectivity refers to the extent that trained examiners
who score a test will be in agreement and score responses in the same way. Tests with
selected-response items (e.g., multiple choice, true—false, and matching) that can be scored
using a fixed key and that minimize subjectivity in scoring are often referred to as “objective”
tests. In contrast, subjective tests are those that rely on the personal judgment of the individual
grading the test. For example, essay tests are considered subjective because test graders rely to
some extent on their own subjective judgment when scoring the essays. Most students are well
aware that different teachers might assign different grades to the same essay item. It is com-
mon, and desirable, for those developing subjective tests to provide explicit scoring rubrics in
an effort to reduce the impact of the subjective judgment of the person scoring the test.
Projective personality tests jective. The test takers simply respond true if the statement describes
involve the presentation of them and false if it does not. By using a scoring key, there should be
ambiguous material that elicits no disagreement among scorers regarding how to score the items.
an almost infinite range of
responses. Most projective tests
involve subjectivity in scoring.
. For example, the clinician may show the examinee an inkblot and ask: “What
might this be?” Instructions to the examinee are minimal, there are essentially no restric-
tions on the examinee’s response, and there is considerable subjectivity when scoring the
response. Elaborating on the distinction between objective and projective tests, Reynolds
(1998b) noted:
It is primarily the agreement on scoring that differentiates objective from subjective tests.
If trained examiners agree on how a particular answer is scored, tests are considered objec-
tive; if not, they are considered subjective. Projective is not synonymous with subjective in
this context but most projective tests are closer to the subjective than objective end of the
continuum of agreement on scoring. (p. 49)
of the test rather than the type of the test. For example, individual aptitude tests and group
aptitude tests are both aptitude tests; they simply differ in how they are administered. This is
true in the personality domain as well wherein some tests require one-on-one administration
but others can be given to groups.
Norm-referenced score
interpretations compare an example, if you say that a student scored better than 95% of his or
examinee’s performance to the her peers, this is a norm-referenced interpretation. The standardization
performance of other people. sample serves as the reference group against which performance is
judged.
Criterion-referenced score
interpretations compare an
ith criterion-referenced interpreta-
examinee’s performance to a tions, the emphasis is on what the examinees know or what they can
specified level of performance. do, not their standing relative to other people. One of the most com-
mon examples of criterion-referenced scoring is the percentage of cor-
rect responses on a classroom examination. For example, if you report that a student correctly
answered 95% of the items on a classroom test, this is a criterion-referenced interpretation. In
Introduction to Educational Assessment 9
Now that we have introduced you to some of the basic concepts of educational assessment,
this is an opportune time to discuss some basic assumptions that underlie educational as-
sessment. These assumptions were adopted in part from Cohen and Swerdlik (2002), who
note, appropriately, that these assumptions actually represent a simplification of some very
complex issues. As you progress through this text, you will develop a better understanding
of these complex and interrelated issues.
in areas in which they have received instruction (AERA et al., 1999). In schools we are often
interested in measuring a number of constructs, such as a student’s intelligence, achievement
in a specific content area, or attitude toward learning. This assumption simply acknowledges
that constructs such as intelligence, achievement, or attitudes exist.
information obtained from them is a key issue in ethical assessment practice (e.g., Cohen
& Swerdlik, 2002).
Well-made tests that are according to guidelines, are fair and minimize bias. Nevertheless, tests
appropriately administered and can be used inappropriately, and when they are it discredits or stig-
interpreted are among the most matizes assessment procedures in general. However, in such circum-
equitable methods of evaluating stances the culprit is the person using the test, not the test itself. At
people. times, people criticize assessments because they do not like the results
obtained. In many instances, this is akin to “killing the messenger.”
tests, you will hear critics of educational and psychological testing call for a ban on, or at
least a significant reduction in, the use of tests. Although educational tests are not perfect (and
never will be), testing experts spend considerable time and effort studying the measurement
characteristics of tests. This process allows us to determine how accurate and reliable tests
are, can provide guidelines for their appropriate interpretation and use, and can result in the
development of more accurate assessment procedures (e.g., Friedenberg, 1995).
Assumption 9 in Table 1.3 suggests that tests can be used in a fair manner. Many people
criticize tests, claiming that they are biased, unfair, and discriminatory against certain groups
of people. Although it is probably accurate to say that no test is perfectly fair to all examinees,
neither is any other approach to selecting, classifying, or evaluating people. The majority of
professionally developed tests are carefully constructed and scrutinized to minimize bias, and
when used properly actually promote fairness and equality. In fact, it is probably safe to say
that well-made tests that are appropriately administered and interpreted are among the most
equitable methods of evaluating people. Nevertheless, the improper use of tests can result in
considerable harm to individual test takers, institutions, and society (AERA et al., 1999).
daunting figure does not include the vast number of tests developed by classroom teachers to
assess the achievement or progress of their students. There are minimal standards that all of
these tests should meet, whether they are developed by an assessment professional, a gradu-
ate student completing a thesis, or a teacher assessing the math skills of 3rd graders. To pro-
vide standards for the development and use of psychological and educational tests and other
Standards for Educational assessment procedures, numerous professional organizations have
developed guidelines. The most influential and comprehensive set of
and Psychological Testing is
guidelines is the Standards for Educational and Psychological Test-
the most influential and
ing, published by the American Educational Research Association,
comprehensive set of guidelines the American Psychological Association, and the National Council
for developing and using on Measurement in Education (1999). We have referenced this docu-
psychological and educational ment numerous times earlier in this chapter and will continue to do
tests. so throughout this text.
you are to enter the teaching profession. While the other participants in the assessment process
have professional and ethical responsibilities, test takers have a number of rights. The Joint
Committee on Testing Practices (JCTP, 1998) notes that the most fundamental right test takers
have is to be tested with tests that meet high professional standards and that are valid for the
intended purposes. Other rights of test takers include the following:
u Test takers should be given information about the purposes of the testing, how the
results will be used, who will receive the results, the availability of information re-
garding accommodations available for individuals with disabilities or language dif-
ferences, and any costs associated with the testing.
a Test takers have the right to be treated with courtesy, respect, and impartiality.
= Test takers have the right to have tests administered and interpreted by adequately
trained individuals who follow professional ethics codes.
= Test takers have the right to receive information about their test results.
= Test takers have the right to have their test results kept confidential.
Appendix D contains the Joint Committee on Testing Practices’ Rights and Respon-
sibilities of Test Takers: Guidelines and Expectations.
s Increased state accountability. NCLB requires that each state develop rigorous aca-
demic standards and implement annual assessments to monitor the performance of districts
and schools. It requires that these assessments meet professional standards for reliability
and validity and requires that states achieve academic proficiency for all students within
12 years. To ensure that no group of children is neglected, the act requires that states and
districts assess all students in their programs, including those with disabilities and limited
English proficiency. However, the act does allow 3% of all students to be given alternative
assessments. Alternative assessments are defined as instruments specifically designed for
students with disabilities that preclude standard assessment.
m More parental choice. The act allows parents with children in schools that do not
demonstrate adequate annual progress toward academic goals to move their children to
other, better performing schools.
a Greater flexibility for states. A goal of NCLB is to give states increased flexibility in
the use of federal funds in exchange for increased accountability for academic results.
= Reading First initiative. A goal of the NCLB Act is to ensure that every student can
read by the end of grade 3. To this end, the Reading First initiative significantly increased
federal funding of empirically based reading instruction programs in the early grades.
While NCLB received broad support when initiated, it has been the target of increasing
criticism in recent years. The act’s focus on increased accountability and statewide assess-
ment programs typically receives the greatest criticism. For example, it is common to hear
teachers and school administrators complain about “teaching to the test” when discussing
the impact of statewide assessment programs. Special Interest Topic 1.1 describes some
current views being voiced by proponents and opponents of NCLB.
While the No Child Left Behind Act (NCLB) passed with broad bipartisan and public support, more
and more criticism has been directed at it in recent years by lawmakers, professional groups, teach-
ers, and others. For example, the National Education Association (NEA), the nation’s largest teacher
union, has criticized the NCLB Act, maintaining that it forces teachers to devote too much time
preparing students for standardized tests at the expense of other, more desirable instructional activi-
ties. Many critics are also calling for more flexibility in the way states implement the NCLB Act’s
accountability requirements, particularly the requirement that students with disabilities be included
in state assessment programs. These critics say that decisions about how students with disabilities
are tested should be left to local professionals working directly with those students. It is safe to say
the honeymoon period is over for NCLB.
Pi But the NCLB Act does have its supporters. For example, advocacy groups for individuals with
ra disabilities maintain that the NCLB Act has focused much needed attention on the achievement of stu-
dents with disabilities. As currently implemented, the NCLB Act requires that most students with dis-
z
S
abilities be tested and their achievement monitored along with their peers without disabilities. These
advocates fear that if the law is changed, the high achievement standards for students with disabilities
will be lost. They note that approximately 30% of students with disabilities are currently exempted
from state assessment programs and they fear that if states are given greater control over who is tested,
even more students with disabilities will be excluded and largely ignored (Samuels, 2007).
ee
bance) and provides funds to states and school districts that meet the requirements of the
law. IDEA provides guidelines for conducting evaluations of students suspected of having a
disability. Students who qualify under IDEA have an individualized educational program
(IEP) developed specifically for them that designates the special services and modifica-
tions to instruction and assessment that they must receive. Possibly most important for
regular education teachers is the mandate for students with disabilities to receive instruc-
ing. In
tion in the “least restrictive environment,’ a movement referred to as mainstream
with disabilities receive educational services in
application, this means that most students
are involved
the regular education classroom. As a result, more regular education teachers
in the education of students with disabilities and are required to implement the educational
instructional
modifications specified in their students’ IEPs, including modifications in both
online
strategies and assessment practices. More information on IDEA 2004 can be found
at http://idea.ed. gov.
oe a ee
Jacob and Harthshorne (2007) note that while Section 504 of the Rehabilitation Act was passed in
1973, the law was not widely applied in the school setting until the late 1980s. Since that time, how-
ever, it has had a substantial impact on public education. In looking to the future, the authors predict
that Section 504 will be used less frequently to obtain accommodations for students with learning
and behavioral problems. Their prediction is based on the following considerations:
m In the past, Section 504 was often used to ensure educational accommodations for students
with attention deficit hyperactivity disorder (ADHD). In 1997, IDEA specifically identi-
fied ADHD as a health condition qualifying for special education services. As a result,
more students with ADHD will likely receive services through IDEA and fewer under
Section 504.
= IDEA 2004 permits school districts to spend up to 15% of special education funds on early
intervention programs. These programs are intended to help students that need specialized
educational services but who have not been identified as having a disability specified in
IDEA. Again, this will likely decrease the need for Section 504 accommodations.
= IDEA 2004 has new regulations for identifying children with specific learning disabilities
that no longer require the presence of a severe discrepancy between aptitude and achieve-
ment. As a result, students that in the past did not qualify for services under IDEA may now
qualify. Again, this will likely reduce the need to qualify students under Section 504.
m Legal experts and school administrators have raised concerns about widespread over-
identification of disabled students under Section 504. Some instances of abuse result
from well-meaning educators trying to help children with academic problems, but no true
disabilities, obtain special accommodations. Other instances are more self-serving. For
example, some schools have falsely identified children under Section 504 so they can be
given assessment accommodations that might allow them to perform better on high-stakes
accountability assessments.
At this point the future of Section 504 remains unclear. Jacob and Harthshorne (2007) make
a compelling case that coming years might see a reduction in the number of students under Section
504. However, time will tell.
reasonable accommodations to ensure that students with disabilities have an equal opportu-
nity to benefit from those activities or programs (Jacob & Harthshorne, 2007). Section
504
differs from IDEA in several important ways. First, it defines a handicap or disability
very
broadly, much more broadly than IDEA. Therefore, a child may not qualify for
services
under IDEA but qualify under Section 504. Second, Section 504 is an antidiscrimination
act, not a grant program like IDE
. In terms of the assessment of disabilities, Section 504 pro-
vides less specific guidance than IDEA. Similar to IDEA, students qualified under
Section
504 may receive modifications to the instruction and assessments implemented
in the class-
Introduction to Educational Assessment 19
rooms. In recent years there has been a rapid expansion in the number of students receiving
accommodations under Section 504. However, Special Interest Topic 1.2 describes some
recent events that might reduce the number of students served under Section 504.
Student Evaluations
to monitor the
The appropriate use of tests and other assessment procedures allows educators
progress of their students. In this context, probably the most common
use of educational assessments involves assigning grades to students
Summative evaluation involves
the determination of the value to reflect their academic progress or achievement. This type of evalu-
or quality of an outcome. ation is typically referred to as summative evaluation.
20 GiavATP
a EAR
In the class-
room, summative evaluation typically involves the formal evaluation of student performance,
commonly taking the form of a numerical or letter grade (e.g., A, B, C, D, or F). Summative
evaluation is often designed to communicate information about student progress, strengths,
and weaknesses to parents and other involved adults. Another prominent application of stu-
dent assessments is to provide specific feedback to students in order to facilitate or guide
their learning. Optimally, students need to know both what they have and have not mastered.
This type of feedback serves to facilitate and guide learning activities and can help motivate
students. It is often very frustrating to students to receive a score on
Formative evaluation involves an assignment without also receiving feedback about what they can
activities designed to provide do to improve their performance in the future. This type of evaluation
feedback to students. is referred to as formative e i S
e roviding teedbac ts.
Instructional Decisions
Educational assessments also can provide important information that helps teachers ad-
just and enhance their teaching practices. For example, assessment information can help
teachers determine what to teach, how to teach it, and how effective
Educational assessments can their instruction has been. Gronlund (2003) delineated a number of
ways in which assessment can be used to enhance instructional deci-
provide important information
sions. For example, in terms of providing information about what to
that helps teachers adjust
teach, teachers should routinely assess the skills and knowledge that
and enhance their teaching
students bring to their classroom in order to establish appropriate
practices.
learning objectives (sometimes referred to as “sizing up” students).
Teachers do not want to spend an excessive amount of time covering
material that the students have alr mastered, nor do they want to introduce material for
which the students are ill peta :
re - In addition to decisions about the content of instruc-
tion, student assessments can help teachers tailor learning activities to match the individual
strengths and weaknesses of their students. Understanding the cognitive strengths and weak-
nesses of students facilitates this process, and certain diagnostic tests provide precisely this
type of information. This type of assessment is frequently referred to a
(Nt. Finally, educational assessment can (and should) provide feedback to teachers about
how effective their instructional practices are. Teachers can use assessment information
to
determine whether the learning objectives were reasonable, which instructional activities
were effective, and which activities need to be modified or abandoned.
situation, some applicants are rejected and are no longer a concern of the university. In
contrast, i i
With placement, all students are placed and there are no actual rejections. For example, if
all the students in a secondary school are assigned to one of three instructional programs
(e medial, regular, and honors), this is a placeme isi i cla
Policy Decisions
We use the category of “policy decisions” to represent a wide range of administrative deci-
sions made at the school, district, state, or national level. These decisions involve issues such
as evaluating the curriculum and instructional materials employed,
Instruction and assessment are determining which programs to fund, and even deciding which em-
two important and integrated ployees receive merit raises and/or promotions. We are currently in
aspects of the teaching process. an era of increased accountability in which parents and politicians
are setting higher standards for students and schools, and there isa
national trend to base many administrative policies and decisions on information garnered
from state or national assessment programs.
Counseling and guidance decisions Promote self-understanding and help students plan for the future
emphasize our recognition that most teachers will not make psychometrics their focus of
study. However, because assessment plays such a prominent role in schools and teachers
devote so much of their time to assessment-related activities, there are some basic competen-
cies that all teachers should master. In fact in 1990 the American Federation of Teachers, the
National Council on Measurement in Education, and the National Education Association col-
laborated to develop a document titled Standards for Teacher Competence in Educational As-
sessment of Students. In the following section we will briefly review these competencies (this
document is reproduced in its entirety in Appendix E). Where appropriate, we will identify
which chapters in this text are most closely linked to specific competencies.
Teachers should understand and be able to describe the implications and limitations of
assessment results and use them to enhance the education of their students and society in
general (addressed primarily in Chapters 1, 4, 5, and 11).
The field of educational assessment is dynamic and continuously evolving. There are some as-
pects of the profession that have been stable for many years. For example, classical test theory
(discussed in some detail in Chapter 4) has been around for almost a century and is still very
Introduction to Educational Assessment 25
influential today. However, many aspects of educational assessment are almost constantly
evolving as the result of a number of external and internal factors. Some of these changes are
the result of theoretical or technical advances, some reflect philosophical changes within the
profession, and some are the result of external societal or political influences. It is important
for assessment professionals to stay informed regarding new developments in the field and
to consider them with an open mind. To illustrate some of the developments the profession
is dealing with today, we will briefly highlight a few contemporary trends that are likely to
continue to impact assessment practices as you enter the teaching profession.
previously difficult if not impossible to assess accurately. Another innovative use of technol-
ogy is the commercially available instrumental music assessment systems that allow students
to perform musical pieces and have their performances analyzed and graded in terms of pitch
and rhythm. Online versions of these programs allow students to practice at home and have
their performance results forwarded to their instructors at school. Although it is difficult to
anticipate the many ways technology will change assessment practices in the twenty-first
century, it is safe to say that they will be dramatic and sweeping. Special Interest Topic 1.3
provides information on the growing use of technology to enhance assessment in contempo-
rary schools.
GTI NUON aN
According to a report in Education Week (May 8, 2003), computer- and Web-based assessments are
starting to find strong support in the schools. For example, the No Child Left Behind Act of 2001,
which requires states to test all students in the 3rd through 8th grades in reading and mathematics
every year, has caused states to start looking for more efficient and economical forms of assessment.
Assessment companies believe they have the answer: switch to computer or online assessments.
Although the cost of developing a computerized test is comparable to that of a traditional paper-and-
oe
See pencil test, once the test is developed the computer test is far less expensive. Some experts estimate
that computerized tests can be administered for as little as 25% of the cost of a paper-and-pencil test.
Another positive feature of computer-based assessment is that the results can often be available in a
few days as opposed to the months educators and students are used to waiting.
Another area in which technology is having a positive impact on educational assessment prac-
tices involves preparing students for tests. More and more states and school districts are developing
online test-preparation programs to help students improve their performance on high-stakes assess-
ments. The initial results are promising. For example, a pilot program in Houston, Texas, found that
75% of the high school students who had initially failed the mandatory state assessment improved
their reading scores by 29% after using a computer-based test-preparation program. In addition
to
being effective, these computer-based programs are considerably less expensive for the school
dis-
tricts than face-to-face test-preparation courses.
While it is too early to draw any firm conclusions about the impact of technology on school
assessment practices, the early results are very promising. It is likely that by the year 2010
school-
based assessments will be very different than they are today. This is an exciting time to work
in the
field of educational assessment! :
es
incsemannan
eneace:
cece
Se
Sue
GyeS
He
NN
Introduction to Educational Assessment Dil.
and other selected-response formats (e.g., true—false, matching), have always had their crit-
ics, but their opposition has become more vocal in recent years. Opponents of traditional test
formats complain that they emphasize rote memorization and other low-level cognitive skills
and largely neglect higher-order conceptual and problem-solving skills. To address these and
related shortcomings, many educational assessment experts have promoted the use of more
“authentic” or complex-performance assessments, typically in the form of performance as-
sessments and portfolios. Performance assessments require test takers to complete a process
or produce a product in a context that closely resembles real-life situations. For example, a
medical student might be required to interview a mock patient, select tests and other assess-
ment procedures, arrive at a diagnosis, and develop a treatment plan (AERA et al., 1999).
: ; ;
(AERA
et al., 1999). Artists, architects, writers, and others have long used portfolios to represent their
work, and in the last decade portfolios have become increasingly popular in the assessment of
students. Although performance assessments have their own set of strengths and weaknesses,
they do represent a significant addition to the assessment options available to teachers.
It has been suggested that one finds more support for high-stakes assessments the further one gets
from the classroom. The implication is that while parents, politicians, and education administrators
might support high-stakes assessment programs, classroom teachers generally don’t. The National
Board on Educational Testing and Public Policy sponsored a study (Pedulla, Abrams, Madans, Rus-
sel, Ramos, & Miao, 2003) to learn what classroom teachers actually think about high-stakes assess-
ment programs. The study found that state assessment programs affect both what teachers teach and
how they teach it. More specifically, the study found the following:
m In those states placing more emphasis on high-stakes assessments, teachers reported feel-
ing more pressure to modify their instruction to align with the test, engage in more test-
preparation activities, and push their students to do well.
m Teachers report that assessment programs have caused them to modify their teaching in ways
that are not consistent with instructional best practices.
m Teachers report that they spend more time on subjects that will be included on the state as-
sessments and less time on subjects that are not assessed (e.g., fine arts, foreign language).
m Teachers in elementary schools reported spending more time on test preparation than those
in high schools.
m A majority of teachers believed that the benefits of assessment programs did not warrant the
time and money spent on them.
= Most teachers did report favorable evaluations of their state’s curriculum standards.
= The majority of teachers did not feel the tests had unintended negative consequences such as
causing students to be retained or drop out of school.
This is an intriguing paper for teachers in training that gives them a glimpse at the way high-
stakes testing programs may influence their day-to-day activities in the classroom. The full text of
this report is available at www.bc.edu/nbetpp.
—- -—--—_—rrr—————————— eee
than less, standardized testing in public schools. For example, the Elementary and Secondary
Education Act of 2001 (No Child Left Behind Act) requires that states test students annually
in grades 3 through 8. Because many states typically administer standardized achievement
tests in only a few of these grades, this new law will require even more high-stakes testing
than is currently in use (Kober, 2002). Special Interest Topic 1.4 provides a brief description
of a study that examined what teachers think about high-stakes state assessments.
students with disabilities was largely the responsibility of special education teachers, but now
regular education teachers play a prominent role. Regular education teachers will have more
students receiving special education services in their classroom and, as a result, will be inte-
grally involved in their instruction and assessment. Regular education teachers are increasingly
being required to help develop and implement individualized educational programs (IEPs) for
these students and assess their progress toward goals and objectives specified in the IEPs.
Summary
This chapter is a broad introduction to the field of educational assessment. We started by
emphasizing that assessment should be seen as an integral part of the teaching process.
When appropriately used, assessment can and should provide information that both en-
hances instruction and promotes learning. We then defined some common terms used in the
educational assessment literature:
Our discussion then turned to a description of different types of tests. Most tests can be
classified as either maximum performance or typical response. Maximum performance tests
are designed to assess the upper limits of the examinee’s knowledge and abilities whereas
typical response tests are designed to measure the typical behavior and characteristics of ex-
aminees. Maximum performance tests are often classified as achievement tests or aptitude
tests. Achievement tests measure knowledge and skills in an area in which the examinee has
received instruction. In contrast, aptitude tests measure cognitive abilities and skills that are
accumulated as the result of overall life experiences (AERA et al., 1999). Maximum perfor-
mance tests can also be classified as either speed tests or power tests. On pure speed tests,
performance reflects only differences in the speed of performance whereas on pure power
tests, performance reflects only the difficulty of the items the examinee is able to answer cor-
rectly. In most situations a test is not a measure of pure speed or pure power, but reflects some
combination of both approaches. Finally, maximum performance tests are often classified as
objective or subjective. When the scoring of a test does not rely on the subjective judgment of
the person scoring the test, it is said to be objective. For example, multiple-choice tests can be
scored using a fixed scoring key and are considered objective (multiple-choice tests are often
scored by a computer). If the scoring of a test does rely on the subjective judgment of the per-
son scoring the test, it is said to be subjective. Essay exams are examples of subjective tests.
Typical response tests measure constructs such as personality, behavior, attitudes, or
use
interests, and are often classified as being either objective or projective. Objective tests
30 CAApASP AER
selected-response items (e.g., true—false, multiple-choice) that are not influenced by the
subjective judgment of the person scoring the test. Projective tests involve the presentation
of ambiguous material that can elicit an almost infinite range of responses. Most projective
tests involve some subjectivity in scoring, but what is exclusive to projective techniques is
the belief that these techniques elicit unconscious material that has not been censored by
the conscious mind.
Most tests produce scores that reflect the test takers’ performance. Norm-referenced
score interpretations compare an examinee’s performance to the performance of other people.
Criterion-referenced score interpretations compare an examinee’s performance to a speci-
fied level of performance. Typically tests are designed to produce either norm-referenced or
criterion-referenced scores, but it is possible for a test to produce both norm- and criterion-
referenced scores.
Next we discussed the basic assumptions that underlie educational assessment:
We described the major participants in the assessment process, including those who
develop tests, use tests, and take tests. We next turned to a discussion of major laws that
govern the use of tests and other assessments in schools, including the following:
a No Child Left Behind Act (NCLB). This act requires states to develop demanding
academic standards and put into place annual standards to monitor progress.
a [ndividuals with Disabilities Education Act (IDEA). This law mandates that children
with disabilities receive a free, appropriate public education. To this end, students
with disabilities may receive accommodations in their instruction and assessment.
m Section 504. Students who qualify under Section 504 may also receive modifications
to their instruction and assessment.
= Protection of Pupil Rights Act (PPRA). Places requirements for surveys and assess-
ments that elicit sensitive information from students.
a Family Educational Rights and Privacy Act (FERPA). Protects the privacy of stu-
dents and regulates access to educational records.
We noted that the use of assessments in schools is predicated on the belief that they
can provide valuable information that promotes student learning and helps educators make
better decisions. Prominent uses include the following: ;
Introduction to Educational Assessment 31
Next we elaborated on what teachers need to know about educational testing and as-
sessment. These competencies include proficiency in the following:
RECOMMENDED READINGS
American Educational Research Association, American Psy- Angeles, CA: Center for the Study of Evaluation. This
chological Association, & National Council on Measure- outstanding report conceptualizes classroom assess-
ment in Education (1999). Standards for educational ment as an integral part of teaching and learning. It is
and psychological testing. Washington, DC: American advanced reading at times, but well worth it.
Educational Research Association. In practically every Weiss, D. J. (1995). Improving individual difference mea-
content area this resource is indispensable! surement with item response theory and computerized
Jacob, S., & Hartshorne, T. S. (2007). Ethics and law for adaptive testing. In D. Lubinski & R. Dawes (Eds.), As-
school psychologists (Sth ed.). Hoboken, NJ: Wiley. This sessing individual differences in human behavior: New
book provides good coverage of legal and ethical issues concepts, methods, and findings (pp. 49-79). Palo Alto,
relevant to work in the schools. CA: Davies-Black. This chapter provides a good intro-
Joint Committee on Standards for Educational Evaluation duction to IRT and CAT.
(2003). The student evaluation standards. Thousand Zenisky, A., & Sierci, S. (2002). Technological innovations in
Oaks, CA: Corwin Press. This text presents the JCSEE large-scale assessment. Applied Measurement in Educa-
guidelines as well as illustrative vignettes intended to tion, 15, 337-362. This article details some of the ways
help educational professionals implement the standards. computers have affected and likely will impact assess-
The classroom vignettes cover elementary, secondary, ment practices.
and higher education settings.
Shepard, L. A. (2000). The role of classroom assessment in
teaching and learning. (CSE Technical Report 517). Los
www.aft.org tion. You can sign up for a weekly alert and summary of
This is the Web site for the American Federation of articles. This is really worth checking out!
Teachers, an outstanding resource for all interested in
www.ncme.org
education.
This Web site for the National Council on Measure-
http://edweek.org ment in Education is an excellent resource for those
Education Week is a weekly newsletter that is available interested in finding scholarly information on assess-
online. This very valuable resource allows teachers to ment in education.
stay informed about professional events across the na-
Go to
aD ictal eps to view a PowerPoint™
presentation and to listen to an audio lecture about this chapter.
CHAPTER
|
The Basic Mathematics
of Measurement
One does not need to be a statistical wizard to grasp the basic mathematical
concepts needed to understand major measurement issues.
CHAPTER HIGHLIGHTS
LEARNING OBJECTIVES
Every semester, whenever one of us teaches a course in tests and measurement for undergradu-
ate students in psychology and education, we inevitably hear a common moan. Students are
quick to say they fear this course because they hear it involves “‘a lot of statistics” and they are
not good at math, much less statistics. As stated in the opening quotation, you do not have to
be a statistical wizard to comprehend the mathematical concepts needed to understand major
33
34 CHAPTER 2
measurement issues. In fact Kubiszyn and Borich (2003) estimate that less than 1% of the stu-
dents in their testing and assessment courses performed poorly entirely because of insufficient
math skills. Nevertheless, all measurements in education and psychology have mathematical
properties, and those who use tests and other assessments, whether teacher-made or standard-
ized commercial procedures, need to have an understanding of the basic mathematical and sta-
tistical concepts on which these assessments are predicated. In this chapter we will introduce
these mathematical concepts. Generally we will emphasize the development of a conceptual
understanding of these issues rather than focusing on mathematical computations. In a few in-
stances we will present mathematical formulas and demonstrate their application, but we will
keep the computational aspect to a minimum. To guard against becoming overly technical in
this chapter, we asked undergraduate students in nonmath majors to review it. Their consensus
was that it was readable and “user friendly.” We hope you will agree!
In developing this textbook our guiding principle has been to address only those con-
cepts that teachers really need to know to develop, administer, and interpret assessments in
educational settings. We recognize that most teachers do not desire to become test develop-
ment experts, but because teachers routinely develop, use, and interpret assessments, they
need to be competent in their use. In this chapter, we will first discuss scales of measurement
and show you how different scales have different properties or characteristics. Next we will
introduce the concept of a collection or distribution of scores and review the different statis-
tics available to describe distributions. Finally we will introduce the concept of correlation,
how it is measured, and what it means.
Scales of Measurement
What Is Measurement?
Nominal Scales
Nominal scales are the simplest of the four scales
In most situations,
these categories are mutually exclusive. For example, gender is an
Nominal scales classify people example of a nominal scale that assigns individuals to mutually ex-
clusive categories. Another example is assigning people to categories
or objects into categories,
based on their college academic majors (e.g., education, psychology,
classes, or sets.
chemistry). You may have noticed that in these examples we did not
assign numbers to the categories. In some situations we do assign
numbers in nominal scales simply to identify or label the categories; however, the categories
are not ordered in a meaningful manner. For example, we might use the number one to rep-
resent a category of students who list their academic major as education, the number two for
the academic major of psychology, the number three for the academic major of chemistry,
and so forth. Notice that no attempt is made to order the categories. Three is not greater than
two, and two is not greater than one. The assignment of numbers is completely arbitrary. We
could just as easily call them red, blue, green, and so on. Another individual might assign
a new set of numbers, which would be just as useful as ours. Because of the arbitrary use
of numbers in nominal scales, nominal scales do not actually quantify the variables under
examination. Numbers assigned to nominal scales should not be added, subtracted, ranked,
or otherwise manipulated. As a result, many common statistical procedures cannot be used
with these scales so their usefulness is limited.
Ordinal Scales
Ordinal scale measurement allows you to rank people or objects according to the amount
or quantity of a characteristic they display or possess. A
us nothing about how much taller. As a result, these scales are somewhat limited in both the
measurement information they provide and the statistical procedures that can be applied.
Nevertheless, the use of these scales is fairly common in educational settings. Percentile
rank, age equivalents, and grade equivalents are all examples of ordinal scales.
36 (C lel AIP IIs IR 2
Interval Scales
Interval scales provide more information than either nominal or ordinal scale) Interval
\\ scale with equal units. By equal scale units, we mean the difference
Interval scales rank people or between adjacent units on the scale is equivalent. The difference be-
objects like an ordinal scale, but tween scores of 70 and 71 is the same as the difference between
on a scale with equal units. scores of 50 and 51 (or 92 and 93; 37 and 38; etc.). Many educational
and psychological tests are designed to produce interval level scores.
Let’s look at an example of scores for three people on an aptitude
test. Assume individual A receives a score of 100, individual B a score of 110, and individual
Cascore of 120. First, we know that person C scored the highest followed by B then A. Sec-
ond, given that the scores are on an interval scale, we also know that the difference between
individuals A and B (i.e., 10 points) is equivalent to the difference between B and C (i.e.,
10 points). Finally, we know the difference between individuals A and C (i.e., 20 points) is
twice as large as the difference between individuals A and B (i.e., 10 points). Interval level
data can be manipulated using common mathematical operations (e.g., addition, subtrac-
tion, multiplication, and division) whereas lesser scales (i.e., nominal and ordinal) cannot.
A final advantage is that most statistical procedures can be used with interval scale data.
As you can see, interval scales represent a substantial improvement over ordinal
scales and provide considerable information. Their one limitation is that interval scales
do not have a true zero point. That is, on interval scales a score of zero does not reflect
the total absence of the attribute. For example, if an individual were unable to answer
any questions correctly on an intelligence test and scored a zero, it would not indicate
the complete lack of intelligence, but only that he or she was unable to respond correctly
to any questions on this test. (Actually intelligence tests are designed so no one actually
receives a score of zero. We just use this example to illustrate the concept of an arbitrary
zero point.) Likewise, even though an IQ of 100 is twice as large as an IQ of 50, it does
not mean that the person with an IQ of 100 is twice as intelligent as the person with an IQ
of 50. In educational settings, interval scale scores are most commonly seen in the form
of standard scores (there are a number of standard scores used in education, which will be
discussed in the next chapter).
Ratio Scales
tests and the measurement of behavioral responses (e.g., reaction time), there are relatively
few ratio scales in educational and psychological measurement. Fortunately, we are able to
address most of the measurement issues in education adequately using interval scales.
Table 2.1 gives examples of common nominal, ordinal, interval, and ratio scales found
in educational and psychological measurement. As we noted, there is a hierarchy among the
scales with nominal scales being the least sophisticated and providing the least information
and ratio scales being the most sophisticated and providing the most information. Nominal
scales allow you to assign a number to a person that associates that person with a set or
category, but other useful quantitative properties are missing. Ordinal scales have all the
positive properties of nominal scales with the addition of the ability to rank people ac-
cording to the amount of a characteristic they possess. Interval scales have all the positive
properties of ordinal scales and also incorporate equal scale units. The inclusion of equal
scale units allows one to make relative statements regarding scores (e.g., the difference
between a score of 82 and a score of 84 is the same as the difference between a score of
92 and 94). Finally, ratio scales have all of the positive properties of an interval scale with
the addition of an absolute zero point. The inclusion of an absolute zero point allows us to
form meaningful ratios between scores (e.g., a score of 50 reflects twice the amount of the
characteristic as a score of 25). Although these scales do form a hierarchy, this does not
mean the lower scales are of little or no use. If you want to categorize students according
to their academic major, a nominal scale is clearly appropriate. Accordingly, if you simply
want to rank people according to height, an ordinal scale would be adequate and appropri-
ate. However, in most measurement situations you want to use the scale that provides the
most information. Special Interest Topic 2.1 elaborates on technical distinctions among the
four scales of measurement.
An individual’s test score in isolation provides very little information, even if we know its
scale of measurement. For example, if you know that an individual’s score on a test of reading
achievement is 79, you know very little about that student’s reading ability. Even if you know
the scale of measurement represented by the test (e.g., an interval scale), you still know very
little about the individual’s reading ability. To meaningfully interpret or describe test scores
you need to have a frame of reference. Often the frame of reference is how other people
performed on the test. For example, if in a class of 25 children, a score of 79 was the highest
score achieved, it would reflect above average (or possibly superior) performance. In contrast,
if 79 was the lowest score, it would reflect below average performance. The following sec-
tions provide information about score distributions and the statistics used to describe them. In
the next chapter we will use many of these concepts and procedures to help you learn how to
describe and interpret test scores.
Distributions
In this chapter we discuss a number of important distinctions among the four scales of measure-
ment. Having touched on the fact that the different scales of measurement differ in terms of the basic
mathematical and statistical operations that can be applied, in this section we elaborate on these
distinctions. With nominal level data the only mathematical operation that is applicable is “equal to”
(=) and “not equal to” (+). With ordinal level data you can also include “greater than” (>) and “less
than” (<) as applicable operations. It is not until you have interval level data that one can use basic
operations like addition, subtraction, multiplication, and division. However, because interval level
scores do not have an absolute or true zero, you cannot make statements about relative magnitude or
create ratios. For example, it is not accurate to say that someone with an IQ (an interval level score)
of 140 is twice as intelligent as someone with an IQ of 70. With ratio level data, however, you can
make accurate statements about relative magnitude and create ratios. For example, someone 6 feet
tall is twice as tall as someone 3 feet tall, and someone weighing 140 pounds does weigh twice as
much as someone weighing 70 pounds.
As you might expect, the scale of measurement also affects the type of statistics that are ap-
plicable. We will start by addressing descriptive statistics. In terms of measures of central tendency,
discussed later in this chapter, only the mode is applicable to nominal level data. With ordinal level
data, both the mode and median can be calculated. With interval and ratio level data, the mode, me-
dian, and mean can all be calculated. In terms of measures of variability, also discussed later in this
chapter, no common descriptive statistic is applicable for nominal level data. One can describe the
categories and the count in each category, but there is no commonly used statistic available. With
ordinal level data, it is reasonable to report the range of scores. With interval and ratio level data the
range is applicable as well as the variance and standard deviation.
One finds a similar pattern in considering inferential statistics, which are procedures that
allow a researcher to make inferences regarding a population based on sample data. In brief, nominal
and ordinal level data are limited to the use of nonparametric statistical procedures. Interval and ratio
level data can be analyzed using parametric statistical procedures. Parametric statistical procedures
are preferred from a research perspective since they have more statistical power, which in simplest
terms means they are more sensitive in detecting true differences between groups.
From this discussion it should be clear that more mathematical and statistical procedures can
be used with interval and ratio level data than with nominal and ordinal level data. This is one of the
reasons why many researchers and statisticians prefer to work with interval and ratio level data. In
summary, it is important to accurately recognize the scale of measurement you are using so appropri-
ate mathematical and statistical procedures can be applied.
Note: While these technical guidelines are widely accepted, there is not universal agreement and some disregard
them (e.g., calculating the mean on ordinal level data).
some situations there are so many possible scores that it is not practical to list each potential
score individually. In these situations it is common to use a grouped frequency distribu-
tion. In grouped frequency distributions the possible scores are “combined” or “grouped”
into class intervals that encompass a range of possible scores. Table 2.4 presents a grouped
40 CHAPTER 2
TABLE 2.2 Distribution of Scores for 20 Students TABLE 2.3 Ungrouped Frequency Distribution
Cindy fl 10 1
Tommy 8 9 4
Paula 9 8 5
Steven 6 7 4
Angela u 6 3
Robert 6 5 2
Kim 10 4 l
Kevin 8 mee |
Note: This reflects the same distribution of scores depicted in
Randy 2 Table 2.2.
Charles 9
Paul 4 125-129 6
Teresa 5 120-124 14
Freddie 6 115-119 17
Tammy 7 110-114 23
Shelly 8 105-109 27
Carol 8 100-104 42
Johnny 7 95-99 39
Mean = 7.3 90-94 25
Median = 7.5 85-89 ON
i blah 80-84 17
ERE IEE OSE 75-79 13
70-74 2)
SEAR
Note: This presents a grouped frequency distribution of 250 hy-
pothetical scores that are grouped into class intervals that incor-
porate five score values.
frequency distribution of 250 hypothetical scores grouped into class intervals spanning five
score values.
Frequency graphs are also popular and provide a visual representation of a distribution.
When reading a frequency graph, scores are traditionally listed on the horizontal axis and
the frequency of scores is listed on the vertical axis. Figure 2.1 presents a graph of the set
of homework scores listed in Tables 2.2 and 2.3. In examining this figure you see that there
was only one score of 10 (reflecting perfect performance) and there’ was only one score
of
The Basic Mathematics of Measurement
41
4 5 6 i, 8 9 10
Homework Scores
4 (reflecting correct responses to only four questions). Most of the other students received
scores between 7 and 9. Figure 2.2 presents a graph of a distribution that might reflect a large
standardization sample. Examining this figure reveals that the scores tend to accumulate
around the middle in frequency, diminishing as we move further away from the middle.
Another characteristic of the distribution depicted in Figure 2.2 is that it is symmetri-
cal, which means that if you divide the distribution into two halves, they will mirror each
other. Not all distributions are symmetrical. A nonsymmetrical distribution is referred to as
skewed. Skewed distributions can be either negatively or positively skewed. \Semaianay
as illustrated in Figure 2.3.
When a test produces scores that are negatively skewed, it is probable that the test is too
easy because there are many high scores and relatively few low scores. Alpositively skewed»
, as illustrated in Figure 2.4. If a test
41
42 CHAPTER 2
produces scores that are positively skewed, it is likely that the test is too difficult because
there are many low scores and few high scores. In the next chapter we will introduce a spe-
cial type of distribution referred to as the normal distribution and describe how it is used to
help interpret test scores. First, however, we will describe two important characteristics of
distributions and the methods we have for describing them. The first characteristic is central
tendency and the second is variability.
familiar with them. It is likely that you have heard of all of these statistics, but we will briefly
discuss them to ensure that you are familiar with the special characteristics of each.
The mean is the arithmetic Mean. Most people are familiar with the mean as the simple arith-
average of a distribution. metic average. Practically every day you hear discussions involving
the concept of the average amount of some entity. Meteorologists give
information about the average temperature and amount of rain, politicians and economists
discuss the average hourly wage, educators talk about the grade point average, health profes-
sionals talk about average weight and average life expectancy, and the list goes on. Formally,
the mean of a set of scores is defined by the following equation:
The mean of the homework scores listed in Table 2.2 is calculated by summing the
20 scores in the distribution and dividing by 20. This results in a mean of 7.3. Note that the
mean is near the middle of the distribution (see Figure 2.1). Although no student obtained
a score of 7.3, the mean is useful in providing a sense of the central tendency of the group
of scores. Several important mathematical characteristics of the mean make it useful as a
measure of central tendency. First, the mean can be calculated with interval and ratio level
data, but not with ordinal and nominal level data. Second, the mean of a sample is a good
estimate of the mean for the population from which the sample was drawn. This is use-
ful when developing standardized tests in which standardization samples are tested and
the resulting distribution is believed to reflect characteristics of the entire population of
people with whom the test is expected to be used (see Special Interest Topic 2.2 for more
information on this topic). Another positive characteristic of the mean is that it is essential
to the definition and calculation of other descriptive statistics that are useful in the context
of measurement.
An undesirable characteristic of the mean is that it is sensitive to unbalanced extreme
scores. By this we mean a score that is either extremely high or extremely low relative to
the rest of the scores in the distribution. An extreme score, either very large or very small,
tends to “pull” the mean in its direction. This might not be readily apparent so let’s look at an
example. In the set of scores 1, 2, 3, 4, 5, and 38, the mean is 8.8. Notice that 8.8 is not near
any score that actually occurs in the distribution. The extreme score of 38 pulls the mean
in its direction. The tendency for the mean to be affected by extreme scores is particularly
problematic when there is a small number of scores. The influence of an extreme score de-
creases as the total number of scores in the distribution increases. For example, the mean of
Pf te, 2, 2235.0, 8, 0,4, 4, Fs aids os ys and 38 is 4.6. In this example the influence
of the extreme score is reduced by the presence of a larger number of scores.
Although we try to minimize the use of statistical jargon whenever possible, at this point it is useful
to highlight the distinction between population parameters and sample statistics. Statisticians dif-
ferentiate between populations and samples. A population is the complete group of people, objects,
or other things of interest. An example of a population is all of the secondary students in the United
States. Because this is a very large number of students, it would be extremely difficult to study such
a group. Such constraints often prevent researchers from studying entire populations. Instead they
study samples. A sample is just a subset of the larger population that is thought to be representative
of the population. By studying samples researchers are able to make generalizations about popula-
tions. For example, although it might not be practical to administer a questionnaire to all secondary
students in the United States, it would be possible to select a random sample of secondary students
and administer the questionnaire to them. If we are careful in selecting this sample and it is of suf-
ficient size, the information garnered from the sample may allow us to draw some conclusions about
the population.
Now we will address the distinction between parameters and statistics. Population values are
referred to as parameters and are typically represented with Greek symbols. For example, statisti-
cians use mu (\1) to indicate a population mean and sigma (6) to indicate a population standard
deviation. Because it is often not possible to study entire populations, we do not know population
parameters and have to estimate them using statistics. A statistic is a value that is calculated based on
a sample. Statistics are typically represented with Roman letters. For example, statisticians use X to
indicate the sample mean (some statisticians use M to indicate the mean) and SD (or S) to indicate
the sample standard deviation. Sample statistics can provide information about the corresponding
population parameters. For example, the sample mean (X) may serve as an estimate of the population
mean (1). Of course the information provided by a sample statistic is only as good as the sample the
statistic is based on. Large representative samples can provide good information whereas small or
biased samples will provide poor information. Without going into detail about sampling and infer-
ential statistics at this point, we do want to make you aware of the distinction between parameters
and statistics. In this and other texts you will see references to both parameters and statistics and
understanding this distinction will help you avoid a misunderstanding. Remember, as a general rule
if the value is designated with a Greek symbol it refers to a population parameter, but if it is desig-
nated with a Roman letter it is a sample statistic.
the following set of scores: 9, 8, 7, 6, 5. In this example the median is 7 because two scores
fall above it and two fall below it. In actual practice a process referred to as interpolation is
often used to compute the median (because interpolation is illustrated in practically every
basic statistics textbook, we will not go into detail about the process). The median can be
calculated for distributions containing ratio, interval, or ordinal level scores, but it is not ap-
propriate for nominal level scores. The median is a useful and versatile measure of central
tendency.
The mode is the most frequently Mode. The mode of a distribution is the most frequently occurring
occurring score in a distribution. score. Referring back to Table 2.3, which presents the ungrouped
The Basic Mathematics of Measurement 45
frequency distribution of 20 students on a homework assignment, you will see that the most
frequently occurring score is 8. These scores are graphed in Figure 2.1, and by locating
the highest point in the graph you are also able to identify the mode (i.e., 8). An advantage
of the mode is that it can be used with nominal data (e.g., the most frequent college major
selected by students) as well as ordinal, interval, and ratio data (Hays, 1994). However, the
mode does have significant limitations as a measure of central tendency. First, some distri-
butions have two scores that are equal in frequency and higher than other scores (see Figure
2.5). This is referred to as a “bimodal” distribution and the mode is ineffective as a measure
of central tendency. Second, the mode is not a very stable measure of central tendency,
particularly with small samples. For example, in the distribution depicted in Table 2.3, if
one student who earned a score 8 had earned a score of either 7 or 9, the mode would have
shifted from 8 to 7 or 9. As a result of these limitations, the mode is often of little utility as
a measure of central tendency.
Choosing between the Mean, Median, and Mode. A natural question is, Which mea-
sure of central tendency is most useful or appropriate? As you might expect, the answer
depends on a number of factors. First, as we noted when discussing the mean, it is essential
when calculating other useful statistics. For this and other rather technical reasons (see
Hays, 1994), the mean has considerable utility as a measure of central tendency. However,
for purely descriptive purposes the median is often the most versatile and useful measure
of central tendency. When a distribution is skewed, the influence of unbalanced extreme
scores on the mean tends to undermine its usefulness. Figure 2.6 illustrates the expected
relationship between the mean and the median in skewed distributions. Note that the mean
is “pulled” in the direction of the skew: that is, lower than the median in negatively skewed
distributions and higher than the median in positively skewed distributions. To illustrate how
the mean can be misleading in skewed distributions, Hopkins (1998) notes that due to the
influence of extremely wealthy individuals, about 60% of the families in the United States
have incomes below the national mean. In this situation, the mean is pulled in the direction
Mode
Median
of the extreme high scores and is somewhat misleading as a measure of central tendency.
Finally, it is important to consider the variable’s scale of measurement. If you are dealing
with nominal level data, the mode is the only measure of central tendency that provides
useful information. With ordinal level data one can calculate the median in addition to the
The Basic Mathematics of Measurement 47
mode. It is only with interval and ratio level data that one can appropriately calculate the
mean in addition to the median and mode.
At this point you should have a good understanding of the various measures of central
tendency and be able to interpret them in many common applications. You might be surprised
how often individuals in the popular media demonstrate a fundamental misunderstanding
of these measures. See Special Interest Topic 2.3 for a rather humorous example of how a
journalist misinterpreted information based on measures of central tendency.
Measures of Variability
Two distributions can have the same mean, median, and mode yet differ considerably in
the way the scores are distributed around the measures of central tendency. Therefore, it is
not sufficient to characterize a set of scores solely by measures of central tendency. Figure
2.7 presents graphs of three distributions with identical means but different degrees of
variability. A measure of the dispersion, spread, or variability of a set of scores will help
us describe the distribution more fully. We will examine three measures of variability
commonly used to describe distributions: range, standard devia-
tion, and variance.
The range is the distance
between the smallest and largest Range. The range is the distance between the smallest and largest
score in a distribution. score in a distribution. The range is calculated:
For example, in referring back to the distribution of scores listed in Table 2.3, you see that
the largest score is 10 and the smallest score is 4. By simply subtracting 4 from 10 you
determine the range is 6. (Note: Some authors define the range as the highest score minus
the lowest score, plus one. This is known as the inclusive range.) The range considers only
Half of all professionals charge above the median fee for their services. Now that you understand
the mean, median, and mode, you will recognize how obvious this statement is. However, a few
years back a local newspaper columnist in Texas, apparently unhappy with his physician’s bill for
some services, conducted an investigation of charges for various medical procedures in the county
in which he resided. In a somewhat angry column he revealed to the community that “fully half of
all physicians surveyed charge above the median fee for their services.”
We would like him to know that “fully half” of all plumbers, electricians, painters, lawn ser-
vices, hospitals, and everyone else we can think of also charge above the median for their services.
We wouldn’t have it any other way!
48 CHAPTER 2
(a)
(b)
(c)
FIGURE 2.7 Three Distributions with Different
Degrees of Variability
Source: From Robert J. Gregory, Psychological Testing: History, Prin-
ciples, and Applications, 3/e. Published by Allyn & Bacon, Boston, MA.
Copyright © 2000 by Pearson Education. Reprinted by permission of the
publisher.
the two most extreme scores in a distribution and tells us about the limits or extremes of a
distribution. However, it does not provide information about how the remaining scores are
spread out or dispersed within these limits. We need other descriptive statistics, namely the
standard deviation and variance, to provide information about the spread or dispersion of
scores within the limits described by the range.
The standard deviation is a Standard Deviation. The mean and standard deviation are the
measure of the average distance most widely used statistics in educational and psychological testing
that scores vary from the mean as well as research in the social and behavioral sciences. The stan-
of the distribution. dard deviation is computed with the following steps:
overcome this difficulty, we simply square each difference score because the square
of any number is always positive.
Step 3. Sum all the squared difference scores.
Step 4. Divide this sum by the number of scores to derive the average of the squared
deviations from the mean. This value is the variance and is designated by 07 (we will
return to this value briefly).
Step 5. The standard deviation (6) is the positive square root of the variance (07). It
is the square root because we first squared all the scores before adding them. To now
get a true look at the standard distance between key points in the distribution, we have
to undo our little trick that eliminated all those negative signs.
These steps are illustrated in Table 2.5 using the scores listed in Table 2.2. This example
illustrates the calculation of the population standard deviation, designated with the Greek
symbol sigma (6). You will also see the standard deviation designated with SD or S. This is
appropriate when you are describing the standard deviation of a sample rather than a popu-
lation (refer back to Special Interest Topic 2.2 for information on the distinction between
population parameters and sample statistics).!
The standard deviation is a measure of the average distance that scores vary from the
mean of the distribution. The larger the standard deviation, the more scores differ from the
mean and the more variability there is in the distribution. If scores are widely dispersed or
spread around the mean, the standard deviation will be large. If there is relatively little dis-
persion or spread of scores around the mean, the standard deviation will be small.
Variance. In calculating the standard deviation we actually first calculate the variance
(62). As illustrated in Table 2.5, the standard deviation is actually the positive square root
of the variance. Therefore, the variance is also a measure of the
The variance is a measure of variability of scores. The reason the standard deviation is more fre-
variability that has special quently used when interpreting individual scores is that the variance
meaning as a theoretical is in squared units of measurement, which complicates interpreta-
concept in measurement tion. For example, we can easily interpret weight in pounds, but it is
more difficult to interpret and use weight reported in squared pounds.
theory and statistics.
While the variance is in squared units, the standard deviation (i.e.,
the square root of the variance) is in the same units as the scores and so is more easily under-
stood. Although the variance is difficult to apply when describing individual scores, it does
have special meaning as a theoretical concept in measurement theory and statistics. For now,
simply remember that the variance is a measure of the degree of variability in scores.
Choosing between the Range, Standard Deviation, and Variance. As we noted, the
range conveys information about the limits of a distribution, but does not tell us how the
scores are dispersed within these limits. The standard deviation indicates the average dis-
statis-
| The discussion and formulas provided in this chapter are those used in descriptive statistics. In inferential
the population variance is estimated from a sample, the N in the denominator is replaced with N — 1.
tics when
50 CHAPTER 2
Standard Deviation \
Variance
2.41
1.55
tance that scores vary from the mean of the distribution. The larger the standard deviation,
the more variability there is in the distribution. The standard deviation is very useful in
describing distributions and will be of particular importance when we turn our attention to
the interpretation of scores in the next chapter. The variance is another important and use-
ful measure of variability. Because the variance is expressed in terms of squared measure-
ment units, it is not as useful in interpreting individual scores as is the standard deviation.
‘
The Basic Mathematics of Measurement 51
However, the variance is important as a theoretical concept, and we will return to it when
discussing reliability and validity in later chapters.
Correlation Coefficients
Most students are somewhat familiar with the concept of correlation. When people speak of
a correlation, they are referring to the relationship between two variables. The variables can
be physical such as weight and height or psychological such as intelligence and academic
achievement. For example, it is reasonable to expect height to demonstrate a relationship
with weight. Taller individuals tend to weigh more than shorter individuals. This relation-
ship is not perfect because there are some short individuals who weigh more than taller
individuals, but the tendency is for taller people to outweigh shorter people. You might also
expect more intelligent people to score higher on tests of academic achievement than less
intelligent people, and this is what research indicates. Again, the relationship is not per-
fect, but as a general rule more intelligent individuals perform better on tests of academic
achievement than their less intelligent peers.
is satisfactory in many situations, but in other contexts it may be more important to determine
whether a correlation is “statistically significant.” Statistical significance is determined by
both the size of the correlation coefficient and the size of the sample. A discussion of statistical
significance would lead us into the realm of inferential statistics and is beyond the scope of
this text. However, most introductory statistics texts address this concept in considerable detail
and contain tables that allow you to determine whether a correlation coefficient is significant
given the size of the sample.
Another way of describing correlation coefficients is by squaring it to derive the
coefficient of determination (i.e., r”). The coefficient of determination is interpreted as
the amount of variance shared by the two variables. In other words, the coefficient of de-
termination reflects the amount of variance in one variable that is
The coefficient of determination predictable from the other variable, and vice versa. This might not
is interpreted as the amount be clear so let’s look at an example. Assume a correlation between
of variance shared by two an intelligence test and an achievement test of 0.60 (i.e., r = 0.60).
variables. By squaring this value we derive the coefficient of determination is
0.36 (i.e., r? = 0.36). This indicates that 36% of the variance in one
variable is predictable from the other variable.
Scatterplots
As noted, a correlation coefficient is a quantitative measure of the relationship between
two variables. Examining scatterplots may enhance our understanding of the relationship
between variables. A scatterplot is simply a graph that visually displays the relationship
between two variables. To create a scatterplot you need to have two
A scatterplot is a graph that
scores for each individual. For example, you could graph each indi-
visually displays the relationship vidual’s weight and height. In the context of educational testing, you
between two variables. could have scores for the students in a class on two different home-
work assignments. In a scatterplot the X-axis represents one variable
and the Y-axis the other variable. Each mark in the scatterplot actually represents two scores,
an individual’s scores on the X variable and the Y variable.
Figure 2.8 shows scatterplots for various correlation values. First, look at Figure 2.8a,
which shows a hypothetical perfect positive correlation (+1.0). Notice that with a perfect
correlation all of the marks will fall on a straight line. Because this is a positive correlation
an increase on one variable is associated with a corresponding increase on the other variable.
Because it is a perfect correlation, if you know an individual’s score on one variable you
can predict the score on the other variable with perfect precision. Next examine Figure 2.8b,
which illustrates a perfect negative correlation (—1.0). Being a perfect correlation all the
marks fall on a straight line, but because it is a negative correlation an increase on one vari-
able is associated with a corresponding decrease on the other variable. Given a score on one
variable, you can still predict the individual’s performance on the other variable with perfect
precision. Now examine Figure 2.8c, which illustrates a correlation of 0.0. Here there is not
a relationship between the variables. In this situation, knowledge about performance on one
variable does not provide any information about the individual’s performance on the other
variable or enhance prediction.
(a) (d)
High High
> >
2 2
s S
S$ $s
Low Low
Low High Low High
Variable X Variable X
(b) (e)
High High
> >
2 2
$ $s
Low Low
Low High Low High
Variable X Variable X
(c) (f)
High High
> >
2 2
To) a
€ s
s s
Low Low
Low High Low High
Variable X Variable X
53
54 CHAPTER 2
So far we have examined only the scatterplots of perfect and zero correlation coef-
ficients. Examine Figure 2.8d, which depicts a correlation of +0.90. Notice that the marks
clearly cluster along a straight line. However, they no longer all fall on the line, but rather
around the line. As you might expect, in this situation knowledge of performance on one
variable helps us predict performance on the other variable, but our ability to predict per-
formance is not perfect as it was with a perfect correlation. Finally, examine Figures 2.8e
and 2.8f, which illustrate coefficients of 0.60 and 0.30, respectively. As you can see a cor-
relation of 0.60 is characterized by marks that still cluster along a straight line, but there is
more variability around this line than there was with a correlation of 0.90. Accordingly, with
a correlation of 0.30 there is still more variability of marks around a straight line. In these
situations knowledge of performance on one variable will help us predict performance on
the other variable, but as the correlation coefficients decrease so does our ability to predict
performance.
There are different formulas for calculating a Pearson correlation coefficient and we will
illustrate one of the simpler ones. For this illustration we will use the homework assignment
scores we have used before as the X variable and another set of 20 hypothetical scores as the
Y variable. The formula is:
Hh 49 8 64 56
8 64 A 49 56
9 81 10 100 90
6 36 5 23 30
7 49 7 49 49
6 36 6 36 36
10 199 9 81 90
8 64 8 64 64
5 25 5 25 25
9 81 ) 81 81
9 81 8 64 ae
9 81 7 49 63
8 64 d 49 56
4 16 4 16 16
5 2S 6 36 30
6 36 7 49 42
7 49 7 49 49
8 64 9 81 2
8 64 8 64 64
7 49 6 36 42
X = 146 X? = 1,114 e=ai43 Y2 = 1,067 XY = 1,083
20(1,083) — (140)(143)
=e: Seem
(201,114) - (146 /20(1,067) - (143°
21,660 — 20,878 ___782
22,280 — 21,316/21,340 — 20,449 964/891
782
(31.048)(29.849)
56 CHAPTER 2
Summary
In this chapter we surveyed the basic mathematical concepts and procedures essential to un-
derstanding measurement. We defined measurement as a set of rules for assigning numbers
to represent objects, traits, or other characteristics. Measurement can involve four different
scales—nominal, ordinal, interval, and ratio—that have distinct properties.
Nominal scale: a qualitative system for categorizing people or objects into catego-
ries. In nominal scales the categories are not ordered in a meaningful manner and do
not convey quantitative information.
Ordinal scale: a quantitative system that allows you to rank people or objects accord-
ing to the amount of a characteristic possessed. Ordinal scales provide quantitative in-
formation, but they do not ensure that the intervals between the ranks are consistent.
‘
The Basic Mathematics of Measurement
57
Reynolds (1999) related this historical example of how interpreting a relationship between variables
as indicating causality can lead to an erroneous conclusion. He noted that in the 1800s a physician
realized that a large number of women were dying of “childbed fever” (i.e., puerperal fever) in the
prestigious Vienna General Hospital. Curiously more women died when they gave birth in the hos-
pital than when the birth was at home. Childbed fever was even less common among women who
gave birth in unsanitary conditions on the streets of Vienna. A commission studied this situation and
after careful observation concluded that priests who came to the hospital to administer last rites were
the cause of the increase in childbed fever in the hospital. The priests were present in the hospital,
but were not present if the birth were outside of the hospital. According to the reasoning of the com-
mission, when priests appeared in this ritualistic fashion the women in the hospital were frightened,
and this stress made them more susceptible to childbed fever.
Eventually, experimental research debunked this explanation and identified what was actually
causing the high mortality rate. At that time the doctors who delivered the babies were the same doc-
tors who dissected corpses. The doctors would move from dissecting diseased corpses to delivering
babies without washing their hands or taking other sanitary procedures. When hand washing and other
antiseptic procedures were implemented, the incidence of childbed fever dropped dramatically.
In summary, it was the transmission of disease from corpses to new mothers that caused
childbed fever, not the presence of priests. Although the conclusion of the commission might sound
foolish to us now, if you listen carefully to the popular media you are likely to hear contemporary
“experts” establishing causality based on observed relationships between variables. However, now
you know to be cautious when evaluating this information.
Interval scale: asystem that allows you to rank people or objects like an ordinal scale
but with the added advantage of equal scale units. Equal scale units indicate that the
intervals between the units or ranks are the same size.
Ratio scale: a system with all the properties of an interval scale with the added ad-
vantage of a true zero point.
These scales form a hierarchy, and we are able to perform more sophisticated measurements
as we move from nominal to ratio scales.
We next turned our attention to distributions. A distribution is simply a set of scores,
and distributions can be represented in a number of ways, including tables and graphs.
Descriptive statistics have been developed that help us summarize and describe major char-
acteristics of distributions. For example, measures of central tendency are frequently used
to summarize distributions. The major measures of central tendency are
Mean: the simple arithmetic average of a distribution. Formally, the mean is defined
by this equation: Mean = Sum of Score / Number of Scores.
Median: the score or potential score that divides a distribution in half.
Mode: the most frequently occurring score in a distribution.
58 CHAPTER 2
Range: the distance between the smallest and largest score in a distribution.
Standard deviation: a popular index of the average distance that scores vary from
the mean.
Variance: another measure of the variability of scores, expressed in squared score
units. Less useful when interpreting individual scores, but important as a theoretical
concept.
RECOMMENDED READINGS
Hays, W. (1994). Statistics (Sth ed.). New York: Harcourt Nunnally, J.C., & Bernstein, I. H. (1994). Psychometric
Brace. This is an excellent advanced statistics text. theory (3rd ed.). New York: McGraw-Hill. An excel-
It covers the information presented in this chapter in lent advanced psychometric text. Chapters 2 and 4 are
greater detail and provides comprehensive coverage of particularly relevant to students wanting a more detailed
statistics in general. discussion of issues introduced in this chapter.
The Basic Mathematics of Measurement 59
Reynolds, C. R. (1999). Inferring causality from relational psychologist, 13, 386-395. An entertaining and enlight-
data and design: Historical and contemporary lessons ening discussion of the need for caution when inferring
for research and clinical practice. The Clinical Neuro- causality from relational data.
INTERNET
SITES OF INTEREST
PRACTICE ITEMS
1. Calculate the mean, variance, and standard deviation for the following score distributions.
For these exercises, use the formulas listed in Table 2.5 for calculating variance and standard
deviation.
ANAN
AIA
VA
WO
WO
BBA
COW O
CO
~
DH
ADA OO
©
>)
FBPNWHRRUAMNNADAA>I
NNwWHhPhANAMAD
60 CHAPTER 2
2. Calculate the Pearson correlation coefficient for the following pairs of scores.
CHAPTER
CHAPTER HIGHLIGHTS
LEARNING OBJECTIVES
61
62 CHAPTER 3
Test scores reflect the performance or ratings of the individuals completing a test. Because
test scores are the keys to interpreting and understanding the examinees’ performance, their
meaning and interpretation are extremely important topics and deserve careful attention. As
you will see, there is a wide assortment of scores available for our use and each format has
its own unique characteristics. eae inscea type of score is
A raw score is simply the eta OTe: d or
number of items scored or
@soony For example, the raw score on a classroom math test might be
coded in a specific manner
the number of items the student answered correctly. The calculation
such as correct/incorrect,
of raw scores is usually fairly straightforward, but raw scores are
true/false, and so on.
often of limited use to those interpreting the test results; they tend to
offer very little useful information. Let’s say a student’s score on a
classroom math test is 50. Does a raw score of 50 represent poor, average, or superior per-
formance? The answer to this question depends on a number of factors such as how many
items are on the test, how difficult the items are, and the like. For example, if the test con-
tained only 50 items and if the student’s raw score were 50, the student demonstrated perfect
performance. If the test contained 100 items and if the student’s raw score were 50, he or
she answered only half of the items correctly. However, we still do not know what that really
means. If the test contained 100 extremely difficult items and if a raw score of 50 were the
highest score in the class, this would likely reflect very good performance. Because raw
scores in most situations have little interpretative meaning, we need to transform or convert
them into another format to facilitate their interpretation and give them meaning. These
transformed scores, typically referred to as derived scores, standard scores, or scaled scores,
are pivotal in helping us interpret test results. There are a number of different derived scores,
but they all can be classified as either norm-referenced or criterion-referenced. We will
begin our discussion of scores and their interpretation by introducing you to these two dif-
ferent approaches to deriving and interpreting test scores.
(a reference group). For example, scores on tests of intelligence are norm-referenced. If you
report that an examinee has an IQ of 100, this indicates he or she
scored higher than 50% of the people in the standardization sample.
With norm-referenced score This is a norm-referenced interpretation. The examinee’s performance
interpretations,
i the examinee’s is being compared with that of other test takers. Personality tests are
performance is compared to the also typically reported as norm-referenced scores. For example, it
performance of other people. might be reported that an examinee scored higher than 98% of the
The Meaning of Test Scores 63
standardization sample on some trait such as extroversion or sensation seeking. With all
norm-referenced interpretatio inee’s performance is compared to that of others.
Norm-referenced
interpretations are relative preta bsolute (i.e., compared to an absolute s Norm-
whereas criterion-referenced referenced score interpretations have many applications, and the
interpretations are absolute. majority of published standardized tests produce norm-referenced
scores. Nevertheless, criterion-referenced tests also have important
applications, particularly in educational settings. Although people
frequently refer to norm-referenced and criterion-referenced tests, this is not technically
accurate. The terms norm-referenced and criterion-referenced actually refer to the interpre-
tation of test scores. Although it is most common for tests to produce either norm-referenced
or criterion-referenced scores, it is actually possible for a test to produce both norm- and
criterion-referenced scores. We will come back to this topic later. First, we will discuss
norm-referenced and criterion-referenced score interpretations and the types of derived
scores associated with each approach.
Norm-Referenced Interpretations
Norms and Reference Groups. To understand performance on a psychological or educa-
tional test, it is often useful to compare an examinee’s performance to the performance of some
preselected group of individuals. Raw scores on a test, such as the number correct, take on
special meaning when they are evaluated against the performance of a normative or reference
group. To accomplish this, when using a norm-referenced approach to interpreting test scores,
raw scores on the test are typically converted to derived scores based on information about the
performance of a specific normative or reference group. Probably the most important consid-
eration when making norm-referenced interpretations involves the relevance of the group of
individuals to whom the examinee’s performance is compared. The reference group from which
the norms are derived should be representative of the type of individuals expected to take the
test and should be defined prior to the standardization of the test. When you interpret a student’s
performance on a test or other assessment, you should ask yourself, “Are these norms appropri-
ate for this student?” For example, it would be reasonable to compare a student’s performance
on a test of academic achievement to other students of the same age, grade, and educational
background. However, it would probably not be particularly useful to compare a student’s
64 CHAPTER 3
performance to younger students who had not been exposed to the same curriculum, or to older
students who have received additional instruction, training, or experience. For nomererereneed
"pipette bSimezningful, youneedtceiparehe ORMMET TSTOCS
ec osPeeTCR Therefore, the first step in developing good normative data
is to define clearly the population for whom the test is designed.
Once the appropriate reference population has been defined clearly, a random sample
is selected and tested. The normati rou st often used to derive scores is
called the -
Standardization samples should Most test publishers
be representative of the types of and developers select a standardization sample using a procedure
individuals expected to take the known as population proportionate stratified random sampling. This
tests. means that samples of people are selected in such a way as to ensure
that the national population as a whole is proportionately repre-
sented on important variables. In the United States, for example, tests are typically stan-
dardized using a sampling plan that stratifies the sample by gender, age, education,
ethnicity, socioeconomic background, region of residence, and community size based on
population statistics provided by the U.S. Census Bureau. If data from the Census Bureau
indicate that 1% of the U.S. population consists of African American males in the middle
range of socioeconomic status residing in urban centers of the southern region, then 1% of
the standardization sample of the test is drawn to meet this same set of characteristics.
Once the standardization sample has been selected and tested, tables of derived scores are
developed. These tables are based on the performance of the standardization sample and
are typically referred to as normative tables or “norms.” Because the relevance of the stan-
dardization sample is so important when using norm-referenced tests, it is the responsibil-
ity of test publishers to provide adequate information about the standardization sample.
Additionally, it is the responsibility of every test user to evaluate the adequacy of the
sample and the appropriateness of comparing the examinee’s score to this particular group.
In making this determination, you should consider the following factors:
Research has shown that there were significant increases in IQ during the twentieth century. This
phenomenon has come to be referred to as the “Flynn Effect” after the primary researcher credited
with its discovery, James Flynn. In discussing his research, Flynn (1998) notes:
Massive IQ gains began in the 19th century, possibly as early as the industrial revolution, and have
affected 20 nations, all for whom data exist. No doubt, different nations enjoyed different rates of
gains, but the best data do not provide an estimate of the differences. Different kinds of IQ tests show
different rates of gains: Culture-reduced tests of fluid intelligence show gains of as much as 20 points
per generation (30 years); performance tests show 10—20 points; and verbal tests sometimes show 10
points or below. Tests closest to the content of school-taught subjects, such as arithmetic reasoning,
general information, and vocabulary, show modest or nil gains. More often than not, gains are simi-
lar at all IQ levels. Gains may be age specific, but this has not yet been established and they certainly
persist into adulthood. The fact that gains are fully present in young children means that causal fac-
tors are present in early childhood but not necessarily that they are more potent in young children
than older children or adults. (p. 61)
So what do you think is causing these gains in IQ? When we ask our students some initially suggest
that these increases in IQ reflect the effects of evolution or changes in the gene pool. However, this
is not really a plausible explanation because it is happening much too fast. Summarizing the current
thinking on this topic, Kamphaus (2001) notes that while there is not total agreement, most inves-
tigators believe it is the result of environmental factors such as better prenatal care and nutrition,
enhanced education, increased test wiseness, urbanization, and higher standards of living.
Consider the importance of this effect in relation to our discussion of the development of test
norms. When we told you that it is important to consider the date of the normative data when
evaluating its adequacy, we were concerned with factors such as the Flynn Effect. Due to the
gradual but consistent increase in IQ, normative data become more demanding as time passes. In
other words, an examinee must obtain a higher raw score (i.e., correctly answer more items) each
time a test is renormed in order for his or her score to remain the same. Kamphaus suggests that as
arule of thumb, IQ norms increase in difficulty by about 3 points every 10 years (based on a mean
of 100 and a standard deviation of 15). For example, the same performance on IQ tests normed 10
years apart would result in IQs about 3 points apart, with the newer test producing the lower scores.
As a result, he recommends that if the normative data for a test are more than 10 years old one
should be concerned about the accuracy of the norms. This is a reasonable suggestion, and test
publishers are becoming better at providing timely revisions. For example, the Wechsler Intelli-
gence Scale for Children—Revised (WISC-R) was published in 1974, but the next revision, the
WISC-III, was not released until 1991, a 17-year interval. The most current revision, the WISC-IV,
was released in 2003, only 12 years after its predecessor.
has 3,600 participants in the standardization, with a minimum of 150 at each grade
level (i.e., pre-kindergarten through grade 12).
the same conditions and with the same administrative procedures that will be used in actual
practice. Accordingly, when the test is administered in clinical or educational settings, it is
important that the test user follow the administrative procedures precisely. For example, if
you are administering standardized tests you need to make sure that you are reading the
directions verbatim and closely adhering to time limits. It obviously would not be reason-
able to compare your students’ performance on a timed mathematics test to the performance
of a standardization sample that was given either more or less time to complete the items.
(The need to follow standard administration and scoring procedures actually applies to all
standardized tests, both norm-referenced and criterion-referenced.)
Many types of derived scores or units of measurement may be reported in “norms
tables,” and the selection of which derived score to employ can influence the interpretation
of scores. Before starting our discussion of common norm-referenced derived scores, we
need to introduce the concept of a normal distribution.
The Normal Curve. The normal distribution is a special type of distribution that is very
useful when interpreting test scores. Figure 3.1 depicts a normal distribution. The normal
distribution, which is also referred to as the Gaussian or bell-shaped curve, is a distribution
that characterizes many variables that occur in nature (see Special Interest Topic 3.2 for
information on Carl Frederich Gauss, who is credited with discovering the bell curve). Gray
(1999) indicates that the height of individuals of a given age and gender is an example of a
variable that is distributed normally. He notes that numerous genetic and nutritional factors
influence an individual’s height, and in most cases these various factors average out so that
people of a given age and gender tend to be of approximately the same height. This accounts
for the peak frequency in the normal distribution. In referring to Figure 3.1 you will see that
a large number of scores tend to “pile up” around the middle of the distribution. However,
KXKX
KX
xX
xXXX KKK
KX
KX
x
KXKXKXKX
ee
ee
MOK
Me
MN
MK
eM
OM
MK KX
KK
xXKXKXKXKXKKK
xx KX
xXXXKKKXK
x
TiS
Carl Frederich Gauss (1777-1855) was a noted German mathematician who is generally credited
with being one of the founders of modern mathematics. Born in Brunswick, he turned his scholarly
pursuits toward the field of astronomy around the turn of the nineteenth century. In the course of
tracking star movements and taking other forms of physical survey measurements (at times with
instruments of his own invention), Gauss found to his annoyance that students and colleagues who
were plotting the location of an object at the same time noted it to be in somewhat different places!
He began to plot the frequency of the observed locations systematically and found the observations
to take the shape of a curve. He determined that the best estimate of the true location of the object
was the mean of the observations and that each independent observation contained some degree of
error. These errors formed a curve that was in the shape of a bell. This curve or distribution of error
terms has since been demonstrated to occur with a variety of natural phenomena and indeed has
become so commonplace that it is most often known as the “normal curve” or the normal distribu-
tion. Of course, you may know it as the bell curve as well due to its shape, and mathematicians and
others in the sciences sometimes refer to it as the Gaussian curve after its discoverer and the man
who described many of its characteristics. Interestingly, Gauss was a very prolific scholar and the
Gaussian curve is not the only discovery to bear his name. He did groundbreaking research on
magnetism and the unit of magnetic intensity is called a gauss.
for a relatively small number of individuals a unique combination of factors results in them
being either much shorter or much taller than the average. This accounts for the distribution
trailing off at both the low and high ends.
Although the previous discussion addressed only observable characteristics of the
normal distribution, certain mathematical properties make it particularly useful when in-
terpreting scores. For example, the normal distribution is a sym-
metrical, unimodal distribution in which the mean, median, and
The normal distribution is
mode are all equal. It is also symmetrical, meaning that if you divide
a symmetrical, unimodal
the distribution into two halves, they will mirror each other. Proba-
distribution in which the mean, bly the most useful characteristic of the normal distribution is that
median, and mode are all equal. —_redictable proportions of scores occur at specific points in the
distribution. Referring to Figure 3.2 you find a normal distribution
with the mean and standard deviations (o) marked. Figure 3.2 also indicates percentile
rank (PR), which will be discussed later in this chapter. Because we know that the mean
equals the median in a normal distribution, we know that an individual who scores at the
mean scored better than 50% of the sample of examinees (remember, earlier we defined
the median as the score that divides the distribution in half). Because approximately 34%
of the scores fall between the mean and | standard deviation above the mean, an individual
whose score falls 1 standard deviation above the mean performs at a level exceeding ap-
proximately 84% (i.e., 50% + 34%) of the population. A score 2 standard deviations above
the mean will be above 98% of the population. Because the distribution is symmetrical, the
relationship is the same in the inverse below the mean. A score | standard deviation below
68 CHAPTER 3
PR 1 10 20 30 40 50 60 70 80 90 ag
me
-lo
PR 0.1 2 16 50 84 98 99.9
FIGURE 3.2. Normal Distribution with Mean, Standard Deviations, and Percentages
Source: From L. H. Janda, Psychological Testing: Theory and Applications. Published by Allyn & Bacon,
Boston, MA. Copyright © 1998 by Pearson Education. Reprinted by permission of the publisher.
the mean indicates that the individual exceeds only about 16% (i.e., 50% — 34%) of the
population on the attribute in question. Approximately two-thirds (i.e., 68%) of the popula-
tion will score within | standard deviation above and below the mean on a normally dis-
tributed variable.
We have reproduced in Appendix F a table that allows you to determine what propor-
tion of scores are below any given point in a distribution by specifying standard deviation
units. For example, you can use these tables to determine that a score 1.96 SD above the
mean exceeds 97.5% of the scores in the distribution whereas a score 1.96 SD below the
mean exceeds only 2.5% of the scores. Although we do not feel it is necessary for you to
become an expert in using these statistical tables, we do encourage you to examine Figure
3.2 carefully to ensure you have a good grasp of the basic properties of the normal distribu-
tion before proceeding.
Although many variables of importance in educational settings
such as achievement and intelligence are very close to conforming to
Although many variables
the normal distribution, not all educational, psychological, or be-
closely approximate the normal
havioral variables are normally distributed. For example, aggressive
distribution, not all educational behavior and psychotic behavior are two variables of interest to psy-
or psychological variables are chologists and educators that are distinctly different from the normal
distributed normally. curve in their distributions. Most children are not aggressive toward
The Meaning of Test Scores 69
their peers, so on measures of aggression, most children pile up at the left side of the distri-
bution whereas children who are only slightly aggressive may score relatively far to the
right. Likewise, few people ever experience psychotic symptoms such as hearing voices of
people who are not there or seeing things no one else can see. Such variables will each have
their own unique distribution, and even though one can, via statistical manipulation, force
these score distributions into the shape of a normal curve, it is not always desirable to do so.
We will return to this issue later, but at this point it is important to refute the common myth
that all human behaviors or attributes conform to the normal curve; clearly they do not!
In this chapter we provided the following formula for transforming raw scores to z-scores.
Z-SCore =
(X%; — X)
SD,.
Consider the situation in which the mean of the raw scores (X ) is 75, the standard deviation of
raw scores (SD) is 10, and the individual’s raw score is 90.
(90 — 75)
z-score
10
15/10
= 15
If you wanted to convert the individual’s score to a T-score, you would use the generic
formula:
T-score = 50 + 10 x
(90 — 75)
. 10
50 + 10 x 1.5
50 + 15
=) 65
z-scores have a mean of 0 and a standard deviation of 1. As a result all scores above the mean
will be positive and all scores below the mean will be negative. For example, a z-score of
1.6 is 1.6 standard deviations above the mean (i.e., exceeding 95% of the scores in the dis-
tribution) and a score of —1.6 is 1.6 standard deviations below the mean (i.e., exceeding only
5% of the scores in the distribution). As you see, in addition to negative scores, z-scores
involve decimals. This results in scores that many find difficult to use and interpret. As a
result, few test publishers routinely report z-scores for their tests. However, researchers
commonly use z-scores because scores with a mean of 0 and a standard deviation of 1 make
statistical formulas easier to calculate.
u T-scores. T-scores have a mean of 50 and a standard deviation of 10. Rélativeyto
For
example, a score of 66 is 1.6 standard deviations above the mean (i.e., exceeding 95% of the
scores in the distribution) and a score of 34 is 1.6 standard deviations below the mean (i.e.,
exceeding only 5% of the scores in the distribution).
a Wechsler IQs (and many others). The Wechsler intelligence scales use a standard
score format with a mean of 100 and a standard deviations of 15. Like T-scores, the Wechsler
IQ format avoids decimals and negative values. For example, a score of 124 is 1.6 standard
deviations above the mean (i.e., exceeding 95% of the scores in the distribution) and a score
of 76 is 1.6 standard deviations below the mean (i.e., exceeding only 5% of the scores in the
distribution). This format has become very popular, and most aptitude and individually
administered achievement tests report standard scores with mean of 100 and standard de-
viation of 15.
= Stanford-Binet IQs. The Stanford-Binet intelligence scales until recently used a
standard score format with a mean of 100 and a standard deviation of 16. This is similar to
the format adopted by the Wechsler scales, but instead of a standard deviation of 15 there is
a standard deviation of 16 (see Special Interest Topic 3.3 for an explanation). This may ap-
pear to be a negligible difference, but it was enough to preclude direct comparisons between
the scales. With the Stanford-Binet scales, a score of 126 is 1.6 standard deviations above
the mean (i.e., exceeding 95% of the scores in the distribution) and a score of 74 is 1.6
standard deviations below the mean (i.e., exceeding only 5% of the scores in the distribu-
tion). The most recent edition of the Stanford-Binet (the fifth edition) adopted a mean of
100 and a standard deviation of 15 to be consistent with the Wechsler and other popular
standardized tests.
a CEEB Scores (SAT/GRE). This format was developed by the College Entrance Ex-
amination Board and used with tests including the Scholastic Assessment Test (SAT) and
the Graduate Record Examination (GRE). CEEB scores have a mean of 500 and a standard
deviation of 100. With this format, a score of 660 is 1.6 standard deviations above the mean
(i.e., exceeding 95% of the scores in the distribution) and a score of 340 is 1.6 standard
deviations below the mean (i.e., exceeding only 5% of the scores in the distribution).
As we noted, standard scores can be set to any desired mean and standard deviation, with
the fancy of the test author frequently being the sole determining factor. Fortunately, the few
standard score formats we just summarized will account for the majority of standardized
tests in education and psychology. Figure 3.3 and Table 3.2 illustrate the relationship
72 CHAPTER 3
When Alfred Binet and Theodore Simon developed the first popular IQ test in the late 1800s, items
were scored according to the age at which half the children got the answer correct. This resulted in
the concept of a “mental age” for each examinee. This concept of a mental age (MA) gradually
progressed to the development of the IQ, which at first was calculated as the ratio of the child’s MA
to actual or chronological age multiplied by 100 to remove all decimals. The original form for this
score, known as the Ratio IQ, was:
MA/CA x 100
where MA = mental age
CA = chronological age
This score distribution has a mean fixed at 100 at every age. However, due to the different restric-
tions on the range of mental age possible at each chronological age (e.g., a 2-year-old can range in
MA only 2 years below CA but a 10-year-old can range 10 years below the CA), the standard de-
viation of the distribution of the Ratio IQ changes at every CA! At younger ages it tends to be small
and it is typically larger at upper ages. The differences are quite large, often with the standard de-
viation from large samples varying from 10 to 30! Thus, at one age a Ratio IQ of 110 is 1 standard
deviation above the mean, whereas at another age the same Ratio IQ of 110 is only 0.33 standard
deviation above the mean. Across age, the average standard deviation of the now archaic Ratio IQ
is about 16. This value was then adopted as the standard deviation for the Stanford-Binet IQ tests
and continued until David Wechsler scaled his first IQ measure in the 1930s to have a standard
deviation of 15, which he felt would be easier to work with. Additionally, he selected a standard
deviation of 15 to help distinguish his test from the then dominant Stanford-Binet test. The Stan-
ford-Binet tests have long abandoned the Ratio IQ in favor of a true standard score, but remained
tethered to the standard deviation of 16 until Stanford-Binet’s fifth edition was published in 2003.
With the fifth edition Standford-Binet’s new primary author, Gale Roid, converted to the far more
popular scale with a mean of 100 and a standard deviation of 15.
between various standard score formats. If reference groups are comparable, Table 3.2 can
also be used to help you equate scores across tests to aid in the comparison of a student’s
performance on tests of different attributes using different standard scores. Table 3.3 illus-
trates a simple formula that allows you to convert standard scores from one format to an-
other (e.g., z-scores to T-scores).
It is important to recognize that not all authors, educators, or clinicians are specific
when it comes to reporting or describing scores. That is, they may report “standard
scores,”
but not specify exactly what standard score format they are using. Obviously the format
is
extremely important. Consider a standard score of 70. If this is a T-score it represents
a
score 2 standard deviations above the mean (exceeding approximately 98% of the
scores
in the distribution). If it is a Wechsler IQ (or comparable score) it is 2 standard
deviations
The Meaning of Test Scores 73
Number
of Cases
: 34.13% 34.13%
13.59% 13.59%
0.13% 0.13%
z-score ee ee ee a | ee el
T-score ee ee eee ee ee ee ee
tt
Deviation 1Q_ ©/#——@_____ tt ___4. _t—__t
(SD = 15) 55 70 85 100 115 130 145
Percentile
1 5 10 20 30 40 50 60 70 80 90 95 99
FIGURE 3.3 Normal Distribution Illustrating the Relationship among Standard Scores
Source: From L. H. Janda, Psychological Testing: Theory and Applications. Published by Allyn & Bacon. Copy-
right © 1998 by Pearson Education. Reprinted by permission of the publisher.
below the mean (exceeding only approximately 2% of the scores in the distribution). In
other words, be sure to know what standard score format is being used so you will be able
to interpret the scores accurately.
Normalized Standard Scores. Discussion about standard scores thus far applies primarily
to scores from distributions that are normal (or that at least approximate normality) and were
computed using a linear transformation. As noted earlier, although it is commonly held that
psychological and educational variables are normally distributed, this is not always the case.
Many variables such as intelligence, memory skills, and academic achievement will closely
approximate the normal distribution when well measured. However, many variables of inter-
est in psychology and education, especially behavioral ones (€.g., aggression, attention, and
hyperactivity), may deviate substantially from the normal distribution. As a result it is not
unusual for test developers to end up with distributions that deviate from normality enough
to cause concern. In these situations test developers may elect to develop normalized standard
CHAE
TEE RS
You can easily convert standard scores from one format to another using the following formula:
For example, consider the situation in which you want to convert a z-score of 1.0 toa
T-score. The calculations are:
T-score = 50 + 10 X Garig)
50 + 10 x (1/1)
504+ 10x. 4
30. + 10
= 60
If you want to convert a T-score of 60 to a CEEB score, the calculations are:
(60 — 50)
CEEB score = 500 + 100 x
10
500 + 100 x (10/10)
500 + 100 x 1
500 + 100
= 600
LENT SRNEE SR
the usefulness and interpretability of the scores. Nevertheless, it is desirable to know what
type of scores you are working with and how they were calculated.
In most situations, normalized standard scores are interpreted in a manner similar to
other standard scores. In fact, they often look strikingly similar to standard scores. For ex-
ample, they may be reported as normalized z-scores or normalized T-scores and often re-
ported without the prefix normalized at all. In this context, they will have the same mean
and standard deviation as their counterparts derived with linear transformations. However,
several types of scores that have traditionally been based on nonlinear transformations are
normalized standard scores. These include:
m Stanine scores. Stanine (i.e., standard nine) scores divide the distribution into nine
bands (1 through 9). Stanine scores have a mean of 5 and a standard deviation of 2.
Because stanine scores use only nine values to represent the full range of scores, they
are not a particularly precise score format. As a result, some professionals avoid their
use. However, certain professionals prefer them because of their imprecision. These
professionals, concerned with the imprecision inherent in all psychological and edu-
cational measurement, choose stanine scores because they do not misrepresent the
precision of measurement (e.g., Popham, 2000). Special Interest Topic 3.4 briefly
describes the history of stanine scores.
m Wechsler scaled scores. The subtests of the Wechsler Intelligence Scale for Children—
Fourth Edition (WISC-IV; Wechsler, 2003) and predecessors are reported as normal-
ized standard scores referred to as scaled scores. The Wechsler scaled scores have a
mean of 10 and a standard deviation of 3. This transformation was performed so the
subtest scores would be comparable, even though their underlying distributions may
have deviated from the normal curve and each other.
= Normal Curve Equivalent (NCE). The normal curve equivalent (NCE) is a normal-
ized standard score with a mean of 50 and a standard deviation of 21.06. NCEs are
not usually used for evaluating individuals, but are primarily used to assess the prog-
i EEie Se SS
Stanines have a mean of 5 and a standard deviation of 2. Stanines have a range of 1 to 9 and are a
form of standard score. Because they are standardized and have nine possible values, the contrived,
contracted name of stanines was given to these scores (standard nine). A stanine is a conversion of
the percentile rank that represents a wide range of percentile ranks at each score point. The U.S. Air
Force developed this system during World War II because a simple score system was needed that
could represent scores as a single digit. On older computers, which used cards with holes punched
in them for entering data, the use of stanine scores not only saved time by having only one digit to
punch but also increased the speed of the computations made by computers and conserved com-
puter memory. Stanines are now used only occasionally and usually only in statistical reporting of
aggregated scores (from Reynolds, 2002).
The Meaning of Test Scores 7
ress of groups (e.g., The Psychological Corporation, 2002). Because school districts
must report NCE scores to meet criteria as part of certain federal education programs,
many test publishers report these scores for tests used in education.
One of the most popular and Percentile Rank. One of the most popular and easily understood
easily understood ways to ways to interpret and report a test score is the percentile rank. Like all
norm-referenced scores, the percentile rank simply reflects an exam-
interpret and report a test score
inee’s performance relative to a specific group. Although there are
is the percentile rank. Percentile
some subtle differences in the ways percentile ranks are calculated and
ranks reflect the percentage of
interpreted, the typical way of interpreting them is as reflecting the
individuals scoring below a percentage of individuals scoring below a given point in a distribution.
given point in a distribution. For example, a percentile rank of 80 indicates that 80% of the indi-
viduals in the standardization sample scored below this score. A per-
centile rank of 20 indicates that only 20% of the individuals in the standardization sample
scored below this score. Percentile ranks range from 1 to 99, and a rank of 50 indicates the
median performance (in a perfectly normal distribution it is also the mean score). As you can
see, percentile ranks can be easily explained to and understood by individuals without formal
training in psychometrics. Whereas standard scores might seem somewhat confusing, a per-
centile rank might be more understandable. For example, a parent might believe an IQ of 75
is in the average range, generalizing from experiences with classroom tests whereby 70 to 80
is often interpreted as representing average or perhaps “C-level” performance. However, ex-
plaining that the child’s score exceeded only approximately 5% of the standardization sample
or scores of other children at the same age level might clarify the issue. One common misun-
derstanding may arise when using percentile ranks: It is important to ensure that results in
terms of percentile rank are not misinterpreted as “percent correct” (Kamphaus, 1993). That
is, a percentile rank of 60 means that the examinee scored better than 60% of the standardiza-
tion sample, not that the examinee correctly answered 60% of the items.
Although percentile ranks can be easily interpreted, they do not represent interval
level measurement. That is, percentile ranks are not equal across all parts of a distribution.
Percentile ranks are compressed near the middle of the distribution, where there are large
numbers of scores, and spread out near the tails, where there are relatively few scores (you
can see this in Figure 3.3 by examining the line that depicts percentiles). This implies that
small differences in percentile ranks near the middle of the distribution might be of little
importance, whereas the same difference at the extremes might be substantial. However,
because the pattern of inequality is predictable, this can be taken into consideration when
interpreting scores and it is not particularly problematic.
There are two formats based on percentile ranks that you might come across in edu-
cational settings. Some publishers report quartile scores that divide the distribution of per-
centile ranks into four equal units. The lower 25% receives a quartile score of 1, 26% to 50%
a quartile score of 2, 51% to 75% a quartile score of 3, and the upper 25% a quartile score
of 4. Similarly, some publishers report decile-based scores, which divide the distribution of
percentile ranks into ten equal parts. The lowest decile-based score is 1 and corresponds to
scores with percentile ranks between 0% and 10%. The highest decile-based score is 10 and
corresponds to scores with percentile ranks between 90% and 100% (e.g., The Psychologi-
cal Corporation, 2002).
78 CHAPTER 3
m The use of interpolation to calculate intermediate grade equivalents assumes that aca-
demic skills are achieved at a constant rate and that there is no gain or loss during the sum-
mer vacation. This tenuous assumption is probably not accurate in many situations.
= Grade equivalents are not comparable across tests or even subtests of the same battery
of tests. For example, grade equivalents of 6.0 on a test of reading comprehension and a test
of math calculation do not indicate that the examinee has the same level of proficiency in
the two academic areas. Additionally, there can be substantial differences between the ex-
aminee’s percentile ranks on the two tests.
*
m Grade equivalents reflect an ordinal level scale of measurement, not an interval scale.
As discussed in the previous chapter, ordinal level scales do not have equal scale units across
the scale. For example, the difference between grade equivalents of 3.0 and 4.0 is not neces-
sarily the same as the difference between grade equivalents of 5.0 and 6.0. Statistically, one
should not add, subtract, multiply, or divide such scores because their underlying metrics
are different. It is like multiplying feet by meters—you can multiply 3 feet by 3 meters and
get 9, but what does it mean?
The Meaning of Test Scores 79
m There is not a predictable relationship between grade equivalents and percentile ranks.
For example, examinees may have a higher grade equivalent on a test of reading comprehen-
sion than of math calculations, but their percentile rank and thus their skill relative to age
peers on the math test may actually be higher.
= Acom misp g-
SRSA
SYTHE SradS eGuIVAleNES. Parents may ask, “Johnny is only in the 4th grade but has
a grade equivalent of 6.5 in math. Doesn’t that mean he is ready for 6th-grade math instruc-
tion?” The answer is clearly “No!” Although Johnny correctly answered the same number
of items as an average 6th grader, this does not indicate that he has mastered the necessary
prerequisites to succeed at the 6th-grade level.
= Unfortunately, grade equivalents tend to become standards of performance. For ex-
ample, lawmakers might decide that all students entering the 6th grade should achieve grade
equivalents of 6.0 or better on a standardized reading test. If you will recall how grade
equivalents are calculated, you will see how ridiculous this is. Because the mean raw score
at each grade level is designated the grade equivalent, 50% of the standardization sample
scored below the grade equivalent. As a result, it would be expected that a large number of
students with average reading skills would enter the 6th grade with grade equivalents below
6.0. It is a law of mathematics that not everyone can score above the average!
the cut score requires correctly answering 85% of the items, all examinees with scores of
84% or below fail and all with 85% and above pass. There is no practical distinction in such
a decision between an examinee answering 85% of the items correctly and one who an-
swered 100% correctly. They both pass! For many educators, mastery testing is viewed as
the preferred way of assessing mastery or proficiency of basic educational skills. For ex-
ample, a teacher can develop a test to assess students’ mastery of multiplication of fractions
or addition with decimals. Likewise, a teacher can develop a test to assess students’ mastery
of spelling words on a 3rd-grade reading list. In both of these situations, the teacher may set
the cut score for designating mastery at 85%, and all students achieving a score of 85% or
higher will be considered to have mastered the relevant knowledge or skills domain. Special
Interest Topic 3.5 provides a brief introduction to the processes many states use to establish
performance standards on their statewide assessments.
Another common criterion-referenced interpretative approach is referred to as “stan-
dards-based interpretations.” Whereas mastery testing typically results in an all-or-none
interpretation (i.e., the student either passes or fails), standards-based interpretations usually
involve three to five performance categories. For example, the results of an achievement test
might be reported as not proficient, partially proficient, proficient, or advanced performance
(e.g., Linn & Gronlund, 2000). An old variant of this approach is the assignment of letter
grades to reflect performance on classroom achievement tests. For example, many teachers
assign letter grades based on the percentage of items correct on a test, which is another type
The Meaning of Test Scores 81
In developing their statewide assessment programs, most states establish performance standards
that specify acceptable performance. Crocker and Algina (1986) outlined three major approaches
to setting performance standards: (a) holistic, (b) content-based, and (c) performance-based. All of
these methods typically invoke the judgment of experts in the content area of interest. The selection
of experts is a sampling problem—what is the sampling adequacy of the experts selected in relation
to the population of such experts?
In holistic standard setting a panel of experts is convened to examine a test and estimate the
percentage of items that should be answered correctly by a person with minimally proficient knowl-
edge of the content domain of interest. After each judge provides a passing standard estimate the
results are averaged to obtain the final cut score. State assessments, however, use content-based and
performance-based strategies.
Content-based standard setting evaluates tests at the item level. The most popular content-
based approaches are the Angoff and modified Angoff procedures, which involve assembling a
panel of about 15 to 20 judges who review the test items. They work independently and decide how
many of 100 students “minimally acceptable,” “borderline,” or “barely proficient” would answer
each item correctly. The average for all judges is computed, which then becomes the estimated cut
score for the number correct on the test.
The original Angoff method was criticized as being too cognitively complex for judges
(Shepard, Glaser, Linn, & Bohrnstedt, 1993) and was modified by using a ““yes—no” procedure, for
which judges are simply asked to indicate whether a borderline proficient student would be able to
answer the item correctly. The number of items recommended by each judge is then averaged
across judges to provide a cut score indicating minimal proficiency. Impara and Plake (1997) con-
cluded that both traditional and modified methods resulted in nearly identical cut scores, and rec-
ommended the yes—no approach based on its simplicity.
The Angoff and modified Angoff procedures offer a more comprehensive approach to standard
setting. Although using a yes—no modification may be less taxing on judges, the few empirical stud-
ies that address this method leave us unconvinced that important information is not lost in the process.
Until more information becomes available regarding the technical merit of this modification, we
recommend the traditional Angoff instructions. Modifying this to include several rounds, in which
judges receive the previous results, has both theoretical and some empirical support, particularly if
the process is moderated by an independent panel referee (Ricker, 2004).
Much of the research and most of the conclusions about content-based procedures have been
predicated on getting groups of judges together physically. Over the last decade, new interactive
procedures based on Internet-oriented real-time activities (e.g., focus groups, Delphi techniques)
have emerged in marketing and other fields that might be employed. These techniques open up new
alternatives for standard setting that have not yet been explored.
The performance-based approach uses the judgment of individuals (e.g., teachers) who are
intimately familiar with a specific group of candidates who are representative of the target popula-
tion (e.g., their own students). The judges are provided with the behavioral criteria, the test, and a
description of “proficient” and “nonproficient” candidates. Judges then identify individuals that,
based on previous classroom assessments, clearly fall into one of the two categories (borderline
candidates are excluded). These individuals are then tested, and separate frequency distributions
are generated for each group. The intersection between the two distributions is then used as the cut
score that differentiates nonproficient from proficient examinees.
(continued)
82 CHAPTER 3
A variation used in at least 31 states in the last few years is the bookmark procedure, some-
times referred to as an item mapping strategy, which involves. first testing examinees. Using Item
Response Theory (IRT), test items (which may be selected-response or open-ended) are mapped
onto a common scale. Item difficulties are then computed, and a booklet is developed with items
arranged from easiest to hardest. Typically, 15 to 20 judges are provided with item content specifi-
cations, content standards, scoring rubrics for open-ended items, and other relevant materials, such
as initial descriptors for performance levels. The judges are then separated into smaller groups and
examine the test and the ordered item booklet. Starting with the easiest items, each judge examines
each item and determines whether a student meeting a particular performance level would be ex-
pected to answer the item correctly. This procedure continues until the judge reaches the first item
that is perceived as unlikely to be answered correctly by a student at the prescribed level. A “book-
mark” is then placed next to this item. After the first round, participants in the small group are al-
lowed to discuss their bookmark ratings for each group and may modify initial ratings based on
group discussion. In a final round, the small group findings are discussed by all judges, and addi-
tional information, such as the impact of the various cut scores on the percent of students that would
be classified in each category, is considered. A final determination is reached either through con-
sensus or by taking the median ratings between groups. The last step involves revising the initial
performance descriptors based on information provided by panelists (and the impact data) during
the iterative rounds. The disadvantages of this method, however, such as difficulties in assembling
qualified judges and the lack of an inter-judge agreement process, raise questions about its
validity.
There are serious concerns with current procedures. First, numerous studies have provided
evidence that different standard setting methods often yield markedly different results. Moreover,
using the same method, different panels of judges arrive at very different conclusions regarding the
appropriate cut score (Crocker & Algina, 1986; Jaeger, 1991). Ideally, multiple methods would be
used to determine the stability of scores generated by different methods and different judges. Un-
fortunately, this is not typically an economic or logistic reality. Rudner (2001) and Glass (1978)
have noted the “arbitrary” nature of the standard setting process. Glass, in particular, has con-
demned the notion of cut scores as a “common expression of wishful thinking” (p. 237). A second
concern related to standard setting involves the classification of minimally proficient, competent,
acceptable, or master examinees. Glass (1978) has noted that valid external criteria for assessing
the legitimacy of such distinctions in the context of subject-matter areas are virtually nonexistent.
They may hold only in rare instances, such as when a person who types zero words per minute with
zero accuracy (a complete absence of skill) may be deemed incapable of working as a typist. But,
would one dare suggest that a minimal amount of reading skill is necessary to be a competent
parent—and if so, what would this level be? Glass concludes, “The attempt to base criterion scores
on a concept of minimal competence fails for two reasons: (1) it has virtually no foundation in
psychology; (2) when its arbitrariness is granted but judges attempt nonetheless to specify minimal
competence, they disagree wildly” (p. 251).
_
eeeeeeSsSsS—
other students scored. If all of the students in the class correctly answered 90% or more of
the items correctly, they would all receive As on the test.
As noted previously, with norm-referenced interpretations the most important consid-
eration is the relevance of the group that the examinee’s performance is compared to. How-
ever, with criterion-referenced interpretations, there is no comparison group, and the most
important consideration is how clearly the knowledge or skill domain
The most important being assessed is specified or defined (e.g., Popham, 2000). For
consideration with criterion- criterion-referenced interpretations to provide useful information
referenced interpretations is about what students know or what skills they possess, it is important
how clearly the knowledge or that the knowledge or skill domain assessed by the test be clearly de-
skill domain is specified or fined. To facilitate this, it is common for tests specifically designed to
defined. produce criterion-referenced interpretations to assess more limited or
narrowly focused content domains than those designed to produce
norm-referenced interpretations. For example, a test designed to produce norm-referenced
interpretations might be developed to assess broad achievement in mathematics (e.g., ranging
from simple number recognition to advanced algebraic computations). In contrast, a math test
designed to produce criterion-referenced interpretations might be developed to assess the
students’ ability to add fractions. In this situation, the criterion-referenced domain is much
more focused, which allows for more meaningful criterion-based interpretations. For example,
if a student successfully completed 95% of the fractional addition problems, you would have
a good idea of his or her math skills in this limited, but clearly defined area. In contrast, ifa
student scored at the 50th percentile on the norm-referenced broad mathematics achievement
test, you would know that the performance was average for that age. However, you would not
be able to make definitive statements about the specific types of math problems the student is
able to perform. Although criterion-referenced interpretations are most applicable to narrowly
defined domains, they are often applied to broader, less clearly defined domains. For example,
most tests used for licensing professionals such as physicians, lawyers, teachers, or psycholo-
gists involve criterion-referenced interpretations.
Early in this chapter we noted that it is not technically accurate to refer to norm-referenced
tests or criterion-referenced tests. It is the interpretation of perfor-
It is not technically accurate to
mance on atest that is either norm-referenced or criterion-referenced.
refer to norm-referenced or
As a result, it is possible for a test to produce both norm-referenced
criterion-referenced tests. It and criterion-referenced interpretations. That being said, for several
is the interpretation of reasons it is usually optimal for tests to be designed to produce either
performance on a test that is norm-referenced or criterion-referenced scores. Norm-referenced
either norm-referenced or interpretations can be applied to a larger variety of tests than criteri-
criterion-referenced. on-referenced interpretations. We have made the distinction between
maximum performance tests (e.g., aptitude and achievement) and
typical response tests (e.g., interest, attitudes, and behavior). Norm-referenced interpreta-
tions can be applied to both categories, but criterion-referenced interpretations are typically
applied only to maximum performance tests. That is, because criterion-referenced scores
84 CHAPTER 3
reflect an examinee’s knowledge or skills in a specific domain, it is not logical to apply them
to measures of personality. Even in the broad category of maximum performance tests,
norm-referenced interpretations tend to have broader applications. Consistent with their
focus on well-defined knowledge and skills domains, criterion-referenced interpretations
are most often applied to educational achievement tests or other tests designed to assess
mastery of a clearly defined set of skills and abilities. Constructs such as aptitude and intel-
ligence are typically broader and lend themselves best to norm-referenced interpretations.
Even in the context of achievement testing we have alluded to the fact that tests designed
for norm-referenced interpretations often cover broader knowledge and skill domains than
those designed for criterion-referenced interpretations.
In addition to the breadth or focus of the knowledge or skills domain being assessed,
test developers consider other factors when developing tests intended primarily for either
norm-referenced or criterion-referenced interpretations. For example, because tests de-
signed for criterion-referenced interpretations typically have a narrow focus, they are able
to devote a large number of items to measuring each objective or skill. In contrast, because
tests designed for norm-referenced interpretations typically have a broader focus they may
devote only a few items to measuring each objective or skill. When developing tests in-
tended for norm-referenced interpretations, test developers will typically select items of
average difficulty and eliminate extremely difficult or easy items. When developing tests
intended for criterion-referenced interpretations, test developers
Tests can be developed that match the difficulty of the items to the difficulty of the knowledge
provide both norm-referenced or skills domain being assessed.
and criterion-referenced Although our discussion to this point has emphasized differ-
interpretations. ences between norm-referenced and criterion-referenced interpreta-
tions, they are not mutually exclusive. Tests can be developed that provide both
norm-referenced and criterion-referenced interpretations. Both interpretative approaches
have positive characteristics and provide useful information (see Table 3.4). Whereas
norm-referenced interpretations provide important information about how an examinee
performed relative to a specified reference group, criterion-referenced interpretations pro-
vide important information about how well an examinee has mastered a specified knowl-
edge or skills domain. It is possible, and sometimes desirable, for a test to produce both
norm-referenced and criterion-referenced scores. For example, it would be possible to
interpret a student’s test performance as “by correctly answering 75% of the multiplication
problems, the student scored better than 60% of the students in the class.” Although the
development of a test to provide both norm-referenced and criterion-referenced scores may
require some compromises, the increased interpretative versatility may justify these com-
promises (e.g., Linn & Gronlund, 2000). As a result, some test publishers are beginning to
produce more tests that provide both interpretative formats. Nevertheless, most tests are
designed for either norm-referenced or criterion-referenced interpretations. Although the
majority of published standardized tests are designed to produce norm-referenced interpre-
tations, tests producing criterion-referenced interpretations play an extremely important
role in educational and other settings.
IQ Classification
A similar approach is often used with typical response assessments. For example, the
Behavior Assessment System for Children (BASC; Reynolds & Kamphaus, 1998) provides
the following descriptions of the clinical scales such as the depression or anxiety scales:
Summary
This chapter provided an overview of different types of test scores and their meanings. We
started by noting that raw scores, while easy to calculate, usually provide little useful infor-
mation about an examinee’s performance on a test. As a result, we usually transform raw
scores into derived scores. The many different types of derived scores can be classified as
either norm-referenced or criterion-referenced. Norm-referenced score interpretations com-
pare an examinee’s performance on a test to the performance of other people, typically the
standardization sample. When making norm-referenced interpretations, it is important to
evaluate the adequacy of the standardization sample. This involves determining if the stan-
dardization is representative of the examinees the test will be used with, if the sample is
current, and if the sample is of adequate size to produce stable statistics.
When making norm-referenced interpretations it is useful to have a basic understand-
ing of the normal distribution (also referred to as the bell-shaped curve). The normal distri-
bution is a distribution that characterizes many naturally occurring variables and has several
characteristics that psychometricians find very useful. The most useful of these character-
istics is that predictable proportions of scores occur at specific points in the distribution. For
example, if you know that an individual’s score is one standard deviation above the mean
on a normally distributed variable, you know that the individual’s score exceeds approxi-
mately 84% of the scores in the standardization sample. This predictable distribution of
scores facilitates the interpretation and reporting of test scores.
Standard scores are norm-referenced derived scores that have a predetermined mean
and standard deviation. A variety of standard scores is commonly used today, including
deviation above the mean. You know that approximately 84% of the scores in a normal
distribution are below 1 standard deviation above the mean. Therefore, the examinee’s score
exceeded approximately 84% of the scores in the reference group.
When scores are not normally distributed (i.e., do not take the form of a normal dis-
tribution), test publishers often use normalized standard scores. These normalized scores
often look just like regular standard scores, but they are computed in a different manner.
Nevertheless, they are interpreted in a similar manner. For example, if a test publisher re-
ports normalized T-scores, they will have a mean of 50 and standard deviation of 10, just
like regular T-scores. There are some unique normalized standard scores, including:
Another common type of norm-referenced score is percentile rank. This popular for-
mat is one of the most easily understood norm-referenced derived scores. Like all norm-
referenced scores, the percentile rank reflects an examinee’s performance relative to a
specific reference group. However, instead of using a scale with a specific mean and stan-
dard deviation, the percentile rank simply specifies the percentage of individuals scoring
below a given point in a distribution. For example, a percentile rank of 80 indicates that 80%
of the individuals in the reference group scored below this score. Percentile ranks have the
advantage of being easily explained to and understood by individuals without formal train-
ing in psychometrics.
The final norm-referenced derived scores we discussed were grade and age equiva-
lents. For numerous reasons, we recommend that you avoid using these scores. If you are
required to report them, also report standard scores and percentile ranks and emphasize
these when interpreting the results.
In contrast to norm-referenced scores, criterion-referenced scores compare an exam-
inee’s performance to a specified level of performance referred to as a criterion. Probably
the most common criterion-referenced score is the percent correct score routinely reported
on classroom achievement tests. For example, if you report that a student correctly an-
swered 80% of the items on a spelling test, this is a criterion-referenced interpretation.
Another type of criterion-referenced interpretation is mastery testing. On a mastery test
you determine whether examinees have achieved a specified level of mastery on the knowl-
edge or skill domain. Here, performance is typically reported as either pass or fail. If ex-
aminees score above the cut score they pass; if they score below the cut score they fail.
Another criterion-referenced interpretation is referred to as standards-based interpreta-
tions. Instead of reporting performance as simply pass/fail, standards-based interpretations
typically involve three to five performance categories.
With criterion-referenced interpretations, a prominent consideration is how clearly the
knowledge or domain is defined. For useful criterion-referenced interpretations, the knowledge
or skill domain being assessed must be clearly defined. To facilitate this, criterion-referenced
interpretations are typically applied to tests that measure focused or narrow domains. For
to the
example, a math test designed to produce criterion-referenced scores might be limited
addition of fractions. This way, if a student correctly answers 957% of the fraction problems,
88 CHAPTER 3
you will have useful information regarding the student’s proficiency with this specific type
of math problem. You are not able to make inferences about a student’s proficiency in other
areas of math, but you will know if this specific type of math problem was mastered. If the
math test contained a wide variety of math problems (as is common with norm-referenced
tests), it would be more difficult to specify exactly in which areas a student is proficient.
We closed the chapter by noting that the terms norm-referenced and criterion-referenced
refer to the interpretation of test performance, not the test itself. Although it is often optimal
to develop a test to produce either norm-referenced or criterion-referenced scores, it is possible
and sometimes desirable for a test to produce both norm-referenced and criterion-referenced
scores. This may require some compromises when developing the test, but the increased flex-
ibility may justify these compromises. Nevertheless, most tests are designed for either norm-
referenced or criterion-referenced interpretations, and most published standardized tests
produce norm-referenced interpretations. That being said, tests that produce criterion-referenced
interpretations have many important applications, particularly in educational settings.
RECOMMENDED READINGS
American Educational Research Association, American Psy- Lyman, H. B. (1998). Test scores and what they mean. Boston:
chological Association, & National Council on Mea- Allyn & Bacon. This text provides a comprehensive and
surement in Education (1999). Standards for educational very readable discussion of test scores. An excellent re-
and psychological testing. Washington, DC: AERA. For source!
the technically minded, Chapter 4, Scales, Norms, and
Score Comparability, is must reading!
www.teachersandfamilies.com/open/parent/scores1.cfm http://childparenting.miningco.com/cs/learningproblems/a/
Understanding Test Scores: A Primer for Parents is a us- wisciii.htm
er-friendly discussion of tests that is accurate and read- This Parents’ Guide to Understanding the IQ Test Scores
able. Another good resource for parents. contains a good discussion ofthe use of intelligence tests
in schools and how they help in assessing learning dis-
abilities. A good resource for parents.
The Meaning of Test Scores 89
PRACTICE ITEMS
1. Transform the following raw scores to the specified standard score formats. The raw score
distribution has a mean of 70 and a standard deviation of 10.
a. Raw score = 85 Z-SCOre T-score =
b. Raw score 60 Z-SCore T-score =
c. Raw score 55 Z-SCOre T-score =
d. Raw score 95 Z-SCore T-score =
e. Raw score 15 Z-SCOTe T-score =
It is the user who must take responsibility for determining whether or not scores
are sufficiently trustworthy to justify anticipated uses and interpretations.
—AERA et al., 1999, p. 31
CHAPTER HIGHLIGHTS
LEARNING OBJECTIVES
90
Reliability for Teachers 91
Errors of Measurement
Some degree of measurement Some degree of error is inherent in all measurement. Although mea-
error is inherent in all surement error has largely been studied in the context of psychologi-
measurement. cal and educational tests, measurement error clearly is not unique to
this context. In fact, as Nunnally and Bernstein (1994) point out, mea-
surement in other scientific disciplines has as much, if not more, error
than that in psychology and education. They give the example of physiological blood pressure
measurement, which is considerably less reliable than many educational tests. Even in situa-
tions in which we generally believe measurement is exact, some error is present. If we asked
a dozen people to time a 440-yard race using the same brand of stopwatch, it is extremely
unlikely that they would all report precisely the same time. If we had a dozen people and a
measuring tape graduated in millimeters and required each person to measure independently
92 CHAPTER 4
the length of a 100-foot strip of land, it is unlikely all of them would report the same answer
to the nearest millimeter. In the physical sciences the introduction of more technologically
sophisticated measurement devices has reduced, but not eliminated, measurement error.
Different theories or models have been developed to address measurement issues, but
possibly the most influential is classical test theory (also called true score theory). Accord-
ing to this theory, every score on a test is composed of two components: the true score (i.e.,
the score that would be obtained if there were no errors, if the score were perfectly reliable)
and the error score: Obtained Score = True Score + Error. This can be represented in a very
simple equation:
A= 1s &
Here we use X; to represent the observed or obtained score of an individual; that is, X; is the
score the test taker received on the test. The symbol T is used to represent an individual’s
true score and reflects the test taker’s true skills, abilities, knowledge, attitudes, or whatever
the test measures, assuming an absence of measurement error. Finally, E represents mea-
surement error.
Measurement error reduces the usefulness of measurement. It limits the extent to
which test results can be generalized and reduces the confidence we have in test results
(AERA et al., 1999). Practically speaking, when we administer a
Measurement error limits the test we are interested in knowing the test taker’s true score. Due to
extent to which test results can the presence of measurement error we can never know with abso-
be generalized and reduces lute confidence what the true score is. However, if we have informa-
the confidence we have in test tion about the reliability of measurement, we can establish intervals
results (AERA et al., 1999). around an obtained score and calculate the probability that the true
score will fall within the interval specified. We will come back to
this with a more detailed explanation when we discuss the standard error of measurement
later in this chapter. First, we will elaborate on the major sources of measurement error. It
should be noted that we will limit our discussion to random measurement error. Some writ-
ers distinguish between random and systematic errors. Systematic error is much harder to
detect and requires special statistical methods that are generally beyond the scope of this
text; however, some special cases of systematic error are discussed in Chapter 16. (Special
Interest Topic 4.1 provides a brief introduction to Generalizability Theory, an extension of
classical reliability theory.)
Generalizability Theory
Lee Cronbach and colleagues developed an extension of classical reliability theory known as “gen-
eralizability theory” in the 1960s and 1970s. Cronbach was instrumental in the development of the
general theory of reliability discussed in this chapter during and after World War II. The basic focus
of generalizability theory is to examine various conditions that might affect the reliability of a test
score. In classical reliability theory there are only two sources for variation in an observed test score:
true score and random error. Suppose, however, that for different groups of people the scores reflect
; different things. For example, boys and girls might respond differently to career interest items. When
the items for a particular career area are then grouped into a scale, the reliability of the scale might be
quite different for boys and girls as a result. This gender effect becomes a limitation on the generaliz-
results are not. A number of factors may introduce error into test scores and even though all
cannot be assigned to distinct categories, it may be helpful to group these sources in some
manner and to discuss their relative contributions. The types of errors that are our greatest
concern are errors due to content sampling and time sampling.
Content Sampling Error. Tests rarely, if ever, include every possible question or evalu-
ate every possible relevant behavior. Let’s revisit the example we introduced at the begin-
ning of this chapter. A teacher administers a math test designed to assess skill in multiplying
two-digit numbers. We noted that there are literally thousands of two-digit multiplication
problems. Obviously it would be impossible for the teacher to develop and administer a
test that includes all possible items. Instead, a universe or domain of test items is defined
based on the content of the material to be covered. From this domain a sample of test ques-
tions is taken. In this example, the teacher decided to select 25 items to measure students’
ability. These 25 items are simply a sample and, as with any sampling procedure, may not
be representative of the domain from which they are drawn.
Measurement error due to Time Sampling Error. Measurement error also can be introduced
time sampling reflects random by one’s choice of a particular time to administer the test. If Eddie
fluctuations in performance did not have breakfast and the math test was just before lunch, he
from one situation to another might be distracted or hurried and not perform as well as if he took
and limits our ability to the test after lunch. But Michael, who ate too much at lunch and
generalize test scores across was up a little late last night, was a little sleepy in the afternoon and
might not perform as well on an afternoon test as he would have on
different situations.
the morning test. If during the morning testing session a neighboring
class was making enough noise to be disruptive, the class might have
performed better in the afternoon when the neighboring class was relatively quiet. These
are all examples of situations in which random changes over time in the test taker (e.g.,
fatigue, illness, anxiety) or the testing environment (e.g., distractions, temperature) affect
Reliability for Teachers 95
s. Some assessment
experts refer to this type of error as temporal instability. As you might expect, testing experts
have developed methods of estimating error due to time sampling.
Other Sources of Error. Although errors due to content sampling and time sampling ac-
count for the major proportion of random error in testing, administrative and scoring errors
that do not affect all test takers equally will also contribute to the random error observed
in scores. Clerical errors committed while adding up a student’s score or an administrative
error on an individually administered test are common examples. When the scoring of a test
relies heavily on the subjective judgment of the person grading the test or involves subtle
discriminations, it is important to consider differences in graders, usually referred to as
inter-scorer or inter-rater differences. That is, would the test taker receive the same score
if different individuals graded the test? For example, on an essay test would two different
graders assign the same scores? These are just a few examples of sources of error that do
not fit neatly into the broad categories of content or time sampling errors.
X,=T+E
As you remember, X; represents an individual’s obtained score, T represents the true score,
and E represents random measurement error. This equation can be extended to incorporate the
concept of variance. This extension indicates that the variance of test scores is the sum of the
true score variance plus the error variance, and is represented in the following equation:
Here, G2 represents the variance of the total test, o represents true score variance, and o2
represents the variance due to measurement erro
The general symbol for the reliability of assessment results associated with content
or domain sampling is r,, and is referred to as the reliability coefficient.
Math-
ematically, reliability is written
96 CHAPTER 4
This equation defines the reliability of test scores as the proportion of test score variance
due to true score differences. The reliability coefficient is considered to be the summary
mathematical representation of this ratio or proportion.
Reliability coefficients can be classified into three broad categories (AERA et al.,
1999). These include (1) coefficients derived from the administration of the same test on
different occasions (i.e., test-retest reliability), (2) coefficients based on the administration
of parallel forms of a test (i.e., alternate-form reliability), and (3) coefficients derived from
a single administration of a test (internal consistency coefficients). A
Reliability can be defined as fourth type, inter-rater reliability, is indicated when scoring involves
the proportion of test score a significant degree of subjective judgment. The major methods of
variance due to true score estimating reliability are summarized in Table 4.1. Each of these ap-
differences. proaches produces a reliability coefficient (r,,) that can be inter-
preted in terms of the proportion or percentage of test score variance
attributable to true variance. For example, a reliability coefficient of 0.90 indicates that 90%
of the variance in test scores is attributable to true variance. The remaining 10% reflects
error variance. We will now consider each of these methods of estimating reliability.
Alternate forms
Simultaneous hs Two forms One session Administer two forms of the test to
administration the same group in the same session.
Delayed oe Two forms Two sessions Administer two forms of the test
administration to the same group at two different
sessions.
Coefficient alpha te One form One session Administer the test to a group one
or KR-20 time. Apply appropriate procedures.
Inter-rater F One form One session Administer the test to a group one
time. Two or more raters score the
test independently.
PVs
Reliability for Teachers 97
Test-Retest Reliability
Probably the most obvious way to estimate the reliability of a test is to administer the same
test to the same group of individuals on two different occasions. With this approach the reli-
ability coefficient is obtained by simply calculating the correlation between the scores on
the two administrations. For example, we could administer our 25-item math test one week
after the initial administration and then correlate the scores obtained on the two administra-
tions. This estimate of reliability is referred to as test-retest reliability and is sensitive to
measurement error due to time sampling. It is an index of the stability
Test-retest reliability is sensitive
of test scores over time. Because many tests are intended to measure
to measurement error due to fairly stable characteristics, we expect tests of these constructs to
time sampling and is an index of produce stable scores. Test-retest reliability reflects the degree to
the stability of scores over time. which test scores can be generalized across different situations or
over time.
One important consideration when calculating and evaluating test-retest reliability
is the length of the interval between the two test administrations. If the test-retest interval
is very short (e.g., hours or days), the reliability estimate may be artificially inflated by
memory and practice effects from the first administration. If the test interval is longer, the
estimate of reliability may be lowered not only by the instability of the scores but also by
actual changes in the test takers during the extended period. In practice, there is no single
“best” time interval, but the optimal interval is determined by the
One important consideration way the test results are to be used. For example, intelligence is acon-
when calculating and evaluating — Struct or characteristic that is thought to be fairly stable, so it would
be reasonable to expect stability in intelligence scores over weeks or
test-retest reliability is the
months. In contrast, an individual’s mood (e.g., depressed, elated,
length of the interval between
nervous) is more subject to transient fluctuations, and stability across
the two test administrations. weeks or months would not be expected.
In addition to the construct being measured, the way the test
is to be used is an important consideration in determining what is an appropriate test-retest
interval. Because the SAT is used to predict performance in college, it is sensible to expect
stability over relatively long periods of time. In other situations, long-term stability is much
less of an issue. For example, the long-term stability of a classroom achievement test (such
as our math test) is not a major concern because it is expected that the students will be
enhancing existing skills and acquiring new ones due to class instruction and studying. In
summary, when evaluating the stability of test scores, one should consider the length of the
test-retest interval in the context of the characteristics being measured and how the scores
are to be used.
The test-retest approach does have significant limitations, the most prominent being
carryover effects from the first to second testing. Practice and memory effects result in
different amounts of improvement in retest scores for different test
The test-retest approach does takers. These carryover effects prevent the two administrations from
have significant limitations, the being independent and as a result the reliability coefficients may
most prominent being carry- be artificially inflated. In other instances, repetition of the test may
over effects from the first to change either the nature of the test or the test taker in some subtle or
second testing. even obvious way (Ghiselli, Campbell, & Zedeck, 1981). As aresult,
98 CHAPTER 4
only tests that are not appreciably influenced by these carryover effects are suitable for this
method of estimating reliability.
Alternate-Form Reliability
Another approach to estimating reliability involves the development of two equivalent or
parallel forms of the test. The development of these alternate forms requires a detailed test
plan and considerable effort because the tests must truly be parallel in terms of content, dif-
ficulty, and other relevant characteristics. The two forms of the test are then administered
to the same group of individuals and the correlation is calculated between the scores on
the two assessments. In our example of the 25-item math test, the teacher could develop
a parallel test containing 25 new problems involving the multiplication of double digits.
To be parallel the items would need to be presented in the same format and be of the same
level of difficulty. Two fairly common procedures are used to es-
Alternate-form reliability
tablish alternate-form reliability. One is alternate-form reliability
based on simultaneous based on simultaneous administrations and is obtained when the two
administration is primarily forms of the test are administered on the same occasion (i.e., back
sensitive to measurement error to back). The other, alternate form with delayed administration, is
due to content sampling. obtained when the two forms of the test are administered on two
different occasions. Alternate-form reliability based on simultane-
ous administration is primarily sensitive to measurement error related to content sampling.
Alternate-form reliability with delayed administration is sensitive to measurement error due
to both content sampling and time sampling.
Alternate-form reliability has the advantage of reducing the carryover effects that are
a prominent concern with test-retest reliability. However, although practice and memory
effects may be reduced using the alternate-form approach, they are
Alternate-form reliability often not fully eliminated. Simply exposing test takers to the com-
based on delayed administration mon format required for parallel tests often results in some carryover
is sensitive to measurement effects even if the content of the two tests is different. For example,
error due to content sampling a test taker given a test measuring nonverbal reasoning abilities may
and time sampling, but cannot develop strategies during the administration of the first form that alter
differentiate the two types of her approach to the second form, even if the specific content of the
error. items is different. Another limitation of the alternate-form approach
to estimating reliability is that relatively few tests, standardized or
teacher made, have alternate forms. As we suggested, the development of alternate forms
that are actually equivalent is a time-consuming process, and many test developers do not
pursue this option. Nevertheless, at times it is desirable to have more than one form of a test,
and when multiple forms exist, alternate-form reliability is an important consideration.
Internal-Consistency Reliability
Internal-consistency reliability estimates primarily reflect errors related to content sam-
pling. These estimates are based on the relationship between items within a test and
are
derived from a single administration of the test. :
Reliability for Teachers 99
1.48
Reliability of Full Test = nay ae 0.85
The reliability coefficient of 0.85 estimates the reliability of the full test when the odd—even
halves correlated at 0.74. This demonstrates that the uncorrected split-half reliability coef-
ficient presents an underestimate of the reliability of the full test. Table 4.2 provides examples
of half-test coefficients and the corresponding full-test coefficients that were corrected with
100 CHAPTER 4
the Spearman-Brown formula. By looking at the first row in this table, you will see that a
half-test correlation of 0.50 corresponds to a corrected full-test coefficient of 0.67.
Although the odd—even approach is the most common way to divide a test and will
generally produce equivalent halves, certain situations deserve special attention. For exam-
ple, if you have a test with a relatively small number of items (e.g., <8), it may be desirable
to divide the test into equivalent halves based on a careful review of item characteristics
such as content, format, and difficulty. Another situation that deserves special attention in-
volves groups of items that deal with an integrated problem (this is referred to as a testlet).
For example, if multiple questions refer to a specific diagram or reading passage, that whole
set of questions should be included in the same half of the test. Splitting integrated problems
can artificially inflate the reliability estimate (e.g., Sireci, Thissen, & Wainer, 1991).
An advantage of the split-half approach to reliability is that it can be calculated from a
single administration of a test. Also, because only one testing session is involved, this approach
reflects errors due only to content sampling and is not sensitive to time sampling errors.
0.50 0.67
0.55 0.71
0.60 0.75
0.65 0.79
0.70 0.82
0.75 0.86
0.80 0.89
0.85 0.92
0.90 0.95
0.95 0.97
eee
Reliability for Teachers 101
to measure both multiplication and division. An even more heterogeneous test would be
one that involves multiplication and reading comprehension, two fairly dissimilar content
domains. As discussed later, sensitivity to content heterogeneity can influence a particular
reliability formula’s use on different domains.
While Kuder and Richardson’s formulas and coefficient alpha both reflect item het-
erogeneity and errors due to content sampling, there is an important difference in terms
of application. In their original article Kuder and Richardson (1937) presented numerous
formulas for estimating reliability. The most commonly used formula is known as the
Kuder-Richardson formula 20 (KR-20). KR-20 is applicable when test items are scored
dichotomously, that is, simply right or wrong, as 0 or 1. Coefficient alpha (Cronbach, 1951)
is a more general form of KR-20 that also deals with test items that produce scores with
multiple values (e.g., 0, 1, or 2). Because coefficient alpha is more broadly applicable, it has
become the preferred statistic for estimating internal consistency (Keith & Reynolds, 1990).
Tables 4.3 and 4.4 illustrate the calculation of KR-20 and coefficient alpha, respectively.
Inter-Rater Reliability
If the scoring of a test relies on subjective judgment, it is important to evaluate the degree
of agreement when different individuals score the test. This is referred to as inter-scorer or
inter-rater reliability. Estimating inter-rater reliability is a fairly straightforward process.
The test is administered one time and two individuals independently score each test. A cor-
relation is then calculated between the scores obtained by the two scorers. This estimate
of reliability is not sensitive to error due to content or time sampling, but only reflects dif-
ferences due to the individuals scoring the test. In addition to the correlational approach,
inter-rater agreement can also be evaluated by calculating the per-
centage of times that two individuals assign the same scores to the
If the scoring of an assessment
performances of students. This approach is illustrated in Special
relies on subjective judgment,
Interest Topic 4.2.
it is important to evaluate the On some tests, inter-rater reliability is of little concern. For ex-
degree of agreement when ample, on a test with multiple-choice or true—false items, grading is
different individuals score the fairly straightforward and a conscientious grader should produce reli-
test. This is referred to as inter- able and accurate scores. In the case of our 25-item math test, a care-
rater reliability. ful grader should be able to determine whether the students’ answers
are accurate and assign a score consistent with that of another careful
grader. However, for some tests inter-rater reliability is a major concern. Classroom essay
tests are a classic example. It is common for students to feel that a different teacher might
have assigned a different score to their essays. It can be argued that the teacher’s personal
biases, preferences, or mood influenced the score, not only the content and quality of the
student’s essay. Even on our 25-item math test, if the teacher required that the students “show
their work” and this influenced the students’ grades, subjective judgment might be involved
and inter-rater reliability could be a concern.
102 CHAPTER 4
KR-20 is sensitive to measurement error due to content sampling and is also a measure of item
heterogeneity. KR-20 is applicable when test items are scored dichotomously, that is, simply right
or wrong, as 0 or 1. The following formula is used for calculating KR-20:
k (a x i)
KR-20 =
hig =| SD?
Consider these data for a five-item test administered to six students. Each item could receive a
score of either | or 0.
Student 1 1 0 1 1 1 4
Student 1 1 1 1 1 1 5
Student 3 1 0 1 0 0 2
Student 4 0 0 0 1 0 1
Student 5 1 1 1 1 1 5
Student 6 1 1 0 1 1 4
D: 0.8333 0.5 0.6667 0.8333 0.6667 SD? = 2.25
qi 0.1667 0.5 0.3333 0.1667 0.3333
PX 4; 0.1389 0.25 072222 0.1389 0.2222
eeee Ee eee ee eee
Note: When calculating SD, n was used in the denominator.
Coefficient alpha is sensitive to méasurement error due to content sampling and is also a measure
of item heterogeneity. It can be applied to tests with items that are scored dichotomously or that
have multiple values. The formula for calculating coefficient alpha is:
Consider these data for a five-item test that was administered to six students. Each item could
receive a score ranging from | to 5.
Note: When calculating SD? and SD, n was used in the denominator.
Reliability of composite scores ing period or semester. Many standardized psychological instruments
is generally greater than the contain several measures that are combined to form an overall com-
measures that contribute to posite score. For example, the Wechsler Adult Intelligence Scale—
Third Edition (Wechsler, 1997) is composed of 11 subtests used in
the composite.
the calculation of the Full Scale Intelligence Quotient (FSIQ). Both
of these situations involve composite scores obtained by combining
the scores on several different tests or subtests. The advantage of composite scores is that the
reliability of composites is generally greater than that of the individual scores that contribute
to the composite. More precisely, the reliability of a composite is the result of the number of
scores in the composite, the reliability of the individual scores, and the correlation between
those scores. The more scores in the composite, the higher the correlation between those
SPECIAL INTEREST TOPIC 4.2
Performance assessments require test takers to complete a process or produce a product in a context
that closely resembles real-life situations. For example, a student might engage in a debate, compose
a poem, or perform a piece of music. The evaluation of these types of performances is typically
based on scoring rubrics that specify what aspects of the student’s performance should be considered
when providing a score or grade. The scoring of these types of assessments obviously involves the
subjective judgment of the individual scoring the performance, and as a result inter-rater reliability
is a concern. As noted in the text one approach to estimating inter-rater reliability is to calculate the
correlation between the scores that are assigned by two judges. Another approach is to calculate the
percentage of agreement between the judges’ scores.
Consider an example wherein two judges rated poems composed by 25 students. The poems
were scored from | to 5 based on criteria specified in a rubric, with 1 being the lowest performance
and 5 being the highest. The results are illustrated in the following table:
Ratings of Rater 1
Ratings of Rater 2 1 2 3) d 3)
P) 0 0 1 2 4
4 0 0 2 B Z
3 0 2 3 1 0
Z 1 1 1 0 0
1 1 1 0 0 0
Once the data are recorded you can calculate inter-rater agreement with the following formula:
This degree of inter-rater agreement might appear low to you, but this would actually be re-
spectable for a classroom test. In fact the Pearson correlation between these judges’ ratings is 0.80
(better than many, if not most, performance assessments).
Instead of requiring the judges to assign the exact same score for agreement, some authors
suggest the less rigorous criterion of scores being within one point of each other (e.g., Linn & Gron-
lund, 2000). If this criterion were applied to these data, the modified agreement percent would be
96% because only one of the judges’ scores were not within one point of each other (Rater 1 assigned
a3 and Rater 2 a5).
We caution you not to expect this high a rate of agreement should you examine the inter-rater
agreement of your own performance assessments. In fact you will learn later that difficulty scoring
performance assessments in a reliable manner is one of the major limitations of these procedures.
ee eee
Reliability for Teachers 105
scores, and the higher the individual reliabilities, the higher the composite reliability. As
we noted, tests are simply samples of the test domain, and combining multiple measures is
analogous to increasing the number of observations or the sample size.
Alternate-form reliability
Simultaneous administration Content sampling
Delayed administration Time sampling and content sampling
and KR-20 would probably underestimate reliability. In the situation of a test with hetero-
geneous content (the heterogeneity is intended and not a mistake), the split-half method is
preferred. Because the goal of the split-half approach is to compare two equivalent halves,
it would be necessary to ensure that each half has equal numbers of both multiplication and
division problems.
We have been focusing on tests of achievement when providing examples, but the same
principles apply to other types of tests. For example, a test that measures depressed mood
may assess a fairly homogeneous domain, making the use of coefficient alpha or KR-20 ap-
propriate. However, if the test measures depression, anxiety, anger, and impulsiveness, the
content becomes more heterogeneous and the split-half estimate would be indicated. In this
situation, the split-half approach would allow the construction of two equivalent halves with
equal numbers of items reflecting the different traits or characteristics under investigation.
Naturally, if different forms of a test are available, it would be important to estimate
alternate-form reliability. If a test involves subjective judgment by the person scoring the
test, inter-rater reliability is important. Many contemporary test manuals report multiple
estimates of reliability. Given enough information about reliability, one can partition the
error variance into its components, as demonstrated in Figure 4.1.
Sampling
SEuseilales
Inter-Rater
Difference
Construct. Some constructs are more difficult to measure than others simply because the
item domain is more difficult to sample adequately. As a general rule, personality variables
are more difficult to measure than academic knowledge. As a result, what might be an ac-
ceptable level of reliability for a measure of “dependency” might be regarded as unaccept-
able for a measure of reading comprehension. In evaluating the acceptability of a reliability
coefficient one should consider the nature of the variable under investigation and how dif-
ficult it is to measure. By carefully reviewing and comparing the reliability estimates of
different instruments available for measuring a construct, one can determine which is the
most reliable measure of the construct.
Time Available for Testing. If the amount of time available for testing is limited, only
a limited number of items can be administered and the sampling of the test domain is open
to greater error. This could occur in a research project in which the school principal allows
you to conduct a study in his or her school but allows only 20 minutes to measure all the
variables in your study. As another example, consider a districtwide screening for reading
problems wherein the budget allows only 15 minutes of testing per student. In contrast, a
psychologist may have two hours to administer a standardized intelligence test individually.
It would be unreasonable to expect the same level of reliability from these significantly dif-
ferent measurement processes. However, comparing the reliability coefficients associated
with instruments that can be administered within the parameters of the testing situation can
help one select the best instrument for the situation.
Test Score Use. The way the test scores will be used is another major consideration
when evaluating the adequacy of reliability coefficients. Diagnostic tests that form the
basis for major decisions about individuals should be held to a higher standard than tests
used with group research or for screening large numbers of individuals. For example,
an individually administered test of intelligence that is used in the diagnosis of mental
retardation would be expected to produce scores with a very high level of reliability. In
108 CHAPTER 4
this context, performance on the intelligence test provides critical information used to
determine whether the individual meets the diagnostic criteria. In contrast, a brief test
used to screen all students in a school district for reading problems would be held to less
rigorous standards. In this situation, the instrument is used simply for screening purposes
and no decisions are being made that cannot easily be reversed. It helps to remember that
although high reliability is desirable with all assessments, standards of acceptability vary
according to the way test scores will be used. High-stakes decisions demand highly reli-
able information!
If a test is being used to make = If a test is being used to make important decisions that are
important decisions that are likely to significantly impact individuals and are not easily reversed,
likely to impact individuals it is reasonable to expect reliability coefficients of 0.90 or even 0.95.
This level of reliability is regularly obtained with individually ad-
significantly and are not easily
ministered tests of intelligence. For example, the reliability of the
reversed, it is reasonable to
Wechsler Adult Intelligence Scale—Third Edition (Wechsler, 1997),
expect reliability coefficients an individually administered intelligence test, is 0.98.
of 0.90 or even 0.95.
# Reliability estimates of 0.80 or more are considered acceptable
in many testing situations and are commonly reported for group and
individually administered achievement and personality tests. For example, the California
Achievement Test/5 (CAT/5)(CTB/Macmillan/McGraw-Hill, 1993), a set of group-admin-
istered achievement tests frequently used in public schools, has reliability coefficients that
exceed 0.80 for most of its subtests.
u For teacher-made classroom tests and tests used for screening, reliability estimates
of at least 0.70 are expected. Classroom tests are frequently combined to form linear com-
posites that determine a final grade, and the reliability of these composites is expected to be
greater than the reliabilities of the individual tests. Marginal coefficients in the 0.70s might
also be acceptable when more thorough assessment procedures are available to address
concerns about individual cases. ;
Reliability for Teachers 109
Some writers suggest that reliability coefficients as low as 0.60 are acceptable for group
research, performance assessments, and projective measures, but we are reluctant to endorse
the use of any assessment that produces scores with reliability estimates below 0.70. As you
recall, a reliability coefficient of 0.60 indicates that 40% of the observed variance can be
attributed to random error. How much confidence can you place in assessment results when
you know that 40% of the variance is attributed to random error?
The preceding guidelines on reliability coefficients and qualitative judgments of their
magnitude must also be considered in context. Some constructs are just a great deal more
difficult to measure reliably than others. From a developmental perspective, we know that
emerging skills or behavioral attributes in children are more difficult to measure than mature
or developed skills. When a construct is very difficult to measure, any reliability coefficient
greater than 0.50 may well be acceptable just because there is still more true score variance
present in such values relative to error variance. However, before choosing measures with
reliability coefficients below 0.70, be sure no better measuring instruments are available
that are also practical and whose interpretations have validity evidence associated with the
intended purposes of the test.
5 ale Vy
~ 1l+ta- Dry
For instance, consider the example of our 25-item math test. If the reliability of the
test were 0.80 and we wanted to estimate the increase in reliability we would achieve by
increasing the test to 30 items (a factor of 1.2), the formula would be:
b2 % 0.80
Ts (2 190:80]
110 CHAPTER 4
0.96
ph TUBA
r = 0.83
Table 4.6 provides other examples illustrating the effects of increasing the length of
our hypothetical test on reliability. By looking in the first row of this table you see that in-
creasing the number of items on a test with a reliability of 0.50 by a factor of 1.25 results ina
predicted reliability of 0.56. Increasing the number of items by a factor of 2.0 (i.e., doubling
the length of the test) increases the reliability to 0.67.
In some situations various factors will limit the number of items we can include in
a test. For example, teachers generally develop tests that can be administered in a specific
time interval, usually the time allocated for a class period. In these situations, one can
enhance reliability by using multiple measurements that are combined for an average or
composite score. As noted earlier, combining multiple tests in a linear composite will
increase the reliability of measurement over that of the component tests. In summary,
anything we do to get a more adequate sampling of the content domain will increase the
reliability of our measurement.
In Chapter 6 we will discuss a set of procedures collectively referred to as “item anal-
yses.” These procedures help us select, develop, and retain test items with good measure-
ment characteristics. While it is premature to discuss these procedures in detail, it should
be noted that selecting or developing good items is an important step in developing a good
test. Selecting and developing good items will enhance the measurement characteristics of
the assessments you use.
Another way to reduce the effects of measurement error is what Ghiselli, Campbell,
and Zedeck (1981) refer to as “good housekeeping procedures.” By this they mean test
developers should provide precise and clearly stated procedures regarding the administra-
tion and scoring of tests. Examples include providing explicit instructions for standardized
administration, developing high-quality rubrics to facilitate reliable scoring, and requiring
extensive training before individuals can administer, grade, or interpret a test.
Range Restriction. The values we obtain when calculating reliability coefficients are
dependent on characteristics of the sample or group of individuals on which the analyses
are based. One characteristic of the sample that significantly impacts the coefficients is the
degree of variability in performance (i.e., variance). More precisely, reliability coefficients
based on samples with large variances (referred to as heterogeneous samples) will generally
produce higher estimates of reliability than those based on samples with small variances
(referred to as homogeneous samples). When reliability coefficients are based on a sample
with a restricted range of variability, the coefficients may actually underestimate the reli-
ability of measurement. For example, if you base a reliability analysis on students in a gifted
and talented class in which practically all of the scores reflect exemplary performance (e.g.,
>90% correct), you will receive lower estimates of reliability than if the analyses are based
on a class with a broader and more nearly normal distribution of scores.
112 CHAPTER 4
The reliability estimates Mastery Testing. Criterion-referenced tests are used to make inter-
discussed in this chapter are pretations relative to a specific level of performance. Mastery testing
usually not applicable to scores is an example of a criterion-referenced test by which a test taker’s
of mastery tests. Because performance is evaluated in terms of achieving a cut score instead
mastery tests emphasize of the degree of achievement. The emphasis in this testing situation
is on Classification. Either test takers score at or above the cut score
classification, a recommended
and are classified as having mastered the skill or domain, or they
approach is to use an index
score below the cut score and are classified as having not mastered
that reflects the consistency of
the skill or domain. Mastery testing often results in limited variability
classification. among test takers, and, as we just described, limited variability in
performance results in small reliability coefficients. As a result, the
reliability estimates discussed in this chapter are typically inadequate for assessing the reli-
ability of mastery test scores. Given the emphasis on classification, a recommended approach
is to use an index that reflects the consistency of classification (AERA et al., 1999). Special
Interest Topic 4.3 illustrates a useful procedure for evaluating the consistency of classifica-
tion when using mastery tests.
As noted in the text, the size of reliability coefficients is substantially affected by the variance of
the test scores. Limited test score variance results in lower reliability coefficients. Because mastery
tests often do not produce test scores with much variability, the methods of estimating reliability
described in this chapter will often underestimate the reliability of these tests. To address this, reli-
ability analyses of mastery tests typically focus on the consistency of classification. That is, because
the objective of mastery tests is to determine if a student has mastered the skill or knowledge domain,
the question of reliability can be framed as one of how consistent mastery—nonmastery classifica-
i tions are. For example, if two parallel or equivalent mastery tests covering the same skill or content
i domain consistently produce the same classifications for the same test takers (i.e., mastery versus
nonmastery), we would have evidence of consistency of classification. If two parallel mastery tests
produced divergent classifications, we would have cause for concern. In this case the test results are
not consistent or reliable.
The procedure for examining the consistency of classification on parallel mastery tests is
fairly straightforward. Simply administer both tests to a group of students and complete a table like
the one that follows. For example, consider two mathematics mastery tests designed to assess stu-
dents’ ability to multiply fractions. The cut score is set at 80%, so all students scoring 80% or higher
are classified as having mastered the skill while those scoring less than 80% are classified as not
having mastered the skill. In the following example, data are provided for 50 students:
Form A: Mastery
(score of 80% or better) 4 32
Form A: Nonmastery
(score <80%) 1h 3
on———_—————————————
Students classified as achieving mastery on both tests are denoted in the upper right-hand
cell while students classified as not having mastered the skill are denoted in the lower left-hand cell.
There were four students who were classified as having mastered the skills on Form A but not on
Form B (denoted in the upper left-hand cell). There were three students who were classified as hav-
ing mastered the skills on Form B but not on Form A (denoted in the lower right-hand cell). The next
step is to calculate the percentage of consistency by using the following formula:
32 + 11
Percent Consistency x 100
50
(continued)
114 CHAPTER 4
This approach is limited to situations in which you have parallel mastery tests. Another limitation is
that there are no clear standards regarding what constitutes “acceptable” consistency of classifica-
tion. As with the evaluation of all reliability information, the evaluation of classification consistency
should take into consideration the consequences of any decisions that are based on the test results
(e.g., Gronlund, 2003). If the test results are used to make high-stakes decisions (e.g., awarding a
diploma), a very high level of consistency is required. If the test is used only for low-stake decisions
(e.g., failure results in further instruction and retesting), a lower level of consistency may be accept-
able. Subkoviak (1984) provides a good discussion of several techniques for estimating the clas-
sification consistency of mastery tests, including some rather sophisticated approaches that require
only a single administration of the test.
The greater the reliability of a present in test scores, and the SD reflects the variability of the
test score, the smaller the SEM scores in the distribution. The SEM is estimated using the follow-
and the more confidence we have 198 formula:
in the precision of test scores.
SEM = SDyv1 - ne
where SD = the standard deviation of the obtained scores
r,, = the reliability of the test
Let’s work through two quick examples. First, let’s assume a test with a standard
deviation of 10 and reliability of 0.90.
Now let’s assume a test with a standard deviation of 10 and reliability of 0.80. The SD
is the same as in the previous example, but the reliability is lower.
Notice that as the reliability of the test scores decreases, the SEM increases.
Because the
reliability coefficient reflects the proportion of observed score variance
due to true score
variance and the SEM is an estimate of the amount of error in test
scores, this inverse
relationship is what one would expect. The greater the reliability of test scores, the smaller
the SEM and the more confidence we have in the precision of test scores. The lower the
reliability of a test, the larger the SEM and the less confidence we have in the precision of
test scores. Table 4.7 shows the SEM as a function of SD and reliability, Examining the
Reliability for Teachers 115
Reliability Coefficients
Standard ae he | Pan = a ari San a
Deviation 0.95 0.90 0.85 0.80 OM) .70
first row in the table shows that on a test with a standard deviation of 30 and a reliability
coefficient of 0.95 the SEM is 6.7. In comparison, if the reliability of the test score is 0.90
the SEM is 9.5; if the reliability of the test is 0.85 the SEM is 11.6; and so forth. The SEM
is used in calculating intervals or bands around observed scores in which the true score is
expected to fall. We will now turn to this application of the SEM.
then we would expect him or her to obtain scores between 67 and 73 two-thirds of the time.
To obtain a 95% confidence interval we simply determine the number of standard devia-
tions encompassing 95% of the scores in a distribution. By referring to a table representing
areas under the normal curve (see Appendix F), you can determine that 95% of the scores
in a normal distribution fall within +1.96 of the mean. Given a true score of 70 and SEM
of 3, the 95% confidence interval would be 70 + 3(1.96) or 70 + 5.88. Therefore, in this
situation an individual’s observed score would be expected to be between 64.12 and 75.88
95% of the time.
You might have noticed a potential problem with this approach to calculating confi-
dence intervals. So far we have described how the SEM allows us to form confidence in-
tervals around the test taker’s true score. The problem is that we don’t know a test taker’s
true score, only the observed score. Although it is possible for us to estimate true scores
(see Nunnally & Bernstein, 1994), it is common practice to use the SEM to establish con-
fidence intervals around obtained scores (see Gulliksen, 1950). These confidence inter-
vals are calculated in the same manner as just described, but the interpretation is slightly
different. In this context the confidence interval is used to define the range of scores that
will contain the individual’s true score. For example, if an individual obtains a score of 70
on a test with a SEM of 3.0, we would expect his or her true score to be between 67 and 73
(obtained score +1 SEM) 68% of the time. Accordingly, we would expect his or her true
score to be between 64.12 and 75.88 95% of the time (obtained score +1.96 SEM).
It may help to make note of the relationship between the reliability of the test score,
the SEM, and confidence intervals. Remember that we noted that as the reliability of scores
increases the SEM decreases. The same relationship exists between test reliability and confi-
dence intervals. As the reliability of test scores increases (denoting less measurement error),
the confidence intervals become smaller (denoting more precision in measurement).
A major advantage of the SEM and the use of confidence intervals is that they serve to
remind us that measurement error is present in all scores and that we should interpret scores
cautiously. A single numerical score is often interpreted as if it is precise and involves no
error. For example, if you report that Susie has a Full Scale IQ of 113, her parents might
interpret this as implying that Susie’s IQ is exactly 113. If you are
A major advantage of the using a high-quality IQ test such as the Wechsler Intelligence Scale
SEM and the use of confidence for Children—4th Edition or the Reynolds Intellectual Assessment
intervals is that they serve to Scales, the obtained IQ is very likely a good estimate of her true IQ.
However, even with the best assessment instruments the obtained
remind us that measurement
scores contain some degree of error and the SEM and confidence
error is present in all scores and
intervals help us illustrate this. This information can be reported in
that we should interpret scores
different ways in written reports. For example, Kaufman and Lich-
cautiously.
tenberger (1999) recommend the following format:
Susie obtained a Full Scale IQ of 113 (between 108 and 118 with 95% confidence
).
Susie obtained a Full Scale IQ in the High Average range, with a 95% probabili
ty that
her true IQ falls between 108 and 118. :
Reliability for Teachers 117
Regardless of the exact format used, the inclusion of confidence intervals highlights
the fact that test scores contain some degree of measurement error and should be interpreted
with caution. Most professional test publishers either report scores as bands within which
the test taker’s true score is likely to fall or provide information on calculating these confi-
dence intervals.
X(n — X)
KR221 =)1 = -——
no
where X = mean
o” = variance
n = number of items
Consider the following set of 20 scores: 50, 48, 47, 46, 42, 42, 41, 40, 40, 38, 37, 36,
36, 35, 34, 32, 32, 31, 30, and 28. Here the X = 38.25, o* = 39.8, and n = 50. Therefore,
38.25(50 — 38.25)
KR-21
50(39.8)
449.4375
1990
1 — 0.23 = 0.77
As you see, this is a fairly simple procedure. If you have access to a computer with a spread-
sheet program or a calculator with mean and variance functions, you can estimate the reli-
ability of a classroom test easily in a matter of minutes with this formula.
Special Interest Topic 4.4 presents a shortcut approach for calculating the Kuder-
Richardson formula 21 (KR-21). If you want to avoid even these limited computations, we
prepared Table 4.8, which allows you to estimate the KR-21 reliability for dichotomously
CHAPTER 4
Saupe (1961) provided a quick method for teachers to calculate reliability for a classroom exam in
the era prior to easy access to calculators or computers. It is appropriate for a test in which each item
is given equal weight and each item is scored either right or wrong. First, the standard deviation of
the exam must be estimated from a simple approximation:
|
“=,
SD = [sum of top 1/6th of scores — sum of bottom 1/6th of scores] / [total # of scores — 1] / 2
Thus, for example, in a class with 24 student test scores, the top one-sixth of the scores are 98, 92,
87, and 86, while the bottom sixth of the scores are 48, 72, 74, and 75. With 25 test items, the cal-
_ culations are:
A reliability coefficient of 0.93 for a classroom test is excellent! Don’t be dismayed if your class-
room tests do not achieve this high a level of reliability.
Source: Saupe, J. L. (1961). Some useful estimates of the Kuder-Richardson formula number 20 reliability coef-
ficient. Educational and Psychological Measurement, 2, 63-72.
ee
eSSSSSSSSFSSFSSSSSSSSEeee
10 — 0.29 0.60
20 0.20 0.64 0.80
30 0.47 0.76 0.87
40 0.60 0.82 0.90
50 0.68 0.86 0.92
75 0.79 0.91 0.95
100 0.84 0.93 0.96
SR
Reliability for Teachers 119
scored classroom tests if you know the standard deviation and number of items (this table
was modeled after tables originally presented by Deiderich, 1973). This table is appropriate
for tests with a mean of approximately 80% correct (we are using a mean of 80% correct
because it is fairly representative of many classroom tests). To illustrate its application, con-
sider the following example. If your test has 50 items and an SD of 8, select the “Number of
Items” row for 50 items and the “Standard Deviation” column for 0.151, because 0.15(50)
= 7.5, which is close to your actual SD of 8. The number at the intersection is 0.86, which
is a very respectable reliability for a classroom test (or a professionally developed test for
that matter).
If you examine Table 4.8, you will likely detect a few fairly obvious trends. First, the
more items on the test the higher the estimated reliability coefficients. We alluded to the
beneficial impact of increasing test length previously in this chapter and the increase in reli-
ability is due to enhanced sampling of the content domain. Second, tests with larger standard
deviations (i.e., variance) produce more reliable results. For example, a 30-item test with
an SD of 3—i.e., 0.10(n)—results in an estimated reliability of 0.47, while one with an SD
of 4.5—i.e., 0.15(n)—1esults in an estimated reliability of 0.76. This reflects the tendency
we described earlier that restricted score variance results in smaller reliability coefficients.
We should note that while we include a column for standard deviations of 0.20(n), standard
deviations this large are rare with classroom tests (Deiderich, 1973). In fact, from our expe-
rience it is more common for classroom tests to have standard deviations closer to 0.10().
Before leaving our discussion of KR-21 and its application to classroom tests, we do want
to caution you that KR-21 is only an approximation of KR-20 or coefficient alpha. KR-21
assumes the test items are of equal difficulty and it is usually slightly lower than KR-20 or
coefficient alpha (Hopkins, 1998). Nevertheless, if the assumptions are not grossly violated,
it is probably a reasonably good estimate of reliability for many classroom applications.
Our discussion of shortcut reliability estimates to this point has been limited to tests
that are dichotomously scored. Obviously, many of the assessments teachers use are not
dichotomously scored and this makes the situation a little more complicated. If your items
are not scored dichotomously, you can calculate coefficient alpha with relative ease using
a commonly available spreadsheet such as Microsoft Excel. With a little effort you should
be able to use a spreadsheet to perform the computations illustrated previously in Tables
4.3 and 4.4.
Summary
Reliability refers to consistency in test scores. If a test or other assessment procedure pro-
duces consistent measurements, its scores are reliable. Why is reliability so important? As
we have emphasized, assessments are useful because they provide information that helps
n
educators make better decisions. However, the reliability (and validity) of that informatio
us to make good decisions, we need reliable information .
is of paramount importance. For
By estimating the reliability of our assessment results, we get an indication of how much
confidence we can place in them. If we have highly reliable and valid information, it is prob-
able that we can use that information to make better decisions. If the results are unreliable,
they are of little value to us.
120 CHAPTER 4
a Test-retest reliability involves the administration of the same test to a group of indi-
viduals on two different occasions. The correlation between the two sets of scores is the
test-retest reliability coefficient and reflects errors due to time sampling.
We also discussed a number of issues important for understanding and interpreting re-
liability estimates. We provided some guidelines for selecting the type of reliability estimate
most appropriate for specific assessment procedures, some guidelines for evaluating reli-
ability coefficients, and some suggestions on improving the reliability of measurement.
Although reliability coefficients are useful when comparing the reliability of dif-
ferent tests, the standard error of measurement (SEM) is more useful when interpreting
scores. The SEM is an index of the amount of error in test scores and is used in calculating
confidence intervals within which we expect the true score to fall. An advantage of the
SEM and the use of confidence intervals is that they serve to remind us that measurement
error is present in all scores and that we should use caution when interpreting scores. We
closed the chapter by illustrating some shortcut procedures that teachers can use to esti-
mate the reliability of their classroom tests.
Reliability for Teachers 121
RECOMMENDED READINGS
American Educational Research Association, American Psy- W. H. Freeman. Chapters 8 and 9 provide outstanding
chological Association, & National Council on Measure- discussions of reliability. A classic!
ment in Education (1999). Standards for educational Nunnally, J.C., & Bernstein, I. H. (1994). Psychometric
and psychological testing. Washington, DC: AERA. theory (3rd ed.). New York: McGraw-Hill. Chapter 6,
Chapter 5, Reliability and Errors of Measurement, is a The Theory of Measurement Error, and Chapter 7, The
great resource! Assessment of Reliability are outstanding chapters. An-
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. other classic!
Linn (Ed.), Educational measurement (3rd ed., pp. 105— Subkoviak, M. J. (1984). Estimating the reliability of mastery—
146). Upper Saddle River, NJ: Merrill/Prentice Hall. A nonmastery classifications. In R. A. Berk (Ed.), A guide
little technical at times, but a great resource for students to criterion-referenced test construction (pp. 267-291).
wanting to learn more about reliability. Baltimore: Johns Hopkins University Press. An excellent
Ghiselli, E. E., Campbell, J. P., & Zedeck, S. (1981). Measure- discussion of techniques for estimating the consistency
ment theory for the behavioral sciences. San Francisco: of classification with mastery tests.
PRACTICE ITEMS
1. Consider these data for a five-item test that was administered to six students. Each item could
receive a score of either 1 or 0. Calculate KR-20 using the following formula:
k pee — ae a)
KR-20 =
k= 1 SD?
where k = number of items
SD? = variance of total test scores
proportion of correct responses on item
Pj
q, = proportion of incorrect responses on item
122 CHAPTER 4
Student 1 1
Student 2 1
Student 3 0
Student 4 0
Student 5 1
Student 6 CO
OFF
Pr 1 oF
OR
ee CO
OK
KF
Ree oroorr
Pi SD?
qj
DX 4G;
2.. Consider these data for a five-item test that was administered to six students. Each item
could receive a score ranging from | to 5. Calculate coefficient alpha using the following
formula:
: k DeSD?
Coefficient alpha = eaaay ] - Sp2
Student 1 4 By 4 5 5
Student 2 3 3 2 3 2
Student 3 2 3 1 2 1
Student 4 4 4 5 5 4
Student 5 2 3 2 2 8
Student 6 ~1 2 » i 3
SD? SD? =
e eee a ee ae ee eels ee re
Note: When calculating SD,” and SD?, use n in the denominator.
Sts SADE aLAN SEU NORA CUT iat Ags abageNUS neath ga ACME
ANah ea cee tee eet eee
CHAPTER
Validity refers to the degree to which evidence and theory support the
interpretations of test scores entailed by proposed uses of the test. Validity is,
therefore, the most fundamental consideration in developing and evaluating
tests.
—AERA et al., 1999, p. 9
CHAPTER HIGHLIGHTS
LEARNING OBJECTIVES
In the previous chapter we introduced you to the concept of the reliability of measurement.
In this context, reliability refers to accuracy and consistency in test scores. Now we turn our
attention to validity, another fundamental psychometric property. Messick (1989) defined
123
124 CHAPTER 5
Validity refers to the validity as “an integrated evaluative judgment of the degree to which
appropriateness or accuracy empirical evidence and theoretical rationales support the adequacy
of the interpretations of test and appropriateness of inferences and actions based on test scores or
scores. other modes of assessment” (p. 13). Similarly, the Standards for
Educational and Psychological Testing (AERA et al., 1999) defined
validity as “the degree to which evidence and theory support the in-
terpretations of test scores entailed by proposed uses of the tests” (p. 9). Do not let the
technical tone of these definitions throw you. In simpler terms, both of these influential
sources indicate that v.
. If test scores are interpreted as reflecting intelligence, do they actually reflect
intellectual ability? If test scores are used (1.e., interpreted) to predict success in college, can
they accurately predict who will succeed in college? Naturally the validity of the interpreta-
tions of test scores is directly tied to the usefulness of the interpretations. Valid interpreta-
tions help us to make better decisions; invalid interpretations do not!
Although it is often done as a matter of convenience, it is not technically correct to refer
to the validity of a test. Validity is a characteristic of the interpretations given to test scores. It
is not technically correct to ask the question “Is the Wechsler Intelligence Scale for Children—
Fourth Edition (WISC-IV) a valid test?” It is preferable to ask the question “Is the interpreta-
tion of performance on the WISC-IV as reflecting intelligence valid?” Validity must always
have a context and that context is interpretation. What does performance on this test mean?
The answer to this question is the interpretation given to performance
When test scores are and it is this interpretation that possesses the construct of validity, not
the test itself. Additionally, when test scores are interpreted in multiple
interpreted in multiple ways,
ways, each interpretation needs to be validated. For example, an
each interpretation needs to
achievement test can be used to evaluate a student’s performance in
be evaluated.
academic classes, to assign the student to an appropriate instructional
program, to diagnose a learning disability, or to predict success in col-
lege. Each of these uses involves different interpretations and the validity of each interpreta-
tion needs to be evaluated (AERA et al., 1999). To establish or determine validity is a major
responsibility of the test authors, test publisher, researchers, and even test user.
Threats to Validity
Messick (1994) and others have identified the two major threats to validity as construct
underrepresentation and construct-irrelevant variance. To translate this into everyday
language, validity is threatened when a test measures either less (construct underrepresenta-
tion) or more (construct-irrelevant variance) than the construct it is
Validity is threatened when supposed to measure (AERA et al., 1999). =
a test measures either less or
more than the construct it is . Consider a test designed to be a comprehensive
designed to measure. measure of the mathematics skills covered in a 3rd-grade curriculum
and to convey information regarding mastery of each skill. If the test
contained only division problems, it would not be an adequate representation of the broad
array of math skills typically covered in a 3rd-grade curriculum (although a score on such
Validity for Teachers 125
Reliability is a necessary but no matter how reliable measurement is, it is not a guarantee of valid-
insufficient condition for ity. From our discussion of reliability you will remember that ob-
validity. tained score variance is composed of two components: true score
variance and error variance. Only true score variance is reliable, and
only true score variance can be systematically related to any construct the test is designed
to measure. If reliability is equal to zero, then the true score variance component must also
be equal to zero, leaving our obtained score to be composed only of error, that is, random
variations in responses. Thus, without reliability there can be no validity.
Although low reliability limits validity, high reliability does not ensure validity. It is
entirely possible that a test can produce reliable scores but inferences based on the test
scores can be completely invalid. Consider the following rather silly example involving
head circumference. If we use some care we can measure the circumferences of our stu-
dents’ heads in a reliable and consistent manner. In other words, the measurement is reliable.
However, if we considered head circumference to be an index of intelligence, our inferences
would not be valid. The measurement of head circumference is still reliable, but when in-
terpreted as a measure of intelligence it would result in invalid inferences.
A more relevant example can be seen in the various Wechsler intelligence scales.
These scales have been shown to produce highly reliable scores on a Verbal Scale and a
Performance Scale. There is also a rather substantial body of research demonstrating these
scores are interpreted appropriately as reflecting types of intelligence. However, some psy-
chologists have drawn the inference that score differences between the Verbal Scale and the
Performance Scale indicate some fundamental information about personality and even
forms of psychopathology. For example, one author argued that a person who, on the
Wechsler scales, scores higher on the Verbal Scale relative to the Performance Scale is
highly likely to have an obsessive-compulsive personality disorder! There is no evidence or
research to support such an interpretation, and, in fact, a large percentage of the population
of the United States score higher on the Verbal Scale relative to the Performance Scale on
each of the various Wechsler scales. Thus, while the scores are themselves highly reliable
and some interpretations are highly valid (the Wechsler scales measure intelligence), other
interpretations wholly lack validity despite the presence of high reliability.
=jdentifigdi@onstuct. In other words, is the content of the test relevant and representative of
Validity for Teachers 127
the content domain? We speak of it being representative because every possible question
that could be asked cannot as a practical matter be asked, so questions are chosen to sample
or represent the full domain of questions. Content validity is typically based on professional
judgments about the appropriateness of the test content.
Studies of criteri-
on-related validity empirically examine the relationships between test scores and criterion
scores using correlation or regression analyses.
The 1999 document is conceptually similar to the 1985 document (i.e., “types of valid-
ity evidence” versus “types of validity”), but the terminology has expanded and changed
somewhat. The change in terminology is not simply cosmetic, but is substantive and intended
to promote a new way of conceptualizing validity, a view that has been growing in the profes-
sion for over two decades (Reynolds, 2002). The 1999 Standards identifies the following five
categories of evidence that are related to the validity of test score interpretations:
a Evidence based on test content includes evidence derived from an analysis of the test
content, which includes the type of questions or tasks included in the test and administration
and scoring guidelines.
Sources of validity evidence These sources of evidence will differ in their importance or relevance
differ in their importance according to factors such as the construct being measured, the in-
according to factors such as the tended use of the test scores, and the population being assessed.
construct being measured, the Those using tests should carefully weight the evidence of validity
intended use of the test scores, and make judgments about how appropriate a test is for each applica-
tion and setting. Table 5.1 provides a brief summary of the different
and the population being
classification schemes that have been promulgated over the past four
assessed.
decades in the Standards.
At this point you might be asking, “Why are the authors wasting my time with a dis-
cussion of the history of technical jargon?” There are at least two important reasons. First,
it is likely that in your readings and studies you will come across references to various
“types of validity.” Many older test and measurement textbooks refer to content, criterion,
and construct validity, and some newer texts still use that or a similar nomenclature. We
hope that when you come across different terminology you will not be confused, but instead
will understand its meaning and origin. Second, the Standards are widely accepted and
serve as professional guidelines for the development and evaluation of tests. For legal and
ethical reasons test developers and publishers generally want to adhere to these guidelines.
As a result, we expect test publishers will adopt the new nomenclature in the next few years.
Currently test manuals and other test-related documents are adopting this new nomenclature
(e.g., Reynolds, 2002). However, older tests typically have supporting literature that uses
the older terminology, and you need to understand its origin and meaning. When reviewing
test manuals and assessing the psychometric properties of a test, you need to be aware of
the older as well as the newer terminology.
1985 Standards
1974 Standards (Validity as Three 1999 Standards
(Validity as Three Types) _Interrelated Types) (Validity as a Unitary Construct)
ee
p. 11). Other
measure.
writers provide similar descriptions. For example, Reynolds (1998b)
notes that validity evidence based on test content focuses on how
well the test items sample the behaviors or subject matter the test is designed to measure. In
a similar vein, Anastasi and Urbina (1997) note that validity evidence based on test content
involves the examination of the content of the test to determine whether it provides a repre-
sentative sample of the domain being measured. Popham (2000) succinctly frames it as
“Does the test cover the content it’s supposed to cover?” (p. 96). In the past, this type of
validity evidence was primarily subsumed under the label “content validity.”
Test developers routinely begin considering the appropriateness of the content of the
test at the earliest stages of development. Identifying what we want to measure is the first order
of business, because we cannot measure anything very well that we have not first clearly de-
fined. Therefore, the process of developing a test should begin with a clear delineation of the
construct or content domain to be measured. Once the construct or
A table of specifications is content domain has been clearly defined, the next step is to develop a
essentially a blueprint that table of specifications. This table of specifications is essentially a
guides the development of blueprint that guides the development of the test. It delineates the top-
the test. ics and objectives to be covered and the relative importance of each
topic and objective. Finally, working from this table of specifications
the test developers write the actual test items. These steps in test development are covered in
detail later in this text. Whereas teachers usually develop classroom tests with little outside
assistance, professional test developers often bring in external consultants who are considered
experts in the content area(s) covered by the test. For example, if the goal is to develop an
achievement test covering American history, the test developers will likely recruit experienced
teachers of American history for assistance developing a table of specifications and writing
test items. If care is taken with these procedures, the foundation is established for a correspon-
dence between the content of the test and the construct it is designed to measure. Test develop-
ers may include adetailed description of their procedures for writing items as validity evidence,
including the number, qualifications, and credentials of their expert consultants.
130 (CHHAGP IEDR 25
Item relevance and content After the test is written, it is common for test developers to
coverage are two important continue collecting validity evidence based on content. This typi-
factors to be considered when cally involves having expert judges systematically review the test
evaluating the correspondence and evaluate the correspondence between the test content and its con-
between the test content and its struct or domain. These experts can be the same ones who helped
during the early phase of test construction or a new, independent
construct.
group of experts. During this phase, the experts typically address two
major issues, item relevance and content coverage. To asses$)itém
To under-
stand the difference between these two issues, consider these examples. For a classroom test
of early American history, a question about the American Revolution would clearly be
deemed a relevant item whereas a question about algebraic equations would be judged to be
irrelevant. This distinction deals with the relevance of the items to the content domain. In
contrast, if you examined the total test and determined that all of the questions dealt with
the American Revolution and no other aspects of American history were covered, you would
conclude that the test had poor content coverage. That is, because early American history
has many important events and topics in addition to the American Revolution that are not
covered in the test, the test does not reflect a comprehensive and representative sample of
the specified domain. The concepts of item relevance and content coverage are illustrated
in Figures 5.1 and 5.2.
As you can see, the collection of content-based validity evidence is typically qualita-
tive in nature. However, although test publishers might rely on traditional qualitative ap-
proaches (e.g., the judgment of expert judges to help develop the tests and subsequently to
evaluate the completed test), they can take steps to report their results in a more quantitative
manner. For example, they can report the number and qualifications of the experts, the
Content Domain
A B
Content Domain Content Domain
number of chances the experts had to review and comment on the assessment, and their
degree of agreement on content-related issues. Taking these efforts a step further, Lawshe
(1975) developed a quantitative index that reflects the degree of agreement among the ex-
perts making content-related judgments. Newer approaches are being developed that use a
fairly sophisticated technique known as multidimensional scaling analysis (Sireci, 1998).
As we suggested previously, different types of validity evidence are most relevant,
appropriate, or important for different types of tests. For example, content-based validity
evidence is often seen as the preferred approach for establishing the
Content-based validity evidence —_ validity of academic achievement tests. This applies to both teacher-
is often the preferred approach made classroom tests and professionally developed achievement
for establishing the validity of tests. Another situation in which content-based evidence is of pri-
achievement tests, including mary importance is with tests used in the selection and classification
teacher-made classroom tests. of employees. For example, employment tests may be designed to
sample the knowledge and skills necessary to succeed at a job. In this
context, content-based evidence can be used to demonstrate consistency between the con-
tent of the test and the requirements of the job. The key factor that makes content-based
validity evidence of paramount importance with both achievement tests and employment
tests is that they are designed to provide a representative sample of the knowledge, behavior,
or skill domain being measured. In contrast, content-based evidence of validity is usually
less relevant for personality and aptitude tests (Anastasi & Urbina, 1997).
individuals who take, administer, or examine the test? Face validity really has nothing to
do with what a test actually measures, just what it appears to measure. For example, does
a test of achievement look like the general public expects an achievement test to look?
Does a test of intelligence look like the general public expects an intelligence test to look?
Naturally, the face validity of a test is closely tied to the content of a test. In terms of face
validity, when untrained individuals inspect a test they are typically looking to see whether
the items on the test are what they expect. For example, are the items on an achievement
test of the type they expect to find on an achievement test? Are the items on an intelligence
test of the type they expect to find on an intelligence test? Whereas content-based evidence
of validity is acquired through a systematic and technical analysis of the test content, face
validity involves only the superficial appearance of a test. A test can appear “face valid” to
the general public, but not hold up under the systematic scrutiny involved in a technical
analysis of the test content.
This is not to suggest that face validity is an undesirable or even irrelevant character-
istic. A test that has good face validity is likely to be better received by the general public.
If a test appears to measure what it is designed to measure, examinees are more likely to be
cooperative and invested in the testing process, and the public is more likely to view the
results as meaningful (Anastasi & Urbina, 1997). Research suggests that good face validity
can increase student motivation, which in turn can increase test performance (Chan, Schmitt,
DeShon, Clause, & Delbridge, 1997). If a test has poor face validity those using the test may
have a flippant or negative attitude toward the test and as a result put little effort into com-
pleting it. If this happens, the actual validity of the test can suffer. The general public is not
likely to view a test with poor face validity as meaningful, even if there is technical support
for the validity of the test.
There are times, however, when face validity is undesirable. These occur primarily in
forensic settings in which detection of malingering may be emphasized. Malingering is a
situation in which an examinee intentionally feigns symptoms of a mental or physical dis-
order in order to gain some external incentive (e.g., receiving a financial reward, avoiding
punishment). In these situations face validity is not desirable because it may help the exam-
inee fake pathological responses.
(p. 14). The criterion can be academic performance as reflected by the grade point aver-
age (GPA), job performance as measured by a supervisor’s ratings, or anything else that
is of importance to the user of the test. Historically, this type of validity evidence has been
referred to as “predictive validity,” “criterion validity,” or “criterion-related validity.”
Validity for Teachers 133
There are two different types of There are two different types of validity studies typically used to
validity studies typically used to collect test-criterion evidence: predictive studies and concur-
collect test-criterion evidence: rent studies. In a predictive study the test is administered, there is
predictive studies and concurrent an intervening time interval, and then the criterion is measured. In
studies. a concurrent study the test is administered and the criterion is mea-
sured at about the same time.
To illustrate these two approaches we will consider the Scho-
lastic Achievement Test (SAT). The SAT is designed to predict how well high school stu-
dents will perform in college. To complete a predictive study, one might administer the
SAT to high school students, wait until the students have completed their freshman year of
college, and then examine the relationship between the predictor (i.e., SAT scores) and the
criterion (i.e., freshman GPA). Researchers often use a correlation coefficient to examine
the relationship between a predictor and a criterion, and in this context the correlation
coefficient is referred to as a validity coefficient. To complete a concurrent study of the
relationship between the SAT and college performance, the researcher might administer
the SAT to a group of students completing their freshman year and then simply correlate
their SAT scores with their GPAs. In predictive studies there is a time interval between the
predictor test and the criterion; in a concurrent study there is no time interval. Figure 5.3
illustrates the temporal relationship between administering the test and measuring the cri-
terion in predictive and concurrent studies.
A natural question is “Which type of study, predictive or concurrent, is best?” As you
might expect (or fear), there is not a simple answer to that question. Very often in education
and other settings we are interested in making predictions about future performance. Con-
sider our example of the SAT; the question is which students will do well in college and
which will not. Inherent in this question is the passage of time. You want to administer a test
before students graduate from high school that will help predict the likelihood of their success
Predictive Design
Time | Time Il
Fall 2003 Spring 2005
Administration of “ College GPA
Scholastic Achievement
Test (SAT)
Concurrent Design
Time |
Fall 2003
Administration of SAT
and
College GPA
in college. In situations such as this, predictive studies maintain the temporal relationship and
other potentially important characteristics of the real-life situation (AERA et al., 1999).
Because a concurrent study does not retain the temporal relationship or other character-
istics of the real-life situation, a predictive study is preferable when prediction is the ultimate
goal of assessment. However, predictive studies take considerable time to complete and can be
extremely expensive. As a result, although predictive studies might be preferable from a techni-
cal perspective, for practical reasons test developers and researchers might adopt a concurrent
strategy to save time and/or money. In some situations this is less than optimal and you should
be cautious when evaluating the results. However, in certain situations concurrent studies are
the preferred approach. Concurrent studies clearly are appropriate when the goal of the test is
to determine current status of the examinee as opposed to predicting future outcome (Anastasi
& Urbina, 1997). For example, a concurrent approach to validation would be indicated for a
test designed to diagnose the presence of psychopathology in elementary school students. Here
we are most concerned that the test give us an accurate assessment of the child’s conditions at
the time of testing, not at some time in the future. The question here is not “Who will develop
the disorder?” but “Who has the disorder?” In these situations, the test being validated is often
a replacement for a more time-consuming or expensive procedure. For example, a relatively
brief screening test might be evaluated to determine whether it can serve as an adequate re-
placement for a more extensive psychological assessment process. However, if we were inter-
ested in selecting students at high risk of developing a disorder in the future, say, for
participation in a prevention program, a prediction study would be in order. We would need to
address how well or accurately our test predicts who will develop the disorder in question.
Criterion Contamination. It is important that the predictor and criterion scores be indepen-
dently obtained. That. i i iter
If predictor scores do influence criterion scores, the criterion is said to be contami-
nated. Consider a situation in which students are selected for a college program based on
performance on an aptitude test. If the college instructors are aware of the students’ perfor-
mance on the aptitude test this might influence their evaluation of the students’ performance
in their class. Students with high aptitude test scores might be given
Criterion contamination occurs preferential treatment or graded in a more lenient manner. In this situ-
when knowledge about ation knowledge of performance on the predictor is influencing perfor-
performance on the predictor mance on the criterion. Criterion contamination has occurred and
includes performance on the any resulting validity coefficients will be artificially inflated. That is,
criterion. the validity coefficients between the predictor test and the criterion
Validity for Teachers 135
will be larger than they would be had the criterion not been contaminated. The coefficients will
suggest the validity is greater than it actually is. To avoid this undesirable situation, test devel-
opers must ensure that no individual who evaluates criterion performance has knowledge of
the examinees’ predictor scores.
Interpreting Validity Coefficients. Predictive and concurrent validity studies examine the
relationship between a test and a criterion and the results are often reported in terms of a
validity coefficient. At this point it is reasonable to ask, “How large should validity coeffi-
cients be?” For example, should we expect validity coefficients greater than 0.80? Although
there is no simple answer to this question, validity coefficients should be large enough to
indicate that information from the test will help predict how individuals will perform on the
criterion measure (e.g., Cronbach & Gleser, 1965). Returning to our example of the SAT,
the question is whether the relationship between the SAT and the
If a test provides information freshman GPA is sufficiently strong so that information about SAT
that helps predict criterion performance helps predict who will succeed in college. If a test pro-
performance better than any vides information that helps predict criterion performance better
other existing predictor, the test than any other existing predictor, the test may be useful even if its
may be useful even if its validity validity coefficients are relatively small. As a result, testing experts
coefficients are relatively small. avoid specifying a minimum coefficient size that is acceptable for
validity coefficients.
Although we cannot set a minimum size for acceptable validity coefficients, certain
techniques are available that help us evaluate the usefulness of test scores for prediction pur-
poses. In Chapter 2 we introduced linear regression, a mathematical procedure that allows
you to predict values on one variable given information on another variable. In the context of
validity analysis, linear regression allows you to predict criterion performance based on pre-
dictor test scores. When using linear regression, a statistic called the
The standard error of estimate standard error of estimate is used to describe the amount of predic-
is used to describe the amount tion error due to the imperfect validity of the test. The standard error
of prediction error due to the of estimate is the standard deviation of prediction errors around the
imperfect validity of the test. predicted score. The formula for the standard error of estimate is quite
similar to that for the SEM introduced in the last chapter. We will not
go into great detail about the use of linear regression and the standard error of estimate, but
Special Interest Topic 5.1 provides a very user-friendly discussion of linear regression.
When tests are used for making decisions such as in student or personnel selection,
factors other than the correlation between the test and criterion are important to consider.
For example, factors such as the proportion of applicants needed to fill Positions (1.e., Se-
lection ratio) and the proportion of applicants who can be successful on the criterion (i.e.,
base rate) can impact the usefulness of test scores. As an example of how the selection
ratio can influence selection decisions, consider an extreme situation in which you have
more positions to fill than you have applicants. Here you do not have the luxury of being
selective and have to accept all the applicants. In this unfortunate situation no test is useful,
no matter how strong a relationship there is between it and the criterion. However, if you
have only a few positions to fill and many applicants, even a test with a moderate correla-
tion with the criterion may be useful. As an example of how the base rate can impact
selection decisions, consider a situation in which practically every applicant can be suc-
cessful (i.e., a very easy task). Because almost any applicant selected will be successful,
136 CHAPTER 5
One of the major purposes of various aptitude measures such as IQ tests is to make predictions about
performance on some other variable such as reading achievement test scores, success in a job train-
ing program, or even college grade point average. In order to make predictions from a score on one
test to a score on some other measure, the mathematical relationship between the two must be deter-
mined. Most often we assume the relationship to be linear and direct with test scores such as intel-
ligence and achievement. When this is the case, a simple equation is derived that represents the
relationship between two test scores that we will call X and Y.
If our research shows that X and Y are indeed related—that is, for any change in the value of
X there is a systematic (not random) change in Y—our equation will allow us to estimate this change
or to predict the value of Y if we know the value of X. Retaining X and Y as we have used them so
far, the general form of our equation would be:
Y=ax +b
This equation goes by several names. Statisticians are most likely to refer to it as a regression equa-
tion. Practitioners of psychology who use the equation to make predictions may refer to it as a predic-
tion equation. However, somewhere around 8th or 9th grade, in your first algebra class, you were
introduced to this expression and told it was the equation of a straight line. What algebra teachers
typically do not explain at this level is that they are actually teaching you regression!
Let’s look at an example of how our equation works. For this example, we will let X represent
some individual’s score on an intelligence test and Y the person’s score on an achievement test a year
in the future. To determine our actual equation, we would have had to test a large number of students
on the IQ test, waited a year, and then tested the same students on an achievement test. We then
calculate the relationship between the two sets of scores. One reasonable outcome would yield an
equation such as this one:
Y = 0.5X+ 10
In determining the relationship between X and Y, we calculated the value of a to be 0.5 and
the value of b to be 10. In your early algebra class, a was referred to as the slope of your line and b
as the Y-intercept (the starting point of your line on the Y-axis when X = 0). We have graphed this
equation for you in Figure 5.4. When X = 0, Y is equal to 10 (Y = 0.5(0) + 10) so our line starts on
the Y-axis at a value of 10. Because our slope is 0.5, for each increase in X, the increase in Y will be
half, or 0.5 times, as much. We can use our equation or our prediction line to estimate or predict the
value of Y for any value of X, just as you did in that early algebra class. Nothing has really changed
except the names.
Instead of slope, we typically refer to a as a regression coefficient or a beta weight. Instead of
the Y-intercept, we typically refer to b from our equation as a constant, because it is always being
added to aX in the same amount on every occasion.
If we look at Figure 5.4, we can see that for a score of 10 on our intelligence test, a score of
15 is predicted on the achievement test. A score of 30 on our intelligence test, a 20-point increase,
predicts an achievement test score of 25, an increase in Y equal to half the increase in X. These values
are the same whether we use our prediction line or our equation—they are simply differing ways of
showing the relationship between X and Y. Vo
Validity for Teachers 137
X
10 20 30 40 50 60
FIGURE 5.4 Example of a Graph of the Equation of a Straight Line, also Known
as a Regression Line or Prediction Line
Note: Y=aX +b when a=0.5 and b= 10. For example, if X is 30, then Y = (0.5)30 + 10 = 25.
We are predicting Y from X, and our prediction is never perfect when we are using test scores.
For any one person, we will typically be off somewhat in predicting future test scores. Our prediction
actually is telling us the mean or average score on Y of all the students in the research study at each
score on X. For example, the mean achievement score of all of our students who had a score of 40
on the intelligence test was 30. We know that not all of the students who earned a 40 on the intelli-
gence test will score 30 on the achievement test. We use the mean score on Y of all our students who
scored 40 on the intelligence measure as our predicted value for all students who score 40 on X
nevertheless. The mean is used because it results in the smallest amount of error in all our predic-
tions. In actual practice, we would also be highly interested in just how much error existed in our
predictions, and this degree of error would be calculated and reported. Once we determine the aver-
age amount of error in our predictions, we make statements about how confident we are in predicting
Y based on X. For example, if the average amount of error (called the standard error of estimate) in
our prediction were 2 points, we might say that based on his score of 40 on the IQ measure, we are
68% confident John’s achievement test score a year later will be between 38 and 42 and 95% confi-
dent that it will fall within the range of scores from 36 to 44.
no test is likely to be useful regardless of how strong a relationship there is between it and
the criterion. However, if you have a difficult task and few applicants can be successful,
even a test with a moderate correlation with the criterion may be useful. To take into con-
sideration these factors, decision-theory mode e be eloped ay
1989). In brief, €
tie s. We will not go
138 CHAPTER 5
Decision-theory models help the into detail about decision theory, but interested students are referred
test user determine how much to Anastasi and Urbina (1997) for a readable discussion of decision-
information a predictor test theory models.
can contribute when making
classification decisions. Validity Generalization. An important consideration in the inter-
pretation of predictive and concurrent studies is the degree to which
they can be generalized to new situations, that is, to circumstances similar to but not the
same as those under which the validity studies were conducted. When a test is used for
prediction in new settings, research has shown that validity coefficients can vary consider-
ably. For example, a validation study may be conducted using a national sample, but differ-
ent results may be obtained when the study is repeated using a restricted sample such as a
local school district. Originally these results were interpreted as suggesting that test users
were not able to rely on existing validation studies and needed to conduct their own local
validation studies. However, subsequent research using a new statistical procedure known
as meta-analysis indicated that much of the variability previously observed in validity coef-
ficients was actually due to statistical artifacts (e.g., sampling error). When these statistical
artifacts were taken into consideration the remaining variability was often negligible, sug-
gesting that validity coefficients can be generalized more than previously thought (AERA
et al., 1999). Currently, in many situations local validation studies are not seen as necessary.
For example, if there is abundant meta-analytic research that produces consistent results,
local validity studies will likely not add much useful information. However, if there is little
existing research or the results are inconsistent, then local validity studies may be particu-
larly useful (AERA et al., 1999).
measurement methods (e.g., self-report and teacher rating). The researcher then examines
the resulting correlation matrix, comparing the actual relationships with a priori (i.e., pre-
existing) predictions about the relationships. In addition to revealing information about
convergent and discriminant relationships, this technique provides information about the
influence of common method variance. When two measures show an unexpected correla-
tion due to similarity in their method of measurement, we refer to this as method variance.
Thus, the multitrait-multimethod matrix allows one to determine what the test correlates
with, what it does not correlate with, and how the method of measurement influences these
relationships. This approach has considerable technical and theoretical appeal, yet difficulty
with implementation and interpretation has limited its application to date.
«s=measure..This is referred to as a contrasted group study. For example, if you are attempting
to validate a new measure of intelligence, you might form two groups,
In a contrasted group study, individuals with mental retardation and normal control participants. In
validity evidence is garnered by this type of study, the diagnoses or group assignment would have been
examining groups that differ on made using assessment procedures that do not involve the test under
the construct being measured. consideration. Each group would then be administered the new test,
and its validity as a measure of intelligence would be supported if the
predefined groups differed in performance in the predicted manner. Although the preceding
example is rather simplistic, it illustrates a general approach that has numerous applications.
For example, many constructs in psychology and education have a developmental component.
That is, you expect younger participants to perform differently than older participants. Tests
designed to measure these constructs can be examined to determine whether they demonstrate
the expected developmental changes by looking at the performance of groups reflecting dif-
ferent ages and/or education. In the past, this type of validity evidence has typically been
classified as construct validity.
By examining the internal structure of a test (or battery of tests) one can determine whether
the relationships between test items (or, in the case of test batteries, component tests) are
consistent with the construct the test is designed to measure (AERA et al., 1999). For ex-
ample, one test might be designed to measure a construct that is hypothesized to involve a
single dimension, whereas another test might measure a construct thought to involve mul-
tiple dimensions. By examining the internal structure of the test we can determine whether
its actual structure is consistent with the hypothesized structure of the construct it measures.
Factor analysis is a sophisticated statistical procedure used to determine the number of
conceptually distinct factors or dimensions underlying a test or bat-
By examining the internal
tery of tests. Because factor analysis is a fairly complicated tech-
structure of the gale fren nique, we will not go into detail about its calculation. However, factor
determine whether its actual analysis plays a prominent role in test validation and you need to be
structure is consistent with the aware of its use. In summary, test publishers and researchers use fac-
hypothesized structure of the tor analysis either to confirm or to refute the proposition that the in-
construct it measures. ternal structure of the tests is consistent with that of the construct.
140 CHAPTER 5
Factor analysis is not the only approach researchers use to examine the internal struc-
ture of a test. Any technique that allows researchers to examine the relationships between
test components can be used in this context. For example, if the items on a test are assumed
to reflect a continuum from very easy to very difficult, empirical evidence of a pattern of
increasing difficulty can be used as validity evidence. If a test is thought to measure a one-
dimensional construct, a measure of item homogeneity might be useful (AERA et al., 1999).
The essential feature of this type of validity evidence is that researchers empirically examine
the internal structure of the test and compare it to the structure of the construct of interest.
This type of validity evidence has traditionally been incorporated under the category of
construct validity and is most relevant with tests measuring theoretical constructs such as
intelligence or personality.
wStruc#being assessed) Although this type of validity evidence has not received as much at-
tention as the approaches previously discussed, it has considerable potential and in terms of
the traditional nomenclature it would likely be classified under construct validity. For ex-
ample, consider a test designed to measure mathematical reasoning ability. In this situation
it would be important to investigate the examinees’ response processes to verify that they
are actually engaging in analysis and reasoning as opposed to applying rote mathematical
algorithms (AERA et al., 1999). There are numerous ways of collecting this type of validity
evidence, including interviewing examinees about their response processes and strategies,
recording behavioral indicators such as response times and eye movements, or even analyz-
ing the types of errors committed (AERA et al., 1999; Messick, 1989).
The Standards (AERA et al., 1999) note that studies of response processes are not re-
stricted to individuals taking the test, but may also examine the assessment professionals who
administer or grade the tests. When testing personnel record or evaluate the performance of
examinees, it is important to make sure that their processes or actions are in line with the
construct being measured. For example, many tests provide specific criteria or rubrics that are
intended to guide the scoring process. The Wechsler Individual Achievement Test—Second
Edition (WIAT-II; The Psychological Corporation, 2002) has a section to assess written ex-
pression that requires the examinee to write an essay. To facilitate grading, the authors include
an analytic scoring rubric that has four evaluative categories: mechanics (e.g., spelling, punc-
tuation), organization (e.g., structure, sequencing, use of introductory/concluding sentences,
etc.), theme development (use of supporting statements, evidence), and vocabulary (e.g.,
specific and varied words, unusual expressions). In validating this assessment it would be
helpful to evaluate the behaviors of individuals scoring the test to verify that the criteria are
being carefully applied and that irrelevant factors are not influencing the scoring process.
Researchers have started example, if a test is used to identify qualified applicants for employ-
examining the consequences ment, it is,assumed that the use of the test will result in better hiring
of test use, both intended and decisions (e.g., lower training costs, lower turnover). If a test is used
unintended, as an aspect of to help select students for admission to a college program, it is as-
validity. sumed that the use of the test will result in better admissions deci-
sions (e.g., greater student success and higher retention). This line of
validity evidence simply asks the question, “Are these benefits being
achieved?” This type of validity evidence, often referred to as consequential validity evi-
dence, is most applicable to tests designed for selection and promotion.
Some authors have advocated a broader conception of validity, one that incorporates
social issues and values. For example, Messick (1989) in his influential chapter suggested
that the conception of validity should be expanded so that it “formally brings consideration
of value implications and social consequences into the validity framework” (p. 20). Other
testing experts have criticized this position. For example, Popham (2000) suggests that in-
corporating social consequences into the definition of validity would detract from the clar-
ity of the concept. Popham argues that validity is clearly defined as the “accuracy of
score-based inferences” (p. 111) and that the inclusion of social and value issues unneces-
sarily complicates the concept. The Standards (AERA et al., 1999) appear to avoid this
broader conceptualization of validity. The Standards distinguish between consequential
evidence that is directly tied to the concept of validity and evidence that is related to social
policy. This is an important but potentially difficult distinction to make. Consider a situation
in which research suggests that the use of a test results in different job selection rates for
different groups. If the test measures only the skills and abilities related to job performance,
evidence of differential selection rates does not detract from the validity of the test. This
information might be useful in guiding social and policy decisions, but it is not technically
an aspect of validity. If, however, the test measures factors unrelated to job performance, the
evidence is relevant to validity. In this case, it may suggest a problem with the validity of
the test such as the inclusion of construct-irrelevant factors.
Another component to this process is to consider the consequences of not using tests.
Even though the consequences of testing may produce some adverse effects, these must be
contrasted with the positive and negative effects of alternatives to using psychological tests.
If more subjective approaches to decision making are employed, for example, the likelihood
of cultural, ethnic, and gender biases in the decision-making process will likely increase.
The development of a validity The development of this validity argument typically involves the
argument typically involves the integration of numerous lines of evidence into a coherent commen-
integration of numerous lines tary. The development of a validity argument is an ongoing process;
of evidence into a coherent it takes into consideration existing research and incorporates new sci-
commentary. entific findings. As we have noted, different types of validity evidence
142 CHAPTERS
are most applicable to different type of tests. Here is a brief review of some of the prominent
applications of different types of validity evidence.
= Evidence based on test content is most often reported with academic achievement
tests and tests used in the selection of employees.
u Evidence based on response processes can be useful with practically any test that
requires examinees to engage in any cognitive or behavioral activity.
Examination of Test Content. Evaluating the validity of the results of classroom assess-
ments often begins with an analysis of test content. As discussed earlier in this chapter, this
typically involves examining item relevance and content coverage. Analysis of item rele-
vance involves examining the individual test items and determining whether they reflect
essential elements of the content domain. Content coverage involves examining the overall
test and determining the degree to which the items cover the specified domain (refer back
to Figure 5.1). The question here is “Does validity evidence based on the content of the test
support the intended interpretations of test results?”’In other words, is this test covering the
content it is supposed to cover?
have valid score interpretations if those scores are not reliable. As a result, efforts to increase
the reliability of assessment results can enhance validity.
Examination of the Overall Assessment Strategy. Nitko (2001) notes that even when
you follow all of the previous guidelines, perfect validity will always elude you. To counter
this, he recommends that teachers employ a multiple-assessment strategy that incorporates
the results of numerous assessments to measure student achievement.
We feel Nitko’s (2001) guidelines provide a good basis for evaluating and improving
the validity of classroom assessments. Although teachers typically do not have the time or
resources to conduct large-scale validity studies, these guidelines provide some practical
and sound advice for evaluating the validity of the results of classroom assessments.
Summary
In this chapter we introduced the concept of validity. In the context of educational and psy-
chological tests and measurement, validity refers to the degree to which theoretical and
empirical evidence supports the meaning and interpretation of test scores. In essence the
validity question is “Are the intended interpretations of test scores appropriate and accu-
rate?” Numerous factors can limit the validity of interpretations. The two major internal
threats to validity are construct underrepresentation (i.e., the test is not a comprehensive
measure of the construct it is supposed to measure) and construct-irrelevant variance (165
the test measures content or skills unrelated to the construct). Other factors that may reduce
validity include variations in instructional procedures, test administration/scoring proce-
dures, and student characteristics. There is also a close relationship between validity and
reliability. For a test to be valid it must be reliable, but at the same time reliability does not
ensure validity. Put another way, reliability is a necessary but insufficient condition for
validity.
As a psychometric concept, validity has evolved and changed over the last half cen-
tury. Until the 1970s validity was generally divided into three distinct types: content valid-
ity, criterion-related validity, and construct validity. This terminology was widely accepted
and is still often referred to as the traditional nomenclature. However, in the 1970s and
1980s measurement professionals started conceptualizing validity as a unitary construct.
That is, although there are different ways of collecting validity evidence, there are not dis-
tinct types of validity. To get away from the perception of distinct types of validity, today
Validity for Teachers 145
we refer to different types of validity evidence. The most current typology includes the fol-
lowing five categories:
a Evidence based on test content. Evidence derived from a detailed analysis of the test
content includes the type of questions or tasks included in the test and guidelines for admin-
istration and scoring. Collecting content-based validity evidence is often based on the eval-
uation of expert judges about the correspondence between the test’s content and its construct.
The key issues addressed by these expert judges are whether the test items assess relevant
content (i.e., item relevance) and the degree to which the construct is assessed in a compre-
hensive manner (i.e., content coverage).
a Evidence based on internal structure. Evidence examining the relationships among test
items and components, or the internal structure of the test, can help determine whether the
structure of the test is consistent with the hypothesized structure of the construct it measures.
a Evidence based on consequences of testing. Evidence examining the intended and un-
intended consequences of testing is based on the common belief that some benefit will result
from the use of tests. Therefore, it is reasonable to confirm that these benefits are being
achieved. This type of validity evidence has gained considerable attention in recent years and
there is continuing debate regarding the scope of this evidence. Some authors feel that social
consequences and values should be incorporated into the conceptualization of validity, whereas
others feel such a broadening would detract from the clarity of the concept.
Different lines of validity evidence are integrated into a cohesive validity argument that
supports the use of the test for different applications. The development of this validity argu-
ment is a dynamic process that integrates existing research and incorporates new scientific
146 CHAPTER 5
findings. Validation is the shared responsibility of the test authors, test publishers, research-
ers, and even test users. Test authors and publishers are expected to provide preliminary evi-
dence of the validity of proposed interpretations of test scores whereas researchers often
pursue independent validity studies. Ultimately, those using tests are expected to weigh the
validity evidence and make their own judgments about the appropriateness of the test in their
own situations and settings, placing the practitioners or consumers of psychological tests in
the final, most responsible role in this process.
RECOMMENDED READINGS
American Educational Research Association, American Psy- Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational
chological Association, & National Council on Mea- measurement (3rd ed., pp. 13-103). Upper Saddle River,
surement in Education (1999). Standards for educational NJ: Merrill/Prentice Hall. A little technical at times, but
and psychological testing. Washington, DC: American very influential.
Educational Research Association. Chapter 1 is a must Sireci, S. G. (1998). Gathering and analyzing content validity
read for those wanting to gain a thorough understanding data. Educational Assessment, 5, 299-321. This article
of validity. provides a good review of approaches to collecting va-
Cronbach, L. J., & Gleser, G. C. (1965). Psychological tests lidity evidence based on test content, including some of
and personnel decisions (2nd ed.). Champaign: Univer- the newer quantitative approaches.
sity of Illinois Press. A classic, particularly with regard to Tabachnick, B. G., & Fidel, L. S. (1996). Using multivariate
validity evidence based on relations to external variables! statistics (3rd ed.). New York: HarperCollins. A great
Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, chapter on factor analysis that is less technical than Gor-
NJ: Erlbaum. A classic for those really interested in un- such (1993).
derstanding factor analysis.
EEO
CHAPTER HIGHLIGHTS
Item Difficulty Index (or Item Difficulty Level) Item Analysis of Performance Assessments
Item Discrimination Qualitative Item Analysis
Distracter Analysis Using Item Analysis to Improve Classroom
Item Analysis: Practical Strategies for Teachers Instruction
LEARNING OBJECTIVES
147
148 CHAE
ee Ra6
A number of quantitative procedures are useful in assessing the quality and measurement
characteristics of the individual items that make up tests. Collectively these procedures are
referred to as item analysis statistics or procedures. Unlike reliability and validity analyses
that evaluate the measurement characteristics of a test as a whole, item analysis procedures
examine individual items separately, not the overall test. Item analysis statistics are useful
in helping test developers, including both professional psychometricians and classroom
teachers, decide which items to keep on a test, which to modify, and which to eliminate. In
addition to helping test developers improve tests by improving the individual items, they can
also provide valuable information regarding the effectiveness of instruction or training.
The reliability of test scores and the validity of the interpretation of test scores are
dependent on the quality of the items on the test. If you can improve the quality of the indi-
vidual items, you will improve the overall quality of your test. When
The reliability and validity of discussing reliability we noted that one of the easiest ways to increase
test scores are dependent on the the reliability of test scores is to increase the number of items that go
quality of the items on the test. into making up the test score. This statement is generally true and is
If you can improve the quality based on the assumption that when you lengthen a test you add items
of the individual items, you will of the same quality as the existing items. If you use item analysis to
improve the overall quality of delete poor items and improve other items, it is possible to end up
your test. with a test that is shorter than the original test and that also produces
scores that are more reliable and result in more valid interpretations.
Although quantitative procedures for evaluating the quality of test items will be the
focus of the chapter, some qualitative procedures may prove useful when evaluating the
quality of test items. These qualitative procedures typically involve an evaluation of validity
evidence based on the content of the test and an examination of individual items to ensure
they are technically accurate and clearly stated. Although qualitative procedures have not
received as much attention as their quantitative counterparts, it is often beneficial to use a
combination of quantitative and qualitative procedures.
Before describing the major quantitative item analysis procedures, we should first
note that different types of items and different types of tests require different types of item
analysis procedures. Items scored dichotomously (i.e., either right or wrong) are handled
differently than items scored on a continuum (e.g., an essay that can receive scores ranging
from 0 to 10). Tests designed to maximize the variability of scores (e.g., norm-referenced)
are handled differently than mastery tests (i.e., scored pass or fail). As we discuss various
item analysis procedures, we will specify which types of procedures are appropriate for
which types of items and tests.
When evaluating items on ability tests, an important consi ion i ifficulty level of
in toms, legaifiuliyisdefinedasthepercentage exproportii- ode? eRe
Item Analysis for Teachers 149
Item difficulty is defined as the rectlyanswerthestem. The item difficulty level or index is abbrevi-
percentage or proportion of test ated as p and calculated with the following formula:
takers who correctly answer the
item. _ Number of Examinees Correctly Answering the Item
Number of Examinees
For example, in a class of 30 students, if 20 students get the answer correct and ten are
incorrect, the item difficulty index is 0.67. The calculations are illustrated here.
ee20 eT
P~ 30
In the same class, if ten students get the answer correct and 20 are incorrect, the item
difficulty index is 0.33. The item difficulty index can range from 0.0 to 1.0 with easier
items having larger decimal values and difficult items at lower values. An item answered
correctly by all students receives an item difficulty of 1.0 whereas an item answered in-
correctly by all students receives an item difficulty of 0.0. Items with p values of either
1.0 or 0.0 provide no information about individual differences and are of no value from a
measurement perspective. Some test developers will include one or two items with p val-
ues of 1.0 at the beginning of a test to instill a sense of confidence in test takers. This is a
defensible practice from a motivational perspective, but from a technical perspective these
items do not contribute to the measurement characteristics of the test. Another factor that
should be considered about the inclusion of very easy or very difficult items is the issue of
time efficiency. The time students spend answering ineffective items is largely wasted and
could be better spent on items that enhance the measurement characteristics of the test.
the
For maximizing variability and reliability Optimialjitem|difficultylevelus,0.50,
indicating that 50% of test takers answered the item correctly and 50% answered incor-
rectly. Based on this statement, you might conclude that it is desirable for all test items to
have a difficulty level of 0.50, but this is not necessarily true for several reasons. One reason
is that items on a test are often correlated with each other, which
means the measurement process may be confounded if all the items
For maximizing variability have p values of 0.50. As a result, it is often desirable to select some
among test takers, the optimal items with p values below 0.50 and some with values greater than
item difficulty level is 0.50, 0.50, but with a mean of 0.50. Aiken (2000) recommends that there
indicating that 50% of the should be approximately a 0.20 range of these p values around the
test takers answered the item optimal value. For example, a test developer might select items with
correctly and 50% answered the _ difficulty levels ranging from 0.40 to 0.60, with a mean of 0.50.
item incorrectly. Another reason why 0.50 is not the optimal difficulty level
for every testing situation involves the influence of guessing. On
Although a Piped a Joe constructed-response items (e.g., essay and short-answer items) for
is optimal for maximizing which guessing is not a major concern, 0.50 is typically considered
variability and reliability, the optimal difficulty level. However, with selected-response items
different levels are desirable (e.g., multiple choice and true—false items) for which test takers
in many different testing might answer the item correctly simply by guessing, the optimal dif-
applications. ficulty level varies. To take into consideration the effects of guessing,
150 CHAPTER 6
TABLE 6.1 Optimal p Values for Items with Varying Numbers of Choices
the optimal item difficulty level is set higher than for constructed- i . For ex-
an Norm pl-oeielensiTourpons vse soldbe HppO RAR
ord, 1952). That is, the test developer might select items with difficulty levels rang-
ing from 0.64 to 0.84 with a mean of approximately 0.74. Table 6.1 provides information on
the optimal mean p value for selected-response items with varying numbers of alternatives
or choices.
As a rule of thumb and for psychometric reasons explained in this chapter, we have noted that item
difficulty indexes of 0.50 are desirable in many circumstances on standardized tests. However, it is
also common to include some very easy items so all or most examinees get some questions correct,
as well as some very hard items, so the test has enough ceiling. With a power test, such as an IQ
test, that covers a wide age range and whose underlying construct is developmental, item selection
becomes much more complex. Items that work very well at some ages may be far too easy, too hard,
or just developmentally inappropriate at other ages. If a test covers the age range of say 3 years up to
20 years, and the items all have a difficulty level of 0.50, you could be left with a situation in which
the 3-, 4-, 5-, and even 6-year-olds typically pass no items and perhaps the oldest individuals nearly
always get every item correct. This would lead to very low reliability estimates at the upper and lower
ages and just poor measurement of the constructs generally, except near the middle of the intended
age range. For such power tests covering a wide age span, item statistics such as the difficulty index
and the discrimination index are examined at each age level and plotted across all age levels. In this
way, items can be chosen that are effective in measuring the relevant construct at different ages.
When the item difficulty indexes for such a test are examined across the entire age range, some will
approach 0.0 and some will approach 1.0. However, within the age levels, for example, for 6-year-
olds, many items will be close to 0.5. This affords better discrimination and gives each examinee a
range of items on which they can express their ability on the underlying trait.
Discrimination Index
Probably the most popular method of calculating an index of item discrimination is based
on the difference in performance between two groups. Although there are different ways
of selecting the two groups, they are typically defined in terms of total test performance.
One common approach is to select the top and bottom 27% of test
takers in terms of their overall performance on the test and exclude
Probably the most popular the middle 46% (Kelley, 1939). Some assessment experts have sug-
method of calculating an index gested using the top and bottom 25%, some the top and bottom 33%,
of item discrimination is based and some the top and bottom halves. In practice, all of these are
on the difference between those probably acceptable (later in this chapter we will show you a more
who score well on the overall practical approach that saves both time and effort). The difficulty of
test and those who score poorly. the item is computed for each group separately, and these are labeled
152 CHAPTER 6
Py and pp, (T for top, B for bottom). The difference between p, and pp is the discrimination
index, designated as D, and is calculated with the following formula (e.g., Johnson, 1951):
D = pr — Pp
To illustrate the logic behind this index, consider a classroom test designed to measure aca-
demic achievement in some specified area. If the item is discriminating between students
who know the material and those who do not, then students who are more knowledgeable
(i.e., students in the top group) should get the item correct more often than students who are
less knowledgeable (i.e., students in the bottom group). For example, if p, = 0.80 (indicat-
ing 80% of the students in the top group answered the item correctly) and p, = 0.30 (indicat-
ing 30% of the students in the bottom group answered the item correctly), then
Hopkins (1998) provided guidelines for evaluating items in terms of their D values
(see Table 6.2). According to these guidelines, D values of 0.40 and above are considered
excellent, between 0.30 and 0.39 are good, between 0.11 and 0.29 are fair, and between
0.00 and 0.10 are poor. Items with negative D values are likely miskeyed or there are other
serious problems. Other testing assessment experts have provided
different guidelines, some more rigorous and some more lenient.
As a general rule, we
As a general rule, we suggest that items with D values over
recommend that items with D
0.30 are acceptable (the larger the better), and items with D values
values over 0.30 are acceptable,
below 0.30 should be carefully reviewed and possibly revised or de-
and items with D values below leted. However, this is only a general rule and there are exceptions.
0.30 should be carefully For example, most indexes of item discrimination, including the item
reviewed and possibly revised discrimination index (D), are biased in favor of items with interme-
or deleted. diate difficulty levels. That is, the maximum D value of an item is
Difficulty
1.00 0.00
0.90 0.20
0.80 0.40
0.70 0.60
0.60 0.70
0.50 1.00
0.40 0.70
0.30 0.60
0.20 0.40
0.10 0.20
0.00 0.00
Se a Ee A A aR BS Ee RE
related to its p value (see Table 6.3). Items that all test takers either pass or fail (i.e., p values
of either 0.0 or 1.0) cannot provide any information about individual differences and their
D values will always be zero. If half of the test takers correctly answered an item and half
failed (i.e., p value of 0.50), then it is possible for the item’s D value to be 1.0. This does not
mean that all items with p values of 0.50 will have D values of 1.0, but just that the item can
conceivably have a D value of 1.0. As aresult of this relationship between p and D, items that
have excellent discrimination power (i.e., D values of 0.40 and above) will necessarily have p
values between 0.20 and 0.80. In testing situations in which it is desirable to have either very
easy or very difficult items, D values can be expected to be lower than those normally desired.
Additionally, items that measure abilities or objectives that are not emphasized throughout
the test may have poor discrimination due to their unique focus. In this situation, if the item
measures an important ability or learning objective and is free of technical defects, it should
be retained (e.g., Linn & Gronlund, 2000).
In summary, although low D values often indicate problems, the guidelines provided
in Table 6.2 should be applied in a flexible, considered manner. Our discussion of the cal-
culation of item difficulty and discrimination indexes has used examples with items that are
dichotomously scored (i.e., correct/incorrect, 1 or 0). Special Interest Topic 6.2 provides a
discussion of the application of these statistics with constructed-response items that are not
scored in a dichotomous manner.
Our discussion and examples of the calculation of the item difficulty index and discrimination index
used examples that were dichotomously scored (i.e., scored right or wrong: 0 or 1). Although this
procedure works fine with selected-response items (e.g., true—false, multiple-choice), you need a
slightly different approach with constructed-response items that are scored in a more continuous
manner (e.g., an essay item that can receive scores between | and 5 depending on quality). To calcu-
late the item difficulty index for a continuously scored constructed-response item, use the following
formula (Nitko, 2001):
The range of possible scores is calculated as the maximum possible score on the item minus the
minimum possible score on the item. For example, if an item has an average score of 2.7 and is
scored on a | to 5 scale, the calculation would be:
2.7 Pel
Therefore, this item has an item difficulty index of 0.675. This value can be interpreted the same as
the dichotomously scored items we discussed.
To calculate the item discrimination index for a continuously scored constructed-response
item, you use the following formula (Nitko, 2001):
_ Average Score for the Top Group — Average Score for the Bottom Group
D
Range of Possible Scores
For example, if the average score for the top group is 4.3, the average score for the bottom group is
1.7, and the item is scored on a 1 to 5 scale, the calculation would be:
pretetllienZhics oes
5-1 4
Therefore, this item has an item discrimination index of 0.65. Again, this value can be interpreted
the same as the dichotomously scored items we discussed.
——_-_--—————————————————————————— eee
(unadjusted) or the total number of items answered correctly omitting the item being ex-
amined (adjusted). Either way, the item-total correlation is usually calculated using the
point-biserial correlation. As you remember from our discussion of basic statistics, the
point-biserial is used when one variable is a dichotomous nominal score and the other vari-
able is measured on an interval or ratio scale. Here the dichotomous variable is the score
on a single item (e.g., right or wrong) and the variable measured 6n an interval scale is the
Item Analysis for Teachers 155
total test score. A large item-total correlation is taken as evidence that an item is measur-
ing the same construct as the overall test measures and that the item discriminates between
individuals high on that construct and those low on that construct. An item-total correlation
calculated on the adjusted total will be lower than that computed on the unadjusted total and is
preferred because the item being examined does not “‘contaminate” or inflate the correlation.
The results of an item-total correlation will be similar to those of an item discrimination index
and can be interpreted in a similar manner (Hopkins, 1998). As teachers gain more access to
computer test scoring programs, the item-total correlation will become increasingly easy to
compute and will likely become the dominant approach for examining item discrimination.
where Pinstruction
= proportion of instructed students getting the answer correct
Pro instruction
= proportion of students without instruction getting the answer
correct
This approach is technically adequate, with the primary limitation being potential dif-
ficulty obtaining access to an adequate group that has not received instruction or training
on the relevant material. If one does have access to an adequate sample, this is a promis-
ing approach.
Another popular approach involves administering the test to the same sample twice,
once before instruction and once after instruction. The formula is:
Some drawbacks are associated with this approach. First, it requires that the test developers
write the test, administer it as a pretest, wait while instruction is provided, administer it as
a posttest, and then calculate the discrimination index. This can take an extended period
of time in some situations, and test developers often want feedback in a timely manner. A
second limitation is the possibility of carryover effects from the pre- to the posttest. For
example, examinees might remember items or concepts emphasized on the pretest, and
this carryover effect can influence how they respond to instruction, study, and subsequently
prepare for the posttest.
Aiken (2000) proposed another approach for calculating discrimination for mastery
tests. Instead of using the top and bottom 27% of students (or the top and bottom 50%), he
recommends using item difficulty values based on the test takers who reached the mastery
cut score (i.e., mastery group) and those who did not reach mastery (i.e., nonmastery group),
using the following formula:
The advantage of this approach is that it can be calculated based on the data from one test
administration with one sample. A potential problem is that because it is common for the
majority of examinees to reach mastery, the p value of the nonmastery group might be based
on a small number of examinees. As a result the statistics might be unstable and lead to er-
roneous conclusions.
are more difficult. Similar complications arise when interpreting indexes of discrimination
with speed tests. Because the individuals completing the later items also tend to be the most
capable test takers, indexes of discrimination may exaggerate the discriminating ability of
these items. Although different procedures have been developed to take into consideration
these and related factors, they all have limitations and none have received widespread ac-
ceptance (e.g., Aiken, 2000; Anastasi & Urbina, 1997). Our recommendation is that you
should be aware of these issues and take them into consideration when interpreting the item
analyses of speed tests.
Distracter Analysis
The final quantitative item analysis procedure we will discuss in this chapter involves the
analysis of individual distracters. On multiple-choice items, the incorrect alternatives are re-
ferred to as distracters because they serve to “distract” examinees who do not actually know
the correct response. Some test developers routinely examine the performance of distracters
for all multiple-choice items, whereas others reserve distracter analysis for items with p or
D values that suggest problems. If you are a professional test developer you can probably
justify the time required to examine each distracter for each item, but for busy teachers it
is reasonable to reserve distracter analysis procedures for items that need further scrutiny
based on their p or D values.
Distracter analysis allows you to examine how many exam-
Distracter analysis allows you inees in the top and bottom groups selected each option on a multi-
to examine how many students ple-choice item. The key is to examine each distracter and ask two
in the top and bottom groups questions. First, did the distracter distract some examinees? If no
selected each option on a examinees selected the distracter, it is not doing its job. An effec-
multiple-choice item. tive distracter must be selected by some examinees. If a distracter
is so obviously incorrect that no examinees select it, it is ineffective
We expect distracters to be and needs to be revised or replaced. The second question involves
selected by more examinees discrimination. Did the distracter attract more examinees in the bot-
in the bottom group than tom group than in the top group? Effective distracters should. When
examinees in the top group. looking at the correct response, we expect more examinees in the
top group to select it than examinees in the bottom group (i.e., it
demonstrates positive discrimination). With distracters we expect the opposite. We expect
more examinees in the bottom group to select a distracter than examinees in the top group.
That is, distracters should demonstrate negative discrimination!
Consider the following example:
Options
Item 1 A* B G D
*Correct answer
158 CHAPTER 6
For this item, p = 0.52 (moderate difficulty) and D = 0.43 (excellent discrimination). Based
on these values, this item would probably not require further examination. However, this
can serve as an example of what might be expected with a “good” item. As reflected in
the D value, more examinees in the top group than the bottom group selected the correct
answer (1.e., option A). By examining the distracters (i.e., options B, C, and D), you see
that they were all selected by some examinees, which means they are serving their purpose
(i.e., distracting examinees who do not know the correct response). Additionally, all three
distracters were selected more by members of the bottom group than the top group. This is
the desired outcome! While we want more high-scoring examinees to select the correct an-
swer than low-scoring examinees (i.e., positive discrimination), we want more low-scoring
examinees to select distracters than high-scoring examinees (i.e., negative discrimination).
In summary, this is a good item and all of the distracters are performing well.
Now we will look at an example that illustrates some problems.
Options
Item 1 A* B Cc D
*Correct answer
For this item, p = 0.50 (moderate difficulty) and D = 0.14 (fair discrimination but further
scrutiny suggested). Based on these values, this item needs closer examination and possible
revision. Examining option B, you will notice that more examinees in the top group than in
the bottom group selected this distracter. This is not a desirable situation; option B needs to
be examined to determine why it is attracting top examinees. It is possible that the wording is
ambiguous or that the option is similar in some way to the correct answer. Examining option
C, you note that no one selected this distracter. It attracted no examinees, was obviously not
the correct answer, and needs to be replaced. To be effective, a distracter must distract some
examinees. Finally, option D performed well. More poor-performing examinees selected this
option than top-performing examinees (i.e., 11 versus 4). It is likely that if the test developer
revises options B and C this will be a more effective item.
Unless you are very familiar with Einstein’s work, this is probably a fairly difficult question.
Now consider this revision:
1. In what year did Albert Einstein first publish his full general theory of relativity?
1655
= . 1762
1832
bD1G
cae 2001
The selection of distracters This is the same question but with different distracters. This revised
can significantly impact the item would likely be a much easier item in a typical high school sci-
difficulty of the item and ence class. The point is that the selection of distracters can signifi-
consequently the ability of cantly impact the difficulty of the item and consequently the ability
the item to discriminate. of the item to discriminate.
Teachers typically have a Teachers typically have a number of practical options for calculating
item analysis statistics for their classroom tests. Many teachers will
number of practical options
have access to computer scoring programs that calculate the various
for calculating item analysis
item analysis statistics we have described. Numerous commercial
statistics for their classroom
companies sell scanners and scoring software that can scan answer
tests. sheets and produce item analysis statistics and related printouts (see
Table 6.4 for two examples). If you do not have access to computer
scoring at your school, Website Reactions has an excellent Internet site that allows you to com-
pute common item analysis statistics online (www.surveyreaction.com/itemanalysis.asp).
TABLE 6.4 Two Examples of Test Scoring and Item Analysis Programs
Principia Products
One of its products, Remark Office OMR, will grade tests and produce
statistics and graphs reflecting common item analysis and test statistics.
Its Internet site is www.principiaproducts.com/office/index.html.
160 CHAP
AE RG
If you prefer to perform the calculations by hand, several authors have suggested some
abbreviated procedures that make the calculation of common item analysis statistics fairly
easy (e.g., Educational Testing Service, 1973; Linn & Gronlund, 2000). Although there are
some subtle differences between these procedures, they generally involve the following
steps:
1. Once the tests are graded, arrange them according to score (i.e., lowest to highest
score).
2. Take the ten papers with the highest scores and the ten with the lowest scores. Set
these into two piles. Set aside the remaining papers; they will not be used in these
analyses.
3. For each item, determine how many of the students in the top group correctly an-
swered it and how many in the bottom group correctly answered it. With this infor-
mation you can calculate the overall item difficulty index (i.e., p) and separate item
difficulty indexes for the top group (p,) and bottom group (p,). For example, if
eight students in the top group answered the item correctly and three in the bottom
group answered the item correctly, add these together (8 + 3 = 11) and divide by 20
to compute the item difficulty index: p = 11/20 = 0.55. Although this item difficulty
index is based on only the highest and lowest scores, it is usually adequate for use
with classroom tests. You can then calculate p; and pz. In this case: p, = 8/10 = 0.80
and pp = 3/10 = 0.30. ;
4. You now have the data needed to calculate the discrimination index for the items.
Using the data for our example: D = py — pp = 0.80 — 0.30 = 0.50.
Using these simple procedures you see that for this item p = 0.55 (moderate difficulty) and
D=0.50 (excellent discrimination). If your items are multiple choice you can also use these
same groups to perform distracter analysis.
Continuing with our example, consider the following results:
Options
A B* Cc D
*Correct answer
As reflected in the item D value (i.e., 0.50), more students in the top group than the bottom
group selected the correct answer (i.e., option B). By examining the distracters (i.e., op-
tions A, C, and D), you see that they each were selected by some students (.e., they are all
distracting as hoped for) and they were all selected by more students in the bottom group
than the top group (i.e., demonstrating negative discrimination). In summary, this item is
functioning well.
Item Analysis for Teachers 161
Example 1
Options
p = 0.63 ——
D = 0.40 A Be € D
To illustrate our step-by-step procedure for evaluating the quality of test items, con-
sider this series of questions and how it applies to the first example.
1. A p of 0.63 is
appropriate for a multiple-choice item on a norm-referenced test. Remember, the
optimal mean p value for a multiple-choice item with four choices is 0.74.
-Pe RB is ereAeseeciate BaegSRELEE a D of 0.40 this item does an excellent
job of discriminating between examinees who performed well on the test and those
doing poorly.
Seen amon Tremere the answers to the previous
€ positive, we might actually skip this question. However, be-
cause we have the data available we can easily examine the result. All three dis-
tracters (i.e., A, C, and D) attracted some examinees and all three were selected
more frequently by members of the bottom group than the top group. This is the
desired outcome.
In summary, this is a good item and no revision is necessary.
Example 2
Options
p = 9.20 ——
D = -0.13 A B Ge D
“Correct answer
162 CHAPTER 6
1. Is the item difficulty level appropriate for the testing application? A p of 0.20 sug-
gests that this item is too difficult for most applications. Unless there is some reason
for including items that are this difficult, this is cause for concern.
2. Does the item discriminate adequately? A D of —0.13 suggests major problems with
this item. It may be miskeyed or some other major flaw is present.
3. Are the distracters performing adequately? Option A, a distracter, attracted most
of the examinees in the top group and a large number of examinees in the bottom
group. The other three options, including the one keyed as correct, were negative
discriminators (i.e., selected more by examinees in the bottom group than the top
group).
4. Overall evaluation? There is a major problem with this item! Because five times as
many examinees in the top group selected option A than option C, which is keyed as
correct, we need to verify that option C actually is the correct response. If the item
is miskeyed and option A is the correct response, this would likely be an acceptable
item (p = 0.52, D = 0.30) and could be retained. If the item was not miskeyed, there
is some other major flaw and the item should be deleted.
Example 3
Options
p = 0.43 —_-—
D = 0.20 A B C IDES
“Correct answer
1. Is the item difficulty level appropriate for the testing application? A p of 0.43 sug-
gests that this item is moderately difficult.
2. Does the item discriminate adequately? A D of 0.20 indicates this item is only a fair
discriminator.
3. Are the distracters performing adequately? Options B and C performed admirably
with more examinees in the bottom group selecting them than examinees in the top
group. Option A is another story! Over twice as many examinees in the top group
selected it than examinees in the bottom group. In other words, this distracter is at-
tracting a fairly large number of the top-performing examinees. It is likely that this
distracter either is not clearly stated or resembles the correct answer in some manner.
Either way, it is not effective and should be revised.
4. Overall evaluation? In its current state, this item is marginal and can stand revision. It
can probably be improved considerably by carefully examining option A and revising
this distracter. If the test author is able to replace option A with a distracter as effective
as B or C, this would likely be a fairly good item.
Item Analysis for Teachers 163
Example 4
Options
p = 0.23 —
D = 0.27 A B Gs D
Correct answer
. Is the item difficulty level appropriate for the testing application? A p of 0.23 sug-
gests that this item is more difficult than usually desired.
Does the item discriminate adequately? A D of 0.27 indicates this item is only a fair
discriminator.
Are the distracters performing adequately? All of the distracters (i.e., options A, B,
and D) were selected by some examinees, which means that they are serving their
purpose. Additionally, all of the distracters were selected more by the bottom group
than the top group (i.e., negative discrimination), the desired outcome.
Overall evaluation? This item is more difficult than typically desired and demon-
strates only marginal discrimination. However, its distracters are all performing prop-
erly. If this item is measuring an important concept or learning objective, it might be
desirable to leave it in the test. It might be improved by manipulating the distracters
to make the item less difficult.
In Chapter 1 we introduced you to performance assessments, noting that they have be-
come very popular in educational settings in recent years. Performance assessments require
test takers to complete a process or produce a product in a setting that closely resembles
real-life situations (AERA et al., 1999). Traditional item analysis
Traditional item analysis statistics have not been applied to performance assessments as rou-
statistics have not been tinely as they have to more traditional paper-and-pencil tests. One
routinely applied to factor limiting the application of item analysis statistics is that per-
performance assessments, formance assessments often involve a fairly small number of tasks
but in many situations these (and sometimes only one task). However, Linn and Gronlund (2000)
suggest that if the assessment involves several tasks, item analysis
procedures could be adopted for
procedures can be adopted for performance assessments. For exam-
performance assessments.
ple, if a performance assessment involves five individual tasks that
receive scores from 0 (no response) to 5 (exemplary response), the
total scores would theoretically range from a low of 0 to a high of 25. Using the practical
strategy of comparing performance between the top 10 high-scoring students with that of
the low-scoring students, one can examine each task to determine whether the task discrimi-
nates between the two groups.
164 CHAPTER 6
Scores
On this task the mean score of the top-performing students was 4.4 while the mean score
of the low-performing students was 1.6. This relatively large difference between the mean
scores suggests that the item is discriminating between the two groups.
Now examine the following example:
Scores
On this task the mean score of the top-performing students was 2.6 while the mean score
of the low-performing students was 2.3. A difference this small suggests that the item is
not discriminating between the two groups. Linn and Gronlund (2000) suggest that two
possible reasons for these results should be considered. First, it is possible that this item
is not discriminating because the performance measured by this task is ambiguous. If this
is the case, the task should be revised or discarded. Second, it is possible that this item is
measuring skills and abilities that differ significantly from those measured by the other four
tasks in the assessment. If this is the case, it is not necessarily a poor item that needs to be
revised or discarded.
ing it, a review a few days later will often reveal a number of errors. This delayed review often
catches both clerical errors (e.g:, spelling or grammar) and less obvious errors that might
make an item unclear or inaccurate. After a “‘cooling-off” period we are often amazed that an
“obvious” error evaded detection earlier. Somehow the introduction of a period of time pro-
vides distance that seems to make errors more easily detected. The time you spend proofing a
test is well spent and can help you avoid problems once the test is administered and scored.
Popham (2000) also recommends that you have a colleague review the test. Ideally
this should be a colleague familiar with the content of the test. For example, a history
teacher might have another history teacher review the test. In addition to checking for cleri-
cal errors, clarity, and accuracy, the reviewer should determine whether the test is covering
the material that it is designed to cover. This is akin to collecting validity evidence based on
the content of the test. For example, on a classroom achievement test you are trying to deter-
mine whether the items cover the material that the test is supposed to cover. Finally, Popham
recommends that you have the examinees provide feedback on the test. For example, after
completing the test you might have the examinees complete a brief questionnaire asking
whether the directions were clear and if any of the questions were confusing.
Ideally a test developer should use both quantitative and qualitative approaches to
improve tests. We regularly provide a delayed review of our own tests and use colleagues as
reviewers whenever possible. After administering a test and obtain-
We recommend the use of both ing the quantitative item analyses, we typically question students
quantitative and qualitative about problematic items, particularly items for which the basis of
approaches to improve the the problem is not obvious. Often a combination of quantitative and
quality of test items. qualitative procedures will result in the optimal enhancement of your
tests.
Popham (2000) notes that historically quantitative item analysis procedures have been
applied primarily to tests using norm-referenced score interpretations and qualitative pro-
cedures have been used primarily with tests using criterion-referenced interpretations. This
tendency can be attributed partly to some of the technical problems we described earlier
about using item analysis statistics with mastery tests. Nevertheless, we recommend the
use of both quantitative and qualitative approaches with both types of score interpretations.
When improving tests, we believe the more information the better.
Having spent the time to develop and analyze test items, you might find it useful to
develop a test bank to catalog your items. Special Interest Topic 6.3 provides information
on this process.
Many teachers at all grade levels find it helpful to develop a test bank to catalog and archive their
test items. This allows them to easily write new tests using test items that they have used previously
and have some basic measurement information on. Several sources have provided guidelines for
developing item banks (e.g., Linn & Gronlund, 2000; Ward & Murray-Ward, 1994). Consider the
following example.
Learning Objective: Describe the measures of variability and their appropriate use.
If the standard deviation of a set of test scores is equal to 9, the variance is equal to:
ees)
b. 18
| c. 30
: d. 81°
Administration Date: February 7, 2002
: Options
& p = 0.58
; D = 0.43 A B fe De
] Number in top group 4 2 0 24
i Number in bottom group 9 7 3 11
@
“Correct answer
ome
Options
p = 0.68
D = 0.37 A B Le Dt
Number in top group 1 2 1 26
Number in bottom group 8 5 2 15
ee ee eee ee ee
“Correct answer
This indicates that this item has been administered on two different occasions. By including informa-
tion from multiple administrations, you will have a better idea of how the item is likely to perform
on
a future test. If you are familiar with computer databases (e.g., Microsoft Access), you can
set up a
database that will allow you to access items with specific characteristics quickly and efficiently.
Pro-
fessionally developed item bank programs are also available. For example, the Assessment
Systems
Corporation’s FastTEST product will help you create and maintain item banks, as well
as construct
tests (see www.assess.com).
‘
Item Analysis for Teachers 167
about which learning objectives have been achieved and which need further elaboration and
review. Sometimes as teachers we believe that our students have grasped a concept only
to discover on a test that items measuring understanding of that concept were missed by a
large number of them. When this happens, it is important to go back and carefully review
the material, possibly trying a different instructional approach to convey the information.
At another level of analysis, information about which distracters are
Item analysis can result not only being selected by students can help teachers pinpoint common mis-
in better tests but also conceptions and thereby correct them. In these ways, item analysis
in better teaching. can result not only in better tests but also in better teaching.
Summary
In this chapter we described several procedures that can be used to assess the quality of the
individual items making up a test.
a Item difficulty level. The item difficulty level or index is defined as the percentage or
proportion of examinees correctly answering the item. The item difficulty index (i.e.,
p) ranges from 0.0 to 1.0 with easier items having larger decimal values and difficult
items having smaller values. For maximizing variability among examinees, the op-
timal item difficulty level is 0.50, indicating that half of the examinees answered the
item correctly and half answered it incorrectly. Although 0.50 is optimal for maximiz-
ing variability, in many situations other values are preferred.
Item discrimination. Item discrimination refers to the extent to which an item accu-
rately discriminates between examinees who vary on the test’s construct. For exam-
ple, on an achievement test the question is whether the item can distinguish between
examinees who are high achievers and those who are poor achievers. Although a
number of different approaches have been developed for assessing item discrimina-
tion, we focused our discussion on the popular item discrimination index (i.e., D). We
provided guidelines for evaluating item discrimination indexes, and as a general rule
items with D values over 0.30 are acceptable, and items with D values below 0.30
should be reviewed. However, this is only a general rule, and we discussed a number
of situations in which smaller D values might be acceptable.
Distracter analysis. The final quantitative item analysis procedure we described was
distracter analysis. In essence distracter analysis allows the test developer to evaluate
the distracters on multiple-choice items (i.e., incorrect alternatives) and determine
whether they are functioning properly. This involves two primary questions. First:
Did the distracter distract some examinees? If a distracter is so obviously wrong that
no examinees selected it, it is useless and deserves attention. The second question
involves discrimination. Did the distracter attract more examinees in the bottom group
than in the top group?
After introducing these different item analysis statistics, we described some practical
strategies teachers can use to examine the measurement characteristics of the items on their
168 CHAPTER 6
classroom assessments. We also introduced a series of steps that teachers can engage in to
use the information provided by item analysis procedures to improve the quality of the items
they use in their assessments.
In addition to quantitative item analysis procedures, test developers can also use
qualitative approaches to improve their tests. Popham (2000) suggested that the test devel-
oper carefully proof the test after setting it aside for a few days. This break often allows
the test author to gain some distance from the test and provide a more thorough review of
it. He also recommends getting a trusted colleague to review the test. Finally, he recom-
mends that the test developer solicit feedback from the examinees regarding the clarity of
the directions and the identification of ambiguous items. Test developers are probably best
served by using a combination of quantitative and qualitative item analysis procedures.
In addition to helping improve tests, in the classroom the information obtained with item
analysis procedures can help the teacher identify common misconceptions and material
that needs further instruction.
RECOMMENDED READINGS
Anastasi, A., & Urbina, S. (1997). Psychological testing (7th Kelley, T. L. (1939). The selection of upper and lower groups
ed.). Upper Saddle River, NJ: Prentice Hall. Chapter 7, for the validation of test items. Journal of Educational
Item Analysis, presents a readable but comprehensive Psychology, 30, 17-24. A real classic!
discussion of item analysis that is slightly more techni- Nitko, A. J., & Hsu, T. C. (1984). A comprehensive micro-
cal than that provided in this text. computer system for classroom testing. Journal of Edu-
Johnson, A. P. (1951). Notes on a suggested index of item va- cational Measurement, 21, 377-390. Describes a set of
lidity: The U-L index. Journal of Educational Measure- computer programs that archives student data, performs
ment, 42, 499-504. This is a seminal article in the history common item analyses, and banks the test question to
of item analysis. facilitate test development.
ii
Go to www.pearsonhighered.com/reynolds2e to view a PowerPoint™
presentation and to listen to an audio lecture about this chapter.
i
CHAPTER
CHAPTER HIGHLIGHTS
LEARNING OBJECTIVES
169
170 CHAPTER 7
As noted in Chapter 1, classroom testing has important implications and its effects are felt
immediately by students and teachers alike. It has been estimated that assessment activities
consume as much as 30% of the available instructional time (Stiggins & Conklin, 1992).
Because testing activities are such important parts of the educational process, all teachers
who develop and use tests should work diligently to ensure their assessment procedures are
adequate and efficient. In this chapter we will start discussing the development of class-
room achievement tests. The initial steps in developing a classroom achievement test are to
specify the educational objectives, develop a table of specifications, and select the type of
items you will include in your assessment. These activities provide the foundation for all
classroom tests and many professionally designed tests of educational achievement.
The identification and statement of educational objectives is an important first step
in developing tests. Educational objectives are simply educational goals, that is, what you
hope the students will learn or accomplish. Educational objectives
The identification and are also referred to as instructional or learning objectives. The teach-
statement of educational ing of any lesson, unit, or course has one or more educational objec-
objectives is an important first tives. These objectives are sometimes clearly stated and sometimes
step in developing tests. (all too often) implicit. Even when the objectives are implicit they
can usually be inferred by carefully examining the materials used in
instruction, the topics covered, and the instructional processes employed. A good classroom
test can be written from clearly stated objectives much more easily than can one from vague
or poorly developed objectives. Clearly stated objectives help you make sure that the test
measures what has been taught in class and greatly facilitate the test development process.
Establishing explicit, clearly stated educational objectives also has the added benefit of
enhancing the quality of teaching. If you know what your educational goals are, you are
much more likely to reach them. The educational reform movement of the 1990s focused
considerable attention on the development and statement of content standards. It is likely
The Initial Steps in Developing a Classroom Test 171
that your state or school district has developed fairly explicit curriculum guidelines that
dictate to some degree the educational objectives you have for your students.
: : We will start by
discussing how objectives can differ in terms of scope.
Scope
Scope refers to how broad or narrow an objective is. An example of a broad objective is
The student will be able to analyze and discuss the effects of the Civil War on
twentieth-century American politics.
The student will be able to list the states that seceded from the Union during the Civil
War.
Clearly different kinds of student responses would be expected for test questions developed
from such different objectives. Objectives with a broad scope are often broken down into
objectives with a more narrow scope. The broad objective above might have been reformu-
lated to the following objectives:
1. The student will be able to analyze and discuss the effects of the Civil War on
twentieth-century American politics.
la. The student will be able to discuss the political effects of post—Civil War occupa-
tion by federal troops on Southern state politics.
1b. The student will be able to trace the rise and fall of African Americans’ political
power during and after Reconstruction.
1c. The student will be able to discuss the long-term effects of the economic depres-
sion in the South after the Civil War.
1d. The student will be able to list three major effects of the Civil War on twentieth-
century U.S. politics.
172 CHAP
TE Re 7
Although these four specific objectives all might help the student attain the broad objective,
they do not exhaust all of the potential objectives that could support the broad objective. In
fact a whole course might be needed to completely master the broad objective.
If you use only very specific educational objectives, you may end up with a large num-
ber of disjointed items that emphasize rote memory and other low-level cognitive abilities.
On the other hand, if you use only broad educational objectives, you
It is probably best to strike may not have the specific information needed to help you develop tests
a balance between broad with good measurement characteristics. Although you can find test de-
velopment experts who promote the use of narrow objectives and other
objectives and narrow
experts who promote broad objectives, in practice it is probably best
objectives.
to strike a balance between the two extremes. This can best be accom-
plished using two approaches. First, you can write objectives that are
at an intermediate level of specificity. Here the goal is to write objectives that provide the
specificity necessary to guide test development but are not so narrow as to limit assessment
to low-level abilities. The second approach is to use a combination of broad and specific
objectives as demonstrated earlier. That is, write broad objectives that are broken down into
more specific objectives. Either of these approaches can help you develop well-organized
tests with good measurement characteristics.
In addition to the scope of educational objectives, they also differ in the domain or the type
of ability/characteristic being measured. The domains typically addressed by educational
objectives involve cognitive, affective, or psychomotor abilities or characteristics. These
three domains are usually presented as hierarchies involving different levels that reflect
varying degrees of complexity. We will start by discussing the cognitive domain.
Cognitive Domain’
The objectives presented in the previous section are referred to as a
1. The student will be able to analyze and discuss the effects of the Civil War on twentieth-
century American politics.
2. The student will be able to list the states that seceded from the Union during the Civil
War.
When we first discussed these two objectives, we emphasized how they differed in scope. The
first objective is broad and could be the basis for a whole course of study. The second one is
narrow and specific. In addition to scope they also differ considerably in the complexity of the
cognitive processes involved. The first one requires “analysis and discussion” whereas the sec-
ond requires only “listing.” If a student can memorize the states that,seceded from the Union,
he or she can be successful on the second objective, but memorization of facts would not be
The Initial Steps in Developing a Classroom Test
173
Bloom’s taxonomy provides a sufficient for the first objective. Analysis and discussion require more
useful way of describing the complex cognitive processes than rote memorization. A taxonomy of
complexity of an objective by cognitive objectives developed by Bloom, Englehart, Furst, Hill, and
classifying it into one of six Krathwohl (1956) is commonly referred to as Bloom’s taxonomy.
hierarchical categories. This taxonomy provides a useful way of describing the complexity
of an objective by classifying it into one of six hierarchical categories
ranging from the most simple to the most complex. Table 7.1 provides
a summary of Bloom’s taxonomy. The categories include the following:
Comprehension. Objectives
explain, and summarize. Educational objectives at the comprehension level include the fol-
lowing examples:
u The student will be able to describe the use of each symbol on a U.S. Geographical
Survey map.
m The student will be able to explain how interest rates affect unemployment.
u The student will be able to write directions for traveling by numbered roads from any
city on a map to any other city.
= The student will be able to apply multiplication and division of double digits in ap-
plied math problems.
Analysis.
c
of parts toa whole) Educational objectives at this level include the following examples:
Synthesis. jecti
. Ojectives at the synthesis level
include the following examples:
m The student will construct a map of a hypothetical country with given characteristics.
m The student will propose a viable plan for establishing the validity of an assessment
instrument following the guidelines presented in the Standards for Educational and
Psychological Testing (1999).
Evaluation.
. Objec-
tives at the evaluation level include the following examples:
a The student will evaluate the usefulness of a map to enable him or her to travel from
one place to another.
a The student will judge the quality of validity evidence for a specified assessment
instrument.
Although it is somewhat dated, we agree with others (e.g., Hopkins, 1998) who feel
that Bloom’s taxonomy is helpful because it presents a framework that helps remind teach-
ers to include items reflecting more complex educational objectives in their tests. Popham
(1999) suggests that teachers tend to focus almost exclusively on objectives at the knowl-
edge level. He goes as far as to suggest that in practice one can actu-
Instruction and assessment ally simplify the taxonomy by having just two levels: knowledge
are too often limited to rote and anything higher than knowledge. We will not go quite that far,
memorization, and higher- but we do agree that instruction and assessment are often limited to
level objectives should be rote memorization, and higher-level educational objectives should
emphasized. be emphasized.
This is not to imply that lower-level objectives are trivial and
should be ignored. For each objective in your curriculum you must’ decide at what level you
expect students to perform. In a brief introduction to a topic it may be sufficient to expect only
The Initial Steps in Developing a Classroom Test 175
knowledge and comprehension of major concepts. In a more detailed study of a topic, higher,
more complex levels of mastery will typically be required. However, it is often not possible to
master higher-level objectives without first having mastered lower-level objectives. Although
we strongly encourage the development of higher-level objectives, it is not realistic to require
high-level mastery of everything. Education is a pragmatic process of choosing what is most
important to emphasize in a limited amount of instructional time. Our culture helps us make
some of these choices, as do legislative bodies, school boards, administrators, and even oc-
casionally parents and students. In some school districts the cognitive objectives are provided
in great detail; in others they are practically nonexistent. As noted earlier, the current trend is
for federal and state lawmakers to exert more and more control over curriculum content.
Affective Domain
Most people think of cognitive objectives when they think of a student’s educational ex-
periences. However, two other domains of objectives appear in the school curriculum: af-
fective and psychomotor objectives. The affective domain involves characteristics such as
values, attitudes, interests, and behavioral actions. As a result
The student will demonstrate interest in earth science by conducting a science fair
project in some area of earth science.
As a general rule, affective objectives are emphasized more in elementary school cur-
ricula than secondary curricula. A taxonomy of affective objectives developed by Krath-
wohl, Bloom, and Masia (1964) is presented in Table 7.2. This taxonomy involves levels of
increasing sophistication, with each level building on preceding levels. It depicts a process
whereby new ideas, values, and beliefs are gradually accepted and internalized as one’s own.
Krathwohl’s taxonomy of affective objectives has never approached the popularity of
Bloom’s taxonomy of cognitive objectives, probably because the affective domain has been
more difficult to define and is also a more controversial area of education. In schools, af-
fective objectives are almost always adjuncts to cognitive objectives. For example, we want
our students to learn about science and as a result to appreciate or enjoy it. Classroom tests
predominantly focus on cognitive objectives, but affective objectives are found in school
curricula, either explicitly or implicitly. Because affective objectives appear in the school
curriculum, their specification enhances the chance of them being achieved.
Psychomotor Domain
The third class of objectives deals with physical activity and is referred to as psychomo-
tor objectives. :
t .g., biology or computer science), or career—technical classes such as
woodworking, electronics, automotive, or metalwork. For example, in physical education
there are countless psychomotor activities such as rolling a bowling ball a certain way or
hitting tennis balls with a certain motion. Biology classes also have many psychomotor ac-
tivities, including focusing a microscope, staining cells, and dissection. Computer science
courses require skill in using a computer keyboard and assembling computer hardware.
Taxonomies of psychomotor objectives have been developed, and Harrow’s (1972) model
is illustrated in Table 7.3. Psychomotor objectives are typically tied to cognitive objectives
because almost every physical activity involves cognitive processes. As a result, like af-
fective objectives, psychomotor objectives typically are adjuncts to cognitive objectives.
Nevertheless, they do appear in the school curriculum and their specification may enhance
instruction and assessment.
Behavioral: The student will be able to list the reasons cited in the curriculum guide
for the United States’ entry into World War I with 80% accuracy.
Nonbehavioral: The student will be able to analyze the reasons for the United States’
entry into World War I.
These two objectives differ in the activity to be performed. The behavioral objective re-
quires that the student list the rea ; the nonbehavioral objective requires that the student
analyze the reason
UC ay
Behav
measured by the teacher. Nonbehavioral activities must be inferre
actvilee
«write ined cule, ereminguiae, kno apdunderstands (hough it is possible to
either behavioral or nonbehavioral objectives at all levels of the cognitive taxonomy,
teachers often find it easier to write behavioral objectives for the lower levels (e.g., knowl-
edge, comprehension, and application) and to write nonbehavioral objectives for the higher
levels (e.g., analysis, synthesis, and evaluation).
It is also common for behavioral objectives to specify an outcome criterion. For ex-
ample, in the previous example, the criterion is listing “the reasons cited in the curriculum
guide... with 80% accuracy.” As illustrated, behavioral objectives often state outcome
criteria as a percentage correct that represents mastery, as in the following example:
The student will be able to diagram correctly 80% of sentences presented from a
standard list.
completely intolerable. In this situation the criterion for mastery may need to be raised to
100% to achieve an acceptable accident rate (e.g., <1%). When training pilots to fly fighter
jets, the Air Force may likewise require 100% mastery of all ground flight objectives be-
cause a single mistake may result in death and the loss of expensive equipment.
The use of behavioral objectives received widespread acceptance in the 1960s and
1970s because they helped teachers clearly specify their objectives. However, a disadvan-
tage of behavioral objectives is that if carried to an extreme they can be too specific and too
numerous, and as a result no longer facilitate instruction or assessment. The ideal situation
is to have objectives that are broad enough to help you organize your instruction and assess-
ment procedures, but that also specify clearly measurable activities.
So far we have defined educational objectives and described some of their major character-
istics. Consistent with our goal of limiting our discussion to the information teachers really
need to know to develop, use, and interpret tests effectively, we have kept our discussion
relatively brief. Because the specification of educational objectives plays an important role
in the development of classroom tests, we will provide a few general suggestions for writing
useful objectives. These include the following:
- Define measurement.
- Describe the different scales of measurement and give examples.
- Describe the measures of central tendency and their appropriate use.
- Describe the measures of variability and their appropriate use.
- Explain the meaning of correlation coefficients and how they are used.
- Explain how scatterplots are used to describe the relationships between two variables.
- Describe how linear regression is used to predict performance.
. Describe major types of correlation coefficients.
= . Distinguish between correlation and causation.
WYN
CeONINMN
You might be asking why we have spent so much time discussing educational objectives.
The reason is that the development of a classroom test should be closely tied to the class
curriculum and educational objectives. As we noted earlier, classroom tests should mea-
sure what was taught. Classroom tests should emphasize what was
The table of specifications, or emphasized in class. The method of ensuring congruence between
test blueprint, is used to ensure classroom instruction and test content is the development and ap-
congruence between classroom plication of a table of specifications, also referred to as a test blue-
instruction and test content. print. An example is given in Table 7.5 for Chapter 2 of this text.
180 CHAPTER 7
TABLE7.5 Table of Specifications for Test on Chapter 2 Based on Content Areas (Number of Items)
Level of Objective
Scales of
measurement 2 2 2 6
Measures of
central tendency 5 5) 6
Measures
of variability 3 3 3 2,
Correlation and
regression 2 3 2 2 9
The column on the left, labeled Content Area, lists the major content areas to be covered in
the test. These content areas are derived by carefully reviewing the educational objectives
and selecting major content areas to be included in the test. Across the top of the table we
list the levels of Bloom’s cognitive taxonomy. The inclusion of this section encourages us
to consider the complexity of the cognitive processes we want to measure. As noted earlier,
there is a tendency for teachers to rely heavily on lower-level processes (e.g., rote memory)
and to underemphasize higher-level cognitive processes. By incorporating these categories
in our table of specifications, we are reminded to incorporate a wider range of cognitive
processes into our tests.
The numbers in the body of the table reflect the number of items to be devoted to
assessing each content area at each cognitive taxonomic level. Table 7.5 depicts specifica-
tions for a 30-item test. If you examine the first content area in Table 7.5 (i.e., scales of
measurement) you see two knowledge-level items, two comprehension-level items, and
two analysis-level items devoted to assessing this content area. The next content area (i.e.,
measures of central tendency) will be assessed by three knowledge-level items and three
comprehension-level items. The number of items dedicated to assessing each objective
should reflect the importance of the objective in the curriculum and how much instruc-
tional time was devoted to it. In our table of specifications we determined the number
of items dedicated to each content area/objective by examining how much material was
devoted to each topic in the text and how much time we typically spend on each topic in
class lectures.
Some testing experts recommend using percentages instead of the number of items
when developing a table of specifications. This approach is illustrated in Table 7.6. For
example, you might determine that approximately 20% of your instruction involved the
different scales of measurement. You would like to reflect this weighting in your test so you
devote 20% of the test to this content area. If you are developing a 30-item test this means
you will write six items to assess objectives related to scales of measurement (0.20 x 30
= 6). If you are developing a 40-item test, this means you will write eight items to assess
The Initial Steps in Developing a Classroom Test 181
TABLE7.6 Table of Specifications for Test on Chapter 2 Based on Content Areas (Percentages)
Level of Objective
Scales of
measurement 6.7% 6.7% 6.7% 20%
Measures of
central tendency 10% 10% 20%
Measures
of variability 10% 10% 10% 30%
Correlation and
regression 6.7% 10% 6.7% 6.7% 30%
ET AS
giving students experience with the testing process similar to the statewide test, has been
shown to improve student performance. Thus, teachers can develop test questions similar
to those on the statewide tests in format. Of course, there is no reason to limit teacher tests
to these formats.
selected-response items provide the answer; the student simply selects the appropriate one.
Although there are considerable differences among these selected-response item formats,
we can make some general statements about their strengths and limitations (see Table 7.7).
Strengths include the following:
Naturally, there are limitations associated with the use of selected-response items,
including the following:
Multiple-choice items can be written to assess higher-level objectives, but this often takes a
little more effort and creativity. -
= Constructed-response items take more time for students to complete. You cannot in-
clude as many constructed-response items or tasks on a test as you can selected-response
items. As a result, you are not able to sample the content domain as thoroughly.
= Constructed-response items are difficult to score. In addition to scoring being more
difficult and time consuming compared to selected-response items, scoring is more subjec-
tive and less reliable.
different item formats, but their introduction here will hopefully help you begin to consider
some of the main issues. Some of these suggestions might seem obvious, but sometimes the
obvious is overlooked! Table 7.9 summarizes these suggestions.
Provide Clear Directions. It is common for teachers to take for granted that students
understand how to respond to different item formats. This may not be the case! When creat-
ing a test always include thorough directions that clearly specify how the student should
respond to each item format. Just to be safe, assume that the students have never seen a test
like it before and provide directions in sufficient detail to ensure they know what is expected
of them.
Develop Items and Tasks That Can Be Scored in a Decisive Manner. Ask yourself
whether the items have clear answers that virtually every expert would agree with. In terms
of essays and performance assessments, the question may be if experts would agree about
the quality of performance on the task. The grading process can be challenging even when
your items have clearly “correct” answers. When there is ambiguity regarding what repre-
sents a definitive answer or response, scoring can become much more difficult.
188 CHAPTER 7
Avoid Inadvertent Cues to the Correct Answers. It is easy for unintended cues to the
correct response to become embedded in a test. These cues have the negative effect of
allowing students who have not mastered the material to correctly answer the item. This
confounds intelligence (i.e., figuring out the correct answer based on detected cues) with
achievement (i.e., having learned the material). To paraphrase Gronlund (1998), only the
students who have mastered an objective should get the item right, and those who have not
mastered it, no matter how intelligent they are, should not get it correct.
Include Test Items and Tasks That Will Result in an Assessment That Produces Reli-
able and Valid Test Results. In the first section of this text we discussed the important
properties of reliability and validity. No matter which format you select for your test, you
should not lose sight of the importance of developing tests that produce reliable and valid
results. To make better educational decisions, you need high-quality information.
How Many Items Should You Include? As is often the case, there is no simple answer
to this question. The optimal number of items to include in an assessment is determined
by factors such as the age of the students, the types of items, the breadth of the material or
topics being assessed (i.e., scope of the test), and the type of test. Let’s consider several of
these factors separately:
Age of Students. For students in elementary school it is probably best to limit regular
classroom exams to approximately 30 minutes in order to maximize effort, concentration,
and motivation. With older students you can increase this period considerably, but it is
The Initial Steps in Developing a Classroom Test 189
Types of Items. Obviously, students can complete more true—false items than they can
essay items in a given period of time. Gronlund (2003) estimates that high school students
should be able to complete approximately one multiple-choice item, three true—false items,
or three fill-in-the-blank items in one minute if the items are assessing objectives at the
knowledge level. Naturally, with younger students or more complex objectives, more time
will be needed. When you move to restricted-response essays or performance assessments,
significantly more time will be needed, and when you include extended-response tasks the
time demands increase even more. As we have already alluded to, the inclusion of more
“time-efficient” items will enhance the sampling of the content domain.
Type and Purpose of the Test. Maximum performance tests can typically be categorized
as either speed or power tests. Pure speed tests generally contain items that are relatively
easy but have strict time limits that prevent examinees from successfully completing all the
items. On pure power tests, the speed of performance is not an issue. Everyone is given
enough time to attempt all the items, but the items are ordered according to difficulty, with
some items being so difficult that no examinee is expected to answer them all. The distinc-
tion between speed and power tests is one of degree rather than being absolute. Most often
a test is not a pure speed test or a pure power test, but incorporates some combination of the
two approaches. The decision to use a speed test, a power test, or some combination of the
two will influence the number and type of items you include on your test.
Scope of the Test. In addition to the speed versus power test distinction, the scope of the test
will influence how many items you include in an assessment. For a weekly exam designed to
assess progress in a relatively narrow range of skills and knowledge, a brief test will likely be
sufficient. However, for a six-week or semester assessment covering a broader range of skills
and knowledge, a more comprehensive (i.e., longer) assessment is typically indicated.
When estimating the time needed to complete the test you should also take into con-
sideration test-related activities such as handing out the test, giving directions, and collect-
ing the tests. Most professional test developers design power tests that approximately 95%
of their samples will complete in the allotted time. This is probably a good rule of thumb for
classroom tests. This can be calculated in the classroom by dividing the number of students
completing the entire test by the total number of subjects.
(Co
RS
SPECIAL INTEREST TOPIC 7.1
. Suggestions for Reducing Test Anxiety
Research suggests that there is a curvilinear relationship between anxiety and performance. That is,
at relatively low levels anxiety may have a motivating effect. It can motivate students to study in a
conscientious manner and put forth their best effort. However, when anxiety exceeds a certain point it
becomes detrimental to performance. It will enhance the validity of your interpretations if you can re-
duce the influence of debilitating test anxiety. Remember, in most classroom situations you are striv-
ing to measure student achievement, not the impact of excessive anxiety. In this situation test anxiety
is a source of construct-irrelevant variance. By reducing test anxiety, you reduce construct-irrelevant
variance and increase the validity of y our interpretations. Researchers have provided suggestions for
helping students control test anxiety (e.g., Hembree, 1988; Linn & Gronlund, 2000; Mealey & Host,
1992; Nitko, 2001; Tippets & Benson, 1989). These suggestions include the following:
m Students with test anxiety may benefit from relaxation training. In many schools students
with debilitating test anxiety may be referred to a school counselor or school psychologist
who can teach them some fairly simple relaxation techniques.
= Although it is good practice to minimize environmental distractions for all students, this is
even more important for highly anxious students. Highly anxious students tend to be more
easily distracted by auditory and visual stimuli than their less anxious peers.
m Do not make the test a do-or-die situation. Although it is reasonable to emphasize the impor-
tance of an assessment, it is not beneficial to tell your students that this will be the most difficult
test they have ever taken or that their future is dependent on their performance on the test.
m Provide a review of the material to be covered on the test before the testing date. This is a
good instructional strategy that can facilitate the integration of material, students will ap-
preciate the review, and anxiety will be reduced.
m Arrange the items on your test from easy to difficult. Have you ever taken a test in which
the first item was extremely difficult or covered some obscure topic you had never heard of?
If so, you probably experienced a sudden drop in confidence, even if you initially felt well
prepared to take the test. To avoid this, many instructors will intentionally start the test with a
particularly easy item. It might not do much from a technical perspective (e.g., item difficulty
or discrimination), but it can have a positive influence on student motivation and morale.
m It is beneficial to have multiple assessments over the course of a grading period rather than
basing everything on one or two assessments. When there are only a limited number of as-
sessments, the stakes may seem so high that student anxiety is increased unnecessarily.
m Prepare all of your students for the test by teaching appropriate test-taking strategies. A novel
or unfamiliar test format provokes anxiety in many students, and this tendency is magnified
in students prone to test anxiety.
= When the students are seated and ready to begin the test, avoid unnecessary discussion before
letting them begin. The students are typically a little “on edge” and anxious to get started. If
the teacher starts rambling about irrelevant topics, this tends to increase student anxiety.
————_—Ss—X
eESSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSsSSSSe
regretful to develop an exemplary assessment and then have its results compromised by poor
preparation or inappropriate administration procedures. Your goal should be to promote
conditions that allow students to perform their best. Before administering an assessment,
you should take appropriate steps to prepare the students. This can include announcing in
The Initial Steps in Developing a Classroom Test 191
Before administering an advance when the test will be administered, describing what con-
assessment, you should take tent and skills will be covered, the basic parameters of the test (e.g.,
appropriate steps to prepare one-hour test including short-answer and restricted-response essay
the students. items), how it will be scored, and how the results will be used (e.g.,
Linn & Gronlund, 2000). It is also beneficial to give the students
examples of the types of items that will be included on the test and
provide general instruction in basic test-taking skills. You also want to do your best to
minimize excessive test anxiety because it can be a source of construct-irrelevant variance
that undermines the validity of your interpretations. Although stressing the importance of
an upcoming assessment can help motivate students, there is a point at which it is no longer
motivating and becomes counterproductive. Special Interest Topic 7.1 provides some sug-
gestions for helping students manage their anxiety.
The scheduling of an assessment is also a decision that deserves careful consideration.
You should try to schedule the test at a time when the students will not be distracted by other
events. For example, scheduling a test the last day before a big holiday is probably not opti-
mal. In this situation the students are likely to be more focused on the
Take steps to ensure that upcoming holiday than on the test. The same goes with major events
the testing environment is at the school. Scheduling tests the day of the big homecoming game
conducive to optimal student or the senior prom is probably not desirable. Teachers should make
performance. every effort to ensure that the physical environment is conducive to
optimal student performance. You should take steps to ensure that
the room is comfortable (e.g., temperature, proper ventilation), that there is proper lighting,
and that extraneous noise is minimized. Additionally, you should make efforts to avoid any
unexpected interruptions (e.g., ask whether a firedrill is scheduled, place a “Test in Prog-
ress” sign on the door).
Once the students have started the test, be careful about providing help to students.
Students can be fairly crafty when it comes to coaxing information from teachers during a
test. They may come asking for clarification while actually “fishing” for hints or clues to the
answer. As a teacher you do not want to discourage students from clarifying the meaning
of ambiguous items, but you also do not want to inadvertently provide hints to the answer
of clearly stated items. Our suggestion is to carefully consider the student’s question and
determine whether the item is actually ambiguous. If it is, make a brief clarifying comment
to the whole class. If the item is clear and the student is simply fishing for a clue to the an-
swer, simply instruct the student to return to his or her seat and carefully read and consider
the meaning of the item. Finally, take reasonable steps to discourage cheating. Cheating is
another source of construct-irrelevant variance that can undermine the validity of your score
interpretations. Special Interest Topic 7.2 provides some strategies for preventing cheating
on classroom tests.
Summary
In this chapter we addressed the initial steps a teacher should follow in developing class-
room achievement tests. We noted that the first step is to specify the educational objectives
192 CHAPTER 7
Fe
Cheating on tests is as old as assessment. In ancient China, examinees were searched before taking
civil service exams, and the actual exams were administered in individual cubicles to prevent cheat-
ing. The punishment for cheating was death (Hopkins, 1998). We do not punish cheaters as severely
today, but cheating continues to be a problem in schools. Like test anxiety, cheating is another source
of construct-irrelevant variance that undermines the validity of test interpretations. If you can reduce
cheating you will enhance the validity of your interpretations. Many authors have provided sugges-
tions for preventing cheating (e.g., Hopkins, 1998; Linn & Gronlund, 2000; Popham, 2000). These
include the following:
Keep the assessment materials secure. Tests and other assessments have a way of getting into
the hands of students. To avoid this, do not leave the assessments in open view in unlocked
offices, make sure that the person copying the tests knows to keep them secure, and number
the tests so you will know if one is missing. Verify the number of tests when distributing them
to students and when picking them up from students.
Possibly the most commonsense recommendation is to provide appropriate supervision of stu-
dents during examinations. This is not to suggest that you hover over students (this can cause
unnecessary anxiety), but simply that you provide an appropriate level of supervision. This can
involve either observing from a position that provides an unobstructed view of the entire room
or occasionally strolling around the room. Possibly the most important factor is to be attentive
and visible; this will probably go a long way toward reducing the likelihood of cheating.
Have the students clear their desks before distributing the tests.
When distributing the tests, it is advisable to individually hand each student a test. This will
help you avoid accidentally distributing more tests than there are students (an accident that
can result in a test falling into the wrong hands).
If students are allowed to use scratch paper, you should require that they turn this in with
the test.
When possible, use alternative seating with an empty row of seats between students.
Create two forms of the test. This can be accomplished by simply changing the order of test
items slightly so that the items are not in exactly the same order. Give students sitting next
to each other alternate forms.
or goals you have for your students. It is important to do this because these objectives will
serve as the basis for your test. In writing educational objectives, we noted that there are
several factors to consider, including the following:
tive objectives, which presents six hierarchical categories including knowledge, com-
prehension, application, analysis, synthesis, and evaluation.
= Format. Educational objectives are often classified as behavioral or nonbehavioral.
Although behavioral objectives have advantages, if the behavioral format is taken to
the extreme it also has limitations. We noted that it is optimal to have objectives that
are broad enough to help you organize your instruction and testing procedures, but
that also state measurable activities.
RECOMMENDED READINGS
Gronlund, N. E. (2000). How to write and use instructional psychology: A century of contributions (pp. 367-389).
objectives (6th ed.). Upper Saddle River, NJ: Merrill/ Mahwah, NJ: Erlbaum. This chapter provides a bio-
Prentice Hall. This is an excellent example of a text that graphical sketch of Dr. Bloom and reviews his influence
focuses on the development of educational objectives. in educational psychology.
Lorin, W. (2003). Benjamin S. Bloom: His life, his works,
and his legacy. In B. Zimmerman (Ed.), Educational
CHAPTER HIGHLIGHTS
LEARNING OBJECTIVES
Ih the last chapter we addressed the development of educational objectives and provided
some general suggestions for developing, assembling, and administering your assessments.
In the next three chapters we will discuss the development of specific types of test items. In
this chapter we will focus on the development of selected-response items. As we noted in
the last chapter, if an item requires a student to select a response from available alternatives,
it is classified as a selected-response item. Multiple-choice, true—false, and matching items
are all selected-response items. If an item requires a student to create or construct a re-
sponse, it is classified as a constructed-response item. Essay and short-answer items are
constructed-response items, but this category also includes other complex activities such as
making a class presentation, composing a poem, or painting a picture.
195
196 CHAPTER 8
If an item requires a student to In this chapter we will address selected-response items in detail.
select a response from available We will discuss their strengths and weaknesses and provide sugges-
alternatives, it is classified as a tions for developing effective items. In the next chapter we will address
selected-response item. If it essay and short-answer items. In Chapter 10 we will address perfor-
requires a student to create or mance assessments and portfolios—types of constructed-response as-
sessments that have gained increased popularity in recent years. In
construct a response, it is
these chapters we will focus on items used to assess student achieve-
classified as a constructed-
ment (as opposed to interests, personality characteristics, etc.).
response item.
Multiple-Choice Items
Multiple-choice items are by far the most popular of the selected-response items. They
have gained this degree of popularity because they can be used in a variety of content areas
and can assess both simple and complex learning outcomes. Multi-
Multiple-choice items are by ple-choice items take the general form of a question or an incomplete
far the most popular selected- statement with a set of possible answers, one of which is correct. The
response items. part of the item that is either a question or an incomplete statement
is referred to as the stem. The possible answers are referred to as
alternatives. The correct alternative is simply called the answer and the incorrect alterna-
tives are referred to as distracters (i.e., they serve to “distract” students who do not actually
know the correct response).
Multiple-choice items can be written so the stem is in the form
Multiple-choice items can be of a direct question or an incomplete sentence. Most writers prefer the
written as a direct question or direct-question format because they feel it presents the problem in
an incomplete sentence. the clearest manner. The advantage of the incomplete-sentence for-
mat is that it may present the problem in a more concise manner. If the
question is formatted as an incomplete statement, it is suggested that the omission occur near
the end of the stem. Our recommendation is to use the direct-question format unless the prob-
lem can be stated more concisely using the incomplete-sentence format without any loss of
clarity. Examine these examples of the two formats.
Example 1 Direct-Question Format
1. Which river is the largest in the United States of America?
A. Mississippi <
B. Missouri
C. Ohio
D. Rio Grande
items. The Mississippi is the largest river in the United States of America and the other
answers are incorrect. However, multiple-choice items can be written for situations having
more than one correct answer. The objective is to identify the “best answer.”
In Example 3 all the variables listed are important to consider when buying a house, but as
almost any realtor will tell you, location is the most important. Most test developers prefer
the best-answer format for two reasons. First, in some situations it is difficult to write an
answer that everyone will agree is correct. The best-answer format allows you to frame it as
an answer that most experts will agree with. Second, the best-answer format often requires
the student to make more subtle distinctions among the alternatives, which results in more
demanding items that measure more complex educational objectives.
ws Provide brief but clear directions. Directions should include how the selected alterna-
tive should be marked.
m The item stem should be numbered for easy identification, while the alternatives are
indented and identified with letters.
= Either capital or lowercase letters followed by a period or parenthesis can be used for
the alternatives. If a scoring sheet is used, make the alternative letters on the scoring sheet
and the test as similar as possible.
= There is no need to capitalize the beginning of alternatives unless they begin with a
proper name.
= When the item stem is a complete sentence, there should not be a period at the end of
the alternatives (see Example 4).
= When the stem is in the form of an incomplete statement with the missing phrase at
the end of the sentence, alternatives should end with a period (see Example 5).
= Keep the alternatives in a vertical list instead of placing them side by side because it
is easier for students to scan a vertical list quickly.
= Use correct grammar and formal language structure in writing items.
= All items should be written so that the entire question appears on one page.
The following use formats that promote clarity, illustrating many of these suggestions.
198 CHAPTER 8
Example 4
Directions: Read each question carefully and select the best answer. Circle the
letter of the answer you have selected.
1. Which type of validity study involves a substantial time interval between when the
test is administered and when the criterion is measured?
A. delayed study
B. content study
C. factorial study
D. predictive study <
Example 5
2. The type of validity study that involves a substantial time interval between when the
test is administered and when the criterion is measured is a
A. delayed study.
B. content study.
C. factorial study.
D. predictive study. <
Have the Item Stem Contain All the Information Necessary to Understand the Prob-
lem or Question. When writing multiple-choice items, the problem or question should
be fully developed in the item stem. Poorly developed multiple-choice items often contain
an inadequate stem that leaves the test taker unclear about the central problem or question.
Compare the stems in the following two examples.
Example 6 Poor Item—Inadequate Stem
1. Absolute zero point.
A. interval scale
B. nominal scale
C. ordinal scale
D. ratio scale <
Your students are not mind readers, and item stems that are not fully developed can
result in misinterpretations by students. One way to determine whether the stem is adequate
is to read the stem without examining the alternatives. If the stem is adequate, a knowledge-
able individual should be able to answer the question with relative ease. In Examples 6 and
7, the first item fails this test whereas the second item passes. This test is equally applicable
if the question is framed as a question or as an incomplete statement.
While we encourage you to develop the problem fully in the item stem, it is usually
not beneficial to include irrelevant material in the stem. Consider this example.
The Development and Use of Selected-Response Items 199
In Example 8 the addition of the sentence “There are several different scales of measurement
used in educational settings” does not serve to add clarity. It simply takes more time to read.
Provide between Three and Five Alternatives. Although there is no “correct” number
of alternatives, it is recommended that you use between three and five. Four are most com-
monly used, but some test developers suggest using five to reduce the chance of correctly
guessing the answer. For example, the chance of correctly guessing the answer with three
alternatives is | in 3 (i.e., 33%); with four alternatives,! in 4 (i.e., 25%); and with five, 1
in 5 (i.e., 20%). The use of five alternatives is probably the upper limit. Many computer
scoring programs accommodate only five alternatives, and it can be difficult to develop
plausible distracters (the addition of distracters that are clearly wrong and not selected by
any students does not reduce the chance of correctly guessing the answer). In some situa-
tions three alternatives may be sufficient. It takes students less time to read and answer
items with three alternatives instead of four (or five), and it is easier to write two good
distracters than three (or four). Certain research even suggests that items with three alterna-
tives can be as effective as items with four or five alternatives (e.g., Costin, 1970; Grier,
1975; Sidick, Barrett, & Doverspike, 1994).
Keep the Alternatives Brief and Arrange Them in an Order That Promotes Efficient
Scanning. As we noted, the item stem should contain as much of the content as possible
and should not contain irrelevant material. A correlate of this is that the alternatives should
be as brief as possible. This brevity makes it easier for the students to scan the alternatives
looking for the correct answer. Consider Examples 9 and 10. While they both measure the
same content, the first one contains an inadequate stem and lengthy alternatives whereas the
second one has an adequate stem and brief alternatives.
Double negatives should always be avoided. Logicians know that a double negative
indicates a positive, but students should not have to decipher this logic problem.
The Development and Use of Selected-Response Items 201
Make Sure Only One Alternative Is Correct or Represents the Best Answer. Care-
fully review your alternatives to ensure there is only one correct or best answer. Commonly
teachers are confronted by students who feel they can defend one of the distracters as a correct
answer. It is not possible to avoid this situation completely, but you can minimize it by care-
fully evaluating the distracters. We recommend setting the test aside for a period of time and
returning to it later for proofing. Fatigue and tight deadlines can allow undetected errors.
Occasionally it might be appropriate to include more than one correct alternative in a
multiple-choice item and require the students to identify all of the correct alternatives. It is
usually best to format these questions as a series of true—false items, an arrangement re-
ferred to as a cluster-type or multiple true—false item. See Examples 15 and 16.
Avoid Cues That Inadvertently Identify the Correct Answer. Item stems should not
contain information that gives away the answer. A cue is something in the stem that provides
aclue to the answer that is not based on knowledge. It often involves an association between
the words in the stem and the correct alternative. See Examples 17 and 18.
In Example 17, the use of predict in the stem and predictive in the correct alternative pro-
vides a cue to the correct answer. This is corrected in Example 18. Additionally, in the
second example there is an intentional verbal association between the stem and the first
distracter (i.e., interval). This association makes the first-distracter more attractive, particu-
larly to students relying on cues, who do not know the correct answer.
In addition to the stem containing cues to the correct answer, the alternatives can
themselves contain cues. One way to avoid this is to ensure that all alternatives are ap-
proximately equal in length and complexity. In an attempt to be precise, teachers may make
the correct answer longer or more complex than the distracters. This can serve as another
type of cue for students. Although in some cases it might be possible to both maintain
precision and shorten the correct alternative, it is usually easier to lengthen the distracters
(though this does make scanning the alternatives more difficult for students). Compare
Examples 19 and 20.
When dealing with numerical alternatives, the visual characteristics of the choices can
also serve as a cue. Examine the following examples.
In Example 21, the third option (i.e., C) is the only alternative that, like the number in the
stem, has two decimal places. The visual characteristics of this alternative may attract the
student to it independent of the knowledge required to answer it. In Example 22, each alter-
native has an equal number of decimal places and is equally visually attractive.
Make Sure All Alternatives Are Grammatically Correct Relative to the Stem. Gram-
matical cues that may help the uninformed student select the correct answer are usually the
result of inadequate proofreading. Examine the following examples.
In Example 23 the phrase “individuals are” in the stem indicates a plural answer. However,
only the fourth alternative (i.e., D) meets this requirement. This is corrected in Example 24
by ensuring that each alternative reflects a plural answer.
Another common error is inattention to the articles a and an. See the following.
In Example 25, the use of the article a indicates an answer beginning with a consonant in-
stead of a vowel. An observant student relying on cues will select the fourth alternative (i.e.,
D) because it is the only one that is grammatically correct. This is corrected in Example 26
by ensuring that all alternatives begin with consonants and in Example 27 by using a(n) to
accommodate alternatives beginning with either consonants or vowels.
Make Sure No Item Reveals the Answer to Another Item. One item should not contain
information that will help a student answer another item. Also, a correct answer on one item
should not be necessary for answering another item. This would give double weight to the
first item.
Have All Distracters Appear Plausible. Distracters should be designed to distract un-
knowledgeable students from the correct answer. Therefore, all distracters should appear
plausible and should be based on common student errors. For example, what concepts,
terms, events, techniques, or individuals are commonly confused? After you have adminis-
tered the test once, analyze the distracters to determine which are effective and which are
not. Replace or revise the ineffective distracters. There is little point in including a distracter
that can be easily eliminated by uninformed students. This simply wastes time and space.
Use Alternative Positions in a Random Manner for the Correct Answer. The correct
answer should appear in each of the alternative positions approximately the same number of
times. When there are four alternatives (e.g., A, B, C, and D), teachers tend to overuse the
middle alternatives (i.e., B and C). Alert students are likely to detect this pattern and use it to
answer questions of which they are unsure. Students have indicated that when faced with a
question they cannot answer based on knowledge they simply select B or C. Additionally, you
should ensure there is no detectable pattern in the placement of correct answers (e. g..A,C. B;
D, A, etc.). If there is no logical ordering for the alternatives (see the earlier recommendation),
they should be randomly arranged. Attempt random assignment when possible and then once
the test is complete, count the number of times the correct answer appears in each position. If
any positions are over- or underrepresented, make adjustments to correct the imbalance.
Minimize the Use of “None of the Above” and Avoid Using “All of the Above.” There
is some disagreement among test development experts regarding the use of “none of the
above” and “all of the above” as alternatives. The alternative “none of the above” is criti-
cized because it automatically forces the item into a correct-answer format. As noted
earlier,
the correct-answer form is often limited to lower-level educational objectives and easier
items. Although there are times when “none of the above” is appropriate as an alternative,
it should be used sparingly. Testing experts are more unified in their criticism of “all
of the
above” as an alternative. There are two primary concerns. First, students may read alterna-
The Development and Use of Selected-Response Items 205
tive A, see that it is correct, and mark it without ever reading alternatives B, C, and D. In this
situation the response is incorrect because the students did not read all of the alternatives,
not necessarily because they have not mastered the educational objective. Second, students
may know that two of the alternatives are correct and therefore conclude that “all of the
above” is correct. In this situation the response is correct but is based on incomplete knowl-
edge. Our recommendation is to use “none of the above” sparingly and avoid using “all of
the above.”
Avoid Artificially Inflating the Reading Level. Unless it is necessary to state the prob-
lem clearly and precisely, avoid obscure words and an overly difficult reading level. This
does not mean to avoid scientific or technical terms necessary to state the problem, but
simply to avoid the unnecessary use of complex incidental words.
Limit the Use of Always and Never in the Alternatives. The use of always and never
should generally be avoided because it is only in mathematics that their use is typically
justified. Savvy students know this and will use this information to rule out distracters.
Avoid Using the Exact Phrasing from the Text. Most measurement specialists suggest
that you avoid using the exact wording used in a text. Exact phrasing may be appropriate if
rote memorization is what you desire, but it is of limited value in terms of concept formation
and the ability to generalize. Exact phrasing should be used sparingly.
Organize the Test in a Logical Manner. The topics in a test should be organized in a
logical manner rather than scattered randomly. However, the test does not have to exactly
mirror the text or lectures. Strive for an organization that facilitates student performance.
Give Careful Consideration to the Number of Items on Your Test. Determining the
number of items to include on a test is a matter worthy of careful consideration. On one
hand you want to include enough items to ensure adequate reliability and validity. Recall
that one way to enhance the reliability of a score is to increase the number of items that go
into making up the score. On the other hand there is usually a limited amount of class time
allotted to testing. Occasionally teachers will include so many items on a test that students
do not have enough time to make reasoned responses. A test with too many items essen-
tially becomes a “speed test” and unfairly rewards students who respond quickly even if
they know no more than students who were slower in responding.
Companies who publish tests estimate a completion time for each item. For example,
an item may be considered a 30-second item, a 45-second item, or a 60-second item. Mak-
ing similar estimates can be useful, but unless you are a professional test developer you will
probably find it difficult to accurately estimate the time necessary to complete every item.
As a general rule you should allot at least one minute for secondary school students to com-
plete a multiple-choice item that measures a lower-level objective (e.g., Gronlund, 2003).
Younger students or items assessing higher-level objectives typically require more time.
violate one of the guidelines to write the most efficient and effective item. This is clearly
appropriate, but if you find yourself doing it routinely, you are most likely being lazy or
careless in your test preparation. Table 8.1 provides a summary of these guidelines.
As we noted earlier the multiple-choice format is the most popular selected-response
format. Major strengths of multiple-choice include the following.
effort multiple-choice items can be written that measure more complex objectives. Consider
the following example suggested by Green (1981) of an item designed to assess a complex
learning objective.
To answer this item correctly, students must understand that the strength of a correlation is
affected by the variability in the sample. More homogeneous samples (i.e., samples with
less variability) generally result in lower correlations. The students then have to reason that
because Harvard is an extremely selective university, the group of applicants admitted there
would have more homogeneous SAT scores than the national standardization sample. That
is, there will be less variance in SAT scores among Harvard students relative to the national
sample. Because there is less variance in the Harvard group, the correlation will be less than
the national sample (if this is unclear, review the section on correlation one more time). This
illustrates that multiple-choice items can measure fairly complex learning objectives. Spe-
cial Interest Topic 8.1 describes research that found that, contrary to claims by critics,
multiple-choice items do not penalize creative or “deep thinking” students.
Critics of multiple-choice and other selected-response items have long asserted that these items
measure only superficial knowledge and conventional thinking and actually penalize students who
are creative, deep thinkers. In a recent study, Powers and Kaufman (2002) examined the relation-
ship between performance on the Graduate Record Examination (GRE) General Test and selected
personality traits, including creativity, quickness, and depth. In summary, their analyses revealed
that there was no evidence that deeper-thinking students were penalized by the multiple-choice
format. The correlation between GRE scores and Depth were as follows: Analytical = 0.06, Quan-
titative = 0.08, and Verbal = 0.15. The results in terms of creativity were more positive, with the
correlation between GRE scores and Creativity as follows: Analytical = 0.24, Quantitative = 0.26,
and Verbal = 0.29 (all p < 0.001). Similar results were obtained with regard to Quickness, with the
correlation between GRE scores and Quickness as follows: Analytical = 0.21, Quantitative = 0.15,
and Verbal = 0.26 (all p < 0.001). In summary, there is no evidence that individuals who are creative,
deep thinking, and mentally quick are penalized by multiple-choice items. In fact, the research re-
veals modest positive correlations between the GRE scores and these personality traits. To be fair,
there was one rather surprising finding, a slightly negative correlation between GRE scores and
Conscientious (e.g., careful, avoids mistakes, completes work on time). The only hypothesis the
authors proposed was that being “‘conscientious” does not benefit students particularly well on timed
tests, such as the GRE, that place a premium on quick performance.
this tendency does not appear to significantly affect performance on multiple-choice tests
(e.g., Hopkins, 1998).
Multiple-choice items are Multiple-Choice Items Are an Efficient Way of Sampling the Con-
efficient at sampling the tent Domain. Multiple-choice tests allow teachers to broadly sample
content domain. the test’s content domain in an efficient manner. That is, because stu-
dents can respond to multiple-choice items in a fairly rapid manner, a
sufficient number of items can be included to allow the teacher to ade-
quately sample the content domain. Again, this enhances the reliability of the test.
Multiple-Choice Items Are Easy to Improve Using the Results of Item Analysis. The
careful use of difficulty and discrimination indexes and distracter analysis can help refine
and enhance the quality of the items.
Multiple-Choice Items Provide Information about the Type of Errors That Students
Are Making. Teachers can gain diagnostic information about common student errors and
misconceptions by examining the distracters that students commonly endorse. This infor-
mation can be used to improve instruction in the future, and current students’ knowledge
base can be corrected in class review sessions. 4
The Development and Use of Selected-Response Items 209
eee Se
Some testing experts support the use of a “correction for guessing” formula with true—false and multiple-
choice items. Proponents of this practice use it because it discourages students from attempting to
raise their scores through blind guessing. The most common formula for correcting for guessing is:
Consider this example: Susan correctly answered 80 multiple-choice items on a 100-item test
(each item having 4 alternatives). She incorrectly answered 12 and omitted 8. Her uncorrected scores
would be 80 (or 80% correct). Applying the formula to these data:
Susan’s corrected score is 76 (or 76%). Note that the omitted items are not counted in the corrected
score; only the items answered correctly and incorrectly are counted. What the correction formula
does is remove the number of items assumed to be the result of blind guessing.
Should you use a correction for guessing? This issue has been hotly debated among assess-
ment professionals. The debate typically centers on the assumptions underlying the correction for-
mula. For example, the formula is based on the questionable assumption that all guesses are random,
and none are based on partial knowledge or understanding of the item content. Probably anyone who
has ever taken a test knows that all guesses are not random and that sometimes students are able to
rule out some alternatives using partial knowledge of the item content. As a result, many assessment
experts don’t recommend using a correction for guessing with teacher-made classroom tests. Some
authors suggest that their use is defensible in situations in which students have insufficient time to
answer all the items or in which guessing is contraindicated due to the nature of the test content (e.g.,
Linn & Gronlund, 2000). Nevertheless, on most classroom assessments a correction for guessing is
not necessary. In fact, in most situations the relative ranking of students using corrected and uncor-
rected scores will be about the same (Nitko, 2001).
Two related issues need to be mentioned. First, if you are not using a correction for guessing,
your students should be encouraged to attempt every item. If you are using a correction for guessing,
your students should be informed of this, something along the lines of “Your score will be corrected
for guessing, so it is not in your best interest to guess on items.”
(continued)
210 CHAPTER 8
The second issue involves professionally developed standardized tests. When you are using a
standardized test it is imperative to strictly follow the administration and scoring instructions. If the
test manual instructs you to use a correction for guessing, you must apply it for the test’s normative
data to be usable. If the test manual instructs you simply to use the “number correct” when calculat-
ing scores, these instructions should be followed. In subsequent chapters we will describe the admin-
istration and use of professionally developed standardized tests.
Although multiple-choice items have many strengths to recommend their use, they do
have limitations. These include the following.
Multiple-choice items are not Multiple-Choice Items Are Not Easy to Write. Although ease
easy to write, and they are not and objectivity of scoring are advantages of multiple-choice items,
effective for measuring all it does take time and effort to write effective items with plausible
educational objectives. distracters.
In summary, multiple-choice items are by far the most popular
selected-response format. They have many advantages and few weak-
nesses. As a result, they are often the preferred format for professionally developed tests. When
skillfully developed, they can contribute to the construction of psychometrically sound class-
room tests. Table 8.2 summarizes the strengths and weaknesses of multiple-choice items.
True—False Items
The next selected-response format we will discuss is the true—false format. True—false items
are very popular, second only to the multiple-choice format. We will actually use the term
true—false items to refer to a broader class of items. Sometimes this category is referred to as
binary-choice items, two-option items, or alternate-choice items. The common factor is that
all these items involve a statement or question that the student marks as true or false, agree or
disagree, correct or incorrect, yes or no, fact or opinion, and so on. Because the most common
form is true—false, we will use this term generically to refer to all two-
True—false items involve a option items.
statement or question that the Here follow examples of true—false items. Example 29 takes
student marks as true or false, the form of the traditional true—false format. Example 30 takes the
agree or disagree, yes or no, form of the correct—incorrect format. We also provide examples of
and so on. the type of directions needed with these questions.
Two variations of the true—false format are fairly common and deserve mention. The
first is the multiple true—false format we briefly mentioned when discussing multiple-choice
items. On traditional multiple-choice items the student must select one correct answer from
the alternatives, whereas on multiple true—false items the student indicates whether each one
of the alternatives is true or false. Frisbie (1992) provides an excellent discussion of the
multiple true—false format. Example 31 is a multiple true—false item.
In the second variation of the traditional true—false format, the student is required to
correct false statements. This is typically referred to as true—false with correction format.
With this format it is important to indicate clearly which part of the statement may be
changed by underlining it (e.g., Linn & Gronlund, 2000). Consider Example 32.
Although this variation makes the true—false items more demanding and less suscepti-
ble to guessing, it also introduces some subjectivity in scoring, which may reduce reliability.
Example 33 contains two ideas, one that is correct and one that is false. Therefore it is par-
tially true and partially false. This can cause confusion as to how students should respond.
Examples 34 and 35 each address only one idea and are less likely to be misleading.
Avoid Specific Determiners and Qualifiers That Might Serve as Cues to the Answer.
Specific determiners such as never, always, none, and all occur more frequently in false
statements and serve as cues to uninformed students that the statement is too broad to be
true. Accordingly, moderately worded statements including usually, sometimes, and fre-
quently are more likely to be true and these qualifiers also serve as cues to uninformed
students. Although it would be difficult to avoid using qualifiers in true—false items, they
can be used equally in true and false statements so their value as cues is diminished. Exam-
ine the following examples.
The Development and Use of Selected-Response Items 213
In Example 36 always may alert a student that the statement is too broad to be true. Ex-
ample 37 contains the qualifier usually, but the statement is false so a student relying on cues
would not benefit from it.
Avoid Negative Statements. Avoid using statements that contain no, none, and not. The
use of negative statements can make the statement more ambiguous, which is not desirable.
The goal of a test item should be to determine whether the student has mastered a learning
objective, not to see whether the student can decipher an ambiguous question.
Avoid Long and/or Complex Statements. A\l statements should be presented as clearly
and concisely as possible. As noted in the previous guideline, the goal is to make all state-
ments clear and precise.
Avoid Including the Exact Wording from the Textbook. As on multiple-choice items
you should avoid the exact wording used in a text. Students will recognize this over time,
and it tends to reward rote memorization rather than the development of a more thorough
understanding of the content (Hopkins, 1998). Table 8.3 provides a summary of the guide-
lines for developing true—false items.
Testing experts provide mixed evaluations of true—false items. Some experts are advo-
cates of true—false items whereas others are much more critical of this format. We tend to
fall toward the more critical end of the continuum.
True-—false items are effective at True—-False Items Are Efficient. Students can respond quickly to
sampling the content domain true—false items, even quicker than they can to multiple-choice items.
and can be scored in a reliable This allows the inclusion of more items on a test designed to be ad-
manner. ministered in a limited period of time.
True—False Items Are Subject to Response Sets. True—false items are considerably
more susceptible to the influence of response sets than are other selected-response items.
True—False Items Provide Little Diagnostic Information. Teachers can often gain
diagnostic information about common student errors and misconceptions by examin-
ing incorrect responses to other test items, but true—false items provide little diagnostic
information.
The Development and Use of Selected-Response Items 215
True—False Items May Produce a Negative Suggestion Effect. Some testing experts
have expressed concern that exposing students to the false statements inherent in true—false
items might promote learning false information (e.g., Hopkins, 1998), called a negative
suggestion effect.
Effective True—False Items Appear Easy to Write to the Casual Observer. This Is
Not the Case! Most individuals believe that true—false items are easy to write. Writing
effective true—false items, like all effective test items, requires considerable thought and
effort. Simply because they are brief does not mean they are easy to write.
Matching Items
The final selected-response format we will discuss is matching items. Matching items
usually contain two columns of words or phrases. One column contains words or phrases
for which the student seeks a match. This column is traditionally placed on the left and
the phrases are referred to as premises. The second column contains words that are
available for selection. The items in this column are referred to as responses. The prem-
ises are numbered and the responses are identified with letters. Directions are provided
that indicate the basis for matching the items in the two lists. Here is an example of a
matching item.
Column A Column B
—o_ 1. Helps initiate and control rapid movement of a. basal ganglia
the arms and legs. b. cerebellum
_4_ 2. Serves as a relay station connecting different
c. corpus callosum
parts of the brain
; ‘ Paar d. hypothalamus
£3. Is involved in the regulation of basic drives
e. limbic system
and emotions.
f. medulla
4__ 4, Helps control slow, deliberate movements
of the arms and legs. g. thalamus
. Connects the two hemispheres.
ls . Controls
Nm the release of certain hormones
important in controlling the internal
environment of the body.
This item demonstrates an imperfect match because there are more responses than premises.
Additionally, the instructions also indicate that each response may be used once, more than
once, or not at all. These procedures help prevent students from matching items simply by
elimination.
Column A Column B
£1. Most populous U.S. city. a. Amazon
Bie 2: Largest country in South America. b. Brazil
4_ 3. Largest river in the Western Hemisphere. c. Lake Superior
—4_ 4. Canada’s leading financial and manufacturing d. Mississippi
center. e. New York City
¢ 5. Largest freshwater lake in the world. f. Nicaragua
Btn: Largest country in Central America. g. Toronto
Although this is an extreme example, it does illustrate how heterogeneous lists can undermine
the usefulness of matching items. For example, premise 1 asks for the most populous U.S. city
and the list of responses includes only two cities, only one of which is in the United States.
Premise 2 asks for the largest country in South America and the list of responses includes only
two countries, only one of which is in South America. In these questions students do not have
to possess much information about U.S. cities or South America to answer them correctly. It
would have been better to develop one matching list to focus on U.S. cities, one to focus on
countries in the Western Hemisphere, one to focus on major bodies of water, and so forth.
Indicate the Basis for Matching Premises and Responses in the Directions. Clearly
state in the directions the basis for matching responses to premises. You may have noticed
that in our example of a poor heterogeneous item (Example 39), the directions do not clearly
specify the basis for matching. This was not the case with our earlier example involving
brain functions and brain structures (Example 38). If you have difficulty specifying the basis
for matching all the items in your lists, it is likely that your lists are too heterogeneous.
Review Items Carefully for Unintentional Cues. Matching items are particularly sus-
ceptible to unintentional cues to the correct response. In Example 39, the use of Jake in
premise 5 and response c may serve as a cue to the correct answer. Carefully review match-
ing lists to minimize such cues.
Include More Responses than Premises. By including more responses than premises,
you reduce the chance that an uninformed student can narrow down options and success-
fully match items by guessing.
Indicate That Responses May Be Used Once, More than Once, or Not at All. By
adding this statement to your directions and writing responses that are occasionally used
more than once or not at all, you also reduce the impact of guessing.
Limit the Number of Items. For several reasons it is desirable to keep the list of items
fairly brief. It is easier for the person writing the test to ensure that the lists are homogeneous
when the lists are brief. For the student taking the test, it is easier to read and respond to a
shorter list of items. Although there is not universal agreement regarding the number of
items to include in a matching list, a maximum of ten appears reasonable with lists between
five and eight items generally recommended.
218 CHAPTER 8
Ensure That the Responses Are Brief and Arrange Them in a Logical Order. Stu-
dents should be able to read the longer premises and then scan the briefer responses in an
efficient manner. To facilitate this process, keep the responses as brief as possible and ar-
range them in a logical order when appropriate (e.g., alphabetically, numerically).
Place All Items on the Same Page. Finally, keep the directions and all items on one
page. It greatly reduces efficiency in responding if the students must turn the page looking
for responses. Students also are more likely to transpose a letter or number if they have to
look back and forth across two pages, leading to errors in measuring what the student has
learned. Table 8.5 summarizes the guidelines for developing matching items.
Testing experts generally provide favorable evaluations of the matching format. Al-
though this format does not have as many advantages as multiple-choice items, it has fewer
limitations than the true—false format.
Matching items can be scored in Matching Items Are Efficient. They take up little space and stu-
a reliable manner, are efficient, dents can answer many items in a relatively brief period.
and are relatively simple to
write. Matching Items Are Relatively Simple to Write. Matching items
are relatively easy to write, but they still take time, planning, and ef-
fort. The secret to writing good matching items is developing two homogeneous sets of
items to be matched and avoiding cues to the correct answer. If they are not developed well,
efficiency and usefulness are lost.
IT
TTT
7. Are the responses brief and arranged in a logical order?
8. Are all the items on the same page?
The Development and Use of Selected-Response Items 219
well, much of what we teach students involves greater understanding and higher-level
skills.
Matching items are fairly Matching Items May Promote Rote Memorization. Due to their
limited in scope and may focus on factual knowledge and simple associations, the use of match-
ing items may encourage rote memorization.
promote rote memorization.
Matching Items Are Vulnerable to Cues That Increase the
Chance of Guessing. Unless written with care, matching items are particularly suscep-
tible to cues that accidentally suggest the correct answer.
It Is Often Difficult to Develop Homogeneous Lists of Relevant Material. When
developing matching items, it is often difficult to generate homogeneous lists for matching.
As aresult, there are two common but unattractive outcomes; the lists may become hetero-
geneous, or information that is homogeneous but trivial may be included. Neither one of
these outcomes is desirable because they both undermine the usefulness of the items.
In summary, matching items are a prevalent selected-response format. They can be
scored in an objective manner, are relatively easy to write, and are efficient. They do have
weaknesses, including being limited in the types of learning outcomes they can measure and
potentially encouraging students to simply memorize facts and simple associations. You
also need to be careful when writing matching items to avoid cues that inadvertently provide
hints to the correct answer. Nevertheless, when dealing with information that has acommon
theme and that lends itself to this item format, they may be particularly useful. Table 8.6
provides a summary of the strengths and weaknesses of matching items.
Summary
All test items can be classified as either selected-response items or constructed-response
items. Selected-response items include multiple-choice, true—false, and matching items
whereas constructed-response items include essay items, short-answer items, and perfor-
mance assessments. We discussed each specific selected-response format, describing how
to write effective items and their individual strengths and weaknesses.
Have you ever heard that it is usually not in your best interest to change your answer on a multiple-
choice test? Many students and educators believe that you are best served by sticking with your first
impression. That is, don’t change your answer. Surprisingly this is not consistent with the research!
Pike (1979) reviewed the literature and came up with these conclusions:
This does not mean that you should encourage your students to change their answers on a whim.
However, if students feel a change is indicated based on careful thought and consideration, they
should feel comfortable doing so. Research suggests that they are probably doing the right thing to
enhance their score.
Multiple-choice items are the most popular selected-response format. They have nu-
merous strengths including versatility, objective and reliable scoring, and efficient
sampling of the content domain. The only weaknesses are that multiple-choice items
are not effective for measuring all learning objectives (e.g., organization and presenta-
tion of material, writing ability, performance tasks), and they are not easy to develop.
Testing experts generally support the use of multiple-choice items as they can contrib-
ute to the development of reliable and valid assessments.
True—False items are another popular selected-response format. Although true—false
items can be scored in an objective and reliable manner and students can answer many
items in a short period of time, they have numerous weaknesses. For example, they
are limited to the assessment of fairly simple learning objectives and are very vulner-
able to guessing. Although true—false items have a place in educational assessment,
before using them we recommend that you weight their strengths and weaknesses and
ensure that they are the most appropriate item format for assessing the specific learn-
ing objectives.
Matching items were the last selected-response format we discussed. These items can
be scored in an objective and reliable manner, can be completed in a fairly efficient
manner, and are relatively easy to develop. Their major limitations include a rather
limited scope and the possibility of promoting rote memorization of material by your
students. Nevertheless, carefully developed matching items can effectively assess
lower-level educational objectives.
In the next two chapters we will address constructed-response items, including essays,
short-answer items, performance assessments, and portfolios. We stated earlier in the textbook
The Development and Use of Selected-Response Items 221
that typically the deciding factor when selecting an assessment or item format involves iden-
tifying the format that most directly measures the behaviors specified by the educational ob-
jectives. The very nature of some objectives mandates the use of constructed-response items
(e.g., writing a letter), but some objectives can be measured equally well using either selected-
response or constructed-response items. If after thoughtful consideration you determine that
both formats are equally well suited, we typically recommend the use of selected-response
items because they allow broader sampling of the content domain and can be scored in a more
reliable manner. However, we do not want you to think that we have a bias against construct-
ed-response items. We believe that educational assessments should contain a variety of assess-
ment procedures that are individually tailored to assess the educational objectives of interest.
RECOMMENDED READING
Aiken, L. R. (1982). Writing multiple-choice items to mea- At this site the author provides some good suggestions for
sure higher-order educational objectives. Educational & making multiple-choice distracters more attractive.
Psychological Measurement, 42, 803-806. A respected Ebel, R. L. (1970). The case for true—false items. School Re-
author presents suggestions for writing multiple-choice view, 78, 373-389. Although many assessment experts are
items that assess higher-order learning objectives. opposed to the use of true—false items for the reasons cited
Beck, M. D. (1978). The effect of item response changes on in the text, Ebel comes to their defense in this article.
scores on an elementary reading achievement test. Jour- Sidick, J. T., Barrett, G. V., and Doverspike, D. (1994). Three-
nal of Educational Research, 71, 153-156. This article is alternative multiple-choice tests: An attractive option.
an example of the research that has examined the issue of Personnel Psychology, 47, 829-835. In this study the au-
students changing their responses on achievement tests. thors compare tests with three-choice multiple-choice
A good example! items with ones with five-choice items. The results sug-
Dewey, R.A. (2000, December 12). Writing multiple choice gest that both have similar measurement characteristics
items which require comprehension. Retrieved November and that a case can be made supporting the use of three-
29, 2004, from www.psywww.com/selfquiz/aboutq.htm. choice items.
FaSL I ET RO
SNE
CHAPTER HIGHLIGHTS
LEARNING OBJECTIVES
We have noted that most test items used in the classroom can be classified as either
selected-response or constructed-response items. If an item requires a student to select a
222
The Development and Use of Constructed-Response Items 223
on their memory. In oral examinations a premium is often placed on the student’s facility
with oral responding. For example, students are rarely given extended time to formulate
a response, and hesitation is often taken as lack of knowledge. With this arrangement the
achievement being measured may be achievement in the articulation of subject matter rather
than knowledge of subject matter. If that is the expressed educational objective, as in rhetoric
or debate, then the oral test is clearly appropriate. Otherwise it may not be a valid measure of
the specified educational objectives. A final limitation of oral testing is one first recognized
during the nineteenth-century industrialization process: inefficiency. The total testing time
for a class is equal to the number of students times the number of minutes allotted to each
student times the number of examiners. As you can see, the testing time adds up quickly.
Teachers are very busy professionals and time is at a premium. All of these shortcomings of
oral testing provide sufficient reason for its use to be quite restricted.
Essay Items
An essay item poses a question An essay item is a test item that poses a question or problem for
or problem for the student to the student to respond to in a written format. Being a constructed-
respond to in a written format. response item, the student must respond by constructing a response,
not by selecting among alternatives. Although essay items vary in
the degree of structure they impose on the student’s response, they generally provide con-
siderable freedom to the student in composing a response: Good essay items challenge the
student to organize, analyze, integrate, and synthesize information. At their best, essay items
elicit novel and creative cognitive processes from students. At their worst they present an
ambiguous task to students that is difficult, if not impossible, to score in a reliable manner.
Written essay items were used in Chinese civil service examinations as long as two
thousand years ago, but they did not become popular in Western civilization until much later.
In the nineteenth century, technical developments (e.g., increased availability of paper, devel-
opment of lead pencils, now principally using graphite) made written examinations cheaper
and more practical in both America and Europe. About the same time, Horace Mann, an in-
fluential nineteenth-century educator, argued about the evils of oral testing and the superiority
of the written essay. This set the stage for the emergence of essay (and other constructed-re-
sponse) tests. Although essay items have their own limitations, they have addressed some of
the problems associated with oral testing. They afforded more uniformity in test content (i.e.,
students get the same questions presented in the same order), there was a written record of the
student’s response, and they were more efficient (i.e., they take less testing time).
Essay items can be classified according to their educational purpose or focus (i.e.,
evaluating content, style, or grammar), the complexity of the task presented (e.g., knowl-
edge, comprehension, application, analysis, synthesis, and evaluation), and how much
structure they provide (restricted or extended response). We will begin by discussing how
essay items can vary according to their educational purpose.
Table 9.1 illustrates different purposes for which essay items are typically used when assess-
ing student achievement. The major purposes are for assessing content, style, and gram-
The Development and Use of Constructed-Response Items 225
Content Assess cognitive objectives Assess content and Assess content and
or knowledge of content only —_ writing style grammar
Style Assess writing ability Assess writing style
and style only and grammar
Grammar Assess grammar only
Content—Style—Grammar
Essay items can be scored in mar (which we take to include writing mechanics such as spelling).
terms of content, style, and In Table 9.1 the three purposes have been crossed to form a nine-
grammar. element matrix. These elements represent different assessment goals.
The composition of the elements in the diagonal is as follows:
All the other elements (i.e., off-diagonal elements) combine two purposes. For ex-
ample, the purpose of an essay item could involve the combination of content-style, content—
grammar, or style-grammar. Although not represented in this matrix, an item may have a
three-element purpose in which the student’s essay is evaluated in terms of content, style, and
grammar. This latter purpose is often encountered in take-home assignments such as reports,
term papers, or final examinations. All of these different combinations may be considered
essay examinations, and the topics of this chapter will apply in varying degrees to the ele-
ments in Table 9.1.
226 CHAPTER 9
While theoretically essay items can be scored independently based on these three pur-
poses, this is actually much more difficult than it appears. For example, research has shown
that factors such as grammar, penmanship, and even the length of the answer influence the
scores assigned, even when teachers are instructed to grade only on content and disregard
style and grammar. We will come back to this and other scoring issues later in this chapter.
Knowledge. At the knowledge level, essay items are likely to include verbs such as define,
describe, identify, list, and state. A knowledge level item follows:
Comprehension. Comprehension level essay items often include verbs such as explain,
paraphrase, summarize, and translate. An example of a comprehension level question
follows:
Knowledge and comprehension level objectives can also be assessed with selected-
response items (e.g., multiple-choice), but there is a distinction. Selected-response items re-
quire only recognition of the correct answer, whereas essay items require recall. That is, with
essay items students must remember the correct answer without having the benefit of having
it in front them. There are instances when the recall/recognition distinction is important and
instances when it is not. When recall is important, essay items should be considered.
Application. Application level essay items typically include verbs such as apply, com-
pute, develop, produce, solve, and use. An example of an application level item follows:
Analysis. Analysis level items are frequently encountered in essay tests. Verbs used at the
analysis level include analyze, break down, differentiate, illustrate, outline, and summarize.
Consider this example:
In Example 4 the student is asked to analyze the material by identifying and describing
the effects of the Industrial Revolution on educational testing. Many teachers simply use
the verb discuss in this context (i.e., Discuss the effects . . . ), and this may be acceptable
if the students understand what is expected of them. Otherwise, it introduces ambiguity
into the assessment process, something to be avoided.
Synthesis. Essay items written at the synthesis level require students to create something
new and original. Verbs often encountered at the synthesis level include compose, create,
develop, and design. Here is an example of an essay item at the synthesis level:
Evaluation. The evaluation level of the cognitive taxonomy requires judgments concern-
ing value and worth. Essay items written at this level often involve “choice” as in the next
example.
Here students select the best item type, a subjective choice, and defend their selection.
Words most often used in essay items at the evaluation level include appraise, choose, criti-
cize, debate, evaluate, judge, and others involving a determination of worth.
In place of a personal subjective choice, the criteria may be defined in terms of scien-
tific, legal, social, or other external standards. Consider this example:
In Example 7 the students must choose a position and build a case for it based on their un-
derstanding of current mores and standards as well as psychometric expertise.
228 ChE PASP
Mia RBS
Extended-response items provide more latitude and flexibility in how the students can
respond to the item. There is little or no limit on the form and scope of the response. When limi-
tations are provided, they are usually held to a minimum (e.g., page and time limits). Extended-
response items often require students to compose, summarize, formulate, compare/interpret,
interpret, and so forth. Examples of extended-response items include the following:
Extended-response items provide less structure and this promotes greater creativity,
integration, and organization of material.
As you might expect, restricted-response and extended-response essay items have
their own strengths and limitations. Restricted-response essay items are particularly good
for assessing objectives at the knowledge, comprehension, and application levels. They
can be answered in a timely fashion by students, which allows you to include more items,
and they are easier to score in a reliable manner than extended-response items. In contrast,
extended-response items are particularly well suited for assessing higher-level cognitive ob-
jectives. However, they are difficult to score in a reliable manner and because they take con-
siderable time for students to complete, you typically have to limit your test to relatively few
items, which results in limited sampling of the content domain. Although restricted-response
items have the advantage of more reliable and efficient scoring, along with better sampling of
the content domain, certain learning objectives simply require the use of extended-response
essay items. In these situations, it is important to write and score extended-response items as
carefully as possible and take into consideration the limitations. To that end, we will start by
giving you some general guidelines for writing good essay items. ‘
The Development and Use of Constructed-Response Items 229
Consider Carefully the Amount of Time Students Will Need to Respond to the Essay
Items. This is a practical recommendation that you pay attention to the amount of time
the students will need to complete each essay item. For example, you might estimate that
students need approximately 15 minutes to complete one item and 30 minutes for another.
As a general rule, teachers tend to underestimate the time students need to respond to essay
items. As teachers we may estimate only the time necessary to write the response whereas
students actually need time to collect and organize their thoughts before even starting the
writing process. As a rule of thumb, we recommend you construct a test you think is ap-
propriate to the available time and reduce it in length by about 25%.
Do Not Allow Students to Select the Items to Which They Will Respond. Some teach-
ers provide a number of items and allow the students to select a specified number of items
to respond to. For example, a test might include eight items and the students are required
to select five items to respond to. As a general rule this practice is to be avoided. When stu-
dents respond to different items, they are essentially taking different tests. When they take
different tests, they cannot be evaluated on a comparative basis. In addition, when students
respond only to the items they are best prepared for or knowledgeable about, you get a less
230 CHAPTER 9
representative sample of their knowledge (e.g., Gronlund, 1998). As you know, anything
that results in less effective content sampling compromises the measurement properties of
the test.
Limit the Use of Essay Items to Educational Objectives That Cannot Be Measured
Using Selected-Response Items. While essays are extremely popular among many
teachers and have their strengths, they do have limitations that we have alluded to and will
outline in the next section. For now, we just want to recommend that you restrict the use
of essay items to the measurement of objectives that cannot be measured adequately using
selected-response items. For example, if you want to assess the student’s ability to organize
and present material in a written format, an essay item would be a natural choice. These
guidelines are summarized in Table 9.2.
So far we have alluded to some of the strengths and weaknesses of essay items, and
this is probably an opportune time to discuss them more directly.
recall, often denoting stronger mastery of the material than recognition, as required with
selected-response items.
The Use of Essay Items Largely Eliminates Blind Guessing. Because essay items
require the student to produce a response as opposed to simply selecting one, students are
not able to guess successfully the desired answer.
When Studying for Essay Tests, Students May Spend Less Time on Rote Memoriza-
tion and More Time Analyzing and Synthesizing Information. Many teachers believe
that students study differently for essay tests than they do for selected-response tests, and
some research supports this claim (e.g., Coffman, 1972; Hakstian, 1971). It is possible that
students preparing for essay tests spend more time analyzing and synthesizing information
rather than memorizing facts. Hopkins (1998) suggests that teachers may combine a few
essay items with selected-response items to achieve this potential instructional benefit.
Expectancy Effects. Expectancy effects occur when the teacher scoring the test allows
irrelevant characteristics of the student to affect scoring. This is also referred to as the “halo
effect.” For example, if a teacher has a favorable overall impression of a student with a history
of academic excellence, the teacher might be inclined to assign a higher score to the student’s
responses (e.g., Chase, 1979). In contrast, a teacher might tend to be more critical of a re-
sponse by a student with a poor academic record who is viewed as difficult or apathetic. These
effects are not typically intentional or even conscious, but they are often present nevertheless.
Similar effects can also carry over from one item to the next within a test. That is, if you see
that a student performed well on an earlier item, it might influence scoring on later items.
Handwriting, Grammar, and Spelling Effects. Research dating from the 1920s has shown
that teachers are not able to score essay items solely on content even when they are in-
structed to disregard style and handwriting, grammar, and spelling effects (e.g., James,
1927; Sheppard, 1929). For example, good handwriting raises scores and poor handwriting,
misspellings, incorrect punctuation, and poor grammar reduce scores even when content is
the only criteria for evaluation. Even the length of the response impacts the score. Teachers
tend to give higher scores to lengthy responses, even when the content is not superior to that
of a shorter response (Hopkins, 1998), something students have long suspected!
Order Effects. Order effects are changes in scoring that emerge during the grading pro-
cess. As a general rule, essays scored early in the grading process receive better grades than
essays scored later (Coffman & Kurfman, 1968; Godshalk, Swineford, Coffman, & ETS,
1966). Research has also shown that the quality of preceding responses impacts the scores
assigned. That is, essays tend to receive higher scores when they are preceded by poor-
quality responses as opposed to when they are preceded by high-quality responses (Hales
& Tokar, 1975; Hughes, Keeling, & Tuck, 1980).
Fatigue Effects. The teacher’s physical and cognitive abilities are likely to degrade if essay
scoring continues for too long a period. The maximum period of time will probably vary
according to the complexity of the responses, but reading essays for more than two hours
without sufficient breaks will likely produce fatigue effects.
As you can see a number of factors can undermine reliability when scoring essay items.
In earlier chapters we emphasized the importance of reliability, so this weakness should be
given careful consideration when developing and scoring essay items. It should also be noted
that reduced reliability undermines the validity of the interpretation of test performance.
Restricted Sampling of the Content Domain. Because essay items typically require a
considerable amount of time to evaluate and to construct a response to, students are able
to respond to only a few items in a testing period. This results in limited sampling of the
content domain and potentially reduced reliability. This is particularly true of extended-re-
sponse essay items but may also apply to restricted-response items.
Scoring Essay Items Is Time Consuming. In addition to it being difficult to score essay
items in a reliable manner, it is a tedious, time-consuming process. Although selected-response
items tend to take longer to develop, they can usually be scored easily, quickly, and reliably.
The Development and Use of Constructed-Response Items
233
Bluffing. Although the use of essay items eliminates random guessing, bluffing is intro-
duced. Bluffing occurs when a student does not possess the knowledge or skills to respond
to the item, but tries to “bluff” or feign a response. Due to the subjective nature of essay
scoring, student bluffing may result in them receiving partial or even full credit. Experience
has shown that some students are extremely proficient at bluffing. For example, a student
may be aware that teachers tend to give lengthy responses more credit and so simply reframe
the initial question as a statement and then repeat the statement in slightly different ways.
Table 9.3 provides a summary of the strengths and weaknesses of essay items.
required. For extended-response items, due to the freedom given to the student, it may not be
possible to write a sample answer that takes into consideration all possible “good” responses.
For items at the synthesis and evaluation levels, new or novel responses are expected. As a
result, the exact form and content of the response cannot be anticipated, and a simple model
response cannot be delineated.
Scoring rubrics are often classified as either analytic or holistic. Analytic scoring
rubrics identify different aspects or dimensions of the response and the teacher scores each
dimension separately. For example, an analytic scoring rubric might distinguish between
content, writing style, and grammar/mechanics. With this scoring rubric the teacher will
score each response in terms of these three categories. With analytic rubrics it is usually nec-
essary to specify the value assigned to each characteristic. For example, for a 15-point essay
item in a social science class wherein the content of the response is of primary concern,
the teacher may designate 10 points for content, 3 points for writing style, and 2 points for
grammar/mechanics. If content were of equal importance with writing style and grammar/
mechanics, the teacher could assign 5 points for each category. In many situations two or
three categories are sufficient whereas in other cases more elaborate schemes are necessary.
An advantage of analytic scoring rubrics is that they provide specific feedback to students
regarding the adequacy of their responses in different areas. This helps students know which
aspects of their responses were adequate and which aspects need improvement. The major
drawback of analytic rubrics is that their use can be fairly time consuming, particularly
when the rubric specifies many dimensions to be graded individually.
With a holistic rubric, the teacher assigns a single score based on the overall quality
of the student’s response. Holistic rubrics are often less detailed than analytic rubrics. They
are easier to develop and scoring usually proceeds faster. Their primary disadvantage is
that they do not provide specific feedback to students about the strengths and weaknesses
of their responses.
Some testing experts suggest that, instead of using holistic rubrics to assign a numeri-
cal or point score, you use an ordinal or ranking approach. With this approach, instead of
assigning a point value to each response, you read and evaluate the responses and sort them
into categories reflecting different qualitative levels. Many teachers use five categories to
correspond to letter grades (i.e., A, B, C, D, and F). When using this approach, Gronlund
(1998) recommends that teachers read each item twice. You initially read through the essay
items and sort them into the designated categories. Subsequently you read the items in each
category as a group checking for consistency. If any items appear to be either superior or
inferior to the other items in that category, you make the necessary adjustment.
To illustrate the differences between holistic and analytic scoring rubrics, consider
this essay question:
Table 9.4 presents a holistic scoring rubric that might be used when scoring this item.
Table 9.5 presents an analytic scoring rubric that might be used when scoring this item.
The Development and Use of Constructed-Response Items 235
Essay Item: Compare and contrast Thurstone’s model of intelligence with that presented
by Gardner. Give examples of the ways they are similar and the ways they differ.
Essay Item: Compare and contrast Thurstone’s model of intelligence with that presented
by Gardner. Give examples of the ways they are similar and the ways they differ.
RSE ESERIES
REESE TRI MINERS AIEEE ELIT PASSED
236 CHAP TERS
Our final comment regarding scoring rubrics is that to be effective they should be used in
a consistent manner. Keep the rubric in front of you while you are scoring and apply it in a
fair, evenhanded manner.
Avoid Expectancy Effects. As youremember, expectancy effects occur when the teacher
allows irrelevant characteristics of the student to affect scoring (also referred to as the “halo
effect”). The obvious approach to minimizing expectancy effects is to score essay items ina
way that the test taker’s identity is not known. If you use test booklets, fold back the cover so
that the student’s name is hidden. If you use standard paper, we suggest that students write
their names on the back of essay sheets and that only one side of the sheet be used. The goal
is simply to keep you from being aware of the identity of the student whose paper you are
currently scoring. To prevent the student’s performance on one item from influencing scores
on subsequent items, we recommend that you start each essay item on a separate page. This
way, exceptionally good or poor performance on a previous item will not inadvertently
influence your scoring of an item.
Consider Writing Effects (e.g., Handwriting, Grammar, and Spelling). If one could
adhere strictly to the guidelines established in the scoring rubrics, writing effects would
not influence scoring unless they were considered essential. However, as we noted, even
when writing abilities are not considered essential, they tend to impact the scoring of an
item. These effects are difficult to avoid other than to warn students early in their academic
careers that these effects exist and suggest that they develop good writing abilities. For those
with poor cursive writing, a block letter printing style might be preferred. Because personal
computers are readily available in schools today, you might allow students to complete
essays using word processors and then print their tests. You should encourage students to
apply grammatical construction rules and to phrase sentences in a straightforward manner
that avoids awkward phrasing. To minimize spelling errors you might elect to provide dic-
tionaries to all students because this will mirror more closely the writing situation in real life
and answer critics who say essay tests should not be spelling tests. The use of word proces-
sors with spelling and grammar checkers might also help reduce these effects.
Minimize Order Effects. To minimize order effects, it is best to score the same question
for all students before proceeding to the next item. The tests should then be reordered in a
random manner before moving on to scoring the next item. For example, score item 1 for all
students; reorder the tests in a random fashion; then score essay item 2 and so forth.
Avoid Fatigue. The difficult task of grading essays is best approached as a series of one-
or two-hour sessions with adequate breaks between them. Although school schedules often
require that papers be scored in short periods of time, you should take into consideration the
effects of fatigue on scoring and try to arrange a schedule for grading that permits frequent
rest periods.
Score Essays More than Once. Whenever possible it is desirable to score essays items
at least two times. This can be accomplished either by you scoring the items twice or having
a colleague score them after you have scored them. When the two scores or ratings are con-
The Development and Use of Constructed-Response Items 237
1. Develop a scoring rubric for each item that clearly specifies the scoring criteria.
2. Take steps to avoid knowing whose paper you are scoring.
3. Avoid allowing writing effects to influence scoring if they are not considered essential.
4. Score the same question for all students before proceeding to the next item.
5. Score the essays in one- or two-hour periods with adequate rest breaks.
6. Score each essay more than one time (or have a colleague score them once after you have
scored them).
sistent, you can be fairly confident in the score. If the two ratings are significantly different,
you should average the two scores. Table 9.6 summarizes our suggestions for scoring essay
items. Special Interest Topic 9.1 presents a brief discussion of automated essay scoring
systems that are being used in several settings.
Short-Answer Items
Relative to essay items, short-answer items place stricter limits on the nature and
length of the response. Practically speaking, short-answer items can be viewed as a type of
restricted-response essay item. As we noted, restricted-response essay items provide more
structure and limit the form and scope of a student’s response relative to an extended-
response essay item. Short-answer items take this a step further, providing even more struc-
ture and limits on the student’s response.
238 CHAPTER 9
Myford and Cline (2002) note that even though essays are respected and desirable assessment
techniques, their application in large-scale standardized assessment programs has been limited be-
cause scoring them with human raters is usually expensive and time consuming. For example, they
note that when using human raters, students may take a standardized essay test at the end of an
academic year and not receive the score reports until the following year. The advent of automated
essay-scoring systems holds promise for helping resolve these problems. By using an automated
scoring system, testing companies can greatly reduce expense and the turnaround time. As a re-
sult, educators and students can receive feedback in a fraction of the time. Although such systems
have been around since at least the 1960s, they have become more readily available in recent
years. Myford and Cline (2002) note that these automated systems generally evaluate essays on
the basis of either content (i.e., subject matter) or style (i.e., linguistic style). Contemporary essay
scoring systems also provide constructive feedback to students in addition to an overall score.
For example, a report might indicate that the student is relying too much on simple sentences or
a limited vocabulary (Manzo, 2003).
In addition to being more cost- and time-efficient, these automated scoring systems have the
potential for increasing the reliability of essay scores and the validity of their interpretation. For
example, the correlation between grades assigned by a human and an automated scoring program is
essentially the same as that between two human graders. However, in contrast to humans, computers
never have a bad day, are never tired or distracted, and assign the same grade to the same essay every
time (Viadero & Drummond, 1998).
In addition to expediting scoring of large-scale assessment, these automated essay scoring
programs have recently found application in the classroom. Manzo (2003) gave the example of a
middle school language arts teacher who regularly assigns essay assignments. She has more than 180
students and in the past would spend up to 60 hours grading a single assignment. She is currently
using an automated online scoring system that facilitates her grading. She has the program initially
score the essays, then she reviews the program’s evaluation and adds her own comments. In other
words, she is not relying exclusively on the automated scoring system, but using it to supplement
and enhance her personal grading. She indicates that the students can receive almost instantaneous
feedback on their essays and typically allows the students to revise their daily assignments as many
times as they desire, which has an added instructional benefit.
These programs are receiving more and more acceptance in both classrooms and large-scale
standardized assessment programs. Here are examples of some popular programs and Web sites at
which you can access information about them:
m e-rater, www.ets.org/research/erater.htm+1
a Intelligent Essay Assessor, www.knowledgetechnologies.com
w IntelliMetric, www.intellimetric.com
= Bayesian Essay Scoring System, http://ericae.net/betsy
The Development and Use of Constructed-Response Items 239
Make Sure There Is Only One Correct Response. In addition to brevity, it is important
that there only be one correct response. This is more difficult than you might imagine. When
writing a short-answer item, ask yourself if the student can interpret it in more than one way.
Consider this example:
Have Only One Blank Space when Using the Incomplete-Sentence Format, Prefer-
ably Near the End of the Sentence. As we noted, unless incomplete-sentence items are
carefully written, they may be confusing or unclear to students. Generally the more blank
spaces an item contains, the less clear the task becomes. Therefore, we recommend that
you usually limit each incomplete sentence to one blank space. We also recommend that
the blank space be located near the end of the sentence. This arrangement tends to provide
more clarity than if the blank appears early in the sentence.
Avoid Unintentional Cues to the Answer. As with selected-response items, you should
avoid including any inadvertent clues that might alert an uninformed student of the correct re-
sponse. For example, provide blanks of the same length for all short-answer items (both direct
questions and incomplete sentences). This way you avoid giving cues about the relative length
of different answers. Also be careful about grammatical cues. The use of the article a indicates
an answer beginning with a consonant instead of a vowel. An observant student relying on cues
will detect this and it may help him or her narrow down potential responses. This can be cor-
rected by using a(n) to accommodate answers that begin with either consonants or vowels.
Make Sure the Blanks Provide Adequate Space for the Student’s Response. A previ-
ous guideline noted that all blanks should be the same length to avoid unintentional cues
240 GAT Pei:
to the correct answer. You should also make sure that each blank provides adequate space
for the student to write the response. As a result, you should determine how much space is
necessary for providing the longest response in a series of short-answer items, and use that
length for all other items.
Avoid Lifting Sentences Directly Out of the Textbook and Converting Them into
Short-Answer Items. Sentences taken directly from textbooks often produce ambiguous
short-answer items. Sentences typically need to be understood in the context of surrounding
material, and when separated from that context their meaning often becomes unclear. Ad-
ditionally, if you copy sentences directly from the text, some students may rely on simple
word associations to answer the items. This may promote rote memorization rather than
developing a thorough understanding of the material (Hopkins, 1998).
Create a Scoring Rubric and Consistently Apply It. As with essay items, it is impor-
tant to create and consistently use a scoring rubric when scoring short-answer items. When
creating this rubric, take into consideration any answers besides the preferred or “best”
response that will receive full or partial credit. For example, remember this item?
How would you score it if the student responded only “Braintree” or only “Massachusetts”?
This should be specified in the scoring rubric.
These guidelines are summarized in Table 9.7.
When recall is important, Short-Answer Items Require Recall, Not Just Recognition.
when dealing with quantitative Whereas selected-response items require only recognition of the cor-
problems, and when rect answer, short-answer items require recall. T'hat is, with these
items students must remember the correct answer without having it
interpreting graphic material,
provided for them. There are instances when the recall/recognition
short-answer items can be
distinction is important, and when recall is important, short-answer
extremely effective.
items can be useful. Also, because short-answer items require recall,
blind guessing is reduced.
Short-Answer Items Are Particularly Well Suited for Quantitative Problems and Prob-
lems Requiring the Interpretation of Graphic Material. When the problem involves
mathematical computations or the interpretation of graphic material such as charts, diagrams,
or illustrations, the short-answer format can be particularly useful (e.g., Hopkins, 1998).
Because Students Can Answer More Short-Answer Items than Essay Items, They
May Allow Better Content Sampling. Because students can usually answer short-answer
items fairly quickly, you can include more short-answer items than essay items on a test. This
can result in more representative sampling of the content domain and enhanced reliability.
Short-Answer Items Are Relatively Easy to Write. We are always cautious when we say
an item type is easy to write, so we say it is relatively easy to write. Compared to multiple-
choice items, short-answer items are easier to write. Even though they are relatively easy to
write, they still need to be developed with care following the guidelines provided.
on short-answer items and incorporating other item types that demand higher-level cogni-
tive processes.
Throughout much of the twentieth century, critics of essay items emphasized their weak-
nesses (primarily unreliable scoring and reduced content sampling) and promoted the use
of selected-response items. In recent years there has been increased criticism of selected-
response items and a call for relying more on essays and other constructed-response items.
Proponents of constructed-response tests, particularly essay items (and performance as-
sessments, discussed in the next chapter), generally claim they provide a more “authentic”
assessment of student abilities, one that more closely resembles the way abilities and knowl-
edge are demonstrated or applied in the real world. Arguments on both sides are pervasive
and often passionate. We take the position that both formats have an important role to play
in educational assessment. As we have repeated numerous times, to adequately assess the
complex array of knowledge and skills emphasized in today’s schools, teachers need to
take advantage of the full range of assessment procedures available. Due to the tendency
for selected-response items to provide reliable and valid measurement, we promote their
use when they can adequately assess the educational objectives. However, it is important
The Development and Use of Constructed-Response Items 243
To adequately assess the to recognize that there are educational objectives that cannot be ad-
complex array of knowledge equately assessed using selected-response items. In these situations
and skills emphasized in today’s you should use constructed-response items. By being aware of the
schools, teachers need to take weaknesses of constructed-response items and using the guidelines
advantage of the full range for developing and scoring them outlined in this chapter you will be
of assessment procedures able to write items that produce results you can have considerable
bvailable confidence in. Remember that the best practice is to select items that
provide the most valid and reliable information about your students’
knowledge and skills.
Summary
In this chapter we focused on the development and use of constructed-response items. Essay
items have a long history, dating back to China over two thousand years ago. An essay item
poses a question or problem that the student responds to in a written format. Although essay
items vary in terms of the limits they place on student responses, most essay items give
students considerable freedom in developing their responses. Essay tests gained popularity
in the United States in the nineteenth century largely due to problems associated with oral
testing. Even though written essay tests addressed some of the problems associated with
oral testing, essays have their own associated problems. The most prominent weaknesses
of essay items involve difficulty scoring in a reliable manner and limited content sampling.
Both of these issues can result in reduced reliability and validity. On the positive side, essay
items are well suited for measuring many complex educational objectives and are relatively
easy to write. We provided numerous suggestions for writing and scoring essay items, but
encouraged teachers to limit the use of essay items to the measurement of educational objec-
tives that are not easily assessed using selected-response items.
The second type of constructed-response item addressed in this chapter was short-an-
swer items. Like essay items, students respond to short-answer items by providing a written
response. However, instead of having a large degree of freedom in drafting their response,
on short-answer items the student is usually required to limit the response to a single word, a
brief phrase, or a symbol/number. Similar to essay items, short-answer items are somewhat
difficult to score in a reliable manner. On the positive side, short-answer items are well suited
for measuring certain educational objectives (e.g., math computations) and are relatively easy
to write. We provided several suggestions for writing short-answer items, but nevertheless
encouraged teachers to limit their use to those situations for which they are uniquely quali-
fied. As with essay items, short-answer items have distinct strengths, but should be used in a
judicious manner.
We ended this chapter by highlighting the classic debate between proponents of
selected-response and constructed-response formats. We believe both have a role to play in
educational assessment and that by knowing the strengths and limitations of both formats
one will be better prepared to develop and use tests in educational settings. In the next chap-
ter we will turn your attention to performance assessments and portfolios. These are special
types of constructed-response items (or tasks) that have been around for many years, but
have gained increasing popularity in schools in recent years.
244 CHAPTER 9
RECOMMENDED READINGS
Fleming, K., Ross, M., Tollefson, N., & Green, S. (1998). types among students are essays and multiple-choice
Teacher’s choices of test-item formats for classes with items. Females overwhelmingly prefer essay items
diverse achievement levels. Journal of Educational whereas males show a slight preference for multiple-
Research, 91, 222-228. This interesting article reports choice items.
that teachers tend to prefer using essay items with high- Gulliksen, H. (1986). Perspective on educational measure-
achieving classes and more recognition items with ment. Applied Psychological Measurement, 10, 109-
mixed-ability or low-achieving classes. 132. This paper presents recommendations regarding the
Gellman, E., & Berkowitz, M. (1993). Test-item type: What development of educational tests, including the develop-
students prefer and why. College Student Journal, 27, ment and grading of essay items.
17-26. This article reports that the most popular item
UR ae MED eens ages ag NRE TTS Ee MeeeGe Gee MIC eee eywe me ees Ce SNE Ge SESE HRN TO Taoees MONO REE Oe
CHAPTER HIGHLIGHTS
LEARNING OBJECTIVES
Ih Chapter 1, we noted that one of the current trends in educational assessment is the rising
popularity of performance assessments and portfolios. Performance assessments and port-
folios are not new creations. In fact as far back as written records have been found, there is
evidence that students were evaluated with what are currently referred to as performance
245
246 CHAPTER 10
As far back as written records assessments. However, interest in and the use of performance as-
have been found, there is sessments and portfolios in schools has increased considerably in
evidence that students were the last decade. Although traditional paper-and-pencil assessments,
evaluated with what are particularly multiple-choice and other selected-response formats
(e.g., true—false, matching), have always had their critics, opposition
currently referred to as
has become much more vocal in recent years. Opponents of tradi-
performance assessments.
tional paper-and-pencil assessments complain that they emphasize
rote memorization and other low-level cognitive skills and largely
neglect higher-order conceptual and problem-solving skills. To make the situation worse,
critics claim that reliance on paper-and-pencil assessments may have negative effects on
what teachers teach and what students learn. They note that in the era of high-stakes as-
sessment teachers often feel compelled to teach to the test. As a result, if high-stakes tests
measure only low-level skills, teachers may teach only low-level skills.
To address these shortcomings, many educational assessment experts have promoted
the use of performance assessments and portfolios. The Standards (AERA et al., 1999)
Performance assessments
require students to complete a real-life situations. Forexample, a medical student might be required
to interview a mock patient, select medical tests and other assess-
process or produce a product in
ment procedures, arrive at a diagnosis, and develop a treatment plan.
a context that closely resembles
Portfolios, a specific form of performance assessment, involve the
real-life situations (AERA et al., systematic collection of a student’s work products over a specified
1999). period of time according to a specific set of guidelines. Artists, archi-
tects, writers, and others have used portfolios to represent their work
for many years, and in the last decade portfolios have become increasingly popular in the
assessment of students.
We just gave you the definition of performance assessments provided by the Standards
(AERA et al., 1999). The Joint Committee on Standards for Educational Evaluation (2003)
provides a slightly different definition. It defines a performance assessment as a
Notice that whereas the Standards’ definition of performance assessments requires test tak-
ers to complete a task in a context or setting that closely resembles real-life situations, the
definition provided by the Joint Committee on Standards for Educational Evaluation does
not. This may alert you to the fact that not everyone agrees as to what qualifies as a per-
formance assessment. Commenting on this, Popham (1999) observed that the distinction
between performance assessments and more traditional assessments is not always clear.
For example, some educators consider practically any constructed-response assessment a
Performance Assessments and Portfolios
247
specific food requirements. This type of assessment assumes that the student is familiar with
the real-life setting and the elements of the problem. Clearly, the assumption 1s made that a
student who can solve the artificial problem can also solve an actual performance problem of
the same sort. That assumption is rarely tested and is problematic in many instances.
Performance assessments may This is only a partial list of the many applications for performance
be the primary approach to assessments in schools. Consider shop classes, theater classes,
assessment in classes such as home economics classes, typing/keyboarding classes, and com-
art, music, physical education, puter classes. Even in classes in which traditional paper-and-pencil
theater, and shop. assessments are commonly used, performance assessments can be
useful adjuncts to the more traditional assessments. For example,
Even in classes in which in college tests and measurement classes it is beneficial to have
traditional paper-and-pencil students select a test construct, develop a test to measure that con-
assessments are commonly struct, administer the test to a sample of subjects, and complete
used, performance assessments preliminary analyses on the resulting data. Like many performance
can be useful adjuncts. assessments, this activity demands considerable time and effort.
However, it measures skills that are not typically assessed when
relying on traditional paper-and-pencil assessments.
We just noted that performance assessments can be very time consuming, and this ap-
plies to both the teacher and the students. Performance assessments take considerable time
for teachers to construct, for students to complete, and for teachers to score. However, not all
performance assessments make the same demands on students and teachers. It is common to
distinguish between extended-response performance assessments and restricted-response
performance assessments. Extended-response performance tasks typically are broad in
scope, measure multiple learning objectives, and are designed to closely mirror real-life situa-
tions. In contrast, restricted-response performance tasks typically measure a specific learning
objective and relative to extended-response assessments are easier to administer and score.
However, restricted-response tasks are less likely to mirror real-life situations.
SPECIAL INTEREST TOPIC 10.1
Example of a Performance Assessment in Mathematics
The issues involved in assessing mathematics problem solving are similar to those in all performance
assessments, so we will use this topic to highlight them. Almost all states have incorporated so-called
higher-order thinking skills in their curricula and assessments. In mathematics this is commonly
focused on problem solving. Common arithmetic and mathematics performance assessments 1n stan-
dardized tests that are developed by mathematicians and mathematics educators focus on common
problem situations and types. For example, algebra problems may be developed around landscape
architectural requirements for bedding perimeters or areas, as well as driving times and distance
problems. Students are asked in a series of questions to represent the problem, develop solutions,
select a solution, solve the problem, and write a verbal description of the solution.
Each part of a mathematics problem such as that just mentioned will be evaluated separately.
In some assessments each part is awarded points to be cumulated for the problem. How these points
are set is usually a judgment call, and there is little research on this process. Correlational analysis
with other indicators of mental processing, factor analysis, and item response theory can all provide
help in deciding how to weight parts of a test, but these are advanced statistical procedures. As
with essay testing, however, each part of a mathematics problem of this kind will be addressed in a
scoring rubric. Typically the rubric provides a set of examples illustrating performances at different
score levels. For example, a 0 to 4 system for a solution generation element would have examples of
each level from 1 to 4. Typically a 0 is reserved for no response, a 1-point response reflects a poorly
developed response that is incorrect, a 2-point response reflects a single correct but simple solution,
whereas 3 and 4 are reserved for multiple correct and increasingly well-developed solutions. An
example of a performance assessment in mathematics follows. The assessment is similar to those
used in various state and national assessments. Note that some parts require responses that are simply
multiple-choice, whereas others require construction of a response along with the procedures the
student used to produce the answer. The process or procedures employed are evaluated as well as the
answer. One of the reasons for examining the student’s construction process is that students some-
times can get the correct answer without knowing the procedure (they may conduct various arithme-
tic operations and produce a response that corresponds to an option on a multiple-choice item).
1. Why did the number of border bricks increase as gray paving blocks were added?
2. Complete the table below. Decide how many paving stones and border bricks the
fifth patio design would have.
1 8
: 2 (4) (12)
| 3 (9) (16)
| 4 (16) (20)
: 5 (25) (24)
— 3) 3223999990009) 0 95959595999.
250
CONSTRUCTED-RESPONSE MATHEMATICS ITEM, GRADES 6-8 ALGEBRA CONCEPTS
Directions: All of the questions are about the same problem shown below. Read the
problem and then answer each question in the boxes given with each question.
Sue is a landscaper who builds patios with gray paving stones and white bricks that make
the border. The number of paving stones and bricks depends on the size of the patio as
shown below:
[| Palabel
ia
ee
[|
cal
qQoooo YOUU
neeen (100
Jesse.
@ @ UO
(18 @@ O
(10 @@ @ Oo
N@@@U
qoood FAeeeer
qooocod
3. From the pattern in the table accompanying Problem 2, write a statement about how many
more border bricks will be needed as the patio design goes from 5 to 6.
4. The number of the patio design is the same as the number of rows of paving stones in the de-
sign. As a new row is added, how many border bricks are added? ANSWER
(continued)
251
Zoe CHAPTER 10
| 5. Notice that if you multiply the value 1 for Patio Design 1 by 4 and add 4, you get the number
of border bricks, 8. Does the same thing work for Patio Design 2? ANSWER
Now write a math statement about the number of the patio design and the number of border
bricks: You can use P for patio design and N for the number of border bricks. Thus, your state-
ment should start, N=...
Due to the great diversity in the types of objectives measured by performance assessments,
it is somewhat difficult to develop specific guidelines for developing effective performance
assessments. However, most experts tend to agree on some general guidelines for this process
(e.g., Gronlund, 1998; Linn & Gronlund, 2000; Nitko, 2001; Popham, 1999, 2000; Stiggins,
2001). These can be classified as suggestions for selecting appropriate performance tasks,
developing clear instructions for students, developing procedures for evaluating students’
performance, and implementing procedures to minimize rating errors. In summarizing these
guidelines, the logical place to start is with the selection of a performance task.
Select Performance Tasks That Provide the Most Direct Assessment of the Educa-
tional Objectives You Want to Measure. One principle we have touched on several times
is that you should select assessment techniques that provide the most direct measurement
of the educational objective of interest. This applies when selecting the type of assessment
to use (e.g., selected-response, constructed-response, or performance assessment) and also
when selecting the specific task that you will employ. To this end, carefully examine the
educational objectives you are targeting and select performance tasks that capture the es-
sential features of those objectives.
‘
Performance Assessments and Portfolios 253
Select Performance Tasks That Maximize Your Ability to Generalize the Results of
the Assessment. One of the most important considerations when selecting a perfor-
mance task is to choose one that will allow you to generalize the results to comparable
tasks. In other words, if a student can perform well on the selected task, there should be a
high probability that he or she can perform well on other tasks that involve similar skills
and knowledge.
Select Performance Tasks That Reflect Essential Skills. As a general rule, perfor-
mance assessments should be used only for assessing the most important or essential skills.
Because performance assessments require considerable time and energy to complete (for
both teachers and students), to promote efficiency use them only for assessing the really
important skills that you want to ensure your students have mastered.
Select Performance Tasks That Encompass More than One Learning Objective. Be-
cause performance assessments often require such extensive time and energy commitments,
it is highly desirable to select tasks that allow the assessment of multiple important edu-
cational objectives. Although this may not always be possible, when it is it enhances the
efficiency of the assessment process.
Select Performance Tasks That Focus Your Evaluation on the Processes and/or Prod-
ucts You Are Most Interested In. Before selecting a performance task you should deter-
mine whether you are primarily interested in assessing the process the students engage in,
the product they produce, or some combination of the two. Sometimes the answer to this
question is obvious; sometimes it is less clear. Some performance tasks do not result in a
product and in this situation it is obvious that you will focus on the process. For example,
assessment of musical performances, speeches, debates, and dance routines require evalua-
tion of the process in real time. In contrast, when evaluating a student-developed diorama,
poem, or sculpture, the process is often less important than the end product. Assessment
experts (e.g., Nitko, 2001) recommend that you focus on the process when
No product is produced.
A specific sequence of steps or procedures is taught.
The specific steps or procedures are essential to success.
The process is clearly observable.
Analysis of the process can provide constructive feedback.
You have the time to devote to observing the students perform the task.
As we noted, it is possible and often desirable to evaluate both process and prod-
uct. Additionally, the emphasis on process or product may change at different stages of
254 CHAPTER 10
instruction. Gronlund (1998) suggests that process is often more important early in the
learning process, but after the procedural steps have been mastered, the product assumes
primary importance. For example, in painting the teacher’s focus may be on procedure
and technique in the early stages of instruction and then shift to the quality of the finished
painting in later stages of instruction. When the process has been adequately mastered, it
may be preferable to focus your evaluation on the product because it can usually be evalu-
ated in a more objective manner, at a time convenient to the teacher, and if necessary the
scoring can be verified.
Select Performance Tasks That Provide the Desired Degree of Realism. This in-
volves considering how closely your task needs to mirror real-life applications. This is
along the lines of the distinction between actual, analogue, and artificial performance as-
sessments. This distinction can be conceptualized as a continuum, with actual performance
tasks being the most realistic and artificial performance tasks the least realistic. Although
it may not be possible to conduct actual or even analogue performance assessments in the
classroom, considerable variability in the degree of realism can be found in artificial perfor-
mance assessments. Gronlund (1998) identifies four factors to consider when determining
how realistic your performance assessment should be:
a The nature of the educational objective being measured. Does the objective require
a high, medium, or low level of realism?
m The sequential nature of instruction. Often in instruction the mastery of skills that
do not require a high level of realism to assess can and should precede the mastery of
skills that demand a high level of realism in assessment. For example, in teaching the
use of power tools in a shop class it would be responsible to teach fundamental safety
rules (which may be measured using paper-and-pencil assessments) before proceed-
ing to hands-on tasks involving the actual use of power tools.
= Practical constraints. Consider factors such as time requirements, expenses, manda-
tory equipment, and so forth. As a general rule, the more realistic the task, the greater
the demands in terms of time and equipment.
a The nature of the task. Some tasks by their very nature preclude actual performance
assessment. Remember our example regarding the recertification of nuclear power
plant operators. In this context, mistakes in the real system could be disastrous, so a
simulator that re-creates the control room is used for assessment purposes.
Select Performance Tasks That Measure Skills That Are “Teachable.” That is, make
sure your performance assessment is measuring a skill that is acquired through direct in-
struction and not one that reflects innate ability. Ask yourself, “Can the students become
more proficient on this task as a result of instruction?” Popham (1999) notes that when
evaluation criteria focus on “teachable skills” it strengthens the relationship between in-
struction and assessment, making both more meaningful.
Select Performance Tasks That Are Fair to All Students. Choose tasks that are fair to
all students regardless of gender, ethnicity, or socioeconomic status.
Performance Assessments and Portfolios
255
Select Performance Tasks That Can Be Assessed Given the Time and Resources Avail-
able. Consider the practicality of a performance task. For example, can the assessment
realistically be completed when considering the expense, time, space, and equipment re-
quired? Consider factors such as class size; what might be practical in a small class of ten
students might not be practical in a class of 30 students. From our experience it is common
for teachers to underestimate the time students require to complete a project or activity. This
is because the teacher is an expert on the task and can see the direct, easy means to comple-
tion. In contrast students can be expected to flounder to some degree. Not allowing sufficient
time to complete the tasks can result in student failure and a sense that the assessment was
not fair. To some extent experience is needed to determine reasonable times and deadlines
for completion. New teachers may find it useful to consult with more experienced colleagues
for guidance in this area.
Select Performance Tasks That Can Be Scored ina Reliable Manner. Choose perfor-
mance tasks that will elicit student responses that can be measured in an objective, accurate,
and reliable manner.
1. Select performance tasks that provide the most direct assessment of the educational
objectives you want to measure.
2. Select performance tasks that maximize your ability to generalize the results of the
assessment.
3. Select performance tasks that reflect essential skills.
4. Select performance tasks that encompass more than one learning objective.
5. Select performance tasks that focus your evaluation on the processes and/or products you are
most interested in.
6. Select performance tasks that provide the desired degree of realism.
7. Select performance tasks that measure skills that are “teachable.”
8. Select performance tasks that are fair to all students.
9. Select performance tasks that can be assessed given the time and resources available.
10. Select performance tasks that can be scored in a reliable manner.
11. Select performance tasks that reflect educational objectives that cannot be measured using
more traditional measures.
256 CHAPTER 10
Developing Instructions
Because performance tasks often require fairly complex student responses, it is important
that your instruction precisely specify the types of responses you are expecting. Because
originality and creativity are seen as desirable educational outcomes,
The second major task in performance tasks often give students considerable freedom in how
they approach the task. However, this does not mean it is appropriate
developing performance
for teachers to provide vague or ambiguous instructions. Few things
assessments is to develop
in the classroom will create more negativity among students than
instructions that clearly
confusing instructions that they feel result in a poor evaluation. It is
specify what students are the teacher’s responsibility to write instructions clearly and precisely
expected to do. so that students do not need to “read the teacher’s mind” (this applies
to all assessments, not only performance assessments). Possibly the
best way to avoid problems in this area is to have someone else (e.g., an experienced col-
league) read and interpret the instructions before you administer the assessment to your stu-
dents. Accordingly, it may be beneficial to try out the performance activity with one or two
students before administering it to your whole class to ensure that the instructions are thor-
ough and understandable. Your instructions should clearly specify the types of responses
you are expecting and the criteria you will use when evaluating students’ performance. Here
is a list of questions that assessment professionals recommend you consider when evaluat-
ing the quality of your instructions (e.g., Nitko, 2001):
Table 10.2 provides a summary of these guidelines for developing instructions for
your performance assessments.
1. Make sure that your instructions clearly specify the types of responses you are expecting.
2. Make sure that your instructions specify any important parameters of the performance task
(e.g., time limits, the use of equipment or materials).
3. Make sure that your instructions clearly specify the criteria you will use when evaluating the
students’ responses.
4. Have a colleague read and interpret the instructions before you administer the assessment to
your students.
5. Try out the performance activity with one or a limited number of students before administering
it to your whole class to ensure that the instructions are thorough and understandable.
6. Write instructions that students from diverse cultural and ethnic backgrounds will interpret in
an accurate manner.
[ELSE MAT
Saat LIS
the preceding chapter. A rubric is simply a written guide that helps you score constructed-
response assessments. In discussing the development of scoring rubrics for performance
assessments, Popham (1999) identified three essential tasks that need to be completed, dis-
cussed in the following paragraphs.
Select Important Criteria That Will Be Considered When Evaluating Student Re-
sponses. Start by selecting the criteria or response characteristics that you will employ
when judging the quality of a student’s response. We recommend that you give careful
consideration to the selection of these characteristics because this is probably the most
important step in developing good scoring procedures. Limit it to three or four of the most
important response characteristics to keep the evaluation process from becoming unman-
ageable. The criteria you are considering when judging the quality of a student’s response
should be described in a precise manner so there is no confusion about what the rating re-
fers to. It is also highly desirable to select criteria that can be directly observed and judged.
Characteristics such as interest, attitude, and effort are not directly observable and do not
make good bases for evaluation.
Specify Explicit Standards That Describe Different Levels of Performance. For each
criterion you want to evaluate, you should develop clearly stated standards that distinguish
among levels of performance. In other words, your standards should spell out what a stu-
dent’s response must encompass or look like to be regarded as excellent, average, or infe-
rior. It is often helpful to provide behavioral descriptions and/or specimens or examples to
illustrate the different levels of performance.
students regarding the strengths and weaknesses of their response. This informs students
which aspects of their responses were adequate and which need improvement. The major
limitation of analytic rubrics is that they can take considerable time to complete. Holistic
rubrics are often less detailed than analytic rubrics and as a result are easier to develop and
complete. Their major disadvantage is that they do not provide specific feedback to students
about the strengths and weaknesses of their responses. Tables 9.4 and 9.5 in Chapter 9 pro-
vide examples of holistic and analytic scoring rubrics.
Linn and Gronlund (2000) identify rating scales and checklists as popular alternatives
to the traditional scoring rubrics. Noting that the distinction between rating scales and tra-
ditional rubrics is often subtle, they find that rating scales typically
Rating scales specify the quality se quality judgments (e.g., outstanding, good, average, marginal,
of performance or frequency of poor) to indicate performance on each criterion as opposed to the
a behavior. more elaborate descriptive standards common on scoring rubrics.
In place of quality judgments, some rating scales indicate frequency
judgments (e.g., always, often, sometimes, seldom, never). Table 10.3 provides an example
of a rating scale using verbal descriptions.
A number of different types of rating scales are commonly used in scoring perfor-
mance assessments. On some rating scales the verbal descriptions are replaced with numbers
to facilitate scoring. Table 10.4 provides an example of a numerical rating scale. Another
variation, referred to as a graphic rating scale, uses a horizontal line with ratings positioned
along the lines. Table 10.5 provides an example of a graphic rating scale. A final popular
type of rating scale combines the graphic format with brief descriptive phrases as anchor
points. This is typically referred to as a descriptive graphic scale. Linn and Gronlund (2000)
suggest that this type of rating scale has a number of advantages that support its use with
performance assessments. It communicates more information to the students regarding their
performance and it helps teachers rate their students’ performance with greater objectivity
and accuracy. Table 10.6 provides an example of a descriptive graphic rating scale. When
developing rating scales it is usually desirable to have between three and seven rating points.
For example, at a minimum you would want your rating scale to include ratings of poor,
average, and excellent. Most experts suggest that including more than seven positions is not
useful because raters usually cannot make finer discriminations than this.
Directions: Indicate the student’s ability to successfully perform the specified activity by
circling the appropriate descriptor.
2. Rate the student’s ability to strike the tennis ball using the forehand stroke.
Poor Marginal Average Good Excellent
3. Rate the student’s ability to strike the tennis ball using the backhand stroke.
Poor Marginal Average Good Excellent
BEA EAS RE IOVS OS OA
TABLE 10.4 Example of a Numerical Rating Scale
Directions: Indicate the student’s ability to successfully perform the specified activity by
circling the appropriate number. On this scale, the numbers represent the following evaluations:
1 = Poor, 2 = Marginal,3 = Average, 4 = Good, and5 = Excellent.
3. Rate the student’s ability to strike the tennis ball using the backhand stroke.
Form is poor Form and accuracy Form and accuracy
and accuracy usually within are consistently
is poor the average range superior
259
TABLE 10.7 Example of a Checklist Used with Preschool Children
Directions: Circle Yes or No to indicate whether each skill has been demonstrated.
Self-Help Skills
Language Development
Yes No Circle
Yes No Square
Yes No Triangle
Yes No Star
Can identify the following colors:
Yes No Red 3
Yes No Blue
Yes No Green
Social Development
ES 35:5.) See e. ee eS
Yes No Plays independently
Yes No Plays parallel to other students
Yes No Plays cooperatively with other students
Yes No Participates in group activities
je GOES SYAR AD eA R eeS
260
Performance Assessments and Portfolios
261
Checklists require a simple Checklists are another popular procedure used to score per-
yes/no judgment. formance assessments. Checklists are similar to rating scales, but
whereas rating scales note the quality of performance or the fre-
quency of a behavior, checklists require a simple yes/no judgment. Table 10.7 provides
an example of a checklist that might be used with preschool children. Linn and Gronlund
(2000) suggest that checklists are most useful in primary education because assessment is
mostly based on observation rather than formal testing. Checklists are also particularly use-
ful for skills that can be divided into a series of behaviors.
Although there is overlap between traditional scoring rubrics such as those used for
scoring essays, rating scales, and checklists, there are differences that may make one pref-
erable for your performance assessment. Consider which format is most likely to produce
the most reliable and valid results and which will provide the most useful feedback to the
students.
other
words, if students impressed a teacher with their punctuality and good manners, the teacher
might tend to rate them more favorably when scoring performance assessments. Obviously
this is to be avoided because it undermines the validity of the results.
a Leniency, severity, and central tendency errors. Leniency errors occur because some
teachers tend to give all students good ratings que because some
teachers tend to give all students poor ratings. occur because
some teachers tend to give all students scores in the middle range (e.g., indicating average
performance). Leniency, severity, and central tendency errors all reduce the range of scores
and make scores less reliable.
s Personal biases. Personal biases may corrupt ratings if teachers have a tendency to
let stereotypes influence their ratings of students’ performance.
= Logical errors. Logical errors occur when a teacher assumes that two characteristics
are related and tends to give similar ratings based on this assumption (Nitko, 2001). An
example of a logical error would be teachers assuming that all students with high aptitude
scores should do well in all academic areas, and letting this belief influence their ratings.
262 CHAPTER 1:0
= Order effects. Order effects are changes in scoring that emerge during the grading
process. These effects are often referred to as rater drift or reliability decay. Nitko (2001)
notes that when teachers start using a scoring rubric they often adhere to it closely and apply
it consistently, but over time there is a tendency for them to adhere to it less closely, and as
a result the reliability of their ratings decreases or decays. :
Obviously these sources of errors can undermine the reliability of scores and the validity
of their interpretations. Therefore, it is important for teachers to take steps to minimize the
influence of factors that threaten the accuracy of ratings. Here are some suggestions for
improving the reliability and accuracy of teacher ratings that are based on our own experi-
ences and the recommendations of other authors (e.g., Linn & Gronlund, 2000; Nitko, 2001;
Popham, 1999, 2000).
Before Administering the Assessment, Have One or More Trusted Colleagues Eval-
uate Your Scoring Rubric. If you have other teachers who are familiar with the per-
formance area review and critique your scoring rubric, they may be able to identify any
limitations before you start the assessment.
When Possible, Rate Performances without Knowing the Student’s Identity. This
corresponds with the recommendation we made with regard to grading essay items. Anony-
mous scoring reduces the chance that ratings will be influenced by halo effects, personal
biases, or logical errors.
Rate the Performance of Every Student on One Task before Proceeding to the Next
Task. It is easier to apply the scoring criteria uniformly when you score one task for every
student before proceeding to the next task. That is, score task number one for every student
before proceeding to the second task. Whenever possible, you should also randomly reorder
the students or their projects before moving on to the next task. This will help minimize
order effects.
Have More than One Teacher Rate Each Student’s Performance. The combined
ratings of several teachers will typically yield more reliable scores than the ratings of only
one teacher. This is particularly important when the results of an assessment will have sig-
nificant consequences for the students.
‘
Performance Assessments and Portfolios 263
Reliable scoring of performance assessments is hard to achieve. When estimating the reliability of
performance assessments used in standardized assessment programs, multiple readers (also termed
raters) are typically given the same student response to read and score. Because it is cost prohibitive
for readers to read and score all responses in high-stakes testing, the responses are typically assigned
randomly to pairs of readers. For each response, the scores given by the two readers are compared. If
they are identical, the essay is given the common score. If they differ meaningfully, such as one score
represents a passing performance and the other failing, a procedure to decide on the score given is
invoked. Sometimes this involves a third reader or, if the scores are close, an average score is given.
For tasks with multiple parts the same two readers evaluate the entire problem. It is important to note
that reliability of the parts will be much lower than reliability of the entire problem. Summing the
scores of the various parts of the problem will produce a more reliable score than that for any one
part, so it is important to decide at what level score reliability is to be assessed. It will be much more
costly to require high reliability for scoring each part than for the entire problem.
Prior to reading the responses the readers must be given training. This training usually includes
review of the rubrics, practice with samples of responses, and repeated scoring until the readers reach
some criterion performance themselves. Readers may be required to agree with a predetermined score
value. For example, readers may be expected to reach a criterion agreement of 70% or 80% with the
predetermined score value over a set of responses. This agreement means that a reader achieves 70%
agreement with the assigned score (the assigned scores were established by experts). Statistically, this
agreement itself is dependent on the number of responses a reader is given. That is, if a reader scores
ten responses and obtains an exact match in scoring seven of them, the reader may have achieved the
required reliability for scoring. However, from a purely statistical perspective they may have a percent
agreement as low as 0.41. This is based on statistical probability for a percentage. It is calculated from
the equation for the standard deviation of a percent:
= OPO Se OO =
= 0.145
S proportion = oes
From the distribution of normal scores, plus or minus 1.96 standard deviations will capture 95% of
the values of the percent a reader would obtain over repeated scoring of 10 responses. This is about
two standard deviations, so that subtracting 2 x 0.145 from 0.70 leaves a percent as low as 0.41.
Clearly, ten responses is a poor sample to decide whether a reader has a true percent agreement
score of 0.70. If the number of essays is increased to 100, the standard error becomes 0.046, and
the lower bound of the interval around 0.70 becomes 0.70 minus 2 x 0.046, or about 0.60. Reading
100 responses is very time consuming and costly, increasing the cost of the entire process greatly.
If we want a true minimal agreement value of 0.70, using this process, we would require the ac-
tual agreement for readers to be well above 0.80 for 100 responses, and above 0.95 for only ten.
Unfortunately, it is not generally reported for high-stakes testing how many responses are scored
to achieve the state-mandated level of reliability. Most of these assessments prescribe the minimal
level of agreement, but in some cases the agreement is between readers, not necessarily with an
expert. This produces yet another source of unreliability, the degree of inter-rater agreement. The
more common approach to inter-rater agreement is to construct the percent agreement across a set
of responses and then correct it for the chance agreement that could occur even if there were no real
(continued)
264 CHAPTER 10
agreement. That is, suppose that for practical purposes in a 0-4 system, the score of 0 represents
a blank sheet—that is, the student did not respond. These are easily separated from the rest of the
responses but have no real contribution to meaningful agreement so they are not considered. This
leaves a 1-4 range for scoring. It is easy to show that if scores were distributed equally across the
range, a random pairing of scores would produce 25% agreement. If the score distribution for the
population of students is actually normally distributed, centered at 2.5, for example, the chance
agreement is quite a bit greater, as high as almost 80%, depending on the assumptions about the
performance underlying the testing and scoring. In any case, the apparent agreement among rat-
ers must be accounted for. One solution is to calculate Cohen’s kappa, an agreement measure that
subtracts the chance agreement from the observed agreement.
The calculations illustrated next are for hypothetical data for two raters. The numbers in the
diagonal of this table reflect agreement between the two raters. For example, if you look in the top
left corner you see that there were two cases in which both raters assigned a score of one. Moving
to the next column where ratings of two coincide, you see that there were three cases in which both
raters assigned a score of two. However, there was one case in which Rater 1 assigned a one and
Rater 2 a two. Likewise, there was one case in which Rater 1 assigned a three and Rater 2 a two.
Note that we would need a much higher observed agreement to ensure a minimal 70% agreement
beyond chance. Because the classical estimate of reliability is always based on excluding error such
as chance, Cohen’s kappa is theoretically closer to the commonly agreed-on concept of reliability.
Unfortunately, there is little evidence that high-stakes assessments employ this method. Again, the
cost of ensuring this level of true agreement becomes quite high.
Rater 2 Scores
1 2 3 4 Percent
1 30%
I 30%
Rater 1
Scores 3 20%
4 20%
4 agreement = 70% ‘
Performance Assessments and Portfolios
265
Pehance agreement = (30% x 20%) + (30% x 60%) + (20% x 10%) + (20% x 10%)
= 6% + 18% + 2% + 2%
28%
Se = 0.3945
A number of Web sites produce this computation. Simply search with the term Cohen’s kappa using
an Internet search engine to find one of these sites. For example, a site at Vassar University produced
the computations just used (Lowry, 2003).
For 20 essays, doubling the number of cases in the table, the estimate remains the same,
but the standard error is reduced to 0.2789. For 80 responses, multiplying the numbers in the
table by 8, the standard error is 0.1395. Thus, even with 80 responses, given a 70% observed agree-
ment among two raters, the actual Cohen’s kappa chance corrected agreement is as low as 0.42
(approximately 0.70 — 1.96 x 0.1395).
. Ensure that the criteria you are evaluating are clearly specified and directly observable.
. Ensure that the standards clearly distinguish among levels of performance.
. Select the type of scoring procedure that is most appropriate.
. Have one or more trusted colleagues evaluate your scoring rubric.
When possible, rate performances without knowing the student’s identity.
. Rate the performance of every student on one task before proceeding to the next task.
. Be sensitive to the presence of leniency, severity, or central tendency errors.
ERWNE
SAAN
. Conduct a preliminary reliability analysis to determine whether your ratings have acceptable
reliability.
\o. Have more than one teacher rate each student’s performance.
Table 10.8 provides a summary of these guidelines for developing and implementing
procedures for scoring your performance assessments. Special Interest Topic 10.3 presents
a discussion of the problems some states have experienced incorporating performance as-
sessments into their high-stakes assessment programs.
Much has been written about the strengths and weaknesses of performance assess-
ments, and this is good time to examine them in more detail.
266 CHAPTER 10
In our earlier edition we documented the experiences of states such as Kentucky and Ver-
mont in implementing versions of performance assessments as part of their statewide assess-
ment programs. Most of the evaluations of these experiences have been negative in terms
of limited reliability, validity, and benefit relative to the cost. Since the implementation of
The No Child Left Behind Act (NCLB) in 2001, there has been almost no statewide movement to-
ward employing performance assessments. Even those states that began experimenting with versions
of performance assessment have rejected them as too expensive for too little additional validity over
paper-and-pencil or computer-based testing.
This does not mean that performance assessment does not exist in education. It has been
adopted in postsecondary education in such areas as medical, veterinary, and management educa-
tion. In these cases there appear to be sufficient resources to conduct these more expensive perfor-
mance assessments. For example, in medical schools there are relatively few students per course
of study, and with a much greater expenditure per student, performance assessments are relatively
cheaper to conduct.
Performance assessments may Performance Assessments May Make Learning More Meaning-
make learning more meaningful _ful and Help Motivate Students. Performance assessments are
and help motivate students. inherently attractive to teachers and students, To many students test-
Performance Assessments and Portfolios 267
ing under conditions that are similar to those they will encounter in the real world is more
meaningful than paper-and-pencil tests. As a result, students might be more motivated to be
actively engaged in the assessment process.
A single measure or approach is unlikely to adequately measure the knowledge, skills, and
complex procedures covered by rigorous content standards. Multiple measures and ap-
proaches can be used to capitalize on the strengths of each measurement technique, enhanc-
ing the utility of the assessment system and strengthening the validity of decisions based on
assessment results. (p. 9)
Although performance assessments have a number of strengths that support their use,
there are some disadvantages. We will now summarize these disadvantages.
Because students typically complete a limited number of tasks, your ability to generalize
with much confidence is limited. The solution to this limitation is to have students complete
multiple performance tasks in order to provide adequate domain sampling. Regretfully, due to
their time-consuming nature, this is not always possible.
Performance assessment is complex. It requires users to prepare and conduct their assess-
ments in a thoughtful and rigorous manner. Those unwilling to invest the necessary time and
energy will place their students directly in harm’s way. (p. 186)
There Are Practical Limitations That May Restrict the Use of Performance Assessments.
In addition to high time demands, other practical limitations might restrict the use of perfor-
mance assessments. These can include factors such as space requirements and special and
potentially expensive equipment and materials necessary to simulate a real-life setting.
Portfolios
Portfolios are a specific type of
Portfolios are a specific type of performance assessment that involves
performance assessment that
the systematic collection of a student’s work products over a speci-
involves the systematic collection fied period of time according to a specific set of guidelines (AERA
of a student’s work products et al., 1999). As we noted earlier, artists, photographers, writers, and
over a specified period of time others have long used portfolios to represent their work, and in the
according to a specific set of last decade portfolios have become increasingly popular in the class-
guidelines (AERA et al., 1999), room. As typically applied in schools today, portfolios may best be
Performance Assessments and Portfolios 269
Decide on the Purpose of the Portfolio. The first step in developing a portfolio is to
determine the purpose or use of the portfolio. This is of foremost importance because it will
largely determine the content of your students’ portfolios. For example, you will need to
decide whether the portfolio will be used purely to enhance learning, as the basis for grades
(i.e., a scorable portfolio), or some combination of the two. If the purpose is only to enhance
learning, there is little need to ensure comparability among the entries in the portfolios.
Students can be given considerable freedom to include entries at their discretion. However,
if the portfolio is going to be used for summative evaluation and the assignment of grades,
then it is important to have standardized content across portfolios. This is necessary to pro-
mote a degree of comparability when evaluating the portfolios.
Decide on What Type of Items Will Be Placed in the Portfolio. _\t is also important
to determine whether the portfolios will showcase the students’ “best work,” represen-
tative products, or indicators of progress or growth. Best work portfolios contain what
the students select as their exemplary work, representative portfolios contain a broad
270 CHAPTER 10
representative sample of the students’ work (including both exemplary and below-average
examples), and growth or learning-progress portfolios include selections that illustrate
the students’ progress over the academic period. A fourth type of portfolio referred to as
evaluation portfolios is designed to help teachers determine whether the students have met
established standards of performance. As such, they should contain products that demon-
strate the achievement of specified standards.
Decide Who Will Select the Items to Include in the Portfolio. The teacher must decide
who will be responsible for selecting the items to include in the portfolio: the teacher, the
student, or both. When selecting items the guiding principle should be to choose items that
will allow the teacher or other raters to make valid inferences about the students’ skills and
knowledge. To promote student involvement in the process, most professionals recommend
that teachers and students collaborate when selecting items to be included in the portfolio.
However, at certain times it may be necessary for the teacher to exert considerable control
over the selection of work products. For example, when it is important for scoring purposes
to ensure standardization of content, the teacher needs to closely supervise the selection of
work products.
Establish Procedures for Evaluating or Scoring the Portfolio. Student portfolios are
typically scored using scoring rubrics similar to those discussed in the context of scoring
essays and performance assessments. As described earlier, scoring rubrics should
m Specify the evaluation criteria to be considered when evaluating the students’ work
products
m Provide explicit standards that describe different levels of performance on each
criterion
m Indicate whether the criteria will be evaluated in a holistic or analytical manner
Promote Student Involvement in the Process. Actively involving students in the assess-
ment process is a goal of all performance assessments, and portfolio assessments provide
particularly good opportunities to solicit student involvement. As we suggested, students
should be involved to the greatest extent possible in selecting what items are included in their
portfolios. Accordingly, they should be involved in maintaining the portfolio and evaluating
the quality of the products it contains. Along these lines, it is highly desirable for teachers to
schedule regular student-teacher meetings to review the portfolio content and compare their
evaluations with those of the students. This enhances the students’ self-assessment skills,
helps them identify individual strengths and weaknesses, and increases their personal in-
volvement in the learning process (e.g., Gronlund, 1998).
Table 10.10 provides a summary of the guidelines for developing portfolios.
Like all assessment techniques, portfolios have their own set of strengths and weak-
nesses (Gronlund, 1998; Kubiszyn & Borich, 2003; Linn & Gronlund, 2000; Nitko, 2001;
Popham 1999, 2000).
Performance Assessments and Portfolios 271
Portfolios May Help Motivate Students and Get Them More Involved in the Learn-
ing Process. Because students typically help select items for and maintain the portfolio,
evaluating their progress as they do so, they may be more motivated to become actively
involved in the learning and assessment process.
Portfolios May Enhance Students’ Ability to Evaluate Their Own Performances and
Products. Because students are typically asked to evaluate their own progress, it is ex-
pected that they will demonstrate enhanced self-assessment skills.
When used correctly portfolios When Used Correctly Portfolios Can Strengthen the Relation-
can strengthen the relationship ship between Instruction and Assessment. Because portfolios
often incorporate products closely linked to classroom instruction,
between instruction and
they can help strengthen the relationship between instruction and
assessment.
assessment.
Portfolios Can Enhance Teachers’ Communication with Both Students and Parents.
Providing regular student-teacher and parent-teacher conferences to review the contents of
portfolios is an excellent way to enhance communication.
272 CHAPTER 10
The use of portfolios has great potential for enriching education and student assessment but
should not be viewed as an alternative to traditional tests and examinations. Students still
need to demonstrate proficiency on uniform tasks designed to be a representative sample of
the objectives of a course of study. One may have wonderful tasks in a science portfolio (col-
lections of rocks, leaves, insects, experiments) but have great gaps in understanding about
major laws of physics, genetics, and so on. (p. 311)
We largely concur with Dr. Hopkins. Portfolios (and other performance assessments) hold
great potential for enriching educational assessment practices. They have considerable
strengths and when used in a judicious manner will enhance the assessment of students. At
the same time we encourage teachers to be aware of the specific strengths and weaknesses of
all assessment techniques and to factor these in when developing their own procedures for
assessing student achievement. No approach to assessment—whether it is selected-re-
sponse items, constructed-response items, performance assessments, or portfolios—should
be viewed as the only way to assess student achievement. As we have repeatedly stated, no
single assessment approach can adequately assess all of the complex skills and knowledge
taught in today’s schools. By using multiple approaches to assessment, one can capitalize
on the strengths of the different approaches in order to elicit the most useful, reliable, and
accurate information possible. Table 10.11 provides a summary of the strengths and weak-
nesses of portfolio assessments.
Performance Assessments and Portfolios 273
Summary
In this chapter we focused on performance assessments and portfolios. These special types
of constructed-response tasks have been around for many years, but have gained increasing
popularity in schools in recent years. To many educators performance assessments are seen
as a positive alternative to traditional paper-and-pencil tests. Critics of traditional paper-
and-pencil tests complain that they emphasize rote memory and other low-level learning
objectives. In contrast they praise performance assessments, which they see as measuring
higher-level outcomes that mirror real-life situations.
Performance assessments require students to complete a process or produce a product
in a setting that resembles real-life situations (AERA et al., 1999). Performance assess-
ments can be used to measure a broad range of educational objectives, ranging from those
emphasizing communication skills (e.g., giving a speech, writing a term paper), to art (e.g.,
painting, sculpture), to physical education (e.g., tennis, diving, golf). Due to this diversity,
it is difficult to develop specific guidelines, but some general suggestions can facilitate the
development of performance assessments. These can be categorized as guidelines for select-
ing performance tasks, developing clear instructions, developing procedures for evaluating
students’ performance, and implementing procedures for minimizing rating errors. These
are listed next.
guidelines (AERA et al., 1999). Guidelines for developing and using portfolios include the
following:
Specify the purpose of the portfolio (e.g., enhance learning, grading, both?).
Decide on the type of items to be placed in the portfolio.
Specify who will select the items to include in the portfolio.
Establish procedures for evaluating the portfolios.
Promote student involvement in the process.
Portfolios are good at reflecting student achievement and growth over time.
Portfolios may help motivate students and get them more involved in the learning
process.
Portfolios may enhance students’ ability to evaluate their own performances and
products.
When used correctly portfolios can strengthen the relationship between instruction
and assessment.
Portfolios can enhance teachers’ communication with both students and parents.
RECOMMENDED READINGS
Feldt, L. (1997). Can validity rise when reliability declines? Rosenquist, A., Shavelson, R., & Ruiz-Primo, M. (2000).
Applied Measurement in Education, 10, 377-387. This On the “exchangeability” of hands-on and computer-
extremely interesting paper argues that at least in theory simulated science performance assessments (CSE
performance tests of achievement can be more valid than Technical Report 531). Stanford University, CA:
constructed-response tests even though the performance CRESST. Previous research has shown inconsistencies
assessments have lower reliability. He notes that now the between scores on hands-on and computer-simulated
challenge is to find empirical examples of this theoreti- performance assessments. This paper examines the
cal possibility. sources of these inconsistencies.
| CHAPTER
CHAPTER HIGHLIGHTS
LEARNING OBJECTIVES
277
278 CHAPTER 11
The process of teaching is a dialogue between the teacher and the student. One important
aspect of this dialogue, both oral and written, is the evaluation of student performance. This
evaluative process includes the testing procedures described in the last four chapters. Ad-
ditionally, in most schools it is mandatory for student evaluations to
Marks are cumulative grades include the assignment of grades or marks. In this context, marks
that reflect student progress are typically defined as cumulative grades that reflect students’ aca-
during a specific period of demic progress during a specific period of instruction. In this chapter
instruction. we will be using the term score to reflect performance on a single as-
sessment procedure (e.g., test or homework assignment), and grades
and marks interchangeably to denote a cumulative evaluation of student performance (e.g.,
cumulative semester grade). In actual practice, people will often use score, grade, and mark
synonymously.
Our discussion will address a variety of issues associated with the process of assign-
ing grades or marks. First we will discuss some of the ways tests and other assessment
procedures are used in schools. This includes providing feedback to students and making
evaluative judgments regarding their progress and achievement. In this context we also
discuss the advantages and disadvantages of assigning grades. Next we discuss some of the
more practical aspects of assigning grades. For example, “What factors should be consid-
ered when assigning grades?” and “What frame of reference should be used when assigning
grades?” We then turn to a slightly technical, but hopefully practical, discussion of how to
combine scores into a composite or cumulative grade. Finally, we present some suggestions
on presenting information about grades to students and parents.
We have noted that tests and other assessment procedures are used in many ways in school
settings. For example, tests can be used to provide feedback to students about their prog-
ress, to evaluate their achievement, and to assign grades. esting applications can
be classified as either formative or summative. Focmatiy a aaaati
activities that areaimed at providing feedback tostudents. In this context, feedback implies
the communication of information concerning a student’s performance or achievement that
is intended to have a corrective effect. Formative evaluation is typically communicated
directly to the students and is designed to direct, guide, and modify their behavior. To be
useful it should indicate to the students what is being done well and what needs improve-
ment. By providing feedback in a timely manner, formative evaluation can help divide
the learning experience into manageable components and provide structure to the instruc-
tional process. Because tests can provide explicit feedback about
Formative evaluation involves
what one has learned and what yet needs to be mastered, they make
evaluative activities aimed at
a significant contribution to the learning process. The students are
providing feedback to students. made aware of any gaps in their knowledge and learn which study
Summative evaluation involves Strategies are effective and which are not. In addition to guiding
the determination of the worth, learning, tests may enhance or promote student motivation. Receiv-
value, or quality of an outcome. ing a “good” score on a weekly test may provide positive reinforce-
¢
Assigning Grades on the Basis of Classroom Assessments 279
ment to students, and avoiding “poor” scores can motivate students to increase their study
activiti
(aN GUEOMEN the classroom summative evaluation typically involves the formal evaluation
of performance or progress in a course, often in the form of a numerical or letter grade or
mark. Summative evaluation is often directed to others beside the student, such as parents
and administrators. Grades are generally regarded in our society as formal recognition of a
specific level of mastery. Significant benefits of grades include the following:
Grades
: arej generally regarded a Although there is variation in the way grades are reported (e.g.,
in our society
og as formal
; letter grades, numerical grades), most people are reasonably familiar
recognition of a specific level with their interpretation. Grades provide a practical system for com-
of mastery. municating information about student performance.
= Ideally, summative evaluations provide a fair, unbiased system of comparing students
that minimizes irrelevant criteria such as socioeconomic status, gender, race, and so on.
This goal is worthy even if attainment is less than perfect. If you were to compare the stu-
dents who entered European or American universities a century ago with those of today,
you would find that privilege, wealth, gender, and race count much less today than at that
time. This is due in large part to the establishment of testing and grading systems that have
attempted to minimize these variables. This is not to suggest that the use of tests has com-
pletely eliminated bias, but that they reflect a more objective approach to evaluation that
may help reduce the influence of irrelevant factors.
As you can see from this brief discussion, grades have both significant benefits and limita-
tions. Nevertheless, they are an engrained aspect of most educational systems and are more
than likely going to be with us for the foreseeable future. As a result, it behooves us to un-
derstand how to assign grades in a responsible manner that capitalizes on their strengths and
minimizes their limitations. Special Interest Topic 11.1 provides a brief history of grading
policies in universities and public schools.
AR
Brookhart (2004) provides a discussion of the history of grading in the United States. Here are a few
of the key developments she notes in this time line.
m Pre 1800: Grading procedures were first developed in universities. Brookhart’s research sug-
gests that the first categorical grading scale was used at Yale in 1785 and classified students as
Optimi (i.e., best), Second Optimi (i.e., second best), Inferiores (i.e., lesser), and Pejores (i.e.,
worse). In 1813 Yale adopted a numerical scale by which students were assigned grades be-
tween | and 4 with decimals used to reflect intermediary levels. Some universities developed
scales with more categories (e.g., 20) whereas others tried simple pass—fail grading.
m 1800s: The common school movement of the 1800s saw the development of public schools
designed to provide instruction to the nation’s children. Initially these early schools adopted
grading scales similar to those in use at universities. About 1840 schools started the practice
of distributing report cards. Teachers at the time complained that assessment and grading
were too burdensome and parents complained the information was difficult to interpret.
These complaints are still with us today!
m 1900s: Percentage grading was common in secondary schools and universities at the begin-
ning of the twentieth century. By 1910, however, educators began to question the reliability
and accuracy of using a scale with 100 different categories or scale points. By the 1920s the
use of letter grades (A, B, C, D, and F) was becoming the most common practice. During
the remainder of the 1900s, a number of grading issues came to the forefront. For example,
educators became increasingly aware that nonachievement factors (e.g., student attitudes
and behaviors and teacher biases) were influencing the assignment of grades and recognized
that this was not a desirable situation. Additionally there was a debate regarding the merits
of norm-referenced versus criterion-referenced grading systems. Finally, efforts were made
to expand the purpose of grades so they not only served to document the students’ level of
academic achievement but also served to enhance the learning of students. As you might
expect, these are all issues that educators continue to struggle with to this day.
In some aspects we have come a long way in refining the ways we evaluate the performance
of our students. At the same time, we are still struggling with many of the same issues we struggled
with a hundred years ago. 4
Assigning Grades on the Basis of Classroom Assessments 281
is sequential. Sequential content is material that must be learned in a particular order, such
as mathematics. For example, students learn addition and subtraction of single digits before
progressing to double digits. Topical content, on the other hand, can often be learned equally
well in various orders. Literature typically can be taught in many different ways because
the various topics can be ordered to fit many different teaching strategies. For example, in
literature a topic can be taught chronologically, such as with the typical survey of English
literature, or topically, such as with a course organized around the type of writing: essay,
poem, short story, and novel. In this last example, content within each category might itself
be organized in a topical or sequential manner.
The cumulative grade or mark in a course is typically considered a judgment of the
mastery of content or overall achievement. If certain objectives are necessary to master
later ones, it makes little sense in grading the early objectives as part of the final grade
because they must have been attained in order to progress to the later objectives. For ex-
ample, in studying algebra one must master the solution of single-variable equations before
progressing to the solution of two-variable equations. Suppose a student receives a score
of 70 on a test involving single-variable equations and a 100 on a later test involving two-
variable equations. Should the 70 and 100 be averaged? If so, the resulting grade reflects
the average mastery of objectives at selected times in the school year, not final mastery of
objectives. Should only the latest score be used? If so, what of the student who got a 100
on the first test and a 70 on the second test? These grades indicate a high degree of mastery
of the earlier objectives and less mastery of later objectives. What should be done? Our
answer is that for sequential content, summative evaluation should be based primarily on
performance at the conclusion of instruction. At that point the student will demonstrate
which objectives have been mastered and which have not. Earlier evaluations indicate only
the level of performance at the time of measurement and penalize students who take longer
to master the objectives but who eventually master them within the time allowed for mas-
tery. This is not to minimize the importance or utility of employing formative evaluation
with sequential material, only that formative evaluation may be difficult to meaningfully
incorporate into summative evaluation. With sequential material, use formative evaluations
to provide feedback, but when assigning grades emphasize the students’ achievement at
the conclusion of instruction.
Topical content, material that is related but with objectives that need not be mastered
in any particular order, can be evaluated more easily in different sections. Here formative
evaluations can easily serve as part of the summative evaluation. There is little need or rea-
son to repetitively test the objectives in each subsequent evaluation. Mixed content, in which
some objectives are sequential and others topical, should be evaluated by a combination of
the two approaches. Later in this chapter we will show you how you can “weight” different
assessments and assignments to meet your specific needs.
and universities today. Although there might be some variation in the meaning attached to
them, letter grades are typically interpreted as
Students and parents generally understand letter grades, and the evaluative judgment
represented by the grade is probably more widely accepted than any other system available.
However, as we alluded to in the previous section, letter grades do have limitations. One
significant limitation of letter grades is that they are only a summary statement and convey
relatively little useful information beyond the general or aggregate level of achievement.
Although teachers typically have considerable qualitative information about their students’
specific strengths and weaknesses, much of this information is lost when it is distilled down
to a letter grade. In some schools teachers are allowed to use pluses and minuses with letter
grades (e.g., A— or C+). This approach provides more categories for classification, but still
falls short of conveying rich qualitative information about student achievement. Special
Interest Topic 11.2 provides information on how some schools are experimenting with de-
leting “Ds” from their grading scheme.
Naturally, other grading systems are available, including the following.
Numerical Grades. Numerical grades are similar to letter grades in that they attempt to
succinctly represent student performance, here with a number instead of a letter. Numerical
grades may provide more precision than letter grades. For example, the excellent perfor-
mance represented by a grade of A may be further divided into numerical grades ranging
from the elusive and coveted 100 to a grade of 90. Nevertheless, numerical grades still only
summarize student performance and fail to capture much rich detail.
Verbal Descriptors. Another approach is to replace letter grades with verbal descrip-
tors such as excellent, above average, satisfactory, or needs improvement. Although the
number of categories varies, this approach simply replaces traditional letter grades with
verbal descriptors in an attempt to avoid any ambiguity regarding the meaning of the
mark.
Pass-Fail. Pass-fail grades and other two-category grading systems have been used for
many years. For example, some high schools and universities offer credit/no-credit grading
for selected courses (usually electives). A variant is mastery grading, which is well suited
for situations that emphasize a mastery learning approach in which all or most students are
expected to master the learning objectives and given the time necessary to do so. In situations
in which the learning objectives are clearly specified, a two-category grading system may
be appropriate, but otherwise it may convey even less information than traditional letter/
numerical grades or verbal descriptors.
284 CHAPTER 11
Hoff (2003) reports that some California high schools are considering deleting the letter D from their
grading systems. He notes that at one high school the English department has experimented with
deleting Ds with some success. The rationale behind the decision is that students who are making
Ds are not mastering the material at the level expected by the schools. This became apparent when
schools noticed that the students making Ds in English were, with very few exceptions, failing the
state-mandated exit examination. This caused them to question whether it is appropriate to give
students a passing grade if it is almost assured that they will not pass the standardized assessment
required for them to progress to the next grade or graduate. Schools also hoped that this policy would
motivate some students to try a little harder and elevate their grades to Cs. There is some evidence
that this is happening. For example, after one English department did away with Ds, approximately
one-third of the students who had made Ds the preceding quarter raised their averages to a C level,
while about two-thirds received Fs. The policy has generally been well received by educators and is
likely to be adopted by other departments and schools.
Supplemental Systems. Many teachers and/or schools have adopted various approaches
to replace or supplement the more traditional marking systems. For example, some teachers
use a checklist of specific learning objectives to provide additional information about their
students’ strengths and weaknesses. Other teachers use letters, phone conversations, or in-
dividual conferences with parents to convey more specific information about their students’
individual academic strengths and weaknesses. Naturally, all of these approaches can be
used to supplement the more traditional grading/marking systems.
Another essential question in assigning grades involves a decision regarding the basis for
grades. By this we mean “Are grades assigned purely on the basis of academic achievement,
or are other student characteristics taken into consideration?” For example, when assign-
ing grades should one take into consideration factors such as a student’s attitudes, behav-
ior, class participation, punctuality, work/study habits, and so forth? As a general rule these
nonachievement factors receive more consideration in elementary school, whereas in the
secondary grades the focus narrows to achievement (Hopkins, 1998). While recognizing
the importance of these nonachievement factors, most assessment experts recommend that
actual academic achievement be the sole basis for assigning achievement grades. If desired,
teachers should assign separate ratings for these nonachievement factors (e.g., excellent,
satisfactory, and unsatisfactory). The key is that these factors should
When educators mix
be rated separately and independently from achievement grades. This
achievement and keeps academic grades as relatively pure marks of achievement that
nonachievement factors, the are not contaminated by nonachievement factors. When educators
meaning of grades is blurred. mix achievement and nonachievement factors, the meaning of the
Assigning Grades on the Basis of Classroom Assessments 285
Math
Social studies
Science
Student Behavior: The following scores reflect the student’s behavior at school.
Overall Classroom
Behavior E S N U
Qe EE ee a eee
grades is blurred. Table 11.1 provides an example of an elementary school report intended to
separate achievement from other factors. Special Interest Topic 11.3 addresses the issue of
lowering grades as a means of classroom discipline.
Frame of Reference
Once you have decided what to base your grades on (hopefully academic achievement), you
need to decide on the frame of reference you will use. In the following sections we will
discuss the most common frames of references.
Norm-referenced or relative The first frame of reference we will discuss is referred to 48q500INE
A :
grading involves comparing
each student’s performance
to that of a specific reference comparable to the norm-referenced approach to score interpreta-
group. tion discussed in Chapter 3). This approach to assigning grades is
286 CHAPTER 11
Nitko (2001) distinguishes between “failing work” and “failure to try.” Failing work is work of such
poor quality that it should receive a failing grade (i.e., F) based on its merits. Failure to try is when
the student, for some reason, simply does not do the work. Should students who fail to try receive
a failing grade? What about students who habitually turn in their assignments late? Should they be
punished by lowering their grades? What about students who are caught cheating? Should they be
given a zero? These are difficult questions that don’t have simple answers.
Nitko (2001) contends that it is invalid to assign a grade of F for both failing work and failure
to try because they do not represent the same construct. The F for failing work represents unaccept-
able achievement or performance. In contrast, failing to try could be due to a host of factors such as
forgetting the assignment, misunderstanding the assignment, or simply defiant behavior. The key
factor is that failing to try does not necessarily reflect unacceptable achievement or performance.
Likewise, Nitko contends that it is invalid to lower grades as punishment for turning in assignments
late. This confounds achievement with discipline.
Along the same lines, Stiggins (2001) recommends that students caught cheating should not
be given a zero because this does not represent their true level of achievement. In essence, these
authors are arguing that from a measurement perspective you should separate the grade from the
punishment or penalty. Nitko (2001) notes that these are difficult issues, but they are classroom
management or discipline issues rather than measurement issues. As an example of a suggested
solution, Stiggins (2001) recommends that instead of assigning a zero to students caught cheating,
it is preferable to administer another test to the student and use this grade. This way, punishment
is addressed as a separate issue that the teacher can handle in a number of ways (e.g., detention
or in-school suspension). Teachers face these issues on a regular basis, and you should consider
them carefully before you are faced with them in the classroom.
also referred to as “grading on the curve.” Although the reference group varies, it is often the
students in a single classroom. For example, a teacher might specify the following criteria
for assigning marks:
With this arrangement, in a class of 20 students, the two students receiving the highest grades
will receive As, the next four students will receive Bs, the next eight students will receive
Cs, the next four students will receive Ds, and the two students with the lowest scores will
receive Fs. An advantage of this type of grading system is that it is straightforward and clearly
specifies what grades students will receive. A second advantage is that it helps prevent grade
Assigning Grades on the Basis of Classroom Assessments 287
inflation, which occurs when teachers are too lenient in their grading and a large proportion
of their students receive unwarranted high marks.
This approach to assigning grades does have limitations. Possibly the most prominent
limitation is that there can be considerable variability among reference groups. If the refer-
ence group is a single classroom, some classes will be relatively high achieving and some
relatively low achieving. If a student is fortunate enough to be in a low-achieving class, he
or she will stand a much better chance of receiving a high grade than if he or she is in a high-
achieving class. But consider the unlucky “average” student who is assigned to a very high-
achieving class. Although this student’s performance might have been sufficient to earn a
respectable mark in an average classroom, relative to the high-achieving students the stu-
dent might receive a poor grade. Additionally, if the teacher strictly follows the guidelines,
a certain percentage of students will receive poor grades by default. To overcome this, some
teachers maintain records over several years in order to establish more substantive reference
data. In this manner a teacher can reduce the influence of variability in class achievement.
The use of large reference groups containing data from many classes accumulated over time
is one of the best approaches to help minimize this limitation.
Gronlund (1998) provides another approach to reducing the effect of variability in
class achievement. He recommends using ranges of percentages instead of precise percent-
ages. For example:
This approach gives the teacher some flexibility in assigning grades. For example, in a
gifted and talented class one would expect more As and Bs, and few Ds or Fs. The use of
percent ranges provides some needed flexibility.
Another limitation of the norm-referenced approach is that the percentage of students
being assigned specific grades is often arbitrarily assigned. In our example we used 20% for
Bs and 40% for Cs; however, it would be just as defensible to use 15% for Bs and 50% for
Cs. Often these percentages are set by the district or school administrators, and one criteria
is not intrinsically better than another. A final limitation is that with relative grading, grades
are not specifically linked to an absolute level of achievement. At least in theory it would be
possible for students in a very low-achieving group to receive relatively high marks without
actually mastering the learning objectives. Accordingly, in a high-achieving class some
students may fail even when they have mastered much of the material. Obviously neither of
these outcomes is desirable!
Criterion-referenced or absolute nonmastery), this approach can also reflect a continuum of achieve-
grading involves comparing ment. One of the most common criterion-referenced grading systems
a student’s performance to a is the traditional percentage-based system. In it grades are based on
specified level of performance. percentages, usually interval bands based on a combination of grades
from tests and other assignments. For example:
Many schools list such bands as formal criteria even though they are often modified in ac-
tual practice. An advantage of this grading approach is that the marks directly describe the
performance of students without reference to other students. As a result, there is no limit
on the number of students that receive any specific grade. For example, in a high-achieving
class all students could conceivably receive As. Although such an extreme outcome is not
likely in most schools, the issue is that there is no predetermined percentage of students that
must receive each grade. Another advantage is that this system, like the norm-referenced
approach, is fairly straightforward and easy to apply.
The major limitation of criterion-referenced or absolute grading is that there is con-
siderable variability in the level of difficulty among tests and other academic assignments
assigned by teachers. Some teachers create tests that are extremely difficult and others write
relatively easy tests. Some teachers are consistently more rigorous in their grading than oth-
ers. As a result, a rigorous teacher might have a class average of only 60% or 70%, whereas
a more lenient teacher might have a class average of 80% or 90%. This inherent variability
in difficulty level across courses makes it difficult to interpret or compare the meaning of
scores based on an absolute standard in a consistent manner.
Joe Mary
There are other problems associated with basing grades on effort or improvement. For
example, the measurement of improvement or change is plagued with numerous technical
problems (Cronbach & Furby, 1970). Additionally, you have the mixing of achievement
with another factor, in this instance effort or improvement. As we suggested before, if you
want to recognize effort or improvement, it is more defensible to
Ifyou want to recognize effort assign separate scores to these factors. Achievement grades should
or improvement, it is best to reflect achievement and not be contaminated by other factors. Fi-
assign separate scores to these nally, although this approach is typically intended to motivate poor
factors. students, it can have a negative effect on better students. Based on
these problems, our recommendation is not to reward effort/improve-
ment over achievement except in special situations. One situation in which the reward of
effort may be justified is in the evaluation of students with severe disabilities for whom
grades may appropriately be used to reinforce effort.
Recommendation
Although we recommend against using achievement relative to aptitude, effort, or improve-
ment as a frame of reference for assigning grades, we believe both absolute and relative
290 CHAPTER 11
Both norm-referenced grading systems can be used successfully. They both have advantages
(i.e., relative) and criterion- and limitations, but when used conscientiously either approach can be
referenced (i.e., absolute) effective. It is even possible for teachers to use a combination of abso-
grading systems can be lute and relative grading systems in secondary schools and universi-
ties. For example, Hopkins (1998) recommends that high schools and
used successfully.
colleges report a conventional absolute grade and also the students’
relative standing in their graduating class (e.g., percentile rank). This
would serve to reduce the differences in grading across schools. For example, a student might
have a grade point average (GPA) of 3.0 but a percentile rank of 20. Although the GPA is
adequate, the percentile rank of 20 (i.e., indicating the student scored better than only 20% of
the other students) suggests that this school gives a high percentage of high grades.
When it is time to assign grades at the end of a six-week period, semester, or some other
grading period, teachers typically combine results from a variety of assessments. This can
include tests, homework assignments, performance assessments, and
The decision of how to weight the like. The decision of how to weight these assessments is usually
left to the teacher and reflects his or her determination of what should
assessments when calculating
be emphasized and to what degree (see Special Interest Topic 11.4).
grades is usually left to the
For example, if a teacher believes the primary determiner of a course
teachers and reflects their
grade should be performance on tests, he or she would weight test
determination of what should performance heavily and place less emphasis on homework and term
be emphasized and to what papers. Another teacher, with a different grading philosophy, might
degree. decide to emphasize homework and papers and place less emphasis
on test performance. Whatever your grading philosophy, you need a
system for effectively and efficiently combining scores from a variety of assessment proce-
dures into a composite score. On initial examination this might appear to be a fairly simple
procedure, but it is often more complicated than it appears.
Consider the following data illustrating a simple situation. Here we have two assess-
ment procedures, a test and a homework assignment. For our initial example we will assume
the teacher wants to weight them equally. The test has a maximum score of 100 whereas the
homework assignment has a maximum value of 50.
Johnny had a perfect score on the test but the lowest grade on the homework assignment.
Sally had the opposite results, a perfect score on the homework assignment and the low-
est score on the test. If the summed scores were actually reflecting equal weighting as the
Assigning Grades on the Basis of Classroom Assessments 291
: Sk a eR
The decision regarding how to weight different assessment procedures is a personal decision
based on a number of factors. Some teachers emphasize homework, some tests, some term pa-
pers, and some performance assessments. No specific practice is clearly right or wrong, but an
understanding of psychometric principles may offer some guidance. First, as we have stated
numerous times, it is desirable to provide multiple assessment opportunities. Instead of relying
on only a midterm and a final, it is best to provide numerous assessment opportunities spread
over the grading period. Second, when possible it is desirable to incorporate different types of
assessment procedures. Instead of relying exclusively on any one type of assessment, when fea-
sible try to incorporate a variety. Finally, when determining weights, consider the psychometric
properties of the different assessment procedures. For example, we often weight the assessments
that produce the most reliable scores and valid interpretations more heavily than those with
less sound psychometric properties. Table 11.2 provides one of our weighting procedures for
an introductory psychology course in tests and measurement. All of the tests are composed of
multiple-choice and short-answer items and make up 90% of the final grade. We include a term
paper, but we are aware that its scores are less reliable and valid so we count it for only 10% of
the total grade.
Naturally this approach emphasizing more objective tests is not appropriate for all classes.
For example, it is difficult to assess performance in a graduate level course on professional ethics
and issues using assessments that emphasize multiple-choice and short-answer. In this course we
use a combination of tests composed of multiple-choice items, short-answer items, and restricted-
response essays along with class presentations and position papers. The midterm and final ex-
amination receive a weighting of approximately 60% of the final grade, with the remaining 40%
accounted for by more subjective procedures. In contrast, in an introductory psychology course,
which typically has about 100 students in every section, we would likely use all multiple-choice
tests. We would not be opposed to requiring a term paper or including some short-answer or
restricted-response items on the tests, but due to the extensive time required to grade these pro-
cedures we rely on the objectively scored tests (which are scored in the computer center). We do
provide multiple assessment opportunities in this course (four semester tests and a comprehensive
final).
In addition to reliability/validity issues and time considerations, you may want to consider
other factors. For example, because students may receive assistance from parents (or others) on
homework assignments, their performance might not be based solely on their own abilities. This
is also complicated by variability in the amount of assistance students receive. Some students may
get considerable support/assistance from parents while others receive little or none. The same
principle applies to time commitment. An adolescent with few extracurricular activities will have
more time to complete homework assignments than one who is required to maintain a part-time
job (or is involved in athletics, band, theater, etc.). If you provide no weight to homework assign-
ments it removes any incentive to complete the work, whereas basing a large proportion of the
grade on homework will penalize students who receive little assistance from parents or who are
involved in many outside activities. Our best advice is to take these factors into consideration and
adopt a balanced approach. Good luck!
292 CHAPTER 11
teacher expects, the composite scores would be equal. Obviously they are not. Johnny’s
score (i.e., 120) is considerably higher than Sally’s score (i.e., 90). The problem is that
achievement test scores have more variability, and as a result they have an inordinate influ-
ence on the composite score.
To correct this problem you need to equate the scores by taking into consideration the
differences in variability. Although different methods have been proposed for equating the
scores, this can be accomplished in a fairly accurate and efficient manner by simply correct-
ing for differences in the range of scores (it is technically preferable to use a more precise
measure of variability than the range, but for classroom applications the range is usually
sufficient). In our example, the test scores had a range of 60 while the homework scores had
a range of 30. By multiplying the homework scores by 2 (i.e., our equating factor) we can
equate the ranges of the scores and give them equal weight in the composite score. Consider
the following illustration.
Note that this correction resulted in the assignments being weighted equally; both students
received the same composite score. If in this situation the teacher wanted to calculate a
percentage-based composite score, he or she would simply divide the obtained composite
score by the maximum composite score (in this case 200) and multiply by 100. This would
result in percentage-based scores of 70% for both Johnny and Sally.
In the previous example we assumed the teacher wanted the test and homework scores
to be equally weighted. Now we will assume that the teacher wants the test score to count
three times as much as the homework score. In this situation we would add another multi-
plier to weight the test score as desired.
With this weighting, Johnny’s composite score (i.e., 340) is considerably higher than Sally’s
composite score (i.e., 220) because the teacher has chosen to place more emphasis on the test
score relative to the homework assignments. If the teacher wanted to calculate a percentage-
based composite score, he or she would divide the obtained composite score by the maximum
composite score (in this case 400) and multiply by 100. This would result in percentage-
based scores of 85% for Johnny and 55% for Sally.
A Short Cut. Although the preceding approach is preferred from a technical perspective,
it can be a little unwieldy and time consuming. Being sensitive to the many demands on a
Assigning Grades on the Basis of Classroom Assessments 293
teacher’s time, we will now describe a simpler and technically adequate approach that may
be employed (e.g., Kubiszyn & Borich, 2000). With this approach, each component grade
is converted to a percentage score by dividing the number of points awarded by the total
potential number of points. Using data from the previous examples with an achievement
test with a maximum score of 100 and a homework assignment with a maximum score of
50, we have the following results:
This procedure equated the scores by converting them both to a 100-point scale (based on
the assumption that the converted scores are comparable in variance). If one were then to
combine these equated scores with equal weighting, you would get the following results:
If one wanted to use different weights for the assessments, you would simply mul-
tiply each equated score by a percentage that represents the desired weighting of each
assessment. For example, if you wanted the test score to count three times as much as the
homework assignment, you would multiply the equated test score by 0.75 (i.e., 75%) and
the equated homework score by 0.25 (i.e., 25%). Note that the weights (i.e., 75% and 25%)
equal 100%. You would get the following results:
Test 1 20%
Test 2 20%
Tests 20%
Term paper 10%
Final examination 30%
Total 100%
The scores reported for each assessment procedure will be percent correct. As noted, a relatively
easy way to equate your scores on different procedures is to report them as percent correct
(computed as the number of points obtained by the student divided by the maximum number
of points). We use the scores of three students in this illustration.
As we indicated, there are a number of commercially available grade book programs. There are
programs for both Mac and PC users and many of these programs reside completely on your
computer. A new trend is Web-based applications that in addition to recording scores and calcu-
lating grades allow students and parents to check on their progress. Clearly with technological
advances these programs will become more sophisticated and more widely available. Here are
just a few of the many grade book programs available and their Web addresses:
294
Assigning Grades on the Basis of Classroom Assessments 295
Students clearly have a right to know the procedures that will be used to determine their
grades. This information should be given to the students early in a course and well before
any assessment procedures are administered that are included in the grading process. A
common question is “How old should students be before they can benefit from this infor-
mation?” We believe any students who are old enough to be administered a test or given an
assignment are also old enough to know how their grades will be determined. Parents should
also be informed in a note or in person during conferences or visits what is expected of
their children and how they will be graded. For students in upper el-
ementary grades and beyond, an easy way to inform them of grading
Any students who are old
requirements is a handout such as shown in Table 11.5. This system
enough to be administered
is similar to those used by one of the authors in his classes.
a test are also old enough to Students and parents should be informed of the grades obtained
know how their grades will as well. Feedback and reporting of grades should be done individu-
be determined. ally in a protected manner. Grades or test scores should not be posted
or otherwise displayed in any way that reveals a student’s individual
performance. A federal law, the Family Educational Rights and Privacy Act (FERPA; also
know as Public Law 93-380 or the Buckley Amendment) governs the maintenance and
release of educational records, including grades, test scores, and related evaluative mate-
rial. Special Interest Topic 11.5 provides information on this law from the Department of
Education’s FERPA compliance home page.
Parent Conferences
Most school systems try to promote parent-teacher communication, often in the form of par-
ent conferences. Typically conferences serve to inform parents of all aspects of their child’s
progress, so preparation for a parent conference should result in a file folder containing a
record of the child’s performance in all areas. This may include information on social and
behavioral development as well as academic progress. Conferences should be conducted as
Test 1 20%
Test 2 20%
Test 3 20%
Term paper 10%
Final examination 30%
Total 100%
296 GASP ale) Rae
The Family Educational Rights and Privacy Act (FERPA) (20 U.S.C. § 1232g; 34 CFR Part 99) is a
Federal law that protects the privacy of student education records. The law applies to all schools that
receive funds under an applicable program of the U.S. Department of Education.
FERPA gives parents certain rights with respect to their children’s education records. These
rights transfer to the student when he or she reaches the age of 18 or attends a school beyond the high
school level. Students to whom the rights have transferred are “eligible students.”
Parents or eligible students have the right to inspect and review the student’s education records
maintained by the school. Schools are not required to provide copies of records unless, for reasons
such as great distance, it is impossible for parents or eligible students to review the records. Schools
may charge a fee for copies.
Parents or eligible students have the right to request that a school correct records which they
believe to be inaccurate or misleading. If the school decides not to amend the record, the parent or
Sea eligible student then has the right to a formal hearing. After the hearing, if the school still decides not
to amend the record, the parent or eligible student has the right to place a statement with the record
setting forth his or her view about the contested information.
Generally, schools must have written permission from the parent or eligible student in order
to release any information from a student’s education record. However, FERPA allows schools to
disclose those records, without consent, to the following parties or under the following conditions
(34 CFR § 99.31):
Schools may disclose, without consent, “directory” information such as a student’s name, ad-
dress, telephone number, date and place of birth, honors and awards, and dates of attendance. How-
ever, schools must tell parents and eligible students about directory information and allow parents
and eligible students a reasonable amount of time to request that the school not disclose directory
information about them. Schools must notify parents and eligible students annually of their rights
under FERPA. The actual means of notification (special letter, inclusion in a PTA bulletin, student
handbook, or newspaper article) is left to the discretion of each school.
For additional information or technical assistance, you may call (202) 260-3887 (voice). Indi-
viduals who use TDD may call the Federal Information Relay Service at 1-800-877-8339.
Or you may contact us at the following address:
Family Policy Compliance Office
U.S. Department of Education
400 Maryland Avenue, SW
Washington, D.C. 20202-5920
Assigning Grades on the Basis of Classroom Assessments 297
Problems related to educational privacy include grading done by parent or student volunteers, post-
ing grades or test scores, or releasing other evaluative information such as disciplinary records in a
manner accessible by nonauthorized persons. (Note that this only applies to materials that identify
individual students, nothing in FERPA prohibits schools from reporting aggregated data.) In general,
all of these activities and many others that once were common practice are prohibited by FERPA.
Some authoritative sites we have reviewed that give guidance to educators on these issues include
the following:
www.nacada.ksu.edu/Resources/FERPA-Overview.htm
www.ed.gov/policy/gen/guid/fpco/index.html
www.aacrao.org/ferpa_guide/enhanced/main_frameset.html
www.sis.umd.edu/ferpa/ferpa_what_is.htm
confidential, professional sessions. The teacher should focus on the individual student and
avoid discussions of other students, teachers, or administrators. The teacher should present
samples of students’ work and other evidence of their performance as the central aspect of
the conference, explaining how each item fits into the grading system. If standardized test
results are relevant to the proceedings, the teacher should carefully review the tests and their
scoring procedures beforehand in order to present a summary of the results in language
clearly understandable to the parents. In subsequent chapters we will address the use and
interpretation of standardized tests in school settings.
Summary
In this chapter we focused on the issue of assigning grades based on the performance
of students on tests and other assessment procedures. We started by discussing some of
the different ways assessment procedures are used in the schools. Formative evaluation
involves providing feedback to students whereas summative evaluation involves making
evaluative judgments regarding their progress and achievement. We also discussed the ad-
vantages and disadvantages of assigning cumulative grades or marks. On the positive side,
grades generally represent a fair system for comparing students that minimizes irrelevant
characteristics such as gender or race. Additionally, because most people are familiar with
grades and their meaning, grades provide an effective and efficient means of providing in-
formation about student achievement. On the down side, a grade is only a brief summary of
a student’s performance and does not convey detailed information about specific strengths
and weaknesses. Additionally, although most people understand the general meaning of
grades, there is variability in what grades actually mean in different classes and schools.
Finally, student competition for grades may become more important than actual achieve-
ment, and students may have difficulty separating their personal worth from their grades,
both undesirable situations.
298 CHAPTER Ui
Next we discussed some of the more practical aspects of assigning grades. For ex-
ample, we recommended that grades be assigned solely on the basis of academic achieve-
ment. Other factors such as class behavior and attitude are certainly important, but when
combined with achievement in assigning grades they blur the meaning of grades. Another
important consideration is what frame of reference to use when assigning grades. Although
different frames of references have been used and promoted, we recommend using either a
relative (i.e., norm-referenced) or an absolute (i.e., criterion-referenced) grading approach,
or some combination of the two.
We also provided a discussion with illustrations of how to combine grades into a
composite or cumulative grade. When assigning grades, teachers typically wish to take a
number of assessment procedures into consideration. Although this process may appear
fairly simple, it is often more complicated than first assumed. We demonstrated that when
forming composites it is necessary to equate scores by correcting for differences in the
variance or range of the scores. In addition to equating scores for differences in variability,
we also demonstrated how teachers can apply different weights to different assessment
procedures. For example, a teacher may want a test to count two or three times as much as a
homework assignment. We also provided examples of how these procedures can be applied
in the classroom with relative ease. In closing, we presented some suggestions on presenting
information about grades to students and parents.
RECOMMENDED READING
CHAPTER HIGHLIGHTS
LEARNING OBJECTIVES
299
300 CHAPTER 12
In this and subsequent chapters we will be discussing a variety of standardized tests com-
monly used in the public schools. In this chapter we will focus on standardized achievement
tests. A standardized test is a test that is administered, scored, and interpreted in a standard
manner. Most standardized tests are developed by testing professionals or test publishing
companies. The goal of standardization is to ensure that testing con-
A standardized test is a test ditions are as nearly the same as possible for all individuals taking
that is administered, scored, the test. If this is accomplished, no examinee will have an advantage
and interpreted in a standard over another due to variance in administration procedures, and as-
manner.
m Standardized achievement tests typically contain high-quality items that were selected
on the basis of both quantitative and qualitative item analysis procedures.
m They have precisely stated directions for administration and scoring so that consistent
procedures can be followed in different settings.
m= Many contemporary standardized achievement tests provide both norm-referenced
and criterion-referenced interpretations. Norm-referenced interpretation allows comparison
to the performance of other students, whereas criterion-referenced interpretation allows
comparison to an established criterion.
= The normative data are based on large, representative samples.
= Equivalent or parallel forms of the test are often available.
u They have professionally developed manuals and support materials that provide exten-
sive information about the test; how to administer, score, and interpret it; and its measure-
ment characteristics.
There are many different types of standardized achievement tests. Some achievement
tests are designed for group administration whereas others are for individual administration.
Individually administered achievement tests must be given to only one student at a time and
require specially trained examiners. Some achievement tests focus on a single subject area
(e.g., reading) whereas others cover a broad range of academic skills and content areas (e.g.,
reading, language, and mathematics). Some use selection type items exclusively whereas
Standardized Achievement Tests in the Era of High-Stakes Assessment 301
= One of the most common uses is to track student achievement over time or to compare
group achievement across classes, schools, or districts.
= Achievement tests can help identify strengths and weaknesses of individual students.
= Achievement tests can be used to evaluate the effectiveness of instructional programs
or curricula and help teachers identify areas of concern.
= A final major use of standard achievement tests is the identification of students with
special educational requirements. For example, achievement tests might be used in assess-
ing children to determine whether they qualify for special education services.
The current trend is toward more, rather than less, standardized testing in public schools.
This trend is largely attributed to the increasing emphasis on educational accountability and
high-stakes testing programs. Popham (2000) notes that while there have always been critics
of public schools, calls for increased accountability became more strident and widespread
in the 1970s. During this period news reports began to surface publicizing incidences of
high school graduates being unable to demonstrate even the most
The current trend is toward basic academic skills such as reading and writing. In 1983 the Na-
more, rather than less, tional Commission on Excellence in Education published A Nation
standardized testing in at Risk: The Imperative for Educational Reform. This important
public schools. report sounded an alarm that the United States was falling behind
other nations in terms of educating our children. Parents, who as
taxpayers were footing the bill for their children’s education, increasingly began to ques-
tion the quality of education being provided and to demand evidence that schools were
actually educating children. In efforts to assuage taxpayers, legislators started implement-
ing statewide minimum-competency testing programs intended to guarantee that graduates
of public schools were able to meet minimum academic standards. While many students
passed these exams, a substantial number of students failed, and the public schools and
teachers were largely blamed for the failures. In this era of increasing accountability, many
schools developed more sophisticated assessment programs that used both state-developed
tests and commercially produced nationally standardized achievement tests. As the trend
continued, it became common for local newspapers to rank schools according to their stu-
dents’ performance on these tests, with the implication that a school’s ranking reflected the
effectiveness or quality of teaching. Special Interest Topic 12.1 provides.a brief description
302 CHAPTER 12
of the National Assessment of Educational Progress (NAEP) that has been used for several
decades to monitor academic progress across the nation, as well as a sample of recent results
in 4th-grade mathematics.
Subsequent legislation and reports continued to focus attention on the quality of our
educational system, promoting increased levels of accountability, which translated into
more testing. In recent years, the No Child Left Behind Act of 2001 required that each state
develop high academic standards and implement annual assessments to monitor the perfor-
mance of states, districts, and schools. It requires that state assessments meet professional
standards for reliability and validity and that states achieve academic proficiency for all
students within 12 years. As this text goes to print Congress is begin-
ning to debate the reauthorization of NCLB. It is likely that there will
The No Child Left Behind Act be significant changes to the act in the next years, but in our opinion
requires states to test students it is likely that standardized achievement tests will continue to see
annually in grades 3 through 8. extensive use in our public schools.
In the remainder of this chapter we will introduce a number
of standardized achievement tests. First we will provide brief descriptions of some major
group achievement tests and discuss their applications in schools. We will then briefly
describe a number of individual achievement tests that are commonly used in schools. The
goal of this chapter is to familiarize you with some of the prominent characteristics of these
tests and how they are used in schools.
Achievement tests can be classified as either individual or group tests. Individual tests are
administered in a one-to-one testing situation. One testing professional (i.e., the examiner)
administers the test to one individual (i.e., the examinee) at a time. In contrast, group-
administered tests are those that can be administered to more than one examinee at a time.
The main attraction of group administration 1s that it is an efficient way to collect information
about students or other examinees. By efficient, we mean a large num-
The main attraction of group- ber of students can be assessed with a minimal time commitment from
administered tests is that they educational professionals. As you might expect, group-administered
are an efficient way to collect tests are very popular in school settings. For example, most teacher-
information about students’ constructed classroom tests are designed to be administered to the
achievement. whole class at one time. Accordingly, if a school district wants to test
all the students in grades 3 through 8, it would probably be impos-
sible to administer a lengthy test to each student on a one-to-one basis. There is simply not
enough time or enough teachers (or other educational professionals) to accomplish such a
task without significantly detracting from the time devoted to instruction. However, when
you can have one professional administer a test to 20 to 30 students at a time, the task can be
accomplished in a reasonably efficient manner.
Although efficiency is the most prominent advantage of group-administered tests,
at least three other positive attributes of group testing warrant mentioning. First, because
the role of the individual administering the test is limited, group tests will typically involve
more uniform testing conditions than individual tests. Second, group tests frequently contain
See
The National Assessment of Educational Progress (NAEP), also referred to as the “Nation’s Report
Card,” is the only ongoing nationally administered assessment of academic achievement in the United
States. NAEP provides a comprehensive assessment of our students’ achievement at critical periods in
their academic experience (i.e., grades 4, 8, and 12). NAEP assesses performance in mathematics, sci-
ence, reading, writing, world geography, U.S. history, civics, and the arts. New assessments in world
history, economics, and foreign language are currently being developed. NAEP has been adminis-
tered regularly since 1969. It does not provide information on the performance of individual students
or schools, but presents aggregated data reflecting achievement in specific academic areas, instruc-
tional practices, and academic environments for broad samples of students and specific subgroups.
The NAEP has an excellent Web site that can be accessed at http://nces.ed.gov/nationsreportcard. Of
particular interest to teachers is the NAEP Questions Tool. This tool provides access to NAEP ques-
tions, student responses, and scoring guides that have been released to the public. This tool can be
accessed at http://nces.ed.gov/nationsreportcard/itmrls. The table below contains 4th-grade reading
scores for the states, Department of Defense Education Agency, and the District of Columbia.
303
304 CHUA PoE Re y2
items that can be scored objectively, often even by a computer (e.g., selected-response
items). This reduces or eliminates the measurement error introduced by the qualitative scor-
ing procedures more common in individual tests. Finally, group tests often have very large
standardization or normative samples. Normative samples for professionally developed
group tests are often in the range of 100,000 to 200,000, whereas professionally developed
individually administered tests will usually have normative samples ranging from 1,000
to 8,000 participants (Anastasi & Urbina, 1997).
Naturally, group tests have some limitations. For example, in a group-testing situation
the individual administering the test has relatively little personal interaction with the indi-
vidual examinees. As a result, there is little opportunity for the examiner to develop rapport
with the examinees and closely monitor and observe their progress. Accordingly they have
limited opportunities to make qualitative behavioral observations about the performance of
their students and how they approach and respond to the assessment tasks. Another concern
involves the types of items typically included on group achievement tests. Whereas some
testing experts applaud group tests for often using objectively scored items, others criticize
them because these items restrict the type of responses examinees can provide. This par-
allels the same argument for and against selected-response items we discussed in earlier
chapters. Another limitation of group tests involves their lack of flexibility. For example,
when administering individual tests the examiner is usually able to select and administer
only those test items that match the examinee’s ability level. With group tests, however, all
examinees are typically administered all the items. As a result, examinees might find some
items too easy and others too difficult, resulting in boredom or frustration and lengthening
the actual testing time beyond what is necessary to assess the student’s knowledge accu-
rately (Anastasi & Urbina, 1997). It should be noted that publishers of major group achieve-
ment tests are taking steps to address these criticisms. For example, to allay concerns about
the extensive use of selected-response items, more standardized achievement tests are being
developed that incorporate a larger number of constructed-response items and performance
tasks. To address concerns about limited flexibility in administration, online and computer-
based assessments are becoming increasingly available.
In this section we will be discussing a number of standardized group achievement
tests. Many of these tests are developed by large test publishing companies and are com-
mercially available to all qualified buyers (e.g., legitimate educational institutions). In ad-
dition to these commercially available tests, many states have started developing their own
achievement tests that are specifically tailored to assess the state curriculum. These are often
standards-based assessments used in high-stakes testing programs. We will start by briefly
introducing some of the major commercially available achievement tests.
‘
Standardized Achievement Tests in the Era of High-Stakes Assessment 305
many school districts use standardized achievement tests to track student achievement over
time or to compare performance across classes, schools, or districts. These batteries typically
contain multiple subtests that assess achievement in specific curricular areas (e.g., reading,
language, mathematics, and science). These subtests are organized in
The most widely used a series of test levels that span different grades. For example, a subtest
might have four levels with one level covering kindergarten through
standardized group
a the 2nd grade, the second level covering grades 3 and 4, the third level
achievement tests are produced . grades . grades
: covering 5 and 6, and the fourth level covering 7 and 8
by CTB McGraw-Hill, (Nitko, 2001). The most widely used standardized group achievement
Harcourt Assessment, and tests are produced and distributed by three publishers: CTB McGraw-
Riverside Publishing. Hill, Harcourt Assessment, and Riverside Publishing.
California Achievement Tests, Fifth Edition (CAT/5). The CAT/5, designed for use with
students from kindergarten through grade 12, is described as a traditional achievement
battery. The CAT/5 assesses content in Reading, Spelling, Language, Mathematics, Study
Skills, Science, and Social Studies. It is available in different formats for different applica-
tions (e.g., Complete Battery, Basic Battery, Short Form). The CAT/5 can be paired with
the Tests of Cognitive Skills, Second Edition (TCS/2), a measure of academic aptitude, to
allow comparison of achievement-—aptitude abilities (we will discuss the potential benefits
of making achievement—aptitude comparisons in the next chapter).
TerraNova CTBS. This is a revision of Comprehensive Tests of Basic Skills, Fourth Edi-
tion. The TerraNova CTBS, designed for use with students from kindergarten through
grade 12, was published in 1997. It combines selected-response and constructed-response
items that allow students to respond in a variety of formats. The TerraNova CTBS assesses
content in Reading/Language Arts, Mathematics, Science, and Social Studies. An expanded
version adds Word Analysis, Vocabulary, Language Mechanics, Spelling, and Mathematics
Computation. The TerraNova CTBS is available in different formats for different applica-
tions (e.g., Complete Battery, Complete Battery Plus, Basic Battery). The TerraNova CTBS
can be paired with the Tests of Cognitive Skills, Second Edition (TCS/2), a measure of
academic aptitude, to compare achievement-—aptitude abilities.
TerraNova The Second Edition (CAT/6). TerraNova The Second Edition, or CAT/6, is de-
scribed as comprehensive modular achievement battery designed for use with students from
kindergarten through grade 12 and contains year 2000 normative data. The CAT/6 assesses
content in Reading/Language Arts, Mathematics, Science, and Social Studies. An expanded
version adds Word Analysis, Vocabulary, Language Mechanics, Spelling, and Mathematics
Computation. It is available in different formats for different applications (e.g., CAT Multiple
Assessments, CAT Basic Multiple Assessment, CAT Plus). The CAT/6 can be paired with
InView, a measure of cognitive abilities, to compare achievement—aptitude abilities.
306 CHAP TERA
Stanford Achievement Test Series, Tenth Edition (Stanford 10). The Stanford 10 can be
used with students from kindergarten through grade 12 and has year 2002 normative data.
It assesses content in Reading, Mathematics, Language, Spelling, Listening, Science, and
Social Science. The Stanford 10 is available in a variety of forms, including abbreviated
and complete batteries. The Stanford 10 can be administered with the Otis-Lennon School
Ability Test, Eighth Edition (OLSAT-8). Also available from Harcourt Assessment, Inc. are
the Stanford Diagnostic Mathematics Test, Fourth Edition (SDMT 4) and the Stanford Di-
agnostic Reading Test, Fourth Edition (SDRT 4), which provide detailed information about
the specific strengths and weaknesses of students in mathematics and reading.
Riverside Publishing. Riverside Publishing produces three major achievement tests: the
Iowa Tests of Basic Skills (ITBS), Iowa Tests of Educational Development (ITED), and
Tests of Achievement and Proficiency (TAP).
Iowa Tests of Basic Skills (ITBS). The ITBS is designed for use with students from kin-
dergarten through grade 8 and, as the name suggests, is designed to provide a thorough as-
sessment of basic academic skills. The most current ITBS form was published in 2001. The
ITBS assesses content in Reading, Language Arts, Mathematics, Science, Social Studies,
and Sources of Information. The ITBS is available in different formats for different applica-
tions (e.g., Complete Battery, Core Battery, Survey Battery). The ITBS can be paired with
the Cognitive Abilities Test (CogAT), Form 6, a measure of general and specific cognitive
skills, to allow comparison of achievement-—aptitude abilities. Figures 12.1, 12.2, and 12.3
provide sample score reports for the ITBS and other tests Riverside Publishing publishes.
Iowa Tests of Educational Development (ITED). The ITED, designed for use with stu-
dents from grades 9 through 12, was published in 2001 to measure the long-term goals
of secondary education. The ITED assesses content in Vocabulary, Reading Compre-
hension, Language: Revising Written Materials, Spelling, Mathematics: Concepts and
Problem Solving, Computation, Analysis of Science Materials, Analysis of Social Stud-
ies Materials, and Sources of Information. The ITED is available as both a complete
battery and a core battery. The ITED can be paired with the Cognitive Abilities Test
(CogAT), Form 6, a measure of general and specific cognitive skills, to allow comparison
of achievement—aptitude abilities.
Tests of Achievement and Proficiency (TAP). The TAP, designed for use with stu-
dents from grades 9 through 12, was published in 1996 to measure skills necessary for
growth in secondary school. The TAP assesses content and skills in Vocabulary, Reading
Comprehension, Written Expression, Mathematics Concepts and Problem Solving, Math
Computation, Science, Social Studies, and Information Processing. It also contains an
Standardized Achievement Tests in the Era of High-Stakes Assessment
307
om
The display of NPRS tothe righit of thé scores allows a quick overview of the student's’
performance in each test relative to the other tests. The shaded area is the margin
of error for the composite: score ol all tests. The horizontal bands are margins of error for
individual ‘tests and totals, Bands completely outside the shaded area indicate scores that
are {fuly different (rom the composite. Non-overlapping bands indicate scores that are (ruly 3
ae rape ditforent trom each other. : cee
Math Problam Solving & Data interp, i .
“hath Computation : The lower part-of the report provides detailed information about skills in each test, The number
‘af items for each skill, the number attempted, the percent correct for the student, and the
percent correct for the nation are reported. The horizontal bands {o'the right of this'information
are the margins of error for €ach skill score. Non-overlapping bands Indicate ’skill scores that
Maps. ee P are truly diferent from-each other,
Reference Maloriais : 7 ;
af information Tote 2 Te Ifinched inTatals
Composte 8S. = Standard Sco (NB = Nallonal Stanine NPR = National Percentia Rank
4c = Percant\Cormet GE = Grade Equivaient NCE = Normal Gurve.Equvaiont NO. ATL.= Number Anompted
Test/Skill Mt
Usago.& Expression
Nouns, Pronouns, and Modifiers]
MATHEMATICS
and Estimation |
Number Properties & Operation: Earth and Space Science
Algebea Physical Science
Geometry
Measurement SOURCES OF INFORMATION
Probability and Statistics 2 Maps and
Estimation ? Lacate/Ptocess Information
Interpret information
|
Analyze Intormation
Reference Matortals
Using Reference Materisie
Searching for information
:
Using Search Reswis
Quant /interp. Relat,
Computation | 13V and 2 lem Skil ara not graphed 4 ve
Add with White Numbers | ee
‘Subtract with Whole Numbers:
Multipty/Owvide Whole Numbors
Saat : : —_ _
FIGURE 12.1 Performance Profile for Iowa Tests of Basic Skills (ITBS). This figure
illustrates a Performance Profile for the Iowa Tests of Basic Skills (ITBS). It is one of the
score report formats that Riverside Publishing provides for the ITBS. The display in the upper-
left portion of the report provides numerical scores for the individual tests, totals, and overall
composite. The National Percentile Rank is also displayed using confidence bands immediately
to the right of the numerical scores. The display in the lower portion of the report provides
detailed information about the specific skills measured in each test. Riverside Publishing
provides an Interpretive Guide for Teachers and Counselors that provides detailed guidance for
interpreting the different score reports.
Source: Reproduced by permission of the publisher, Riverside Publishing Company. Copyright 2001 The River-
side Publishing Company. All rights reserved.
optional Student Questionnaire that solicits information about attitudes toward school, ex-
tracurricular interests, and long-term educational and career plans. The TAP is available
as both a complete battery and a survey battery. The TAP can be paired with the Cognitive
Abilities Test (CogAT), Form 5, a measure of general and specific cognitive skills, to allow
comparison of achievement—aptitude abilities.
308 CHAP RERw2
Se
:
oe
Boe
rz IOWA
m= a TESTS |
NARRATIVE FOR
:
a Tests of Basic Skills® (ITBS*)
MM
. =<
cue : er
Ce
ke reed arse
tary =<
Femme x
fest
ee
Date: 04/
oeae \ }
eae : 72 Order No.: 002-470000028-0002 _ Page: 1 Grade: 5 f
Percentile Ranks |
Scores Low High Achievement Scores for Katrina Adams:
GE SPR NPR_1 25 sO 75 89
Vocabulary} Katrina was given the lowa Tests of Basic Skills in March 2001,
Reading Compretensian| At the time of testing, she was in fifth grade in Central Elementary
858
in Spring Lake.
Her composite score is the score that best describes her overall
achievement on the tests. Katrina's composite national percentile
rank (NPR) of 54 means that she scored higher than 54 percent of
fifth-grade students nationally. Her overall achievement appears to
be about average for fifth grade.
ASAI
adda
In general, a. student's ability to read is related fo success in many
GORE TOFAE! & = areas of school work. Katrina's Reading Gomprehension score is
Social Studies Total] 8.6 48 about average when compared with other students in fifth grade
2 mere: Ml te nationally. The Vocabulary test measures knowledge of words
‘Mays ADiagrams| 3:6 19 18 important in the comprehension of reading materials of all sorts. It
ere ial ae & is also the strongest measure of general verbal ability. Vocabulary
; ‘ development contributes to a student's understanding of spoken
COMPOSITE! 34 and written language encountered both in and out of school.
Legend: GE = Grade Equivalent fmm NPA = National Percentile Rank A student's scores can be compared with each other to determine
sat SHA = Stain Reconcile: Rank relative strengths and weaknesses,
© 2001 The Riverside Publishing Company. All Rights Reserved. ‘@ Riverside Publishing 4 HoucHioNn mireLin COMPANY
FIGURE 12.2 Profile Narrative for Iowa Tests of Basic Skills (ITBS). This figure illustrates
the Profile Narrative report format available-for the Iowa Tests of Basic Skills (ITBS). Although
this format does not provide detailed information about the skills assessed in each test, as in the
Performance Profile shown in Figure 12.1, it does provide an easy-to-understand discussion of
the student’s performance. This format describes the student’s performance on the composite
score (reflecting the student’s overall level of achievement) and the two reading tests (i.e.,
Vocabulary and Reading Comprehension). This report also identifies the student’s relative
strengths and areas that might need attention. This report illustrates the reporting of both state
and national percentile ranks.
Source: Reproduced by permission of the publisher, Riverside Publishing Company. Copyright 2001 The River-
side Publishing Company. All rights reserved.
FIGURE 12.3. Score Labels for the Iowa Tests and CogAT. This figure presents student score
labels for the Iowa Tests of Basic Skills (ITBS), Iowa Tests of Educational Development (ITED),
and Cognitive Abilities Test (CogAT). The CogAT is an ability test discussed in the next chapter.
These labels are intended for use in the students’ cumulative records and allow educators to track
student growth over time.
Source: Reproduced by permission of the publisher, Riverside Publishing Company. Copyright 2001 The River-
side Publishing Company. All rights reserved.
Riverside Publishing offers the Performance Assessments for ITBS, ITED, and TAP. These
are norm-referenced open-ended assessments in Integrated Language Arts, Mathematics,
Science, and Social Studies. These free-response assessments give students the opportunity
to demonstrate content-specific knowledge and higher-order cognitive skills in a more life-
like context. Other publishers have similar products available that supplement their survey
batteries.
Diagnostic Achievement Tests. The most widely used achievement tests have been
the broad survey batteries designed to assess the student’s level of achievement in broad
310 CHA Pie Rei
academic areas. Although these batteries do a good job in this context, they typically have
too few items that measure specific skills and learning objectives to be useful to teachers
when making instructional decisions. For example, the test results might suggest that a
particular student’s performance is low in mathematics, but the results will not pinpoint the
student’s specific strengths and weaknesses. To address this limitation, many test publishers
have developed diagnostic achievement tests. These diagnostic batteries contain a larger
number of items linked to each specific learning objective. In this way they can provide
more precise information about which academic skills have been achieved and which have
not. Examples of group-administered diagnostic achievement tests include the Stanford Di-
agnostic Reading Test, Fourth Edition (SDRT 4) and the Stanford Diagnostic Mathematics
Test, Fourth Edition (SDMT 4), both published by Harcourt Assessment, Inc. Most other
publishers have similar diagnostic tests to complement their survey batteries.
Obviously these have been very brief descriptions of these major test batteries. These
summaries were based on information current at the time of this writing. However, these tests
are continuously being revised to reflect curricular changes and to update normative data.
For the most current information, interested readers should access the Internet sites for the
publishing companies (see Table 12.1) or refer to the current edition of the Mental Measure-
ments Yearbook or other reference resources. See Special Interest Topic 12.2 for information
on these resources.
When you want to locate information on a standardized test, it is reasonable to begin by exam-
ining information provided by the test publishers. This can include their Internet sites, catalogs,
test manuals, specimen test sets, score reports, and other supporting documentation. However, you
should also seek out resources that provide independent evaluations and reviews of the tests you
are researching. The Testing Office of the American Psychological Association Science Directorate
(American Psychological Association, 2008) provides the following description of the four most
popular resources:
= Mental Measurements Yearbook (MMY). MMY, published by the Buros Institute for Mental
Measurements, lists tests alphabetically by title. Each listing provides basic descriptive in-
formation about the test (e.g., author, date of publication) plus information about the avail-
ability of technical information and scoring and reporting services. Most listings also include
one or more critical reviews by qualified assessment experts.
m Tests in Print (TIP). TIP, also published by the Buros Institute for Mental Measurements, is a
bibliographic encyclopedia of information on practically every published test in psychology
and education. Each listing provides basic descriptive information on tests, but does not contain
critical reviews or psychometric information. After locating a test that meets your criteria, you
can turn to the Mental Measurements Yearbook for more detailed information on the test.
wu Test Critiques. Test Critiques, published by Pro-Ed, Inc., contains a three-part listing for
each test that includes Introduction, Practical Applications/Uses, and Technical Aspects,
followed by a critical review of the test.
a Tests. Tests, also published by Pro-Ed, Inc., is a bibliographic encyclopedia covering thou-
sands of assessments in psychology and education. It provides basic descriptive information
on tests, but does not contain critical reviews or information on reliability, validity, or other
technical aspects of the tests. It serves as a companion to Test Critiques.
These resources can be located in the reference section of most college and larger public libraries.
In addition to these traditional references, Test Reviews Online is a Web-based service of the Buros
Institute of Mental Measurements (www.unl.edu/buros). This service makes test reviews available
online to individuals precisely as they appear in the Mental Measurements Yearbook. For a relatively
small fee (currently $15), users can download information on any of over 2,000 tests.
allows one to compare a student’s performance to that of students across the nation, not
only students from one’s state or school district. For example, one could find that Johnny’s
reading performance was at the 70th percentile relative to a national normative group. Using
these commercial tests it is also possible to compare state or local groups (e.g., a district,
school, or class) to a national sample. For example, one might find that a school district’s
mean 4th-grade reading score was at the 55th percentile based on national normative data.
These comparisons can provide useful information to school administrators, parents, and
other stakeholders.
All states have developed educational standards that specify the academic knowl-
edge and skills their students are expected to achieve (see Special Interest Topic 12.3 for
312 CHAPTER 12
Standards-Based Assessments
AERA et al. (1999) defines standards-based assessments as tests that are designed to measure clearly
defined content and performance standards. In this context, content standards are statements that
specify what students are expected to achieve in a given subject matter at a specific grade (e.g.,
Mathematics, Grade 5). In other words, content standards specify the skills and knowledge we want
our students to master. Performance standards specify a level of performance, typically in the form
of a cut score or a range of scores that indicates achievement levels. That is, performance standards
specify what constitutes acceptable performance. National and state educational standards have been
developed and can be easily accessed via the Internet. Below are a few examples of state educational
Internet sites that specify the state standards.
Education World provides a Web site that allows you to easily access state and national standards.
The site for state standards is www.education-world.com/standards/state/index.shtml.
The site for national standards is www.education-world.com/standards/national.
a
Standardized Achievement Tests in the Era of High-Stakes Assessment 313
(grades 5, 10, and 11), and social studies (grades 8, 10, and 11). There is a Spanish TAKS
that is administered in grades 3 through 6. The decision to promote a student to the next
grade may be based on passing the reading and math sections, and successful completion
of the TAKS at grade 11 is required for students to receive a high school diploma. The
statewide assessment program contains two additional tests. There is a Reading Proficiency
Test in English (RPTE) that is administered to limited English proficient students to assess
annual growth in reading proficiency. Finally there is the State-Developed Alternative As-
sessment (SDAA) that can be used with special education students when it is determined
that the standard TAKS is inappropriate. All of these tests are designed to measure the
educational objectives specified in the state curriculum, the Texas Essential Knowledge and
Skills curriculum (TEKS) (see www.tea.state.tx.us).
Some states have developed hybrid assessment strategies to assess student performance
and meet accountability requirements. For example, some states use a combination of state-
developed tests and commercial off-the-shelf tests, using different tests at different grade
levels. Another approach, commonly referred to as augmented testing, involves the use of a
commercial test that is administered along with test sections that address any misalignment
Massachusetts Yes No No
Michigan Yes No Yes
Minnesota Yes No No
Mississippi Yes No No
Missouri No Yes No
Montana Yes No Yes
Nebraska Yes No No
Nevada Yes No Yes
New Hampshire Yes No No
New Jersey Yes No No
New Mexico Yes No Yes
New York Yes No No
North Carolina Yes No No
North Dakota Yes No No
Ohio Yes No No
Oklahoma Yes No No
Oregon Yes No No
Pennsylvania Yes No No
Rhode Island Yes Yes No
South Carolina Yes No No
South Dakota No Yes Yes
Tennessee Yes No No
Texas Yes No No
Utah F Yes No Yes
Vermont Yes No No
Virginia Yes No No
Washington Yes No No
West Virginia Yes No Yes
Wisconsin No Yes No
Wyoming Yes No No
Totals 45 10 18
ki eR ee
Note: Data provided by Education Week, accessed August 9, 2007, at www.edcounts.org/createtable/step 1 .php
?clear=1. State Developed Test (Criterion-Referenced): defined as tests that are custom-made to correspond to
state content standards. Augmented/Hybrid Test: defined as tests that incorporate aspects of both commercially
developed norm-referenced and state-developed criterion-referenced tests (includes commercial tests augmented
or customized to match state standards). Off-the-Shelf Test (Norm-Referenced): defined as commercially devel-
oped norm-referenced tests that have not been modified to specifically reflect state standards.
Standardized Achievement Tests in the Era of High-Stakes Assessment 315
between state standards and the content of the commercial test. Table 12.2 provides informa-
tion on the assessment strategies used as of 2007 in state assessment programs (Education
Week, 2007). A review of this table reveals that the majority of states (i.e., 45) have state-
developed tests that are specifically designed to align with their standards. Only one state (i.e.,
lowa) reported exclusively using an off-the-shelf test. It should be noted that any report of state
assessment practices is only a snapshot of an ever-changing picture. The best way to get infor-
mation on your state’s current assessment practices is to go to the Web
' Proponents of high-stakes site of the state’s board of education and verify the current status.
testing programs believe they There is considerable controversy concerning statewide testing
increase academic expectations programs. Proponents of high-stakes testing programs see them as a
way of increasing academic expectations and ensuring that all stu-
and ensure that all students are
dents are judged according to the same standards. They say these test-
judged according to the same
ing programs guarantee that students graduating from public schools
standards. have the skills necessary to be successful in life after high school.
Critics of high-stakes testing Critics of these testing programs argue that these tests emphasize
programs believe too much rote learning and often neglect critical thinking, problem solving,
and communication skills. To exacerbate the problem, critics feel
instructional time is spent
that too much instructional time is spent preparing students for the
preparing students for the tests
tests instead of teaching the really important skills teachers would
instead of teaching the really
like to focus on. Additionally, they argue that these tests are cultur-
important skills necessary for ally biased and are not fair to minority students (Doherty, 2002). For
success in life. additional information on high-stakes testing programs, see Special
Interest Topics 12.4 and 12.5. This debate is likely to continue for
the foreseeable future, but in the meantime these tests will continue to play an important
role in public schools.
SL RO I
The American Educational Research Association (AERA) is a leading organization that studies
educational issues. The AERA (2000) presented a position statement regarding high-stakes testing
programs employed in many states and school districts. Its position is summarized in the following
points:
1. Important decisions should not be based on a single test score. Ideally, information from
multiple sources should be taken into consideration when making high-stakes decisions.
When tests are the basis of important decisions, students should be given multiple opportuni-
ties to take the test.
. When students and teachers are going to be held responsible for new content or standards, they
should be given adequate time and resources to prepare themselves before being tested.
. Each test should be validated for each intended use. For example, if a test is going to be used
for determining which students are promoted and for ranking schools based on educational
effectiveness, both interpretations must be validated.
. If there is the potential for adverse effects associated with a testing program, efforts should
be made to make all involved parties aware of them.
There should be alignment between the assessments and the state content standards.
When specific cut scores are used to denote achievement levels, the purpose, meaning, and
validity of these passing scores should be established.
Students who fail a high-stakes test should be given adequate opportunities to overcome any
deficiencies.
. Adequate consideration and accommodations should be given to students with language
differences.
Adequate consideration and accommodations should be given to students with disabilities.
When districts, schools, or classes are.to be compared, it is important to specify clearly
which students are to be tested and which students are exempt, and to ensure that these
guidelines are followed.
11. Test scores must be reliable.
1 There should be an ongoing evaluation of both the intended and unintended effects of any
high-stakes testing program.
These guidelines may be useful when trying to evaluate the testing programs your state or
school employs. For more information, the full text of this position statement can be accessed at
www.aera.net/about/policy/stakes.htm.
at the end of the year. For this example, let’s assume that statewide testing begins in grade 3.
The results of the state test, student by student, are used to build a model of performance for
each student, for Ms. Jones, for Washington School, and for the East Bunslip school district.
One year’s data are inadequate to do more than simply mark the levels of performance of
each focus for achievement: student, teacher, school, and district,
Standardized Achievement Tests in the Era of High-Stakes Assessment 317
ESSE RSE oR RL
W. James Popham (2000) provided three reasons why he feels standardized achievement tests should
not be used to evaluate educational effectiveness or quality:
1. There may be poor alignment between what is taught in schools and what is measured by
the tests. Obviously, if the test is not measuring what is being taught in schools, this will
undermine the validity of any interpretations regarding the quality of an education.
2. In an effort to maximize score variance, test publishers often delete items that are relatively
easy. Although this is a standard practice intended to enhance the measurement characteris-
tics of a test, it may have the unintended effect of deleting items that measure learning objec-
tives that teachers feel are most important and emphasize in their instruction. He reasons that
the items might be easy because the teachers focused on the objectives until practically all of
the students mastered them.
3. Standardized achievement tests may reflect more than what is taught in schools. Popham
notes that performance on standardized achievement tests reflects the students’ intellectual
ability, what is taught in school, and what they learn outside of school. As a result, to interpret
them as reflecting only what is taught in school is illogical, and it is inappropriate to use them
as a measure of educational quality.
The next year Ms. Jones’s previous students have been dispersed to several 4th-grade
classrooms. A few of her students move to different school districts, but most stay in East
Bunslip and most stay at Washington School. All of this information will be included in the
modeling of performance. Ms. Jones now has a new class of students who enter the value-
added assessment system. At the end of this year there is now data on each student who com-
pleted 4th grade, although some students may have been lost through attrition (e.g., missed
the testing, left the state). The Tennessee model includes a procedure that accounts for all of
these “errors.” The performance of the 4th-grade students can now be evaluated in terms of
their 3rd-grade performance and the effect of their previous teacher, Ms. Jones, and the effect
of their current teacher (assuming that teacher also taught last year and there was assessment
data for the class taught). In addition a school-level effect can be estimated. Thus, the value-
added system attempts to explain achievement performance for each level in the school system
by using information from each level. This is clearly a very complex undertaking for an entire
state’s data. As of 1997, Sanders et al. noted that over 4 million data points in the Tennessee
system were used to estimate effects for each student, teacher, school, and district.
The actual value-added component is not estimated as a gain, but as the difference in
performance from the expected performance based on the student’s previous performance,
current grade in school effect, sum of current and previous teacher effectiveness, and school
effectiveness. When three or more years’ data become available, longitudinal trend models can
be developed to predict the performance in each year for the various sources discussed.
318 CHA PER eaa
Preparing Students for the Test. Much has been written in recent years about the proper
procedures or practices for preparing students to take standardized achievement tests. As
we noted earlier, high-stakes testing programs are in place in every state, and these tests are
used to make important decisions such as which students graduate or get promoted, which
teachers receive raises, and which administrators retain their jobs. As you might imagine,
the pressure to ensure that students perform well on these tests has also increased. Legisla-
tors exert pressure on state education officials to increase student performance, who in turn
put pressure on local administrators, who in turn put pressure on teachers. An important
question is “What test preparation practices are legitimate and acceptable, and what prac-
tices are unethical or educationally contraindicated?” This is a more complicated question
than one might first imagine.
A popular phrase currently being used in both the popular media and professional ed-
ucational literature is teaching to the test. This phrase generally implies efforts by teachers
to prepare students to perform better on standardized achievement
tests. Many writers use “teaching to the test” in a derogatory man-
Teaching to the test has become ner, referencing unethical or inappropriate preparation practices.
a popular concept in both the Other writers use the phrase more broadly to reference any instruc-
popular media and professional tion designed to enhance performance on a test. As you will see, a
literature. wide range of test preparation practices can be applied. Some of
Standardized Achievement Tests in the Era of High-Stakes Assessment 319
these practices are clearly appropriate whereas others are clearly inappropriate. As an
extreme example, consider a teacher who shared the exact items from a standardized test
that is to be administered to students. This practice is clearly a breach of test security and is
tantamount to cheating. It is unethical and educationally indefensible and most responsible
educators would not even consider such a practice. In fact, such a breach of test security
could be grounds for the dismissal of the teacher, revocation of license, and possible legal
charges (Kober, 2002).
Thankfully such flagrantly abusive practices are relatively rare, but they do occur.
However, the appropriateness of some of the more common methods of preparing stu-
dents for tests is less clear. With one notable exception, which we will describe next,
itis generally acceptedt De
é ETE
(@propriated You may recognize that this involves the issue of test validity. Standardized
achievement tests are meant to assess the academic achievement of students in specific
areas. If test preparation practices increase test scores without increasing the level of
achievement, the validity of the test is compromised. Consider the following examples
of various test preparation procedures.
Preparation Using Practice Forms of the Test. Many states and commercial test publishers
release earlier versions of their exams as practice tests. Because these are released as practice
tests, their use is not typically considered unethical. However, if these tests become the focus
of instruction at the expense of other teaching activities, this practice can be harmful. Re-
search suggests that direct instruction using practice tests may produce short-term increases
in test scores without commensurate increases in performance on other measures of the test
domain (Kober, 2002). Like instruction in generic test-taking skills, the limited use of practice
tests may help familiarize students with the format of the test. However, practice tests should
be used in a judicious manner to ensure that they do not become the focus of instruction.
homework assignments that resemble actual items on the test (Kober, 2002). If the writing
section of a test requires single-paragraph responses, teachers will restrict their writing
assignments to a single paragraph. If a test uses only multiple-choice items, the teachers
will limit their classroom tests to multiple-choice items. The key feature is that students
are given instruction exposing them only to the material as presented and measured on
the test. With this approach students will be limited in their ability to generalize acquired
skills and knowledge to novel situations (Popham, 1999). Test scores may increase, but
the students’ mastery of the underlying domain is limited. As a result, this practice should
be avoided.
Preparation Emphasizing Test Content. This practice is somewhat similar to the previous
one, but instead of proyiding extensive exposure to items resembling those on the test, the
goal is to emphasize the skills and content most likely to be included on the standardized
tests. Kober (2002) notes that this practice often has a “narrowing effect” on instruction.
Because many standardized achievement tests emphasize basic skills and knowledge that
can be easily measured with selected-response items, this practice may result in teachers
neglecting more complex learning objectives such as the analysis and synthesis of informa-
tion or development of complex problem-solving skills. While test scores may increase, the
students’ mastery of the underlying domain is restricted. This practice should be avoided.
Preparation Using Multiple Instructional Techniques. With this approach students are
given instruction that exposes them to the material as conceptualized and measured on the
test, but also presents the material in a variety of different formats. Instruction covers all
salient knowledge and skills in the curriculum and addresses both basic and higher-order
learning objectives (Kober, 2002). With this approach, increases in test scores are associ-
ated with increases in mastery of the underlying domain of skills and knowledge (Popham,
1999). As a result, this test preparation practice is recommended. .
Standardized Achievement Tests in the Era of High-Stakes Assessment 321
Only test preparation practices Although this list of test preparation practices is not exhaus-
that introduce generic test- tive, we have tried to address the most common forms. In summary,
taking skills and use multiple only preparation that introduces generic test-taking skills and uses
instructional techniques can be multiple instructional techniques can be recommended enthusiasti-
recommended enthusiastically. cally. Teaching generic test-taking skills makes students more fa-
miliar and comfortable with the assessment process, and as a result
enhances the validity of the assessment. The use of multiple instruc-
tional techniques results in enhanced test performance that reflects an increased mastery
of the content domain. As a result, neither of these practices compromises the validity of
the score interpretation as reflecting domain-specific knowledge. Other test preparation
practices generally fall short of this goal. For example, practice tests may be useful when
used cautiously, but they are often overused and become the focus of instruction with det-
rimental results. Any procedures that emphasize test-specific content or test-specific item
formats should be avoided because they may increase test scores without actually enhanc-
ing mastery of the underlying test domain.
Administering Standardized Tests. When introducing this chapter we noted that standard-
ized tests are professionally developed and must be administered and scored in a standard
manner. For standardized scores to be meaningful and useful, it is imperative to follow these
standard procedures precisely. These procedures are explicitly speci-
fied so that the tests can be administered in a uniform manner in differ-
For standardized scores to be
ent settings. For example, it is obviously important for all students to
meaningful and useful, it is receive the same instructions and same time limits at each testing site
imperative to follow the test’s in order for the results to be comparable. Teachers are often respon-
administration procedures sible for administering group achievement tests to their students and as
precisely. a result should understand the basics of standardized test administra-
tion. Here are a few guidelines to help teachers in standardized test
administration to their students that are based on our own experience and a review of the
literature (e.g., Kubiszyn & Borich, 2003; Linn & Gronlund, 2000; Popham, 1999, 2000).
Review the Test Administration Manual before the Day of the Test. Administering stan-
dardized tests is not an overly difficult process, but it is helpful to review the administration
instructions carefully before the day of the test. This way you will be familiar with the
procedures and there should be no surprises. This review will alert you to any devices (e.g.,
stopwatch) or supporting material (e.g., scratch paper) you may need during the adminis-
tration. It is also beneficial to do a mock administration by reading the instructions for the
test in private before administering it to the students. The more familiar you are with the
administration instructions, the better prepared you will be to administer the test. Addition-
ally, you will find the actual testing session to be less stressful.
Encourage the Students to Do Their Best. Standardized achievement tests (and most other
standardized tests used in schools) are maximum performance tests and ideally students will
put forth their best efforts. This is best achieved by explaining to the students how the test
results will be used to their benefit. For example, with achievement tests you might tell the
students that the results can help them and their parents track their academic progress and
322 CHAPTERSI2
identify any areas that need special attention. Although it is important to motivate students
to do their best, it is equally crucial to avoid unnecessarily raising their level of anxiety. For
example, you would probably not want to focus on the negative consequences of poor perfor-
mance immediately before administering the test. This presents a type of balancing act; you
want to encourage the students to do their best without making them excessively anxious.
Closely Follow Instructions. As we noted, the reliability and validity of the test results
are dependent on the individual administering the test closely following the administration
instructions. First, the instructions to students must be read word for word. Do not alter the
instructions in any way, paraphrase them, or try to improvise. It is likely that some students
will have questions, but you are limited in how you can respond. Most manuals indicate
that you can clarify procedural questions (e.g., where do I sign my name?), but you cannot
define words or in any other way provide hints to the answers.
Strictly Adhere to Time Limits. Bring a stopwatch and practice using it before the day of
the test.
Be Alert to Cheating. Although you do not want to hover over the students to the extent
that it makes them unnecessarily nervous, active surveillance is indicated and can help deter
cheating. Stay alert and monitor the room from a position that provides a clear view of the
entire room. Walk quietly around the room occasionally. If you note anything out of the
ordinary, increase your surveillance of those students. Document any unusual events that
might deserve further consideration or follow-up.
By following these suggestions you should have a productive and uneventful testing
session. Nevertheless, be prepared for unanticipated events to occur. Keep the instruction
manual close so you can refer to it if needed. It is also helpful to remember you can rely on
your professional educational training to guide you in case of unexpected events.
Interpreting Standardized Tests. Teachers are also often called on to interpret the re-
sults of standardized tests. This often involves interpreting test results for use in their own
classroom. This can include monitoring student gains in achievement, identifying individ-
ual strengths and weaknesses, evaluating class progress, and planning instruction. At other
times, teachers are called on to interpret the results to parents or even students. Although
report cards document each student’s performance in the class, the results of standardized
tests provide normative information regarding student progress in a
The key factor in accurately
broader context (e.g., Linn & Gronlund, 2000).
interpreting the results of The key factor in accurately interpreting the results of stan-
standardized tests is being dardized tests is being familiar with the type of scores reported. In
familiar with the type of scores Chapter 3 we presented a review of the major types of test scores
reported. test publishers use. As we suggested in that chapter, when report-
Standardized Achievement Tests in the Era of High-Stakes Assessment 323
ing test results to parents, it is usually best to use percentile ranks. As with all norm-
referenced scores, the percentile rank simply reflects an examinee’s performance relative
to the specific norm group. Percentile ranks are interpreted as indicating the percentage of
individuals scoring below a given point in a distribution. For example, a percentile rank of
75 indicates that 75% of the individuals in the standardization sample scored below this
score. A percentile rank of 30 indicates that only 30% of the individuals in the standardiza-
tion sample scored below this score. Percentile ranks range from 1 to 99, and a rank of 50
indicates median performance. When discussing results in terms of percentile rank, it is
helpful to ensure that they are not misinterpreted as “percent correct” (Kamphaus, 1993).
That is, a percentile rank of 80 means that the examinee scored better than 80% of the
standardization sample, not that he or she correctly answered 80% of the items. Although
most test publishers report grade equivalents, we recommend that you avoid interpreting
them to parents. In Chapter 3 we discussed many of the problems associated with the use
of these scores and why they should be avoided.
Before leaving our discussion of the use of standardized achievement tests in
schools, it is appropriate to discuss some factors other than academic achievement that
may influence test performance. As we have emphasized in this textbook, it is extremely
important to select and use tests that produce reliable and valid scores. It is also important
to understand that even with the most psychometrically sound tests, factors other than
those we are attempting to measure may influence test performance. Achievement tests
are an attempt to measure students’ academic achievement in specific content areas. An
example of an extraneous factor that might influence performance on a standardized test
is the emotional state or mood of the student. If a student is emotionally upset the day of
a test, his or her performance will likely be impacted (see Special Interest Topic 12.6 for
A number of years ago when one of our colleagues was working with a private agency, a mother
and her young son (approximately 9 or 10 years of age) came in for their appointment. Although
he does not remember the specifics of the referral, the primary issue was that the child was
having difficulty at school and there was concern that he might have a learning disability. To
determine the basis of his school problems, he was scheduled to receive a battery of individual
standardized tests. On greeting them it was obvious that the child was upset. He sat quietly crying
in the waiting room with his head down. Our colleague asked the mother what was wrong and
she indicated his pet cat had died that morning. She was clearly sensitive to her son’s grief, but
was concerned that it would takes months to get another appointment (this agency was typically
booked for months in advance). Our colleague explained to her that he was much too upset to
complete the assessment on this day and that any results would be invalid. To ensure that her son
received the help he needed in a timely manner, they were able to schedule an appointment in a
few weeks. Although teachers may not have this much discretion when scheduling or adminis-
tering standardized tests, they should be observant and sensitive to the effects of emotional state
on test performance.
324 CHAP TER 2
a personal example). If you can see that a student is upset while taking a test, make a note
of this as it might be useful later in understanding his or her performance. Accordingly,
a student’s level of motivation will also influence performance. Students who do not see
the test as important may demonstrate a lackadaisical approach to it. If you notice that a
student is not completing the test or is completing it in a haphazard manner, this should
also be documented.
As we noted, standardized achievement tests are also used in the identification, diagnosis,
and classification of students with special learning needs. Although some group-adminis-
tered achievement tests might be used in identifying children with
Although some group special needs, in many situations individually administered achieve-
achievement tests are used in ment tests are employed. For example, if a student is having learning
difficulties and parents or teachers are concerned about the possibil-
identifying students with special
ity of a learning disability, the student would likely be given a battery
needs, in many situations
of tests, one being an individual achievement test. A testing profes-
individually administered sional, with extensive training in psychometrics and test administra-
achievement tests are used. tion, administers these tests to one student at a time. Because the
tests are administered individually, they can contain a wider variety
of item formats. For example, the questions are often presented in different modalities, with
some questions presented orally and some in written format. Certain questions may require
oral responses whereas some require written responses. In assessing writing abilities, some
of these tests elicit short passages whereas others require fairly lengthy essays. Relative to
the group tests, individual achievement tests typically provide a more thorough assessment
of the student’s skills. Because they are administered in a one-to-one context, the examiner
can observe the student closely and hopefully gain insight into the source of learning prob-
lems. Additionally, because these tests are scored individually, they are more likely to incor-
porate open-ended item formats (e.g., essay items) requiring qualitative scoring procedures.
Although regular education teachers typically are not responsible for administering and
interpreting these tests, teachers often do attend special education or placement committee
meetings at which the results of these tests are discussed and used to make eligibility and
placement decisions. As a result, it is beneficial to have some familiarity with these tests. In
this section we will briefly introduce you to some of the most popular individual achieve-
ment tests used in the schools.
m Reading Composite: composed of the Word Reading subtest (letter knowledge, pho-
nological awareness, and decoding skills), Reading Comprehension subtest (comprehen-
sion of short passages, reading rate, and oral reading prosody), and Pseudoword Decoding
(phonetic decoding skills).
a Written Language Composite: composed of the Spelling subtest (ability to write dic-
tated letters and words) and Written Language subtest (transcription, handwriting, written
word fluency, generate and combine sentences, extended writing sample).
The WIAT-II produces a variety of derived scores, including standard scores and per-
centile ranks. The WIAT-II has excellent psychometric properties and documentation. Addi-
tionally, the WIAT-II has the distinct advantage of being statistically linked to the Wechsler
intelligence scales. Linkage with these popular intelligence tests facilitates the aptitude—
achievement discrimination analyses often used to diagnose learning disabilities (this will
be discussed more in the next chapter on aptitude tests).
Woodcock-Johnson III Tests of Achievement (WJ III ACH; Woodcock, McGrew, &
Mather, 2001la). The WJ If ACH is a comprehensive individually administered norm-
referenced achievement test distributed by Riverside Publishing. The standard battery con-
tains the following cluster scores and subtests:
= Oral Language: composed of the Story Recall subtest (ability to recall details of
stories presented on an audiotape) and Understanding Directions subtest (ability to follow
directions presented on an audiotape).
a Math Calculation Skills: a math aggregate cluster composed of the Calculation and
Math Fluency subtests.
subtest (ability to formulate and write simple sentences quickly), and Writing Samples sub-
test (ability to write passages varying in length, vocabulary, grammatical complexity, and
abstractness).
Other special-purpose clusters can be calculated using the 12 subtests in the stan-
dard battery. In addition, ten more subtests in an extended battery allow the calculation of
supplemental clusters. The WJ III ACH provides a variety of derived scores and has excel-
lent psychometric properties and documentation. A desirable feature of the WJ III ACH is
its availability in two parallel forms, which is an advantage when testing a student on more
than one occasion because the use of different forms can help reduce carryover effects. Ad-
ditionally, the WJ III ACH and the Woodcock-Johnson III Tests of Cognitive Abilities (WJ
Ill COG; Woodcock, McGrew, & Mather, 2001b) compose a comprehensive diagnostic
system, the Woodcock-Johnson III (WJ III; Woodcock, McGrew, & Mather, 2001c). When
administered together they facilitate the aptitude—-achievement discrimination analyses
often used to diagnose learning disabilities.
Wide Range Achievement Test 3 (WRAT3). (The WRAT3 is a brief achievement test that
measures basic reading, spelling, and arithmetic skills. It contains the following subtests:
u Reading: assesses ability to recognize and name letters and pronounce printed
words
m Spelling: assesses ability to write letters, names, and words that are presented
orally
= Arithmetic: assesses ability to recognize numbers, count, and perform written
computations
The WRAT3 can be administered in 15 to 30 minutes and comes in two parallel forms.
Relative to the WIAT-II and WJ III ACH, the WRAT3 measures a limited number of skills.
However, when only a quick estimate of achievement in word recognition, spelling, and
math computation is needed, the WRAT3 can be a useful instrument.
The individual achievement batteries described to this point measure skills in multiple
academic areas. As with the group achievement tests, there are individual tests that focus
on multiple skill domains. The following two tests are examples of individual achievement
tests that focus on specific skill areas.
Summary
In this chapter we focused on standardized achievement tests and their applications in the
schools. These tests are designed to be administered, scored, and interpreted in a standard
manner. The goal of standardization is to ensure that testing conditions are the same for all
individuals taking the test. If this is accomplished, no examinee will have an advantage over
328 CHAP PER?
another, and test results will be comparable. These tests have different applications in the
schools, including
Of these uses, high-stakes testing programs are probably the most controversial. These
programs use standardized achievement tests to make such important decisions as which
students will be promoted and evaluating educational professionals and schools. Proponents
of high-stakes testing programs see them as a way of improving public education and en-
suring that students are all judged according to the same standards. Critics of high-stakes
testing programs argue that they encourage teachers to focus on low-level academic skills
at the expense of higher-level skills such as problem solving and critical thinking.
We next described several of the most popular commercial group achievement tests.
The chapter included a discussion of the current trend toward increased high-stakes assess-
ments in the public schools and how this is being implemented by states using a combina-
tion of commercial and state-developed assessments. We introduced a potentially useful
approach for assessing and monitoring student achievement that is referred to as value-
added assessment.
We also provided some guidelines to help teachers prepare their students for these
tests. We noted that any test preparation procedure that raises test scores without also increas-
ing the mastery of the underlying knowledge and skills is inappropriate. After evaluating
different test preparation practices, we concluded that preparation that introduces generic
test-taking skills and uses multiple instructional techniques can be recommended. These
practices should result in improved performance on standardized tests that reflects increased
mastery of the underlying content domains. Preparation practices that emphasize the use
of practice tests or focus on test-specific content or test-specific item formats should be
avoided because they may increase test scores, but may not increase mastery of the under-
lying test domain. We also provided some suggestions for teachers to help administer and
interpret test results:
Review the test administration manual before the day of the test.
Encourage students to do their best on the test.
Closely follow administration instructions.
Strictly adhere to time limits.
Avoid interruptions.
Be alert to cheating.
Be familiar with the types of derived scores produced by the test.
Achievement test, p. 300 Individually administered tests, Teaching to the test, p. 318
Appropriate preparation practice, p. 304 Test preparation practices, p. 318
p. 319 Standardized scores, p. 321 Value-added assessment, p. 315
Diagnostic achievement tests, p. 310 Standardized test, p. 300
Group-administered tests, p. 302 Standardized test administration,
Inappropriate preparation practices, p. 321
p. 318 Statewide testing programs, p. 310
RECOMMENDED READINGS
The following articles provide interesting commentaries on Kober, N. (2002). Teaching to the test: The good, the bad,
issues related to the use of standardized achievement tests in and who’s responsible. Test Talk for Leaders. (Issue 1).
the schools: Washington, DC: Center on Education Policy.
Boston, C. (2001). The debate over national testing. ERIC Di-
gest, ERIC-RIEO. (20010401).
Doherty, K. M. (2002). Education issues: Assessment. Edu-
cation Week on the Web. Retrieved May 14, 2003, from
www.edweek.org/context/topics/issuespage.cfm?id=41.
Conventional intelligence tests and even the entire concept of intelligence testing
are perennially the focus of considerable controversy and strong emotion.
—Reynolds & Kaufman, 1990
CHAPTER HIGHLIGHTS
LEARNING OBJECTIVES
After reading and studying this chapter, students should be able to:
1. Compare and contrast the constructs of achievement and aptitude.
2. Explain how achievement and aptitude can be conceptualized as different aspects of a
continuum. Provide examples to illustrate this continuum.
Discuss the major milestones in the history of intelligence assessment.
Describe the major uses of aptitude and intelligence tests in schools.
Explain the rationale for the analysis of aptitude—achievement discrepancies.
Explain the response to intervention (RTI) process and its current status.
Describe and evaluate the major group aptitude/intelligence tests.
Describe and evaluate the major individual achievement tests.
SS
SAEvaluate and select aptitude/intelligence tests that are appropriate for different applications.
SSS
Understand a report of the intellectual assessment of a school-aged child.
11. Identify the major college admission tests and describe their use.
Ih Chapter 1, when describing maximum performance tests we noted that they are often
classified as either achievement tests or aptitude tests. (In some professional sources the
term aptitude is being replaced with ability. For historical purposes we will use aptitude to
330
The Use of Aptitude Tests in the Schools 331
Aptitude tests are designed to designate this type of test in this chapter, but we do want to alert
measure the cognitive skills, readers to this variability in terminology.) We defined achievement
abilities, and knowledge that tests as those designed to assess students’ knowledge or skills in a
individuals have accumulated as content domain in which they have received instruction (AERA et al.
the result of their overall life ve : wma
experiences.
Both achievement and aptitude These introductory comments might lead you to believe there is
tests measure developed a clear and universally accepted distinction between achievement and
aptitude tests. However, in actual practice this is not the case and the
abilities and can be arranged
distinction is actually a matter of degree. Many, if not most, testing
along a continuum according to
experts conceptualize both achievement and aptitude tests as tests of
how dependent the abilities are
developed cognitive abilities that can be ordered along a continuum in
on direct school experiences. terms of how closely linked the assessed abilities are to specific learn-
ing experiences. This continuum is illustrated in Figure 13.1. At one
end of the continuum you have teacher-constructed classroom tests that are tied directly to the
instruction provided in a specific classroom or course. For example, a classroom mathematics
test should assess specifically the learning objectives covered in the class during a specific
instructional period. This is an example of a test that is linked clearly and directly to specific
academic experiences (i.e., the result of curriculum and instruction). Next along the continuum
are the survey achievement batteries that measure a fairly broad range of knowledge, skills,
and abilities. Although there should be alignment between the learning objectives measured
by these tests and the academic curriculum, the scope of a survey battery is considerably
broader and more comprehensive than that of a teacher-constructed classroom test. The group-
administered survey batteries described in the previous chapter are dependent on direct school
experiences, but there is variability in how direct the linkage is. For example, the achievement
tests developed by states to specifically assess the state’s core curriculum are more directly
linked to instruction through the state’s specified curriculum than the commercially developed
achievement tests that assess a more generic curriculum.
Next are intelligence and other aptitude tests that emphasize verbal, quantitative, and
visual-spatial abilities. Many traditional intelligence tests can be placed in this category,
and even though they are not linked to a specific academic curriculum, they do assess many
skills that are commonly associated with academic success. The Otis-Lennon School Abil-
ity Test (OLSAT); Stanford-Binet Intelligence Scales—Fifth Edition; Tests of Cognitive
Skills, Second Edition (TCS/2); Wechsler Intelligence Scale for Children—Fourth Edition
(WISC-IV); and Reynolds Intellectual Assessment Scales (RIAS) are all examples of tests
that fit in this category (some of these will be discussed later in this chapter). In developing
these tests, the authors attempt to measure abilities that are acquired through common, ev-
eryday experiences; not only those acquired through formal educational experiences. For
example, a quantitative section of one of these tests will typically emphasize mental com-
putations and quantitative reasoning as opposed to the developed mathematics skills tradi-
tionally emphasized on achievement tests. Novel problem-solving skills are emphasized on
many portions of these tests as well. Modern intelligence tests are not just measures of
knowledge or how much you know, but also how well you think.
Finally, at the most “general” end of the continuum are the nonverbal and cross-
cultural intelligence or aptitude tests. These instruments attempt to minimize the influence
of language, culture, and educational experiences. They typically emphasize the use of non-
verbal performance items and often completely avoid language-based content (e.g., reading,
writing, etc.). The Naglieri Nonverbal Ability Test—Multilevel Form (NNAT—Multilevel
Form) is an example of a test that belongs in this category. The NNAT—Multilevel Form is
a group-administered test of nonverbal reasoning and problem solving that is thought to be
relatively independent of educational experiences, language, and cultural background (how-
ever, no test is truly culture-free). The NNAT—Multilevel Form (like many nonverbal IQ
tests) employs “progressive matrices’”—items in which the test taker must find the missing
pattern in a series of designs or figures. The matrices in the NNAT—Multilevel Form are
arranged in order of difficulty and contain designs and shapes that are not linked to any spe-
cific culture. Promoters of the test suggest that this test may be particularly useful for students
with limited English proficiency, minorities, or those with hearing impairments.
tudes in specific areas such as mechanical or clerical ability. Subsequently, test developers
developed multiple-aptitude batteries to measure a number of distinct abilities.
General intelligence tests historically have been the most popular and widely used aptitude
tests in school settings. While practically everyone is familiar with the concept of intelli-
gence and uses the term in everyday conversations, it is not easy to develop a definition of
intelligence on which everyone agrees. In fact, the concept of intelligence probably has
generated more controversy than any other topic in the area of tests and measurement (see
Special Interest Topics 13.1 and 13.2). Although practically all edu-
General intelligence tests cators, psychologists, and psychometricians have their own personal
historically have been the definition of intelligence, most of these definitions will incorporate
most popular and widely used abilities such as problem solving, abstract reasoning, and the ability
aptitude tests in school settings. to acquire knowledge (e.g., Gray, 1999). Developing a consensus
beyond this point is more difficult. For our present purpose, instead
Most definitions of intelligence of pursuing a philosophical discussion of the meaning of intelligence,
incorporate abilities such as we will focus only on intelligence as measured by contemporary
problem solving, abstract intelligence tests. These tests typically produce an overall score re-
reasoning, and the ability to ferred to as an intelligence quotient or IQ.
acquire knowledge. Intelligence tests had their beginning in the schools. In the early
1900s, France initiated a compulsory education program. Recognizing
that not all children had the cognitive abilities necessary to benefit
from regular education classes, the minister of education wanted to develop special educa-
tional programs to meet the particular needs of these children. To accomplish this, he needed
a way of identifying children who needed special services. Alfred Binet and his colleague
Theodore Simon had been attempting to develop a measure of intelligence for some years,
and the French government commissioned them to develop a test that could predict academic
performance accurately. The result of their efforts was the first Binet-Simon Scale, released
in 1905. This test contained problems arranged in the order of their difficulty and assessing a
wide range of abilities. The test contained some sensory-perceptual tests, but the emphasis was
on verbal items assessing comprehension, reasoning, and judgment.
Subsequent revisions of the Binet-Simon Scale were released in 1908
The development and success
and 1911. These scales gained wide acceptance in France and were
of the Binet-Simon Scale, and soon translated and standardized in the United States, most success-
subsequently the Stanford- fully by Louis Terman at Stanford University. This resulted in the
Binet, ushered in ane era of Stanford-Binet Intelligence Test, which has been revised numerous
widespread intelligence testing times (the fifth revision remains in use today). Ironically, Terman’s
in the United States. version of the Binet-Simon Scale became even more popular in France
and other parts of Europe than the Binet-Simon Scale!
The development and success of the Binet-Simon Scale, and subsequently the Stanford-
Binet Intelligence Test, ushered in the era of widespread intelligence testing in the United
States. Following Terman’s lead, other assessment experts developed and released their own
intelligence tests. Some of the tests were designed for individual administration (like the
Stanford-Binet Intelligence Test) whereas others were designed for group administration.
334 CHA Pane) Remap
a es scene
LEER EUS ee SERS Ets RET
A task force established by the American Psychological Association produced a report titled “Intel-
ligence: Knowns and Unknowns” (Neisser et al., 1996). Its authors summarize the state of knowl-
edge about intelligence and conclude by identifying seven critical questions about intelligence that
have yet to be answered. These issues are summarized here and remain unconquered.
In a field where so many issues are unresolved and so many questions unanswered, the confident tone
that has characterized most of the debate on these topics is clearly out of place. The study of intel-
ligence does not need politicized assertions and recriminations; it needs self-restraint, reflection, and
a great deal more research. The questions that remain are socially as well as scientifically important.
There is no reason to think them unanswerable, but finding the answers will require a shared and
sustained effort as well as the commitment of substantial scientific resources. Just such a commit-
ment is what we strongly recommend. (p. 97)
Due to the often emotional Some of these tests placed more emphasis on verbal and quantitative
abilities whereas others focused more on visual—spatial and abstract
debate over the meaning of
problem-solving abilities. As a general rule, research has shown with
intelligence, many test
considerable consistency that contemporary intelligence tests are good
publishers have adopted more
predictors of academic success. This is to be expected considering this
neutral names such as academic was the precise purpose for which they were initially developed over
potential, school ability, and 100 years ago. In addition to being good predictors of school perfor-
simply ability to designate mance, research has shown that IQs are fairly stable over time. Never-
essentially the same construct. theless, these tests have become controversial themselves as a result of
The Use of Aptitude Tests in the Schools 335
AT
Although IQ tests had their origin in the schools, they have been the source of considerable contro-
versy essentially since their introduction. Opponents of IQ tests often argue IQ tests should be
banned from schools altogether whereas proponents can hardly envision the schools without them.
Many enduring issues contribute to this controversy, and we will mention only the most prominent
ones. These include the following.
Can IQ Be Increased?
Given the importance society places on intelligence and a desire to help children excel, it is reason-
able to ask how much IQ can be improved. Hereditarians, those who see genetics as playing the
primary role in influencing IQ, hold that efforts to improve it are doomed to failure. In contrast,
environmentalists, who see environmental influences as primary, see IQ as being highly malleable.
So who is right? In summary, the research suggests that IQ can be improved to some degree, but the
improvement is rather limited. For example, adoption studies indicate that lasting gains of approxi-
mately 10 to 12 IQ points are the most that can be accomplished through even the most pervasive
environmental interventions. The results of preschool intervention programs such as Head Start are
much less impressive. These programs may result in modest increases in IQ, but even these gains are
typically lost in a few years (Kranzler, 1997). These programs do have other benefits to children,
however, and should not be judged only on their impact on IQ.
A contemporary debate involves the use of IQ tests in the identification of students with learning
disabilities. Historically the diagnosis of learning disabilities has been based on a discrepancy
model in which students’ level of achievement is compared to their overall level of intelligence. If
students’ achievement in reading, mathematics, or some other specific achievement area is signifi-
cantly below that expected based on their IQ, they may be diagnosed as having a learning disability
(actually the diagnosis of learning disabilities is more complicated than this, but this explanation is
sufficient in this context). Currently some researchers are presenting arguments that IQs need not
play a role in the diagnosis of learning disabilities and are calling for dropping the use of a discrep-
ancy model, and the 2004 federal law governing special education eligibility (the Individuals with
Disabilities Education Act of 2004) no longer requires such a discrepancy, but does allow its use in
diagnosing disabilities.
So what does the future hold for IQ testing in the schools? We believe that when used ap-
propriately IQ tests can make a significant contribution to the education of students. Braden (1997)
noted that
eliminating IQ is different from eliminating intelligence. We can slay the messenger, but the message
that children differ in their learning rate, efficiency, and ability to generalize knowledge to new situ-
ations (despite similar instruction) remains. (p. 244)
At the same time we recognize that on occasion IQ tests (and other tests) have been used in inap-
propriate ways that are harmful to students. The key is to be an informed user of assessment results.
To this end a professional educator should have a good understanding of the topics covered in this
text, including basic psychometric principles and the ethical use of test results.
the often emotional debate over the meaning of intelligence. To try and avoid this association
and possible misinterpretations, many test publishers have adopted more neutral names such
as academic potential, scholastic ability, school ability, mental ability, and simply ability to
designate essentially the same construct.
As you can see from the previous discussion, aptitude and intelligence tests have a long
history of use in the schools. Their widespread use continues to this day, with major applica-
tions including
= Providing alternative measures of cognitive abilities that reflect information not cap-
tured by standard achievement tests or school grades
= Helping teachers tailor instruction to meet a student’s unique pattern of cognitive
strengths and weaknesses
a Assessing how well students are prepared to profit from school experiences
u Identifying students who are underachieving and may need further assessment to rule
out learning disabilities or other cognitive disorders, including mental retardation
u Identifying students for gifted and talented programs
= Helping guide students and parents with educational and vocational planning
The Use of Aptitude Tests in the Schools 337
Although we have identified the most common uses of aptitude/intelligence tests in the
schools, the list clearly is not exhaustive. Classroom teachers are involved to varying degrees
with these applications. For example, teachers are frequently called on to administer and in-
terpret many of the group aptitude tests for their own students. School psychologists or other
professionals with specific training in administering and interpreting clinical and diagnostic
tests typically administer and interpret the individual intelligence and aptitude tests. Even
though they are not directly involved in administering individual intelligence tests, it is impor-
tant for teachers to be familiar with these individual tests. Teachers frequently need to read and
understand psychological reports describing student performances on these tests. Addition-
ally, teachers are often on committees that plan and develop educational programs for students
with disabilities based on information derived from these tests. In a later section we present
an example of a report of the intellectual assessment of a high school student.
Aptitude—Achievement Discrepancies
One common assessment One common assessment practice employed in schools and in clini-
practice employed in schools cal settings is referred to as aptitu@eZaehievement discrepancy
a -
Additionally, Fuchs et al. (2003) outline the perceived benefits of RTI relative to the
aptitude—achievement discrepancy approach. For example, RTI is purported to provide help
to struggling students sooner. That is, RTI will help identify students with learning dis-
abilities in a timely manner, not waiting for them to fail before providing assistance. Addi-
tionally, proponents hold that RTI effectively distinguishes between students with actual
disabilities and students that simply have not received adequate instruction. With RTI dif-
ferent instructional strategies of increasing intensity are implemented as part of the process.
It is also believed by some that RTI will result in a reduced number of students receiving
special education services and an accompanying reduction in costs.
While RTI appears to hold promise in the identification of students with learning dis-
abilities (LD), a number of concerns remain. The RTI process has been defined and applied
in different ways, by different professionals, in different school settings (e.g., Christ, Burns,
& Ysseldyke, 2005; Fuchs et al., 2003). For example, some professionals envision RTI as
part of a behavioral problem-solving model while others feel it should involve the consistent
application of empirically validated protocols for students with specific learning problems.
Even when there is agreement on the basic strategy (e.g., a problem-solving model), differ-
ences exist in the number of levels (or tiers) involved, who provides the interventions, and
if RTI is a precursor to a formal assessment or if the RTI process replaces a formal assess-
ment in identifying students with learning disabilities (Fuchs et al., 2003). These inconsis-
tencies present a substantial problem since they make it difficult to empirically establish the
utility of the RTI process. Currently, the RTI model has been evaluated primarily in the
context of reading disabilities with children in the early grades, and this research is gener-
ally promising. However, much less research is available supporting the application of RTI
with other learning disorders and with older children (Feifer & Toffalo, 2007; Reynolds,
2005). In summary, there is much to be learned!
340 @GHAP TER 23
At this point, we view RTI as a useful process that can help identify struggling stu-
dents and ensure that they receive early attention and more intensive instructional interven-
tions. We also feel that students that do not respond to more intensive instruction should
receive a formal psychological assessment that includes, among other techniques, standard-
ized cognitive tests (i.e., aptitude, achievement, and possibly neuropsychological tests). We
do not agree with those that support RTI as a “stand-alone” process for identifying students
with LD. This position excludes the use of standardized tests and essentially ignores 100
years of empirical research supporting the use of psychometric procedures in identifying
and treating psychological and learning problems. A more moderate and measured approach
that incorporates the best of RTI and psychometric assessment practices seems most reason-
able at this time. If future research demonstrates that RTI can be used independently to
identify and develop interventions for students with learning disabilities, we will re-evaluate
our position.
Primary Test of Cognitive Skills (PTCS). The Primary Test of Cognitive Skills, pub-
lished by CTB McGraw-Hill, is designed for use with students in kindergarten through Ist
grade (ages 5.1 to 7.6 years). It has four subtests (Verbal, Spatial, Memory, and Concepts)
that require no reading or number knowledge. The PTCS produces an overall Cognitive
The Use of Aptitude Tests in the Schools 341
Skills Index (CSD), and when administered with TerraNova The Second Edition, anticipated
achievement scores can be calculated.
InView. InView, published by CTB McGraw-Hill, is designed for use with students in
grades 2 through 12. It is actually the newest version of the Tests of Cognitive Skills and
assesses cognitive abilities in verbal reasoning, nonverbal reasoning, and quantitative rea-
soning. InView contains five subtests: Verbal Reasoning—Words (deductive reasoning,
analyzing categories, and recognizing patterns and relationships), Verbal Reasoning—Con-
text (ability to identify important concepts and draw logical conclusions), Sequences (abil-
ity to comprehend rules implied in a series of numbers, figures, or letters), Analogies
(ability to recognize literal and symbolic relationships), and Quantitative Reasoning (ability
to reason with numbers). When administered with TerraNova The Second Edition, antici-
pated achievement scores can be calculated.
Otis-Lennon School Ability Test, 8th Edition (OLSAT-8). The Otis-Lennon School
Ability Test, 8th Edition, published by Harcourt Assessment, Inc., is designed for use with
students from kindergarten through grade 12. The OLSAT-8 is designed to measure verbal
processes and nonverbal processes that are related to success in school. This includes tasks
such as detecting similarities and differences, defining words, following directions, recall-
ing words/numbers, classifying, sequencing, completing analogies, and solving mathemat-
ics problems. The OLSAT-8 produces Total, Verbal, and Nonverbal School Ability Indexes
(SAIs). The publishers note that although the total score is the best predictor of success in
school, academic success is dependent on both verbal and nonverbal abilities, and the Verbal
and Nonverbal SAIs can provide potentially important information. When administered
with the Stanford Achievement Test Series, Tenth Edition (Stanford 10), one can obtain
aptitude—achievement comparisons (Achievement/Ability Comparisons, or AACs).
Cognitive Abilities Test (CogAT), Form 6. The Cognitive Abilities Test (CogAT), dis-
tributed by Riverside Publishing, is designed for use with students from kindergarten
through grade 12. It provides information about the development of verbal, quantitative, and
nonverbal reasoning abilities that are related to school success. Students in kindergarten
through grade 2 are given the following subtests: Oral Vocabulary, Verbal Reasoning, Rela-
tional Concepts, Quantitative Concepts, Figure Classification, and Matrices. Students in
grades 3 through 12 undergo the following subtests: Verbal Classification, Sentence Com-
pletion, Verbal Analogies, Quantitative Relations, Number Series, Equation Building, Fig-
ure Classification, Figure Analogies, and Figure Analysis. Verbal, quantitative, and
nonverbal battery scores are provided along with an overall composite score. The publishers
encourage educators to focus on an analysis of the profile of the three battery scores rather
than the overall composite score. They feel this approach provides the most useful informa-
tion to teachers regarding how they can tailor instruction to meet the specific needs of stu-
dents (see Special Interest Topic 13.3 for examples). When given with the Iowa Tests of
Basic Skills or lowa Tests of Educational Development, the CogAT provides predicted
achievement scores to help identify students whose level of achievement is significantly
higher or lower than expected. Figures 13.2, 13.3, and 13.4 provide examples of CogAT
score reports. Table 13.1 illustrates the organization of the major group aptitude/intelligence
tests.
342 CVA? TER 13
The Cognitive Abilities Test (CogAT) is an aptitude test that measures the level and pattern of a
student’s cognitive abilities. When interpreting the CogAT, Riverside Publishing (2002) encourages
teachers to focus on the student’s performance profile on the three CogAT batteries: Verbal Reason-
ing, Quantitative Reasoning, and Nonverbal Reasoning. To facilitate interpretation of scores, the
profiles are classified as A, B, C, or E profiles, described next.
a A profiles. Students with A profiles perform at approximately the sAme level on verbal,
quantitative, and nonverbal reasoning tasks. That is, they do not have any relative strengths or weak-
nesses. Approximately one-third of students receive this profile designation.
a B profiles. Students with B profiles have one battery score that is significantly aBove or
Below the other two scores. That is, they have either a relative strength or a relative weakness on
one subtest. B profiles are designated with symbols to specify the student’s relative strength or
weakness. For example, B (Q+) indicates that a student has a relative strength on the Quantitative
Reasoning battery, whereas B(V-) indicates that a student has a relative weakness on the Verbal
Reasoning battery. Approximately 40% of students have this type of profile.
a C profiles. Students with C profiles have both a relative strength and a relative weakness.
Here the C stands for Contrast. For example, C (V+N—) indicates that a student has a relative
strength in Verbal Reasoning and a relative weakness in Nonverbal Reasoning. Approximately 14%
of the students demonstrate this profile type.
us E profiles. Some students with B or C demonstrate strengths and/or weaknesses that are
so extreme they deserve special attention. With the CogAT, score differences of 24 points or
more (on a scale with a mean of 100 and SD of 16) are designated as E profiles (E stands for
Extreme). For example, E (Q—) indicates that a student has an extreme or severe weakness in
Quantitative Reasoning. Approximately 14% of students have this type of profile.
= Level of performance. In addition to the pattern of performance, it is also important to
consider the level of performance. To reflect the level of performance, the letter code is preceded
by a number indicating the student’s middle stanine score. For example, if a student received sta-
nines of 4, 5, and 6 on the Verbal, Quantitative, and Nonverbal Reasoning batteries, the middle
stanine is 5. In classifying stanine scores, Stanine 1 is Very Low, Stanines 2 and 3 are Below Aver-
age, Stanines 4-6 are Average, Stanines 7 and 8 are Above Average, and Stanine 9 is Very High.
As an example of a complete profile, the profile 8A would indicate students with relative evenly
developed Verbal, Quantitative, and Nonverbal Reasoning abilities with their general level of perfor-
mance in the Above Average range.
Riverside Publishing (2002) delineates a number of general principles for tailoring instruc-
tion to meet the needs of students (e.g., build on strengths) as well as more specific suggestions for
working with students with different patterns and levels of performance. CogAT, Form 6: A Short
Guide for Teachers (Riverside Publishing, 2002), an easy to read and very useful resource, is avail-
able online at www.riverpub.com/products/group.cogat6/home.html.
The Use of Aptitude Tests in the Schools 343
Predicted Achievement National Percentile Rank Katrina's composite score is derived from results from the three
Low High batteries. Katrina's composite national percentile rank of 43 is a
1 28 50 78 99 general statement of her ability. She seems to be about average in
Vocabulary praca73 overall cognitive ability:
Reading [essa]
Language fess 301.214 To a certain degree, the Cognitive Abilities Test scores can be
Mathematics used to predict success in schoo} subjects. Katrina is expected to
have more or less Consistent achievement In all subjects. Her
predicted level of achievement is about average. Some students’
Message from School: achievement is higher than that predicted from ability scores, and
some lower. Much depends on the study environment, student
This space may be left blank for teacher to write. a message or may be.used for effort, and motivation for learning.
a predefined message that the school can provide,
© 2001 The Riverside Publishing Company. All Rights Reserved. 4 Riverside Publishing 4 HOUGHTON MIFFLIN COMPANY
FIGURE 13.2 Profile Narrative for the Cognitive Abilities Test (CogAT) This figure
illustrates one of the report formats available from Riverside Publishing for the CogAT.
This format provides numerical scores and graphs in the left column and a narrative description
of the student’s performance in the right column. Note that the profile depicted in this figure
is identified as 5B (Q+). Please refer to Special Interest Topic 13.3 for information on how
Cog AT score profiles are coded and how teachers can use this information to customize
instruction to meet the needs of individual students.
Source: Reproduced by permission of the publisher, Riverside Publishing Company. Copyright 2001
The Riverside Publishing Company. All rights reserved.
Fe eee Please see other side for report features and options.
= NPR = National Percentile Rank _. 1 Different students bring different patterns and levels of abilities to learn-
NS = Netonw surroe
GE = Grade Equivalent
Nie Dorel alee EeBeere
* Not included in Totals and Composite
’» | ing tasks. The Cognitive Abilities Test is designed to find out about these
| abilities. Katrina’s national percentile rank of 48 on verbal ability means
at, compared with other students her age nationally, Katrina scored
Cognitive Abilities Test (CogAT) smn rote ‘| higher than 48 percent: Katrina appears to be about average in verbal
Age Percentile Rank bility. Katrina's national percentile rank is 60 in quantitative abliity and 31
Grade Scores Aga Scores in nonverbal ability. Katrina seems to be somewhat below average in
jonverbal ability. Katrina's composite score is derived from results from
he three batteries. Katrina's composite national percentile rank of 43'is a
jeneral statement of her ability. She seems to be about average in over-
| cognitive ability.
9-95429.
Riverside Publishing Call 800.323.9540 + Fax 630.467.7192 ele SK-KMP-04/01
A HOUGHTON MIFFLIN COMPANY Visit www.riversidepublishing.com Embrace Learning with The lowa Tests™
FIGURE 13.3. Combined Profile Narrative for the Iowa Tests of Basic Skills (TBS) and the
Cognitive Abilities Test (CogAT) This figure illustrates a Profile Narrative depicting a student’s
performance on both the Iowa Tests of Basic Skills (ITBS) and the Cognitive Abilities Test
(CogAT). This format provides numerical scores and graphs in the left column and a narrative
description of the student’s performance in the right column.
Source: Reproduced by permission of the publisher, Riverside Publishing Company. Copyright 2001
The Riverside Publishing Company. All rights reserved.
these tests. Classroom teachers are being asked more and more fre-
Surveys of school psychologists
quently to work with special education students, and as a result
and other assessment personnel
teachers need to be familiar with these tests because they are used in
have consistently shown that the identifying special needs students and planning their educational
Wechsler scales are the most programs (Nitko, 2001).
popular individual intelligence
tests used in clinical and school Wechsler Intelligence Scale for Children—Fourth Edition
settings with children. (WISC-IV). The Wechsler Intelligence Scale for Children—
The Use of Aptitude Tests in the Schools
345
ety
Bae: “ _ Please see other side for report features and options. —
IB
mm
iowa}
a TESTS
NT/ABILITY GRAPHIC COMPARISON
2wa Tests of Basic Skills” (ITBS®)
goneng cuss
Syator Marymte
ral Sanor
Norma: Spmncond
i ”
|i
Wr
¢-
Bes
We
aie
ar
mere-AZmMOoODMD
zAZ>Pr>D
rae
ese
3 6 2 4
Social
This graph is particulary useful for presenting a visual depiction of the relative strengths and weaknesses of this building and for
showing how the performance of the typical student in this building compares with that of the national norm group.
FIGURE 13.4 Achievement—Ability Graphic Comparison of the Iowa Tests of Basic Skills
(ITBS) when Combined with the Cognitive Abilities Test (CogAT) This figure presents a visual
depiction of the National Percentile Rank for each ITBS test relative to the Predicted National
Percentile Rank based on performance on the CogAT. This report illustrates the reporting of group
data, in this case the performance of all the 3rd-grade students in one school building.
Source: Reproduced by permission of the publisher, Riverside Publishing Company. Copyright 2001
The Riverside Publishing Company. All rights reserved.
Fourth Edition (WISC-IV) is the fourth edition of the most popular individual test of intel-
lectual ability for children. Empirical surveys of school psychologists and other assessment
personnel have consistently shown that the Wechsler scales are the most popular individual
intelligence test used in clinical and school settings with children (e.g., Livingston, Eglsaer,
Dickson, & Harvey-Livingston, 2003). The WISC-IV, which takes approximately 2 to 3
hours to administer and score, must be administered by professionals with extensive train-
ing in psychological assessment. Here are brief descriptions of the subtests (Wechsler,
2003):
346 CHAPTER 13
= Coding. The student matches and copies symbols that are associated with either ob-
jects (i.e., Coding A) or numbers (Coding B). This subtest is a measure of processing speed,
short-term visual memory, mental flexibility, attention, and motivation.
a Digit Span. The student is presented sequences of numbers orally to repeat verbatim
(i.e., Digits Forward) or in reverse order (i.e., Digits Backwards). This subtest involves
short-term auditory memory, attention, and on Digits Backwards, mental manipulation.
a Information. The student responds to questions that are presented orally involving a
broad range of knowledge (e.g., science, history, geography). This subtest measures the
student’s general fund of knowledge.
a Letter-Number Sequencing. The student reads a list of letters and numbers and then
recalls the letters in alphabetical order and the numbers in numerical order. This subtest
involves short-term memory, sequencing, mental manipulation, and attention.
a Matrix Reasoning. The student examines an incomplete matrix and then selects the
item that correctly completes the matrix. This subtest is a measure of fluid intelligence and
is considered a largely language-free and culture-fair measure of intelligence.
a Picture Completion. The student is presented a set of pictures and must identify what
important part is missing. This subtest measures visual scanning and organization as well
as attention to essential details.
a Picture Concepts. The student examines rows of objects and then selects objects that
go together based on an underlying concept. This subtest involves nonverbal abstract rea-
soning and categorization.
= Similarities. Two words are presented orally to the student, who must identify how
they are similar. This subtest measures verbal comprehension, reasoning, and concept for-
mation.
a Symbol Search. The student scans groups of symbols and indicates whether a target
symbol is present. This subtest is a measure of processing speed, visual scanning, and
concentration.
348 CHAPTER 13
= Vocabulary. The student is presented a series of words orally to define. This subtest is
primarily a measure of word knowledge and verbal conceptualization.
= Word Reasoning. The student must identify the underlying or common concept im-
plied by a series of clues. This subtest involves verbal comprehension, abstraction, and
reasoning.
u Perceptual Reasoning Index (PRI). The PRI is a composite of Block Design, Picture
Concepts, and Matrix Reasoning. Picture Completion is a supplemental PRI subtest. The
PRI reflects perceptual and nonverbal reasoning, spatial processing abilities, and visual—
spatial—motor integration.
a Working Memory Index (WMI). The WMI is a composite of Digit Span and Letter—
Number Sequencing. Arithmetic is a supplemental WMI subtest. The WMI reflects the
student’s working memory capacity that includes attention, concentration, and mental
control.
m Processing Speed Index (PSI). The PSlis a composite of Coding and Symbol Search.
Cancellation is a supplemental PSI subtest. The PSI reflects the student’s ability to quickly
process nonverbal material as well as attention and visual—motor coordination.
This four-index framework is based on factor analytic and clinical research (Wechsler, 2003).
Similar index scores have a rich history of clinical use and have been found to provide reli-
able information about the student’s abilities in specific areas (Kaufman, 1994; Kaufman &
Lichtenberger, 1999; Wechsler, 2003). Whereas previous Wechsler scales have produced a
Verbal IQ, Performance IQ, and Full Scale IQ, the WISC-IV reports only a Full Scale IQ
(FSIQ), which reflects the student’s general level of intelligence. The organization of the
WISC-IV is depicted in Table 13.2. To facilitate the calculation of aptitude—achievement
discrepancies, the WISC-IV is statistically linked to the Wechsler Individual Achievement
Test—Second Edition (WIAT-II), which was described in the previous chapter on standard-
ized achievement tests.
The WISC-IV and its predecessors are designed for use with children between the ages
of 6 and 16. For early childhood assessment the Wechsler Preschool and Primary Scale of
Intelligence—Third Edition (WPPSI-III) is available and is appropriate for children between
The Use of Aptitude Tests in the Schools 349
Information
Vocabulary
Similarities Verbal Comprehension
Comprehension
Word Reasoning
Block Design
Picture Completion
Matrix Reasoning Perceptual Reasoning Full Scale IQ
Picture Concepts
2 years 6 months and 7 years 3 months. The Wechsler Adult Intelligence Scale—Third Edi-
tion (WAIS-IID) is appropriate for individuals between the ages of 16 and 89 years of age.
TABLE 13.3 Organization of the Stanford-Binet Intelligence Scales, 5th Edition (SB5)
(CHC) theory of cognitive abilities, which incorporates Cattell’s and Horn’s Gf-Gc theory
and Carroll’s three-stratum theory. The CHC theory provides a comprehensive model for
assessing a broad range of cognitive abilities, and many clinicians like this battery because it
measures such a broad range of abilities. The organization of the WJ III Tests of Cognitive
Abilities is depicted in Table 13.4 (Riverside, 2003). The WJ III Tests of Cognitive Abilities
is co-normed with the WJ III Tests of Achievement described in the chapter on standardized
achievement tests.
TABLE 13.4 Organization of the Woodcock-Johnson III (WJ III) Tests of Cognitive Abilities
involves the decision to use a group or individual test. As is the case with standardized
achievement tests, group aptitude tests are used almost exclusively for mass testing applica-
tions because of their efficiency. Even a relatively brief individual intelligence test typically
requires approximately 30 minutes per student to administer. Additionally, assessment pro-
fessionals with special training in test administration are needed to administer these indi-
vidual tests. A limited amount of time to devote to testing and a limited number of assessment
personnel combine to make it impractical to administer individual tests to a large number
of students. However, some situations demand the use of an individual intelligence test. This
is often the case when making classification decisions such as identifying students who have
learning disabilities or who qualify for gifted and talented programs.
When selecting an intelligence or aptitude test, it is also important to consider how
the information will be used. Are you primarily interested in obtaining a global measure of
intellectual ability, or do you need a test that provides multiple scores reflecting different
sets of cognitive abilities? As we noted, as a general rule intelligence
When selecting an intelligence tests have been shown to be good at predicting academic success.
or aptitude test, it is important Therefore, if you are simply interested in predicting school success
practically any of these tests will meet your needs. If you want to
to consider factors such as how
identify the cognitive strengths and weaknesses of your students, you
the information will be used and
should look at the type of scores provided by the different test batter-
how much time is available for
ies and select one that meets your needs from either a theoretical or
testing. practical perspective. For example, a teacher or clinician who has
embraced the Cattell-Horn-Carroll (CHC) theory of cognitive abili-
ties would be well served using the Woodcock-Johnson III Tests of Cognitive Abilities be-
cause it is based on that specific model of cognitive abilities. The key is to select a test that
provides the specific type of information you need for your application. Look at the type of
factor and intelligence scores the test produces, and select a test that provides meaningful
and practical information for your application.
If you are interested in making aptitude—achievement comparisons, ideally you should
select an aptitude test that is co-normed with an achievement test that also meets your spe-
cific needs. All of the major group aptitude tests we discussed are co-normed or linked to a
major group achievement test. When selecting a combination aptitude—achievement battery,
you should examine both the achievement test and the aptitude test to determine which set
best meets your specific assessment needs. In reference to the individual intelligence tests
we discussed, only the WISC-IV and WJ III Tests of Cognitive Abilities have been co-
normed with or linked to an individual achievement test battery. While it is optimal to use
co-normed instruments when aptitude—achievement comparisons are important, in actual
practice many clinicians rely on aptitude and achievement tests that are not co-normed or
linked. In this situation, it is important that the norms for both tests be based on samples that
are as nearly identical as possible. For example, both tests should be normed on samples
with similar characteristics (e.g., age, race, geographic region) and obtained at approxi-
mately the same time (Reynolds, 1990).
Another important question involves the population you will use the test with. For ex-
ample, if you will be working with children with speech, language, or hearing impairments or
diverse cultural/language backgrounds, you may want to select a test that emphasizes nonver-
bal abilities and minimizes cultural influences. Finally, as when selecting any test, you want
The Use of Aptitude Tests in the Schools 353
to examine the psychometric properties of the test. You should select a test that produces reli-
able scores and has been validated for your specific purposes. All of the aptitude/intelligence
tests we have discussed have good psychometric properties, but it is the test user’s responsibil-
ity to ensure that the selected test has been validated for the intended purposes.
mimo tT
Year Month Day
Ethnicity: Grade/Education:
ID#: Examiner:
Date of Birth} 1989 ma
Reason for referral: Referral source:
Age
RIAS Indexes
354
SPECIAL INTEREST TOPIC 13.4 Continued
RIAS Profiles
120 120
60 60
110 110
50 50 100 100
90 90
40 40
80 80
30 30 70 70
60 60
20 20
50 50
150 150
80 80
140 140
70 70 130 130
120 120
60 60
110 110
50 100 100
50
90 90
40 40
80 80
30 30 70 70
60 60
20 20 “8 50
355
356 CHAPTER 13
Background Information
Becky J. Gibson is a 17-year-old female. She was referred by her guidance counselor for an initial
learning disability evaluation. Becky is currently in the 11th grade. The name of Becky’s school
was reported as “Lincoln High School.” Becky's parental educational attainment was reported as:
“College.’The primary language spoken in Becky’s home is English.
Becky identified the following vision, hearing, language, and/or motor problems: “Requires
prescription glasses for reading.” Becky further identified the following learning problems:
“None. Finally, Becky identified the following medical/neurological problems: “None”
Behavioral Observations
Becky arrived more than 15 minutes early. She was accompanied to the session by her legal guard-
ian. During testing the following behavioral observations were made: “Client appeared easily
distracted and was very fidgety.’
ings: reasoning or fluid abilities and verbal or crystallized abilities. Each of these indexes is ex-
pressed as an age-corrected standard score that is scaled to a mean of 100 and a standard
deviation of 15. These scores are normally distributed and can be converted to a variety of other
metrics if desired.
The RIAS also contains subtests designed to assess verbal memory and nonverbal memory.
Depending on the age of the individual being evaluated, the verbal memory subtest consists ofa
series of sentences, age-appropriate stories, or both, read aloud to the examinee. The examinee
is then asked to recall these sentences or stories as precisely as possible. The nonverbal memory
subtest consists of the presentation of pictures of various objects or abstract designs for a period
of 5 seconds. The examinee is then shown a page containing six similar objects or figures and
must discern which object or figure was previously shown. The scores from the verbal memory and
nonverbal memory subtests are combined to form a Composite Memory Index (CMX), which pro-
vides a strong, reliable assessment of working memory and also may provide indications as to
whether or not a more detailed assessment of memory functions may be required. In addition, the
high reliability of the verbal and nonverbal memory subtests allows them to be compared directly
to each other.
For reasons described in the RIAS/RIST Professional Manual (Reynolds & Kamphaus, 2003),
it is recommended that the RIAS subtests be assigned to the indices described above (e.g., VIX,
NIX, CIX, and CMX). For those who do not wish to consider the memory scales as a separate entity
and prefer to divide the subtests strictly according to verbal and nonverbal domains, the RIAS
subtests can be combined to form a Total Verbal Battery (TVB) score and a Total Nonverbal Battery
(TNB) score. The subtests that compose the Total Verbal Battery score assess verbal reasoning
ability, verbal memory, and the ability to access and apply prior learning in solving language-
related tasks. Although labeled the Total Verbal Battery score, the TVB also is a reasonable ap-
proximation of measures of crystallized intelligence. The TNB comprises subtests that assess
nonverbal reasoning, spatial ability, and nonverbal memory. Although labeled the Total Nonverbal
Battery score, the TNB also provides a reasonable approximation of fluid intelligence. These two
indexes of intellectual functioning are then combined to form an overall Total Test Battery (TTB)
score. By combining the TVB and the TNB to form the TTB, a stronger, more reliable assessment
of general intelligence (g) is obtained. TheTTB measures the two most important aspects of gen-
eral intelligence according to recent theories and research findings: reasoning, or fluid, abilities
and verbal, or crystallized, abilities. Each of these scores is expressed as an age-corrected standard
score that is scaled to a mean of 100 and a standard deviation of 15. These scores are normally
distributed and can be converted to a variety of other metrics if desired.
(continued)
358 CHAPTER 13
Becky earned a Verbal Intelligence Index (VIX) of 56, which falls within the significantly
below average range of verbal intelligence skills and exceeds the performance of less than one
percent of individuals Becky’s age. The chances are 90 out of 100 that Becky’s true VIX falls within
the range of scores from 53 to 64.
Becky earned a Nonverbal Intelligence Index (NIX) of 92, which falls within the average range
of nonverbal intelligence skills and exceeds the performance of 30% of individuals Becky's age.
The chances are 90 out of 100 that Becky's true NIX falls within the range of scores from 87 to
98.
Becky earned a Composite Memory Index (CMX) of 114, which falls within the above average
range of working memory skills. This exceeds the performance of 82% of individuals Becky’s age.
The chances are 90 out of 100 that Becky’s true CMX falls within the range of scores from 108 to
NRNNRRENRNRCRRNER:
NER
ERUEE
SA
119:
On testing with the RIAS, Becky earned a Total Test Battery or TTB score of 83. This level of
performance on the RIAS falls within the range of scores designated as below average and exceeds
the performance of 13% of individuals at Becky’s age. The chances are 90 out of 100 that Becky’s
true TTB falls within the range of scores from 79 to 88.
Becky's Total Verbal Battery (TVB) score of 68 falls within the range of scores designated as
significantly below average and exceeds the performance of2% ofindividuals her age. The chances
are 90 out of 100 that Becky’s true TVB falls within the range of scores from 65 to 74.
Becky’s Total Nonverbal Battery (TNB) score of 100 falls within the range of scores designated
as average and exceeds the performance of 50% of individuals her age. The chances are 90 out of
100 that Becky's true TNB falls within the range of scores from 95 to 105.
VIX is the Verbal Intelligence Index, NIX is the Nonverbal Intelligence Index, CIX is the Composite
Intelligence Index, CMX is the Composite Memory Index, VRM is the Verbal Memory Subtest, NVM is
the Nonverbal Memory Subtest, TVB is the Total Verbal Battery Index, and TNB is the Total Nonverbal
Battery Index.
(continued)
360 CHAPTER 13
nonverbal intelligence or spatial abilities. The magnitude of the difference observed between these
two scores is potentially important and should be considered when drawing conclusions about
Becky's current status. A difference of this size is relatively uncommon, occurring in only 5% of
cases in the general population. In such cases, interpretation of the TTB or general intelligence score
may be of less value than viewing Becky's verbal and nonverbal abilities separately.
If interested in comparing the TTB and CIX scores or the TTB and CMX scores, it is better to
compare the CIX and CMX directly. As noted in the RIAS/RIST Professional Manual (Reynolds &
Kamphaus, 2003), the TTB is simply a reflection of the sum of the T scores of the subtests that
compose the CIX and CMxX. Thus, it is more appropriate to make a direct comparison of the CMX and
CIX because any apparent discrepancy between the TTB and the CIX or the TTB and the CMX will in
fact be a reflection of discrepancies between the CIX and the CMx, so this value is best examined
directly. To compare the CMX or CIX to the TTB may exaggerate some differences inappropriately.
tellectual development. Teachers should prepare an individualized curriculum designed for stu-
dents who learn at a slower rate than others of the same age and grade level. Alternative methods
of instruction should be considered that involve the use of repeated practice, spaced practice,
concrete examples, guided practice, and demonstrative techniques. Individuals with general intel-
ligence scores in this range often benefit from repeated practice approaches to training because
of problems with acquisition and long-term retrieval, as well as an individualized instructional
method that differs significantly from that of their age-mates. It also will be important to assist
Becky in developing strategies for learning and studying. Although it is important for all students
to know how to learn and not just what to learn, low scores on general intelligence indices make
the development of learning and study strategies through direct instruction even more important.
If confirmed through further testing, co-occurring deficits in adaptive behavior and behavioral
problems should be added to the school intervention program.
Becky’s VIX score of 56 and TVB score of 68 indicate severe deficits in the development of
verbal intellect relative to others at Becky’s age. Individuals at this score level on the TVB nearly
always have accompanying verbal memory difficulties that can easily be moderate to severe in
nature. Special attention to Becky's VRM score is necessary, as well as considerations for any
extant verbal memory problems and their accompanying level of severity in making specific
recommendations.
Verbal ability is important for virtually every aspect of activity because language is key to
nearly all areas of human endeavor. A multitude of research investigations have documented the
importance of verbal ability for predicting important life outcomes. Verbal ability should be con-
sidered equivalent to the term “crystallized intelligence” (Kamphaus, in press). As assessed by
the RIAS, verbal ability (like crystallized intelligence) is highly related to general intelligence,
and as such its relationship to important life outcomes is easily correlated. Verbal ability also is
the foundation for linguistic knowledge, which is necessary for many types of learning.
With the exception of the early grades, along with kindergarten and pre-K settings, school
is principally a language-oriented task. Given Becky's relative verbal deficits, special teaching
methods might be considered, including special class placement in the case of severe deficits in
verbal intellectual development. The examiner should also consider either conducting, or making
a referral for, an evaluation for the presence of a language disorder. Alternative methods of in-
struction that emphasize “show me” rather than “tell me” techniques or that as a minimum pair
these two general approaches, are preferred.
Although linguistic stimulation likely cannot counteract the effects of verbal ability deficits
that began in infancy or preschool years, verbal stimulation is still warranted to either improve
adaptation or at least prevent an individual from falling further behind peers. Verbal concept and
knowledge acquisition should continue to be emphasized. A simple word-for-the-day program
may be beneficial for some students. Verbal knowledge builders of all varieties may be helpful
including defining words, writing book reports, a book reading program, and social studies and
science courses that include writing and oral expression components. Alternatively, assistive
technology (e.g., personal digital assistance devices, tape recorders, MP3 players, or IPODs) may
be used to enhance functioning in the face of the extensive verbal demands required for making
adequate academic progress.
(continued)
362 CHAPTER 13
In addition, teachers should rely more heavily on placing learning into the student's experi-
ential context, giving it meaning and enabling Becky to visualize incorporating each newly
learned task or skill into her life experience.
The use ofvisual aids should be encouraged and made
available to Becky whenever possible. Academic difficulties are most likely to occur in language-
related areas (e.g., the acquisition of reading), especially early phonics training. The acquisition
of comprehension skills also is aided when the verbal ability falls into this level by the use of
language experience approaches to reading, in particular. Frequent formal and informal assess-
ment of Becky’s reading skills, as well as learning and study strategies (the latter with an instru-
ment, e.g., the School Motivation and Learning Strategies Inventory; SMALSI; Stroud & Reynolds,
2006) is recommended. This should be followed by careful direct instruction in areas of specific
skill weaknesses and the use of high interest, relevant materials. It also will be important to assist
Becky in developing strategies for learning and studying. Although it is important for all students
to know how to learn and not just what to learn, low scores within the verbal intelligence domains
make the development of learning and study strategies through direct instruction even more
important.
‘
The Use of Aptitude Tests in the Schools 363
emphasis on learning by example and by demonstration is likely to be most effective with stu-
dents with this intellectual pattern. Also common are problems with sequencing including se-
quential memory and, in the early grades, mastery of phonics when synthesizing word sounds
into correct words. Emphases on holistic methods of learning are likely to be more successful in
addition to experiential approaches. The practical side of learning and the application of knowl-
edge can be emphasized to enhance motivation in these students.
Often, these students do not have good study, learning, and test-taking strategies. It is often
useful to assess the presence of strategies with a scale such as the School Motivation and Learn-
ing Strategies Inventory and then to target deficient areas of learning strategies for direct instruc-
tion (Stroud & Reynolds, 2006).
The magnitude ofdiscrepancy between Becky’s CMX score of 114 and CIX score of 71 is rela-
tively unusual within the normative population, suggesting that memory skills are relatively more
intact than general intellectual skills. Individuals with this profile may require more intensive
and broad-based intervention because general intelligence is a better predictor of occupational
and educational outcomes than are memory skills (Kamphaus, in press).
Students with this profile may experience problems with inferential reasoning, logic, the
comprehension of new concepts, and the acquisition of new knowledge. As such, participation in
school or intervention programs is often more successful if lessons are of longer duration, infor-
mation is provided in multiple modalities, opportunities to practice newly acquired skills are
provided frequently, and repetition and review is emphasized.
EES
SPECIAL INTEREST TOPIC 13.4 Continued
y
SS
Raw score 22 40 21 70 38 90
T score
(Mean = 50, SD = 10) ©) 29 33 60 50 64 21 45 31 Be, 29 50 39
z score
(Mean = 0, SD = 1) 4 OOM 2 LO 1 70n a100m 0:00 sn 405 2.935 0:53.01 93 O05 ee. 13 0.00) mais
Subtest scaled score
(Mean = 10, SD = 3) <1 4 5 13 10 14
Sum of subtest
T scores 42 89 131 114 92 153 245
Index score
(Mean = 100, SD = 15) 56 92 71 114 68 100 83
95% confidence
interval 52-65 86-99 67-78 107-120 64-75 94-106 79-88
90% confidence
interval 53-64 87-98 67-77 108-119 65-74 95-105 79-88
Stanine
(Mean = 5, SD = 2) 1 4 1 7 1 5 3
general intelligence through the use of comprehensive measures of memory functions including
the WRAML-2 (Sheslow & Adams, 2003) and the TOMAL-2 (Reynolds & Voress, 2007). Subtests of
the Neuropsychological Assessment (NAB; Stern & White, 2003), and other related tests of verbal
skills with which you are familiar and skilled may well be useful adjuncts to the assessment pro-
cess in Becky’s case. Students with this pattern often exhibit inadequate levels of study skills
development and learning strategies and, thus, may become discouraged in school or vocational-
training programs. Assessment and targeted remediation of such deficits can be undertaken for
ages 8 through 18 years with assessments such as the School Motivation and Learning Strategies
Inventory (Stroud & Reynolds, 2006).
In cases where the CMX score is clinically significantly higher than the CIX score, follow-up
evaluation may be warranted, particularly if the CIX is in the below average range or lower. Lower
intelligence test scores are associated with increased forms of a variety of psychopathology, par-
ticularly if scores are in or near the mental retardation range (Kamphaus, in press). Because
general intelligence impacts knowledge and skill acquisition in a variety of areas, a thorough
ry
The Use of Aptitude Tests in the Schools 365
evaluation of academic achievement is necessary to gauge the impact of any impairment and make
plans to remediate educational weaknesses.
References
Hammill, D., & Bryant, B. (2005). Detroit Tests of Learning Aptitude-Primary (DTLA-P-3) (3rd ed.).
Austin, TX: PRO-ED.
Hammill, D., Pearson, N. A., & Voress, J. K. (1993). Developmental Test of Visual Perception-2 (DTVP-2).
Austin, TX: PRO-ED.
Kamphaus, R.W. (in press). Clinical assessment of children’s intelligence (3rd ed.). New York:
Springer.
McCarthy, D. (1972). McCarthy Scales of Children’s Abilities. San Antonio, TX: Harcourt Assessment.
Reitan, R. M., & Wolfson, D. (1993). The Halstead-Reitan Neuropsychological Test Battery: Theory and
clinical interpretation (2nd ed.). Tucson, AZ: Neuropsychology Press.
Reynolds, C. R. (2006). Koppitz Developmental Scoring System for the Bender Gestalt Test (Koppitz-2)
(2nd ed.). Austin, TX: PRO-ED.
Reynolds, C. R., & Kamphaus, R. W. (2003). Reynolds Intellectual Assessment Scales (RIAS) and the
Reynolds Intellectual Screening Test (RIST) professional manual. Lutz, FL: Psychological Assess-
ment Resources.
Reynolds, C. R., Pearson, N. A., & Voress, J. K. (2002). Developmental Test of Visual Perception—
Adolescent and Adult (DTVP-A). Austin,TX:PRO-ED.
Reynolds, C.R., & Voress, J. (2007). Test of Memory and Learning (TOMAL-2) (2nd ed.). Austin, TX:
PRO-ED.
Reynolds, C.R., Voress, J., & Pierson, N. (2007). Developmental Test of Auditory Perception (DTAP).
Austin, TX: PRO-ED.
Semel, E. M., Wiig, E.H., & Secord, W. A. (2004). Clinical Evaluation of Language Fundamentals 4—
Screening Test (CELF-4). San Antonio, TX: Harcourt Assessment.
Sheslow, D., & Adams, W. (2003). Wide Range Assessment of Memory and Learning 2 (WRAML-2).
Wilmington, DE: Wide Range.
Stern, R. A., & White, T. (2003). Neuropsychological Assessment Battery (NAB). Lutz, FL: Psychological
Assessment Resources.
Stroud, K., & Reynolds, C. R. (2006). School Motivation and Learning Strategies Inventory (SMALSI).
Los Angeles: Western Psychological Services.
Wallace, G., & Hammill, D. D. (2002). Comprehensive Receptive and Expressive Vocabulary
Test (CREVT-2)
(2nd ed.). Los Angeles: Western Psychological Services.
Reproduced by special permission of the publisher, Psychological Assessment Resources, Inc., 16204 North Flor-
ida Avenue, Lutz, Florida 33549, from the Reynolds Intellectual Assessment Scales Interpretive Report by Cecil
R. Reynolds, PhD and Randy W. Kamphaus, PhD. Copyright 1998, 1999, 2002, 2003, 2007 by Psychological As-
sessment Resources, Inc. Further reproduction is prohibited without permission of PAR, Inc.
a
366 CHAPTER 13
for understanding Becky’s particular pattern of intellectual development and ways that it
may be relevant to altering instructional methods or making other changes in how material
is presented to her in an educational setting. The next major section of the report deals pre-
cisely with school feedback and recommendations. Here the reader is provided with a gen-
eral understanding of the implications of these findings for Becky’s academic development
and alternative methods of instruction are recommended. These are based on various studies
of the implications of intelligence test results for student learning over many decades. In
particular, the actuarial analyses of discrepancies in Becky’s various areas of intellectual
development have led to recommendations for some additional assessment as well as
changes in teaching methods.
The purpose of all of the commentary in this report is ultimately to achieve an under-
standing of Becky’s intellectual development and how it may be related to furthering her
academic development in the best way possible.
The sample report is restricted to recommendations for school or other formal instruc-
tional settings. Other specialized reports can be generated separately for specialized clinical
settings that make quite different recommendations and even provide provisional diagnoses
that should be considered by the professional psychologist administering and interpreting
the intellectual assessment. The reader should be aware that it is rare for a report to be based
only on an intellectual assessment, and we doubt you will ever see such a report based on a
singular instrument. Typically, reports of the assessment of a student conducted by a diag-
nostic professional will include not only a thorough assessment of intellectual functions,
such as reported in Special Interest Topic 13.4, but also will include evaluations of academic
skills and status, personality, and behavior that may affect academic performance, special-
ized areas of development such as auditory perceptual skills, visual perceptual skills, visual
motor integration, attention, concentration, and memory skills, among other important as-
pects of the student’s development, that are dictated by the nature of the referral and infor-
mation gathered during the ongoing assessment process.
A final type of aptitude test that is often used in schools includes those used to make ad-
mission decisions at colleges and universities. College admission tests were specifically
designed to predict academic performance in college, and although they are less clearly
linked to a specific educational curriculum than most standard achievement tests, they do
focus on abilities and skills that are highly academic in nature. Higher education admis-
sion decisions are typically based on a number of factors including high school GPA,
letters of recommendation, personal interviews, written statements,
College admissions tests such
and extracurricular activities, but in many situations scores on stan-
as Le SAT and AAS are dardized admission tests are a prominent factor. The two most
designed to predict academic widely used admission assessment tests are the Scholastic Assess-
performance in college. ment Test (SAT) and the American College Test (ACT).
Scholastic Assessment Test. The College Entrance Examination Board (CEEB), com-
monly referred to as the College Board, was originally formed to provide colleges and
The Use of Aptitude Tests in the Schools 367
universities with a valid measure of students’ academic abilities. Its efforts resulted in the
development of the first Scholastic Aptitude Test in 1926. The test has undergone numerous
revisions and in 1994 the title was chan ged to Scholastic Assessment Test (SAT). The new-
est version of the SAT was administered for the first time in fall 2005 and includes the fol-
lowing three sections: Critical Reading, Mathematics, and Writing. Although the Critical
Reading and Mathematics sections assess new content relative to previous exams, the most
prominent change is the introduction of the Writing section. This section contains both
multiple-choice questions concerning grammar and a written essay. The SAT is typically
taken in a student’s senior year. The College Board also produces the Preliminary SAT
(PSAT), which is designed to provide practice for the SAT. The PSAT helps students iden-
tify their academic strengths and weaknesses so they can better prepare for the SAT. The
PSAT is typically taken during a student’s junior year. More information about the SAT can
be assessed at the College Board’s Web site: www.collegeboard.com.
American College Test. The American College Testing Program (ACT) was initiated in
1959 and is the major competitor of the SAT. The American College Test (ACT) is de-
signed to assess the academic development of high school students and predict their ability
to complete college work. The test covers four skill areas—English, Mathematics, Read-
ing, and Science Reasoning—and includes 215 multiple-choice questions. When describ-
ing the ACT, the producers emphasize that it is not an aptitude or IQ test, but an achievement
test that reflects the typical high school curriculum in English, mathematics, and science.
In addition to the four subtests, the ACT also incorporates an interest inventory that pro-
vides information that may be useful for educational and career planning. Beginning in the
2004—2005 academic year, the ACT included an optional 30-minute writing test that as-
sesses an actual sample of students’ writing. More information about the ACT can be as-
sessed at the ACT’s Web site: www.act.org.
Summary
In this chapter we discussed the use of standardized intelligence and aptitude tests in the
schools. We started by noting that aptitude/intelligence tests are designed to assess the cog-
nitive skills, abilities, and knowledge that are acquired as the result of broad, cumulative life
experiences. We compared aptitude/intelligence tests with achievement tests that are de-
signed to assess skills and knowledge in areas in which specific instruction has been pro-
vided. We noted that this distinction is not absolute, but rather one of degree. Both aptitude
and achievement tests measure developed cognitive abilities. The distinction lies with the
degree to which the cognitive abilities are dependent on or linked to formal learning experi-
ences. Achievement tests should measure abilities that are developed as the direct result of
formal instruction and training whereas aptitude tests should measure abilities acquired
from all life experiences, not only formal schooling. In addition to this distinction, achieve-
ment tests are usually used to measure what has been learned or achieved at a fixed point in
time, whereas aptitude tests are often used to predict future performance. Although the
distinction between aptitude and achievement tests is not as clear as one might expect, the
two types of tests do differ in their focus and are used for different purposes.
368 CHAPTER 13
The most popular type of aptitude test used in schools today is the general intelligence
test. Intelligence tests actually had their origin in the public schools approximately 100 years
ago when Alfred Binet and Theodore Simon developed the Binet-Simon Scale to identify
children who needed special educational services to be successful in French schools. The test
was well received in France and was subsequently translated and standardized in the United
States to produce the Stanford-Binet Intelligence Test. Subsequently other test developers
developed their own intelligence tests and the age of intelligence testing had arrived. Some
of these tests were designed for group administration and others for individual administra-
tion. Some of these tests focused primarily on verbal and quantitative abilities whereas others
placed more emphasis on visual—spatial and abstract problem-solving skills. Some of these
tests even avoided verbal content altogether. Research suggests that, true to their initial pur-
pose, intelligence tests are fairly good predictors of academic success. Nevertheless, the
concept of intelligence has taken on different meanings for different people, and the use of
general intelligence tests has been the focus of controversy and emotional debate for many
years. This debate is likely to continue for the foreseeable future. In an attempt to avoid
negative connotations and misinterpretations, many test publishers have switched to more
neutral titles such as school ability or simply ability to designate the same basic construct.
Contemporary intelligence tests have numerous applications in today’s schools. These
include providing a broader measure of cognitive abilities than traditional achievement
tests, helping teachers tailor instruction to meet students’ unique patterns of cognitive
strengths and weaknesses, determining whether students are prepared for educational expe-
riences, identifying students who are underachieving and may have learning or other cogni-
tive disabilities, identifying students for gifted and talented programs, and helping students
and parents make educational and career decisions. Classroom teachers are involved to
varying degrees with practically all of these applications. Teachers often help with the ad-
ministration and interpretation of group aptitude tests, and although they typically do not
administer and interpret individual aptitude tests, they do need to be familiar with the tests
and the type of information they provide.
One common practice when interpreting intelligence tests is referred to as aptitude—
achievement discrepancy analysis. This simply involves comparing a student’s performance
on an aptitude test with performance on an achievement test. The expectation is that achieve-
ment will be commensurate with aptitude. Students with achievement scores significantly
greater than ability scores may be considered academic overachievers whereas those with
achievement scores significantly below ability scores may be considered underachievers.
There are a number of possible causes for academic underachievement ranging from poor
student motivation to specific learning disabilities. We noted that there are different methods
for determining whether a significant discrepancy between ability and achievement scores
exists and that standards have been developed for performing these analyses. To meet these
standards, many of the popular aptitude and achievement tests have been co-normed or statis-
tically linked to permit comparisons. We cautioned that while ability-achievement discrep-
ancy analysis is a common practice, not all assessment experts support the practice. As we
have emphasized throughout this text, test results should be interpreted in addition to other
sources of information when making important decisions. This suggestion applies when mak-
ing ability—achievement comparisons.
The Use ofAptitude Tests in the Schools 369
The chapter concluded with an examination of a number of the popular group and
individual aptitude tests. Finally, a number of factors were discussed that should be consid-
ered when selecting an aptitude test. These included deciding between a group and indi-
vidual test, determining what type of information is needed (e.g., overall IQ versus multiple
factors scores), determining what students the test will be used with, and evaluating the
psychometric properties (e.g., reliability and validity) of the test.
RECOMMENDED READINGS
Cronbach, L. J. (1975). Five decades of public controversy New York: John Wiley and Sons. This text provides a
over mental testing. American Psychologist, 36, 1-14. review of the use of RTI in the identification of learning
An interesting and readable chronicle of the controversy disabilities.
surrounding mental testing during much of the twentieth Kamphaus, R. W. (2001). Clinical assessment of child and
century. adolescent intelligence. Boston: Allyn & Bacon. This
Fletcher-Janzen, E., & Reynolds, C. R. (Eds.). (in press). Neu- text provides an excellent discussion of the assessment
roscientific and clinical perspectives on the RTI initia- of intelligence and related issues.
tive in learning disabilities diagnosis and intervention.
OI
LOLOL Assessment of Behavior
and Personality
CHAPTER HIGHLIGHTS
LEARNING OBJECTIVES
370
Assessment of Behavior and Personality 371
In Chapter 1, when describing the different types of tests, we noted that tests typically can
be classified as measures of either maximum performance or typical response. Maximum
performance tests are often referred to as ability tests. On these tests items are usually scored
as either correct or incorrect, and examinees are encouraged to demonstrate the best perfor-
mance possible. Achievement and aptitude tests are common examples of maximum perfor-
mance tess, Tact CATA el RRR
Typical response tests usually . Typical re-
ASSESS SOREN such se sponse tests typically assess constructs such as personality, behavior,
personality, behavior, attitudes, attitudes, or interests (Cronbach, 1990). Although maximum perfor-
or interests. mance tests are the most prominent type of test used in schools today,
typical response tests are used frequently also.
Public Law 94-142 (IDEA) and its most current reauthorization, the Individuals with
Disabilities Education Improvement Act of 2004 (IDEA 2004), mandate that schools pro-
vide special education and related services to students with emotional disorders. These laws
compel schools to identify students with emotional disorders and, as a result, expand
school assessment practices, previously focused primarily on cognitive abilities, to include
the evaluation of personality, behavior, and related constructs. The primary goal of this
chapter is to help teachers become familiar with the major instruments used in assessing
emotional and behavioral features of children and adolescents and to assist them in under-
standing the process of evaluating such students. Teachers are often called on to provide
relevant information on students’ behavior. Teachers are involved to varying degrees with
the assessment of student behavior and personality. Classroom teachers are often asked to
help with the assessment of students in their classrooms, for example, by completing be-
havior rating scales on students in their class. This practice provides invaluable data to
school psychologists and other clinicians because teachers have a unique opportunity to
observe children in their classrooms. Teachers can provide information on how the child
behaves in different contexts, both academic and social. As a result, the knowledge derived
from behavior rating scales completed by teachers plays an essential role in the assessment
of student behavior and personality. Teachers may also be involved with the development
and implementation of educational programs for children with emotional or behavioral
disorders. As part of this role, teachers may need to read psychological reports and incor-
porate these findings into instructional strategies. In summary, although teachers do not
need to become experts in the field of psychological assessment, it is vital for them to
become familiar with the types of instruments used in assessing children’s behavior and
personality.
Before proceeding, it is beneficial to clarify how assessment
Personality can be defined as experts conceptualize personality. Gray (1999) defines personality
an individual’s characteristic as “the relatively consistent patterns of thought, feeling, and behav-
way of thinking, feeling, and ior that characterize each person as a unique individual” (p. G12).
behaving. This definition probably captures most people’s concept of personal-
ity. In conventional assessment terminology, personality is defined in
a similar manner, incorporating a host of emotional, behavioral, motivational, interpersonal,
and attitudinal characteristics (Anastasi & Urbina, 1997). In the context of child and ado-
lescent assessment, the term personality should be used with some care. Measures of
372 CHAPTER 4
personality and behavior in children demonstrate less stability than comparable measures
in adults. This is not particularly surprising given the rapid developmental changes charac-
teristic of children and adolescents. As a result, when using the term personality in the
context of child and adolescent assessment, it is best to interpret it cautiously and under-
stand that it does not necessarily reflect a fixed construct, but one that is subject to develop-
ment and change.
Even though we might not consciously be aware of it, we all engage in the assessment of
personality and behavior on a regular basis. When you note that “Johnny has a good person-
ality,” “Tommy is a difficult child,” or “Tamiqua is extroverted,” you are making a judgment
about personality. We use these informal evaluations to determine whom we want to associ-
ate with and whom we want to avoid, among many other ways.
The development of the first formal instrument for assessing personality is typically
traced to the efforts of Robert Woodworth. In 1918, he developed the Woodworth Personal
Data Sheet, which was designed to help collect personal information about military recruits.
Much as the development of the Binet scales ushered in the era of intelligence testing, the
introduction of the Woodworth Personal Data Sheet ushered in the era of personality assess-
ment. Subsequent instruments for assessing personality and behavior took on a variety of
forms, but they all had the same basic purpose of helping us to understand the behavior and
personal characteristics of ourselves and others. Special Interest Topic 14.1 provides a brief
description of an early test of personality.
Response Sets
A response set is present when Response biases or response sets are test responses that misrepre-
test takers respond in a manner sent a person’s true characteristics. For example, an individual com-
pleting an employment-screening test might attempt to present an
that misrepresents their true
overly positive image by answering all of the questions in the most
characteristics.
socially appropriate manner possible, even if these responses do not
accurately represent the person. On the other hand, a teacher who is
hoping to have a disruptive student transferred from his or her class might be inclined to
exaggerate the student’s misbehavior in order to hasten that student’s removal. In both of
these situations the individual completing the test or scale responded in a manner that
systematically distorted reality. Response sets can be present when completing maximum
performance tests. For example, an individual with a pending court case claiming neuro-
logical damage resulting from an accident might “fake bad” on an intelligence test in an
effort to substantiate the presence of brain damage and enhance his or her legal case. How-
ever, response sets are an even bigger problem on typical performance tests. Because many
of the constructs measured by typical performance tests (e.g., per-
Response sets are a ubiquitous sonality, behavior, attitudes, beliefs) have dimensions that may be
problem in personality seen as either socially “desirable” or “undesirable,” the tendency to
assessment. employ a response set is heightened. When response sets are pres-
Assessment of Behavior and Personality 373
Se eee er
Sir Francis Galton (1884) related a tale attributed to Benjamin Franklin about a crude personality
test. Franklin describes two types of people, those who are optimistic and focus on the positive and
those who are pessimistic and focus on the negative. Franklin reported that one of his philosophical
friends desired a test to help him identify and avoid people who were pessimistic, offensive, and
prone to acrimony.
In order to discover a pessimist at first sight, he cast about for an instrument. He of course possessed
a thermometer to test heat, and a barometer to tell the air-pressure, but he had no instrument to test
the characteristic of which we are speaking. After much pondering he hit upon a happy idea. He
chanced to have one remarkably handsome leg, and one that by some accident was crooked and de-
formed, and these he used for the purpose. If a stranger regarded his ugly leg more than his handsome
one he doubted him. If he spoke of it and took no notice of the handsome leg, the philosopher deter-
mined to avoid his further acquaintance. Franklin sums up by saying, that every one has not this
two-legged instrument, but every one with a little attention may observe the signs of a carping and
fault-finding disposition. (pp. 9-10)
Source: This tale was originally reported by Sir Francis Galton (1884). Galton’s paper was reproduced in Good-
stein & Lanyon (1971).
ent, the validity of the test results may be compromised because they introduce construct-
irrelevant error to test scores (e.g., AERA et al., 1999). That is, the test results do not
accurately reflect the construct the test was designed to measure. To
Validity scales are designed to combat this, many typical performance tests incorporate some type
detect the presence of response of validity scale designed to detect the presence of response sets.
bias. Validity scales take different forms, but the general principle is that
they are designed to detect individuals who are not responding in an
accurate manner. Special Interest Topic 14.2 provides an example of a “fake good” re-
sponse set. In the last several decades, personality scale authors have devised many types
of so-called validity scales to detect a dozen or more response sets.
Self-report inventories, despite the efforts of test developers, always remain susceptible to response
sets. The following case is an authentic example. In this case the Behavior Assessment System for
Children, Self-Report of Personality (BASC-SRP) was utilized.
Maury was admitted to the inpatient psychiatric unit of a general hospital with the diagnoses
of impulse control disorder and major depression. She is repeating the seventh grade this school year
because she failed to attend school regularly last year. When skipping school, she spent time roaming
the local shopping mall or engaging in other relatively unstructured activities. She was suspended
from school for lying, cheating, and arguing with teachers. She failed all of her classes in both se-
mesters of the past school year.
Maury’s responses to the diagnostic interview suggested that she was trying to portray herself
in a favorable light and not convey the severity of her problems. When asked about hobbies, for
example, she said that she liked to read. When questioned further, however, she could not name a
book that she had read.
Maury’s father reported that he has been arrested many times. Similarly, Maury and her sisters
have been arrested for shoplifting. Maury’s father expressed concern about her education. He said that
Maury was recently placed in an alternative education program designed for youth offenders.
Maury’s SRP results show evidence of a social desirability or fake good response set. All of
her clinical scale scores were lower than the normative T-score mean of 50 and all of her adaptive
scale scores were above the normative mean of 50. In other words, the SRP results suggest that
Maury is optimally adjusted, which is in stark contrast to the background information obtained.
Maury’s response set , however, was identified by the Lie scale of the SRP, where she obtained
a score of 9, which is on the border of the caution and extreme caution ranges. The following table
shows her full complement of SRP scores.
Source: Clinical Assessment of Child and Adolescent Personality and Behavior (2nd ed.) (Box 6.1, p. 99), by
R. W. Kamphaus and P. J. Frick, 2002, Boston: Allyn & Bacon. Copyright 2002 by Pearson Education. Reprinted
with permission.
4
Assessment of Behavior and Personality 375
TABLE 14.1 Ten Most Popular Tests of Child Personality and Behavior
Note: BASC = Behavior Assessment System for Children. The Conners Rating Scales—
Revised and Sentence Completion Tests actually were tied. Based on a national sample of
school psychologists (Livingston et al., 2003).
a student’s interest in different career options. However, we will be limiting our discussion
primarily to tests used in assessing children and adolescents with emotional and behavioral
disorders. To this end, we will briefly describe behavior rating scales, self-report measures,
and projective techniques in the following sections.
The scale will then present a series of item stems for which the informant rates the child.
For example:
feelings and behaviors due to a number of factors such as limited insight or verbal abilities
or, in the context of self-report tests, limited reading ability. However, when using behavior
rating scales, information is solicited from the important adults in a child’s life. Ideally these
adult informants will have had adequate opportunities to observe the child in a variety of
settings over an extended period of time. Behavior rating scales also represent a cost-effec-
tive and time-efficient method of collecting assessment information. For example, a clini-
cian may be able to collect information from both parents and one or more teachers with a
minimal investment of time. Most popular behavior rating scales have separate inventories
for parents and teachers. This allows the clinician to collect information from multiple in-
formants who observe the child from different perspectives and in various settings. Behav-
ior rating scales can also help clinicians assess the presence of rare behaviors. Although any
responsible clinician will interview the child, parents, and hopefully teachers, it is still pos-
sible to miss important indicators of behavioral problems. The use of well-designed behav-
ior rating scales may help detect the presence of rare behaviors, such as fire setting and
animal cruelty, that might be missed in a clinical interview.
There are some limitations associated with the use of behavior rating scales. Even
though the use of adult informants to rate children provides some degree of objectivity, these
scales are still subject to response sets that may distort the true characteristics of the child.
For example, as a “cry for help” a teacher may exaggerate the degree of a student’s prob-
lematic behavior in hopes of hastening a referral for special education services. Accord-
ingly, parents might not be willing or able to acknowledge their child has significant
emotional or behavioral problems and tend to underrate the degree and nature of problem
behaviors. Although behavior rating scales are particularly useful in diagnosing “external-
izing” problems such as aggression and hyperactivity, which are easily observed by adults,
they are less helpful when assessing “internalizing” problems such as depression and anxi-
ety, which are not as apparent to observers.
Over the past two decades, behavior rating scales have gained
In recent years behavior rating popularity and become increasingly important in the psychological
scales have gained popularity assessment of children and adolescents (Livingston et al., 2003). It is
and become increasingly common for a clinician to have both parents and teachers complete
important in the assessment behavior rating scales for one child. This is desirable because parents
of children and adolescents. and teachers have the opportunity to observe the child in different
settings and can contribute unique yet complementary information to
the assessment process. As a result, school psychologists will frequently ask classroom
teachers to help with student evaluations by completing behavior rating scales on one of
their students. Next we will briefly review some of the most popular scales.
used behavior rating scales in the public schools today (Livingston et al., 2003). Information
obtained from the publisher estimates the BASC was used with more than 1 million children
in the United States alone in 2003. By 2006, this estimate had grown to 2 million children
per year. The TRS and PRS are appropriate for children from 2 to 21 years. Both the TRS
and PRS provide item stems to which the informant responds Never, Sometimes, Often, or
Almost Always. The TRS is designed to provide a thorough examination of school-related
behavior whereas the PRS is aimed at the home and community environment (Ramsay,
Reynolds, & Kamphaus, 2002). In 2004, Reynolds and Kamphaus released the second edi-
tion of the BASC, known as the BASC-2, with updated scales and normative samples. Table
14.2 depicts the 5 composite scales, 16 primary scales, and 7 content scales for all the pre-
school, child, and adolescent versions of both instruments. Reynolds and Kamphaus (2004)
describe the individual primary subscales of the TRS and PRS as follows:
New to the BASC-2 are the content scales, so called because their interpretation is
driven more by item content than actuarial or predictive methods. These scales are intended
for use by advanced-level clinicians to help clarify the meaning of the primary scales and
as an additional aid to diagnosis.
In addition to these individual scales, the TRS and PRS provide several different
composite scores. The authors recommend that interpretation follow a “top-down” ap-
proach, by which the clinician starts at the most global level and progresses to more specific
levels (e.g., Reynolds & Kamphaus, 2004). The most global measure is the Behavioral
Symptoms Index (BSI), which is a composite of the Aggression, Attention Problems, Anx-
iety, Atypicality, Depression, and Somatization scales. The BSI reflects the overall level of
378 CHAPTER 14
TABLE 14.2 Composites, Primary Scales, and Content Scales in the TRS and PRS
gy (e TN le Gs A
2-5 6-11 12-2] ~ 2-5 6-I1 = 12-21
Composite
Adaptive Skills
Behavioral Symptoms Index
Externalizing Problems
Internalizing Problems Ce
scot
(SR
See &£
*
& Ee:
ORS
Ee
re ee
Oe
ae
THE.
School Problems Te
ke
ae
ermine
ee
<a
Primary Scale
Adaptability *
Anxiety * *
Attention Problems * *
Atypicality hss
ee
se
TS * *
Conduct Problems * *
Depression eg
eG
eee* *
Functional Communication :
Hyperactivity %
Leadership
Learning Problems
Social Skills *
Somatization #*
Study Skills
Withdrawal * Bee
x*&%
Content Scale .
Anger Control
Bullying
Developmental Social Disorders
Emotional Self-Control
Executive Functioning
Negative Emotionality
Resiliency
behavioral problems and provides the clinician with a reliable but nonspecific index of pa-
thology. For more specific information about the nature of the problem behavior, the clini-
cian proceeds to the four lower-order composite scores:
a School Problems. This composite consists of the Attention Problems and Learning
Problems scales. High scores on this scale suggest academic motivation, attention, and
learning difficulties that are likely to hamper academic progress. This composite is available
only for the BASC-TRS.
The third level of analysis involves examining the 16 clinical (e.g., Hyperactivity,
Depression) and adaptive scales (e.g., Leadership, Social Skills). Finally, clinicians will
often examine the individual items. Although individual items are often unreliable, when
interpreted cautiously they may provide clinically important information. This is particu-
larly true of what is often referred to as “critical items.” Critical items, when coded in a
certain way, suggest possible danger to self or others. For example, if a parent or teacher
reports that a child often “threatens to harm self or others,” the clinician would want to de-
termine whether these statements indicate imminent danger to the child or others.
When interpreting the Clinical Composites and Scale scores, high scores reflect ab-
normality or pathology. The authors provide the following classifications: T-score > 70 is
Clinically Significant; 60-69 is At-Risk; 41-59 is Average; 31-40 is Low; and <30 is Very
Low. Scores on the Adaptive Composite and Scales are interpreted differently, with high
scores reflecting adaptive or positive behaviors. The authors provide the following classifi-
cations: T-score > 70 is Very High; 60-69 is High; 41-59 is Average; 31-40 is At-Risk; and
<30 is Clinically Significant. Computer software is available to facilitate scoring and inter-
pretation, and the use of this software is recommended because hand scoring can be chal-
lenging for new users. An example of a completed TRS profile is depicted in Figure 14.1.
The TRS and PRS have several unique features that promote their use. First, they
contain a validity scale that helps the clinician detect the presence of response sets. As noted
previously, validity scales are specially developed and incorporated in the test for the pur-
pose of detecting response sets. Both the parent and teacher scales contain a “fake bad” (F)
380 CHAPTER 14
CLINICAL PROFILE _
Hyper. Aggres- Conduct bonnes
it i ‘Composite
&8
ee
Te |peneprene
ee
ny
a iyo
a aeeimel
[meal.
t.
prtrpern
3
25
foe
Poendieennl
ayn Re
ny
s
Hoe s—
Oc
any
es
Ty
co)
four
Oybot
vena rereporns boreparesns
ne
ee
Ye
ee
ee
ee
|
Pn
rreeliveifansy
Jove |
es
RR
onibuaene
batoretadeg
ee
ee
tforsafararpeens egedil|
Hf
at
ee
ss
riefaii eeperrrporrtecerfirs MO
os
ee
iE
pifeveiforerdeces
|roteas
Jorden
provbevret
fin|
aaa
eon Joti
oo frrviperssfortrrefireepenns
oo frveeferseforredeceeferiigs
en
i
os |
tt
Josrepreifons
index that is elevated when an informant excessively rates maladaptive items as Almost al-
ways and adaptive items as Never. If this index is elevated, the clinician should consider the
possibility that a negative response set has skewed the results. Another unique feature of
these scales is that they assess both negative and adaptive behaviors. Before the advent of
the BASC, behavior rating scales were often criticized for focusing only on negative behav-
iors and pathology. Both the TRS and PRS address this criticism by assessing a broad
spectrum of behaviors, both positive and negative. The identification of positive character-
istics can facilitate treatment by helping identify strengths to build on. Still another unique
feature is that the TRS and PRS provide three norm-referenced comparisons that can be
selected depending on the clinical focus. The child’s ratings can be compared to a general
national sample, a gender-specific national sample, or a national clinical sample composed
of children who have a clinical diagnosis and are receiving treatment. In summary, the
Assessment of Behavior and Personality
381
BASC-2 PRS and BASC-2 TRS are psychometrically sound instruments that have gained
considerable support in recent years.
The CRS-R produces two index scores, the ADHD Index and the Conners Global
Index (CGI). The ADHD Index is a combination of items that has been found to be useful
in identifying children who have attention deficit/hyperactivity disorder (ADHD). The CGI,
a more general index, is sensitive to a variety of behavioral and emotional problems. The
CGI (formerly the Hyperactivity Index) has been shown to be a sensitive measure of medi-
cation (e.g., psychostimulants such as Ritalin) treatment effects with children with ADHD.
Computer-scoring software is available for the CRS-R to facilitate scoring and interpreta-
tion. Specific strengths of the CRS-R include its rich clinical history and the availability of
short forms that may be used for screening purposes or situations in which repeated admin-
istrations are necessary (e.g., measuring treatment effects; Kamphaus & Frick, 2002).
about the child’s activities and competencies in areas such as recreation (e.g., hobbies and
sports), social functioning (e.g., clubs and organizations), and schooling (e.g., grades). The
second section assesses problem behaviors and contains item stems describing problem be-
haviors. On these items the informant records a response of Not true, Somewhat true/Some-
times true, or Very true/Often true. The clinical subscales of the CBCL and TRF are
Computer-scoring software is available for the CBCL and TRF and is recommended
because hand scoring is a fairly laborious and time-consuming process. The CBCL and
TRF have numerous strengths that continue to make them popular among school psy-
chologists and other clinicians. They are relatively easy to use, are time efficient (when
using the computer-scoring program), and have a rich history of clinical and research ap-
plications (Kamphaus & Frick, 2002).
The BASC-2 TRS and PRS, the CBCL and TRF, and the CRS-R are typically referred
to as omnibus rating scales. This indicates that they measure a wide range of symptoms and
behaviors that are associated with different emotional and behavioral disorders. Ideally an
omnibus rating scale should be sensitive to symptoms of both internalizing (e.g., anxiety,
depression) and externalizing (e.g., ADHD, conduct) disorders to ensure that the clinician
is not missing important indicators of psychopathology. This is particularly important when
assessing children and adolescents because there is a high degree of comorbidity with this
population. Comorbidity refers to the presence of two or more disorders occurring simulta-
neously in the same individual. For example, a child might meet the criteria for both an
externalizing disorder (e.g., conduct disorder) and an internalizing disorder (e.g., depressive
disorder). However, if a clinician did not adequately screen for internalizing symptoms, the
more obvious externalizing symptoms might mask the internalizing symptoms and result in
Assessment of Behavior and Personality
383
Self-Report Measures
A self-report measure is an A self-report measure is an instrument completed by individuals
instrument completed by that allows them to describe their own subjective experiences, in-
individuals that allows them cluding emotional, motivational, interpersonal, and attitudinal char-
acteristics (e.g., Anastasi & Urbina, 1997). Although the use of
to describe their own subjective
self-report measures has a long and rich history with adults, their use
experiences, including
with children is a relatively new development because it was long
emotional, motivational,
believed that children did not have the personal insights necessary to
interpersonal, and attitudinal understand and accurately report their subjective experiences. To fur-
characteristics (Anastasi & ther complicate the situation, skeptics noted that young children
Urbina, 1997). typically do not have the reading skills necessary to complete written
self-report tests (e.g., Kamphaus & Frick, 2002). However, numer-
ous self-report measures have been developed and used successfully with children and ado-
lescents. Although insufficient reading skills do make these instruments impractical with
very young children, these new self-report measures are being used with older children
(e.g., >7years) and adolescents with considerable success. Self-report measures have proven
to be particularly useful in the assessment of internalizing disorders such as depression and
anxiety that have symptoms that are not always readily apparent to observers. The develop-
ment and use of self-report measures with children are still at a relatively early stage, but
several instruments are gaining widespread acceptance. We will now briefly describe some
of the most popular child and adolescent self-report measures.
(BASC-2) we introduced earlier, and recent research suggests it is the most popular self-report
measure among school psychologists. There are three forms of the SRP, one for children 8 to
11 years and one for adolescents 12 to 18 years. A third version, the SRP-I (for interview) is
standardized as an interview version for ages 6 and 7 years. The SRP has an estimated 3rd-
grade reading level, and if there is concern about the student’s ability to read and comprehend
the material, the instructions and items can be presented using audio. The SRP contains brief
descriptive statements that children or adolescents mark as true or false to some questions, or
never, sometimes, often, or almost always to other questions, as it applies to them. Table 14.3
depicts the 5 composites along with the 10 primary and content scales available for children
and adolescents. Reynolds and Kamphaus (1992) describe the subscales as follows:
The SRP produces five composite scores. The most global composite is the Emotional
Symptoms Index (ESI) composed of the Anxiety, Depression, Interpersonal Relations, Self-
Esteem, Sense of Inadequacy, and Social Stress scales. The ESI is an index of global psy-
chopathology, and high scores usually indicate serious emotional problems. The four
lower-order composite scales are
= Inattention/Hyperactivity. This scale combines the Attention Problems and the Hy-
peractivity scales to form a composite reflecting difficulties with the self-regulation of be-
havior and ability to attend and concentrate in many different settings.
a Internalizing Problems. This is a combination of the Anxiety, Atypicality, Locus of
Control, Social Stress, and Somatization scales. This scale reflects the magnitude of inter-
nalizing problems, and clinically significant scores (i.e., T-scores > 70) suggest significant
problems.
4
Assessment of Behavior and Personality 385
C A
Scale 8-11 12-21
Composite
Emotional Symptoms Index *
Inattention/Hyperactivity M3 *
Internalizing Problems * *
Personal Adjustment * i
School Problems = -
Primary Scale
Anxiety
Attention Problems
Attitude to School * *
Attitude to Teachers ok *
Atypicality ok *
Depression ok *
Hyperactivity
Interpersonal Relations * ok
Locus of Control * *
Self-Esteem * *
Self-Reliance * *
Sensation Seeking *
Sense of Inadequacy * *
Social Stress * *
Somatization *
Content Scale
Anger Control
Ego Strength
Mania
Test Anxiety
a
—
rm
—
oe
“—
a
terpeens
tpeure
eeeJoti
ee paafin
prc pepeapoetes
Jobe
appropriately to make them sound as though they are simply part of an interview. The child’s
responses are then scored according to objective criteria. Another positive aspect of the SRP
is that it covers several dimensions or areas that are important to children and adolescents,
but have been neglected in other child self-report measures (e.g., attitude toward teachers and
school). Finally, the SRP assesses both clinical and adaptive dimensions. This allows the
clinician to identify not only problem areas but also areas of strength to build on.
Problems). This close correspondence with the CBCL and TRF is one of the strengths of the
YSR. Additionally, the YSR has an extensive research base that facilitates clinical interpre-
tations and computer-scoring software that eases scoring. The YSR has a strong and loyal
following and continues to be a popular instrument used in school settings.
As with behavior rating scales, self-report measures come in omnibus and single-
domain formats. Both the SRP and YSR are omnibus self-report measures. An example of
a single-domain self-report measure is the Children’s Depression Inventory (CDI; Kovacs,
1991). The CDI is a brief, 27-item self-report inventory designed for use with children be-
tween 7 and 17 years. It presents a total score as well as five factor scores: Negative Mood,
Interpersonal Problems, Ineffectiveness, Anhedonia (loss of pleasure from activities that
previously brought pleasure), and Negative Self-Esteem. The CDI is easily administered
and scored, is time efficient and inexpensive, and has an extensive research database. As
with the other single-domain measures, the CDI does not provide coverage of a broad range
of psychological disorders or personality characteristics, but it does give a fairly in-depth
assessment of depressive symptoms.
Projective Techniques
Although some experts have The debate over the use of projective techniques has been going
expressed reservations about on for decades. Although there is evidence of diminished use of projec-
the use of projective techniques, tive techniques in the assessment of children and adolescents, these
they continue to play a promi- techniques are still popular and used in schools. For example, a na-
nent role in the assessment of tional survey of psychological assessment procedures used by school
children and adolescents. psychologists indicates that four of the ten most popular procedures for
assessing personality are projective techniques (Livingston et al.,
2003). This debate is apt to continue, but it is highly likely that projec-
tives will continue to play a prominent role in the assessment of children and adolescen
ts for
the foreseeable future. Next we will briefly describe a few of the major projective
techniques
used with children and adolescents.
Projective Drawings
Some of the most popular projective techniques used with children and adolescents
involve
the interpretation of projective drawings. This popularity is usually attributed to
two fac-
tors. First, young children with limited verbal abilities are hampered in their ability to re-
spond to clinical interviews, objective self-report measures, and even most other projective
techniques. However, these young children can produce drawings because this activity
is
largely nonverbal. Second, because children are usually familiar with and enjoy drawing,
this technique provides a nonthreatening “child-friendly” approach to assessment (Finch &
Belter, 1993; Kamphaus & Frick, 2002). There are several different projective drawing
techniques in use today.
Draw-A-Person Test (DAP). The Draw-A-Person Test (DAP) is the most widely used
projective drawing technique. The child is given a blank sheet of paper and a pencil and asked
to draw a whole person. Although different scoring systems have been developed for the
DAP, no system has received universal approval. The figure in the drawing is often inter-
preted as a representation of the “self.” That is, the figure reflects how children feel about
themselves and how they feel as they interact with their environment (Handler, 1985).
Kinetic Family Drawing (KFD). With the Kinetic Family Drawing (KFD), children
are given paper and pencil and asked to draw a picture of everyone in their family, including
themselves, doing something (hence the term kinetic). After completing the drawing the
children are asked to identify each figure and describe what each one is doing. The KFD is
thought to provide information regarding the children’s view of their family and their inter-
actions (Finch & Belter, 1993).
390 CHAPTER 14
Pro Con
Less structured format allows clinician greater The reliability of many techniques is questionable. As
flexibility in administration and interpretation and a result, the interpretations are more related to
places fewer demand characteristics that would characteristics of the clinician than to characteristics
prompt socially desirable responses from informant. of the person being tested.
Allows for the assessment of drives, motivations, Even some techniques that have good reliability have
desires, and conflicts that can affect a person’s questionable validity, especially in making diagnoses
perceptual experiences but are often unconscious. and predicting overt behavior.
Provides a deeper understanding of a person than Although we can at times predict things we cannot
would be obtained by simply describing behavioral understand, it is rarely the case that understanding
patterns. does not enhance prediction (Gittelman-Klein, 1986).
Source: Clinical Assessmentdts Child and Adolescent Personality and Behavior (2nd ed.) (Table 11.1, p. 231)
by R. W. Kamphaus and P. J. Frick, 2002, Boston: Allyn & Bacon. Copyright 2002 by Pearson Education.
Adapted with permission.
Despite their popularity and appeal to clinicians, little empirical data support the use of
projective drawings as a means of predicting behavior or classifying children by diagnostic
type (e.g., depressed, anxious, conduct disordered, etc.). These techniques may provide a
nonthreatening way to initiate the assessment process and an opportunity to develop rapport,
but otherwise they should be used with considerable caution and an understanding of their
technical limitations (Finch & Belter, 1993; Kamphaus & Frick, 2002).
there are different ways of interpreting the results. Because incomplete-sentence stems pro-
vide more structure than most projective tasks (e.g., drawings or inkblots), some have ar-
gued that they are not actually “projective” in nature, but are more or less a type of structured
interview. As a result, some prefer the term semiprojective to characterize these tests. Re-
gardless of the classification, relatively little empirical evidence documents the psycho-
metric properties of these tests (Kamphaus & Frick, 2002). Nevertheless, they remain
popular, are nonthreatening to children, and in the hands of skilled clinicians may provide
an opportunity to enhance their understanding of their clients.
Apperception Tests
Another type of projective technique used with children is apperception tests. With this
technique the child is given a picture and asked to make up a story about it. Figure 14.3
depicts a picture similar to those in some apperception tests used with older children and
adolescents. These techniques are also sometimes referred to as thematic or storytelling
techniques. Like other projective techniques, children generally find apperception tests in-
viting and enjoyable. Two early apperception tests, the Thematic Apperception Test (TAT)
and the Children’s Apperception Test (CAT), have received fairly widespread use with chil-
dren and adolescents. Like other projective techniques, limited empirical evidence supports
the use of the TAT or CAT. A more recently developed apperception test is the Roberts Ap-
perception Test for Children (RATC; McArthur & Roberts, 1982), which uniquely features
the inclusion of a standardized scoring system and normative data. The standardized scoring
approach results in increased reliability relative to previous apperception tests. However, the
normative data are inadequate and there is little validity evidence available (Kamphaus &
Frick, 2002). Nevertheless, the RATC is a step in the right direction in terms of enhancing
the technical qualities of projective techniques.
Inkblot Techniques
The final projective approach we will discuss is the inkblot technique. With this technique
the child is presented an ambiguous inkblot and asked to interpret it in some manner, typically
by asking: “What might this be?” Figure 14.4 presents an example of an inkblot similar to
those used on inkblot tests. Of all the inkblot techniques, the Rorschach is the most widely
used. Different interpretative approaches have been developed for the Rorschach, but the
Exner Comprehensive System (Exner, 1974, 1978) has received the most attention by clini-
cians and researchers in recent years. The Exner Comprehensive System provides an elaborate
standardized scoring system that produces approximately 90 possible scores. Relative to other
Rorschach interpretive systems, the Exner system produces more reliable measurement and
has reasonably adequate normative data. However, evidence of validity
Because there is little empirical is limited, and many of the scores and indexes that were developed
data supporting the use of with adults have not proven effective with children (Kamphaus &
projective techniques as a means Frick, 2002).
of understanding personality or In summary, in spite of relatively little empirical evidence of
predicting behavior, they should their utility, projective techniques continue to be popular among psy-
be used with caution. chologists and other clinicians. Our recommendation is to use these
392 CHAPTER 14
FIGURE 14.3 A Picture Similar to Those FIGURE 14.4 An Inkblot Similar to Those
Used on Apperception Tests Used on Inkblot Tests
Source: From Robert J. Gregory, Psychological Testing: His- Source: From Robert J. Gregory, Psychological Testing: His-
tory, Principles, and Applications, 3/e. Published by Allyn & tory, Principles, and Applications, 3/e. Published by Allyn &
Bacon, Boston, MA. Copyright © 2004 by Pearson Educa- Bacon, Boston, MA. Copyright © 2004 by Pearson Educa-
tion. Reprinted by permission of the publisher. tion. Reprinted by permission of the publisher.
instruments cautiously. They should not be used for making important educational, clinical,
and diagnostic decisions, but they may have merit in introducing the child to the assessment
process, establishing rapport, and developing hypotheses that can be pursued with more
technically adequate assessment techniques.
Summary
In this chapter we focused on tests of behavior and personality and their applications in the
schools. We noted that Public Law 94-142 and subsequent legislation require that public
schools provide special education and related services to students with emotional disorders.
Before these services can be provided, the schools must be able to identify children with
these disorders. The process of identifying these children often involves a psychological
evaluation completed by a school psychologist or other clinician. Teachers often play an
important role in this assessment process. For example, teachers often complete rating
scales that describe the behavior of students in their class. Teachers are also often involved
in the development and implementation of educational programs for these special needs
students. As a result, it is beneficial for teachers to be familiar with the types of instruments
used to identify students with emotional and behavioral problems..
Assessment of Behavior and Personality 393
We noted the three major types of instruments used in assessing personality and be-
havior in children and adolescents, including the following:
literature, they continue to be among the most popular approaches to assessing the personal-
ity of children and adolescents. Our position is that although projective techniques should
not be used as the basis for making important educational, clinical, or diagnostic decisions,
they may have merit in developing rapport with clients and in generating hypotheses that
can be pursued using technically superior assessment techniques.
Apperception tests, p. 391 Child Behavior Checklist (CBCL), Projective techniques, p. 388
Behavior Assessment System for p. 381 Public Law 94-142 / IDEA, p. 371
Children—Parent Rating Scale Conners Rating Scales—Revised Response sets, p. 372
(PRS), p. 376 (CRS-R), p. 381 Self-report measure, p. 383
Behavior Assessment System for Draw-A-Person Test (DAP), p. 389 Sentence completion tests, p. 390
Children—Self-Report of House-Tree-Person (H-T-P), p. 389 Teacher Report Form (TRF),
Personality (SRP), p. 383 Inkblot technique, p. 391 p. 381
Behavior Assessment System for Kinetic Family Drawing (KFD), Typical response tests, p. 371
Children—Teacher Rating Scale p. 389 Validity scale, p. 373
(TRS), p. 376 Personality, p. 371 Youth Self-Report (YSR), p. 387
Behavior rating scale, p. 375 Projective drawings, p. 389
RECOMMENDED READINGS
Kamphaus, R. W., & Frick, P. J. (2002). Clinical assessment Personality, behavior, and context. New York: Guilford
of child and adolescent personality and behavior. Bos- Press. This is another excellent source providing thor-
ton: Allyn & Bacon. This text provides comprehensive ough coverage of the major behavioral and personality
coverage of the major personality and behavioral assess- assessment techniques used with children. Particularly
ment techniques used with children and adolescents. It good for those interested in a more advanced discussion
also provides a good discussion of the history and cur- of these instruments and techniques.
rent use of projective techniques.
Reynolds, C. R., & Kamphaus, R. W. (2003). Handbook of
psychological and educational assessment of children:
Assessment Accommodations
CHAPTER HIGHLIGHTS
LEARNING OBJECTIVES
395
396 CHAPTER 15
10. Identify and give examples of modifications of the response format that might be appropriate
for students with disabilities.
11. Identify and give examples of modifications of timing that might be appropriate for students
with disabilities.
12. Identify and give examples of modifications of the setting that might be appropriate for
students with disabilities.
13. Identify and give examples of adaptive devices and supports that might be appropriate for
students with disabilities.
14. Describe and give examples illustrating the use of limited portions of an assessment or an
alternate assessment for a student with a disability.
15. Identify and explain the reasoning behind the major principles for determining which
assessment accommodations to provide.
16. Briefly describe the current status of research on the selection of assessment
accommodations.
17. Describe the use of assessment accommodations for English Language Learners.
18. Discuss the controversy regarding the reporting of results of modified assessments.
So far in this text we have emphasized the importance of strictly adhering to standard
assessment procedures when administering tests and other assessments. This is neces-
sary to maintain the reliability and validity of score interpretations.
Standard assessment proce- However, at times it is appropriate to deviate from these standard
dures may not be appropriate procedures. Standard assessment procedures may not be appro-
for a student with a disability priate for students with a disability if the assessment requires the
if the assessment requires the students to use some ability (e.g., sensory, motor, language, etc.)
student to use some ability that that is affected by their disability, but is irrelevant to the construct
being measured. To address this, teachers and others involved in
is affected by the disability, but
assessment may need to modify standard assessment procedures to
is irrelevant to the construct
accommodate the special needs of students with disabilities. In this
being measured. context, the Standards (AERA et al., 1999) note that assessment
accommodations are changes in the standard assessment proce-
dures that are implemented in order to minimize the impact of student characteristics that
are irrelevant to the construct being measured by the assessment. The Standards go on to
state that the goal of accommodations is to provide the most valid and accurate measure-
ment of the construct of interest. As framed by the U.S. Department of Education (1997),
“Assessment accommodations help students show what they know without being placed
at a disadvantage by their disability” (p. 8). For example, consider a test designed to as-
sess a student’s knowledge of world history. A blind student would not be able to read the
material in its standard printed format, but if the student could read Braille, an appropri-
ate accommodation would be to convert the test to the Braille format. In this example, it
is important to recognize that reading standard print is incidental to the construct being
measured. That is, the test was designed to measure the student’s knowledge of world
history, not the ability to read standard print. An important consideration when selecting
accommodations is that we only want to implement accommodations that preserve the
Assessment Accommodations 397
reliability of test scores and the inferences about the meaning of performance on the test
(U.S. Department of Education, 1997).
More and more often, teachers More and more often, lawmakers are writing laws that mandate as-
are being called on to modify sessment accommodations for students with disabilities. As a result,
their assessments in order to more and more often, teachers are being called on to modify their
accommodate the special needs assessments in order to accommodate the special needs of students
bP students with disabilities’ with disabilities. Major laws that address assessment accommoda-
tions include Section 504 of the Rehabilitation Act of 1973 (Section
504), The Americans with Disabilities Act (ADA), The No Child
Left Behind Act of 2001 (NCLB), and the Individuals with Disabilities Education Improve-
ment Act (IDEA 2004). In the next section we will focus on IDEA and Section 504 and their
impact on the assessment of students with disabilities.
In 1975, Congress passed Public Law 94-142, the Education of All Handicapped Children
Act (EAHCA). This law required that public schools provide students with disabilities a
free, appropriate public education (FAPE). Prior to the passage of this law, it was estimated
that as many as one million children with disabilities were being denied a FAPE (e.g., Turn-
bull, Turnbull, Shank, Smith, & Leal, 2002). In 1986, Public Law 99-457, the Infants and
Toddlers with Disabilities Act, was passed to ensure that preschool children with disabilities
also received appropriate services. In 1990, the EAHCA was reauthorized and the name was
changed to the Individuals with Disabilities Education Act (IDEA).
These laws had a significant impact on the way students with disabilities received
educational services, The number of children with developmental disabilities in state mental
health institutions declined by almost 90%, the rate of unemployment for individuals in their
twenties with disabilities was reduced, and the number of young adults with disabilities
enrolled in postsecondary education increased. Although this was clearly a step in the right
direction, problems remained. Students with disabilities were still dropping out of school at
almost twice the rate of students without disabilities, there was concern that minority chil-
dren were being inappropriately placed in special education, and educational professionals
and parents had concerns about the implementation of the law (Kubiszyn & Borich, 2003).
To address these and other concerns, the law was updated and reauthorized in 1997 as the
Individuals with Disabilities Education Act of 1997 (IDEA 97) and again in 2004 as the
Individuals with Disabilities Education Improvement Act of 2004 (IDEA 2004).
Entire books have been written on IDEA and its impact on the public schools, and it is
not our intention to cover this law and its impact in great detail. Because this is a textbook on
educational assessment for teachers, we will be limiting our discussion to the effect of IDEA
398 CHAPTER 15
Research has shown that on the assessment practices of teachers. In this context, probably the
students with disabilities dem- greatest effect of IDEA has been its requirement that schools provide
onstrate superior educational services to students with disabilities in the general education class-
and social gains when they room whenever appropriate. Earlier versions of the act had required
receive instruction in regular that students with disabilities receive instruction in the least restrictive
environment. In actual practice students with disabilities were often
education classrooms.
segregated into resource or self-contained classrooms largely based
on the belief that they would not be able to profit from instruction in
regular education classrooms. Educational research, however, has shown that students with
disabilities demonstrate superior educational and social gains when they receive instruction
in regular education classrooms (see McGregor & Vogelsberg, 1998; Stainback & Stainback,
1992). Revisions of IDEA, reflecting this research and prevailing legal and political trends,
mandated that public schools educate students with disabilities alongside students who do
not have disabilities to the maximum extent possible, an approach often referred to as inclu-
sion or mainstreaming (Turnbull et al., 2002). This extends not only to students with mild dis-
abilities but also to those with moderate and severe disabilities. The impact of this on regular
education teachers is that they have more students with disabilities in their classrooms. As
a result, regular education teachers are increasingly responsible for planning and providing
instruction to children with disabilities and for evaluating their progress. This includes help-
ing identify students with disabilities, planning their instruction and assessment, and working
with them daily in the classroom.
Central to the provision of services to students with disabilities is the individualized
educational program (IEP). The IEP is a written document developed by a committee
or team composed of the student’s parents, regular education teachers, special education
teachers, and other school personnel (e.g., school psychologists, counselors). This commit-
tee is typically referred to as the IEP committee. When appropriate the students may be in-
vited to participate as well as professionals representing external agencies. At a minimum
the IEP should specify the student’s present level of academic performance, identify mea-
surable annual goals and short-term objectives, specify their instructional arrangement,
and identify the special education and related services the student will receive. In terms
of assessment accommodations, the IEP should specify any modifications in classroom
tests and other assessments that are deemed necessary, and each of the student’s teachers
should have a copy of the IEP. Additionally, the IEP should identify any accommodations
that are seen as appropriate for state and districtwide assessments, including those required
by the No Child Left Behind Act (NCLB). If the IEP committee decides that the student is
unable to take the state’s regular assessment even with accommodations, the IEP can specify
that the student take an alternate assessment. These alternate assessments are designed for
students that the IEP committee determines should not be assessed based on their grade-
level curriculum.
As we noted, regular education teachers are becoming increasingly involved in teach-
ing and testing students with disabilities. Mastergeorge and Miyoshi (1999) note that as
members of the IEP committee, regular education teachers are involved in
The involvement of regular education teachers does not stop simply with planning. They
are also primarily responsible for implementing the IEP in the classroom. It should be noted
that the services and accommodations stipulated in the IEP are not merely suggestions but
legally commit the school to provide the stated modifications, accommodations, services,
and so forth.
severe discrepancy. The individual intelligence and achievement tests used in assessing learn-
ing disabilities are generally administered by assessment specialists with advanced graduate
training in administering and interpreting these tests. Although reliance on ability—achieve-
ment discrepancies to diagnose learning disabilities is the most widely accepted methodology,
it has become the focus of considerable debate in recent years, and some experts recommend
dropping this approach (e.g., Fletcher et al., 2002). As this text goes to print, it appears the next
revision of IDEA may drop the requirement of a discrepancy model (but continue to allow its
use) for an alternative approach that is still being refined.
Mental Retardation. Mental retardation typically is identified when the student scores
more than 2 standard deviations below the mean on an individualized intelligence test and
presents significant deficits in two or more areas of adaptive functioning (e.g., communica-
tion, self-care, leisure). Additionally, these deficits must be manifested before the age of
18 years (APA, 1994). Students with mental retardation comprise approximately 11% of
the special education population (Turnbull et al., 2002). The assessment of students with
mental retardation involves the administration of individual intelligence and achievement
tests by assessment professionals and also adaptive behavior scales that parents or teachers
typically complete.
(i) The term means a condition exhibiting one or more of the following characteristics
over a long period of time and to a marked degree that adversely affects a student’s
educational performance:
(a) An inability to learn that cannot be explained by intellectual, sensory, or other
health factors.
(b) An inability to build or maintain satisfactory interpersonal relationships with peers
and teachers.
(c) Inappropriate types of behavior or feelings under normal circumstances.
(d) A general pervasive mood of unhappiness or depression.
(e) A tendency to develop physical symptoms or fears associated with personal or
school problems.
(ii) The term includes schizophrenia.
Assessment Accommodations 401
The term does not apply to children who are socially maladjusted, unless it is determined
that they have an emotional disturbance (34 C.E.R. Sec. 300.7(c)(4)).
School psychologists typically take a lead role in the identification and assessment of students
with an emotional disturbance and use many of the standardized measures of behavior and
personality discussed in Chapter 14. When there is concern that students have an emotional
disturbance, their teachers will often be interviewed by the school psychologist asked to
complete behavior rating scales to better understand the nature and degree of any problems.
Other Health Impaired. IDEA covers a diverse assortment of health conditions under
the category of Other Health Impaired (OHI). The unifying factor is that all of these
conditions involve limitations in strength, vitality, or alertness. Approximately 3.5% of the
students receiving special education services have this classification (Turnbull et al., 2002).
The health conditions included in this broad category include, but are not limited to, asthma,
epilepsy, sickle cell anemia, and cancer. Attention deficit/hyperactivity disorder (ADHD)
is also typically classified in this category (but may be served under Section 504 as well).
ADHD is characterized by problems maintaining attention, impulsivity, and hyperactivity
(APA, 1994). As with the diagnosis of emotional disturbance, when there is concern that
students have ADHD, their teachers will often be asked to complete behavior rating scales
to acquire a better picture of their functioning in the school setting.
Hearing Impairments. IDEA defines hearing impairments as hearing loss that is se-
vere enough to negatively impact a student’s academic performance. Students with hear-
ing impairments account for approximately 1% of the students receiving special education
services (Turnbull et al., 2002). Assessment of hearing impairments will involve an audi-
ologist, but school personnel will typically be involved to help determine the educational
ramifications of the impairment.
Autism. IDEA defines autism as a developmental disability that is evident before the
age of 3 and impacts verbal and nonverbal communication and social interaction. Students
402 CHAPTER 15
IDEA defines autism as a with autism account for approximately 1% of the students receiving
developmental disability that is special education services (Turnbull et al., 2002). The assessment
evident before the age of 3 and of autism typically involves a combination of intelligence, achieve-
impacts verbal and nonverbal ment, and speech and language tests to assess cognitive abilities as
communication and social well as behavior rating scales to access behavioral characteristics.
interaction.
Visual Impairments Including Blindness. IDEA defines vi-
sual impairment as impaired vision that even after correction (e.g.,
glasses) negatively impacts a student’s academic performance. Students with visual impair-
ments constitute less than 1% of the students receiving special education services (Turn-
bull et al., 2002). Assessment of visual impairments will involve an ophthalmologist or an
optometrist, but school personnel will typically be involved to determine the educational
implications of the impairment.
Deaf-Blindness. This IDEA category is for students with coexisting visual and hearing
impairments that result in significant communication, developmental, and educational
needs. Assessment of students in this category typically relies on student observations,
information from parent and teacher behavior rating scales, and interviews of adults
familiar with the child (Salvia & Ysseldyke, 2007).
Traumatic Brain Injury. IDEA defines traumatic brain injury as an acquired brain
injury that is the result of external force and results in functional and psychosocial impair-
ments that negatively impact the student’s academic performance. Students with traumatic
brain injuries constitute less than 1% of the students receiving special education services
(Turnbull et al., 2002). The assessment of traumatic brain injuries typically involves a
combination of medical assessments (e.g., computerized axial tomography), neuropsycho-
logical tests (e.g., to assess a wide range of cognitive abilities such as memory, attention,
visual—spatial processing), and traditional psychological and educational tests (e.g., intel-
ligence and achievement). These assessments are often complemented with assessments of
behavior and personality.
Developmental Delay. Kubiszyn and Borich (2003) note that early versions of IDEA
required fairly rigid adherence to categorical eligibility procedures that identified and la-
beled students before special education services could be provided. While the intention
of these requirements was to provide appropriate oversight, it had the unintentional effect
of hampering efforts at prevention and early intervention. Before students qualified for
special education services, their problems had to be fairly severe and chronic. To address
this problem, IDEA 97 continued to recognize the traditional categories of disabilities (i.e.,
those listed above) and expanded eligibility to children with developmental delays. This
provision allows states to provide special education services to students between the ages
of 3 and 9 with delays in physical, cognitive, communication, social/emotional, and adap-
tive development. Additionally, IDEA gave the states considerable freedom in how they
define developmental delays, requiring only that the delays be identified using appropriate
assessment instruments and procedures. The goal of this more flexible approach to eligi-
bility is to encourage early identification and intervention. No longer do educators have to
Assessment Accommodations 403
wait until student problems escalate to crisis proportions; they can now provide services
early when the problems are more manageable and hopefully have a better prognosis.
Section 504
Section 504 of the Rehabilitation Act of 1973 is another law that had a significant impact
on the instruction and assessment of students with disabilities. Section 504 (often referred
to simply as 504) prohibits any discrimination against an individual with a disability in any
agency or program that receives federal funds. Because state and local education agencies
receive federal funds, Section 504 applies. Although IDEA requires that a student meet
specific eligibility requirements in order to receive special education services, Section 504
established a much broader standard of eligibility. Under Section 504, an individual with a
disability is defined as anyone with a physical or mental disability that substantially limits
one or more life activities. As a result, it is possible that a student may not qualify for special
education services under IDEA, but still qualify for assistance under Section 504 (this is
often referred to as “504 only”). Section 504 requires that public schools offer students with
disabilities reasonable accommodations to meet their specific educational needs. To meet
this mandate, schools develop “504 Plans” that specify the instructional and assessment
accommodations the student should receive. Parents, teachers, and other school personnel
typically develop these 504 Plans. Regular education teachers are involved in the develop-
ment of these plans and are responsible for ensuring that the modifications and accommoda-
tions are implemented in the classroom.
As we noted earlier, standard assessment procedures may not be appropriate for students
with a disability if the assessment requires the students to use some ability that is affected
by their disability but is irrelevant to the construct being measured. Assessment accommo-
dations are modifications to standard assessment procedures that are granted in an effort
to minimize the impact of student characteristics that are irrelevant to the construct being
measured. If this is accomplished the assessment will provide a more valid and accurate
measurement of the student’s true standing on the construct (AERA et al., 1999). The goal
is not simply to allow the student to obtain a higher score; the goal is to obtain more valid
score interpretations. Assessment accommodations should increase the validity of the score
interpretations so they more accurately reflect the student’s true standing on the construct
being measured.
Although some physical, cognitive, sensory, or motor deficits may be readily appar-
ent to teachers (e.g., vision impairment, hearing impairment, physical impairment), other
deficits that might undermine student performance are not as obvious. For example, stu-
dents with learning disabilities might not outwardly show any deficits that would impair
performance on a test, but might in fact have significant cognitive processing deficits that
limit their ability to complete standard assessments. In some situations the student may have
readily observable deficits, but have associated characteristics that also need to be taken into
404 CHAPTER 15
consideration. For example, a student with a physical disability (e.g., partial paralysis) may
be easily fatigued when engaging in standard school activities. Because some tests require
fairly lengthy testing sessions, the student’s susceptibility to fatigue,
Fairness to all parties is a not only the more obvious physical limitations, needs to be taken into
central issue when considering consideration when planning assessment accommodations (AERA
assessment accommodations. et al., 1999).
Fairness to all parties is a central issue when considering assess-
ment accommodations. For students with disabilities, fairness requires that they not be penal-
ized as the result of disability-related characteristics that are irrelevant to the construct being
measured by the assessment. For students without disabilities, fairness requires that those
receiving accommodations not be given an unjust advantage over those being tested under
standard conditions. As you can see, these serious issues deserve careful consideration.
The Standards (AERA et al., 1999) specify the following three situations in which accom-
modations should not be provided or are not necessary.
Accommodations Are Not Appropriate for an Assessment if the Purpose of the Test Is
to Assess the Presence and Degree of the Disability. For example, it would not be ap-
propriate to give a student with attention deficit/hyperactivity disorder (ADHD) extra time
on a test designed to diagnose the presence of attention problems. As we indicated earlier, it
would not be appropriate to modify a test of visual acuity for a student
Assessment accommodations with impaired vision.
should be individualized to Accommodations Are Not Necessary for All Students with Dis-
meet the specific needs of each abilities. Not all students with disabilities need accommodations.
student with a disability. Even when students with a disability require accommodations on one
test, this does not necessarily mean that they will need accommoda-
tions on all tests. As we will discuss in more detail later, assessment accommodations should
be individualized to meet the specific needs of each student with a disability. There is no spe-
cific accommodation that is appropriate, necessary, or adequate for all students with a given
Assessment Accommodations 405
Braille format
Large-print editions
Large-print figure supplements
Braille figure supplement
CCTV to magnify text and materials
For computer-administered tests, devices such as ZoomText Magnifier and ScreenReader to
magnify material on the screen or read text on the screen
Reader services (read directions and questions, describe visual material)
Sign language
Audiotaped administration
Videotaped administration
Alternative background and foreground colors
Increasing the spacing between items
Reducing the number of items per page
Using raised line drawings
Using language-simplified questions
Converting written exams to oral exams; oral exams to written format
Defining words
Providing additional examples
Clarifying and helping students understand directions, questions, and tasks
Highlighting key words or phrases
Providing cues (e.g., bullets, stop signs) on the test booklet
Rephrasing or restating directions and questions
Simplifying or clarifying language
Using templates to limit the amount of print visible at one time
S 2
them to take the exam orally or provide access to a scribe to write down their responses.
A student whose preferred method of communication is sign language could respond in
sign language and responses could subsequently be translated for grading. Other common
modifications to the response format include allowing the student to point to the correct
response; having an aide mark the answers; using a tape recorder to record responses;
using a computer or Braillewriter to record responses; using voice-activated computer
software; providing increased spacing between lines on the answer sheet; using graph
paper for math problems; and allowing the student to mark responses in the test booklet
rather than on a computer answer sheet. Table 15.2 provides a summary listing of these
and related accommodations.
Modifications of Timing
Extended time is probably the most frequent accommodation provided. Extended time is
appropriate for any student who may be slowed down due to reduced processing speed,
reading speed, or writing speed. Modifications of timing are also appropriate for students
q
Assessment Accommodations 407
Oral examinations
Scribe services (student dictates response to scribe, who creates written response)
Allowing a student to respond in sign language
Allowing a student to point to the correct response
Having an aide mark the answers
Using a tape recorder to record responses
Using a computer with read-back capability to record responses
Using a Braillewriter to record responses
Using voice-activated computer software
Providing increased spacing between lines on the answer sheet
Using graph paper for math problems
Allowing students to mark responses in the test booklet rather than on a computer answer
sheet (e.g., Scantron forms)
m Using a ruler for visual tracking
who use other accommodations such as the use of a scribe or some form of adaptive equip-
ment, because these often require more time. Determining how much time to allow is a
complex consideration. Research suggests that 50% additional time is adequate for most
students with disabilities (Northeast Technical Assistance Center, 1999). Although this is
probably a good rule of thumb, be sensitive to special conditions that might demand extra
time. Nevertheless, most assessment professionals do not recommend “unlimited time” as
an accommodation. It is not necessary, can complicate the scheduling of assessments, and
can be seen as unreasonable and undermine the credibility of the accommodation process
in the eyes of some educators. Other time-related modifications include providing more
frequent breaks or administering the test in sections, possibly spread over several days.
For some students it may be beneficial to change the time of day the test is administered to
accommodate their medication schedule or fluctuations in their energy levels. Table 15.3
provides a summary listing of these and related accommodations.
Modifications of Setting
Modifications of setting allow students to be tested in a setting that will enable them
to perform at their best. For example, for students who are highly distractible this may
Extended time
More frequent breaks
Administering the test in sections
Spreading the testing over several days
Changing the time of day the test is administered
408 CHAPTER 15
Modifications of setting allow include administering the test individually or in a small group set-
students to be tested in a ting. For other students preferential seating in the regular classroom
setting that will enable them to may be sufficient. Some students will have special needs based on
perform their best. space or accessibility requirements (e.g., a room that is wheelchair
accessible). Some students may need special accommodations such
as a room free from extraneous noise/distractions, special lighting,
special acoustics, or the use of a study carrel to minimize distractions. Table 15.4 provides
a summary listing of these and related accommodations.
to both classroom assessments and state and district assessment programs. For example, if
a student with a visual handicap receives large-print instructional materials in class (e.g.,
large-print textbook, handouts, and other class materials), it would be logical to provide
large-print versions of classroom assessments as well as large-print standardized assess-
ments. A reasonable set of questions to ask is (1) What types of instructional accommoda-
tions are being provided in the classroom? (2) Are these same accommodations appropriate
and necessary to allow the students to demonstrate their knowledge and skills on assess-
ments? (3) Are any additional assessment accommodations indicated? (Mastergeorge &
Miyoshi, 1999).
Periodically Reevaluate the Needs of the Student. Over time, the needs of a student
are likely to change. In some cases, students will mature and develop new skills and abili-
ties. In other situations there may be a loss of some abilities due to a progressive disorder.
As aresult, it is necessary to periodically reexamine the needs of the student and determine
whether the existing accommodations are still necessary and if any new modifications need
to be added.
m Tailor the modifications to meet the specific needs of the individual student (i.e., no one-size-
fits-all accommodations).
a Ifastudent routinely receives an accommodation in classroom instruction, that
accommodation is usually appropriate for assessments.
ma When possible, select accommodations that will promote independent functioning.
m Periodically reevaluate the needs of the students (e.g., Do they still need the accommodation?
Do they need additional accommodations’).
Major test publishers typically provide accommodation guidelines for the assessments they
publish. The easiest way to access up-to-date information on these accommodation policies is
by accessing the publishers’ Web sites. These accommodation policies include both the types of
accommodations allowed and the process examinees must go through in order to qualify for and
request accommodations. Below are some Web sites where you can find accommodation policies
for some major test publishers.
The Standards (AERA et al., 1999) note that “any test that employs language is, in part, a
measure of language skills. This is of particular concern for test takers whose first language
is not the language of the test (p. 91). Accordingly, both IDEA and NCLB require that when
assessing students with limited English proficiency, educators must ensure that they are
actually assessing the students’ knowledge and skills and not their proficiency in English.
For example, if a bilingual student with limited English proficiency is unable to correctly
answer a mathematics word problem presented in English, one must question whether the
student’s failure reflects inadequate mathematical reasoning and computation skills or in-
sufficient proficiency in English. If the goal is to assess the student’s English proficiency, it
is appropriate to test an ELL student in English. However, if the goal is to assess achieve-
ment in an area other than English, you need to carefully consider the type of assessment
or set of accommodations needed to ensure a valid assessment. This often requires testing
students in their primary language.
There are a number of factors that need to be considered when assessing ELL stu-
dents. First, when working with students with diverse linguistic backgrounds it is important
for educators to carefully assess the student’s level of acculturation, language dominance,
and language proficiency before initiating the formal assessment (Jacob & Hartshorne,
Assessment Accommodations
413
The Texas Student Assessment Program includes a number of assessments, the most widely admin-
istered being the Texas Assessment of Knowledge and Skills (TAKS). The manual (Texas Education
Agency, 2003) notes that accommodations that do not compromise the validity of the test results may
be provided. Decisions about what accommodations to provide should be based on the individual
needs of the student and take into consideration whether the student regularly receives the accommo-
dation in the classroom. For students receiving special education services, the requested accommoda-
tions must be noted on their IEP. The manual identifies the following as allowable accommodations:
Naturally, the testing program allows individuals to request accommodations that are not included
in this list, and these will be evaluated on a one-by-one basis. However, the manual identifies the
following as nonallowable accommodations:
In addition to the TAKS, the State Developed Alternative Assessment (SDAA) is designed
for students who are receiving instruction in the state-specified curriculum but for whom the IEP
committee has decided the TAKS is inappropriate. Whereas the TAKS is administered based on the
student’s assigned grade level, the SDAA is based on the student’s instructional level as specified
by the IEP committee. The goal of the SDAA is to provide accurate information about the student’s
annual growth in the areas of reading, writing, and math. In terms of allowable accommodations,
the manual (Texas Education Agency, 2003) simply specifies the following:
With the exception of the nonallowable accommodations listed below, accommodations documented
in the individual education plan (IEP) that are necessary to address the student’s instructional needs
(continued)
414 CHAPTER 15
based on his or her disability may be used for this assessment. Any accommodation made MUST be
documented in the student’s IEP and must not invalidate the tests. (p. 111)
No direct or indirect assistance that identifies or helps identify the correct answer
No clarification or rephrasing of test questions, passages, prompts, or answer choices
No reduction in the number of answer choices for an item
No allowance for reading and writing tests to be read aloud to the student, with the exception
of specific prompts
2007). For example, you need to determine the student’s dominant language (i.e., the pre-
ferred language) and proficiency in both dominant and nondominant languages. It is also
important to distinguish between conversational and cognitive/academic language skills.
For example, conversational skills may develop in about two years, but cognitive/academic
language skills may take five or more years to emerge (e.g., Cummings, 1984). The impli-
cation is that teachers should not rely on their subjective impression of an ELL student’s
English proficiency based on subjective observations of daily conversations, but should em-
ploy objective measures of written and spoken English proficiency. The Standards (AERA
et al., 1999) provide excellent guidance in language proficiency assessment.
A number of strategies exist for assessing students with limited English proficiency
when using standardized assessments, these include the following:
= Locate tests with directions and materials in the student’s native language. There are
a number of commercial tests available in languages other than English. However, these
tests vary considerably in quality depending on how they were developed. For example, a
simple translation of a test from one language to another does not ensure test equivalence. In
this context, equivalence means it is possible to make comparable inferences based on test
performance (AERA et al., 1999). The question is: Does the translated test produce results
that are comparable to the original test in terms of validity and reliability?
m It may be possible to use a nonverbal test. There are a number of nonverbal tests that
were designed to reduce the influence of cultural and language factors. However, one should
keep in mind that while these assessments reduce the influence of language and culture, they
do not eliminate them.
u If it is not possible to locate a suitable translated test or a nonverbal test, a qualified
bilingual examiner may conduct the assessment, administering the tests in the student’s native
language. When a qualified bilingual examiner is not available, an interpreter may be used.
While this is a common practice, there are a number of inherent problems that may compro-
mise the validity of the test results (AERA et al., 1999). It is recommended that educators
considering this option consult the Standards (AERA et al., 1999) for additional information
on the use of interpreters in assessing individuals with limited-English-proficiency.
Assessment Accommodations 415
Salvia and Ysseldyke (2007) provide suggestions for assessing students with limited
English proficiency in terms of classroom achievement. First, they encourage teachers to
ensure that they assess what is actually taught in class, not related content that relies on
incidental learning. Students with different cultural and language backgrounds might not
have had the same opportunities for incidental learning as native English speakers. Second,
give ELL students extra time to process their responses. They note that for a variety of
reasons, students with limited English proficiency may require additional time to process
information and formulate a response. Finally, they suggest that teachers provide ELL stu-
dents with opportunities to demonstrate achievement in ways that do not rely exclusively
on language.
[A]n otherwise qualified student who is unable to disclose the degree of learning he actually pos-
sesses because of the test format or environment would be the object of discrimination solely on the
basis of his handicap.
(Chief Justice Cummings, U.S. 7th Circuit Court of Appeals!)
Section 504 imposes no requirement upon an educational institution to lower or to effect substantial
modifications of standards to accommodate a handicapped person.
(Justice Powell, U.S. Supreme Court?)
These quotes were selected by Phillips (1993) to illustrate the diversity in legal opinions
that have been rendered regarding the provision of assessment accommodations for students with
disabilities. Dr. Phillips’s extensive writings in this area (e.g., 1993, 1994, & 1996) provide some
guidelines regarding the assessment of students with disabilities. Some of these guidelines are most
directly applicable to high-stakes assessment programs, but they also have implications for other
educational assessments.
Notice
Students should be given adequate notice when they will be required to engage in a high-stakes testing
program (e.g., assessments required for graduation). Although this requirement applies to all students,
it is particularly important for students with disabilities to have adequate notice of any testing require-
ments because it may take them longer to prepare for the assessment. What constitutes adequate
notice? With regard to a test required for graduation from high school, one court found 1.5 years to
be inadequate (Brookhart v. Illinois State Board of Education). Another court agreed, finding that ap-
proximately 1 year was inadequate, but suggested that 3 years was adequate (Northport v. Ambach).
'Brookhart v. Illinois State Board of Education, 697 F. 2d 179 (7th Cir. 1983).
Southeastern Community College v. Davis, 442 U.S. 397 (1979).
3Debra P. y. Turlington, 474 F. Supp. 244 (M.D. FL 1979).
Assessment Accommodations 417
Invalid Accommodations
Courts have ruled that test administrators are not required to grant assessment accommodations that
“substantially modify” a test or that “pervert” the purpose of the test (Brookhart). In psychometric
terms, the accommodations should not invalidate the interpretation of test scores. Phillips (1994)
suggests the following questions should be asked when considering a given accommodation:
1. Will format changes or accommodations in testing conditions change the skills being
measured?
2. Will the scores of examinees tested under standard conditions have a different meaning than
scores for examinees tested with the requested accommodation?
Ww Would nondisabled examinees benefit if allowed the same accommodation?
4. Does the disabled examinee have any capability for adapting to standard test administration
conditions?
5. Is the disability evidence or testing accommodation policy based on procedures with doubt-
ful validity and reliability? (p. 104)
If the answer to any of these questions is “yes,” Phillips suggests the accommodations are likely
not appropriate.
Flagging
“Flagging” refers to administrators adding notations on score reports, transcripts, or diplomas indicating
that assessment accommodations were provided (and in some cases what the accommodations were).
Proponents of flagging hold that it protects the users of assessment information from making inaccurate
interpretations of the results. Opponents of flagging hold that it unfairly labels and stigmatizes students
with disabilities, breaches their confidentiality, and potentially puts them at a disadvantage. If there is
substantial evidence that the accommodation does not detract from the validity of the interpretation of
scores, flagging is not necessary. However, flagging may be indicated when there is incomplete evi-
dence regarding the comparability of test scores. Phillips (1994) describes a process labeled “self-selec-
tion with informed disclosure.” Here administrators grant essentially any reasonable accommodation
that is requested, even if it might invalidate the assessment results. Then, to protect users of assessment
results, they add notations specifying what accommodations were provided. An essential element is
that the examinee requesting the accommodations must be adequately informed that the assessment
reports will contain information regarding any accommodations provided and the potential advantages
and disadvantages of taking the test with accommodations. However, even when administrators get
informed consent, disclosure of assessment accommodations may result in legal action.
Phillips (1993) notes that at times the goal of promoting valid and comparable test results
and the legal and political goal of protecting the individual rights of students with disabilities may
be at odds. She recommends that educators develop detailed policies and procedures regarding the
provision of assessment accommodations, decide each case on an individual basis, and provide
expeditious appeals when requested accommodations are denied. She notes:
To protect the rights of both the public and individuals in a testing program, it will be necessary to
balance the policy goal of maximum participation by the disabled against the need to provide valid
and interpretable student test scores. (p. 32)
418 CHAPTER 15
Summary
In this chapter we focused on the use of assessment accommodations for students with
disabilities. We noted that standard assessment procedures might not be appropriate for
a student with a disability if the assessment requires the students to use an ability that is
affected by their disability but is irrelevant to the construct being measured. In these situ-
ations it may be necessary for teachers to modify the standard assessment procedures. We
gave the example of students with visual handicaps taking a written test of world history.
Although the students could not read the material in its standard format, if they could read
Braille an appropriate accommodation would be to convert the test to the Braille format.
Because the test is designed to measure knowledge of world history, not the ability to read
standard print, this would be an appropriate accommodation. The goal of assessment ac-
commodations is not simply to allow the student to obtain a better grade, but to provide
the most reliable and valid assessment of the construct of interest. To this end, assessment
accommodations should always increase the validity of the score interpretations so they
more accurately reflect the student’s true standing on the construct being measured.
We noted three situations in which assessment accommodations are not appropriate or
necessary (AERA et al., 1999). These are when (1) the affected ability is directly relevant to
the construct being measured, (2) the purpose of the assessment is to assess the presence and
degree of the disability, and (3) the student does not actually need the accommodation.
A number of federal laws mandate assessment accommodations for students with
disabilities. The Individuals with Disabilities Education Act (IDEA) and Section 504 of the
Rehabilitation Act of 1973 are the laws most often applied in the schools and we spent some
time discussing these. IDEA requires that public schools provide students with disabilities
a free appropriate public education (FAPE) and identifies a number of disability categories.
These include learning disabilities, communication disorders, mental retardation, emotional
disturbance, other health impaired, multiple disabilities, hearing impairments, orthopedic
impairments, autism, visual impairments, traumatic brain injury, and developmental delay.
A key factor in the provision of services to students with disabilities is the individualized
educational program (IEP). The IEP is a written document developed by a committee that
specifies a number of factors, including the students’ instructional arrangement, the special
services they will receive, and any assessment accommodations they will receive.
Section 504 of the Rehabilitation Act of 1973, often referred to as Section 504 or
simply as 504, prohibits discrimination against individuals with disabilities in any agency
or school that receives federal funds. In the public schools, Section 504 requires that schools
provide students with disabilities reasonable accommodations to meet their educational
needs. Section 504 provides a broad standard of eligibility, simply stating that an individual
with a disability is anyone with a physical or mental disability that limits one or more life
activities. Because Section 504 is broader than IDEA, it is possible for a student to qualify
under Section 504 and not qualify under IDEA. This is sometimes referred to as 504 only.
The following assessment accommodations have been developed to meet the needs
of students with disabilities:
Modifications of timing, p. 406 Orthopedic impairments, p. 401 Traumatic brain injury, p. 402
Multiple disabilities, p. 401 Other Health Impaired (OHD), p. Visual impairment, p. 402
Nonstandard administration flags, 401
p.415 Speech disorders, p. 400
RECOMMENDED READINGS
American Educational Research Association, American Psy- in Education, 7(2), 93-120. An excellent discussion of
chological Association, & National Council on Measure- legal cases involving assessment accommodations for
ment in Education (1999). Standards for educational students with disabilities.
and psychological testing. Washington, DC: AERA. The Thurnlow, M., Hurley, C., Spicuzza, R., & El Sawaf,
Standards provide an excellent discussion of assessment H. (1996). A review of the literature on testing ac-
accommodations. commodations for students with disabilities (Min-
Mastergeorge, A. M., & Miyoshi, J. N. (1999). Accommoda- nesota Report No. 9). Minneapolis: University of
tions for students with disabilities: A teacher’s guide Minnesota, National Center on Educational Out-
(CSE Technical Report 508). Los Angeles: National comes. Retrieved April 19, 2004, from http://education
Center for Research on Evaluation, Standards, and Stu- .umn.edu/NCEO/OnlinePubs/MnReport9. html.
dent Testing. This guide provides some useful informa- Turnbull, R., Turnbull, A., Shank, M., Smith, S., & Leal, D.
tion on assessment accommodations specifically aimed (2002). Exceptional lives: Special education in today’s
toward teachers. schools. Upper Saddle River, NJ: Merrill Prentice Hall.
Phillips, S. E. (1994). High-stakes testing accommodations: This excellent text provides valuable information regard-
Validity versus disabled rights. Applied Measurement ing the education of students with disabilities.
CHAPTER HIGHLIGHTS
LEARNING OBJECTIVES
421
422 CHAPTER 16
9. Describe the results of research on bias in prediction and in relation to variables that are
external to the test.
10. Explain what is implied by homogeneity of regression and describe the conditions that may
result when it is not present.
Groups of people who can be defined on a qualitative basis such as gender or ethnicity
(and are thus formed using a nominal scale of measurement as was discussed in Chapter 2),
do not always show the same mean level of performance on various educational and psy-
chological tests. For example, on tests of spatial skill, requiring visualization and imagery,
men and boys tend to score higher than do women and girls. On tests that involve written
language and tests of simple psychomotor speed (such as the rapid copying of symbols
or digits), women and girls tend to score higher than men and boys (see Special Interest
Topic 16.1 for additional information). Ethnic group differences in test performance also
occur and are most controversial and polemic.
There is perhaps no more controversial finding in the field of psychology than the
persistent 1 standard deviation (about 15 points) difference between
the intelligence test performance of black and white students taken
Much effort has been expended
as a group. Much effort has been expended to determine why group
to determine why group differ- differences occur (and there are many, many such group differences
ences occur on standardized on various measures of specialized ability and achievement), but we
aptitude tests, but we do not do not know for certain why they exist. One major, carefully studied
know for certain why. explanation is that the tests are biased in some way against certain
groups. This is referred to as the cultural test bias hypothesis.
The cultural test bias hypothesis
The cultural test bias hypothesis represents the contention that
maintains that group differences
any gender, ethnic, racial, or other nominally determined group differ-
on mental tests are due to inher- ences on mental tests are due to inherent, artifactual biases produced
ent, artifactual biases that exist within the tests through flawed psychometric methodology. Group
within the tests. differences are believed then to stem from characteristics of the tests
and to be totally unrelated to any actual differences in the psychologi-
cal trait, skill, or ability in question. The resolution or evaluation of the validity of the cultural
test bias hypothesis is one of the most crucial scientific questions facing psychology today.
Bias in mental tests has many implications for individuals including the misplacement
of students in educational programs; errors in assigning grades; unfair denial of admission
to college, graduate, and professional degree programs; and the inappropriate denial of em-
ployment. The scientific implications are even more substantive. There would be dramatic
implications for educational and psychological research and theory if the cultural test bias
hypothesis were correct: The principal research of the past 100 years in the psychology of
human differences would have to be dismissed as confounded and largely artifactual be-
cause much of the work is based on standard psychometric theory and testing technology.
This would in turn create major upheavals in applied psychology, because the foundations
of clinical, counseling, educational, industrial, and school psychology are all strongly tied
to the basic academic field of individual differences.
Teachers, be they in the elementary and secondary schools or colleges and universi-
ties, assign grades on the basis of tests or other more subjective evaluations of learning, and
The Problem of Bias in Educational Assessment
423
Research has shown that although there are no significant sex differences in overall intelligence
scores, substantial differences exist with regard to specific cognitive abilities. Females typically score
higher on a number of verbal abilities whereas males perform better on visual-spatial and (starting
in middle childhood) mathematical skills. It is believed that sex hormone levels and social factors
both influence the development of these differences. As is typical of group differences in intellectual
abilities, the variability in performance within groups (i.e., males and females) is much larger than the
mean difference between groups (Neisser et al., 1996). Diane Halpern (1997) has written extensively
on gender differences in cognitive abilities. This table briefly summarizes some of her findings.
Rapid access and use of verbal and other Verbal fluency, synonym generation,
information in long-term memory associative memory, spelling, anagrams
Note: This table was adopted from Halpern (1997), Appendix (p. 1102).
If well-constructed and prop- to the most stringent of psychometric and statistical standards turn
erly standardized tests are out to be culturally biased when used with native-born American
biased, then classroom tests are ethnic minorities, what about the teacher-made test in the classroom
almost certain to be at least as and more subjective evaluation of work samples (e.g., performance
biased and probably more so. assessments)? If well-constructed and properly standardized tests are
biased, then classroom measures are almost certain to be at least as
biased and probably more so. As the reliability of a test or evaluation
procedure goes down, the likelihood of bias goes up, the two being inversely related. A large
reliability coefficient does not eliminate the possibility of bias, but as reliability is lowered,
the probability that bias will be present increases.
The purpose of this chapter is to address the issues and findings surrounding the cul-
tural test bias hypothesis in a rational manner and evaluate the validity of the hypothesis,
as far as possible, on the basis of existing empirical research. This will not be an easy task
because of the controversial nature of the topic and strong emotional overtones. Prior to
turning to the reasons that test bias generates highly charged emotions and reviewing some
of the history of these issues, it is proper to engage in a discussion of just what we mean by
the term bias.
that are the subject of intense polemic, emotional debate without a mechanism for rational
resolution. It is imperative that.the evaluation of bias in tests be undertaken from the stand-
point of scholarly inquiry and debate. Emotional appeals, legal—adversarial approaches, and
political remedies of scientific issues appear to us to be inherently unacceptable.
Concern about cultural bias in mental testing has been a recurring issue since the beginning
of the use of assessment in education. From Alfred Binet in the 1800s to Arthur Jensen over
the last 50 years, many scientists have addressed this controversial problem, with varying,
inconsistent outcomes. In the last few decades, the issue of cultural bias has come forth
as a major contemporary problem far exceeding the bounds of purely academic debate
and professional rhetoric. The debate over the cultural test bias hypothesis has become
entangled and sometimes confused within the larger issues of individual liberties, civil
rights, and social justice, becoming a focal point for psychologists, sociologists, educators,
politicians, minority activists, and the lay public. The issues increasingly have become legal
and political. Numerous court cases have been brought and New York state even passed
“truth-in-testing” legislation that is being considered in other states and in the federal legis-
lature. Such attempts at solutions are difficult if not impossible. Take for example the legal
response to the question “Are intelligence tests used to diagnose mental retardation biased
against cultural and ethnic minorities?” In California in 1979 (Larry P. v. Riles) the answer
was “‘yes” but in Illinois in 1980 (PASE v. Hannon) the response was “no.” Thus two federal
district courts of equivalent standing have heard nearly identical cases, with many of the
same witnesses espousing much the same testimony, and reached precisely opposite con-
clusions. See Special Interest Topic 16.2 for more information on legal issues surrounding
assessment bias.
Though current opinion on the cultural test bias hypothesis is quite divergent, ranging
from those who consider it to be for the most part unresearchable (e.g., Schoenfeld, 1974) to
those who considered the issue settled decades ago (e.g., Jensen, 1980), it seems clear that
empirical analysis of the hypothesis should continue to be undertaken. However difficult
full objectivity may be in science, we must make every attempt to view all socially, politi-
cally, and emotionally charged issues from the perspective of rational scientific inquiry. We
must also be prepared to accept scientifically valid findings as real, whether we like them
or not.
Systematic group differences on standardized intelligence and aptitude tests may occur as
a function of socioeconomic level, race or ethnic background, and other demographic vari-
ables. Black—white differences on IQ measures have received extensive investigation over
the past 50 or 60 years. Although results occasionally differ slightly depending on the age
groups under consideration, random samples of blacks and whites show a mean difference
426 CHAPTER16
Largely due to overall mean differences in the performance of various ethnic groups on IQ tests,
the use of intelligence tests in the public schools has been the subject of courtroom battles around
the United States. Typically such lawsuits argue that the use of intelligence tests as part of the
determination of eligibility for special education programs leads to overidentification of certain
minorities (traditionally African American and Hispanic children). A necessary corollary to this
argument is that the resultant overidentification is inappropriate because the intelligence tests in
use are biased, underestimating the intelligence of minority students, and that there is in fact no
greater need for special education placement among these ethnic minorities than for other ethnic
groups in the population.
Attempts to resolve the controversy over IQ testing in the public schools via the courtroom
have not been particularly successful. Unfortunately, but not uncharacteristically, the answer to
the legal question “Are IQ tests biased in a manner that results in unlawful discrimination against
minorities when used as part of the process of determining eligibility for special education place-
ments?” depends on where you live!
There are four key court cases to consider when reviewing this question, two from California
and one each from Illinois and Georgia.
The first case is Diana v. State Board of Education (C-70-37 RFP, N.D. Cal., 1970), heard
by the same federal judge who would later hear the Larry P. case (see later). Diana was filed on
behalf of Hispanic (referred to as Chicano at that time and in court documents) children classified
as EMR, or educable mentally retarded (a now archaic term), based on IQ tests administered in
English. However, the children involved in the suit were not native English speakers and when
retested in their native language, all but one (of nine) scored above the range designated as EMR.
Diana was resolved through multiple consent decrees (agreements by the adverse parties ordered
into effect by the federal judge). Although quite detailed, the central component of interest here is
that the various decrees ensured that children would be tested in their native language, that more
than one measure would be used, and that adaptive behavior in nonschool settings would be as-
sessed prior to a diagnosis of EMR.
It seems obvious to us now that whenever persons are assessed in other than their native
language, the validity of the results as traditionally interpreted would not hold up, at least in the
case of ability testing. This had been obvious to the measurement community for quite some time
prior to Diana, but had not found its way into practice. Occasionally one still encounters cases of a
clinician evaluating children in other than their native language and making inferences about intel-
lectual development—clearly this is inappropriate.
Three cases involving intelligence testing of black children related to special education
placement went to trial: Larry P. v. Riles (343 F. Supp. 306, 1972; 495 F. Supp. 976, 1979); PASE vy.
Hannon (506 F. Supp. 831, 1980); and Marshall v. Georgia (CV 482-233, S.D. of Georgia, 1984).
Each of these cases involved allegations of bias in IQ tests that caused the underestimation of the
intelligence of black children and subsequently led to disproportionate placement of black children
in special education programs. All three cases presented testimony by experts in education, testing,
4= measurement, and related fields, some professing the tests to be biased and others professing they
were not. That a disproportionate number of black children were in special education was conceded
4
in all cases—what was litigated was the reason.
Ee
@ In California in Larry P. v. Riles (Wilson Riles being superintendent of the San Francisco
= Unified School District), Judge Peckham ruled that IQ tests were in fact biased against black
SE
The Problem of Bias in Educational Assessment
427
of about | standard deviation, with the mean score of the white groups consistently ex-
ceeding that of the black groups. When a number of demographic variables are taken into
account (most notably socioeconomic status, or SES), the size of the difference reduces
to 0.5 to 0.7 standard deviation but remains robust in its appearance. The differences have
persisted at relatively constant levels for quite some time and under a variety of methods of
investigation. Some recent research suggests that the gap may be narrowing, but this has not
been firmly established (Neisser et al., 1996).
Mean differences between ethnic groups are not limited to black-white comparisons.
Although not nearly as thoroughly researched as black—white differences, Hispanic—white
differences have also been documented, with Hispanic mean performance approximately
0.5 standard deviation below the mean of the white group. On the average, Native Ameri-
cans tend to perform lower on tests of verbal intelligence than whites. Both Hispanics and
Native Americans tend to perform better on visual—spatial tasks relative to verbal tasks. All
studies of race/ethnic group differences on ability tests do not show higher levels of per-
formance by whites. Asian American groups have been shown consistently to perform as
well as or better than white groups. Depending on the specific aspect of intelligence under
investigation, other race/ethnic groups show performance at or above the performance level
of white groups (for a readable review of this research, see Neisser et al., 1996).
428 CHAPTER 16
It should always be kept in mind that the overlap among the distributions of intelli-
gence test scores for different ethnic groups is much greater than the size of the differences
between the various groups. Put another way, there is always more within-group variability
in performance on mental tests than between-group variability. Neisser et al. (1996) frame
it this way:
Group means have no direct implications for individuals. What matters for the next person
you meet (to the extent that test scores matter at all) is that person’s own particular score, not
the mean of some reference group to which he or she happens to belong. The commitment
to evaluate people on their own individual merit is central to a democratic society. It also
makes quantitative sense. The distributions of different groups inevitably overlap, with the
range of scores within any one group always wider than the mean differences between any
two groups. In the case of intelligence test scores, the variance attributable to individual dif-
ferences far exceeds the variance related to group membership. (p. 90)
Explaining Mean Group Differences. Once mean group differences are identified, it is
natural to attempt to explain them. Reynolds (2000) notes that the most common explana-
tions for these differences have typically fallen into four categories:
The final explanation (i.e., category d) is embodied in the cultural test bias hypothesis
introduced earlier in this chapter. Restated, the cultural test bias hypothesis represents the
contention that any gender, ethnic, racial, or other nominally determined group differences on
mental tests are due to inherent, artifactual biases produced within the tests through flawed
psychometric methodology. Group differences are believed then to stem from characteristics
of the tests and to be totally unrelated to any actual differences in the psychological trait, skill,
or ability in question. Because mental tests are based largely on middle-class values and knowl-
edge, their results are more valid for those groups and will be biased
The hypothesis of differential against other groups to the extent that they deviate from those values
validity suggests that tests and knowledge bases. Thus, ethnic and other group differences result
measure constructs more from flawed psychometric methodology and not from actual differ-
ences in aptitude. As will be discussed, this hypothesis reduces to one
accurately and make more valid
of differential validity; the hypothesis of differential validity being that
predictions for individuals from
tests measure intelligence and other constructs more accurately and
the groups on which the tests
make more valid predictions for individuals from the groups on which
are mainly based than for those the tests are mainly based than for those from other groups. The practi-
from other groups. cal implications of such bias have been pointed out previously and are
the issues over which most of the court cases have been fought.
If the cultural test bias hypothesis is incorrect, then group differences are not at-
tributable to the tests and must be due to one of the other factors mentioned above. The
model emphasizing the interactive effect of genes and environment (category c, commonly
The Problem of Bias in Educational Assessment 429
The controversy over test Test Bias and Etiology. The controversy over test bias is dis-
bias should not be confused tinct from the question of etiology. Reynolds and Ramsay (2003)
with that over etiology of any note that the need to research etiology is only relevant once it has
observed group differences. been determined that mean score differences are real, not simply
artifacts of the assessment process. Unfortunately, measured differ-
ences themselves have often been inferred to indicate genetic differences and therefore the
genetically based intellectual inferiority of some groups. This inference is not defensible
from a scientific perspective.
Test Bias and Fairness. Bias and fairness are related but separate concepts. As noted by
Brown, Reynolds, and Whitaker (1999), fairness is a moral, philosophical, or legal issue on
which reasonable people can disagree. On the other hand bias is a statistical property of a
test. Therefore, bias is a property empirically estimated from test data whereas fairness is a
principle established through debate and opinion. Nevertheless, it is common to incorporate
information about bias when considering the fairness of an assessment process. For ex-
ample, a biased test would likely be considered unfair by essentially everyone. However, it
is clearly possible that an unbiased test might be considered unfair by at least some. Special
Interest Topic 16.3 summarizes the discussion of fairness in testing and test use from the
Standards (AERA et al., 1999).
Test Bias and Offensiveness. There is also a distinction between test bias and item offen-
siveness. Test developers often use a minority review panel to examine each item for content
that may be offensive or demeaning to one or more groups (e.g., see Reynolds & Kamphaus,
2003, for a practical example). This is a good procedure for identifying and eliminating of-
fensive items, but it does not ensure that the items are not biased. Research has consistently
found little evidence that one can identify, by personal inspection, which items are biased and
which are not (for reviews, see Camilli & Shepard, 1994; Reynolds, Lowe, & Saenz, 1999).
Test Bias and Inappropriate Test Administration and Use. The controversy over test
bias is also not about blatantly inappropriate administration and usage of mental tests. Ad-
ministration of a test in English to an individual for whom English is a poor second language
is inexcusable both ethically and legally, regardless of any bias in the tests themselves (un-
less of course, the purpose of the test is to assess English language skills). It is of obvious
importance that tests be administered by skilled and sensitive professionals who are aware
of the factors that may artificially lower an individual’s test scores. That should go without
saying, but some court cases involve just such abuses. Considering the use of tests to as-
sign pupils to special education classes or other programs, the question needs to be asked,
430 CHAPTER 16
se ee
The Standards (AERA et al., 1999) present four different ways that fairness is typically used in the
context of assessment.
Aa Fairness as absence of bias: There is general consensus that for a test to be fair, it should not be
biased. Bias is used here in the statistical sense: systematic error in the estimation of a value.
ps: Fairness as equitable treatment: There is also consensus that all test takers should be treated
in an equitable manner throughout the assessment process. This includes being given equal
opportunities to demonstrate their abilities by being afforded equivalent opportunities to
prepare for the test and standardized testing conditions. The reporting of test results should
be accurate, informative, and treated in a confidential manner.
Fairness as opportunity to learn: This definition holds that test takers should all have an
equal opportunity to learn the material when taking educational achievement tests.
Fairness as equal outcomes: Some hold that for a test to be fair it should produce equal
performance across groups defined by race, ethnicity, gender, and so on (i.e., equal mean
performance).
Many assessment professionals believe that (1) if a test is free from bias and (2) test takers re-
ceived equitable treatment in the assessment process, the conditions for fairness have been achieved.
The other two definitions receive less support. In reference to definition (3) requiring equal opportu-
nity to learn, there is general agreement that adequate opportunity to learn is appropriate in some cases
but irrelevant in others. However, disagreement exists in terms of the relevance of opportunity to learn
in specific situations. A number of problems arise with this definition of fairness that will likely pre-
vent it from receiving universal acceptance in the foreseeable future. The final definition (4) requiring
equal outcomes has little support among assessment professionals. The Standards note:
The position that fairness requires equality in overall passing rates for different groups has been
almost entirely repudiated in the professional testing literature . . . unequal outcomes at the group
level have no direct bearing on questions of test bias. (pp. 74-76)
Itis unlikely that consensus in society at large or within the measurement community is imminent on
all matters of fairness in the use of tests. As noted earlier, fairness is defined in a variety of ways and
is not exclusively addressed in technical terms; it is subject to different definitions and interpretations
in different social and political circumstances. According to one view, the conscientious application
of an unbiased test in any given situation is fair, regardless of the consequences for individuals or
groups. Others would argue that fairness requires more than satisfying certain technical require-
ments. (p. 80)
“What would you use instead?” Teacher recommendations alone are less reliable and valid
than standardized test scores and are subject to many external influences. Whether special
education programs are of adequate quality to meet the needs of children is an important edu-
cational question, but distinct from the test bias question, a distinction sometimes confused.
The Problem of Bias in Educational Assessment 431
Bias and Extraneous Factors. The controversy over the use of mental tests is complicated
further by the fact that resolution of the cultural test bias question in either direction will not
resolve the problem of the role of nonintellective factors that may influence the test scores
of individuals from any group, minority or majority. Regardless of any group differences, it
is individuals who are tested and whose scores may or may not be accurate. Similarly, it is
individuals who are assigned to classes and accepted or rejected for employment or college
admission. Most assessment professionals acknowledge that a number of emotional and
motivational factors may impact performance on intelligence tests. The extent to which these
factors influence individuals as opposed to group performance is difficult to determine.
In 1969, the Association of Black Psychologists (ABP) adopted the following official policy
on educational and psychological testing (Williams, Dotson, Dow, & Williams, 1980):
The Association of Black Psychologists fully supports those parents who have chosen to
defend their rights by refusing to allow their children and themselves to be subjected to
achievement, intelligence, aptitude and performance tests which have been and are being
used to (a) label black people as uneducable; (b) place black children in “special” classes
and schools; (c) perpetuate inferior education in blacks; (d) assign black children to lower
educational tracks than whites; (e) deny black students higher educational opportunities; and
(f) destroy positive intellectual growth and development of black people.
Since 1968 the ABP has sought a moratorium on the use of all psychological and
educational tests with students from disadvantaged backgrounds. The ABP carried its call
for a moratorium to other professional organizations in psychology and education. In direct
response to the ABP call, the American Psychological Association’s (APA) Board of Direc-
tors requested its Board of Scientific Affairs to appoint a group to study the use of psycho-
logical and educational tests with disadvantaged students. The committee report (Cleary,
Humphreys, Kendrick, & Wesman, 1975) was subsequently published in the official journal
of the APA, American Psychologist.
Subsequent to the ABP’s policy statement, other groups adopted similarly stated
policy statements on testing. These groups included the National Association for the Ad-
vancement of Colored People (NAACP), the National Education Association (NEA), the
National Association of Elementary School Principals (NAESP), the American Personnel
and Guidance Association (APGA), and others. The APGA called for the Association of
Measurement and Evaluation in Guidance (AMEG), a sister organization, to “develop and
disseminate a position paper stating the limitations of group intelligence tests particularly
and generally of standardized psychological, educational, and employment testing for low
socioeconomic and underprivileged and non-white individuals in educational, business,
and industrial environments.” It should be noted that the statements by these organizations
assumed that psychological and educational tests are biased, and that what is needed is that
the assumed bias be removed.
Many potentially legitimate objections to the use of educational and psychological
tests with minorities have been raised by black and other minority psychologists. Unfor-
tunately, these objections are frequently stated as facts on rational rather than empirical
grounds. The most frequently stated problems fall into one of the following categories
(Reynolds, 2000; Reynolds, Lowe, & Saenz, 1999; Reynolds & Ramsay, 2003).
Inappropriate Content
Black and other minority children have not been exposed to the material involved in the test
questions or other stimulus materials. The tests are geared primarily toward white middle-
class homes, vocabulary, knowledge, and values. As a result of inappropriate content, the
tests are unsuitable for use with minority children.
The Problem of Bias in Educational Assessment 433
The early actions of the ABP were most instrumental in bringing forward these objec-
tions into greater public and professional awareness and subsequently prompted a considerable
434 CHAPTER 16
Se
Some good readings on this issue for follow up include the following works:
Nomura, J. M., Stinnett, T.; Castro, F., Atkins, M., Beason, S., Linden, S., Hogan, K., Newry,
B., & Wiechmann, K. (March, 2007). Effects of Stereotype Threat on Cognitive Per-
formance of African Americans. Paper presented to the annual meeting of the National
Association of School Psychologists, New York.
Sackett, P. R., Hardison, C. M., & Cullen, M. J. (2004). On interpreting stereotype threat as
accounting for African-American differences on cognitive tests. American Psycholo-
gist, 59(1), 7-13.
Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual test performance
of African Americans. Journal of Personality and Social Psychology, 69, 797-811.
Steele, C. M., Spencer, S. J., & Aronson, J. (2002). Contending with group image: The psy-
chology of stereotype and social identity threat. In M. Zanna (Ed.), Advances in ex-
perimental social psychology (Vol. 23, pp. 379-440). New York: Academic Press.
The early actions of the ABP amount of research. When the objections were first raised, very little
were most instrumental in data existed to answer these charges. Contrary to the situation de-
bringing these issues into cades ago when the current controversy began, research now exists
that examines many of these concerns. There is still relatively little
greater public and professional
research regarding labeling and the long-term social consequences
awareness and subsequently bh
of testing, and these areas should be investigated using diverse sam-
promoted a considerable ples and numerous statistical techniques (Reynolds, Lowe, & Saenz,
amount of research. 1999).
is said to arise when deficiencies in a test itself or the manner in which it is used result in dif-
ferent meanings for scores earned by members of different identifiable subgroups. (p. 74)
As we discussed in previous chapters, evidence for the validity of test score interpre-
tations can come from sources both internal and external to the test. Bias in a test may be
found to exist in any or all of these categories of validity evidence. Prior to examining the
436 CHAPTER 16
evidence on the cultural test bias hypothesis, the concept of culture-free testing and the def-
inition of mean differences in test scores as test bias merit attention.
Cultural loading and cultural bias are not synonymous terms, though the concepts are
frequently confused even in the professional literature. A test or test item can be cultur-
ally loaded without being culturally biased. Cultural loading refers
Cultural loading refers to the to the degree of cultural specificity present in the test or individual
degree of cultural specificity items of the test. Certainly, the greater the cultural specificity of a test
present in the test or individual item, the greater the likelihood of the item being biased when used
items of the test. with individuals from other cultures. Virtually all tests in current use
are bound in some way by their cultural specificity. Culture loading
must be viewed on a continuum from general (defining the culture in a broad, liberal sense)
to specific (defining the culture in narrow, highly distinctive terms).
A number of attempts have been made to develop a culture-free (sometimes referred
to as culture fair) intelligence test. However, culture-free tests are generally inadequate
from a statistical or psychometric perspective (e.g., Anastasi & Urbina, 1997). It may be that
because intelligence is often defined in large part on the basis of behavior judged to be of
value to the survival and improvement of the culture and the individuals within that culture,
a truly culture-free test would be a poor predictor of intelligent behavior within the cultural
setting. Once a test has been developed within a culture (a culture loaded test) its generaliz-
ability to other cultures or subcultures within the dominant societal framework becomes a
matter for empirical investigation.
Differences in mean levels of performance on cognitive tasks between two groups histori-
cally (and mistakenly) are believed to constitute test bias by a number of writers (e.g., Alley
& Foster, 1978; Chinn, 1979; Hilliard, 1979). Those who support mean differences as an
indication of test bias state correctly that there is no valid a priori scientific reason to be-
lieve that intellectual or other cognitive performance levels should differ across race. It is
the inference that tests demonstrating such differences are inherently biased that is faulty.
Just as there is no a priori basis for deciding that differences exist, there is no a priori basis
for deciding that differences do not exist. From the standpoint of the objective methods of
science, a priori or premature acceptance of either hypothesis (differences exist versus dif-
ferences do not exist) is untenable. As stated in the Standards (AERA et al., 1999):
Most testing professionals would probably agree that while group differences in testing
outcomes should in many cases trigger heightened scrutiny for possible sources of test bias,
The Problem of Bias in Educational Assessment 437
outcome differences across groups do not in themselves indicate that a testing application is
biased or unfair. (p. 75)
Some adherents to the “mean differences as bias” position also require that the distri-
bution of test scores in each population or subgroup be identical prior to assuming that the
test is nonbiased, regardless of its validity. Portraying a test as biased
The mean difference definition regardless of its purpose or the validity of its interpretations conveys
of test bias is the most uniformly an inadequate understanding of the psychometric construct and is-
rejected of all definitions of sues of bias. The mean difference definition of test bias is the most
uniformly rejected of all definitions of test bias by psychometricians
test bias by psychometricians
involved in investigating the problems of bias in assessment (Camilli
involved in investigating the
& Shepard, 1994; Cleary et al., 1975; Cole & Moss, 1989; Hunter,
problems of bias in assessment.
Schmidt, & Rauschenberger, 1984; Reynolds, 1982, 1995, 2000).
Jensen (1980) discusses the mean differences as bias definition
in terms of the egalitarian fallacy. The egalitarian fallacy contends that all human populations
are in fact identical on all mental traits or abilities. Any differences with regard to any aspect
of the distribution of mental test scores indicate that something is wrong with the test itself. As
Jensen points out, such an assumption is totally, scientifically unwarranted. There are simply
too many examples of specific abilities and even sensory capacities that have been shown to
unmistakably differ across human populations. The result of the egalitarian assumption then
is to remove the investigation of population differences in ability from the realm of scientific
inquiry, an unacceptable course of action (Reynolds, 1980).
The belief of many people in the mean differences as bias definition is quite likely
related to the nature—nurture controversy at some level. Certainly data reflecting racial
differences on various aptitude measures have been interpreted to indicate support for a
hypothesis of genetic differences in intelligence and implicating one group as superior to
another. Such interpretations understandably call for a strong emotional response and are
not defensible from a scientific perspective. Although IQ and other aptitude test score dif-
ferences undoubtedly occur, the differences do not indicate deficits or superiority by any
group, especially in relation to the personal worth of any individual member of a given
group or culture.
a The items ask for information that minority or disadvantaged children have not had
equal opportunity to learn.
= The items require the child to use information in arriving at an answer that minority
or disadvantaged children have not had equal opportunity to learn.
438 CHAPTER 16
m The scoring of the items is improper, unfairly penalizing the minority child, because
the test author has a Caucasian middle-class orientation that is reflected in the scoring cri-
terion. Thus minority children do not receive credit for answers that may be correct within
their own cultures but do not conform to Anglocentric expectations.
m= The wording of the questions is unfamiliar to minority children and even though they
may “know” the correct answer are unable to respond because they do not understand the
question.
These problems with test items cause the items to be more difficult than they should
actually be when used to assess minority children. This, of course, results in lower test
scores for minority children, a well-documented finding. Are these criticisms of test items
accurate? Do problems such as these account for minority—majority group score differences
on mental tests? These are questions for empirical resolution rather than armchair specula-
tion, which is certainly abundant in the evaluation of test bias. Empirical evaluation first
requires a working definition. We will define a biased test item as follows:
There are two concepts of special importance in this definition. First, the group of items
must be unidimensional; that is, they must all be measuring the same factor or dimension of
aptitude or personality. Second, the items identified as biased must be differentially more
difficult for one group than another. The definition allows for score differences between
groups of unequal standing on the dimension in question but requires that the difference
be reflected on all items in the test and.in an equivalent fashion. A number of empirical
techniques are available to locate deviant test items under this definition. Many of these
techniques are based on item-response theory (IRT) and designed to detect differential item
functioning, or DIF. The relative merits of each method are the subject of substantial debate,
but in actual practice each method has led to similar general conclusions, though the specific
findings of each method often differ.
With multiple-choice tests, another level of complexity can easily be added to the
examination of content bias. With a multiple-choice question, typically three or four dis-
tracters are given in addition to the correct response. Distracters may be examined for their
attractiveness (the relative frequency with which they are chosen) across groups. When
distracters are found to be disproportionately attractive for members
Content bias in well-prepared of any particular group, the item may be defined as biased.
standardized tests is irregular Research that includes thousands of subjects and nearly 100
in its occurrence, and no published studies consistently finds very little bias in tests at the level
common characteristics of items °F the individual item. Although some biased items are nearly always
that are found to be biased can found, they seldom account for more than 2% to 5% of the variance
be ascertained by expert judges. in performance and often, for every item favoring one group, there is
an item favoring the other group.
Earlier in the study of item bias it was hoped that the empirical analysis of tests at the
item level would result in the identification of a category of items having similar content
as biased and that such items could then be avoided in future test development (Flaugher,
The Problem of Bias in Educational Assessment 439
1978). Very little similarity among items determined to be biased has been found. No one
has been able to identify those characteristics of an item that cause the item to be biased. In
summarizing the research on item bias or differential item functioning (DIF), the Standards
(AERA et al., 1999) note:
Although DIF procedures may hold some promise for improving test quality, there has been
little progress in identifying the cause or substantive themes that characterizes items exhibiting
DIF. That is, once items on a test have been statistically identified as functioning differently
from one examinee group to another, it has been difficult to specify the reasons for the differ-
ential performance or to identify a common deficiency among the identified items. (p. 78)
It does seem that poorly written, sloppy, and ambiguous items tend to be identified as bi-
ased with greater frequency than those items typically encountered in a well-constructed,
standardized instrument.
A common practice of test developers seeking to eliminate “bias” from their newly
developed educational and psychological tests has been to arrange for a panel of expert mi-
nority group members to review all proposed test items. Any item identified as “culturally
biased” by the panel of experts is then expurgated from the instrument. Because, as previ-
ously noted, no detectable pattern or common characteristic of individual items statistically
shown to be biased has been observed (given reasonable care at the item writing stage), it
seems reasonable to question the armchair or expert minority panel approach to determin-
ing biased items. Several researchers, using a variety of psychological and educational
tests, have identified items as being disproportionately more difficult for minority group
members than for members of the majority culture and subsequently compared their results
with a panel of expert judges. Studies by Jensen (1976) and Sandoval and Mille (1979) are
representative of the methodology and results of this line of inquiry.
After identifying the 8 most racially discriminating and 8 least racially discriminating
items on the Wonderlic Personnel Test, Jensen (1976) asked panels of 5 black psychologists
and 5 Caucasian psychologists to sort out the 8 most and 8 least discriminating items when
only these 16 items were presented to them. The judges sorted the items at a no better than
chance level. Sandoval and Mille (1979) conducted a somewhat more extensive analysis
using items from the WISC-R. These two researchers had 38 black, 22 Hispanic, and 40
white university students from Spanish, history, and education classes identify items from
the WISC-R that are more difficult for a minority child than a white child and items that are
equally difficult for each group. A total of 45 WISC-R items was presented to each judge;
these items included the 15 most difficult items for blacks as compared to whites, the 15
most difficult items for Hispanics as compared to whites, and the 15 items showing the most
nearly identical difficulty indexes for minority and white children. The judges were asked to
read each question and determine whether they thought the item was (1) easier for minority
than white children, (2) easier for white than minority children, or (3) of equal difficulty for
white and minority children. Sandoval and Mille’s (1979) results indicated that the judges
were not able to differentiate between items that were more difficult for minorities and items
that were of equal difficulty across groups. The effects of the judges’ ethnic backgrounds on
the accuracy of their item bias judgments were also considered. Minority and nonminority
judges did not differ in their ability to identify accurately biased items nor did they differ
with regard to the type of incorrect identification they tended to make. Sandoval and Mille’s
440 CHAPTER 16
(1979) two major conclusions were that “(1) judges are not able to detect items which are
more difficult for a minority child than an Anglo child, and (2) the ethnic background of the
judge makes no difference in accuracy of item selection for minority children” (p. 6). Even
without empirical support for its validity, the use of expert panels of minorities continues
but for a different purpose. Members of various ethnic, religious, or other groups that have a
cultural system in some way unique may well be able to identify items that contain material
that is offensive, and the elimination of such items is proper.
From a large number of studies employing a wide range of methodology a relatively
clear picture emerges. Content bias in well-prepared standardized tests is irregular in its
occurrence, and no common characteristics of items that are found to be biased can be
ascertained by expert judges (minority or nonminority). The variance in group score dif-
ferences on mental tests associated with ethnic group membership when content bias has
been found is relatively small (typically ranging from 2% to 5%). Although the search for
common biased item characteristics will continue, cultural bias in aptitude tests has found
no consistent empirical support in a large number of actuarial studies contrasting the perfor-
mance of a variety of ethnic and gender groups on items of the most widely employed intel-
ligence scales in the United States. Most major test publishing companies do an adequate
job of reviewing their assessments for the presence of content bias. Nevertheless, certain
standardized tests have not been examined for the presence of content bias, and research
with these tests should continue regarding potential content bias with different ethnic groups
(Reynolds & Ramsay, 2003).
There is no single method for the accurate determination of the degree to which educational
and psychological tests measure a distinct construct. The defining of bias in construct measure-
ment then requires a general statement that can be researched from a variety of viewpoints with
a broad range of methodology. The following rather parsimonious definition is proffered:
intelligence scale load highly on (are members of) the same factor, then if a group of indi-
viduals score high on one of these subtests, they would be expected to score at a high level
on other subtests that load highly on that factor. Psychometricians attempt to determine
through a review of the test content and correlates of performance on the factor in question
what psychological trait underlies performance; or, in a more hypothesis testing approach,
they will make predictions concerning the pattern of factor loadings. Hilliard (1979), one
of the more vocal critics of IQ tests on the basis of cultural bias, has pointed out that one
of the potential ways of studying bias involves the comparison of factor analytic results of
test studies across race.
If the IQ test is a valid and reliable test of “innate” ability or abilities, then the factors which
emerge on a given test should be the same from one population to another, since “intel-
ligence” is asserted to be a set of mental processes. Therefore, while the configuration of
scores of a particular group on the factor profile would be expected to differ, logic would
dictate that the factors themselves would remain the same. (p. 53)
Although not agreeing that identical factor analyses of an instrument speak to the
“innateness” of the abilities being measured, consistent factor analytic results across popu-
lations do provide strong evidence that whatever is being measured by the instrument is
being measured in the same manner and is in fact the same construct within each group.
The information derived from comparative factor analysis across populations is directly
relevant to the use of educational and psychological tests in diagnosis and other decision-
making functions. Psychologists, in order to make consistent interpretations of test score
data, must be certain that the test(s) measures the same variable across populations.
In contrast to Hilliard’s (1979) strong statement that factorial similarity across eth-
nicity has not been reported “in the technical literature,” a number of such studies have ap-
peared over the past three decades, dealing with a number of different tasks. These studies
have for the most part focused on aptitude or intelligence tests, the most controversial of all
techniques of measurement. Numerous studies of the similarity of factor analysis outcomes
for children of different ethnic groups, across gender, and even diagnostic groupings have
been reported over the past 30 years. Results reported are highly consistent in revealing that
the internal structure of most standardized tests varies quite little across groups. Compari-
sons of the factor structure of the Wechsler Intelligence Scales (e.g., WISC-III, WAIS-IIT)
and the Reynolds Intellectual Assessment Scales (Reynolds, 2002) in particular and other
intelligence tests find the tests to be highly factorially similar across gender and ethnic-
ity for blacks, whites, and Hispanics. The structure of ability tests for other groups has
been researched less extensively, but evidence thus far with Chinese, Japanese, and Native
Americans does not show substantially different factor structures for these groups.
As is appropriate for studies of construct measurement, comparative factor analy-
sis has not been the only method of determining whether bias exists. Another method of
investigation involves the comparison of internal-consistency reliability estimates across
groups. As described in Chapter 4, internal-consistency reliability is determined by the
degree to which the items are all measuring a similar construct. The internal-consistency
reliability coefficient reflects the accuracy of measurement of the construct. To be unbiased
with regard to construct validity, internal-consistency estimates should be approximately
442 CHAPTER 16
equal across race. This characteristic of tests has been investigated for a number of popular
aptitude tests for blacks, whites, and Hispanics with results similar to those already noted.
Many other methods of comparing construct measurement across groups have been
used to investigate bias in tests. These methods include the correlation of raw scores with
age, comparison of item-total correlations across groups, comparisons of alternate form and
test-retest correlations, evaluation of kinship correlation and differences, and others (see
Reynolds, 2002, for a discussion of these methods). The general results of research with
these methods have been supportive of the consistency of construct measurement of tests
across ethnicity and gender.
Construct measurement of a large number of popular psychometric assessment in-
struments has been investigated across ethnicity and gender with a divergent set of meth-
odologies. No consistent evidence of bias in construct measurement has been found in the
many prominent standardized tests investigated. This leads to the conclusion that these
psychological tests function in essentially the same manner across ethnicity and gender,
the test materials are perceived and reacted to in a similar manner, and the tests are measur-
ing the same construct with equivalent accuracy for blacks, whites,
Hispanic, and other American minorities for both sexes. Differential
Single-group or differential
validity or single-group validity has not been found and likely is
validity has not been found
not an existing phenomenon with regard to well-constructed stan-
and likely is not an existing
dardized psychological and educational tests. These tests appear to
phenomenon with regard i be reasonably unbiased for the groups investigated, and mean score
well-constructed standardized differences do not appear to be an artifact of test bias (Reynolds &
educational tests. Ramsay, 2003).
Internal analyses of bias (such as with item content and construct measurement) are less
confounded than analyses of bias in prediction due to the potential problems of bias in the
criterion measure. Prediction is also strongly influenced by the reliability of criterion mea-
sures, which frequently is poor. (The degree of relation between a predictor and a criterion
is restricted as a function of the square root of the product of the reliabilities of the two vari-
ables.) Arriving at a consensual definition of bias in prediction is also a difficult task. Yet,
from the standpoint of the traditional practical applications of aptitude and intelligence tests
in forecasting probabilities of future performance levels, prediction is the most crucial use
of test scores to examine. Looking directly at bias as a characteristic
From the standpoint of a test and not a selection model, Cleary et al.’s (1975) definition of
of traditional practical test fairness, as restated here in modern times, is a clear direct state-
applications on aptitude and ment of test bias with regard to prediction bias:
intelligence tests in forecasting
A test is considered biased with respect to prediction when the inference
probabilities of future
drawn from the test score is not made with the smallest feasible random
performance levels, prediction error or if there is constant error in an inference or prediction as a function
is the most crucial use of test of membership in a particular group. (After Reynolds, 1982, p. 201)
scores to examine.
The Problem of Bias in Educational Assessment
443
The evaluation of bias in prediction under the Cleary et al. (1975) definition (known
as the regression definition) is quite straightforward. With simple regressions, predictions
take the form Y,= aX + b, where ais the regression coefficient and b is some constant. When
this equation is graphed (forming a regression line), a represents the slope of the regres-
sion line and b the Y-intercept. Given our definition of bias in prediction validity, nonbias
requires errors in prediction to be independent of group membership, and the regression line
formed for any pair of variables must be the same for each group for whom predictions are
to be made. Whenever the slope or the intercept differs significantly across groups, there
is bias in prediction if one attempts to use a regression equation based on the combined
groups. When the regression equations for two (or more) groups are equivalent, prediction
is the same for those groups. This condition is referred to variously as homogeneity of re-
gression across groups, simultaneous regression, or fairness in prediction. Homogeneity of
regression is illustrated in Figure 16.1, in which the regression line
When the regression equations shown is equally appropriate for making predictions for all groups.
are the same for two or more Whenever homogeneity of regression across groups does not occur,
groups, prediction is the same then separate regression equations should be used for each group
for those groups. concerned.
x,
FIGURE 16.1 Equal Slopes and Intercepts
Note: Equal slopes and intercepts result in homogeneity of regression in which the
regression lines for different groups are the same.
444 CHAPTER 16
In actual clinical practice, regression equations are seldom generated for the prediction
of future performance. Rather, some arbitrary, or perhaps statistically derived, cutoff score is
determined, below which failure is predicted. For school performance, a score of 2 or more
standard deviations below the test mean is used to infer a high probability of failure in the
regular classroom if special assistance is not provided for the student in question. Essentially
then, clinicians are establishing prediction equations about mental aptitude that are assumed
to be equivalent across race, sex, and so on. Although these mental equations cannot be
readily tested across groups, the actual form of criterion prediction can be compared across
groups in several ways. Errors in prediction must be independent of group membership. If
regression equations are equal, this condition is met. To test the hypothesis of simultaneous
regression, regression slopes and regression intercepts must both be compared.
When homogeneity of regression does not occur, three basic conditions can result:
(a) Intercept constants differ, (b) regression coefficients (slopes) differ, or (c) slopes and inter-
cepts differ. These conditions are illustrated in Figures 16.2, 16.3, and 16.4, respectively.
When intercept constants differ, the resulting bias in prediction is constant across the
range of scores. That is, regardless of the level of performance on the independent vari-
able, the direction and degree of error in the estimation of the criterion (systematic over- or
underprediction) will remain the same. When regression coefficients differ and intercepts
x
FIGURE 16.2 Equal Slopes with Differing Intercepts
Note: Equal slopes with differing intercepts result in parallel regression lines that
produce a constant bias in prediction.
The Problem of Bias in Educational Assessment
445
xX,
are equivalent, the direction of the bias in prediction will remain constant, but the amount
of error in prediction will vary directly as a function of the distance of the score on the in-
dependent variable from the origin. With regression coefficient differences, then, the higher
the score on the predictor variable, the greater the error of prediction for the criterion. When
both slopes and intercepts differ, the situation becomes even more complex: Both the de-
gree of error in prediction and the direction of the “bias” will vary as a function of level of
performance on the independent variable.
A considerable body of literature has developed over the last 30 years regarding dif-
ferential prediction of tests across ethnicity for employment selection, college admissions,
and school or academic performance generally. In an impressive review of 866 black—white
prediction comparisons from 39 studies of test bias in personnel selection, Hunter, Schmidt,
and Hunter (1979) concluded that there was no evidence to substantiate hypotheses of dif-
ferential or single-group validity with regard to the prediction of the job performance across
race for blacks and whites. A similar conclusion has been reached by other independent
researchers (e.g., Reynolds, 1995). A number of studies have also focused on differential
validity of the Scholastic Aptitude Test (SAT) in the prediction of college performance
446 CHAPTER 16
(typically measured by grade point average). In general these studies have found either no
difference in the prediction of criterion performance for blacks and whites or a bias (under-
prediction of the criterion) against whites. When bias against whites has been found, the
differences between actual and predicted criterion scores, while statistically significant,
have generally been quite small.
A number of studies have investigated bias in the prediction of school performance
for children. Studies of the prediction of future performance based on IQ tests for children
have covered a variety of populations including normal as well as referred children: high-
poverty, inner-city children; rural black; and Native American groups. Studies of preschool
as well as school-age children have been carried out. Almost without exception, those stud-
ies have produced results that can be adequately depicted by Figure 16.1, that is, equivalent
prediction for all groups. When this has not been found, intercepts have generally differed
resulting in a constant bias in prediction. Yet, the resulting bias has not been in the popu-
larly conceived direction. The bias identified has tended to overpredict how well minority
children will perform in academic areas and to underpredict how well white children will
\
The Problem of Bias in Educational Assessment 447
perform. Reynolds (1995) provides a thorough review of studies investigating the prediction
of school performance in children.
With regard to bias in prediction, the empirical evidence suggests conclusions simi-
lar to those regarding bias in test content and other internal characteristics. There is no
strong evidence to support contentions of differential or single-group validity. Bias occurs
infrequently and with no apparently observable pattern, except with regard to instruments
of poor reliability and high specificity of test content. When bias oc-
. Bias in prediction occurs curs, it usually takes the form of small overpredictions for low SES,
infrequently and with no disadvantaged ethnic minority children, or other low-scoring groups.
apparently observable These overpredictions are unlikely to account for adverse placement
pattern, except with regard to or diagnosis in these groups (Reynolds & Ramsay, 2003).
instruments of poor reliability
and high specificity of test
content. Summary
A considerable body of literature currently exists failing to substantiate cultural bias against
native-born American ethnic minorities with regard to the use of well-constructed, ade-
quately standardized intelligence and aptitude tests. With respect
A considerable body of to personality scales, the evidence is promising yet far more pre-
literature currently exists liminary and thus considerably less conclusive. Despite the exist-
failing to substantiate cultural ing evidence, we do not expect the furor over the cultural test bias
hypothesis to be resolved soon. Bias in psychological testing will
bias against native-born
remain a torrid issue for some time. Psychologists and educators will
American ethnic minorities
need to keep abreast of new findings in the area. As new techniques
with regard to the use of and better methodology are developed and more specific populations
well-constructed, adequately examined, the findings of bias now seen as random and infrequent
standardized intelligence and may become better understood and seen to indeed display a correct-
aptitude tests. able pattern.
In the meantime however, one cannot ethnically fall prey to
the sociopoliticolegal Zeitgeist of the times and infer bias where none exists. Psychologists
and educators cannot justifiably ignore the fact that low IQ, ethnic, disadvantaged children
are just as likely to fail academically as are their white, middle-class counterparts. Black
adolescent delinquents with deviant personality scale scores and exhibiting aggressive be-
havior need treatment environments as much as their white peers. The potential outcome for
score interpretation (e.g., therapy versus prison, special education versus regular education)
cannot dictate the psychological meaning of test performance. We must practice intelligent
testing (Kaufman, 1994). We must remember that it is the purpose of the assessment process
to beat the prediction made by the test, to provide insight into hypotheses for environmental
interventions that prevent the predicted failure or subvert the occurrence of future maladap-
tive behavior.
Test developers are also going to have to be sensitive to the issues of bias, perform-
ing appropriate checks for bias prior to test publication. Progress is being made in all of
these areas. However, we must hold to the data even if we do not like them. At present, only
scattered and inconsistent evidence for bias exists. The few findings of bias do suggest two
448 CHAPTER 16
guidelines to follow in order to ensure nonbiased assessment: (1) Assessment should be con-
ducted with the most reliable instrumentation available, and (2) multiple abilities should be
assessed. In other words, educators and psychologists need to view multiple sources of ac-
curately derived data prior to making decisions concerning individuals. One hopes that this
is what has actually been occurring in the practice of assessment, although one continues to
hear isolated stories of grossly incompetent placement decisions being made. This is not to
say educators or psychologists should be blind to an individual’s cultural or environmental
background. Information concerning the home, community, and school environment must
all be evaluated in individual decisions. As we noted, it is the purpose of the assessment
process to beat the prediction and to provide insight into hypotheses for environmental
interventions that prevent the predicted failure.
Without question, scholars have not conducted all the research that needs to be done
to test the cultural test bias hypothesis and its alternatives. A number and variety of criteria
need to be explored further before the question of bias is empirically resolved. Many dif-
ferent achievement tests and teacher-made, classroom-specific tests need to be employed
in future studies of predictive bias. The entire area of differential validity of tests in the af-
fective domain is in need of greater exploration. A variety of views toward bias have been
expressed in many sources; many with differing opinions offer scholarly, nonpolemical
attempts directed toward a resolution of the issue. Obviously, the fact that such different
views are still held indicates resolution lies in the future. As far as the present situation is
concerned, clearly all the evidence is not in. With regard to a resolution of bias, we believe
that were a scholarly trial to be held, with a charge of cultural bias brought against mental
tests, the jury would likely return the verdict other than guilty or not guilty that is allowed in
British law—“not proven.” Until such time as a true resolution of the issues can take place,
we believe the evidence and positions taken in this chapter accurately reflect the state of our
empirical knowledge concerning bias in mental tests.
Comparative factor analysis, p. 441 Examiner and language bias, p. 433 Prediction bias, p. 442
Content bias, p. 438 Homogeneity of regression, p. 443 Regression intercepts, p. 444
Cultural bias, p. 424 Inappropriate standardization, Regression slopes, p. 444
Cultural loading, p. 436 p. 433 Test bias, p. 424
Cultural test bias hypothesis, p. 422 Inequitable social consequences,
Culture-free tests, p. 436 p. 433
Differential predictive validity, Mean difference definition of test
p. 433 bias, p. 437
RECOMMENDED READINGS
Cleary, T. A., Humphreys, L. G., Kendrick, S. A., & Wesman, educational tests with disadvantaged students—an early
A. (1975). American Psychologist, 30, 15-41. This is and influential article.
the report of a group appointed by the APA’s Board of Halpern, D. F. (1997). Sex differences in intelligence: Im-
Scientific Affairs to study the use of psychological and plications for education. American Psychologist, 52,
The Problem of Bias in Educational Assessment 449
1091-1102. A good article that summarizes the literature Policy, and Law, 6, 144-150. This article provides a par-
on sex differences with an emphasis on educational im- ticularly good discussion of test bias in terms of public
plications. policy issues.
Neisser, U., BooDoo, G., Bouchard, T., Boykin, A., Brody, N., Reynolds, C. R., & Ramsay, M. C. (2003). Bias in psychological
Ceci, S., Halpern, D., Loehlin, J., Perloff, R., Sternberg, assessment: An empirical review and recommendations. In
R., & Urbina, S. (1996). Intelligence: Knowns and un- J. R. Graham & J. A. Naglieri (Eds.), Handbook of psy-
knowns. American Psychologist, 51, 77-101. This report chology: Assessment psychology (pp. 67-93). New York:
of an APA task force provides an excellent review of the Wiley. This chapter also provides an excellent review of
research literature on intelligence. the literature.
Reynolds, C. R. (1995). Test bias in the assessment of intel- Suzuki, L. A., & Valencia, R. R. (1997). Race-ethnicity and
ligence and personality. In D. Saklofsky & M. Zeidner measured intelligence: Educational implications. Ameri-
(Eds.), International handbook of personality and intel- can Psychologist, 52, 1103-1114. A good discussion of
ligence (pp. 545-573). New York: Plenum Press. This the topic with special emphasis on educational implica-
chapter provides a thorough review of the literature. tions and alternative assessment methods.
Reynolds, C. R. (2000). Why is psychometric research on bias
in mental testing so often ignored? Psychology, Public
Best Practices in
Educational Assessment
CHAPTER HIGHLIGHTS
LEARNING OBJECTIVES
While teachers might not always be aware of it, their positions endow them with consid-
erable power. Teachers make decisions on a day-to-day basis that significantly impact their
450
Best Practices in Educational Assessment 451
It is the teacher’s responsibility students, and many of these decisions involve information garnered
to make sure that the assess- from educational assessments. As a result, it is the teacher’s respon-
ments they use are developed, sibility to make sure that the assessments they use, whether they
administered, scored, and are professionally developed tests or teacher-constructed tests, are
interpreted in a technically, developed, administered, scored, and interpreted in a technically,
ethically, and legally sound manner. This chapter provides some
ethically, and legally sound
guidelines that will help you ensure that your assessment practices
-manner.
are sound.
Much of the information discussed in this chapter has been introduced in previous
chapters. We will also incorporate guidelines that are presented in existing professional
codes of ethics and standards of professional practice. One of the principal sources is the
Code of Professional Responsibilities in Educational Measurement that was prepared by
the National Council on Measurement in Education (NCME, 1995). This code is presented
in its entirety in Appendix B. The Code of Professional Responsibilities in Educational
Measurement specifies the following general responsibilities for NCME members who are
involved in educational assessment:
Although these expectations are explicitly directed toward NCME members, all educational
professionals who are involved in assessment activities are well served by following these
general guidelines.
The Code of Professional Responsibilities in Educational Measurement (NCME, 1995)
delineates eight major areas of assessment activity, five of which are most applicable to teach-
ers. These are (1) Developing Assessments; (2) Selecting Assessments; (3) Administering
Assessments; (4) Scoring Assessments; and (5) Interpreting, Using, and Communicating As-
sessment Results. We will use these categories to organize our discussion of best practices in
educational assessment, and we will add an additional section, Responsibilities of Test Takers.
In addition to the Code of Professional Responsibilities in Educational Measurement (NCME,
1995), the following guidelines reflect a compilation of principles presented in the Standards
for Educational and Psychological Testing (AERA et al., 1999), The Student Evaluation
Standards (Joint Committee on Standards for Educational Evaluation, 2003), Code of Fair
Testing Practices in Education (Joint Committee on Testing Practices, 1988), and the Rights
and Responsibilities of Test Takers: Guidelines and Expectations (JCTP, 1998).
452 CoHUADP
Te Ra 7,
Develop Assessment Procedures That Are Appropriate for Measuring the Specified
Educational Outcomes. Once the table of specifications is developed, it should be
used to guide the development of items and scoring procedures. Guidelines for developing
items of different types were presented in Chapters 8, 9, and 10. Selected-response items,
constructed-response items, and performance assessments and portfolios all have their own
specific strengths and weaknesses, and are appropriate for assessing some objectives and in-
appropriate for assessing others. It is the test developer’s responsibility to determine which
procedures are most appropriate for assessing specific learning objectives. In the past it was
fairly common for teachers to use a limited number of assessment
procedures (e.g., multiple-choice, true—false, or essay items). How-
It has become widely recog-
ever, it has become more widely recognized that no single assess-
nized that no single assessment
ment format can effectively measure the diverse range of educational
format can effectively measure outcomes emphasized in today’s schools. As a result, it is important
the diverse range of educational for teachers to use a diverse array of procedures that are carefully
outcomes emphasized in today’s selected to meet the specific purposes of the assessment and to facili-
schools. tate teaching and achievement (e.g., Linn & Gronlund, 2000).
Best Practices in Educational Assessment 453
Develop Explicit Scoring Criteria. Practically all types of assessments require clearly
stated criteria for scoring the items. This can range from fairly straightforward scoring keys
for selected-response items and short-answer items to detailed scoring rubrics for evaluat-
ing performance on extended-response essays and performance assessments. Whatever the
format, developing the items and the scoring criteria should be an integrated process guided
by the table of specifications. Scoring procedures should be consistent with the purpose of
the test and facilitate valid score interpretations (AERA et al., 1999).
Develop Clear Guidelines for Test Administration. All aspects of test administration
should be clearly specified. This includes instructions to students taking the test, time limits,
testing conditions (e.g., classroom or laboratory), and any equipment that will be utilized.
Teachers should develop administration instructions in sufficient detail so that other educa-
tors are able to replicate the conditions if necessary.
When developing assessments, Plan Accommodations for Test Takers with Disabilities and
some thought should be given to Other Special Needs. As discussed in Chapter 15, it is becoming
more common for regular education teachers to have students with
what types of accommodations
disabilities in their classroom. When developing assessments, some
may be necessary for students
thought should be given to what types of accommodations may be
with disabilities.
necessary for these students or other students with special needs.
Evaluate the Technical Properties of Assessments. After administering the test, teach-
ers should use quantitative and qualitative item analysis procedures to evaluate and refine
their assessments (discussed in Chapter 6). Teachers should also
Teachers should perform perform preliminary analyses that will allow them to assess the reli-
preliminary analyses regarding ability and validity of their measurements. Reliability and validity
the reliability and validity of were discussed in Chapters 4 and 5. Although it might be difficult for
their measurements. teachers to perform some of the more complex reliability and valid-
ity analyses, at a minimum they should use some of the simplified
procedures outlined in the appropriate chapters. Table 17.1 presents a summary of these
guidelines for developing assessments.
is mandated. However, teachers are often involved in the selection and administration of
other standardized assessment instruments. As a result, they incur numerous responsibilities
associated with this role. The guiding principle, as when developing assessments, is to ensure
that the assessments meet high professional standards and are valid for the intended purposes.
Here are a few guidelines for selecting assessments that meet professional standards.
When selecting standardized Select Assessments That Have Been Validated for the Intended
assessments, it is of primary Purpose. As we have emphasized throughout this text, validity is
importance to select tests that a fundamental consideration when developing or selecting a test (see
have been validated for the Chapter 5). Professionally developed assessments should clearly
intended purpose. specify the recommended interpretations of test scores and provide
a summary of the validity evidence supporting each interpretation.
However, in the end it is the person selecting the test who is respon-
sible for determining whether the assessment is appropriate for use in the particular setting
(AERA et al., 1999). As an example, in selecting achievement tests it is important that the
content of the assessment correspond with the content of the curriculum. The essential
questions are “How will the assessment information be used?” and “Has the proposed as-
sessment been validated for those uses?”
Select assessments with Select Assessments with Normative Data That Are Representative
normative data that are of Correct Target Population. The validity of norm-referenced
representative of the type of test interpretations is dependent on the how representative the normative
takers the test will be used with. or standardization group is of the target population (see Chapter 3).
The fundamental question is “Does the normative sample adequately
represent the type of test takers the test will be used with?” It is also
important to consider how current the norms are because their usefulness diminishes over
time (AERA et al., 1999),
the levels of reliability recommended for different applications or uses. For example, when
making high-stakes decisions it is important to utilize assessment results that are highly
reliable (e.g., r,,. > 0.95).
Select Tests That Are Fair. Although no assessment procedure is absolutely free from
bias, efforts should be made to select assessments that have been shown to be relatively
free from bias due to race, gender, or ethnic backgrounds. Bias in educational assessment is
discussed in detail in Chapter 16.
Select assessments based on a Select Assessments Based on a Thorough Review of the Avail-
thorough review of the available able Literature. The selection of assessment procedures can have
literature. significant consequences for a large number of individuals. As a re-
sult, the decision should be based on a careful and thorough review
of the available information. It is appropriate to begin this review by examining information
and material the test publishers provide. This can include catalogs, test manuals, specimen
test sets, score reports, and other supporting documentation. However, the search should not
stop here, and you should seek out independent evaluations and reviews of the tests you are
considering. A natural question is “Where can I access information about assessments?” Four
of the most useful references are the Mental Measurements Yearbook, Tests in Print, Tests,
and Test Critiques. These resources can be located in the reference section of most college
and larger public libraries. The Testing Office of the American Psychological Association
Science Directorate (APA, 2008) provides the following description of these resources:
Mental Measurements Yearbook (MMY). Published by the Buros Institute for Mental Mea-
surements, the Mental Measurements Yearbook (MMY) lists tests alphabetically by title
and is an invaluable resource for researching published assessments. Each listing provides
descriptive information about the test, including test author, publication dates, intended
population, forms, prices, and publisher. It contains additional information regarding the
availability of reliability, validity, and normative data, as well as scoring and reporting ser-
vices. Most listings include one or more critical reviews by qualified assessment experts.
Tests in Print (TIP). Also published by the Buros Institute for Mental Measurements,
Tests in Print (TIP) is a bibliographic encyclopedia of information on practically every
published test in psychology and education. Each listing includes the test title, intended
population, publication date, author, publisher, and references. TIP does not contain critical
reviews or psychometric information, but it does serve as a master index to the Buros Insti-
tute reference series on tests. In the TIP the tests are listed alphabetically, within subjects
(e.g., achievement tests, intelligence tests). There are also indexes that can help you locate
specific tests. After locating a test that meets your criteria, you can turn to the Mental Mea-
surements Yearbook for more detailed information on the test.
Test Critiques. Also published by Pro-Ed, Inc., Test Critiques is designed to be a companion
to Tests. Test Critiques contains a tripart listing for each test that includes Introduction (e.g., in-
formation on the author, publisher, and purposes), Practical Applications/Uses (e.g., intended
population, administration, scoring, and interpretation guidelines), and Technical Aspects
(e.g., information on reliability, validity), followed by a critical review of the test. Its user-
friendly style makes it appropriate for individuals with limited training in psychometrics.
In addition to these traditional references, Test Reviews Online is a new Web-based
service of the Buros Institute of Mental Measurements (www.unl.edu/buros). This service
makes test reviews available online to individuals precisely as they appear in the Mental
Measurements Yearbook. For a relatively small fee (currently $15), users can download
information on any of over 2,000 tests that includes specifics on test purpose, population,
publication date, administration time, and descriptive test critiques. For more detailed in-
formation on these and other resources, the Testing Office of the American Psychological
Association Science Directorate has prepared an information sheet on “Finding Informa-
tion on Psychological Tests.” This can be requested by visiting its Web site (www.apa
.org/science/faq-findtests.html).
Select and use only assessments Select and Use Only Assessments That You Are Qualified to Ad-
that you are qualified to minister, Score, and Interpret. Because the administration, scor-
administer, score, and interpret. ing, and interpretation of many psychological and educational tests
requires advanced training, it is important to select and use only those
tests that you are qualified to use as a result of your education and training. For example, the
administration of an individual intelligence test such as the Wechsler Intelligence Scale for
Children—Fourth Edition (WISC-IV) requires extensive training and supervision that is typi-
cally acquired in graduate psychology and education programs. Most test publication firms
have established procedures that allow individuals and organizations to qualify to purchase
tests based on specific criteria. For example, Psychological Assessment Resources (2003) has
a three-tier system that classifies assessment products according to qualification requirements.
In this system, level A products require no special qualifications whereas level C products
require an advanced professional degree or license based on advanced training and experience
in psychological and educational assessment practices. Before purchasing restricted tests,
potential buyers must provide documentation that they meet the necessary requirements.
Maintain Test Security. For assessments to be valid, it is important that test security
be maintained. Individuals selecting, purchasing, and using standardized assessment have
a professional and legal responsibility to maintain the security of dssessment instruments.
Best Practices in Educational Assessment
457
Teachers using standardized For example, The Psychological Corporation (2003) includes the fol-
assessments have a professional lowing principles in its security agreement: (a) Test takers should
and legal responsibility to not have access to testing material or answers before taking the
maintain the security of those test; (b) assessment materials cannot be reproduced or paraphrased;
instruments. (c) assessment materials and results can be released only to quali-
fied individuals; (d) if test takers or their parents/guardians ask to
examine test responses or results, this review must be monitored by
a qualified representative of the organization conducting the assessment; and (e) any re-
quest to copy materials must be approved in writing. Examples of breaches in the security
of standardized tests include allowing students to examine the test before taking it, using
actual items from a test for preparation purposes, making and distributing copies of a test,
and allowing test takers to take the test outside of a controlled environment (e.g., allowing
them to take the test home to complete it). Table 17.2 provides a summary of these guide-
lines for selecting published assessments. Special Interest Topic 17.1 provides information
about educators who have engaged in unethical and sometimes criminal practices when
using standardized assessments.
So far we have discussed your professional responsibilities related to developing and se-
lecting tests. Clearly, your professional responsibilities do not stop there. Every step of the
assessment process has its own important responsibilities, and now we turn to those associ-
ated with the administration of assessments. Subsequently we will address responsibilities
related to the scoring, interpreting, using, and communicating assessment results. The fol-
lowing guidelines involve your responsibilities when administering assessments.
Over 50 New York City educators may lose their jobs after an independent auditor produced evidence
that they helped students cheat on state tests.
(Hoff, 1999)
State officials charge that 71 Michigan schools might have cheated on state tests.
(Keller, 2001)
Georgia education officials suspend state tests after 270 actual test questions were posted on an
Internet site that was accessible to students, teachers, and parents.
(Olson, 2003)
Cizek (1998) notes that the abuse of standardized assessments by educators has become a national
scandal. With the advent of high-stakes assessment, it should not be surprising that some educators
would be inclined to cheat. With one’s salary and possibly one’s future employment riding on how stu-
dents perform on state-mandated achievement tests, the pressure to ensure that those students perform
well may override ethical and legal concerns for some people. Cannell (1988, 1989) was among the
first to bring abusive test practices to the attention of the public. Cannell revealed that by using outdated
versions of norm-referenced assessments, being lax with test security, and engaging in inappropriate
test preparation practices, all 50 states were able to report that their students were above the national
average (this came to be referred to as the Lake Wobegon phenomenon). Other common “tricks” that
educators have employed to inflate scores include using the same form of a test for a long period of
time so that teachers could become familiar with the content, encouraging low-achieving students to
Le skip school on the day of the test, selectively removing answer sheets of low-performing students, and
— excluding limited-English and special education students from assessments (Cizek, 1998).
Do not be fooled into thinking that these unethical practices are limited to top administrators try-
ing to make their schools look good; they also involve classroom teachers. Cizek (1998) reports a num-
ber of recent cases wherein principals or other administrators have encouraged teachers to cheat by
having students practice on the actual test items, and in some cases even erasing and correcting wrong
responses on answer sheets. Other unethical assessment practices engaged in by teachers included
providing hints to the correct answer, reading questions that the students are supposed to read, answer-
ing questions about test content, rephrasing test questions, and sometimes simply giving the students
the answers to items. Gay (1990) reported that 35% of the teachers responding to a survey had either
witnessed or engaged in unethical assessment practices. The unethical behaviors included changing
incorrect answers, revealing the correct answer, providing extra time, allowing the use of inappropriate
aids (e.g., dictionaries), and using the actual test items when preparing students for the test.
Just because other professionals are engaging in unethical behavior does not make it right.
Cheating by administrators, teachers, or students undermines the validity of the assessment results.
If you need any additional incentive to avoid unethical test practices, be warned that the test pub-
lishers are watching! The states and other publishers of standardized tests have a vested interest in
maintaining the validity of their assessments. As a result, they are continually scanning the results for
evidence of cheating. For example, Cizek (1998) reports that unethical educators have been identi-
fied as the result of fairly obvious clues such as ordering an excessive number of blank answer sheets
or a disproportionate number of erasures, to more subtle clues such as unusual patterns of increased
scores. The fact is, educators who cheat are being caught and punished, and the punishment may
include the loss of one’s job and license to teach!
SS
a
458
Best Practices in Educational Assessment 459
Provide Information to Students and Parents about Their Rights and Give Them
an Opportunity to Express Their Concerns. Students and parents should have an op-
portunity to voice concerns about the testing process and receive information about oppor-
tunities to retake an examination, have one rescored, or cancel scores. When appropriate
they should be given information on how they can obtain copies of assessments or other
related information. When an assessment is optional, students and parents should be given
this information so they can decide whether they want to take the
assessment. If alternative assessments are available, they should also
Students and parents should be informed of this. An excellent resource for all test takers is Rights
have an opportunity to voice and Responsibilities of Tests Takers: Guidelines and Expectations
concerns about the testing developed by the Joint Committee on Testing Practices (1998). This
process. is reproduced in Appendix D.
460 QUARTER le
Administer Only Those Assessments for Which You Are Qualified by Education and
Training. As noted previously, it is important to only select and use tests that you are
qualified to use as a result of your education and training. Some assessments require exten-
sive training and supervision before being able to administer them independently.
Make Sure the Scoring Is Fair. An aspect of the previous guideline that deserves special
attention involves fairness or the absence of bias in scoring. Whenever scoring involves
subjective judgment, it is also important to take steps to ensure that the scoring is based
solely on performance or content and is not contaminated by expectancy effects related to
students. That is, you do not want your personal impressions of the students to influence
Best Practices in Educational Assessment
461
Score the Assessments and Report the Results in a Timely Manner. Students and
their parents are often anxious to receive the results of their assessments and deserve to have
their results reported in a timely manner. Additionally, to promote learning, it is important
that students receive feedback on their performance in a punctual manner. If the results are
to be delayed, it is important to notify the students, explain the situation, and attempt to
minimize any negative effects.
If Scoring Errors Are Detected, Correct the Errors and Provide the Corrected Results
in a Timely Manner. If you, or someone else, detect errors in your scoring, it is your
responsibility to take corrective action. Correct the errors, adjust the impacted scores, and
provide the corrected results in a timely manner.
Students have a right to review Implement a Reasonable and Fair Process for Appeal. Stu-
their assessments and appeal dents have a right to review their assessments and appeal their scores
their scores if they believe there if they believe there were errors in scoring. Although most institu-
were errors in scoring. tions have formal appeal procedures, it is usually in everyone’s best
interest to have a less formal process available by which students
can approach the teacher and attempt to address any concerns. This option may prevent
relatively minor student concerns from escalating into adversarial confrontations involving
parents, administrators, and possibly the legal system.
Use assessment results only for Use Assessment Results Only for the Purposes for Which They
the purposes for which they Have Been Validated. When interpreting assessment results,
have been validated. the issue of validity is an overriding concern. A primary consider-
ation when interpreting and using assessment results is to determine
whether there is sufficient validity evidence to support the proposed interpretations and
uses. When teachers use assessment results, it is their responsibility to promote valid inter-
pretations and guard against invalid interpretations.
Be Aware of the Limitations of the Assessment Results. A\l assessments contain error,
and some have more error than others do. It is the responsibility of teachers and other users
of assessment results to be aware of the limitations of assessments and to take these limita-
tions into consideration when interpreting and using assessment results.
Use multiple sources and types Use Multiple Sources and Types of Assessment Information
of assessment information when When Making High-Stakes Educational Decisions. Whenever
making high-stakes educational you hear assessment experts saying “‘multiple-choice items are worth-
decisions. less because they cannot measure higher-order cognitive skills” or
“performance assessments are worthless because they are not reliable”
recognize that they are expressing their own personal biases and not
being objective. Selected-response items, constructed-response items, and performance as-
sessments all have something to contribute to the overall assessment process. Multiple-choice
items and other selected-response formats can typically be scored in a reliable fashion and this
is a definite strength. Although we believe multiple-choice items can be written that measure
higher-order cognitive abilities, many educational objectives simply cannot be assessed using
selected-response items. If you want to measure a student’s writing skills, essay items are
particularly well suited. If you want to assess a student’s ability to engage in an oral debate,
a performance assessment is clearly indicated. The point is, different assessment procedures
have different strengths and weaknesses, and teachers are encouraged to use the results of a
variety of assessments when making important educational decisions. It is not appropriate to
base these decisions on the result of one assessment, particularly when it is difficult to take
corrective action should mistakes occur,
Take into Consideration Personal Factors or Extraneous Events That Might Have
Influenced Test Performance. This guideline holds that teachers should be sensitive to
factors that might have negatively influenced a student’s performance. For example, was the
student feeling ill or upset on the day of the assessment? Is the student prone to high levels
of test anxiety? This guideline also extends to administrative and environmental events
that might have impacted the student. For example, were there errors in administration that
might have impacted the student’s performance? Did any events occur during the admin-
istration that might have distracted the student or otherwise undermined performance? If
it appears any factors compromised the student’s performance, this should be considered
when interpreting their assessment results. ‘
Best Practices in Educational Assessment
463
Report Results in an Easily Understandable Manner. Students and their parents have
the right to receive comprehensive information about assessment results presented in an un-
derstandable and timely manner regarding the results of their assessments. It is the teacher’s
responsibility to provide this feedback to students and their parents and to attempt to answer
all of their questions. Providing feedback to students regarding their performance and ex-
plaining the rationale for grading decisions facilitates learning.
Inform Students and Parents How Long the Scores Will Be Retained and Who Will
Have Access to the Scores. Students and their parents have a right to know how long the
assessment results will be retained and who will have access to these records.
Develop Procedures so Test Takers Can File Complaints and Have Their Concerns
Addressed. Teachers and school administrators should develop procedures whereby stu-
dents and their parents can file complaints about assessment practices. As we suggested ear-
lier, it is usually desirable to try to address these concerns in an informal manner as opposed
to allowing the problem to escalate into a legal challenge. Table 17.5 provides a summary of
these guidelines for interpreting, using, and communicating assessment results.
ResponsibilitiesofTest Takers
So far we have emphasized the rights of students and other test takers and the responsibili-
ties of teachers and other assessment professionals. However, the Standards (AERA et al.,
1999) note that students and other test takers also have responsibilities. These responsibili-
ties include the following.
Students Are Responsible for Preparing for the Assessment. Students have the right
to have adequate information about the nature and use of assessments. In turn, students are
responsible for studying and otherwise preparing themselves for the assessment.
Students Are Responsible for Following the Directions of the Individual Administer-
ing the Assessment. Students are expected to follow the instructions provided by the
individual administering the test or assessment. This includes behaviors such as showing
464 CHAPTER 17
TABLE 17.5 Checklist for Interpreting, Using, and Communicating Assessment Results
1. Are assessment results used only for purposes for which they
have been validated?
2. Did you take into consideration the limitations of the
assessment results?
3. Were multiple sources and types of assessment information
used when making high-stakes educational decisions?
4. Have you considered personal factors or extraneous events that
might have influenced test performance?
5. Are there any differences between the normative group and actual
test takers that need to be considered?
6. Are results communicated in an easily understandable
and timely manner?
7. Have you explained to students and parents how they are likely
to be impacted by assessment results?
8. Have you informed students and parents how long the scores
will be retained and who will have access to the scores?
9. Have you developed procedures so test takers can file complaints
and have their concerns addressed?
a
up on time for the assessment, starting and stopping when instructed to do so, and recording
responses as requested.
Students are responsible for Students Are Responsible for Behaving in an Academically
acting in an academically Honest Manner. That is, students should not cheat! Any form of
honest manner. cheating reduces the validity of the test and is unfair to other students.
Cheating can include copying from another student, using prohibited
resources (e€.g., notes or other unsanctioned aids), securing stolen copies of tests, or having
someone else take the test for them. Most schools have clearly stated policies on academic
honesty and students caught cheating may be sanctioned.
Students Are Responsible for Not Interfering with the Performance of Other Students.
Students should refrain from any activity that might be distracting to other students.
Students Are Responsible for Informing the Teacher or Another Professional if They
Believe the Assessment Results Do Not Adequately Represent Their True Abilities.
If, for any reason students feel that the assessment results do not adequately represent their
actual abilities, they should inform the teacher. This should be done as soon as possible so
the teacher can take appropriate actions.
Students Should Respect the Copyright Rights of Test Publishers. Students should
not make copies or in any other way reproduce assessment materials.
Best Practices in Educational Assessment 465
REE
Linn and Gronlund (2000) provide the following suggestions to help prevent cheating in your
classroom.
Take steps to keep the test secure before the testing date.
Prior to taking the test, have students clear off the top of their desks.
If students are allowed to use scratch paper, have them turn it in with their tests.
Carefully monitor the students during the test administration.
When possible provide an empty row of seats between students.
aN Use two forms of the test and alternate forms when distributing (you can use the same test
Sa
items, just arranged in a different order).
= Design your tests to have good face validity, that is, so it appears relevant and fair.
8. Foster a positive attitude toward tests by emphasizing how assessments benefit students (e.g.,
students learn what they have and have not mastered; a fair way of assigning grades).
Students Should Not Disclose Information about the Contents of a Test. In addition
to not making copies of an assessment, students should refrain from divulging in any other
manner information about the contents of a test. For example, they should not give other stu-
dents information about what to expect on a test. This is tantamount to cheating. Table 17.6
provides a summary of the responsibilities of test takers.
asked , ‘Now that you have told us we should do, what things should we avoid?” To that end,
here is our list of 12 assessment-related behaviors that should be avoided.
Don’t teach the test itself. It’s tantamount to cheating (e.g., giving students answers
to tests; changing incorrect answers to correct answers).
Don’t create an environment where it is easy for students to cheat (e.g., failing to
monitor students in a responsible manner).
Don’t base high-stakes decisions on the results of a single assessment.
Don’t use poor quality assessments (e.g., unreliable, lacking relevant validity data,
inadequate normative data).
Don’t keep students and parents “in the dark” about how they will be assessed.
Don’t breach confidentiality regarding the performance of students on assessments.
Don’t let your personal preferences and biases impact the scoring of assessments and
assignment of grades.
Don’t use technical jargon without a clear, commonsense explanation when reporting
the results of assessments.
Don’t use assessments that you are not qualified to administer, score, and interpret.
10. Don’t make decisions using information you do not understand.
11. Don’t ignore the special assessment needs of students with disabilities or from diverse
linguistic/cultural backgrounds.
12. Don’t accede to bad decisions for students based on the faulty interpretation of test
results by others.
In closing, we hope you enjoy a successful and rewarding career as an educational profes-
sional. Remember, “Our children are our future!”
Academic honesty, p. 464 Mental Measurements Yearbook The Student Evaluation Standards
Code of Fair Testing Practices in (MMY), p. 455 (JCSEE, 2003), p. 451
Education (JCTP, 1988), Rights and Responsibilities of Test Critiques, p. 456
p. 451 Test Takers: Guidelines and Test Reviews Online, p. 456
Code of Professional Expectations (JCTP, 1998), Tests, p. 455
Responsibilities in Educational p. 451 Test security, p. 456
Measurement (NCME, 1995), Standards for Educational and Tests in Print (TIP), p. 455
p. 451 Psychological Testing (AERA
Education Week, p. 456 etal., 1999), p. 451
RECOMMENDED READINGS
American Educational Research Association, American AERA. This is “the source” for technical information
Psychological Association, & National Council on on the development and use of tests in educational and
Measurement in Education (1999). Standards for edu- psychological settings.
cational and psychological testing. Washington, DC:
Best Practices in Educational Assessment
467
In addition to the Standards (AERA et al., 1999), the Appendix C: Code of Fair Testing Practices in Educa-
codes and guidelines reproduced in the appendixes of this tion (JCTP, 1988)
textbook are outstanding resources. These are the following: Appendix D: Rights and Responsibilities ofTest Takers:
f Guidelines and Expectations (JCTP, 1998)
AppendixA: Summary Statements of The Student Eval- Appendix E: Standards for Teacher Competence in
uation Standards (JCSEE, 2003) Educational Assessment of Students (AFT, NCME, & NEA,
Appendix B: Code of Professional Responsibilities in 1990)
Educational Measurement (NCME, 1995)
Summary Statements of
The Student Evaluation Standards
Propriety Standards
The propriety standards help ensure that student evaluations will be conducted legally, ethically, and
with due regard for the well-being of the students being evaluated and other people affected by the
evaluation results. These standards are as follows:
Pl. Service to Students: Evaluations of students should promote sound education principles,
fulfillment of institutional missions, and effective student work, so that the educational needs of
students are served.
P2. Appropriate Policies and Procedures: Written policies and procedures should be developed,
implemented, and made available, so that evaluations are consistent, equitable, and fair.
P4. Treatment of Students: Students should be treated with respect in all aspects of the evaluation
process, so that their dignity and opportunities for educational development are enhanced.
PS. Rights of Students: Evaluations of students should be consistent with applicable laws and basic
principles of fairness and human rights, so that students’ rights and welfare are protected.
P6. Balanced Evaluation: Evaluations of students should provide information that identifies both
strengths and weaknesses, so that strengths can be built upon and problem areas addressed.
P7. Conflict of Interest: Conflicts of interest should be avoided, but if present should be dealt with
openly and honestly, so that they do not compromise evaluation processes and results.
Utility Standards
The utility standards help ensure that student evaluations are useful. Useful student evaluations are in-
formative, timely, and influential. Standards that support usefulness are as follows:
U1. Constructive Orientation: Student evaluations should be constructive, so that they result in
educational decisions that are in the best interest of the student.
468
Summary Statements of The Student Evaluation Standards 469
U2. Defined Users and Uses: The users and uses of a student evaluation should be specified, so that
the evaluation appropriately contributes to student learning and development.
U3. Information Scope: The information collected for student evaluations should be carefully
focused and sufficiently comprehensive, so that the evaluation questions can be fully answered and
the needs of students addressed.
U4. Evaluator Qualifications: Teachers and others who evaluate students should have the necessary
knowledge and skills, so that the evaluations are carried out competently and the results can be used
with confidence.
U5. Explicit Values: In planning and conducting student evaluations, teachers and others who
evaluate students should identify and justify the values used to judge student performance, so that the
bases for the evaluations are clear and defensible.
U6. Effective Reporting: Student evaluation reports should be clear, timely, accurate, and relevant,
so that they are useful to students, their parents/guardians, and other legitimate users.
U7. Follow-Up: Student evaluations should include procedures for follow-up, so that students,
parents/guardians, and other legitimate users can understand the information and take appropriate
follow-up actions.
Feasibility Standards
The feasibility standards help ensure that student evaluations can be implemented as planned. Feasible
evaluations are practical, diplomatic, and adequately supported. These standards are as follows:
F1. Practical Orientation: Student evaluation procedures should be practical, so that they produce
the needed information in efficient, nondisruptive ways.
F2. Political Viability: Student evaluations should be planned and conducted with the anticipation
of questions from students, their parents/guardians, and other legitimate users, so that their questions
can be answered effectively and their cooperation obtained.
F3. Evaluation Support: Adequate time and resources should be provided for student evaluations,
so that evaluations can be effectively planned and implemented, their results fully communicated, and
appropriate follow-up activities identified.
Accuracy Standards
The accuracy standards help ensure that a student evaluation will produce sound information about
a student’s learning and performance. Sound information leads to valid interpretations, justifiable
conclusions, and appropriate follow-up. These standards are as follows:
Al. Validity Orientation: Student evaluations should be developed and implemented, so that the
interpretations made about the performance of a student are valid and not open to misinterpretation.
470 APPENDIX A
A2. Defined Expectations for Students: The performance expectations for students should be
clearly defined, so that evaluation results are defensible and meaningful.
A3. Context Analysis: Student and contextual variables that may influence performance should be
identified and considered, so that a student’s performance can be validly interpreted.
A4. Documented Procedures: The procedures for evaluating students, both planned and actual,
should be described, so that the procedures can be explained and justified.
A5. Defensible Information: The adequacy of information gathered should be ensured, so that
good decisions are possible and can be defended and justified.
A6. Reliable Information: Evaluation procedures should be chosen or developed and implemented,
so that they provide reliable information for decisions about the performance of a student.
A7. Bias Identification and Management: Student evaluations should be free from bias, so that
conclusions can be fair.
A8. Handling Information and Quality Control: The information collected, processed, and reported
about students should be systematically reviewed, corrected as appropriate, and kept secure, so that
accurate judgments can be made.
A9. Analysis of Information: Information collected for student evaluations should be systematically
and accurately analyzed, so that the purposes of the evaluation are effectively achieved.
A10. Justified Conclusions: The evaluative conclusions about student performance should be
explicitly justified, so that students, their parents/guardians, and others can have confidence in them.
All. Metaevaluation: Student evaluation procedures should be examined periodically using these
and other pertinent standards, so that mistakes are prevented, or detected and promptly corrected, and
sound student evaluation practices are developed over time.
Source: Joint Committee on Standards for Educational Evaluation (2003). The student evaluation
standards. Thousand Oaks, CA: Corwin Press. ;
APPENDIX B
1. Develop Assessments
2. Market and Sell Assessments
471
472 ACP Pi N Dex SB
Select Assessments
Administer Assessments
Score Assessments
Interpret, Use, and Communicate Assessment Results
Educate about Assessment
eh
SIS)
aS Evaluate Programs and Conduct Research on Assessments
Although the organization of the Code is based on the differentiation of these activities, they are viewed
as highly interrelated, and those who use this Code are urged to consider the Code in its entirety. The
index following this Code provides a listing of some of the critical interest topics within educational
measurement that focus on one or more of the assessment activities.
General Responsibilities
The professional responsibilities promulgated in this Code in eight major areas of assessment activity
are based on expectations that NCME members involved in educational assessment will:
Responsible professional practice includes being informed about and acting in accordance with
the Code of Fair Testing Practices in Education (Joint Committee on Testing Practices, 1988), the
Standards for Educational and Psychological Testing (American Educational Research Association,
American Psychological Association, National Council on Measurement in Education, 1985), or sub-
sequent revisions, as well as all applicable state and federal laws that may govern the development,
administration, and use of assessments. Both the Standards for Educational and Psychological Testing
and the Code of Fair Testing Practices in Education are intended to establish criteria for judging the
technical adequacy of tests and the appropriate uses of tests and test results. The purpose of this Code
is to describe the professional responsibilities of those individuals who are engaged in assessment
activities. As would be expected, there is a strong relationship between professionally responsible
practice and sound educational assessments, and this Code is intended to be consistent with the rel-
evant parts of both of these documents.
It is not the intention of NCME to enforce the professional responsibilities stated in the Code
or to investigate allegations of violations to the Code.
Since the Code provides a frame of reference for the evaluation of the appropriateness of behavior,
NCME recognizes that the Code may be used in legal or other similar proceedings.
‘
Code of Professional Responsibilities in Educational Measurement 473
1.1 Ensure that assessment products and services are developed to meet applicable professional,
technical, and legal standards.
1.2 Develop assessment products and services that are as free as possible from bias due to charac-
teristics irrelevant to the construct being measured, such as gender, ethnicity, race, socioeco-
nomic status, disability, religion, age, or national origin.
1.3. Plan accommodations for groups of test takers with disabilities and other special needs when
developing assessments.
1.4 Disclose to appropriate parties any actual or potential conflicts of interest that might influence
the developers’ judgment or performance.
1.5 Use copyrighted materials in assessment products and services in accordance with state and
federal law.
1.6 Make information available to appropriate persons about the steps taken to develop and score
the assessment, including up-to-date information used to support the reliability, validity, scor-
ing and reporting processes, and other relevant characteristics of the assessment.
1.7. Protect the rights to privacy of those who are assessed as part of the assessment development
process.
1.8 Caution users, in clear and prominent language, against the most likely misinterpretations and
misuses of data that arise out of the assessment development process.
1.9 Avoid false or unsubstantiated claims in test preparation and program support materials and
services about an assessment or its use and interpretation.
1.10 Correct any substantive inaccuracies in assessments or their support materials as soon as
feasible.
1.11 Develop score reports and support materials that promote the understanding of assessment results.
The marketing of assessment products and services, such as tests and other instruments, scoring services,
test preparation services, consulting, and test interpretive services, should be based on information that is
accurate, complete, and relevant to those considering their use. Persons who market and sell assessment
products and services have a professional responsibility to:
2.1 Provide accurate information to potential purchasers about assessment products and services
and their recommended uses and limitations.
2.2 Not knowingly withhold relevant information about assessment products and services that
might affect an appropriate selection decision.
2.3 Base all claims about assessment products and services on valid interpretations of publicly
available information.
2.4 Allow qualified users equal opportunity to purchase assessment products and services.
2.5 Establish reasonable fees for assessment products and services.
474 APPENDIX B
2.6 Communicate to potential users, in advance of any purchase or use, all applicable fees associ-
ated with assessment products and services.
2.7 Strive to ensure that no individuals are denied access to opportunities because of their inability
to pay the fees for assessment products and services.
2.8 Establish criteria for the sale of assessment products and services, such as limiting the sale of
assessment products and services to those individuals who are qualified for recommended uses
and from whom proper uses and interpretations are anticipated.
2.9 Inform potential users of known inappropriate uses of assessment products and services and
provide recommendations about how to avoid such misuses.
2.10 Maintain a current understanding about assessment products and services and their appropriate
uses in education.
2.11 Release information implying endorsement by users of assessment products and services only
with the users’ permission.
2.12 Avoid making claims that assessment products and services have been endorsed by another
organization unless an official endorsement has been obtained.
2.13 Avoid marketing test preparation products and services that may cause individuals to receive
scores that misrepresent their actual levels of attainment.
Those who select assessment products and services for use in educational settings, or help others do
so, have important professional responsibilities to make sure that the assessments are appropriate for
their intended use. Persons who select assessment products and services have a professional respon-
sibility to:
3.1 Conduct a thorough review and evaluation of available assessment strategies and instruments
that might be valid for the intended uses.
3.2 Recommend and/or select assessments based on publicly available documented evidence of
their technical quality and utility rather than on unsubstantiated claims or statements.
3.3 Disclose any associations or affiliations that they have with the authors, test publishers, or oth-
ers involved with the assessments under consideration for purchase and refrain from participa-
tion if such associations might affect the objectivity of the selection process.
3.4 Inform decision makers and prospective users of the appropriateness of the assessment for
the intended uses, likely consequences of use, protection of examinee rights, relative costs,
materials and services needed to conduct or use the assessment, and known limitations of the
assessment, including potential misuses and misinterpretations of assessment information.
3h) Recommend against the use of any prospective assessment that is likely to be administered,
scored, and used in an invalid manner for members of various groups in our society for reasons
of race, ethnicity, gender, age, disability, language background, socioeconomic status, religion,
or national origin.
3.6 Comply with all security precautions that may accompany assessments being reviewed.
SPY Immediately disclose any attempts by others to exert undue influence on the assessment selec-
tion process.
3.8 Avoid recommending, purchasing, or using test preparation products and services that may cause
individuals to receive scores that misrepresent their actual levels of attainment.
Code of Professional Responsibilities in Educational Measurement 475
Those who prepare individuals to take assessments and those who are directly or indirectly involved
in the administration of assessments as part of the educational process, including teachers, admin-
istrators, and assessment personnel, have an important role in making sure that the assessments are
administered in a fair and accurate manner. Persons who prepare others for, and those who administer,
assessments have a professional responsibility to:
4.1 Inform the examinees about the assessment prior to its administration, including its purposes,
uses, and consequences; how the assessment information will be judged or scored; how the results
will be kept on file; who will have access to the results; how the results will be distributed; and
examinees’ rights before, during, and after the assessment.
4.2 Administer only those assessments for which they are qualified by education, training, licen-
sure, or certification.
4.3 Take appropriate security precautions before, during, and after the administration of the
assessment.
4.4 Understand the procedures needed to administer the assessment prior to administration.
4.5 Administer standardized assessments according to prescribed procedures and conditions and
notify appropriate persons if any nonstandard or delimiting conditions occur.
4.6 Not exclude any eligible student from the assessment.
4.7 Avoid any conditions in the conduct of the assessment that might invalidate the results.
4.8 Provide for and document all reasonable and allowable accommodations for the administration
of the assessment to persons with disabilities or special needs.
4.9 Provide reasonable opportunities for individuals to ask questions about the assessment procedures
or directions prior to and at prescribed times during the administration of the assessment.
4.10 Protect the rights to privacy and due process of those who are assessed.
4.11 Avoid actions or conditions that would permit or encourage individuals or groups to receive
scores that misrepresent their actual levels of attainment.
Section 5: Responsibilities of
Those Who Score Assessments
The scoring of educational assessments should be conducted properly and efficiently so that the results
are reported accurately and in a timely manner. Persons who score and prepare reports of assessments
have a professional responsibility to:
5.1 Provide complete and accurate information to users about how the assessment is scored, such
as the reporting schedule, scoring process to be used, rationale for the scoring approach, techni-
cal characteristics, quality control procedures, reporting formats, and the fees, if any, for these
services.
5.2 Ensure the accuracy of the assessment results by conducting reasonable quality control proce-
dures before, during, and after scoring.
5:3 Minimize the effect on scoring of factors irrelevant to the purposes of the assessment.
5.4 Inform users promptly of any deviation in the planned scoring and reporting service or schedule
and negotiate a solution with users.
55 Provide corrected score results to the examinee or the client as quickly as practicable should
errors be found that may affect the inferences made on the basis of the scores.
476 APPENDIX B
5.6 Protect the confidentiality of information that identifies individuals as prescribed by state and
federal laws.
Bhi Release summary results of the assessment only to those persons entitled to such information
by state or federal law or those who are designated by the party contracting for the scoring
services.
5.8 Establish, where feasible, a fair and reasonable process for appeal and rescoring the assessment.
The interpretation, use, and communication of assessment results should promote valid inferences
and minimize invalid ones. Persons who interpret, use, and communicate assessment results have a
professional responsibility to:
6.1 Conduct these activities in an informed, objective, and fair manner within the context of the
assessment’s limitations and with an understanding of the potential consequences of use.
6.2 Provide to those who receive assessment results information about the assessment, its pur-
poses, its limitations, and its uses necessary for the proper interpretation of the results.
6.3 Provide to those who receive score reports an understandable written description of all reported
scores, including proper interpretations and likely misinterpretations.
6.4 Communicate to appropriate audiences the results of the assessment in an understandable and
timely manner, including proper interpretations and likely misinterpretations.
6.5 Evaluate and communicate the adequacy and appropriateness of any norms or standards used
in the interpretation of assessment results.
6.6 Inform parties involved in the assessment process how assessment results may affect them.
6.7 Use multiple sources and types of relevant information about persons or programs whenever
possible in making educational decisions.
6.8 Avoid making, and actively discourage others from making, inaccurate reports, unsubstanti-
ated claims, inappropriate interpretations, or otherwise false and misleading statements about
assessment results.
6.9 Disclose to examinees and others whether and how long the results of the assessment will be
kept on file, procedures for appeal and rescoring, rights examinees and others have to the as-
sessment information, and how those rights may be exercised.
6.10 Report any apparent misuses of assessment information to those responsible for the assessment
process.
6.11 Protect the rights to privacy of individuals and institutions involved in the assessment process.
The process of educating others about educational assessment, whether as part of higher education,
professional development, public policy discussions, or job training, should prepare individuals to
understand and engage in sound measurement practice and to become discerning users of tests and test
results. Persons who educate or inform others about assessment have a professional responsibility to:
7.1 Remain competent and current in the areas in which they teach and reflect that in their instruction.
7.2 Provide fair and balanced perspectives when teaching about assessment.
Code of Professional Responsibilities in Educational Measurement 477
7.3 Differentiate clearly between expressions of opinion and substantiated knowledge when edu-
cating others about any specific assessment method, product, or service.
7.4 Disclose any financial interests that might be perceived to influence the evaluation of a particu-
lar assessment product or service that is the subject of instruction.
7.5 Avoid administering any assessment that is not part of the evaluation of student performance in a
course if the administration of that assessment is likely to harm any student.
7.6 Avoid using or reporting the results of any assessment that is not part of the evaluation of stu-
dent performance in a course if the use or reporting of results is likely to harm any student.
7.7 Protect all secure assessments and materials used in the instructional process.
7.8 Model responsible assessment practice and help those receiving instruction to learn about their
professional responsibilities in educational measurement.
7.9 Provide fair and balanced perspectives on assessment issues being discussed by policymakers,
parents, and other citizens.
8.1 Conduct evaluation and research activities in an informed, objective, and fair manner.
8.2 Disclose any associations that they have with authors, test publishers, or others involved with
the assessment and refrain from participation if such associations might affect the objectivity
of the research or evaluation.
8.3 Preserve the security of all assessments throughout the research process as appropriate.
8.4 Take appropriate steps to minimize potential sources of invalidity in the research and disclose
known factors that may bias the results of the study.
8.5 Present the results of research, both intended and unintended, in a fair, complete, and objective
manner.
8.6 Attribute completely and appropriately the work and ideas of others.
8.7 Qualify the conclusions of the research within the limitations of the study.
8.8 Use multiple sources of relevant information in conducting evaluation and research activities
whenever possible.
8.9 Comply with applicable standards for protecting the rights of participants in an evaluation or
research study, including the rights to privacy and informed consent.
Afterword
As stated at the outset, the purpose of the Code of Professional Responsibilities in Educational Mea-
surement is to serve as a guide to the conduct of NCME members who are engaged in any type of
assessment activity in education. Given the broad scope of the field of educational assessment as well
as the variety of activities in which professionals may engage, it is unlikely that any code will cover
the professional responsibilities involved in every situation or activity in which assessment is used in
education. Ultimately, it is hoped that this Code will serve as the basis for ongoing discussions about
478 APPENDIX B
what constitutes professionally responsible practice. Moreover, these discussions will undoubtedly
identify areas of practice that need further analysis and clarification in subsequent editions of the
Code. To the extent that these discussions occur, the Code will have served its purpose.
This index provides a list of major topics and issues addressed by the responsibilities in each of the
eight sections of the Code. Although this list is not intended to be exhaustive, it is intended to serve as
a reference source for those who use this Code.
Source: Code of professional responsibilities in educational measurement. Prepared by the NCME Ad Hoc Com-
mittee on the Development of a Code of Ethics: Cynthia B. Schmeiser, ACT—Chair; Kurt F. Geisinger, State
University of New York; Sharon Johnson-Lewis, Detroit Public Schools; Edward D. Roeber, Council of Chief State
School Officers; William D. Schafer, University of Maryland. Copyright 1995 National Council on Measurement
in Education. Any portion of this Code may be reproduced and disseminated for educational purposes.
| APPENDIX C
Code of Fair Testing
Practices in Education
The Code of Fair Testing Practices in Education states the major obligations to test takers of profes-
sionals who develop or use educational tests. The Code is meant to apply broadly to the use of tests in
education (admissions, educational assessment, educational diagnosis, and student placement). The
Code is not designed to cover employment testing, licensure or certification testing, or other types of
testing. Although the Code has relevance to many types of educational tests, it is directed primarily at
professionally developed tests such as those sold by commercial test publishers or used in formally
administered testing programs. The Code is not intended to cover tests made by individual teachers for
use in their own classrooms.
The Code addresses the roles of test developers and test users separately. Test users are people
who select tests, commission test development services, or make decisions on the basis of test scores.
Test developers are people who actually construct tests as well as those who set policies for particular
testing programs. The roles may, of course, overlap as when a state education agency commissions
test development services, sets policies that control the test development process, and makes decisions
on the basis of the test scores.
The Code has been developed by the Joint Committee on Testing Practices, a cooperative ef-
fort of several professional organizations, that has as its aim the advancement, in the public interest,
of the quality of testing practices. The Joint Committee was initiated by the American Educational
Research Association, the American Psychological Association, and the National Council on Mea-
surement in Education. In addition to these three groups the American Association for Counseling and
Development/Association for Measurement and Evaluation in Counseling and Development, and the
American Speech-Language-Hearing Association are now also sponsors of the Joint Committee.
The Code presents standards for educational test developers and users in four areas:
Organizations, institutions, and individual professionals who endorse the Code commit themselves
to safeguarding the rights of test takers by following the principles listed. The Code is intended to be
consistent with the relevant parts of the Standards for Educational and Psychological Testing (AERA,
APA, NCME, 1985). However, the Code differs from the Standards in both audience and purpose.
The Code is meant to be understood by the general public; it is limited to educational tests; and the
primary focus is on those issues that affect the proper use of tests. The Code is not meant to add new
principles over and above those in the Standards or to change the meaning of the Standards. The goal
is rather to represent the spirit of a selected portion of the Standards in a way that is meaningful to test
takers and/or their parents or guardians. It is the hope of the Joint Committee that the Code will also
be judged to be consistent with existing codes of conduct and standards of other professional groups
who use educational tests.
479
480 APPENDIX C
B. Interpreting Scores
Test developers should help users interpret scores Test users should interpret scores correctly.
correctly.
*Many of the statements in the Code refer to the selection of existing tests. However, in customized testing pro-
grams test developers are engaged to construct new tests. In those situations, the test development process should
be designed to help ensure that the completed tests will be in compliance with the Code.
Code of Fair Testing Practices in Education 481
10. Describe the population(s) represented by any 10. Interpret scores taking into account any major dif-
norms or comparison group(s), the dates the data were ferences between the norms or comparison groups and the
gathered, and the process used to select the samples of test actual test takers. Also take into account any differences
takers. in test administration practices or familiarity with the spe-
cific questions in the test.
11. Warn users to avoid specific, reasonably anticipated 11. Avoid using tests for purposes not specifically rec-
misuses of test scores. ommended by the test developer unless evidence is ob-
tained to support the intended use.
12. Provide information that will help users follow rea- 12. Explain how any passing scores were set and gather
sonable procedures for setting passing scores when it is evidence to support the appropriateness of the scores.
appropriate to use such scores with the test.
13. Provide information that will help users gather 13. Obtain evidence to help show that the test is meet-
evidence to show that the test is meeting its intended ing its intended purpose(s).
purpose(s).
14. Review and revise test questions and related materi- 14. Evaluate the procedures used by test developers to
als to avoid potentially insensitive content or language. avoid potentially insensitive content or language.
15. Investigate the performance of test takers of differ- 15. Review the performance of test takers of different
ent races, gender, and ethnic backgrounds when samples races, gender, and ethnic backgrounds when samples of
of sufficient size are available. Enact procedures that help sufficient size are available. Evaluate the extent to which
to ensure that differences in performance are related pri- performance differences may have been caused by the
marily to the skills under assessment rather than to irrel- test.
evant factors.
16. When feasible, make appropriately modified forms 16. When necessary and feasible, use appropriately
of tests or administration procedures available for test tak- modified forms or administration procedures for test
ers with handicapping conditions. Warn test users of po- takers with handicapping conditions. Interpret standard
tential problems in using standard norms with modified norms with care in the light of the modifications that were
tests or administration procedures that result in noncom- made.
parable scores.
18. Provide test takers with the information they need to be familiar with the coverage of the test,
the types of question formats, the directions, and appropriate test-taking strategies. Strive to make
such information equally available to all test takers.
Under some circumstances, test developers have direct control of tests and test scores. Under
other circumstances, test users have such control.
Whichever group has direct control of tests and test scores should take the steps described
below.
20. Tell test takers or their parents/guardians how long scores will be kept on file and indicate to
whom and under what circumstances test scores will or will not be released.
21. Describe the procedures that test takers or their parents/guardians may use to register com-
plaints and have problems resolved.
Source: Code of fair testing practices in education. (1988). Washington, DC: Joint Committee on Testing Prac-
tices. (Mailing address: Joint Committee on Testing Practices, American Psychological Association, 1200 17th
Street NW, Washington, DC 20036.) ’
&
ne
APPENDIX D
aa
a Rights and Responsibilities of Test
Takers: Guidelines and Expectations
Preamble
The intent of this statement is to enumerate and clarify the expectations that test takers may reason-
ably have about the testing process, and the expectations that those who develop, administer, and
use tests may have of test takers. Tests are defined broadly here as psychological and educational
instruments developed and used by testing professionals in organizations such as schools, industries,
clinical practice, counseling settings and human service and other agencies, including those assess-
ment procedures and devices that are used for making inferences about people in the above-named
settings. The purpose of the statement is to inform and to help educate not only test takers, but also
others involved in the testing enterprise so that measurements may be most validly and appropriately
used. This document is intended as an effort to inspire improvements in the testing process and does
not have the force of law. Its orientation is to encourage positive and high quality interactions between
testing professionals and test takers.
The rights and responsibilities listed in this document are neither legally based nor inalienable
rights and responsibilities such as those listed in the United States of America’s Bill of Rights. Rather,
they represent the best judgments of testing professionals about the reasonable expectations that those
involved in the testing enterprise (test producers, test users, and test takers) should have of each other.
Testing professionals include developers of assessment products and services, those who market and
sell them, persons who select them, test administrators and scorers, those who interpret test results,
and trained users of the information. Persons who engage in each of these activities have significant
responsibilities that are described elsewhere, in documents such as those that follow (American As-
sociation for Counseling and Development, 1988; American Speech-Language-Hearing Association,
1994; Joint Committee on Testing Practices, 1988; National Association of School Psychologists,
1992: National Council on Measurement in Education, 1995).
In some circumstances, the test developer and the test user may not be the same person, group
of persons, or organization. In such situations, the professionals involved in the testing should clarify,
for the test taker as well as for themselves, who is responsible for each aspect of the testing process.
For example, when an individual chooses to take a college admissions test, at least three parties are
involved in addition to the test taker: the test developer and publisher, the individuals who administer
the test to the test taker, and the institutions of higher education who will eventually use the informa-
tion. In such cases a test taker may need to request clarifications about their rights and responsibilities.
When test takers are young children (e.g., those taking standardized tests in the schools) or are persons
who spend some or all their time in institutions or are incapacitated, parents or guardians may be
granted some of the rights and responsibilities, rather than, or in addition to, the individual.
Perhaps the most fundamental right test takers have is to be able to take tests that meet high
professional standards, such as those described in Standards for Educational and Psychological
483
484 APPENDIX D
Testing (American Educational Research Association, American Psychological Association, & Na-
tional Council on Measurement in Education, 1999) as well as those of other appropriate professional
associations. This statement should be used as an adjunct, or supplement, to those standards. State and
federal laws, of course, supersede any rights and responsibilities that are stated here.
References
American Association for Counseling and Development (now American Counseling Association) & As-
sociation for Measurement and Evaluation in Counseling and Development (now Association for
Assessment in Counseling). (1989). Responsibilities of users of standardized tests: RUST statement
revised. Alexandria, VA: Author.
American Educational Research Association, American Psychological Association, & National Council on
Measurement in Education. (1999). Standards for educational and psychological testing. Washing-
ton, DC: American Educational Research Association.
American Speech-Language-Hearing Association. (1994). Protection of rights of people receiving audiol-
ogy or speech-language pathology services. ASHA (36), 60-63.
Joint Committee on Testing Practices. (1988). Code offair testing practices in education. Washington, DC:
American Psychological Association.
National Association of School Psychologists. (1992). Standards for the provision of school psychological
services. Author: Silver Springs, MD.
National Council on Measurement in Education. (1995). Code of professional responsibilities in educa-
tional measurement. Washington, DC: Author.
1. Because test takers have the right to be informed of their rights and responsibilities as test tak-
ers, it is normally the responsibility of the individual who administers a test (or the organization
that prepared the test) to inform test takers of these rights and responsibilities.
Because test takers have the right to be treated with courtesy, respect, and impartiality, regard-
less of their age, disability, ethnicity, gender, national origin, race, religion, sexual orientation,
or other personal characteristics, testing professionals should:
a. Make test takers aware of any materials that are available to assist them in test preparation.
These materials should be clearly described in test registration and/or test familiarization
materials.
b. See that test takers are provided with reasonable access to testing services.
Because test takers have the right to be tested with measures that meet professional standards
that are appropriate for the test use and the test taker, given the manner in which the results will
be used, testing professionals should:
a. Take steps to utilize measures that meet professional standards and are reliable, relevant,
useful given the intended purpose and are fair for test takers from varying societal groups.
b. Advise test takers that they are entitled to request reasonable accommodations in test ad-
ministration that are likely to increase the validity of their test scores if they have a disability
recognized under the Americans with Disabilities Act or other relevant legislation.
486 APPENDIX D
4. Because test takers have the right to be informed, prior to testing, about the test’s purposes, the
nature of the test, whether test results will be reported to the test takers, and the planned use of
the results (when not in conflict with the testing purposes), testing professionals should:
a. Give or provide test takers with access to a brief description about the test purpose (e.g.,
diagnosis, placement, selection, etc.) and the kind(s) of tests and formats that will be used
(e.g., individual/group, multiple-choice/free response/performance, timed/untimed, etc.),
unless such information might be detrimental to the objectives of the test.
. Tell test takers, prior to testing, about the planned use(s) of the test results. Upon request,
the test taker should be given information about how long such test scores are typically kept
on file and remain available.
Provide test takers, if requested, with information about any preventative measures that have
been instituted to safeguard the accuracy of test scores. Such information would include
any quality control procedures that are employed and some of the steps taken to prevent
dishonesty in test performance.
. Inform test takers, in advance of the testing, about required materials that must be brought
to the test site (e.g., pencil, paper) and about any rules that allow or prohibit use of other
materials (e.g., calculators).
e. Provide test takers, upon request, with general information about the appropriateness of the
test for its intended purpose, to the extent that such information does not involve the release
of proprietary information. (For example, the test taker might be told, “Scores on this test are
useful in predicting how successful people will be in this kind of work” or “Scores on this
test, along with other information, help us to determine if students are likely to benefit from
this program.”)
Provide test takers, upon request, with information about re-testing, including if it is possible
to re-take the test or another version of it, and if so, how often, how soon, and under what
conditions.
Provide test takers, upon request, with information about how the test will be scored and in
what detail. On multiple-choice tests, this information might include suggestions for test
taking and about the use of a correction for guessing. On tests scored using professional
judgment (e.g., essay tests or projective techniques), a general description of the scoring
procedures might be provided except when such information is proprietary or would tend
to influence test performance inappropriately.
- Inform test takers about the type of feedback and interpretation that is routinely provided, as
well as what is available for a fee. Test takers have the right to request and receive informa-
tion regarding whether or not they can obtain copies of their test answer sheets or their test
materials, if they can have their scores verified, and if they may cancel their test results.
Provide test takers, prior to testing, either in the written instructions, in other written docu-
ments or orally, with answers to questions that test takers may have about basic test admin-
istration procedures.
Inform test takers, prior to testing, if questions from test takers will not be permitted during
the testing process.
Provide test takers with information about the use of computers, calculators, or other equip-
ment, if any, used in the testing and give them an opportunity to practice using such equip-
ment, unless its unpracticed use is part of the test purpose, or practice would compromise
the validity of the results, and to provide a testing accommodation for the use of such equip-
ment, if needed.
Inform test takers that, if they have a disability, they have the right to request and receive
accommodations or modifications in accordance with the provisions of the Americans with
Disabilities Act and other relevant legislation.
Rights and Responsibilities of Test Takers: Guidelines and Expectations 487
m. Provide test takers with information that will be of use in making decisions if test takers
have options regarding which tests, test forms, or test formats to take.
Because test takers have a right to be informed in advance when the test will be administered,
if and when test results will be available, and if there is a fee for testing services that the test
takers are expected to pay, test professionals should:
a. Notify test takers of the alteration in a timely manner if a previously announced testing
schedule changes, provide a reasonable explanation for the change, and inform test takers
of the new schedule. If there is a change, reasonable alternatives to the original schedule
should be provided.
b. Inform test takers prior to testing about any anticipated fee for the testing process, as well
as the fees associated with each component of the process, if the components can be sepa-
rated.
Because test takers have the right to have their tests administered and interpreted by appropri-
ately trained individuals, testing professionals should:
a. Know how to select the appropriate test for the intended purposes.
b. When testing persons with documented disabilities and other special characteristics that
require special testing conditions and/or interpretation of results, have the skills and knowl-
edge for such testing and interpretation.
c. Provide reasonable information regarding their qualifications, upon request.
d. Insure that test conditions, especially if unusual, do not unduly interfere with test perfor-
mance. Test conditions will normally be similar to those used to standardize the test.
e. Provide candidates with a reasonable amount of time to complete the test, unless a test has
a time limit.
f. Take reasonable actions to safeguard against fraudulent actions (e.g., cheating) that could
place honest test takers at a disadvantage.
Because test takers have the right to be informed about why they are being asked to take particu-
lar tests, if a test is optional, and what the consequences are should they choose not to complete
the test, testing professionals should:
a. Normally only engage in testing activities with test takers after the test takers have pro-
vided their informed consent to take a test, except when testing without consent has been
mandated by law or governmental regulation, or when consent is implied by an action the
test takers have already taken (e.g., such as when applying for employment and a personnel
examination is mandated).
b. Explain to test takers why they should consider taking voluntary tests.
c. Explain, if a test taker refuses to take or complete a voluntary test, either orally or in writing,
what the negative consequences may be to them for their decision to do so.
d. Promptly inform the test taker if a testing professional decides that there is a need to devi-
ate from the testing services to which the test taker initially agreed (e.g., should the testing
professional believe it would be wise to administer an additional test or an alternative test),
and provide an explanation for the change.
8. Because test takers have a right to receive a written or oral explanation of their test results
within a reasonable amount of time after testing and in commonly understood terms, testing
professionals should:
488 APPENDIX D
Interpret test results in light of one or more additional considerations (e.g., disability, lan-
guage proficiency), if those considerations are relevant to the purposes of the test and per-
formance on the test and are in accordance with current laws.
Provide, upon request, information to test takers about the sources used in interpreting their
test results, including technical manuals, technical reports, norms, and a description of the
comparison group, or additional information about the test taker(s).
Provide, upon request, recommendations to test takers about how they could improve their
performance on the test, should they choose or be required to take the test again.
Provide, upon request, information to test takers about their options for obtaining a second
interpretation of their results. Test takers may select an appropriately trained professional
to provide this second opinion.
Provide test takers with the criteria used to determine a passing score, when individual test
scores are reported and related to a pass—fail standard.
Inform test takers, upon request, how much their scores might change, should they elect
to take the test again. Such information would include variation in test performance due to
measurement error (e.g., the appropriate standard errors of measurement) and changes in
performance over time with or without intervention (e.g., additional training or treatment).
Communicate test results to test takers in an appropriate and sensitive manner, without use
of negative labels or comments likely to inflame or stigmatize the test taker.
Provide corrected test scores to test takers as rapidly as possible, should an error occur in
the processing or reporting of scores. The length of time is often dictated by individuals
responsible for processing or reporting the scores, rather than the individuals responsible
for testing, should the two parties indeed differ.
Correct any errors as rapidly as possible if there are errors in the process of developing
scores.
Because test takers have the right to have the results of tests kept confidential to the extent al-
lowed by law, testing professionals should:
a. Insure that records of test results (in paper or electronic form) are safeguarded and main-
tained so that only individuals who have a legitimate right to access them will be able to do
so.
b. Provide test takers, upon request, with information regarding who has a legitimate right to
access their test results (when individually identified) and in what form. Testing profession-
als should respond appropriately to questions regarding the reasons why such individuals
may have access to test results and how they may use the results.
c. Advise test takers that they are entitled to limit access to their results (when individually
identified) to those persons or institutions, and for those purposes, revealed to them prior to
testing. Exceptions may occur when test takers, or their guardians, consent to release the test
results to others or when testing professionals are authorized by law to release test results.
d. Keep confidential any requests for testing accommodations and the documentation support-
ing the request.
10. Because test takers have the right to present concerns about the testing process and to receive
information about procedures that will be used to address such concerns, testing professionals
should:
a. Inform test takers how they can question the results of the testing if they do not believe that
the test was administered properly or scored correctly, or other such concerns.
b. Inform test takers of the procedures for appealing decisions that they believe are based in
whole or in part on erroneous test results.
Rights and Responsibilities of Test Takers: Guidelines and Expectations 489
c. Inform test takers if their test results are under investigation and may be canceled, in-
validated, or not released for normal use. In such an event, that investigation should be
performed in a timely manner. The investigation should use all available information that
addresses the reason(s) for the investigation, and the test taker should also be informed of
the information that he/she may need to provide to assist with the investigation.
d. Inform the test taker, if that test taker’s test results are canceled or not released for normal
use, why that action was taken. The test taker is entitled to request and receive information
on the types of evidence and procedures that have been used to make that determination.
rT. Testing professionals need to inform test takers that they should listen to and/or read their rights
and responsibilities as a test taker and ask questions about issues they do not understand.
2. Testing professionals should take steps, as appropriate, to ensure that test takers know that
they:
a. Are responsible for their behavior throughout the entire testing process.
b. Should not interfere with the rights of others involved in the testing process.
c. Should not compromise the integrity of the test and its interpretation in any manner.
Testing professionals should remind test takers that it is their responsibility to ask questions
prior to testing if they are uncertain about why the test is being given, how it will be given, what
they will be asked to do, and what will be done with the results. Testing professionals should:
a. Advise test takers that it is their responsibility to review materials supplied by test publish-
ers and others as part of the testing process and to ask questions about areas that they feel
they should understand better prior to the start of testing.
b. Inform test takers that it is their responsibility to request more information if they are not satis-
fied with what they know about how their test results will be used and what will be done with
them.
Testing professionals should inform test takers that it is their responsibility to read descriptive
material they receive in advance of a test and to listen carefully to test instructions. Testing
professionals should inform test takers that it is their responsibility to inform an examiner in
advance of testing if they wish to receive a testing accommodation or if they have a physical
condition or illness that may interfere with their performance. Testing professionals should
inform test takers that it is their responsibility to inform an examiner if they have difficulty
comprehending the language in which the test is given. Testing professionals should:
a. Inform test takers that, if they need special testing arrangements, it is their responsibility to
request appropriate accommodations and to provide any requested documentation as far in
advance of the testing date as possible. Testing professionals should inform test takers about
the documentation needed to receive a requested testing accommodation.
b. Inform test takers that, if they request but do not receive a testing accommodation, they
could request information about why their request was denied.
490 A PPE NDTEXaD
Testing professionals should inform test takers when and where the test will be given, and
whether payment for the testing is required. Having been so informed, it is the responsibility
of the test taker to appear on time with any required materials, pay for testing services, and be
ready to be tested. Testing professionals should:
a. Inform test takers that they are responsible for familiarizing themselves with the appropri-
ate materials needed for testing and for requesting information about these materials, if
needed.
b. Inform the test taker, if the testing situation requires that test takers bring materials (e.g.,
personal identification, pencils, calculators, etc.) to the testing site, of this responsibility to
do so.
Testing professionals should advise test takers, prior to testing, that it is their responsibility to:
a. Listen to and/or read the directions given to them.
b. Follow instructions given by testing professionals.
c. Complete the test as directed.
d. Perform to the best of their ability if they want their score to be a reflection of their best
effort.
e. Behave honestly (e.g., not cheating or assisting others who cheat).
Testing professionals should inform test takers about the consequences of not taking a test,
should they choose not to take the test. Once so informed, it is the responsibility of the test taker
to accept such consequences, and the testing professional should so inform the test takers. If test
takers have questions regarding these consequences, it is their responsibility to ask questions of
the testing professional, and the testing professional should so inform the test takers.
Testing professionals should inform test takers that it is their responsibility to notify appropriate
persons, as specified by the testing organization, if they do not understand their results, or if
they believe that testing conditions affected the results. Testing professionals should:
a. Provide information to test takers, upon request, about appropriate procedures for question-
ing or canceling their test scores or results, if relevant to the purposes of testing.
b. Provide to test takers, upon request, the procedures for reviewing, re-testing, or canceling
their scores or test results, if they believe that testing conditions affected their results and if
relevant to the purposes of testing.
Provide documentation to the test taker about known testing conditions that mi ght have affected
the results of the testing, if relevant to the purposes of testing.
a. Testing professionals should advise test takers that it is their responsibility to ask questions
about the confidentiality of their test results, if this aspect concerns them.
b. Testing professionals should advise test takers that it is their responsibility to present con-
cerns about the testing process in a timely, respectful manner.
Source: Test Taker Rights and Responsibilities Working Group of the Joint Committee on Testing
Practices.
(1998, August). The rights and responsibilities of test takers: Guidelines and expectations. Washington,
DC: As-
sociation Psychological Association.
APPENDIX E
The professional education associations began working in 1987 to develop standards for teacher
competence in student assessment out of concern that the potential educational benefits of student
assessments be fully realized. The Committee appointed to this project completed its work in 1990,
following reviews of earlier drafts by members of the measurement, teaching, and teacher prepara-
tion and certification communities. Parallel committees of affected associations are encouraged to
develop similar statements of qualifications for school administrators, counselors, testing directors,
supervisors, and other educators in the near future. These statements are intended to guide the preser-
vice and inservice preparation of educators, the accreditation of preparation programs, and the future
certification of all educators.
A standard is defined here as a principle generally accepted by the professional associations
responsible for this document. Assessment is defined as the process of obtaining information that is
used to make educational decisions about students; to give feedback to students about their progress,
strengths, and weaknesses; to judge instructional effectiveness and curricular adequacy; and to inform
policy. The various assessment techniques include, but are not limited to, formal and informal obser-
vation, qualitative analysis of pupil performance and products, paper-and-pencil tests, oral question-
ing, and analysis of student records. The assessment competencies included here are the knowledge
and skills critical to a teacher’s role as educator. It is understood that there are many competencies
beyond assessment competencies that teachers must possess.
By establishing standards for teacher competence in student assessment, the associations sub-
scribe to the view that student assessment is an essential part of teaching and that good teaching cannot
exist without good student assessment. Training to develop the competencies covered in the standards
should be an integral part of preservice preparation. Further, such assessment training should be
widely available to practicing teachers through staff development programs at the district and building
levels. The standards are intended for use as:
m a guide for teacher educators as they design and approve programs for teacher preparation
m aself-assessment guide for teachers in identifying their needs for professional development in
student assessment
m a guide for workshop instructors as they design professional development experiences for in-
service teachers
= an impetus for educational measurement specialists and teacher trainers to conceptualize stu-
dent assessment and teacher training in student assessment more broadly than has been the case
in the past.
The standards should be incorporated into future teacher training and certification programs.
Teachers who have not had the preparation these standards imply should have the opportunity and sup-
port to develop these competencies before the standards enter into the evaluation of these teachers.
491
492 APPENDIX E
There are seven standards in this document. In recognizing the critical need to revitalize classroom
assessment, some standards focus on classroom-based competencies. Because of teachers’ growing
roles in education and policy decisions beyond the classroom, other standards address assessment
competencies underlying teacher participation in decisions related to assessment at the school, dis-
trict, state, and national levels.
The scope of a teacher’s professional role and responsibilities for student assessment may be
described in terms of the following activities. These activities imply that teachers need competence in
student assessment and sufficient time and resources to complete them in a professional manner.
(c) Recording and reporting assessment results for school-level analysis, evaluation, and decision
making;
(d) Analyzing assessment information gathered before and during instruction to understand each
students’ progress to date and to inform future instructional planning;
(e) Evaluating the effectiveness of instruction; and
(f) Evaluating the effectiveness of the curriculum and materials in use.
Each standard that follows is an expectation for assessment knowledge or skill that a teacher
should possess in order to perform well in the five areas just described. As a set, the standards call on
teachers to demonstrate skill at selecting, developing, applying, using, communicating, and evaluat-
ing student assessment information and student assessment practices. A brief rationale and illustrative
behaviors follow each standard.
The standards represent a conceptual framework or scaffolding from which specific skills can
be derived. Work to make these standards operational will be needed even after they have been pub-
lished. It is also expected that experience in the application of these standards should lead to their
improvement and further development.
options available to them, considering among other things, the cultural, social, economic, and lan-
guage backgrounds of students. They will be aware that different assessment approaches can be in-
compatible with certain instructional goals and may impact quite differently on their teaching.
Teachers will know, for each assessment approach they use, its appropriateness for making
decisions about their pupils. Moreover, teachers will know where to find information about and/
or reviews of various assessment methods. Assessment options are diverse and include text- and
curriculum-embedded questions and tests, standardized criterion-referenced and norm-referenced
tests, oral questioning, spontaneous and structured performance assessments, portfolios, exhibitions,
demonstrations, rating scales, writing samples, paper-and-pencil tests, seatwork and homework, peer-
and self-assessments, student records, observations, questionnaires, interviews, projects, products,
and others’ opinions.
3. The teacher should be skilled in administering, scoring, and interpreting the results of
both externally produced and teacher-produced assessment methods. It is not enough that teach-
ers are able to select and develop good assessment methods; they must also be able to apply them
properly. Teachers should be skilled in administering, scoring, and interpreting results from diverse
assessment methods. Teachers who meet this standard will have the conceptual and application skills
that follow. They will be skilled in interpreting informal and formal teacher-produced assessment
results, including pupils’ performances in class and on homework assignments. Teachers will be able
to use guides for scoring essay questions and projects, stencils for scoring response-choice questions,
and scales for rating performance assessments. They will be able to use these in ways that produce
consistent results. Teachers will be able to administer standardized achievement tests and be able to
interpret the commonly reported scores: percentile ranks, percentile band scores, standard scores, and
grade equivalents. They will have a conceptual understanding of the summary indexes commonly
reported with assessment results: measures of central tendency, dispersion, relationships, reliability,
and errors of measurement.
Teachers will be able to apply these concepts of score and summary indices in ways that en-
hance their use of the assessments that they develop. They will be able to analyze assessment results to
identify pupils’ strengths and errors. If they get inconsistent results, they will seek other explanations
for the discrepancy or other data to attempt to resolve the uncertainty before arriving at a decision.
They will be able to use assessment methods in ways that encourage students’ educational develop-
ment and that do not inappropriately increase students’ anxiety levels.
4. Teachers should be skilled in using assessment results when making decisions about in-
dividual students, planning teaching, developing curriculum, and school improvement. Assess-
ment results are used to make educational decisions at several levels: in the classroom about students,
Standards for Teacher Competence in Educational Assessment of Students 495
in the community about a school and a school district, and in society, generally, about the purposes
and outcomes of the educational enterprise. Teachers play a vital role when participating in decision
making at each of these levels and must be able to use assessment results effectively.
Teachers who meet this standard will have the conceptual and application skills that follow.
They will be able to use accumulated assessment information to organize a sound instructional plan
for facilitating students’ educational development. When using assessment results to plan and/or eval-
uate instruction and curriculum, teachers will interpret the results correctly and avoid common misin-
terpretations, such as basing decisions on scores that lack curriculum validity. They will be informed
about the results of local, regional, state, and national assessments and about their appropriate use for
pupil, classroom, school, district, state, and national educational improvement.
5. Teachers should be skilled in developing valid pupil grading procedures that use pupil
assessments. Grading students is an important part of professional practice for teachers. Grading is de-
fined as indicating both a student’s level of performance and a teacher’s valuing of that performance. The
principles for using assessments to obtain valid grades are known and teachers should employ them.
Teachers who meet this standard will have the conceptual and application skills that follow.
They will be able to devise, implement, and explain a procedure for developing grades composed of
marks from various assignments, projects, in-class activities, quizzes, tests, and/or other assessments
that they may use. Teachers will understand and be able to articulate why the grades they assign are
rational, justified, and fair, acknowledging that such grades reflect their preferences and judgments.
Teachers will be able to recognize and to avoid faulty grading procedures such as using grades as
punishment. They will be able to evaluate and to modify their grading procedures in order to improve
the validity of the interpretations made from them about students’ attainments.
Teachers must be well-versed in their own ethical and legal responsibilities in assessment. In addition,
they should also attempt to have the inappropriate assessment practices of others discontinued when-
ever they are encountered. Teachers should also participate with the wider educational community in
defining the limits of appropriate professional behavior in assessment.
Teachers who meet this standard will have the conceptual and application skills that follow.
They will know those laws and case decisions that affect their classroom, school district, and state
assessment practices. Teachers will be aware that various assessment procedures can be misused or
overused resulting in harmful consequences such as embarrassing students, violating a student’s right
to confidentiality, and inappropriately using students’ standardized achievement test scores to measure
teaching effectiveness.
Source: Standards for teacher competence in educational assessment of students. Developed by the American
Federation of Teachers, National Council on Measurement in Education, and National Education Association.
This
is not copyrighted material. Reproduction and dissemination are encouraged. 1990. ‘
|e | e
;
ae Ga —
Som ©} S Sa
=
S N © SaY 3
p=) SS
i
———)see
Y
Z.
© co=
"S UO | Sa= Y
Z+
; LL8T@
CV8C QLLT
O87 60LT
€VLTe 9LIT
Cv9T 9PST C8
8LST
1197 pS’
680
OCVC
ISve’ ECOG
8ST 9977
9677 9077
9ECTT SPIT
LLIT 0607
611 ££0C
1907
(0) 0 0
é Zz- (panu1juoo)
(@) Z+
SOLG vCCT
0617
LSIT LS@T LSET VSVC LIST
COV:
687° 08ST
6vST 98b7"
CV9OT vElLTe
VOLT
€L9T VOLTv6LT TS8C
€T8T O67
1887 6£67
L967
167VCET 1197
| 0 0
z-
) ane,jo Z+ Z 9c" Lg 8¢° 6S" 09° 19° ct"£9 9° ¢9° 99° L9 89° 69° OF We (Ghel vl cL OL EE 8L 6L 08° 18° (6:3€8
Zz
6S8C°
LO8C° €8Le’
I@8e° SVL 699¢°
LOLE” v6Se”
CEOC™ OCS" Orre
LSSe ESbe”
Clee
60re" v9
OOee"
OfEe COLE
SCCe OSIe
Lobe OSOg
C80c° ClO¢
186c° C167
9V6C
(0) 0 0
z-
JAIN@ z+
COT LIZcSt£6cl"
6LI
IvIl 89CT°
Teer: 90rI O8rl vSSI°
errl 8c9l
16ST” LIST’
pool 9ELT
OOLT CLL 6L8T
vrs”
8081" OS6l”
Sl6l 6107
C861" 8807
vSOC
| 0 Q
[VULION z-
vary
z+
000S$°
0967 Oss’
Ot6r OrsT ICL
1OLy
108v" Ivor
Igor C8rVv
cO9r CCS" Cvyy z9Sb"
p9EV
vor Cty98cr
Lvcv solr
LOtYV’ 0607
6clv" cSOV
€lov 9C6c°
vL6c
©) 0 0
Jo z-
suonsodolg
@) zi
0000° 0Z10"
0800"
0700" 0910 6£70
6610° 6L7O 6S£0°
61£0° 8Er0"
LISO’
86£0° 8L70° LSSO° 9¢90°
96S0° FILO
¢L90° £SLO" TE80°
£6L0° 0160°
1L80° 8760"
L860° p90I"
97OI
| 0
TA z-
WIAVL
@) oma Z+ Zi 00°10° TO" €0" vO"co 90° LO’ 80° 60° Ol 1 1A) 28 vl Cl: io) [iis8I 6r 0c 74 (G6;CT vO oT 97 BG
497
498
Ta
(e) (q) (9) (9) (®) (q) (9)
w1dVL
one men
penunuo)
jo
Z+
0 Z+ 0 ze
oR
== ~ = ~ a = ~
Oe 0 O. = O.
vs” S667 SOOT eC OLEVy 0£90° COT 8987 CELLO
cs ECOL LLOV l vs C8EV 8190 (AGG IL87" 6c10°
98° 1SO¢” 6rel col vOEV 9090" VOT CL8V SC1O"
L8° 8LOe CCl 9C'T 9007 v6S0 SEC SL8r CELLO’
88° 90TE vost” LOT slvr C8S0° TGC 1887 6110"
68° teheilite LOST 8ST 6CrV ILSO° CAG v88v OT1O
06 6SIe Irs | 6o lvry 6SS0° 8CC L388 E110
16 981 vist 09'T Coty 87S0 6CTC 0687 OLLO
tO" Clon 88LT 19'T Coby LEsO OE C68" LOLO”
£6 8ETE COLT col vLyy 9TS0" LEG 9687 vOTO
v6 VOCE 9ELT €9'T v8rVy 91S0 COAG 8687 COLO
c6 680° TILT pol Cory” ¢OSO’ CE? 1067 6600°
96° CIeg C89T" COT COST c6v0" VET v06rV 9600°
Lo OvVee 099T° 99'T CISy C870" CEC 9067" 600°
86 C9Ee” Ceol” L9'T COS SLVO" SEC 6067" 1600°
66° 68Eo° IT9T 89'1 ces c9r0 LG 11l6r" 6800°
O0'T Elve L8ST° 69'T crSy CSO" 87S cl6v” L800°
10'T SEE COST I OL VSS 9br0 6e°C 9167 800°
COT 19vc 6ST A) |We vOSy 9EP0 Ove 8l6Vv C7800
€O'T C8re° CIST CLL ELSy L7VO" vc OC6V 0800"
vO'T 80S¢c° Corl” EL C8SV" 8170 GMS CCOV 800°
SOT Tese” 69rl EY 16sv 6070" ev'T CC6V ¢L00°
90°T DSSE orrl SLI 66S7 10v0" wT LC6V €L00°
JO LLSe €cyl” OL 8097" C6EO a6 6c6V 1L00°
80°T 66S¢° 1Ovl Liat 919V P8c0 9V'T cov’ 6900°
60° 1C9¢" 6Lel SL ccor’ SLEO" LVT CLOV 8900°
Ol Ev9e™ LSel 6L'T ecorv” LOCO" 8V'C veo 9900°
Ei c99¢° Ceel 08'T Ivor’ 6S¢0 6r'C 9¢6r 7900"
cIl 989¢° viel [8'T 6v9V TSc0° OSC 8c6r C900"
EE
vS00"
6100°
6S00°
cS00°
600°
8700"
vr00"
Ov00°
900°
600"
TS00°
€£00"
Ty00°
(panuyuo?)
6100°
LS00"
€C00°
cCOO"
1COO"
1c00°
0c00"
0900°
$S00°
LV00"
Sv00"
€v00"
6£00°
L700"
900°
£700"
VC00"
€c00"
8€00°
LE00°
9€00°
SE00°
veo0
CcE00"
TE00"
O€00°
800°
tvor 9r6r
8ror"
6v6r" CSO
eSoV 9S6V" 096r° £96r" 9967" OL6V"
6967" [Lov tLov 9L6V" 8Lor 6L6V"
Over’
Tv6v Spor [Sor SS6V LS6V"
6Sor 196v°
C96t" p96r
S96r" L96r"
8967" cL6V vLoVv
vLov
CLOV" LLOV 186r"
LLOV 6L6v 0867” 1867"
CTEO™
viedo v6c0" 9ScO"
OScO" 8CCO" COLO
L610 S10"
VLIO’ 9910" 8S10 OSTO"
Or evl0
610
10°
90"
vre0 6c¢0° 10¢0° L8CO"
LOEO" 18c0° 8970"
PLCO COTO" vyCcO
6€C0"
€€CcO" cCCO" cICO
LICO’ LOCO"
COCO’ 8810 6L10° OLIO" C910"
e810 9e10"
9S9r 9891"
SLO 9OLV
6697"
coor" VYLV
OSLV 19LV CLLV
OSLV t6LV cO8r" CI8v
8LLV 88LV OL8r 8esr 98
poor
IL9V" 6ILv
clLy 9CLV 8ELy
CeLy LOLG e8Ly 86LV 8087 LI8v
[C89C8V vesy Cry OS8r
pS8v L98v
LS8V" y98V
VIC
OIG
61'C
OT
CL?
IEG
81°C
Kare
661
v6'l
Lol
06'1
c6'l
col
681
col
ps'T
C81
e831
OCT
96'1
861
ele
STC
60°C
Il¢
SOC
90°C
LOT
80°C
lol
10°
COC
c0'C
vOT
881
00°C
S81
98'1
LSI
COC OCT
Oc O6LT
OLTT OcOT
8cOl e001”
£860" 6980"
€S80° 8LLO0° 6rL0"
Selo 80L0° 1890"
ILcT
Iscl TELL
IStl c601°
CLEL 9SOT
SLOT 8960"
TS60°
ve60" 1060°
8160" S880" €C80°
880° €6L0° VOLO’
8080° TCLO 690° 8990"
¢S90°
€y90"
ROHOGBANMNTNOTMOHOAAG
ATFANOMADDHDOAAMNYTYHN
SAN RAQRN SASHA GRR SASSI SS
499
500
TH
(e) (q) (0) (®) (4) (9) (e) (q) (2)
ATavVL
penunuo)
anea, anyeA anyeA
jo jo jo
ie FA || = | = it a = oat OSS me | x
Zi) 0 Z+ 0 z+ 0 Z+ 0 Z+ 0 Z+
Za - | = - ~ i= ee a| = a | = a | ~
ZS OR = ORES Zi Oa ee (U0 2 OM OREZ=
we
LIES
£000"
8000°
Lo6v
C66v"
APPENDIX G
Chapter 2
1. Calculate the mean, variance, and standard deviation for the following score distributions.
Distribution 1 Distribution 2 Distribution 3
Mean = 7.267 Mean = 5.467 Mean = 5.20
Variance = 3.3956 Variance = 5.182 Variance = 4.427
SD = 1.8427 SDi=32.276 SD = 2.104
2. Calculate the Pearson Correlation Coefficient for the following pairs of scores.
Sample 1: r = 0.631
Sample 2: r = 0.886
Sample 3: r = 0.26
Chapter 3
1. Transform the following raw scores to the specified standard score formats. The raw score
distribution has a mean of 70 and a standard deviation of 10.
a. Raw score = 85 z-score = 1.5 T-score = 65
b. Raw score = 60 z-score = —1.0 T-score = 40
c. Raw score = 55 z-score = —1.5 T-score = 35
d. Raw score = 95 z-score = 2.5 T-score = 75
e. Raw score = 75 z-score = 0.5 T-score = 55
Chapter 4
1. Calculating KR 20:
Student 1 0 1 1 0 1 3}
Student 2 1 1 1 1 1 >
(continued)
501
502 APPENDIX G
Student 3 1 0 1 0 0 2
Student 4 0 0 0 1 0 1
Student 5 1 1 1 1 1 5
Student 6 I 1 0 1 0 3
P; 0.6667 0.6667 0.6667 0.6667 0.5 SD? = 2.1389
4d 0.3333 0.3333 0.3333 0.3333 0.5
P; X 4; (02222 0222) 0.2222 (02222 0.25
p, x qi 1.1388
Student 1 4 5) 4 5 4 23
Student 2 3 3) D) 3 2 13
Student 3 7) 3 1 2: 1 9
Student 4 4 4 5 5 4 22
Student 5 2 3 2 2 3 12
Student 6 1 2 2 1 3 9
SDi2 1.2222 0.8889 1.8889 2.3333 1.6667 SD? = 32.89
)
1.2222 + 0.8889 + 1.8889 + 2.3333 + 1.6667
Coefficient alpha = 5/4 x (1
32.89
1.25 x (1 — 8/32.89)
1.25 x (1 — 0.2432)
1.25 x (0.7568)
0.946
REFERENCES
Achenbach, T. M. (1991a). Manual for the Child Behavior American Psychological Association (2008). FAQ/Finding in-
Checklists/4—18 and 1991 profile. Burlington: Univer- formation about tests. Retrieved March 17, 2008, from
sity of Vermont, Department of Psychiatry. www.apa.org/science/faq-findtests.html#printeddirec
Achenbach, T. M. (1991b). Manual for the Teacher’s Report American Psychological Association (1993, January). Call for
Form and 1991 profile. Burlington: University of Ver- book proposals for test instruments. APA Monitor, 24, 12.
mont, Department of Psychiatry. American Psychological Association, American Educational
Achenbach, T. M. (1991c). Manual for the Youth Self-Report Research Association, & National Council on Measure-
and 1991 profile. Burlington: University of Vermont, ment in Education (1974). Standards for educational
Department of Psychiatry. and psychological testing. Washington, DC: Author.
Aiken, L. R. (1982). Writing multiple-choice items to mea- American Psychological Association, American Educational
sure higher-order educational objectives. Educational & Research Association, & National Council on Measure-
Psychological Measurement, 42, 803-806. ment in Education (1985). Standards for educational
Aiken, L. R. (2000). Psychological testing and assessment. and psychological testing. Washington, DC: Author.
Boston: Allyn & Bacon. Amrein, A. L., & Berliner, D. C. (2002). High-stakes test-
Alley, G., & Foster, C. (1978). Nondiscriminatory testing of ing, uncertainty, and student learning. Education Policy
minority and exceptional children. Focus on Exceptional Analysis Archives, 10(18). Retrieved May 11, 2003,
Children, 9, 1-14. from http://epaa.asu.edu/epaa/v10n18.
American Educational Research Association (2000). AERA Anastasi, A., & Urbina, S. (1997). Psychological testing (7th
position statement concerning high-stakes testing in ed.). Upper Saddle River, NJ: Prentice Hall.
preK-12 education. Retrieved September 13, 2003, Beck, M. D. (1978). The effect of item response changes on
from www.aera.net/about/policy/stakes.htm scores on an elementary reading achievement test. Jour-
American Educational Research Association, American Psy- nal of Educational Research, 71, 153-156.
chological Association, & National Council on Measure- Bloom, B., Englehart, M., Furst, E., Hill, W., & Krathwohl, D.
ment in Education (1999). Standards for educational (1956). Taxonomy of educational objectives: The clas-
and psychological testing. Washington, DC: American sification of educational goals. Handbook I: Cognitive
Educational Research Association. domain. White Plains, NY: Longman.
American Federation of Teachers, National Council on Boser, U. (1999). Study finds mismatch between California
Measurement in Education, & National Education As- standards and assessments. Education Week, 18, 10.
sociation (1990). Standards for teacher competence in Boston, C. (2001). The debate over national testing. (Report
educational assessment of students. Washington, DC: No. EDO-TM-01-02). College Park, MD: ERIC Clear-
American Federation of Teachers. inghouse on Assessment and Evaluation. (ERIC No. ED
American Psychiatric Association (1994). The diagnostic and 458214).
statistical manual of mental disorders (4th ed.). Wash- Braden, J. P. (1997). The practical impact of intellectual assess-
ington, DC: Author ment issues. School Psychology Review, 26, 242-248.
American Psychological Association (1954). Technical rec- Brookhart, S. M. (2004). Grading. Upper Saddle River, NJ:
ommendations for psychological tests and diagnostic Pearson Merrill Prentice Hall.
techniques. Psychological Bulletin, 51(2, pt. 2). Brown, R. T., Reynolds, C. R., & Whitaker, J. S. (1999). Bias
American Psychological Association (1966). Standards for in mental testing since “Bias in Mental Testing.” School
educational and psychological tests and manuals. Wash- Psychology Quarterly, 14, 208-238.
ington, DC: Author. Camilli, G., & Shepard, L. A. (1994). Methods for identifying
American Psychological Association (2004). Testing and as- biased test items. Thousand Oaks, CA: Sage.
sessment: FAQ/Finding information about psychologi- Campell, D. T., & Fiske, D. W. (1959). Convergent and dis-
cal tests. Retrieved December 1, 2004, from www.apa criminant validation by the multitrait-multimethod ma-
.org/science/faq-findtests.html#findinfo trix. Psychological Bulletin, 56, 546-553.
503
504 REFERENCES
Cannell, J. J. (1988). Nationally normed elementary achieve- Costin, F. (1970). The optimal number of alternatives in mul-
ment testing in America’s public schools: How all 50 tiple-choice achievement tests: Some empirical evidence
states are above average. Educational Measurement: Is- for a mathematical proof. Educational & Psychological
sues and Practice, 7, 5-9. Measurement, 30, 353-358.
Cannell, J. J. (1989). The “Lake Wobegon” report: How pub- Crocker, L., & Algina, J. (1986). Introduction to classical and
lic educators cheat on standardized achievement tests. modern test theory. New York: Harcourt Brace.
Albuquerque, NM: Friends for Education. Cronbach, L. J. (1950). Further evidence on response sets and
Canter, A. S. (1997). The future of intelligence testing in the test design. Educational & Psychological Measurement,
schools. School Psychology Review, 26, 255-261. 10, 3-31.
Ceperley, P. E., & Reel, K. (1997). The impetus for the Ten- Cronbach, L. J. (1951). Coefficient alpha and the internal
nessee value-added accountability system. In J. Millman structure of tests. Psychometrika, 16, 297-334.
(Ed.), Grading teachers, grading schools, (pp. 133-136). Cronbach, L. J. (1990). Essentials of psychological testing
Thousand Oaks, CA: Corwin Press. (Sth ed.). New York: HarperCollins.
Chan, D., Schmitt, N., DeShon, R. P., Clause, C. S., & Del- Cronbach, L. J., & Furby, L. (1970). How we should mea-
bridge, K. (1997). Reaction to cognitive ability tests: sure change—Or should we? Psychological Bulletin, 52,
The relationship between race, test performance, face 281-302.
validity, and test-taking motivation. Journal of Applied Cronbach, L. J., & Gleser, G. C. (1965). Psychological tests
Psychology, 82, 300-310. and personnel decisions (2nd ed.). Champaign: Univer-
Chandler, L.A. (1990). The projective hypothesis and the sity of Illinois Press.
development of projective techniques for children. In CTB/Macmillan/McGraw-Hill. (1993). California Achieve-
C.R. Reynolds & R. Kamphaus (Eds.), Handbook of ment Tests/5. Monterey, CA: Author.
psychological and educational assessment of children: Cummins, J. (1984). Bilingual special education: Issues in as-
Personality, behavior, and context (pp. 55-69). New sessment and pedagogy. San Diego, CA: College Hill.
York: Guilford Press. Deiderich, P. B. (1973). Short-cut statistics for teacher-made
Chase, C. (1979). The impact of achievement expectations tests. Princeton, NJ: Educational Testing Service.
and handwriting quality on scoring essay tests. Journal Doherty, K. M. (2002). Education issues: Assessment. Educa-
of Educational Measurement, 16, 39-42. tion Week on the Web. Retrieved May 14, 2003, from www
Chinn, P. C. (1979). The exceptional minority child: Is- .edweek.org/context/topics/issuespage.cfm?id=41
sues and some answers. Exceptional Children, 46, Ebel, R. L. (1970). The case for true—false items. School Re-
532-536. view, 78, 373-389.
Christ, T., Burns, M., & Ysseldyke, J. (2005). Conceptual con- Ebel, R. L. (1971). How to write true—false items. Educational
fusion within response-to-intervention vernacular: Clar- & Psychological Measurement, 31, 417-426.
ifying meaningful differences. Communiqué, 33(3). Educational Testing Services (1973). Making the classroom
Cizek, G. J. (1998). Filling in the blanks: Putting standardized test: A guide for teachers. Princeton, NJ: Author.
tests to the test. Fordham Report, 2(11). Engelhart, M. D. (1965). A comparison of several item dis-
Cleary, T. A., Humphreys, L. G., Kendrick, S. A., & Wesman, crimination indices. Journal of Educational Measure-
A. (1975). Educational uses of tests with disadvantaged ment, 2, 69-76.
students. American Psychologist, 30, 15-41. Exner, J. E. (1974). The Rorschach: A comprehensive system,
Coffman, W. (1972). On the reliability of ratings of essay ex- I. New York: Wiley.
aminations. NCME Measurement in Education, 3, 1-7. Exner, J. E. (1978). The Rorschach: A comprehensive system,
Coffman, W., & Kurfman, D. (1968). A comparison of two IT, New York: Wiley.
methods of reading essay examinations. American Edu- Feifer, S. G., & Toffalo, D. A. (2007). Integrating RTI with
cational Research Journal, 5, 99-107. cognitive neuropsychology: A scientific approach to
Cohen, J. (1988). Statistical power analysis for the behavioral reading. Middleton, MD: School Neuropsych Press.
sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Feldt, L. (1997). Can validity rise when reliability declines?
Cohen, R. C., & Swerdlik, M. E. (2002). Psychological test- Applied Measurement in Education, 10, 377-387.
ing and assessment: An introduction to tests and mea- Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn
surement. New York: McGraw-Hill. (Ed.), Educational measurement (3rd ed., pp. 105-146).
Cole, N. S., & Moss, P. A. (1989). Bias in test use. In R. L. Upper Saddle River, NJ:.Merrill Prentice Hall.
Linn (Ed.), Educational measurement (3rd ed., pp. 201- Finch, A. J., & Belter, R. W. (1993). Projective techniques.
219). Upper Saddle River, NJ: Merrill Prentice Hall. In T. H. Ollendick and M. Hersen (Eds.), Handbook of
Conners, C. K. (1997). Conners’ Rating Scales—Revised. child and adolescent assessment (pp. 224-238). Boston:
North Tonawanda, NY: Multi-Health Systems. Allyn & Bacon.
‘
REFERENCES 505
Flaugher, R. L. (1978). The many definitions of test bias. ment decisions. Recent empirical findings and future
American Psychologist, 33, 671-679. directions. School Psychology Quarterly, 12, 146-154.
Fletcher, J. M., Foorman, B. R., Boudousquie, A., Barnes, Gronlund, N. E. (1998). Assessment of student achievement
M. A., Schatschneider, C., & Francis, D. J. (2002). As- (6th ed.). Boston: Allyn & Bacon.
sessment of reading and learning disabilities: A research Gronlund, N. E. (2000). How to write and use instructional
based intervention-oriented approach. Journal of School objectives (6th ed.). Upper Saddle River, NJ: Merrill/
Psychology, 40, 27-63. Prentice Hall.
Flynn, J. R. (1998). IQ gains over time: Toward finding the Gronlund, N. E. (2003). Assessment of student achievement
causes. In U. Neisser (Ed.), The rising curve: Long-term (7th ed.). Boston: Allyn & Bacon.
gains in IQ and related measures, pp. 25-66. Washing- Gulliksen, H. (1950). Theory of mental tests. New York:
ton, DC: American Psychological Association. Wiley.
Friedenberg, L. (1 995). Psychological testing: Design, analy- Haak, R. A. (1990). Using the sentence completion to assess
sis, and use. Boston: Allyn & Bacon. emotional disturbance. In C. R. Reynolds & R. W. Kam-
Frisbie, D. A. (1992). The multiple true—false format: A status phaus (Eds.), Handbook of psychological and educa-
review. Educational Measurement: Issues and Practice, tional assessment of children: Personality, behavior, and
11, 21-26. context (pp. 147-167). New York: Guilford Press.
Fuchs, L. S. (2002). Best practices in providing accommoda- Haak, R. A. (2003). The sentence completion as a tool for
tions for assessment. In A. Thomas & J. Grimes (Eds.), to assessing emotional disturbance. In C. R. Reynolds
Best practices in school psychology (Vol. IV, pp. 899- & R. W. Kamphaus (Eds.), Handbook of psychological
909). Bethesda, MD: National Association of School and educational assessment of children: Personality, be-
Psychologists. havior, and context (2nd ed., pp. 159-181). New York:
Fuchs, D., Mock, D., Morgan, P., & Young, C. (2003). Re- Guilford Press.
sponsiveness to intervention: Definitions, evidence, Hakstian, A. (1971). The effects of study methods and test
and implications for the learning disabilities construct. performance on objective and essay examinations. Jour-
Learning Disabilities Research and Practice, 18(3), nal of Educational Research, 64, 319-324.
157-171. Hales, L., & Tokar, E. (1975). The effect of quality of pre-
Galton, F. (1884). Measurement of character. Fortnightly Re- ceding responses on the grades assigned to subsequent
view, 42, 179-185. (Reprinted in Readings in personality responses to an essay question. Journal of Educational
assessment, by L. D. Goodstein & R. I. Lanyon, Eds., Measurement, 12, 115-117.
1971, New York: Wiley). Halpern, D. F. (1997). Sex differences in intelligence: Im-
Gay, G. H. (1990). Standardized tests: Irregularities in admin- plications for education. American Psychologist, 52,
istering the test effects test results. Journal of Instruc- 1091-1102.
tional Psychology, 17, 93-103. Hammer, E. (1985). The House-Tree-Person Test. In C. New-
Ghiselli, E. E., Campbell, J. P., & Zedeck, S. (1981). Mea- mark (Ed.), Major psychological assessment instruments
surement theory for the behavioral sciences. San Fran- (pp. 135-164). Boston: Allyn & Bacon.
cisco: W. H. Freeman. Handler, L. (1985). The clinical use of the Draw-A-Person
Glass, G. V. (1978). Standards and criteria. Journal of Educa- Test (DAP). In C. Newmark (Ed.), Major psychological
tional Measurement, 15, 237-261. assessment instruments (pp. 165-216). Boston: Allyn &
Godshalk, F., Swineford, F., Coffman, W., & Educational Test- Bacon.
ing Service (1966). The measurement of writing ability. Harrow, A. J. (1972). A taxonomy of the psychomotor domain.
New York: College Entrance Examination Board. New York: David McKay.
Goodstein, L. D., & Lanyon, R. I. (1971). Readings in person- Hays, W. (1994). Statistics (Sth ed.). New York: Harcourt
ality assessment. New York: Wiley. Brace.
Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, Helms, J. E. (1992). Why is there no study of cultural equiva-
NJ: Erlbaum. lence in standardized cognitive ability testing? American
Gray, P. (1999). Psychology. New York: Worth. Psychologist, 47, 1083-1101.
Green, B. F. (1981). A primer of testing. American Psycholo- Hembree, R. (1988). Correlates, causes, effects, and treat-
gist, 36, 1001-1011. ment of test anxiety. Review of Educational Research,
Grier, J. B. (1975). The number of alternatives for optimum 58, 47-77.
test reliability. Journal of Educational Measurement, 1 js Hilliard, A. G. (1979). Standardization and cultural bias as
109-113. impediments to the scientific study and validation of
Gresham, F. M., & Witt, J. C. (1997). Utility of intelligence “intelligence.” Journal of Research and Development in
tests for treatment planning, classification, and place- Education, 12, 47-58.
506 REFERENCES
Hilliard A. G. (1989). Back to Binet: The case against the use Kamphaus, R. W., & Frick, P. J. (2002). Clinical assessment
of IQ tests in the schools. Diagnostique, 14, 125-135. of child and adolescent personality and behavior. Bos-
Hoff, D. J. (1999). N.Y.C. probe levels test-cheating charges. ton: Allyn & Bacon.
Education Week, 19, 3. Kamphaus, R. W., & Reynolds, C. R. (1998). Behavior As-
Hoff, D. J. (2003). California schools experiment with dele- sessment System for Children (BASC) ADHD Monitor.
tion of D’s. Education Week, 32, 5. Circle Pines, MN: American Guidance Service.
Hopkins, K. D. (1998). Educational and psychological measure- Kaufman, A. S. (1994). Intelligent testing with the WISC-III.
ment and evaluation (8th ed.). Boston: Allyn & Bacon. New York: Wiley.
Hughes, D., Keeling, B., & Tuck, B. (1980). The influence of Kaufman, A. S., & Lichtenberger, E. O. (1999). Essentials of
context position and scoring method on essay scoring. WAIS-III assessment. New York: Wiley.
Journal of Educational Measurement, 17, 131-135. Keith, T. Z., & Reynolds, C. R. (1990). Measurement and
Hunter, J. E., Schmidt, F. L., & Hunter, R. (1979). Differen- design issues in child assessment research. In C. R.
tial validity of employment tests by race: A comprehen- Reynolds & R.W. Kamphaus (Eds.), Handbook of
sive review and analysis. Psychological Bulletin, 86, psychological and educational assessment of children:
721-735. Intelligence and achievement (pp. 29-62). New York:
Hunter, J. E., Schmidt, F. L., & Rauschenberger, J. (1984). Guilford Press.
Methodological, statistical, and ethical issues in the Keller, B. (2001). Dozens of Michigan schools under suspi-
study of bias in mental testing. In C.R. Reynolds & cion of cheating. Education Week, 20, 18, 30.
R. T. Brown (Eds.), Perspectives on bias in mental test- Kelley, T. L. (1939). The selection of upper and lower groups
ing (pp. 41-101). New York: Plenum Press. for the validation of test items. Journal of Educational
Impara, J. C., & Plake, B. S. (1997). Standard setting: An Psychology, 30, 17-24.
alternative approach. Journal of Educational Measure- Kerlinger, F. N. (1973). Foundations of behavioral research.
ment, 34, 353-366. New York: Holt, Rinehart and Winston.
Jacob, S., & Hartshorne, T. S. (2007). Ethics and law for King, W. L., Baker, J., & Jarrow, J. E. (1995). Testing accom-
school psychologists (5th ed.). Hoboken, NJ: Wiley. modations for students with disabilities. Columbus, OH:
Jaeger, R. M. (1991). Selection of judges for standard-setting. Association on Higher Education and Disability.
Educational Measurement: Issues and Practice, 10(2), Kober, N. (2002). Teaching to the test: The good, the bad,
3-10. and who’s responsible. Test Talk for Leaders (Issue 1).
James, A. (1927). The effect of handwriting on grading. Eng- Washington, DC: Center on Education Policy. Re-
lish Journal, 16, 180-205. trieved May 13, 2003, from www.cep-dc.org/testing/
Jensen, A. R. (1976). Test bias and construct validity. Phi testtalkjune2002.htm
Delta Kappan, 58, 340-346. Kovacs, M. (1991). The Children’s Depression Inventory
Jensen, A. R. (1980). Bias in mental testing. New York: Free (CDI). North Tonawanda, NY: Multi-Health Systems.
Press. Kranzler, J. H. (1997). Educational and policy issues related
Johnson, A. P. (1951). Notes on a suggested index of item va- to the use and interpretation of intelligence tests in the
lidity: The U-L index. Journal of Educational Measure- schools. School Psychology Review, 26, 50-63.
ment, 42, 499-504. Krathwohl, D., Bloom, B., & Masia, B. (1964). Taxonomy of
Joint Committee on Standards for Educational Evaluation educational objectives: Book 2: Affective domain. White
(2003). The student evaluation standards. Arlen Gul- Plains, NY: Longman.
lickson, Chair. Thousand Oaks, CA: Corwin Press. Kubiszyn, T., & Borich, G. (2000). Educational testing and
Joint Committee on Testing Practices (1988). Code of fair measurement: Classroom application and practice (6th
testing practices in education. Washington, DC: Ameri- ed.). New York: Wiley.
can Psychological Association. Kubiszyn, T., & Borich, G. (2003). Educational testing and
Joint Committee on Testing Practices (1998). Rights and measurement: Classroom application and practice (7th
responsibilities of test takers: Guidelines and expec- ed.). New York: Wiley.
tations. Washington, DC: American Psychological Kuder, G. F., & Richardson, M. W. (1937). The theory of the
Association. estimation of reliability. Psychometrika, 2, 151-160.
Kamphaus, R. W. (1993). Clinical assessment of children’s Lawshe, C. H. (1975). A quantitative approach to content va-
intelligence: A handbook for professional practice. Bos- lidity. Personnel Psychology, 28, 563-575.
ton: Allyn & Bacon. Linn, R., & Baker, E. (1992, fall). Portfolios and accountability.
Kamphaus, R. W. (2001). Clinical assessment of child and The CRESST Line: Newsletter of the National Center for
adolescent intelligence. Boston: Allyn & Bacon. Research on Evaluation Standards and Student Testing, 1,
Kamphaus, R. W. (in press). Clinical assessment of children’s 8-10. Retrieved December 6, 2004, from www.cse.ucla
intelligence (3rd ed.). New York: Springer. .edu/products/newsletters/clfall92.pdf
REFERENCES 507
Linn, R. L., & Gronlund, N. E. (2000). Measurement and as- Neisser, U., BooDoo, G., Bouchard, T., Boykin, A., Brody, N.,
sessment in teaching (8th ed.). Upper Saddle River, NJ: Ceci, S., Halpern, D., Loehlin, J., Perloff, R., Sternberg,
Prentice Hall. R., & Urbina, S. (1996). Intelligence: Knowns and un-
Livingston, R. B., Eglsaer, R., Dickson, T., & Harvey-Liv- knowns. American Psychologist, 51, 77-101.
ingston, K. (2003). Psychological assessment practices Nitko, A. J. (2001). Educational assessment of students.
with children and adolescents. Paper presented at the Upper Saddle River, NJ: Merrill Prentice Hall.
23rd Annual National Academy of Neuropsychology Nitko, A. J., & Lane, S. (1990). Standardized multilevel sur-
Conference, Dallas, TX. vey achievement batteries. In C. R. Reynolds & R. W.
Lord, F. M. (1952). The relation of the reliability of multiple- Kamphaus (Eds.), Handbook of psychological and
choice tests to the distribution of item difficulties. Psy- educational assessment of children: Intelligence and
chometrika, 17, 181-194. achievement (pp. 405-434). New York: Guilford Press.
Lowry, R. (2003). Vassar stats: Cohen’s kappa. Retrieved Nomura, J. M., Stinnett, T., Castro, F., Atkins, M., Beason,
August 10, 2003, from http://faculty./vassar.edu/lowry/ S., Linden, S., Hogan, K., Newry, B., & Weichmann, K.
kappa.html (March, 2007). Effects of stereotype threat on cognitive
Lyman, H. B. (1998). Test scores and what they mean. Boston: performance of African Americans. Paper presented to
Allyn & Bacon. the annual meeting of the National Association of School
Manzo, K. K. (2003). Essay scoring goes digital. Education Psychologists, New York.
Week, 22, 39-40, 42. Northeast Technical Assistance Center (1999). Providing test
Mastergeorge, A. M., & Miyoshi, J. N. (1999). Accommoda- accommodations. NETAC Teacher Tipsheet. Rochester,
tions for students with disabilities: A teacher’s guide NY: Author.
(CSE Technical Report 508). Los Angeles: National Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric the-
Center for Research on Evaluation, Standards, and Stu- ory (3rd ed.). New York: McGraw-Hill.
dent Testing. Olson, L. (2003). Georgia suspends testing plans in key
McArthur, D., & Roberts, G. (1982). Roberts Apperception grades. Education Week, 22, 1, 15.
Test for Children: Manual. Los Angeles: Western Psy- Oosterhof, A. C. (1976). Similarity of various item discrimi-
chological Services. nation indices. Journal of Educational Measurement, 13,
McGregor, G., & Vogelsberg, R. (1998). Inclusive schooling 145-150.
practices: Pedagogical and research foundations. A syn- Pedulla, J., Abrams, L., Madaus, G., Russell, M., Ramos, M.,
thesis of the literature that informs best practices about & Miao, J. (2003). Perceived effects of state-mandated
inclusive schooling. Pittsburgh, PA: Allegheny Univer- testing programs on Teaching and Learning: Findings
sity of the Health Sciences. from a national survey of teachers. National Board on
Mealey, D. L., & Host, T. R. (1992). Coping with test anxiety. Educational Testing and Public Policy. Retrieved March
College Teaching, 40, 147-150. 17, 2008, from www.bc.edu/research/reports.html
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Phillips, S. E. (1993). Testing accommodations for disabled
measurement (3rd ed., pp. 13-103). Upper Saddle River, students. Education Law Reporter, 80, 9-32.
NJ: Merrill Prentice Hall. Phillips, S. E. (1994). High-stakes testing accommodations:
Messick, S. (1994). The interplay of evidence and conse- Validity versus disabled rights. Applied Measurement in
quences in the validation of performance assessments. Education, 7(2), 93-120.
Educational Researcher, 23, 13-23. Phillips, S. E. (1996). Legal defensibility of standards: Issues
Murphey, K. R., & Davidshofer, C. O. (2001). Psychologi- and policy perspectives. Educational Measurement: Is-
cal testing: Principles and applications (Sth ed.). Upper sues and Practice, 15(2), 5-19.
Saddle River, NJ: Prentice Hall. Piacentini, J. (1993). Checklists and rating scales. In T. H.
Myford, C. M., & Cline, F. (2002). Looking for patterns in Ollendick & M. Hersen (Eds.), Handbook of child and
disagreement: A Facets analysis of human raters’ and adolescent assessment (pp. 82-97). Boston: Allyn &
e-rater’s scores on essays written for the Graduate Man- Bacon.
agement Admission Test (GMAT). Paper presented at the Pike, L. W. (1979). Short-term instruction, testwiseness, and
annual meeting of the American Educational Research the Scholastic Aptitude Test: A literature review with
Association, New Orleans, LA. research recommendations. Princeton, NJ: Educational
National Council on Measurement in Education (1995). Code Testing Service.
of professional responsibilities in educational measure- Popham, W. J. (1999). Classroom assessment: What teachers
ment. Washington, DC: Author. need to know. Boston: Allyn & Bacon.
National Commission of Excellence in Education (1983). A Popham, W. J. (2000). Modern educational measurement:
nation at risk: The imperative for educational reform. Practical guidelines for educational leaders. Boston:
Washington, DC: U.S. Government Printing Office. Allyn & Bacon.
508 REFERENCES
Powers, D.E., & Kaufman, J.C. (2002). Do standardized Reynolds, C. R. (2002). Comprehensive Trail-Making Test:
multiple-choice tests penalize deep-thinking or creative Examiner’s manual. Austin, TX: PRO-ED.
students? (RR-02-15). Princeton, NJ: Educational Test- Reynolds, C.R. (2005, August). Considerations in RTI as a
ing Service method of diagnosis of learning disabilities. Paper pre-
The Psychological Assessment Resources (2003). Catalog of sented to the Annual Institute for Psychology in the
professional testing resources, 26. Lutz, FL: Author. Schools of the American Psychological Association.
The Psychological Corporation (2002). Examiner’s manual Washington, DC.
for the Wechsler Individual Achievement Test—Second Reynolds, C. R., & Kamphaus, R. (Eds.). (1990a). Hand-
Edition. San Antonio: Author. book of psychological and educational assessment of
The Psychological Corporation (2003). The catalog for psycho- children: Personality, behavior, and context. New York:
logical assessment products. San Antonio, TX: Author. Guilford Press.
Ramsay, M., Reynolds, C., & Kamphaus, R. (2002). Essen- Reynolds, C. R., & Kamphaus, R. (Eds.). (1990b). Handbook
tials of behavioral assessment. New York: Wiley. of psychological and educational assessment of children:
Reitan, R. M., & Wolfson, D. (1993). The Halstead-Reiten Intelligence and achievement. New York: Guilford Press.
Neuropsychological Test Battery: Theory and clinical Reynolds, C. R., & Kamphaus, R. W. (1992). Behavior As-
interpretation (2nd ed.). Tucson, AZ: Neuropsychology sessment System for Children: Manual. Circle Pines,
Press. MN: American Guidance Service.
Reynolds, C. R. (1980). In support of “Bias in Mental Test- Reynolds, C. R., & Kamphaus, R. W. (1998). Behavior As-
ing” and scientific inquiry. The Behavioral and Brain sessment System for Children: Manual. Circle Pines,
Sciences, 3, 352. MN: American Guidance Services.
Reynolds, C. R. (1982). The problem of bias in psychologi- Reynolds, C. R., & Kamphaus, R. W. (2003). Reynolds In-
cal assessment. In C. R. Reynolds & T. B. Gutkin (Eds.), tellectual Assessment Scales (RIAS) and the Reynolds
The handbook of school psychology (pp. 178-208). New Intellectual Screening Test (RIST) professional manual.
York: Wiley. Lutz, FL: Psychological Assessment Resources.
Reynolds, C. R. (1983). Test bias: In God we trust; all oth- Reynolds, C. R., & Kaufman, A. S. (1990). Assessment of
ers must have data. Journal of Special Education, 17, children’s intelligence with the Wechsler Intelligence
241-260. Scale for Children—Revised (WISC-R). In C.R.
Reynolds, C. R. (1985). Critical measurement issues in learning Reynolds & R. W. Kamphaus (Eds.), Handbook of psy-
disabilities. Journal of Special Education, 18, 451-476. chological and educational assessment of children: In-
Reynolds, C. R. (1990). Conceptual and technical problems telligence and achievement (pp. 127-165). New York:
in learning disability diagnosis. In C. R. Reynolds & Guilford Press.
R. W. Kamphaus (Eds.), Handbook ofpsychological and Reynolds, C. R., Lowe, P. A., & Saenz, A. (1999). The prob-
educational assessment of children: Intelligence and lem of bias in psychological assessment. In T. B. Gutkin
achievement (pp. 571-592). New York: Guilford Press. & C.R. Reynolds (Eds.), The handbook of school psy-
Reynolds, C. R. (1995). Test bias in the assessment of intel- chology (3rd ed., pp. 549-595). New York: Wiley.
ligence and personality. In D. Saklofsky & M. Zeidner Reynolds, C. R., & Ramsay, M. C. (2003). Bias in psycholog-
(Eds.), International handbook of personality and intel- ical assessment: An empirical review and recommenda-
ligence (pp. 545-576). New York: Plenum. tions. In J. R. Graham & J. A. Naglieri (Eds.), Handbook
Reynolds, C. R. (1998a). Common sense, clinicians, and ac- of psychology: Assessment psychology (pp. 67-93). New
tuarialism in the detection of malingering during head York: Wiley.
injury litigation. In C. R. Reynolds (Ed.), Detection of Reynolds, C. R., & Voress, J. (2007). Test of Memory and
malingering during head injury litigation. Critical issues Learning (TOMAL-2) (2nd ed.). Austin, TX: PRO-ED.
in neuropsychology (pp. 261-286). New York: Plenum. Reynolds, C. R., Voress, J., & Pierson, N. (2007). Develop-
Reynolds, C. R. (1998b). Fundamentals of measurement and mental Test of Auditory Perception (DTAP). Austin, TX:
assessment in psychology. In A. Bellack & M. Hersen PRO-ED.
(Eds.), Comprehensive clinical psychology (pp. 33-55). Reynolds, W. M. (1993). Self-report methodology. In T. H.
New York: Elsevier. Ollendick and M. Hersen (Eds.), Handbook of child and
Reynolds, C. R. (1999). Inferring causality from relational adolescent assessment (pp. 98-123). Boston: Allyn &
data and design: Historical and contemporary lessons Bacon.
for research and clinical practice. The Clinical Neurop- Ricker, K. L. (2004). Setting cutscores: Critical review of
sychologist, 13, 386-395. Angoff and modified-Angoff methods. Alberta Journal
Reynolds, C. R. (2000). Why is psychometric research on bias of Educational Measurement.
in mental testing so often ignored? Psychology, Public Riverside Publishing (2002). CogAT, Form 6: A short guide
Policy, and Law, 6, 144-150. for teachers. Itasca, IL: Author.
REFERENCES 509
Riverside Publishing (2003). Clinical and special needs as- Sireci, S. G. (1998). Gathering and analyzing content validity
sessment catalog. Itasca, IL: Author. data. Educational Assessment, 5, 299-321.
Roid, G. H. (2003). Stanford-Binet Intelligence Scales, Fifth Sireci, S. G., Thissen, D., & Wainer, H. (1991). On the re-
Edition. Itasca, IL: Author. liability of testlet-based tests. Journal of Educational
Rudner, L. M. (2001). Computing the expected proportions of Measurement, 28, 237-247.
misclassified examinees. Practical Assessment, Research Stainback, S., & Stainback, W. (1992). Curriculum consider-
& Evaluation, 7(14). Retrieved December 3, 2005, from ations in inclusive classrooms: Facilitating learning for
http://PAREonline.net/getvn.asp?v=7&n=14 all students. Baltimore: Brooks.
Sackette, P. R., Hardison, C. M., & Cullen, M. J. (2004). On Steele, C. M., & Aronson, J. (1995). Stereotype threat and
interpreting stereotype threat as accounting for African- the intellecutal test performance of African Americans.
American differences on cognitive tests. American Psy- Journal of Personality and Social Psychology, 69, 797-
chologist, 59(1), 7-13. 811.
Salvia, J., & Ysseldyke, J. (2007). Assessment in special and Steele, C. M., Spencer, S. J., & Aronson, J. (2002). Contend-
inclusive education. Boston: Houghton Mifflin. ing with group image: The psychology of stereotype and
Samuels, C. A. (2007, July 6). Advocates for students with dis- social identity threat. In M. Zanna (Ed.), Advances in
abilities balk at proposed NCLB change. Education Week. experimental social psychology (Vol. 23, pp. 379-440).
Retrieved July 9, 2007, www.edweek.org/ew/articles. New York: Academic Press.
Sanders, W. L., Saxton, A. M., & Horn, S. P. (1997). The Ten- Stern, R. A. & White, T. (2003). Neuropsychological Assess-
nessee value-added assessment system: A quantitative, ment Battery (NAB). Lutz, FL: Psychological Assess-
outcomes-based approach to educational assessment. ment Resources.
In J. Millman (Ed.), Grading teachers, grading schools Stiggens, R. J. (2001). Student-involved classroom assessment
(pp. 137-162). Thousand Oaks, CA: Corwin Press. (3rd ed.). Upper Saddle River, NJ: Merrill Prentice Hall.
Sandoval, J., & Mille, M. P. W. (1979, September). Accuracy Stiggins, R. J., & Conklin, N. F. (1992). In teacher’s hands:
Judgments of WISC-R item difficulty for minority groups. Investigating the practices of classroom assessment. Al\-
Paper presented at the annual meeting of the American bany, NY: State University of New York Press.
Psychological Association, New York. Stroud, K., & Reynolds, C. R. (2006). School Motivation and
Sarnacki, R. E. (1979, spring). An examination of test-wise- Learning Strategies Inventory (SMALSI). Los Angeles:
ness in the cognitive domain. Review of Educational Re- Western Psychological Services.
search, 49, 252-279. Subkoviak, M. J. (1984). Estimating the reliability of mastery—
Sattler, J. M. (1992). Assessment of children (revised and up- nonmastery classifications. In R. A. Berk (Ed.), A guide
dated 3rd ed.). San Diego, CA: Author. to criterion-referenced test construction (pp. 267-291).
Saupe, J. L. (1961). Some useful estimates of the Kuder- Baltimore: Johns Hopkins University Press.
Richardson formula number 20 reliability coefficient. Suzuki, L. A., & Valencia, R. R. (1997). Race-ethnicity and
Educational and Psychological Measurement, 2, 63-72. measured intelligence: Educational implications. Ameri-
Schoenfeld, W. N. (1974). Notes on a bit of psychological can Psychologist, 52, 1103-1114.
nonsense: “Race differences in intelligence.” Psycho- Tabachnick, B. G., & Fidel, L. S. (1996). Using multivariate
logical Record, 24, 17-32. statistics (3rd ed.). New York: HarperCollins.
Semel, E. M., Wiig, E. H., & Secord, W. A. (2004). Clinical Texas Education Agency (2003). District and campus coordi-
Evaluation of Language Fundamentals 4—Screening nator manual: Texas student assessment program.
Test (CELF-4). San Antonio, TX: Harcourt Assessment. Thurnlow, M., Hurley, C., Spicuzza, R., & El Sawaf, H.
Shavelson, R., Baxter, G., & Gao, X. (1993). Sampling vari- (1996). A review of the literature on testing accommo-
ability of performance assessments. Journal of Educa- dations for students with disabilities (Minnesota Report
tional Measurement, 30, 215-232. No. 9). Minneapolis: University of Minnesota, National
Shepard, L., Glaser, R., Linn, R., & Bohrnstedt, G. (1993). Center on Educational Outcomes. Retrieved April 19,
Setting performance standards for student achievement 2004, from http://education.umn.edu/NCEO/Online
tests. Stanford, CA: National Academy of Education. Pubs/MnReport9. html
Sheppard, E. (1929). The effect of quality of penmanship on Tippets E., & Benson, J. (1989). The effect of item arrange-
grades. Journal of Educational Research, 19, 102-105. ment on test anxiety. Applied Measurement in Educa-
Sheslow, D., & Adams, W. (2003). Wide Range Assessment tion, 2, 289-296.
of Memory and Learning 2 (WRAML-2). Wilmington, Turnbull, R., Turnbull, A., Shank, M., Smith, S., & Leal, D.
DE: Wide Range. (2002). Exceptional lives: Special education in today’s
Sidick, J. T., Barrett, G. V., & Doverspike, D. (1994). Three- schools. Upper Saddle River, NJ: Merrill Prentice Hall.
alternative multiple-choice tests: An attractive option. U.S. Department of Education (1997). Guidance on standards,
Personnel Psychology, 47, 829-835. assessments, and accountability—II. Assessments.
510 REFERENCES
Retrieved November 30, 2004, from www.ed.gov/policy/ adaptive testing. In D. Lubinski & R. Dawis (Eds.), As-
elsec/guid/standardsassessment/guidance_pg4.html sessing individual differences in human behavior: New
Viadero, D., & Drummond, S. (1998, April 22). Software said concepts, methods, and findings (pp. 49-79). Palo Alto,
to grade essays for content. Education Week. Retrieved CA: Davies-Black.
January 30, 2004, from www.edweek.org/ew/ew_print Wigdor, A. K., & Garner, W. K. (1982). Ability testing: Uses,
story.cfm?slug=32soft.h17 consequences, and controversy. Washington DC: Na-
Wallace, G., & Hammill, D. D. (2002). Comprehensive Re- tional Academy Press.
ceptive and Expressive Vocabular Test, (CREVT-2) (2nd Williams, R. L. (1970). Danger: Testing and dehumanizing
ed.). Los Angeles: Western Psychological Services. Black children. Clinical Child Psychology Newsletter,
Ward, A. W., & Murray-Ward, M. (1994). Guidelines for the 9, 5-6.
development of item banks: An NCME instructional Williams, R.L., Dotson, W., Dow, P., & Williams, W. S.
module. Educational Measurement: Issues and Prac- (1980). The war against testing: A current status report.
tice, 13, 34-39. Journal of Negro Education, 49, 263-273.
Webster, W. J., & Mendro, R. L. (1997). The Dallas value- Witt, J., Heffer, R., & Pfeiffer, J. (1990). Structured rating
added accountability system. In J. Millman (Ed.), Grad- scales: A review of self-report and informant rating
ing teachers, grading schools (pp. 81-99). Thousand processes, procedures, and issues. In C. R. Reynolds
Oaks, CA: Corwin Press. & R. W. Kamphaus (Eds.), Handbook of psychological
Wechsler, D. W. (1991). Wechsler Intelligence Scale for and educational assessment of children: Personality, be-
Children—Third Edition: Manual. San Antonio, TX: havior, and context (pp. 364-394). New York: Guilford
Psychological Corporation. Press.
Wechsler, D. W. (1997). WAIS-III administration and scoring Woodcock, R. W., McGrew, K. S., & Mather, N. (2001a).
manual. San Antonio, TX: Psychological Corporation. Woodcock-Johnson III (WJ III) Tests of Achievement.
Wechsler, D. W. (2003). Wechsler Intelligence Scale for Itasca, IL: Riverside Publishing.
Children—Fourth Edition: Technical and interpretive Woodcock, R. W., McGrew, K. S., & Mather, N. (2001b).
manual. San Antonio, TX: Psychological Corporation. Woodcock-Johnson III (WJ III) Tests of Cognitive Abili-
Weiss, D. J. (1982). Improving measurement quality and ef- ties. Itasca, IL: Riverside Publishing.
ficiency with adaptive theory. Applied Psychological Woodcock, R. W., McGrew, K.S., & Mather, N. (2001c).
Measurement, 6, 473-492. Woodcock-Johnson III (WJ III) Complete Battery. Itasca,
Weiss, D. J. (1985). Adaptive testing by computer. Journal of IL: Riverside Publishing.
Consulting and Clinical Psychology, 53, 774-789.
Weiss, D. J. (1995). Improving individual difference mea-
surement with item response theory and computerized
Seca ber eu sia tent
ee at
511
512 INDEX
ne bsnl a ° as ee —
&. a aoe
a
i aes ae
tkaate asp
; — a on
be . _
ae iy ie
ar
Sal 5 eee
Pieane ae.
‘al
Measurement and Assessment in Education, Second Edition, employs a
pragmatic approach to the study of educational tests and measurement
so that teachers will understand essential psychometric concepts and be
able to apply them in the classroom.
The principles that guide this text are:
: / :
— What essential knowledge
and skills do classroom teachers need to
conduct student assessments in a professional manner?
maa YAV/a¥elime (OLo tpl rectors]
cel amelamsvelUlerelulelarlmetssetscantclalan (eli MUlsr4
i
Malis coretelcmar-tacct0l ico Maire MU aye [0(<1Narelo)elcer-(oarele)(em-Tare Mm
ccrenlalcerlihy
accurate presentation of the material.
While providing a slightly more technical presentation of measurement and assessment than more
basic texts, this text is both approachable and comprehensive. The text includes an introduction to
the basic mathematics of measurement, and expands traditional coverage to include a thorough
discussion of performance and portfolio assessments, a complete presentation of assessment
accommodations for students with disabilities, and a practical discussion of professional best
practices in educational measurement.
rrill
is an imprint of
Cover Photograph: ©Gabe Palmer/Corbis