Measurement and Assessment in Education-Pearson (2008)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 548

we

te
=yo :

a
11945.
“8 SCIONASY
we GIVNOd
“¢ NOLSONIAN
== aOLOIA N

NO] FEIGE RG]NOES


Measurement and Assessment
in Education
SECOND EDITION

Measurement and Assessment


in Education

Cecil R. Reynolds
Texas A&M University

Ronald B. Livingston
University of Texas at Tyler

Victor Willson
Texas A&M University

PEARSON
i a

Upper Saddle River, New Jersey


Columbus, Ohio
Library of Congress Cataloging-in-Publication Data
Reynolds, Cecil R.
Measurement and assessment in education / Cecil R.
Reynolds, Ronald B. Livingston, Victor Willson.—2nd ed.
p. cm.
Includes bibliographical references and index.
ISBN-13: 978-0-205-57934-1 (pbk.)
ISBN-10: 0-205-57934-5 (pbk.)
1. Educational tests and measurements—Handbooks,
manuals, etc. I. Livingston, Ronald B. II. Willson, Victor L. III. Title.
LB3051.R45 2009
371.26—de22
2008009328

Publisher: Kevin M. Davis


Series Editorial Assistant: Lauren Reinkober
Director of Marketing: Quinn Perkson
Marketing Manager: Erica DeLuca
Editorial Production Service: Omegatype Typography, Inc.
Composition Buyer: Linda Cox
Manufacturing Manager: Megan Cochran
Cover Administrator: Linda Knowles

This book was set in Times Roman by Omegatype Typography, Inc. It was printed and bound by
R. R. Donnelley/Harrisonburg. The cover was printed by Phoenix Color Corporation/Hagerstown.
——————————————————
5 —

Copyright © 2009, 2006 by Pearson Education, Inc., Upper Saddle River, New Jersey 07458.
Pearson. All rights reserved. Printed in the United States of America. This publication is protected
by Copyright and permission should be obtained from the publisher prior to any prohibited
reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic,
mechanical, photocopying, recording, or likewise. For information regarding permission(s), write
to: Rights and Permissions Department, 501 Boylston Street, Suite 900, Boston, MA 02116, or fax
your request to 617-671-2290.

Pearson® is a registered trademark of Pearson ple


Merrill® is a registered trademark of Pearson Education, Inc.

Pearson Education Ltd. Pearson Education Australia Pty. Limited


Pearson Education Singapore Pte. Ltd. Pearson Education North Asia Ltd.
Pearson Education Canada, Ltd. Pearson Educacién de Mexico, S.A. de C.V.
Pearson Education—Japan Pearson Education Malaysia Pte. Ltd.

Merrill
is an imprint of
10) OS) FT AGe See
PEARSON ISBN-13: 978-0-205-57934-1
s é www.pearsonhighered.com ISBN-10: 0-205-57934-5
Cecil: To Julia for the many sacrifices she makes for me and my work.
Ron: To my buddy Kyle—you bring great joy to my life!
Vic: To 34 years’ grad students in measurement and statistics.
Le? @
Ji “ agi

Pree 1c Lavo sem We wslons sil, extras qn olla 6f disg>


Wetter aypredlnmeet a greesvov—sivA ybhad vin oT tno
: ae 7 :
peprmanennnar aM
st Mi zinsbuiz hota ‘sawy FE AT io
43a Dele

Citi (uigatge frei, ape lac


me Tot
Wupes { a Arr
oe® Daler:
“ a
———
3,
p Oinpiibt trenpity toy) ¥ traphonednr baa coy
pT aren xa yarns byhans Cae © fu pd
j
BRIEF CONTENTS

ea

Introduction to Educational Assessment 1

The Basic Mathematics of Measurement 33

The Meaning of Test Scores 61

Reliability for Teachers 90

Validity for Teachers 123

Item Analysis for Teachers 147

The Initial Steps in Developing a Classroom Test 169

The Development and Use of Selected-Response Items 195

NY
WH
F&F
nN The Development and Use of Constructed-Response Items
©CPmMATIA 222

Performance Assessments and Portfolios 245

Assigning Grades on the Basis of Classroom Assessments pale

emk
pm
fkNm
©
= Standardized Achievement Tests in the Era
of High-Stakes Assessment 299

The Use of Aptitude Tests in the Schools 330

Assessment of Behavior and Personality 370

fmMn
fem
bem
Le
kt Assessment Accommodations 395

Vii
16 The Problem of Bias in Educational Assessment 421

17 Best Practices in Educational Assessment 450

APPENDIX A Summary Statements of The Student Evaluation


Standards 468

APPENDIX B_ Code of Professional Responsibilities in Educational


Measurement 471

APPENDIX C Code of Fair Testing Practices in Education 479

APPENDIX D_ Rights and Responsibilities of Test Takers:


Guidelines and Expectations 483

APPENDIX E Standards for Teacher Competence in Educational


Assessment of Students 491

APPENDIX F Proportions of Area under the Normal Curve 497

APPENDIX G Answers to Practice Problems 501

References 503

Index 511
LN EPS

CONTENTS

Preface xix

1 Introduction to Educational Assessment 1

The Language of Assessment 2


Tests, Measurement, and Assessment 3
Types of Tests 4
Types of Score Interpretations 8

Assumptions of Educational Assessment 9


Psychological and Educational Constructs Exist 9
Psychological and Educational Constructs Can Be Measured 10
Although We Can Measure Constructs, Our Measurement Is Not Perfect 10
There Are Different Ways to Measure Any Given Construct 10
All Assessment Procedures Have Strengths and Limitations 10
Multiple Sources of Information Should Be Part of the Assessment Process 11
Performance on Tests Can Be Generalized to Nontest Behaviors 11
Assessment Can Provide Information That Helps Educators Make Better
Educational Decisions 11
Assessments Can Be Conducted in a Fair Manner 11
Testing and Assessment Can Benefit Our Educational Institutions and Society
asaWhole 12

Participants in the Assessment Process 13


People Who Develop Tests 13
People Who Use Tests 14
People Who Take Tests 14
Other People Involved in the Assessment Process 15

Educational Assessment and the Law 15


No Child Left Behind Act of 2001 (NCLB) 15
Disabilit ies Educati on Improve ment Act of 2004 (IDEA 2004) 16
Individuals with
Section 504 of the Rehabilitation Act of 1973 (Section 504) 17

Protection of Pupil Rights Act(PPRA) 19


Family Educatio nal Rights and Privacy Act (FERPA) 19

19
Common Applications of Educational Assessments
Student Evaluations 19
Instructional Decisions 20
Selection, Placement, and Classification Decisions 20
CONTENTS

Policy Decisions 21
Counseling and Guidance Decisions 21

What Teachers Need to Know about Assessment 21


Teachers Should Be Proficient in Selecting Professionally Developed Assessment
Procedures Appropriate for Making Instructional Decisions 22
Teachers Should Be Proficient in Developing Assessment Procedures Appropriate for
Making Instructional Decisions 23
Teachers Should Be Proficient in Administering, Scoring, and Interpreting
Professionally Developed and Teacher-Made Assessment Procedures 23
Teachers Should Be Proficient in Using Assessment Results When Making
Educational Decisions 23
Teachers Should Be Proficient in Developing Valid Grading Procedures That
Incorporate Assessment Information 24
Teachers Should Be Proficient in Communicating Assessment Results 24
Teachers Should Be Proficient in Recognizing Unethical, Illegal, and Other
Inappropriate Uses of Assessment Procedures or Information 24

Educational Assessment in the Twenty-First Century 24


Computerized Adaptive Testing (CAT) and Other Technological Advances 25
“Authentic” or Complex-Performance Assessments 26
Educational Accountability and High-Stakes Assessment 27
Trends in the Assessment of Students with Disabilities 28

Summary 29

Z The Basic Mathematics of Measurement 33

The Role of Mathematics in Assessment 33

Scales of Measurement 34
What Is Measurement? 34
Nominal Scales 35
Ordinal Scales 35
Interval Scales 36
Ratio Scales 36

The Description of Test Scores 38


Distributions 38
Measures of Central Tendency 42
Measures of Variability 47

Correlation Coefficients 51
Scatterplots 52
Correlation and Prediction 54 ‘
CONTENTS

Types of Correlation Coefficients 54


Correlation versus Causality 56

Summary 56

3 The Meaning of Test Scores 61

Norm-Referenced and Criterion-Referenced Score Interpretations


Norm-Referenced Interpretations 63
Criterion-Referenced Interpretations 79

Norm-Referenced, Criterion-Referenced, or Both? 83

Qualitative Description of Scores 85

Summary 86

4 Reliability for Teachers 90


Errors of Measurement 91
Sources of Measurement Error 92

Methods of Estimating Reliability 95


Test-Retest Reliability 97
Alternate-Form Reliability 98
Internal-Consistency Reliability 98
Inter-Rater Reliability 101
Reliability of Composite Scores 102
Selecting a Reliability Coefficient 105
Evaluating Reliability Coefficients 107
How to Improve Reliability 109
Special Problems in Estimating Reliability 111

The Standard Error of Measurement 112


Evaluating the Standard Error of Measurement 112

Reliability: Practical Strategies for Teachers 117

Summary 119

5 Validity for Teachers 123

Threats to Validity 124

Reliability and Validity 125


126
“Types of Validity” versus “Types of Validity Evidence”
xii CONTENTS

Types of Validity Evidence 129


Evidence Based on Test Content 129
Evidence Based on Relations to Other Variables 132
Evidence Based on Internal Structure 139
Evidence Based on Response Processes 140
Evidence Based on Consequences of Testing 140
Integrating Evidence of Validity 141

Validity: Practical Strategies for Teachers 143


Summary 144

6 Item Analysis for Teachers 147

Item Difficulty Index (or Item Difficulty Level) 148


Special Assessment Situations and Item Difficulty 150

Item Discrimination 150


Discrimination Index 151
Item-Total Correlation Coefficients 153
Item Discrimination on Mastery Tests 155
Item Analysis of Speed Tests 156

Distracter Analysis 157


How Distracters Influence Item Difficulty and Discrimination 158
Item Analysis: Practical Strategies for Teachers 159
Using Item Analysis to Improve Items 161
Item Analysis of Performance Assessments 163
Qualitative Item Analysis 164
Using Item Analysis to Improve Classroom Instruction 165
Summary 167

p The Initial Steps in Developing a Classroom Test 169


Characteristics of Educational Objectives 171
Scope 171

Taxonomy of Educational Objectives 172


Cognitive Domain 172
Affective Domain 175
Psychomotor Domain 176

Behavioral versus Nonbehavioral Educational Objectives _ 177


CONTENTS xiii

Writing Educational Objectives 178

Developing a Table of Specifications (or Test Blueprint) 179

Implementing the Table of Specifications and Developing an Assessment 181


Norm-Referenced versus Criterion-Referenced Score Interpretations 182

Developing Classroom Tests in a Statewide Testing Environment 182


Selecting Which Types of Items to Use 183
Putting the Assessment Together 186

Preparing Your Students and Administering the Assessment 189

Summary 191

8 The Development and Use of Selected-Response Items 195

Multiple-Choice Items 196


Guidelines for Developing Multiple-Choice Items 197
Strengths of Multiple-Choice Items 206
Weaknesses of Multiple-Choice Items 210

True—False Items 211


Guidelines for Developing True—False Items 212
Strengths of True—False Items 213
Weaknesses of True—False Items 214

Matching Items 215


Guidelines for Developing Matching Items 216
Strengths of Matching Items 218
Weaknesses of Matching Items 218

Summary 219

9 The Development and Use of Constructed-Response Items 222

Oral Testing: The Oral Essay as a Precursor of Constructed-Response


Items 223
Essay Items 224
Purposes of Essay Items 224
Essay Items at Different Levels of Complexity 226
Restricted-Response versus Extended-Response Essays DIENT
Guidelines for Developing Essay Items 228
Strengths of Essay Items 229
Weaknesses of Essay Items 231
Guidelines for Scoring Essay Items 233
xiv CONTENTS

Short-Answer Items 237


Guidelines for Developing Short-Answer Items 239
Strengths of Short-Answer Items 241
Weaknesses of Short-Answer Items 241

A Final Note: Constructed-Response versus Selected-Response Items 242


Summary 243

10 Performance Assessments and Portfolios 245

What Are Performance Assessments? 246

Guidelines for Developing Effective Performance Assessments 252


Selecting Appropriate Performance Tasks 252
Developing Instructions 256
Developing Procedures for Evaluating Responses 256
Implementing Procedures to Minimize Errors in Rating 261
Strengths of Performance Assessments 266
Weaknesses of Performance Assessments 267

Portfolios 268
Guidelines for Developing Portfolio Assessments 269
Strengths of Portfolio Assessments 271
Weaknesses of Portfolio Assessments 272

Summary 273

11 Assigning Grades on the Basis of Classroom Assessments 277


Feedback and Evaluation 278
Formal and Informal Evaluation 281
The Use of Formative Evaluation in Summative Evaluation 281

Reporting Student Progress: Which Symbols to Use? 282


The Basis for Assigning Grades 284
Frame of Reference 285
Norm-Referenced Grading (Relative Grading) 285
Criterion-Referenced Grading (Absolute Grading) 287
Achievement in Relation to Improvement or Effort 288
Achievement Relative to Ability 289
Recommendation 289

Combining Grades into a Composite 290 ;


CONTENTS

Informing Students of the Grading System and Grades Received 295

Parent Conferences 295

Summary 297

12 Standardized Achievement Tests in the Era


of High-Stakes Assessment 299

The Era of High-Stakes Assessment 301

Group-Administered Achievement Tests 302


Commercially Developed Group Achievement Tests 304
State-Developed Achievement Tests 310
Value-Added Assessment: A New Approach to Educational Accountability 315
Best Practices in Using Standardized Achievement Tests in Schools 318

Individual Achievement Tests 324


Selecting an Achievement Battery 327

Summary 327

13 The Use of Aptitude Tests in the Schools 330

A Brief History of Intelligence Tests 333

The Use of Aptitude and Intelligence Tests in Schools 336


Aptitude—Achievement Discrepancies 337

A New Assessment Strategy for Specific Learning Disabilities: Response to


Intervention (RTI) 339
Major Aptitude/Intelligence Tests 340
Group Aptitude/Intelligence Tests 340
Individual Aptitude/Intelligence Tests 343
Selecting Aptitude/Intelligence Tests 350
Understanding the Report of an Intellectual Assessment 353

College Admission Tests 366


Summary 367

14 Assessment of Behavior and Personality 370

Assessing Behavior and Personality 372


Response Sets 372
Assessment of Behavior and Personality in the Schools 373
CONTENTS

Behavior Rating Scales 375


Behavior Assessment System for Children, Second Edition—Teacher Rating Scale
and Parent Rating Scale (TRS and PRS) 376
Conners’ Rating Scales—Revised (CRS-R) 381
Child Behavior Checklist and Teacher Report Form (CBCL and TRF) 381

Self-Report Measures 383


Behavior Assessment System for Children, Second Edition—Self-Report of
Personality (SRP) 383
Youth Self-Report (YSR) 387

Projective Techniques 388


Projective Drawings 389
Sentence Completion Tests 390
Apperception Tests 391
Inkblot Techniques 391

Summary 392

15 Assessment Accommodations 395

Major Legislation That Affects the Assessment of Students


with Disabilities 397
Individuals with Disabilities Education Act (IDEA) 397
IDEA Categories of Disabilities 399

Section 504 403

The Rationale for Assessment Accommodations 403


When Are Accommodations Not Appropriate or Necessary? 404
Strategies for Accommodations 405
Modifications of Presentation Format 405
Modifications of Response Format 405
Modifications of Timing 406
Modifications of Setting 407
Adaptive Devices and Supports 408
Using Only a Portion of aTest 409
Using Alternate Assessments 409

Determining What Accommodations to Provide 410


Assessment of English Language Learners (ELLs) 412
Reporting Results of Modified Assessments 415
Summary 418 ;
CONTENTS XVii

16 The Problem of Bias in Educational Assessment 421

What Do We Mean by Bias? 424


Past and Present Concerns: A Brief Look 425

The Controversy over Bias in Testing: Its Origin, What It Is, and What
It Is Not 425
Cultural Bias and the Nature of Psychological Testing 431

Objections to the Use of Educational and Psychological Tests with


Minority Students 432
Inappropriate Content 432
Inappropriate Standardization Samples 433
Examiner and Language Bias 433
Inequitable Social Consequences 433
Measurement of Different Constructs 433
Differential Predictive
Validity 433
Qualitatively Distinct Aptitude and Personality 433

The Problem of Definition in Test Bias Research: Differential Validity 435

Cultural Loading, Cultural Bias, and Culture-Free Tests 436

Inappropriate Indicators of Bias: Mean Differences


and Equivalent Distributions 436

Bias in Test Content 437

Bias in Other Internal Features of Tests 440


442
Bias in Prediction and in Relation to Variables External to the Test

Summary 447

17 Best Practices in Educational Assessment 450

Guidelines for Developing Assessments 452

Guidelines for Selecting Published Assessments 453

Guidelines for Administering Assessments 457

Guidelines for Scoring Assessments 460

Guidelines for Interpreting, Using, and Communicating


Assessment Results 462

Responsibilities of Test Takers 463


Avoid 465
Summary and Top 12 Assessment-Related Behaviors to
XViii CONTENTS

APPENDIX A Summary Statements of The Student Evaluation


Standards 468

APPENDIX B: Code of Professional Responsibilities in Educational


Measurement 471

APPENDIX C Code of Fair Testing Practices in Education 479

APPENDIX D Rights and Responsibilities of Test Takers:


Guidelines and Expectations 483

APPENDIX E Standards for Teacher Competence in Educational


Assessment of Students 491

APPENDIX F Proportions of Area under the Normal Curve 497

APPENDIX G Answers to Practice Problems 501

References 503

Index 511
RES
ee,
eR

PREFACE
Be

Wren we meet someone for the first time, we engage inescapably in some form of evalu-
ation. Funny, personable, intelligent, witty, arrogant, and rude are just some of the descrip-
tors we might apply to people we meet. This happens in classrooms as well. As university
professors, just as other classroom teachers do, we meet new students each year and form
impressions about them from our interactions. These impressions are forms of evaluation or
assessment of characteristics we observe or determine from our interactions with these new
students. We all do this, and we do it informally, and at times we realize, once we have had
more experience with someone, that our early evaluations were in error. There are times,
however, when our evaluations must be far more formal and hopefully more precise. This is a
book about those times and how to make our appraisals more accurate and meaningful.
We must, for example, assign grades and determine a student’s suitability for ad-
vancement. Psychologists need to determine accurately proper diagnoses of various forms
of psychopathology such as mental retardation, learning disabilities, schizophrenia, de-
pression, anxiety disorders, and the like. These types of evaluations are best accomplished
through more rigorous means than casual interaction and more often than not are accom-
plished best via the use of some formal measurement procedures. Just as a carpenter can
estimate the length of a board needed for some construction project, we can estimate student
characteristics—but neither is satisfactory when it is time for the final construction or deci-
sion. We both must measure.
Educational and psychological tests are the measuring devices we use to address such
questions as the degree of mastery of a subject matter area, the achievement of educational
objectives, the degree of anxiety a student displays over taking a test, or even the ability
of a student to pay attention in a classroom environment. Some tests are more formal than
others, and the degree of formality of our measuring techniques varies on a continuum from
the typical teacher-made test on a specific assignment to commercially prepared, carefully
standardized procedures with large, nationally representative reference samples for standard
setting.
The purpose of this book is to educate the reader about the different ways in which
we do
we can measure constructs of interest to us in schools and the ways to ensure that
the best job possible in designing our own classroom assessments. We also provide detailed
such as
information on a variety of assessments used by other professionals in schools,
more intel-
school psychologists, so the reader can interact with these other professionals
a better job
ligently and use the results of the many assessments that occur in schools to do
with the students.
of various
Not only is the classroom assessment process covered in detail, but the use
is covered. The regular or general educatio n classroo m is empha-
standardized tests also
ons of the evaluatio n and measure ment processes to students
sized, but special applicati
have tried to illustrate
with disabilities are also noted and explained. Whenever possible, we
applicati on to everyday problems in the schools. Through an
the principles taught through
and explanat ion of principle s of tests and measurement
integrated approach to presentation

XIX
XX PREFACE

with an emphasis on applications to classroom issues, we hope we will have prepared the
reader for the changing face of assessment and evaluation in the schools. The fundamental
principles taught may change little, but actual practice in the schools is sure to change.
This book is targeted primarily at individuals who are in teacher preparation programs
or preparing for related educational positions such as school administrators. Others who
may pursue work in educational settings will also find the content informative and at all
times, we hope, practical. In preparing this text, we repeatedly asked ourselves two ques-
tions. First, what do teachers really need to know to perform their jobs? We recognize that
most teachers do not aspire to become assessment experts, so we have tried to focus on the
essential knowledge and skills and avoid esoteric information. Second, what does the em-
pirical research tell us about educational assessment and measurement? At times it might be
easier to go with educational fads and popular trends and disregard what years of research
have shown us. While this may be enticing, it is not acceptable! We owe you, our readers,
the most accurate information available that is based on the current state of scientific knowl-
edge. We also owe this to the many students you will be evaluating during your careers.
The authors have developed two indispensable supplements to augment the textbook.
Particularly useful for student review and mastery of the material presented are the audio-
enhanced PowerPoint™ lectures featuring Dr. Victor Willson. A Test Bank is also available
to instructors.

The Second Edition

We appreciate the opportunity to prepare a second edition of this text! While this edition
maintains the organization of the first edition, there have been a number of substantive
changes. A primary focus of this revision was updating and increasing the coverage of fed-
eral laws and how they have an impact on educational assessment. In doing this we tried to
emphasize how these laws affect teachers in our classrooms on a daily basis. Our guiding
principle was to follow our instructors’ and readers’ lead—retaining what they liked and
adding what they requested.

Acknowledgments

We would like to express our appreciation to our editor, Arnis Burvikovs, for his leadership in
helping us bring this eidtion to closure. His work in obtaining numerous high-quality reviews
and then guiding us on how best to implement the suggestions from them was of tremendous
benefit to us in our writing assignments. Our thanks to the reviewers: Nick Elksnin, The
Citadel; Kathy Peca, Eastern New Mexico University; and Dan Smith, University at Buffalo,
Canisius College.
To our respective families, we owe a continual debt of gratitude for the warm recep-
tion they give our work, for their encouragement, and for their allowances for our work
schedules. Hopefully, this volume will be of use to those who toil in the classrooms of our
nation and will assist them in conducting better evaluations of students, enabling even better
teaching to occur. This will be our thanks. ‘
Measurement and Assessment
in Education
~ oobey aoe pooped the
os The togelaneneal
ieseo . “are -

¢ terjeemiee SORTS <=


. Sor gata ayeces. (When whe
: ie i ‘ ie Vievre ae ot a3) ;
7 oe ~ 6 hs (0 Giles
o- mma goise that

“aiaanee see bas eve oven pes ture q


i Ube

Serer
oy » u ;
ia ie @. fp Man’ walel Gee najort. » oe

hc ri Hic ha it * nm the <i wa ot aoa hie Whe


By, ‘Om wir tiy5 he wany Nivie Via Wilk Pe sft f Lewy yo See ,

Peeeeriiers buve doveiosant rw riieepeee shy Canam ty een Me eenthiask, <5


i> a Peary Sy Gees) for eitident ervirw an! wroecey af te eee wewiod aw thet anlioe

ented Pertti" Wriwits tewdiring Ur. Vicor Wilbow. A Teer Tank vaabigs @ ¢
eer. «

Pape the ogertumiy w ecpes » soend aDtiog ot


-- pie Ox Spaniesion ov the Tage efit, Gere have tas
Pees if Ute oryedon we
paid brewery faevesak pass cnintialaenonl
CHAPTER

Introduction to
Educational Assessment

Why do I need to learn about testing and assessment?

CHAPTER HIGHLIGHTS

The Language of Assessment What Teachers Need to Know


Assumptions of Educational Assessment about Assessment
Educational Assessment
Participants in the Assessment Process
in the Twenty-First Century
Educational Assessment and the Law
Common Applications of Educational Assessments

LEARNING OBJECTIVES

After reading and studying this chapter, students should be able to


Define test, measurement, and assessment.
Describe and give examples of different types of tests.
Describe and give examples of different types of score interpretations.
Describe and explain the assumptions underlying educational assessment.
Describe the major participants in the assessment process.
Describe and explain the major applications of assessment in schools.
t.
Describe the major federal education laws influencing the use of assessmen
2 Describe and explain the competencies teachers should demonstrate
ee
Pe
oa
in educational assessment.
9. Describe some major trends in assessment.

e
Students in teacher preparation programs want to teach, but our combined experienc
in colleges of education suggests that they are generally not very
of more than 60 years
testing is
interested in testing and assessment. Yes, they know that teachers do test, but
a career in teaching. Teachers love children and love teaching,
not what led them to select
or at best neutral view of testing. This predisposi tion is not
but they often have a negative
uate psycholog y students are typically drawn to
limited to education students. Undergrad
with and help people. Most aspire to be counselors or
psychology because they want to work

1
2 CHAPTER 1

Assessment is an integral therapists, and relatively few want to specialize in assessment. When
component of the teaching we teach undergraduate education or psychology test and measure-
process. Assessment can and ment courses, we recognize that it is important to spend some time
should provide information that explaining to our students why they need to learn about testing and
both enhances instruction and assessment. This is one of the major goals of this chapter. We want to
promotes learning. explain to you why you need to learn about testing and assessment,
and hopefully convince you that this is a worthwhile endeavor.
Teaching is often conceptualized as a straightforward process whereby teachers
provide instruction and students learn. With this perspective, teaching is seen as a simple
instruction—learning process. In actual practice, it is more realistic to view assessment as an
integral component of the teaching process. i

1992). Assessment can and should provide relevant information that both enhances instruc-
tion and promotes learning. In other words, there should be a close reciprocal relationship
between instruction, learning, and assessment. With this expanded conceptualization of teach-
ing, instruction and assessment are integrally related, with assessment providing objective
feedback about what the students have learned, how well they have learned it, how effective
the instruction has been, and what information, concepts, and objectives require more atten-
tion. Instead of teaching being limited to an instruction—learning process, it is conceptualized
more accurately as an instruction—learning—assessment process. In this model, the goal of as-
sessment, like that of instruction, is to facilitate student achievement (e.g., Gronlund, 1998).
In the real world of education, it is difficult to imagine effective teaching that does not involve
some form of assessment. The better job teachers do in assessing student learning, the better
their teaching will be.
The following quote from Stiggins and Conklin (1992) illustrates the important role
teachers play in the overall process of educational assessment.

As a nation, we spend billions of dollars on educational assessment, including hundreds of mil-


lions for international and national assessments, and additional hundreds of millions for state-
wide testing programs. On top of these, the standardized tests that form the basis of district-wide
testing programs represent a billion dollar industry. If we total all of these expensive, highly
visible, politically important assessments, we still account for less than 1 percent of all the as-
sessments conducted in America’s schools. The other 99 percent are conducted by teachers in
their classrooms on a moment-to-moment, day-to-day, and week-to-week basis. (back cover)

In summary, if you want to be an effective teacher, you need to be knowledgeable about testing
and assessment. Instruction and assessment are both instrumental parts of the teaching process,
and assessment is a major component of a teacher’s day-to-day job. We hope that by the time
you finish this chapter you will have a better understanding of the role of assessment in education
and recognize that although you may never want to specialize in testing and assessment, you will
appreciate the important role assessment plays in the overall educational process.

The Language of Assessment

In our brief introduction we have already used a number of relatively common but somewhat
technical terms. Before we go any further it would be beneficial to define them for you.
Introduction to Educational Assessment 3

Tests, Measurement, and Assessment


A test is a procedure in which = Test) Atest isadevice or procedure in which a sample of an in-
a sample of an individual’s Oe EEE
behavior is obtained, evaluated, AERA, APA, & NCME, 1999). This is a rather
and scored using standardized broad or general definition, but at this point in our discussion we will
procedures (AERA et al., 1999). be best served with this generic definition. Rest assured that we will
provide more specific information on different types of tests in due
time. Before proceeding, however, it should be noted that a specific
aspect of our definition of a test deserves mentioning. Because a test is only a sample of
behavior, it is imperative that tests refléeta@
fepresentativeysampleyof
the behavior you are
interested in learning about. Your
m. The importance of the concept of
a represen sample will become more apparent as we proceed with our study of testing
and assessment, and we will touch on it in more detail in later chapters when we address the
technical properties of tests.

Measurement is aset of rulesfor == Measurement. ‘Measurement canbedefined asasetofrules for


assigning numbers to represent
objects, traits, attributes, or (lors: An educational test is a measuring device and therefore involves
behaviors. rules (e.g., administration guidelines and scoring criteria) for assign-
ing numbers that represent an individual’s performance. In turn, these
numbers are interpreted as reflecting characteristics of the test taker.
For example, the number of words spelled correctly on a spelling test might be interpreted
as reflecting a student’s spelling skills.

Assessment is any systematic m Assessment.


procedure for collecting
information that can be used (AERA et al., 1999). Assess-
to make inferences about the ment should lead to an increased understanding of these charac-
characteristics of people or teristics. Tests are obviously one systematic method of collecting
information and are therefore one set of tools for assessment. Re-
objects (AERA et al., 1999).
views of historical records, interviews, and observations are also le-
gitimate assessment techniques and all are maximally useful when
they are integrated. Therefore, assessment is a broader, more comprehensive process than
testing.

Now that we have defined these common terms, with some reluctance we acknowledge
that in actual practice many educational professionals use testing, measurement, and assess-
ment interchangeably. Recognizing this, Popham (2000) noted that in contemporary educa-
rigid
tional circles, assessment has become the preferred term. Measurement sounds rather
con-
and sterile when applied to students and tends to be avoided. Testing has its own negative
goes by when newspapers don’t contain articles about
notations. For example, hardly a week
“teaching to the test” or “high-stakes testing,” typically with negative connotations. Addition-
recent
ally, when people hear the word test they usually think of paper-and-pencil tests. In
with traditional paper-and- pencil tests, alternative
years, as a result of growing dissatisfaction
GHAP TER W

testing procedures have been developed (e.g., performance assessments and portfolios). As
a result, testing is not seen as particularly descriptive of modern educational practices. That
leaves us with assessment as the current buzzword among educators.
Before proceeding, we should define some other terms. — ——y

fessional who has specialized in the area oftesting, measurement, and assessment. You will
likely hear people refer to the psychometric properties of a test, and by
Psychometrics is the science of this they mean the measurement or statistical characteristics of a test.
psychological measurement. These measurement characteristics include reliability and validity.{Re=
» liability refers tothe stability or consistency of test scores) On a more
‘theoretical level, reliability refers to the degree to which test scores
Reliability refers to the stability arenfreeifroni measurement efor (AERA et al., 1999). Scores that are
or consistency of the test scores. relatively free from measurement errors will be stable or consistent
(i.e., reliable).
Validity refers to the accuracy . If test scores are
interpreted as reflecting intelligence, do they actually reflect intellec-
of the interpretations of test
tual ability? If test scores are used to predict success on a job, can they
scores.
accurately predict who will be successful on the job?

Types of Tests
We defined a fest as a device or procedure in which a sample of an individual’s behavior
is obtained, evaluated, and scored using standardized procedures (AERA et al., 1999). You
have probably taken a large number of tests in your life, and it is likely that you have noticed
that all tests are not alike. For example, people take tests in schools that help determine their
grades, tests to obtain drivers’ licenses, interest inventories to help make educational and
vocational decisions, admissions tests when applying for college, exams to obtain profes-
sional certificates and licenses, and personality tests to gain personal understanding. This
brief list is clearly not exhaustive! ;
Cronbach (1990) notes that tests can generally be classified as measures of either maxi-
mum performance or typical response. Maximum performance tests are also often referred to
as ability tests, but achievement tests are included here as well. On maximum performance
tests items may be scored as either “correct” or “incorrect,” and ex-
Maximum performance tests aminees are encouraged to demonstrate their very b ances.
are designed to assess the Waist’ pertains aes ERAN RRC epeetaame,
upper limits of the examinee’s Gfithejexamince’Siknowledge
und Abilities) For example, maximum
knowledge and abilities. performance tests can be designed to assess how well a student per-
forms selected tasks or has mastered a specified content domain. Intel-
ligence tests and classroom achievement tests are common examples
of maximum performance tests. In contrast, typical response tests attempt to measure the
typical behavior and characteristics of examinees. Often, typical response tests are referred to
as personality tests, and in this context personality is used broadly to reflect a whole host of
noncognitive characteristics such as attitudes, behaviors, emotions, and interests (Anastasi
&
Urbina, 1997). Some individuals reserve the term test for maximum performance measures,
while using terms such as scale and inventory when referring to typical performance instru-
Introduction to Educational Assessment 5

ments (AERA et al., 1999). In this textbook we will use the term fest in its broader sense,
applying to both maximum performance and typical response procedures.

Maximum Performance Tests. As we noted, maximum performance tests are designed


to assess the upper limits of the examinee’s knowledge and abilities. Within the broad cat-
egory of maximum performance tests, a number of subcategories are often employed. First,
maximum performance tests are often classified as either achievement tests or aptitude tests.
Second, maximum performance tests are often described as either speed or power tests. Fi-
nally, maximum performance tests can be classified as either objective or subjective. These
distinctions, while not absolute in nature, have a long historical basis and provide some
useful descriptive information.

Achievement and Aptitude. Maximum performance tests are often classified as either
achievement tests or aptitude tests. A’

Achievement tests measure (AERA et al., 1999). In contrast,


knowledge and skills in an area
in which instruction has been
In other words, achievement tests
provided (AERA et al., 1999).
are linked or tied to a specific program of instructional objectives,
Aptitude tests measure cognitive whereas aptitude tests reflect the cumulative impact of life experi-
abilities and skills that are ences as a whole. This distinction, however, is not absolute and is
accumulated as the result of actually a matter of degree or emphasis. Most testing experts today
overall life experiences (AERA conceptualize both achievement and aptitude tests as measures of
et al., 1999). developed cognitive abilities that can be ordered along a continuum
in terms of how closely linked the assessed abilities are to specific
learning experiences.
Another distinction between achievement and aptitude tests involves the ways their

g owever, this distinction is not absolute either. As an example, a test


given at the end of high school to assess achievement might also be used to predict success
t
in college. Although it is important to recognize that the distinction between achievemen
when
and academic tests is not absolute, the achievement/aptitude distinction is also useful
discussing different types of student abilities.

On speed tests, performance Speed and Power Tests. Maximum performance tests often are cat-
-
reflects differences in the speed egorized as either speed or power tests. On
-A speed
of performance.
test generally contains items that are relatively easy and has a strict
On power tests, performance time limit that prevents any examinees from successfully completing
iC
reflects the difficulty of the items. Ona pure po

items the examinee is able GRRE yore is given plenty of time to attempt all the items,
to answer correctly. but the items are ordered according to difficulty, and the test contains
6 CHAPTER 1

some items that are so difficult that no examinee is expected to answer them all. As a result,
performance on a power test primarily reflects the difficulty of the items the examinee is
able to answer correctly. Well-developed speed and power tests are designed so no one will
obtain a perfect score. They are designed this way because perfect scores are “indetermi-
nate.” That is, if someone obtains a perfect score on a test, the test failed to assess the very
upper limits of that person’s ability. To access adequately the upper limits of ability, tests
need to have what test experts refer to as an “adequate ceiling”; that is, the tests are difficult
enough that no examinee will be able to obtain a perfect score. As you might expect, this
distinction between speed and power tests is also one of degree rather than being absolute.
Most often a test is not a pure speed test or a pure power test, but incorporates some com-
bination of the two approaches. For example, the Scholastic Assessment Test (SAT) and
Graduate Record Examination (GRE) are considered power tests, but both have time limits.
When time limits are set such that 95% or more of examinees will have the opportunity to
respond to all items, the test is still considered to be a power test and not a speed test.

Objective and Subjective Maximum Performance Tests. Objectivity typically implies im-
partiality or the absence of personal bias. Cronbach (1990) notes that the less test scores are
influenced by the subjective judgment of the person grading or scoring the test, the more
objective the test is. In other words, objectivity refers to the extent that trained examiners
who score a test will be in agreement and score responses in the same way. Tests with
selected-response items (e.g., multiple choice, true—false, and matching) that can be scored
using a fixed key and that minimize subjectivity in scoring are often referred to as “objective”
tests. In contrast, subjective tests are those that rely on the personal judgment of the individual
grading the test. For example, essay tests are considered subjective because test graders rely to
some extent on their own subjective judgment when scoring the essays. Most students are well
aware that different teachers might assign different grades to the same essay item. It is com-
mon, and desirable, for those developing subjective tests to provide explicit scoring rubrics in
an effort to reduce the impact of the subjective judgment of the person scoring the test.

Typical Response Tests. As we indicated, typical


Typical response tests are are
designed to measure the typical @/EXamMiiGES: Typical response tests measure constructs such as per-
behavior and characteristics of sonality, behavior, attitudes, or interests. In conventional assessment
examinees. terminology, the general term personality broadly encompasses a
wide range of emotional, interpersonal, motivational, attitudinal, and
other personal characteristics (Anastasi & Urbina, 1997). In terms of
personality testing, most testing experts distinguish between objective and projective tech-
niques. Although there are some differences, this distinction largely parallels the separation
of maximum performance tests into objective or subjective tests. These two approaches are
described next.
it Be ; Objective Personality Tests. As with maximum performance tests,
Objective personality tests use in the context of typical response assessment objectivity also implies
items that are not influenced by _ impartiality or the absence of personal bias. Objeétiveypersonality
the subjective judgment of the
person scoring the test. aré'scoredimanobjectivelmanner. For example, a personality test that
includes true—false items such as “I enjoy parties” is considered ob-
Introduction to Educational Assessment 7

Projective personality tests jective. The test takers simply respond true if the statement describes
involve the presentation of them and false if it does not. By using a scoring key, there should be
ambiguous material that elicits no disagreement among scorers regarding how to score the items.
an almost infinite range of
responses. Most projective tests
involve subjectivity in scoring.
. For example, the clinician may show the examinee an inkblot and ask: “What
might this be?” Instructions to the examinee are minimal, there are essentially no restric-
tions on the examinee’s response, and there is considerable subjectivity when scoring the
response. Elaborating on the distinction between objective and projective tests, Reynolds
(1998b) noted:

It is primarily the agreement on scoring that differentiates objective from subjective tests.
If trained examiners agree on how a particular answer is scored, tests are considered objec-
tive; if not, they are considered subjective. Projective is not synonymous with subjective in
this context but most projective tests are closer to the subjective than objective end of the
continuum of agreement on scoring. (p. 49)

Exclusive to projective tests is what is referred to as the “projective hypothesis.”


In summary, the projective hypothesis holds that when examinees respond to ambiguous
stimuli, they respond in a manner that reflects their genuine unconscious desires, motives,
and drives without interference from the ego or conscious mind (Reynolds, 1998b). Projec-
tive techniques are extremely popular, but they are the focus of considerable controversy.
This controversy focuses on the subjective nature of this approach and the lack of empirical
evidence supporting the technical qualities of the instruments. In other words, although the
tests are popular there is little evidence they provide reliable and valid information.
Table 1.1 depicts the major categories of tests we have discussed. Although we have
introduced you to the major types of tests, this brief introduction clearly is not exhaustive.
Even though essentially all tests can be classified according to this scheme, other distinc-
tions are possible. For example, a common distinction is made between standardized tests
and nonstandardized tes i
a - The goal of standardization is to
make sure that testing conditions are the same for all the individuals taking the test (AERA
et al., 1999). Part of the process of standardizing most tests involves administering them
to large samples that represent the types of individuals who will take the test. This group,
i
typically referred to as the standa LY)

ts (Anastasi & Urbina, 1997). Examples of standardized


Test (ACT),
tests include the Scholastic Assessment Test (SAT) and the American College
Nonstanda rdized tests
popular admission tests used by colleges to help select students.
of nonstandar dized tests is
are developed in a less formal manner. The most common type
every day of the
the classroom achievement tests with which we are all familiar. Practically
classroom tests.
academic year teachers are developing and administering
to be
Finally, it is common to distinguish between individual tests (i.e., tests designed
red to more than
administered to one examinee at a time) and group tests (i.e., tests administe
to the administration
one examinee at a time). This is an important distinction that applies
8 CHAPTER 1

TABLE 1.1 Major Categories of Tests

I. Maximum Performance Tests


a. Achievement tests: assess knowledge and skills in an area in which the student has received
instruction.
1. Speed tests: e.g., a timed typing test.
2. Power tests: e.g., a spelling test containing words of increasing difficulty.
b. Aptitude tests: assess knowledge and skills accumulated as the result of overall life
experiences.
1. Speed tests: e.g., a timed test whereby the test taker quickly scans groups of symbols
and marks symbols that meet predetermined criteria.
2. Power tests: e.g., a test of nonverbal reasoning and problem solving that requires the
test taker to solve problems of increasing difficulty.
c. Maximum performance tests are often classified as either objective or subjective. When the
scoring of a test does not rely on the subjective judgment of the individual scoring it, it is
said to be objective. If the scoring of a test does rely on subjective judgment, it is said to be
subjective.

II. Typical Response Tests


a. Objective personality tests: e.g., a test whereby the test taker answers true—false items
referring to personal beliefs and preferences.
b. Projective personality tests: e.g., a test whereby the test taker looks at an inkblot and
describes what he or she sees.
Berea a STI |

of the test rather than the type of the test. For example, individual aptitude tests and group
aptitude tests are both aptitude tests; they simply differ in how they are administered. This is
true in the personality domain as well wherein some tests require one-on-one administration
but others can be given to groups.

Types of Score Interpretations


Practically all tests produce scores that reflect or represent the performance of the individuals
taking the tests. There are two fundamental approaches to understanding scores: the norm-
referenced approach and the criterion-referenced approach. With n

Norm-referenced score
interpretations compare an example, if you say that a student scored better than 95% of his or
examinee’s performance to the her peers, this is a norm-referenced interpretation. The standardization
performance of other people. sample serves as the reference group against which performance is
judged.
Criterion-referenced score
interpretations compare an
ith criterion-referenced interpreta-
examinee’s performance to a tions, the emphasis is on what the examinees know or what they can
specified level of performance. do, not their standing relative to other people. One of the most com-
mon examples of criterion-referenced scoring is the percentage of cor-
rect responses on a classroom examination. For example, if you report that a student correctly
answered 95% of the items on a classroom test, this is a criterion-referenced interpretation. In
Introduction to Educational Assessment 9

addition to percent correct, another e of criterion-referenced interpretation is referred to as


mastery testing.

AERA et al., 1999). For example, on a


licensing exam for teachers the cut score might be 70%, and all examinees earning a score of
70% or greater will receive a designation of “pass.”

(absolutejstandard), People often refer to norm-referenced and criterion-referenced tests; but


this is not technically accurate. Actually, the terms norm-referenced and criterion-referenced
refer to the interpretation of test scores. Although it is more common for tests to produce
either norm-referenced or criterion-referenced scores, it is possible for a test to produce both
norm- and criterion-referenced scores. Table 1.2 depicts salient information about norm-
and criterion-referenced scores.

Assumptions of Educational Assessment

Now that we have introduced you to some of the basic concepts of educational assessment,
this is an opportune time to discuss some basic assumptions that underlie educational as-
sessment. These assumptions were adopted in part from Cohen and Swerdlik (2002), who
note, appropriately, that these assumptions actually represent a simplification of some very
complex issues. As you progress through this text, you will develop a better understanding
of these complex and interrelated issues.

Psychological and Educational Constructs Exist


Constructs are the traits or In assessment terminology, aconstruct issimply thetrait orcharacter-
characteristics a test is designed istic thatlatestis designed tolmeasure. For example, achievement is a
to measure (AERA et al., 1999). construct that reflects an individual’s knowledge or accomplishments

TABLE 1.2 Norm- and Criterion-Referenced Scores

Type of Score Description Example

Norm-referenced scores An examinee’s performance is An examinee earns a percentile


compared to that of other people. rank score of 50, meaning that the
Interpretation is relative to that of eS better than 50% of the
other people individuals in the standardization
sample.

Criterion-referenced scores An examinee’s performance is A student correctly answers 50% of


compared to a specified level of the items on a test.
performance.
Interpretation is absolute (not On a licensing exam, an examinee
relative). obtains a score greater than the cut
score and receives a passing score.
10 CHAPTER 1

in areas in which they have received instruction (AERA et al., 1999). In schools we are often
interested in measuring a number of constructs, such as a student’s intelligence, achievement
in a specific content area, or attitude toward learning. This assumption simply acknowledges
that constructs such as intelligence, achievement, or attitudes exist.

Psychological and Educational Constructs Can Be Measured


Cronbach (1990) notes that an old, often-quoted adage among measurement profession-
als goes “If a thing exists, it exists in some amount. If it exists in some amount, it can be
measured” (p. 34). If we accept the assumption that psychological constructs exist, the next
natural question is “Can these constructs be measured?” As you might predict, assessment
experts believe psychological and educational constructs can be measured.

Although We Can Measure Constructs,


Our Measurement Is Not Perfect
Although assessment experts believe they can measure psychological constructs, they also
acknowledge the measurement process is not perfect. This is usually framed in terms of mea-
surement error and its effects on the reliability of scores. Some degree of error is inherent
in all measurement, and measurement error reduces the usefulness of
Some degree of error is inherent measurement. As you will learn, assessment experts make considerable
in all measurement. efforts to estimate and minimize the effects of measurement error.

There Are Different Ways to Measure Any Given Construct


As you will learn in this text, there are multiple approaches to measuring any given con-
struct. Consider the example of academic achievement. A student’s achievement in a spe-
cific area can be measured using a number of different approaches.
There are multiple approaches For example, a teacher might base a student’s grade in a course on
to measuring any given a variety of components including traditional paper-and-pencil tests
construct, and these different (e.g., multiple-choice, short-answer, and essay items), homework
approaches have their assignments, class projects, performance assessments, and portfo-
own unique strengths and lios. Although all of these different approaches typically are aimed
at measuring the knowledge, skills, and abilities of students, each has
weaknesses.
its own unique characteristics.

All Assessment Procedures Have Strengths and Limitations


While acknowledging that there are a number of different approaches to measuring any con-
struct, assessment experts also acknowledge that all assessment procedures have their own
specific set of strengths and limitations. One assessment approach might produce highly
reliable scores, but not measure some aspects of a construct as well as another approach,
which produces less reliable scores. As a result, it is important that test users understand
the specific strengths and weaknesses of the procedures they use. The relatively simple idea
that professionals should be aware of the limitations of their assessment procedures and the
Introduction to Educational Assessment 11

information obtained from them is a key issue in ethical assessment practice (e.g., Cohen
& Swerdlik, 2002).

Multiple Sources of Information Should


Be Part of the Assessment Process
Given that there are different approaches to measuring any given construct and that each
approach has its own strengths and weaknesses, it only follows that assessment should
incorporate information from different approach i
Important decisions should s S-
not be based on the result of a S For example, when deciding which applicants
single test or other assessment should be admitted to a college or university, information such as
procedure. performance on an admissions test (e.g., SAT or ACT), high school
grade point average (GPA), letters of recommendation, evidence of
extracurricular activities, and a written statement of purpose should be considered. It would
be inappropriate to base this decision on any one source of information.

Performance on Tests Can Be Generalized


to Nontest Behaviors
Typically when we give a test we are not simply interested in the individual’s performance
on the test, but in the ability to generalize from test performance to nontest behaviors. For
example, it is not an individual’s score on the SAT that is in itself important to a college
admissions officer, but the fact that the score can be used to help predict performance in
college. The same applies to a 20-item test measuring multiplication skills. It is not really
the student’s ability to answer those specific multiplication problems correctly that is of
primary importance, but that the performance on those 20 items reflects the ability to per-
is
form multiplication problems in general. This assumption holds that test performance
important, not in and of itself, but because of what it tells us about the test taker’s standing
on the measured construct or ability to perform certain tasks or jobs.

Assessment Can Provide Information That Helps


Educators Make Better Educational Decisions
on the prem-
The widespread use of assessment procedures in educational settings is based
obtained from assessme nt procedur es can help educators make
ise that the information
better decisions. These decisions range from the specific grade a
to the effectiveness of the cur-
student should receive in a course sped
Information obtaine d from
riculum used in a state or school district.
assessment procedures can
help educators make better
decisions. Assessments Can Be Conducted in a Fair Manner
on,
Educational assessments are not Although many critics of testing might argue against this assumpti
and en-
perfect, but they can provide contemporary assessment experts spend considerable time
useful information ergy developing instruments that, when administered and interpreted
12 CHAPTER 1

Well-made tests that are according to guidelines, are fair and minimize bias. Nevertheless, tests
appropriately administered and can be used inappropriately, and when they are it discredits or stig-
interpreted are among the most matizes assessment procedures in general. However, in such circum-
equitable methods of evaluating stances the culprit is the person using the test, not the test itself. At
people. times, people criticize assessments because they do not like the results
obtained. In many instances, this is akin to “killing the messenger.”

Testing and Assessment Can Benefit Our


Educational Institutions and Society as a Whole
Although many people might initially argue that the elimination of all tests would be a posi-
tive event, on closer examination most will agree that tests and other assessment procedures
make significant contributions to education and society as a whole. Consider a world with-
out tests. People would be able to present themselves as surgeons without ever having their
ability to perform surgery competently assessed. People would be given drivers’ licenses
without having their ability to drive assessed. Airline pilots would be flying commercial jets
without having to demonstrate their competence as pilots. All of these examples should give
you reasons to consider the value of tests. Although it is typically not a matter of life and
death, the use of tests in schools also has important implications. How comfortable would
you be if your instructors simply assigned your grades based exclusively on their subjective
impressions of you? In this situation it is likely that each instructor’s personal biases and
preferences would play important roles in determining one’s grades. If the instructor felt
you were a “good student,” you would likely receive a good grade. However, if the instructor
had a negative impression of you for any reason, you might not be so lucky. Most people
would prefer to be evaluated based on their demonstrated skills and abilities rather than on
subjective judgment. The same principle applies to admissions decisions made by universi-
ties. Without tests admission officers might make arbitrary decisions based solely on their
personal likes and dislikes. In fact, the SAT was developed to increase the objectivity of
college admissions, which in the first quarter of the twentieth century depended primarily
on family status. When used appropriately tests can provide objective information that is
free from personal biases and other subjective influences.
These assumptions are listed in Table 1.3. As we noted, these seemingly simple as-
sumptions represent some complex and controversial issues, and there is considerable debate
regarding the pros and cons of testing and assessment. Many of the controversies surround-
ing the use of tests are the results of misunderstandings and misuses of tests. As noted in
assumption 3 in Table 1.3, tests and all other assessment procedures contain some degree
of
measurement error. Tests are not perfect and they should not be interpreted as if they were
perfect. However, this limitation is not limited to psychological and educational measure-
ment; all measurement is subject to error. Chemistry, physics, and engineering all struggle
with imperfect, error-laden measurement that is always, to some extent, limiting advance-
ment in the disciplines. An example most of us can relate to involves the medical profession.
There is error in medical assessment procedures such as blood pressure tests or tests of blood
cholesterol level, but they still provide useful information. The same is true of educational
assessment procedures. They are not perfect, but they still provide useful information.
While
you probably will not hear anyone proclaim that there should be a ban on the use of medical
Introduction to Educational Assessment 13

TABLE 1.3 Assumptions of Educational Assessment

. Psychological and educational constructs exist.


. Psychological and educational constructs can be measured.
. Although we can measure constructs, our measurement is not perfect.
. There are different ways to measure any given construct.
All assessment procedures have strengths and limitations.
. Multiple sources of information should be part of the assessment process.
. Performance on tests can be generalized to nontest behaviors.
. Assessment can provide information that helps educators make better educational decisions.
. Assessments can be conducted in a fair manner.
. Testing and assessment can benefit our educational institutions and society as a whole.
1SeCmMmIAMEWNE

tests, you will hear critics of educational and psychological testing call for a ban on, or at
least a significant reduction in, the use of tests. Although educational tests are not perfect (and
never will be), testing experts spend considerable time and effort studying the measurement
characteristics of tests. This process allows us to determine how accurate and reliable tests
are, can provide guidelines for their appropriate interpretation and use, and can result in the
development of more accurate assessment procedures (e.g., Friedenberg, 1995).
Assumption 9 in Table 1.3 suggests that tests can be used in a fair manner. Many people
criticize tests, claiming that they are biased, unfair, and discriminatory against certain groups
of people. Although it is probably accurate to say that no test is perfectly fair to all examinees,
neither is any other approach to selecting, classifying, or evaluating people. The majority of
professionally developed tests are carefully constructed and scrutinized to minimize bias, and
when used properly actually promote fairness and equality. In fact, it is probably safe to say
that well-made tests that are appropriately administered and interpreted are among the most
equitable methods of evaluating people. Nevertheless, the improper use of tests can result in
considerable harm to individual test takers, institutions, and society (AERA et al., 1999).

Participants in the Assessment Process


A large number of individuals are involved in different aspects of the assessment process.
(e.g.
Brief descriptions follow of some of the major participants in the assessment process
AERA et al., 1999).

People Who Develop Tests


the exact number
Can you guess how many new tests are developed in a given year? Although
larger than you might imagine. The American Psychological
is unknown, it is probably much
that up to 20,000 new psycholog ical, behaviora l, and cognitive
Association (1993) estimated
number includes tests publishe d by commerci al test pub-
tests are developed every year. This
nals hoping to have their instrumen ts published , and tests
lishers, tests developed by professio
specific research questions . However, even this rather
developed by researchers to address
14 CHAP DER: 1

daunting figure does not include the vast number of tests developed by classroom teachers to
assess the achievement or progress of their students. There are minimal standards that all of
these tests should meet, whether they are developed by an assessment professional, a gradu-
ate student completing a thesis, or a teacher assessing the math skills of 3rd graders. To pro-
vide standards for the development and use of psychological and educational tests and other
Standards for Educational assessment procedures, numerous professional organizations have
developed guidelines. The most influential and comprehensive set of
and Psychological Testing is
guidelines is the Standards for Educational and Psychological Test-
the most influential and
ing, published by the American Educational Research Association,
comprehensive set of guidelines the American Psychological Association, and the National Council
for developing and using on Measurement in Education (1999). We have referenced this docu-
psychological and educational ment numerous times earlier in this chapter and will continue to do
tests. so throughout this text.

People Who Use Tests


The list of people who use tests includes those who select, administer, score, interpret, and
use the results of tests and other assessment procedures. Tests are utilized in a wide span of
settings by a wide range of individuals. For example, teachers use tests in schools to assess
their students’ academic progress. Psychologists and counselors use tests to understand
their clients better and to help refine their diagnostic impressions. Employers use tests to
help select and hire skilled employees. States use tests to determine who will be given driv-
ers’ licenses. Professional licensing boards use tests to determine who has the knowledge
and skills necessary to enter professions ranging from medicine to real estate. This is only
a small sampling of the many settings in which tests are used. As with the development of
tests, some of the people using these tests are assessment experts whose primary responsi-
bility is administering, scoring, and interpreting tests. However, many of the people using
tests are trained in other professional areas, and assessment is not their primary area of
training. As with test development, the administration, scoring, and interpretation of tests
involve professional and ethical standards and responsibilities. In addition to the Standards
for Educational and Psychological Testing (AERA et al., 1999) already mentioned, The
Student Evaluation Standards (JCSEE, 2003), Code of Professional Responsibilities in Ed-
ucational Measurement (NCME, 1995), and Code of Fair Testing Practices in Education
(JCTP, 1988) provide guidelines for the ethical and responsible use of tests. These last three
documents are included in Appendixes A, B, and C, respectively.

People Who Take Tests


We have all been in this category at many times in our life. In public
The most fundamental right of
school we take an untold number of tests to help our teachers evalu-
test takers is to be tested with ate our academic progress, knowledge, and skills. You probably took
tests that meet high professional the SAT or ACT test to gain admission to college. When you graduate
standards and are valid for the from college and are ready to obtain a teacher’s license or certificate,
intended purpose. you will probably be given another test to evaluate how well prepared

Introduction to Educational Assessment 15

you are to enter the teaching profession. While the other participants in the assessment process
have professional and ethical responsibilities, test takers have a number of rights. The Joint
Committee on Testing Practices (JCTP, 1998) notes that the most fundamental right test takers
have is to be tested with tests that meet high professional standards and that are valid for the
intended purposes. Other rights of test takers include the following:

u Test takers should be given information about the purposes of the testing, how the
results will be used, who will receive the results, the availability of information re-
garding accommodations available for individuals with disabilities or language dif-
ferences, and any costs associated with the testing.
a Test takers have the right to be treated with courtesy, respect, and impartiality.
= Test takers have the right to have tests administered and interpreted by adequately
trained individuals who follow professional ethics codes.
= Test takers have the right to receive information about their test results.
= Test takers have the right to have their test results kept confidential.

Appendix D contains the Joint Committee on Testing Practices’ Rights and Respon-
sibilities of Test Takers: Guidelines and Expectations.

Other People Involved in the Assessment Process


Although the preceding three categories probably encompass most participants in the assess-
ment process, they are not exhaustive. For example, there are individuals who market and sell
and
assessment products and services, those who teach others about assessment practices,
those who conduct research on assessment procedures and evaluate assessment programs
(NCME, 1995).

Educational Assessment and the Law


of
Educational assessment, like most aspects of public education, is governed by a number
not be underestim ated. Their impact on the classroom teacher
laws, whose impact should
assess the
is felt every day—on what teachers teach, how they teach it, and how teachers
. Below are brief descriptio ns of a few of the major federal laws
results of their instruction
l assessmen ts and their results are used in the schools. It
that influence the way educationa
and other school personnel to be familiar with the laws that govern
is important for teachers
governmental
public education. It appears that the trend is toward more, rather than less,
oversight of public education.

No Child Left Behind Act of 2001 (NCLB)


one of the first federal
The Elementary and Secondary Education Act of 1965 (ESEA) was
that education is pri-
laws to focus on education. While the federal government recognizes
l states, it holds that the federal government has a
marily the responsibility of the individua
16 CHAPTER 1

responsibility to ensure that an adequate level of educational services is being provided by


all states (Jacob & Harthshorne, 2007). The No Child Left Behind Act of 2001 is the most
recent reauthorization of the ESEA. Major themes of this act include the following:

s Increased state accountability. NCLB requires that each state develop rigorous aca-
demic standards and implement annual assessments to monitor the performance of districts
and schools. It requires that these assessments meet professional standards for reliability
and validity and requires that states achieve academic proficiency for all students within
12 years. To ensure that no group of children is neglected, the act requires that states and
districts assess all students in their programs, including those with disabilities and limited
English proficiency. However, the act does allow 3% of all students to be given alternative
assessments. Alternative assessments are defined as instruments specifically designed for
students with disabilities that preclude standard assessment.
m More parental choice. The act allows parents with children in schools that do not
demonstrate adequate annual progress toward academic goals to move their children to
other, better performing schools.
a Greater flexibility for states. A goal of NCLB is to give states increased flexibility in
the use of federal funds in exchange for increased accountability for academic results.

= Reading First initiative. A goal of the NCLB Act is to ensure that every student can
read by the end of grade 3. To this end, the Reading First initiative significantly increased
federal funding of empirically based reading instruction programs in the early grades.

While NCLB received broad support when initiated, it has been the target of increasing
criticism in recent years. The act’s focus on increased accountability and statewide assess-
ment programs typically receives the greatest criticism. For example, it is common to hear
teachers and school administrators complain about “teaching to the test” when discussing
the impact of statewide assessment programs. Special Interest Topic 1.1 describes some
current views being voiced by proponents and opponents of NCLB.

The Education of All


Handicapped Children Individuals with Disabilities Education
Act required that public Improvement Act of 2004 (IDEA 2004)
schools provide students with The Education of All Handicapped Children Act of 1975 (EAHCA)
disabilities a free appropriate was the original law requiring that all children with disabilities be
public education (FAPE). given a free appropriate public education (FAPE). Of the estimated
more than eight million children with disabilities at that time, over
The individualized educational
half were not receiving an appropriate public education and as many
program (IEP) is a written
as one million were not receiving a public education at all (Jacob
document developed by a & Hartshorne, 2007). The Individuals with Disabilities Education
committee that specifies a Improvement Act of 2004 (commonly abbreviated as IDEA 2004
number of factors, including any or simply IDEA), as the most current reauthorization of the EAHCA,
assessment accommodations the designates 13 disability categories (e.g., mental retardation, visual or
student will receive. hearing impairment, specific learning disabilities, emotional distur-
Introduction to Educational Assessment 17

SPECIAL INTEREST TOPIC 1.1

NCLB—The Good and the Bad!


sao

While the No Child Left Behind Act (NCLB) passed with broad bipartisan and public support, more
and more criticism has been directed at it in recent years by lawmakers, professional groups, teach-
ers, and others. For example, the National Education Association (NEA), the nation’s largest teacher
union, has criticized the NCLB Act, maintaining that it forces teachers to devote too much time
preparing students for standardized tests at the expense of other, more desirable instructional activi-
ties. Many critics are also calling for more flexibility in the way states implement the NCLB Act’s
accountability requirements, particularly the requirement that students with disabilities be included
in state assessment programs. These critics say that decisions about how students with disabilities
are tested should be left to local professionals working directly with those students. It is safe to say
the honeymoon period is over for NCLB.
Pi But the NCLB Act does have its supporters. For example, advocacy groups for individuals with
ra disabilities maintain that the NCLB Act has focused much needed attention on the achievement of stu-
dents with disabilities. As currently implemented, the NCLB Act requires that most students with dis-
z
S

abilities be tested and their achievement monitored along with their peers without disabilities. These
advocates fear that if the law is changed, the high achievement standards for students with disabilities
will be lost. They note that approximately 30% of students with disabilities are currently exempted
from state assessment programs and they fear that if states are given greater control over who is tested,
even more students with disabilities will be excluded and largely ignored (Samuels, 2007).
ee

bance) and provides funds to states and school districts that meet the requirements of the
law. IDEA provides guidelines for conducting evaluations of students suspected of having a
disability. Students who qualify under IDEA have an individualized educational program
(IEP) developed specifically for them that designates the special services and modifica-
tions to instruction and assessment that they must receive. Possibly most important for
regular education teachers is the mandate for students with disabilities to receive instruc-
ing. In
tion in the “least restrictive environment,’ a movement referred to as mainstream
with disabilities receive educational services in
application, this means that most students
are involved
the regular education classroom. As a result, more regular education teachers
in the education of students with disabilities and are required to implement the educational
instructional
modifications specified in their students’ IEPs, including modifications in both
online
strategies and assessment practices. More information on IDEA 2004 can be found
at http://idea.ed. gov.

Section 504 of the Rehabilitation Act of 1973 (Section 504)


Section 504 requires that
public schools offer students
ore specifically,
with disabilities reasonable
schools cannot exclude students with disabilit ies from any activi-
accommodations to meet their make
specific educational needs. ties or programs based on their disability, and schools must
18 CHAPTER 1

oe a ee

SPECIAL INTEREST TOPIC 1.2

Decline in the Number of 504-Only Students?

Jacob and Harthshorne (2007) note that while Section 504 of the Rehabilitation Act was passed in
1973, the law was not widely applied in the school setting until the late 1980s. Since that time, how-
ever, it has had a substantial impact on public education. In looking to the future, the authors predict
that Section 504 will be used less frequently to obtain accommodations for students with learning
and behavioral problems. Their prediction is based on the following considerations:

m In the past, Section 504 was often used to ensure educational accommodations for students
with attention deficit hyperactivity disorder (ADHD). In 1997, IDEA specifically identi-
fied ADHD as a health condition qualifying for special education services. As a result,
more students with ADHD will likely receive services through IDEA and fewer under
Section 504.
= IDEA 2004 permits school districts to spend up to 15% of special education funds on early
intervention programs. These programs are intended to help students that need specialized
educational services but who have not been identified as having a disability specified in
IDEA. Again, this will likely decrease the need for Section 504 accommodations.
= IDEA 2004 has new regulations for identifying children with specific learning disabilities
that no longer require the presence of a severe discrepancy between aptitude and achieve-
ment. As a result, students that in the past did not qualify for services under IDEA may now
qualify. Again, this will likely reduce the need to qualify students under Section 504.
m Legal experts and school administrators have raised concerns about widespread over-
identification of disabled students under Section 504. Some instances of abuse result
from well-meaning educators trying to help children with academic problems, but no true
disabilities, obtain special accommodations. Other instances are more self-serving. For
example, some schools have falsely identified children under Section 504 so they can be
given assessment accommodations that might allow them to perform better on high-stakes
accountability assessments.

At this point the future of Section 504 remains unclear. Jacob and Harthshorne (2007) make
a compelling case that coming years might see a reduction in the number of students under Section
504. However, time will tell.

reasonable accommodations to ensure that students with disabilities have an equal opportu-
nity to benefit from those activities or programs (Jacob & Harthshorne, 2007). Section
504
differs from IDEA in several important ways. First, it defines a handicap or disability
very
broadly, much more broadly than IDEA. Therefore, a child may not qualify for
services
under IDEA but qualify under Section 504. Second, Section 504 is an antidiscrimination
act, not a grant program like IDE
. In terms of the assessment of disabilities, Section 504 pro-
vides less specific guidance than IDEA. Similar to IDEA, students qualified under
Section
504 may receive modifications to the instruction and assessments implemented
in the class-
Introduction to Educational Assessment 19

rooms. In recent years there has been a rapid expansion in the number of students receiving
accommodations under Section 504. However, Special Interest Topic 1.2 describes some
recent events that might reduce the number of students served under Section 504.

Protection of Pupil Rights Act (PPRA)


The Protection of Pupil Rights Act holds that students may not be required, without prior
consent, to complete surveys or other assessments funded by the Department of Education
(DOE) that elicit sensitive information (e.g., information about political affiliation, mental
problems, sexual behavior). It also requires that local education agencies (LEAs) notify
parents when the school is going to administer a survey eliciting sensitive information and
give the parents an opportunity to examine the survey. The parents then have the opportunity
to exclude their child from the survey or assessment.

Family Educational Rights and Privacy Act (FERPA)


Also known as the Buckley Amendment, the Family Educational Rights and Privacy
Act (FERPA) protects the privacy of students and regulates access to educational records.
Educational records are defined very broadly, essentially any record maintained by a school
having to do with a student. It allows parents largely unrestricted access to their child’s
school records, but requires that parents give written consent for other, non-school person-
nel to view the records. As noted, FERPA applies to all student records, including those
containing assessment results.

Common Applications of Educational Assessments


Now that we have introduced you to some of the basic terminology, assumptions, and types
of individuals involved in educational testing and assessment, as well as a brief legislative
in
history, we will explain further why testing and assessment play such prominent roles
educational settings. Tests and assessments have many uses in educational settings, but un-
derlying practically all of these uses is the belief that tests can provide valuable information
be dif-
that facilitates student learning and helps educators make better decisions. It would
ficult, if not impossible, to provide a comprehensive listing of all the educational applications
uses
of tests and other assessment procedures, so what follows is a listing of the prominent
Gronlund, 1998,
commonly identified in the literature (e.g., AFT, NCME, & NEA, 1990;
2003; Nitko, 2001; Popham, 2000).

Student Evaluations
to monitor the
The appropriate use of tests and other assessment procedures allows educators
progress of their students. In this context, probably the most common
use of educational assessments involves assigning grades to students
Summative evaluation involves
the determination of the value to reflect their academic progress or achievement. This type of evalu-
or quality of an outcome. ation is typically referred to as summative evaluation.
20 GiavATP
a EAR

In the class-
room, summative evaluation typically involves the formal evaluation of student performance,
commonly taking the form of a numerical or letter grade (e.g., A, B, C, D, or F). Summative
evaluation is often designed to communicate information about student progress, strengths,
and weaknesses to parents and other involved adults. Another prominent application of stu-
dent assessments is to provide specific feedback to students in order to facilitate or guide
their learning. Optimally, students need to know both what they have and have not mastered.
This type of feedback serves to facilitate and guide learning activities and can help motivate
students. It is often very frustrating to students to receive a score on
Formative evaluation involves an assignment without also receiving feedback about what they can
activities designed to provide do to improve their performance in the future. This type of evaluation
feedback to students. is referred to as formative e i S
e roviding teedbac ts.

Instructional Decisions
Educational assessments also can provide important information that helps teachers ad-
just and enhance their teaching practices. For example, assessment information can help
teachers determine what to teach, how to teach it, and how effective
Educational assessments can their instruction has been. Gronlund (2003) delineated a number of
ways in which assessment can be used to enhance instructional deci-
provide important information
sions. For example, in terms of providing information about what to
that helps teachers adjust
teach, teachers should routinely assess the skills and knowledge that
and enhance their teaching
students bring to their classroom in order to establish appropriate
practices.
learning objectives (sometimes referred to as “sizing up” students).
Teachers do not want to spend an excessive amount of time covering
material that the students have alr mastered, nor do they want to introduce material for
which the students are ill peta :
re - In addition to decisions about the content of instruc-
tion, student assessments can help teachers tailor learning activities to match the individual
strengths and weaknesses of their students. Understanding the cognitive strengths and weak-
nesses of students facilitates this process, and certain diagnostic tests provide precisely this
type of information. This type of assessment is frequently referred to a
(Nt. Finally, educational assessment can (and should) provide feedback to teachers about
how effective their instructional practices are. Teachers can use assessment information
to
determine whether the learning objectives were reasonable, which instructional activities
were effective, and which activities need to be modified or abandoned.

Selection, Placement, and Classification Decisions


The terms selection, placement, and classification are often used interchangeably, but
tech-
nically they have different meanings. Nitko (2001) notes that selection refers to decisions
by a school, college, or other instituti i or reject a student. With selection,
some individuals are 2
A common example of selection involves universities making admissions decisions
. In this
Introduction to Educational Assessment 21

situation, some applicants are rejected and are no longer a concern of the university. In
contrast, i i

With placement, all students are placed and there are no actual rejections. For example, if
all the students in a secondary school are assigned to one of three instructional programs
(e medial, regular, and honors), this is a placeme isi i cla

. For example, special education students may be classified as learn-


ing disabled, emotionally disturbed, speech handicapped, or some other category of handi-
capping conditions, but these categories are not ordered in any particular manner; they are
simply descriptive. Psychological and educational tests often provide important diagnostic
information that is used when making classification decisions. In summary, although selec-
tion, placement, and classification decisions are technically different, educational tests and
assessments provide valuable information that can help educators make better decisions.

Policy Decisions
We use the category of “policy decisions” to represent a wide range of administrative deci-
sions made at the school, district, state, or national level. These decisions involve issues such
as evaluating the curriculum and instructional materials employed,
Instruction and assessment are determining which programs to fund, and even deciding which em-
two important and integrated ployees receive merit raises and/or promotions. We are currently in
aspects of the teaching process. an era of increased accountability in which parents and politicians
are setting higher standards for students and schools, and there isa
national trend to base many administrative policies and decisions on information garnered
from state or national assessment programs.

Counseling and Guidance Decisions


Educational assessments can also provide information that promotes self-understanding
and helps students plan for the future. For example, parents and students can use assess-
ment information to make educational plans and select careers that best match a student’s
abilities and interests.
Although this listing of common applications of testing and assessment is clearly not
of as-
exhaustive, it should give you an idea of some of the most important applications
we want to emphasize that instructio n and assessmen t are two
sessment procedures. Again,
of the teaching process. Table 1.4 outlines these major
important and integrated aspects
applications of assessment in education.

What Teachers Need to Know about Assessment


onal assessment
So far in this chapter we have discussed the central concepts related to educati
nt in today’s schools. We will now elaborate
and some of the many applications of assessme
nal testing and assessme nt. First we want to
on what teachers need to know about educatio
22 CHAPTER

TABLE 1.4 Common Applications of Educational Assessments

Type of Application Examples

Student evaluations Summative evaluation (e.g., assigning grades)


Formative evaluation (e.g., providing feedback)

Instructional decisions Placement assessment (e.g., sizing up)


Diagnostic assessment (detecting cognitive strengths and weaknesses)
Feedback on effectiveness of instruction

Selection, placement, and College admission decisions


classification decisions Assigning students to remedial, regular, or honors programs
Determining eligibility for special education services

Policy decisions Evaluating curriculum and instructional practices

Counseling and guidance decisions Promote self-understanding and help students plan for the future

emphasize our recognition that most teachers will not make psychometrics their focus of
study. However, because assessment plays such a prominent role in schools and teachers
devote so much of their time to assessment-related activities, there are some basic competen-
cies that all teachers should master. In fact in 1990 the American Federation of Teachers, the
National Council on Measurement in Education, and the National Education Association col-
laborated to develop a document titled Standards for Teacher Competence in Educational As-
sessment of Students. In the following section we will briefly review these competencies (this
document is reproduced in its entirety in Appendix E). Where appropriate, we will identify
which chapters in this text are most closely linked to specific competencies.

Teachers Should BeProficient in Selecting


Professionally Developed Assessment Procedures
Appropriate for Making Instructional Decisions

This requires that teachers be familiar with the


wide range of assessment procedures available for use in schools and the type of informa-
tion the different procedures provide (addressed primarily in Chapters 1, 12, 13, and 14).
To evaluate the technical merits of tests, teachers need to be familiar with the concepts of
reliability and validity and be able to make evaluative decisions about the quality and suit-
ability of different assessment procedures (addressed primarily in Chapters 4, 5, 6, and 17).
In order to make informed decisions about the quality of assessment procedures, teachers
need to be able to locate, interpret, and use technical information and critical reviews of
professionally developed tests (addressed primarily in Chapter 17).
Introduction to Educational Assessment 23

Teachers Should Be Proficient in Developing


Assessment Procedures Appropriate for
Making Instructional Decisions
In addition to being able to select among the professionally developed assessment proce-
dures that are available, tea
ass . In fact, the vast majority of the assessment
The vast majority of the information teachers collect and use on a daily basis comes from
assessment information teachers teacher-made tests. As a result, teachers need to be proficient in
collect and use comes from planning, developing, and using classroom tests. To accomplish
teacher-made tests. this, teachers must be familiar with the principles and standards
for developing a wide range of assessment techniques including
selected-response items, constructed-response items, performance assessments, and port-
folios (addressed primarily in Chapters 7, 8, 9, and 10). Teachers must also be able to
evaluate the technical quality of the instruments they develop (addressed primarily in
Chapters 4, 5, and 6).

Teachers Should Be Proficient in Administering,


Scoring, and Interpreting Professionally Developed
and Teacher-Made Assessment Procedures
In addition to being able to select and develop good assessment procedures, teachers must
be able to use them appropriately. Teachers need to understand the principles of standardiza-
tion and be prepared to administer tests in a standardized manner (addressed primarily in
Chapters 3 and 12). They should be able to reliably and accurately
Teachers must be able to score a wide range of assessment procedures including selected-re-
interpret assessment results sponse items, constructed-response items, performance assessments,
accurately and use them and portfolios (addressed primarily in Chapters 8, 9, and 10). Teach-
appropriately. ers need to be able to interpret the scores reported on standardized
assessment procedures such as percentile ranks and standard scores
(addressed primarily in Chapter 3). The proper interpretation of scores also requires that
teachers have a practical knowledge of basic statistical (e.g., measures of central tendency,
dispersion, correlation) and psychometric concepts (e.g., reliability, errors of measurement,
validity) (addressed primarily in Chapters 2, 4, and 5).

Teachers Should Be Proficient in Using Assessment


Results When Making Educational Decisions
As we have noted, assessment results are used to make a wide range of consequential
de-
educational decisions (e.g., student evaluations, instructional planning, curriculum
velopment, and educational policies). Because teachers play such a pivotal role in using
assessment information in the schools, they must be able to interpret assessment results
accurately and use them appropriately. They need to understand the concepts of reliability
and validity and be prepared to interpret test results in an appropriately cautious manner.
24 CHAPTER 1

Teachers should understand and be able to describe the implications and limitations of
assessment results and use them to enhance the education of their students and society in
general (addressed primarily in Chapters 1, 4, 5, and 11).

Teachers Should Be Proficient in Developing


Valid Grading Procedures That Incorporate
Assessment Information
Assigning grades to students is an important aspect of teaching. Teachers must be able to de-
velop and apply fair and valid procedures for assigning grades based on the performance of
students on tests, homework assignments, class projects, and other assessment procedures
(addressed primarily in Chapters 11 and 15).

Teachers Should Be Proficient in


Communicating Assessment Results
Teachers are routinely called on to interpret and report assessment results to students, par-
ents, and other invested individuals. As a result, teachers must be able to use assessment
terminology correctly, understand different score formats, and explain the meaning and
implications of assessment results. Teachers must be able to explain, and often defend, their
own assessment and grading practices (addressed primarily in Chapters 1, 11, and 15). They
should be able to describe the strengths and limitations of different assessment methods (ad-
dressed primarily in Chapters 8, 9, and 10). In addition to explaining the results of their own
classroom assessments, they must be able to explain the results of professionally developed
standardized tests (addressed primarily in Chapters 12, 13, and 14).

Teachers Should Be Proficient in Recognizing


Unethical, Illegal, and Other Inappropriate Uses
of Assessment Procedures or Information
It is essential that teachers be familiar with the ethical codes and laws that apply to educa-
tional assessment practices. Teachers must ensure that their assessment practices are con-
sistent with these professional ethical and legal standards, and if they
It is essential that teachers be
become aware of inappropriate assessment practices by other profes-
familiar with the ethical
sionals they should take steps to correct the situation (addressed pri-
codes and laws that apply to marily in Chapters 15, 16, and 17). Table 1.5 outlines these standards
educational assessment practices. _for teacher competence in educational assessments.

Educational Assessment in the Twenty-First Century

The field of educational assessment is dynamic and continuously evolving. There are some as-
pects of the profession that have been stable for many years. For example, classical test theory
(discussed in some detail in Chapter 4) has been around for almost a century and is still very
Introduction to Educational Assessment 25

TABLE 1.5 Teacher Competencies in Educational Assessment

Teachers should be proficient in the following:


1. Selecting professionally developed assessment procedures appropriate for making
instructional decisions.
. Developing assessment procedures that are appropriate for making instructional decisions.
whyAdministering, scoring, and interpreting professionally developed and teacher-made
assessment procedures.
. Using assessment results when making educational decisions.
. Developing valid grading procedures that incorporate assessment information.
. Communicating assessment results.
SNM Recognizing unethical, illegal, and other inappropriate uses of assessment procedures or
information.

influential today. However, many aspects of educational assessment are almost constantly
evolving as the result of a number of external and internal factors. Some of these changes are
the result of theoretical or technical advances, some reflect philosophical changes within the
profession, and some are the result of external societal or political influences. It is important
for assessment professionals to stay informed regarding new developments in the field and
to consider them with an open mind. To illustrate some of the developments the profession
is dealing with today, we will briefly highlight a few contemporary trends that are likely to
continue to impact assessment practices as you enter the teaching profession.

Computerized Adaptive Testing (CAT)


and Other Technological Advances
The widespread availability of fairly sophisticated and powerful personal computers has
had a significant impact on many aspects of our society, and the field of assessment is no
exception. One of the most dramatic and innovative uses of computer technology has been
the emergence of computerized adaptive testing (CAT). In CAT the test taker is initially
given an item that is of medium difficulty. If the test taker correctly responds to that item,
the computer selects and administers a slightly more difficult item. If the examinee misses
the initial item, the computer selects a somewhat easier item. As the testing proceeds the
computer continues to select items on the basis of the test taker’s performance on previous
items. CAT continues until a specified level of precision is reached. Research suggests that
CAT can produce the same levels of reliability and validity as conventional paper-and-pencil
tests, but because it requires the administration of fewer test items, assessment efficiency
can be enhanced (e.g., Weiss, 1982, 1985, 1995).
CAT is not the only innovative application of computer technology in the field of as-
sessment. Some of the most promising applications of technology in assessment involve the
use of technology to present problem simulations that cannot be realistically addressed with
paper-and-pencil tests. For example, flight-training programs routinely use sophisticated
flight simulators to assess the skills of pilots. This technology allows programs to assess how
pilots will handle emergency and other low-incidence situations, assessing skills that were
26 CHAPTER 1

previously difficult if not impossible to assess accurately. Another innovative use of technol-
ogy is the commercially available instrumental music assessment systems that allow students
to perform musical pieces and have their performances analyzed and graded in terms of pitch
and rhythm. Online versions of these programs allow students to practice at home and have
their performance results forwarded to their instructors at school. Although it is difficult to
anticipate the many ways technology will change assessment practices in the twenty-first
century, it is safe to say that they will be dramatic and sweeping. Special Interest Topic 1.3
provides information on the growing use of technology to enhance assessment in contempo-
rary schools.

‘“‘Authentic” or Complex-Performance Assessments


Although advances in technology are driving some of the current trends in assessment, others
are the result of philosophical changes among members of the assessment profession. This is
exemplified in the current emphasis on performance assessments and portfolios in education.
Performance assessments and portfolios are not new creations, but have been around for many
years (e.g., performance assessments have been used in industrial and organizational psychol-
ogy for decades). However, the use of performance assessments and portfolios in schools has
increased appreciably in recent years. Traditional testing formats, particularly multiple-choice

GTI NUON aN

SPECIAL INTEREST ToPIc 1.3


Technology and Assessment in the Schools

According to a report in Education Week (May 8, 2003), computer- and Web-based assessments are
starting to find strong support in the schools. For example, the No Child Left Behind Act of 2001,
which requires states to test all students in the 3rd through 8th grades in reading and mathematics
every year, has caused states to start looking for more efficient and economical forms of assessment.
Assessment companies believe they have the answer: switch to computer or online assessments.
Although the cost of developing a computerized test is comparable to that of a traditional paper-and-
oe
See pencil test, once the test is developed the computer test is far less expensive. Some experts estimate
that computerized tests can be administered for as little as 25% of the cost of a paper-and-pencil test.
Another positive feature of computer-based assessment is that the results can often be available in a
few days as opposed to the months educators and students are used to waiting.
Another area in which technology is having a positive impact on educational assessment prac-
tices involves preparing students for tests. More and more states and school districts are developing
online test-preparation programs to help students improve their performance on high-stakes assess-
ments. The initial results are promising. For example, a pilot program in Houston, Texas, found that
75% of the high school students who had initially failed the mandatory state assessment improved
their reading scores by 29% after using a computer-based test-preparation program. In addition
to
being effective, these computer-based programs are considerably less expensive for the school
dis-
tricts than face-to-face test-preparation courses.
While it is too early to draw any firm conclusions about the impact of technology on school
assessment practices, the early results are very promising. It is likely that by the year 2010
school-
based assessments will be very different than they are today. This is an exciting time to work
in the
field of educational assessment! :
es
incsemannan
eneace:
cece
Se
Sue
GyeS
He
NN
Introduction to Educational Assessment Dil.

and other selected-response formats (e.g., true—false, matching), have always had their crit-
ics, but their opposition has become more vocal in recent years. Opponents of traditional test
formats complain that they emphasize rote memorization and other low-level cognitive skills
and largely neglect higher-order conceptual and problem-solving skills. To address these and
related shortcomings, many educational assessment experts have promoted the use of more
“authentic” or complex-performance assessments, typically in the form of performance as-
sessments and portfolios. Performance assessments require test takers to complete a process
or produce a product in a context that closely resembles real-life situations. For example, a
medical student might be required to interview a mock patient, select tests and other assess-
ment procedures, arrive at a diagnosis, and develop a treatment plan (AERA et al., 1999).
: ; ;
(AERA
et al., 1999). Artists, architects, writers, and others have long used portfolios to represent their
work, and in the last decade portfolios have become increasingly popular in the assessment of
students. Although performance assessments have their own set of strengths and weaknesses,
they do represent a significant addition to the assessment options available to teachers.

Educational Accountability and High-Stakes Assessment


So far we have described how technological and philosophical developments within the pro-
fession have influenced current assessment practices. Other changes are the result of soci-
etal and political influences, such as the increasing emphasis on educational accountability
and high-stakes testing. Although parents and politicians have always closely scrutinized
the public schools, over the last three decades the public demands for increased educational
accountability in the schools have reached an all-time high. To help ensure that teachers
are teaching what they are supposed to be teaching and students are learning what they
are supposed to be learning, all 50 states and the District of Columbia have implemented
statewide testing programs (Doherty, 2002). These testing programs are often referred to as

High-stakes testing programs


produce results that have direct al., 1999). Students who do not pass the tests may not be promoted
and substantial consequences for to the next grade or allowed to graduate. However, the high stakes
both students and schools. are not limited to students. Many states publish “report cards” that
reflect the performance of school districts and individual schools. In
some states low-performing schools can be closed, reconstituted, or taken over by the state,
and administrators and teachers can be terminated or replaced (Amrein & Berliner, 2002).
Proponents of these testing programs maintain that they ensure that public school
students are acquiring the knowledge and skills necessary to succeed in society. To support
their position, they refer to data showing that national achievement scores have improved
since these testing programs were implemented. Opponents of high-stakes testing programs
argue that the tests emphasize rote learning and generally neglect critical thinking, problem
solving, and communication skills. Additionally, these critics feel that too much instruc-
tional time is spent “teaching to the test” instead of teaching the vitally important skills
for the
teachers would prefer to focus on (Doherty, 2002). This debate is likely to continue
foreseeable future, but in the meantime accountability and the associated testing programs
are likely to play a major role in our public schools. In fact the trend is toward more, rather
28 CHAPTER 1

SPECIAL INTEREST Topic 1.4


How High-Stakes Assessments Affect Teachers

It has been suggested that one finds more support for high-stakes assessments the further one gets
from the classroom. The implication is that while parents, politicians, and education administrators
might support high-stakes assessment programs, classroom teachers generally don’t. The National
Board on Educational Testing and Public Policy sponsored a study (Pedulla, Abrams, Madans, Rus-
sel, Ramos, & Miao, 2003) to learn what classroom teachers actually think about high-stakes assess-
ment programs. The study found that state assessment programs affect both what teachers teach and
how they teach it. More specifically, the study found the following:

m In those states placing more emphasis on high-stakes assessments, teachers reported feel-
ing more pressure to modify their instruction to align with the test, engage in more test-
preparation activities, and push their students to do well.
m Teachers report that assessment programs have caused them to modify their teaching in ways
that are not consistent with instructional best practices.
m Teachers report that they spend more time on subjects that will be included on the state as-
sessments and less time on subjects that are not assessed (e.g., fine arts, foreign language).
m Teachers in elementary schools reported spending more time on test preparation than those
in high schools.
m A majority of teachers believed that the benefits of assessment programs did not warrant the
time and money spent on them.
= Most teachers did report favorable evaluations of their state’s curriculum standards.
= The majority of teachers did not feel the tests had unintended negative consequences such as
causing students to be retained or drop out of school.

This is an intriguing paper for teachers in training that gives them a glimpse at the way high-
stakes testing programs may influence their day-to-day activities in the classroom. The full text of
this report is available at www.bc.edu/nbetpp.
—- -—--—_—rrr—————————— eee

than less, standardized testing in public schools. For example, the Elementary and Secondary
Education Act of 2001 (No Child Left Behind Act) requires that states test students annually
in grades 3 through 8. Because many states typically administer standardized achievement
tests in only a few of these grades, this new law will require even more high-stakes testing
than is currently in use (Kober, 2002). Special Interest Topic 1.4 provides a brief description
of a study that examined what teachers think about high-stakes state assessments.

Trends in the Assessment of Students with Disabilities


Recent amendments to the Individuals with Disability Education Act (IDEA) have signifi-
cantly impacted the assessment and instruction of children with disabilities, In summary, cur-
rent laws require that students with disabilities, with few exceptions, be included in regular
education classes and participate in all state and district assessment programs. The effect
of
this for regular education teachers is far reaching. In the past the instruction and assessment
of
Introduction to Educational Assessment 29

students with disabilities was largely the responsibility of special education teachers, but now
regular education teachers play a prominent role. Regular education teachers will have more
students receiving special education services in their classroom and, as a result, will be inte-
grally involved in their instruction and assessment. Regular education teachers are increasingly
being required to help develop and implement individualized educational programs (IEPs) for
these students and assess their progress toward goals and objectives specified in the IEPs.

Summary
This chapter is a broad introduction to the field of educational assessment. We started by
emphasizing that assessment should be seen as an integral part of the teaching process.
When appropriately used, assessment can and should provide information that both en-
hances instruction and promotes learning. We then defined some common terms used in the
educational assessment literature:

a A test is a procedure in which a sample of an individual’s behavior is obtained, evalu-


ated, and scored using standardized procedures (AERA et al., 1999).
= Measurement is a set of rules for assigning numbers to represent objects, traits, at-
tributes, or behaviors.
a Assessment is any systematic procedure for collecting information that can be used to
make inferences about the characteristics of people or objects (AERA et al., 1999).
u Reliability refers to the stability, accuracy, or consistency of test scores.
u Validity refers to the accuracy of the interpretations of test scores.

Our discussion then turned to a description of different types of tests. Most tests can be
classified as either maximum performance or typical response. Maximum performance tests
are designed to assess the upper limits of the examinee’s knowledge and abilities whereas
typical response tests are designed to measure the typical behavior and characteristics of ex-
aminees. Maximum performance tests are often classified as achievement tests or aptitude
tests. Achievement tests measure knowledge and skills in an area in which the examinee has
received instruction. In contrast, aptitude tests measure cognitive abilities and skills that are
accumulated as the result of overall life experiences (AERA et al., 1999). Maximum perfor-
mance tests can also be classified as either speed tests or power tests. On pure speed tests,
performance reflects only differences in the speed of performance whereas on pure power
tests, performance reflects only the difficulty of the items the examinee is able to answer cor-
rectly. In most situations a test is not a measure of pure speed or pure power, but reflects some
combination of both approaches. Finally, maximum performance tests are often classified as
objective or subjective. When the scoring of a test does not rely on the subjective judgment of
the person scoring the test, it is said to be objective. For example, multiple-choice tests can be
scored using a fixed scoring key and are considered objective (multiple-choice tests are often
scored by a computer). If the scoring of a test does rely on the subjective judgment of the per-
son scoring the test, it is said to be subjective. Essay exams are examples of subjective tests.
Typical response tests measure constructs such as personality, behavior, attitudes, or
use
interests, and are often classified as being either objective or projective. Objective tests
30 CAApASP AER

selected-response items (e.g., true—false, multiple-choice) that are not influenced by the
subjective judgment of the person scoring the test. Projective tests involve the presentation
of ambiguous material that can elicit an almost infinite range of responses. Most projective
tests involve some subjectivity in scoring, but what is exclusive to projective techniques is
the belief that these techniques elicit unconscious material that has not been censored by
the conscious mind.
Most tests produce scores that reflect the test takers’ performance. Norm-referenced
score interpretations compare an examinee’s performance to the performance of other people.
Criterion-referenced score interpretations compare an examinee’s performance to a speci-
fied level of performance. Typically tests are designed to produce either norm-referenced or
criterion-referenced scores, but it is possible for a test to produce both norm- and criterion-
referenced scores.
Next we discussed the basic assumptions that underlie educational assessment:

Psychological-and educational constructs exist.


Psychological and educational constructs can be measured.
Although we can measure constructs, our measurement is not perfect.
There are different ways to measure any given construct.
All assessment procedures have strengths and limitations.
Multiple sources of information should be part of the assessment process.
Performance on tests can be generalized to nontest behaviors.
Assessment can provide information that helps educators make better educational
decisions.
Assessments can be conducted in a fair manner.
m Testing and assessment can benefit our educational institutions and society as a
whole.

We described the major participants in the assessment process, including those who
develop tests, use tests, and take tests. We next turned to a discussion of major laws that
govern the use of tests and other assessments in schools, including the following:

a No Child Left Behind Act (NCLB). This act requires states to develop demanding
academic standards and put into place annual standards to monitor progress.
a [ndividuals with Disabilities Education Act (IDEA). This law mandates that children
with disabilities receive a free, appropriate public education. To this end, students
with disabilities may receive accommodations in their instruction and assessment.
m Section 504. Students who qualify under Section 504 may also receive modifications
to their instruction and assessment.
= Protection of Pupil Rights Act (PPRA). Places requirements for surveys and assess-
ments that elicit sensitive information from students.
a Family Educational Rights and Privacy Act (FERPA). Protects the privacy of stu-
dents and regulates access to educational records.

We noted that the use of assessments in schools is predicated on the belief that they
can provide valuable information that promotes student learning and helps educators make
better decisions. Prominent uses include the following: ;
Introduction to Educational Assessment 31

Student evaluations. Appropriate assessment procedures allow teachers to monitor


student progress and provide constructive feedback.
Instructional decisions. Appropriate assessment procedures can provide information
that allows teachers to modify and improve their instructional practices.
Selection, placement, and classification decisions. Educational tests and assessments
provide useful information to help educators select, place, and classify students.
Policy decisions. We are in an era of increased accountability, and policy makers and
educational administrators are relying more on information from educational assess-
ments to guide policy decisions.
Counseling and guidance decisions. Educational assessments also provide informa-
tion that promotes self-understanding and helps students plan for the future.

Next we elaborated on what teachers need to know about educational testing and as-
sessment. These competencies include proficiency in the following:

Selecting professionally developed assessment procedures appropriate for making


instructional decisions.
Developing assessment procedures that are appropriate for making instructional
decisions.
Administering, scoring, and interpreting professionally developed and teacher-made
assessment procedures.
Using assessment results when making educational decisions.
Developing valid grading procedures that incorporate assessment information.
Communicating assessment results.
Recognizing unethical, illegal, and other inappropriate uses of assessment procedures
or information.

We concluded this chapter by describing some of the trends in educational assessment


at the beginning of the twenty-first century. These included the influence of computerized
adaptive testing (CAT) and other technological advances, the growing emphasis on authen-
tic or complex-performance assessments, the national emphasis on educational account-
ability and high-stakes assessment, and recent developments in the assessment of students
with disabilities.

KEY TERMS AND CONCEPTS

Achievement tests, p. 5 Diagnostic assessment, p. 20 High-stakes testing, p. 27


Aptitude tests, p. 5 Educational accountability, p. 27 individualized educational program
Error, p. 10 (IEP), p. 17
Assessment, p. 3
Classification decisions, p. 21 Family Educational Rights and Individuals with Disabilities
Computerized adaptive testing Privacy Act (FERPA), p. 19 Education Act of 2004 (IDEA),
(CAT), p. 25 Formative evaluation, p. 20 p. 16
Construct, p. 9 Free appropriate public education Mainstreaming, p. 17
Criterion-referenced score, p. 8 (FAPE), p. 16 Mastery testing, p. 9
32 CHAPTER 1

Maximum performance tests, Placement decisions, p. 21 Speed tests, p. 5


p.4 Power tests, p. 5 Standardization sample, p. 7
Measurement, p. 3 Projective personality tests, p. 7 Standardized tests, p. 7
No Child Left Behind Act of 2001 Protection of Pupil Rights Act Summative evaluation, p. 19
(NCLB), p. 16 (PPRA), p. 19 Test, p. 3
Nonstandardized tests, p. 7 Psychometrics, p. 4 Typical response tests, p. 6
Norm-referenced score, p. 8 Reliability, p. 4 Validity, p. 4
Objective personality tests, p. 6 Section 504, p. 17
Placement assessment, p. 20 Selection decisions, p. 20

RECOMMENDED READINGS

American Educational Research Association, American Psy- Angeles, CA: Center for the Study of Evaluation. This
chological Association, & National Council on Measure- outstanding report conceptualizes classroom assess-
ment in Education (1999). Standards for educational ment as an integral part of teaching and learning. It is
and psychological testing. Washington, DC: American advanced reading at times, but well worth it.
Educational Research Association. In practically every Weiss, D. J. (1995). Improving individual difference mea-
content area this resource is indispensable! surement with item response theory and computerized
Jacob, S., & Hartshorne, T. S. (2007). Ethics and law for adaptive testing. In D. Lubinski & R. Dawes (Eds.), As-
school psychologists (Sth ed.). Hoboken, NJ: Wiley. This sessing individual differences in human behavior: New
book provides good coverage of legal and ethical issues concepts, methods, and findings (pp. 49-79). Palo Alto,
relevant to work in the schools. CA: Davies-Black. This chapter provides a good intro-
Joint Committee on Standards for Educational Evaluation duction to IRT and CAT.
(2003). The student evaluation standards. Thousand Zenisky, A., & Sierci, S. (2002). Technological innovations in
Oaks, CA: Corwin Press. This text presents the JCSEE large-scale assessment. Applied Measurement in Educa-
guidelines as well as illustrative vignettes intended to tion, 15, 337-362. This article details some of the ways
help educational professionals implement the standards. computers have affected and likely will impact assess-
The classroom vignettes cover elementary, secondary, ment practices.
and higher education settings.
Shepard, L. A. (2000). The role of classroom assessment in
teaching and learning. (CSE Technical Report 517). Los

INTERNET SITES OF INTEREST

www.aft.org tion. You can sign up for a weekly alert and summary of
This is the Web site for the American Federation of articles. This is really worth checking out!
Teachers, an outstanding resource for all interested in
www.ncme.org
education.
This Web site for the National Council on Measure-
http://edweek.org ment in Education is an excellent resource for those
Education Week is a weekly newsletter that is available interested in finding scholarly information on assess-
online. This very valuable resource allows teachers to ment in education.
stay informed about professional events across the na-

Go to
aD ictal eps to view a PowerPoint™
presentation and to listen to an audio lecture about this chapter.
CHAPTER

|
The Basic Mathematics
of Measurement

One does not need to be a statistical wizard to grasp the basic mathematical
concepts needed to understand major measurement issues.

CHAPTER HIGHLIGHTS

The Role of Mathematics in Assessment The Description of Test Scores


Scales of Measurement Correlation Coefficients

LEARNING OBJECTIVES

After reading and studying this chapter, students should be able to


1. Define measurement.
. Describe the different scales of measurement and give examples.
Describe the measures of central tendency and their appropriate use.
Describe the measures of variability and their appropriate use.
Explain the meaning of correlation coefficients and how they are used.
Explain how scatterplots are used to describe the relationships between two variables.
Describe how linear regression is used to predict performance.
Describe major types of correlation coefficients.
awh
NAN
eeDistinguish between correlation and causation.

The Role of Mathematics in Assessment

Every semester, whenever one of us teaches a course in tests and measurement for undergradu-
ate students in psychology and education, we inevitably hear a common moan. Students are
quick to say they fear this course because they hear it involves “‘a lot of statistics” and they are
not good at math, much less statistics. As stated in the opening quotation, you do not have to
be a statistical wizard to comprehend the mathematical concepts needed to understand major

33
34 CHAPTER 2

measurement issues. In fact Kubiszyn and Borich (2003) estimate that less than 1% of the stu-
dents in their testing and assessment courses performed poorly entirely because of insufficient
math skills. Nevertheless, all measurements in education and psychology have mathematical
properties, and those who use tests and other assessments, whether teacher-made or standard-
ized commercial procedures, need to have an understanding of the basic mathematical and sta-
tistical concepts on which these assessments are predicated. In this chapter we will introduce
these mathematical concepts. Generally we will emphasize the development of a conceptual
understanding of these issues rather than focusing on mathematical computations. In a few in-
stances we will present mathematical formulas and demonstrate their application, but we will
keep the computational aspect to a minimum. To guard against becoming overly technical in
this chapter, we asked undergraduate students in nonmath majors to review it. Their consensus
was that it was readable and “user friendly.” We hope you will agree!
In developing this textbook our guiding principle has been to address only those con-
cepts that teachers really need to know to develop, administer, and interpret assessments in
educational settings. We recognize that most teachers do not desire to become test develop-
ment experts, but because teachers routinely develop, use, and interpret assessments, they
need to be competent in their use. In this chapter, we will first discuss scales of measurement
and show you how different scales have different properties or characteristics. Next we will
introduce the concept of a collection or distribution of scores and review the different statis-
tics available to describe distributions. Finally we will introduce the concept of correlation,
how it is measured, and what it means.

Scales of Measurement

What Is Measurement?

——— educational or psychological test is a measuring device, and as such it in-


volves rules (e.g., specific items, administration, and scoring instruc-
Measurement is a set of rules for tions) for assigning numbers to an individual’s performance that are
assigning numbers to represent interpreted as reflecting characteristics of the individual. For exam-
objects, traits, attributes, or ple, the number of math questions students answer correctly ona par-
behaviors. ticular math quiz may be interpreted as reflecting their understanding
of two-digit multiplication. Another example is that your responses
to questions about how often you worry about aspects of your life and are distracted by
small inconveniences may be interpreted as revealing your relative level of anxiety. When
we measure something; e
\Scalelofmeasiirement. A scale is a system or scheme for assigning values or scores to the
characteristic being measured (e.g., Sattler, 1992). There are four scales of measurement,
and these different scales have distinct properties and convey unique types of information.
The four scales of measurement Goa tao ov The scales form a
The Basic Mathematics of Measurement FF)

Nominal Scales
Nominal scales are the simplest of the four scales
In most situations,
these categories are mutually exclusive. For example, gender is an
Nominal scales classify people example of a nominal scale that assigns individuals to mutually ex-
clusive categories. Another example is assigning people to categories
or objects into categories,
based on their college academic majors (e.g., education, psychology,
classes, or sets.
chemistry). You may have noticed that in these examples we did not
assign numbers to the categories. In some situations we do assign
numbers in nominal scales simply to identify or label the categories; however, the categories
are not ordered in a meaningful manner. For example, we might use the number one to rep-
resent a category of students who list their academic major as education, the number two for
the academic major of psychology, the number three for the academic major of chemistry,
and so forth. Notice that no attempt is made to order the categories. Three is not greater than
two, and two is not greater than one. The assignment of numbers is completely arbitrary. We
could just as easily call them red, blue, green, and so on. Another individual might assign
a new set of numbers, which would be just as useful as ours. Because of the arbitrary use
of numbers in nominal scales, nominal scales do not actually quantify the variables under
examination. Numbers assigned to nominal scales should not be added, subtracted, ranked,
or otherwise manipulated. As a result, many common statistical procedures cannot be used
with these scales so their usefulness is limited.

Ordinal Scales
Ordinal scale measurement allows you to rank people or objects according to the amount
or quantity of a characteristic they display or possess. A

Ordinal scales rank people or ti For example, ranking


objects according to the amount _the children in a classroom according to height from the tallest to
of a characteristic they display the shortest is an example of ordinal measurement. Traditionally the
or possess.
ranking is ordered from the “most” to the “least.” In our example
the tallest person in the class would receive the rank of 1, the next
tallest a rank of 2, and the like. Although ordinal scale measurement provides quantitative
information, it does not ensure that the intervals between the ranks are consistent. That is,
the difference in height between the children ranked 1 and 2 might be three inches while the
difference between those ranked 3 and 4 might be one inch

us nothing about how much taller. As a result, these scales are somewhat limited in both the
measurement information they provide and the statistical procedures that can be applied.
Nevertheless, the use of these scales is fairly common in educational settings. Percentile
rank, age equivalents, and grade equivalents are all examples of ordinal scales.
36 (C lel AIP IIs IR 2

Interval Scales
Interval scales provide more information than either nominal or ordinal scale) Interval

\\ scale with equal units. By equal scale units, we mean the difference
Interval scales rank people or between adjacent units on the scale is equivalent. The difference be-
objects like an ordinal scale, but tween scores of 70 and 71 is the same as the difference between
on a scale with equal units. scores of 50 and 51 (or 92 and 93; 37 and 38; etc.). Many educational
and psychological tests are designed to produce interval level scores.
Let’s look at an example of scores for three people on an aptitude
test. Assume individual A receives a score of 100, individual B a score of 110, and individual
Cascore of 120. First, we know that person C scored the highest followed by B then A. Sec-
ond, given that the scores are on an interval scale, we also know that the difference between
individuals A and B (i.e., 10 points) is equivalent to the difference between B and C (i.e.,
10 points). Finally, we know the difference between individuals A and C (i.e., 20 points) is
twice as large as the difference between individuals A and B (i.e., 10 points). Interval level
data can be manipulated using common mathematical operations (e.g., addition, subtrac-
tion, multiplication, and division) whereas lesser scales (i.e., nominal and ordinal) cannot.
A final advantage is that most statistical procedures can be used with interval scale data.
As you can see, interval scales represent a substantial improvement over ordinal
scales and provide considerable information. Their one limitation is that interval scales
do not have a true zero point. That is, on interval scales a score of zero does not reflect
the total absence of the attribute. For example, if an individual were unable to answer
any questions correctly on an intelligence test and scored a zero, it would not indicate
the complete lack of intelligence, but only that he or she was unable to respond correctly
to any questions on this test. (Actually intelligence tests are designed so no one actually
receives a score of zero. We just use this example to illustrate the concept of an arbitrary
zero point.) Likewise, even though an IQ of 100 is twice as large as an IQ of 50, it does
not mean that the person with an IQ of 100 is twice as intelligent as the person with an IQ
of 50. In educational settings, interval scale scores are most commonly seen in the form
of standard scores (there are a number of standard scores used in education, which will be
discussed in the next chapter).

Ratio Scales

Miles per hour, length, and weight are all


examples of ratio scales. As the name suggests, with these scales we
Ratio scales have the properties
can interpret ratios between scores. For example, 60 miles per hour is
of interval scales plus a true twice as fast as 30 miles per hour, 20 feet is twice as long as 10 feet,
zero point. and 60 pounds is three times as much as 20 pounds. Ratios are not
meaningful or interpretable with interval scales. As we noted, a child
with an intelligence quotient (IQ) of 100 is not twice as intelligent as one with an IQ of 50;
a child with a standardized math achievement test score of 100 does not know twice as much
as one with a score of 50. With the exception of percent correct on classroom achievement
The Basic Mathematics of Measurement 37

tests and the measurement of behavioral responses (e.g., reaction time), there are relatively
few ratio scales in educational and psychological measurement. Fortunately, we are able to
address most of the measurement issues in education adequately using interval scales.
Table 2.1 gives examples of common nominal, ordinal, interval, and ratio scales found
in educational and psychological measurement. As we noted, there is a hierarchy among the
scales with nominal scales being the least sophisticated and providing the least information

TABLE 2.1 Common Nominal, Ordinal, Interval, and Ratio Scales

Scale Example Sample Scores

Nominal Gender of participant Females="1


Male = 2
Ethnicity African American = |
Caucasian = 2
Hispanic American = 3
Native American = 4
Asian American = 5
Place of birth Northeast = 1
Southeast = 2
Midwest = 3
Southwest = 4
Northwest = 5
Pacific = 6

Ordinal Preference for activity 1 = Most preferred


2 = Intermediate preferred
3 = Least preferred
Graduation class rank 1 = Valedictorian
2 = Salutatorian
3 = Third Rank
Etc.

Percentile rank 99th Percentile


98th Percentile
97th Percentile
Etc.

Interval Intelligence scores Intelligence quotient of 100


Personality test scores Depression score of 75
Graduate Record Exam Verbal score of 550

Ratio Height in inches 60 inches tall


Weight in pounds 100 pounds
Percent correct on classroom test 100%
TLE ESS Sea Cie SASLk ned Sane Rs ECE secon sivas]
LALO LENE TLE TEE DEERE
38 CHARTER 2

and ratio scales being the most sophisticated and providing the most information. Nominal
scales allow you to assign a number to a person that associates that person with a set or
category, but other useful quantitative properties are missing. Ordinal scales have all the
positive properties of nominal scales with the addition of the ability to rank people ac-
cording to the amount of a characteristic they possess. Interval scales have all the positive
properties of ordinal scales and also incorporate equal scale units. The inclusion of equal
scale units allows one to make relative statements regarding scores (e.g., the difference
between a score of 82 and a score of 84 is the same as the difference between a score of
92 and 94). Finally, ratio scales have all of the positive properties of an interval scale with
the addition of an absolute zero point. The inclusion of an absolute zero point allows us to
form meaningful ratios between scores (e.g., a score of 50 reflects twice the amount of the
characteristic as a score of 25). Although these scales do form a hierarchy, this does not
mean the lower scales are of little or no use. If you want to categorize students according
to their academic major, a nominal scale is clearly appropriate. Accordingly, if you simply
want to rank people according to height, an ordinal scale would be adequate and appropri-
ate. However, in most measurement situations you want to use the scale that provides the
most information. Special Interest Topic 2.1 elaborates on technical distinctions among the
four scales of measurement.

The Description of Test Scores

An individual’s test score in isolation provides very little information, even if we know its
scale of measurement. For example, if you know that an individual’s score on a test of reading
achievement is 79, you know very little about that student’s reading ability. Even if you know
the scale of measurement represented by the test (e.g., an interval scale), you still know very
little about the individual’s reading ability. To meaningfully interpret or describe test scores
you need to have a frame of reference. Often the frame of reference is how other people
performed on the test. For example, if in a class of 25 children, a score of 79 was the highest
score achieved, it would reflect above average (or possibly superior) performance. In contrast,
if 79 was the lowest score, it would reflect below average performance. The following sec-
tions provide information about score distributions and the statistics used to describe them. In
the next chapter we will use many of these concepts and procedures to help you learn how to
describe and interpret test scores.

Distributions

These can be scores earned on a reading test, scores


on an intelligence test, or scores on a measure of depression. We can also have distributions
reflecting physical characteristics such as weight, height, or strength.
A distribution is a set of scores. Distributions can be represented in a number of ways, including tables
and graphs. Table 2.2 presents scores for 20 students on a homework
assignment similar to what might be recorded in a teacher’s grade book. Table 2.3 presents
an ungrouped frequency distribution of the same 20 scores. Notice that in this example there
are only seven possible measurement categories or scores (i.e., 4, 5; 6, 7, 8, 9, and
10). In
The Basic Mathematics of Measurement 39

PMR AREA ISS wae

SPECIAL INTEREST TOPIC. 2.1


Scales of Measurement: Mathematical Operations
and Statistics Procedures

In this chapter we discuss a number of important distinctions among the four scales of measure-
ment. Having touched on the fact that the different scales of measurement differ in terms of the basic
mathematical and statistical operations that can be applied, in this section we elaborate on these
distinctions. With nominal level data the only mathematical operation that is applicable is “equal to”
(=) and “not equal to” (+). With ordinal level data you can also include “greater than” (>) and “less
than” (<) as applicable operations. It is not until you have interval level data that one can use basic
operations like addition, subtraction, multiplication, and division. However, because interval level
scores do not have an absolute or true zero, you cannot make statements about relative magnitude or
create ratios. For example, it is not accurate to say that someone with an IQ (an interval level score)
of 140 is twice as intelligent as someone with an IQ of 70. With ratio level data, however, you can
make accurate statements about relative magnitude and create ratios. For example, someone 6 feet
tall is twice as tall as someone 3 feet tall, and someone weighing 140 pounds does weigh twice as
much as someone weighing 70 pounds.
As you might expect, the scale of measurement also affects the type of statistics that are ap-
plicable. We will start by addressing descriptive statistics. In terms of measures of central tendency,
discussed later in this chapter, only the mode is applicable to nominal level data. With ordinal level
data, both the mode and median can be calculated. With interval and ratio level data, the mode, me-
dian, and mean can all be calculated. In terms of measures of variability, also discussed later in this
chapter, no common descriptive statistic is applicable for nominal level data. One can describe the
categories and the count in each category, but there is no commonly used statistic available. With
ordinal level data, it is reasonable to report the range of scores. With interval and ratio level data the
range is applicable as well as the variance and standard deviation.
One finds a similar pattern in considering inferential statistics, which are procedures that
allow a researcher to make inferences regarding a population based on sample data. In brief, nominal
and ordinal level data are limited to the use of nonparametric statistical procedures. Interval and ratio
level data can be analyzed using parametric statistical procedures. Parametric statistical procedures
are preferred from a research perspective since they have more statistical power, which in simplest
terms means they are more sensitive in detecting true differences between groups.
From this discussion it should be clear that more mathematical and statistical procedures can
be used with interval and ratio level data than with nominal and ordinal level data. This is one of the
reasons why many researchers and statisticians prefer to work with interval and ratio level data. In
summary, it is important to accurately recognize the scale of measurement you are using so appropri-
ate mathematical and statistical procedures can be applied.

Note: While these technical guidelines are widely accepted, there is not universal agreement and some disregard
them (e.g., calculating the mean on ordinal level data).

some situations there are so many possible scores that it is not practical to list each potential
score individually. In these situations it is common to use a grouped frequency distribu-
tion. In grouped frequency distributions the possible scores are “combined” or “grouped”
into class intervals that encompass a range of possible scores. Table 2.4 presents a grouped
40 CHAPTER 2

TABLE 2.2 Distribution of Scores for 20 Students TABLE 2.3 Ungrouped Frequency Distribution

Student Homework Scores Score Frequency

Cindy fl 10 1
Tommy 8 9 4
Paula 9 8 5
Steven 6 7 4
Angela u 6 3
Robert 6 5 2
Kim 10 4 l
Kevin 8 mee |
Note: This reflects the same distribution of scores depicted in
Randy 2 Table 2.2.
Charles 9

Hada 9 TABLE 2.4 Grouped Frequency Distribution


Shawn 9
Koren 8 Class Interval Frequency

Paul 4 125-129 6
Teresa 5 120-124 14
Freddie 6 115-119 17
Tammy 7 110-114 23
Shelly 8 105-109 27
Carol 8 100-104 42
Johnny 7 95-99 39
Mean = 7.3 90-94 25
Median = 7.5 85-89 ON

i blah 80-84 17
ERE IEE OSE 75-79 13

70-74 2)
SEAR
Note: This presents a grouped frequency distribution of 250 hy-
pothetical scores that are grouped into class intervals that incor-
porate five score values.

frequency distribution of 250 hypothetical scores grouped into class intervals spanning five
score values.
Frequency graphs are also popular and provide a visual representation of a distribution.
When reading a frequency graph, scores are traditionally listed on the horizontal axis and
the frequency of scores is listed on the vertical axis. Figure 2.1 presents a graph of the set
of homework scores listed in Tables 2.2 and 2.3. In examining this figure you see that there
was only one score of 10 (reflecting perfect performance) and there’ was only one score
of
The Basic Mathematics of Measurement
41

4 5 6 i, 8 9 10
Homework Scores

FIGURE 2.1 Graph of the Homework Scores

4 (reflecting correct responses to only four questions). Most of the other students received
scores between 7 and 9. Figure 2.2 presents a graph of a distribution that might reflect a large
standardization sample. Examining this figure reveals that the scores tend to accumulate
around the middle in frequency, diminishing as we move further away from the middle.
Another characteristic of the distribution depicted in Figure 2.2 is that it is symmetri-
cal, which means that if you divide the distribution into two halves, they will mirror each
other. Not all distributions are symmetrical. A nonsymmetrical distribution is referred to as
skewed. Skewed distributions can be either negatively or positively skewed. \Semaianay
as illustrated in Figure 2.3.
When a test produces scores that are negatively skewed, it is probable that the test is too
easy because there are many high scores and relatively few low scores. Alpositively skewed»
, as illustrated in Figure 2.4. If a test

Low Scores High Scores


Test Scores

FIGURE 2.2 Hypothetical Distribution of Large


Standardization Sample

41
42 CHAPTER 2

Low Scores High Scores


Test Scores

FIGURE 2.3 Negatively Skewed Distribution

Low Scores High Scores


Test Scores

FIGURE 2.4 Positively Skewed Distribution

produces scores that are positively skewed, it is likely that the test is too difficult because
there are many low scores and few high scores. In the next chapter we will introduce a spe-
cial type of distribution referred to as the normal distribution and describe how it is used to
help interpret test scores. First, however, we will describe two important characteristics of
distributions and the methods we have for describing them. The first characteristic is central
tendency and the second is variability.

Measures of Central Tendency


The scores in many distributions tend to concentrate around a center (hence the term central
tendency) and there are three common descriptive statistics used to summarize this tendency.
The threemeasu
tendenres cyvare
ofcentral
themmeansmediatjandimode. These statistics
are frequently referenced in mental and physical measurement and all teachers should be
The Basic Mathematics of Measurement 43

familiar with them. It is likely that you have heard of all of these statistics, but we will briefly
discuss them to ensure that you are familiar with the special characteristics of each.

The mean is the arithmetic Mean. Most people are familiar with the mean as the simple arith-
average of a distribution. metic average. Practically every day you hear discussions involving
the concept of the average amount of some entity. Meteorologists give
information about the average temperature and amount of rain, politicians and economists
discuss the average hourly wage, educators talk about the grade point average, health profes-
sionals talk about average weight and average life expectancy, and the list goes on. Formally,
the mean of a set of scores is defined by the following equation:

Mean = Sum of Scores / Number of Scores

The mean of the homework scores listed in Table 2.2 is calculated by summing the
20 scores in the distribution and dividing by 20. This results in a mean of 7.3. Note that the
mean is near the middle of the distribution (see Figure 2.1). Although no student obtained
a score of 7.3, the mean is useful in providing a sense of the central tendency of the group
of scores. Several important mathematical characteristics of the mean make it useful as a
measure of central tendency. First, the mean can be calculated with interval and ratio level
data, but not with ordinal and nominal level data. Second, the mean of a sample is a good
estimate of the mean for the population from which the sample was drawn. This is use-
ful when developing standardized tests in which standardization samples are tested and
the resulting distribution is believed to reflect characteristics of the entire population of
people with whom the test is expected to be used (see Special Interest Topic 2.2 for more
information on this topic). Another positive characteristic of the mean is that it is essential
to the definition and calculation of other descriptive statistics that are useful in the context
of measurement.
An undesirable characteristic of the mean is that it is sensitive to unbalanced extreme
scores. By this we mean a score that is either extremely high or extremely low relative to
the rest of the scores in the distribution. An extreme score, either very large or very small,
tends to “pull” the mean in its direction. This might not be readily apparent so let’s look at an
example. In the set of scores 1, 2, 3, 4, 5, and 38, the mean is 8.8. Notice that 8.8 is not near
any score that actually occurs in the distribution. The extreme score of 38 pulls the mean
in its direction. The tendency for the mean to be affected by extreme scores is particularly
problematic when there is a small number of scores. The influence of an extreme score de-
creases as the total number of scores in the distribution increases. For example, the mean of
Pf te, 2, 2235.0, 8, 0,4, 4, Fs aids os ys and 38 is 4.6. In this example the influence
of the extreme score is reduced by the presence of a larger number of scores.

Median. The median is the score or potential score that divides


The median is the score or a distribution in half. In the distribution of scores depicted in Table
potential score that divides 2.3, half the scores are 8 or above and half the scores are 7 or below.
a distribution in half. Therefore, the point that divides the distribution in half is between 8
and 7, or 7.5. When the number of scores in a distribution is an odd
number, the median is simply the score that is in the middle of the distribution. Consider
44 CHAPTER 2

SPECIAL INTEREST TOPIC 2.2


Population Parameters and Sample Statistics

Although we try to minimize the use of statistical jargon whenever possible, at this point it is useful
to highlight the distinction between population parameters and sample statistics. Statisticians dif-
ferentiate between populations and samples. A population is the complete group of people, objects,
or other things of interest. An example of a population is all of the secondary students in the United
States. Because this is a very large number of students, it would be extremely difficult to study such
a group. Such constraints often prevent researchers from studying entire populations. Instead they
study samples. A sample is just a subset of the larger population that is thought to be representative
of the population. By studying samples researchers are able to make generalizations about popula-
tions. For example, although it might not be practical to administer a questionnaire to all secondary
students in the United States, it would be possible to select a random sample of secondary students
and administer the questionnaire to them. If we are careful in selecting this sample and it is of suf-
ficient size, the information garnered from the sample may allow us to draw some conclusions about
the population.
Now we will address the distinction between parameters and statistics. Population values are
referred to as parameters and are typically represented with Greek symbols. For example, statisti-
cians use mu (\1) to indicate a population mean and sigma (6) to indicate a population standard
deviation. Because it is often not possible to study entire populations, we do not know population
parameters and have to estimate them using statistics. A statistic is a value that is calculated based on
a sample. Statistics are typically represented with Roman letters. For example, statisticians use X to
indicate the sample mean (some statisticians use M to indicate the mean) and SD (or S) to indicate
the sample standard deviation. Sample statistics can provide information about the corresponding
population parameters. For example, the sample mean (X) may serve as an estimate of the population
mean (1). Of course the information provided by a sample statistic is only as good as the sample the
statistic is based on. Large representative samples can provide good information whereas small or
biased samples will provide poor information. Without going into detail about sampling and infer-
ential statistics at this point, we do want to make you aware of the distinction between parameters
and statistics. In this and other texts you will see references to both parameters and statistics and
understanding this distinction will help you avoid a misunderstanding. Remember, as a general rule
if the value is designated with a Greek symbol it refers to a population parameter, but if it is desig-
nated with a Roman letter it is a sample statistic.

the following set of scores: 9, 8, 7, 6, 5. In this example the median is 7 because two scores
fall above it and two fall below it. In actual practice a process referred to as interpolation is
often used to compute the median (because interpolation is illustrated in practically every
basic statistics textbook, we will not go into detail about the process). The median can be
calculated for distributions containing ratio, interval, or ordinal level scores, but it is not ap-
propriate for nominal level scores. The median is a useful and versatile measure of central
tendency.

The mode is the most frequently Mode. The mode of a distribution is the most frequently occurring
occurring score in a distribution. score. Referring back to Table 2.3, which presents the ungrouped
The Basic Mathematics of Measurement 45

frequency distribution of 20 students on a homework assignment, you will see that the most
frequently occurring score is 8. These scores are graphed in Figure 2.1, and by locating
the highest point in the graph you are also able to identify the mode (i.e., 8). An advantage
of the mode is that it can be used with nominal data (e.g., the most frequent college major
selected by students) as well as ordinal, interval, and ratio data (Hays, 1994). However, the
mode does have significant limitations as a measure of central tendency. First, some distri-
butions have two scores that are equal in frequency and higher than other scores (see Figure
2.5). This is referred to as a “bimodal” distribution and the mode is ineffective as a measure
of central tendency. Second, the mode is not a very stable measure of central tendency,
particularly with small samples. For example, in the distribution depicted in Table 2.3, if
one student who earned a score 8 had earned a score of either 7 or 9, the mode would have
shifted from 8 to 7 or 9. As a result of these limitations, the mode is often of little utility as
a measure of central tendency.

Choosing between the Mean, Median, and Mode. A natural question is, Which mea-
sure of central tendency is most useful or appropriate? As you might expect, the answer
depends on a number of factors. First, as we noted when discussing the mean, it is essential
when calculating other useful statistics. For this and other rather technical reasons (see
Hays, 1994), the mean has considerable utility as a measure of central tendency. However,
for purely descriptive purposes the median is often the most versatile and useful measure
of central tendency. When a distribution is skewed, the influence of unbalanced extreme
scores on the mean tends to undermine its usefulness. Figure 2.6 illustrates the expected
relationship between the mean and the median in skewed distributions. Note that the mean
is “pulled” in the direction of the skew: that is, lower than the median in negatively skewed
distributions and higher than the median in positively skewed distributions. To illustrate how
the mean can be misleading in skewed distributions, Hopkins (1998) notes that due to the
influence of extremely wealthy individuals, about 60% of the families in the United States
have incomes below the national mean. In this situation, the mean is pulled in the direction

Low Scores High Scores


Test Scores

FIGURE 2.5 Bimodal Distribution


CHAPTER 2

Mean, median, and


mode are the same

(a) Normal Distribution

(b) Negatively Skewed Distribution

Mode

Median

(c) Positively Skewed Distribution

FIGURE 2.6 Relationship between Mean, Median,


and Mode in Normal and Skewed Distributions
Source: From L. H. Janda, Psychological Testing: Theory and Applica-
tions. Published by Allyn & Bacon, Boston, MA. Copyright © 1998 by
Pearson Education. Reprinted by permission of the publisher.

of the extreme high scores and is somewhat misleading as a measure of central tendency.
Finally, it is important to consider the variable’s scale of measurement. If you are dealing
with nominal level data, the mode is the only measure of central tendency that provides
useful information. With ordinal level data one can calculate the median in addition to the
The Basic Mathematics of Measurement 47

mode. It is only with interval and ratio level data that one can appropriately calculate the
mean in addition to the median and mode.
At this point you should have a good understanding of the various measures of central
tendency and be able to interpret them in many common applications. You might be surprised
how often individuals in the popular media demonstrate a fundamental misunderstanding
of these measures. See Special Interest Topic 2.3 for a rather humorous example of how a
journalist misinterpreted information based on measures of central tendency.

Measures of Variability
Two distributions can have the same mean, median, and mode yet differ considerably in
the way the scores are distributed around the measures of central tendency. Therefore, it is
not sufficient to characterize a set of scores solely by measures of central tendency. Figure
2.7 presents graphs of three distributions with identical means but different degrees of
variability. A measure of the dispersion, spread, or variability of a set of scores will help
us describe the distribution more fully. We will examine three measures of variability
commonly used to describe distributions: range, standard devia-
tion, and variance.
The range is the distance
between the smallest and largest Range. The range is the distance between the smallest and largest
score in a distribution. score in a distribution. The range is calculated:

Range = Highest Score — Lowest Score

For example, in referring back to the distribution of scores listed in Table 2.3, you see that
the largest score is 10 and the smallest score is 4. By simply subtracting 4 from 10 you
determine the range is 6. (Note: Some authors define the range as the highest score minus
the lowest score, plus one. This is known as the inclusive range.) The range considers only

SPECIAL INTEREST TOPIC 25


A Public Outrage: Physicians Overcharge Their Patients

Half of all professionals charge above the median fee for their services. Now that you understand
the mean, median, and mode, you will recognize how obvious this statement is. However, a few
years back a local newspaper columnist in Texas, apparently unhappy with his physician’s bill for
some services, conducted an investigation of charges for various medical procedures in the county
in which he resided. In a somewhat angry column he revealed to the community that “fully half of
all physicians surveyed charge above the median fee for their services.”
We would like him to know that “fully half” of all plumbers, electricians, painters, lawn ser-
vices, hospitals, and everyone else we can think of also charge above the median for their services.
We wouldn’t have it any other way!
48 CHAPTER 2

(a)

(b)

(c)
FIGURE 2.7 Three Distributions with Different
Degrees of Variability
Source: From Robert J. Gregory, Psychological Testing: History, Prin-
ciples, and Applications, 3/e. Published by Allyn & Bacon, Boston, MA.
Copyright © 2000 by Pearson Education. Reprinted by permission of the
publisher.

the two most extreme scores in a distribution and tells us about the limits or extremes of a
distribution. However, it does not provide information about how the remaining scores are
spread out or dispersed within these limits. We need other descriptive statistics, namely the
standard deviation and variance, to provide information about the spread or dispersion of
scores within the limits described by the range.

The standard deviation is a Standard Deviation. The mean and standard deviation are the
measure of the average distance most widely used statistics in educational and psychological testing
that scores vary from the mean as well as research in the social and behavioral sciences. The stan-
of the distribution. dard deviation is computed with the following steps:

Step 1. Compute the mean of the distribution.


Step 2. Subtract each score in the distribution from the mean. This will yield some
negative numbers, and if you add all of these differences, the sum will be zero. To
The Basic Mathematics of Measurement 49

overcome this difficulty, we simply square each difference score because the square
of any number is always positive.
Step 3. Sum all the squared difference scores.
Step 4. Divide this sum by the number of scores to derive the average of the squared
deviations from the mean. This value is the variance and is designated by 07 (we will
return to this value briefly).
Step 5. The standard deviation (6) is the positive square root of the variance (07). It
is the square root because we first squared all the scores before adding them. To now
get a true look at the standard distance between key points in the distribution, we have
to undo our little trick that eliminated all those negative signs.

These steps are illustrated in Table 2.5 using the scores listed in Table 2.2. This example
illustrates the calculation of the population standard deviation, designated with the Greek
symbol sigma (6). You will also see the standard deviation designated with SD or S. This is
appropriate when you are describing the standard deviation of a sample rather than a popu-
lation (refer back to Special Interest Topic 2.2 for information on the distinction between
population parameters and sample statistics).!
The standard deviation is a measure of the average distance that scores vary from the
mean of the distribution. The larger the standard deviation, the more scores differ from the
mean and the more variability there is in the distribution. If scores are widely dispersed or
spread around the mean, the standard deviation will be large. If there is relatively little dis-
persion or spread of scores around the mean, the standard deviation will be small.

Variance. In calculating the standard deviation we actually first calculate the variance
(62). As illustrated in Table 2.5, the standard deviation is actually the positive square root
of the variance. Therefore, the variance is also a measure of the
The variance is a measure of variability of scores. The reason the standard deviation is more fre-
variability that has special quently used when interpreting individual scores is that the variance
meaning as a theoretical is in squared units of measurement, which complicates interpreta-
concept in measurement tion. For example, we can easily interpret weight in pounds, but it is
more difficult to interpret and use weight reported in squared pounds.
theory and statistics.
While the variance is in squared units, the standard deviation (i.e.,
the square root of the variance) is in the same units as the scores and so is more easily under-
stood. Although the variance is difficult to apply when describing individual scores, it does
have special meaning as a theoretical concept in measurement theory and statistics. For now,
simply remember that the variance is a measure of the degree of variability in scores.

Choosing between the Range, Standard Deviation, and Variance. As we noted, the
range conveys information about the limits of a distribution, but does not tell us how the
scores are dispersed within these limits. The standard deviation indicates the average dis-
statis-
| The discussion and formulas provided in this chapter are those used in descriptive statistics. In inferential
the population variance is estimated from a sample, the N in the denominator is replaced with N — 1.
tics when
50 CHAPTER 2

TABLE 2.5 Calculating the Standard Deviation and Variance

Student Scores Difference (Score — Mean) Difference Squared

(7 73) = —0.3 0.09


(8 7.3) = 07 0.49
(9 73) = Ly, 2.89
(6 73) = -1.3 1.69
7 7.3) = —0.3 0.09
te
ON
‘o.oo (6 7.3) = -1.3 1.69
= (=) (10 7.3) = D7 De?
(8 73) = 0.7 0.49
(5 7.3) = —2.3 5:29
(9 7.3) = LY 2.89
(9 7.3) = hy 2.89
(9 7.3) = Mey, 2.89
(8 7.3) = 0.7 0.49
(4 7.3) = —3.3 10.89
(5 73) = —2.3 529
(6 7.3) = —1.3 1.69
(7 7.3) = —0.3 0.09
(8 7.3) = 0.7 0.49
(8 7.3) = 0.7 0.49
SIroo
OC
tr
OO
UO
nan
© (7 73) = 0.3 0.09
Sum = 146 Sum 48.2
Mean i] (es Variance 48.2/(n)
48.2/20
= 2.41

Standard Deviation \
Variance
2.41
1.55

tance that scores vary from the mean of the distribution. The larger the standard deviation,
the more variability there is in the distribution. The standard deviation is very useful in
describing distributions and will be of particular importance when we turn our attention to
the interpretation of scores in the next chapter. The variance is another important and use-
ful measure of variability. Because the variance is expressed in terms of squared measure-
ment units, it is not as useful in interpreting individual scores as is the standard deviation.

The Basic Mathematics of Measurement 51

However, the variance is important as a theoretical concept, and we will return to it when
discussing reliability and validity in later chapters.

Correlation Coefficients

Most students are somewhat familiar with the concept of correlation. When people speak of
a correlation, they are referring to the relationship between two variables. The variables can
be physical such as weight and height or psychological such as intelligence and academic
achievement. For example, it is reasonable to expect height to demonstrate a relationship
with weight. Taller individuals tend to weigh more than shorter individuals. This relation-
ship is not perfect because there are some short individuals who weigh more than taller
individuals, but the tendency is for taller people to outweigh shorter people. You might also
expect more intelligent people to score higher on tests of academic achievement than less
intelligent people, and this is what research indicates. Again, the relationship is not per-
fect, but as a general rule more intelligent individuals perform better on tests of academic
achievement than their less intelligent peers.

WariableswiThe correlation coefficient was developed by Karl Pear-


A correlation coefficient is a son (1857—1936) and is designated by the letter 7, Correlation coef-
quantitative measure of the ficients can range from —1.0 to +1.0. When interpreting correlation
relationship between two coefficients, there are two parameters to consider. The first parameter
variables. is the sign of the coefficient. A positive correlation coefficient indi-
cates that an increase on one variable is associated with an increase
on the other variable. For example, height and weight demonstrate a positive correlation
with each other. As noted earlier, taller individuals tend to weigh more than shorter individu-
als. A negative correlation coefficient indicates that an increase on one variable is associated
with a decrease on the other variable. For example, because lower scores denote superior
performance in the game of golf, there is a negative correlation between the amount of
tournament prize money won and a professional’s average golf score. Professional golfers
with the lowest average scores tend to win the most tournaments.
The second parameter to consider when interpreting correlation coefficients is the mag-
nitude or absolute size of the coefficient. The magnitude of a coefficient indicates the strength
of the relationship between two variables. A value of 0 indicates the absence of a relationship
between the variables. As coefficients approach a value of 1.0, the strength of the relationship
increases. A coefficient of 1.0 (either positive or negative) indicates a perfect correlation, one in
which change in one variable is accompanied by a corresponding and proportionate change in
the other variable, without exception. Perfect correlation coefficients are rare in psychological
and educational measurement, but they might occur in very small samples simply by chance.
There are numerous qualitative and quantitative ways of describing correlation coef-
ficients. A qualitative approach to describe correlation coefficients is as weak, moderate, or
strong. Although there are no universally accepted standards for describing the strength of
correlations, we offer the following guidelines: <0.30, weak; 0.30-0.70, moderate; and >0.70,
strong (these are just guidelines and should not be applied in a rigid manner). This approach
52 CHAPTER 2

is satisfactory in many situations, but in other contexts it may be more important to determine
whether a correlation is “statistically significant.” Statistical significance is determined by
both the size of the correlation coefficient and the size of the sample. A discussion of statistical
significance would lead us into the realm of inferential statistics and is beyond the scope of
this text. However, most introductory statistics texts address this concept in considerable detail
and contain tables that allow you to determine whether a correlation coefficient is significant
given the size of the sample.
Another way of describing correlation coefficients is by squaring it to derive the
coefficient of determination (i.e., r”). The coefficient of determination is interpreted as
the amount of variance shared by the two variables. In other words, the coefficient of de-
termination reflects the amount of variance in one variable that is
The coefficient of determination predictable from the other variable, and vice versa. This might not
is interpreted as the amount be clear so let’s look at an example. Assume a correlation between
of variance shared by two an intelligence test and an achievement test of 0.60 (i.e., r = 0.60).
variables. By squaring this value we derive the coefficient of determination is
0.36 (i.e., r? = 0.36). This indicates that 36% of the variance in one
variable is predictable from the other variable.

Scatterplots
As noted, a correlation coefficient is a quantitative measure of the relationship between
two variables. Examining scatterplots may enhance our understanding of the relationship
between variables. A scatterplot is simply a graph that visually displays the relationship
between two variables. To create a scatterplot you need to have two
A scatterplot is a graph that
scores for each individual. For example, you could graph each indi-
visually displays the relationship vidual’s weight and height. In the context of educational testing, you
between two variables. could have scores for the students in a class on two different home-
work assignments. In a scatterplot the X-axis represents one variable
and the Y-axis the other variable. Each mark in the scatterplot actually represents two scores,
an individual’s scores on the X variable and the Y variable.
Figure 2.8 shows scatterplots for various correlation values. First, look at Figure 2.8a,
which shows a hypothetical perfect positive correlation (+1.0). Notice that with a perfect
correlation all of the marks will fall on a straight line. Because this is a positive correlation
an increase on one variable is associated with a corresponding increase on the other variable.
Because it is a perfect correlation, if you know an individual’s score on one variable you
can predict the score on the other variable with perfect precision. Next examine Figure 2.8b,
which illustrates a perfect negative correlation (—1.0). Being a perfect correlation all the
marks fall on a straight line, but because it is a negative correlation an increase on one vari-
able is associated with a corresponding decrease on the other variable. Given a score on one
variable, you can still predict the individual’s performance on the other variable with perfect
precision. Now examine Figure 2.8c, which illustrates a correlation of 0.0. Here there is not
a relationship between the variables. In this situation, knowledge about performance on one
variable does not provide any information about the individual’s performance on the other
variable or enhance prediction.
(a) (d)
High High

> >
2 2
s S
S$ $s

Low Low
Low High Low High
Variable X Variable X

(b) (e)
High High

> >
2 2

$ $s

Low Low
Low High Low High
Variable X Variable X

(c) (f)
High High

> >
2 2
To) a
€ s
s s

Low Low
Low High Low High
Variable X Variable X

FIGURE 2.8 Scatterplots of Different Correlation Coefficients


Source: From D. Hopkins, Educational and Psychological Measurement and Evaluation, 8/e. Published by Allyn
& Bacon, Boston, MA. Copyright © 1998 by Pearson Education. Reprinted by permission of the publisher.

53
54 CHAPTER 2

So far we have examined only the scatterplots of perfect and zero correlation coef-
ficients. Examine Figure 2.8d, which depicts a correlation of +0.90. Notice that the marks
clearly cluster along a straight line. However, they no longer all fall on the line, but rather
around the line. As you might expect, in this situation knowledge of performance on one
variable helps us predict performance on the other variable, but our ability to predict per-
formance is not perfect as it was with a perfect correlation. Finally, examine Figures 2.8e
and 2.8f, which illustrate coefficients of 0.60 and 0.30, respectively. As you can see a cor-
relation of 0.60 is characterized by marks that still cluster along a straight line, but there is
more variability around this line than there was with a correlation of 0.90. Accordingly, with
a correlation of 0.30 there is still more variability of marks around a straight line. In these
situations knowledge of performance on one variable will help us predict performance on
the other variable, but as the correlation coefficients decrease so does our ability to predict
performance.

Correlation and Prediction


In the previous section we mentioned that when variables are correlated, particularly when
there is a strong correlation, knowledge about performance on one
Linear regression isa variable provides information that can help predict performance on
the other variable. A special mathematical procedure referred to as
mathematical procedure that
linear regression is designed precisely for this purpose. Linear regres-
allows you to predict values on
sion allows you to predict values on one variable given information on
one variable given information another variable. We will not be going into detail about its computa-
on another variable. tion, but linear regression has numerous applications in developing
and evaluating tests, and so we will come back to linear regression in
later chapters.

Specific correlation coefficients


Types of Correlation Coefficients
are appropriate for specific
situations. Specific correlation coefficients are appropriate for specific situa-
tions. The most common coefficient is the Pearson product-moment
correlation. The Pearson coefficient is appropriate when the variables being correlated are
measured on an interval or ratio scale. Table 2.6 illustrates the calculation of the Pearson cor-
relation coefficient. Although the formula for calculating a Pearson correlation may appear
rather intimidating, it is not actually difficult, and we encourage you to review this section if
you are interested in how these coefficients are calculated (or if your professor wants you to
be familiar with the process). Spearman’s rank correlation coefficient, another popular coef-
ficient, is used when the variables are measured on an ordinal scale. The point-biserial cor-
relation coefficient is also widely used in test development when one variable is dichotomous
(meaning only two scores are possible, e.g., pass or fail, 0 or 1, etc.) and the other variable
is measured on an interval or ratio scale. A common application of the point-biserial cor-
relation is in calculating an item-total test score correlation. Here the dichotomous variable
is the score on a single item (e.g., right or wrong) and the variable measured on an interval
scale is the total test score. A large item-total correlation is taken as evidence that an item is
measuring the same construct as the overall test measures.
4
The Basic Mathematics of Measurement 55

TABLE 2.6 Calculating a Pearson Correlation Coefficient

There are different formulas for calculating a Pearson correlation coefficient and we will
illustrate one of the simpler ones. For this illustration we will use the homework assignment
scores we have used before as the X variable and another set of 20 hypothetical scores as the
Y variable. The formula is:

A)3 NXXY — (2X)(2X)


Y NNZX2
—(EXY- NEY? —(EXP
XY = sum of the XY products
X = sum of X scores
Y = sumof Y scores
X? = sum of squared X scores
Y? = sum of squared Y scores

Homework 1 (X) x2 Homework 2 (Y) Y? (X)(Y)

Hh 49 8 64 56
8 64 A 49 56
9 81 10 100 90
6 36 5 23 30
7 49 7 49 49
6 36 6 36 36
10 199 9 81 90
8 64 8 64 64
5 25 5 25 25
9 81 ) 81 81
9 81 8 64 ae
9 81 7 49 63
8 64 d 49 56
4 16 4 16 16
5 2S 6 36 30
6 36 7 49 42
7 49 7 49 49
8 64 9 81 2
8 64 8 64 64
7 49 6 36 42
X = 146 X? = 1,114 e=ai43 Y2 = 1,067 XY = 1,083

20(1,083) — (140)(143)
=e: Seem
(201,114) - (146 /20(1,067) - (143°
21,660 — 20,878 ___782
22,280 — 21,316/21,340 — 20,449 964/891
782
(31.048)(29.849)
56 CHAPTER 2

Correlation versus Causality


Our discussion of correlation has indicated that when variables are correlated, information
about an individual’s performance on one variable enhances our ability to predict perfor-
mance on the other variable. We have also seen that by squaring a correlation coefficient to
get the coefficient of determination, we can make statements about the amount of variance
shared by two variables. In later chapters we will show how correla-
Correlation analysis does allow tion coefficients are used in developing and evaluating tests. It is,
one to establish causality. however, a common misconception to believe that iftwo variables are
correlated one is causing the other. It is possible that the variables
are causally related, but it is also possible that a third variable explains the relationship. Let’s
look at an example. Assume we found a correlation between the amount of ice cream con-
sumed in New York and the number of deaths by drowning in Texas. If you were to interpret
this-correlation as inferring causation, you would either believe that people eating ice cream
in New York caused people to drown in Texas or that people drowning in Texas caused people
to eat ice cream in New York. Obviously neither would be correct! How would you explain
this relationship? The answer is that the seasonal change in temperature accounts for the
relationship. In late spring and summer people in New York consume more ice cream and
people in Texas engage in more water-related activities (i.e., swimming, skiing, boating) and
consequently drown more frequently. This is a fairly obvious case of a third variable explain-
ing the relationship; however, identifying the third variable is not always so easy. It is fairly
common for individuals or groups in the popular media to attribute causation on the basis of a
correlation. So the next time you hear on television or read in the newspaper that researchers
found a correlation between variable A and variable B, and that this correlation means that A
causes B, you will not be fooled. Although correlation analysis does not allow us to establish
causality, certain statistical procedures are specifically designed to allow us to infer causality.
These procedures are referred to as inferential statistics and are covered in statistical courses.
Special Interest Topic 2.4 presents a historical example showing how interpreting a relation-
ship between variables as indicating causality resulted in an erroneous conclusion.

Summary
In this chapter we surveyed the basic mathematical concepts and procedures essential to un-
derstanding measurement. We defined measurement as a set of rules for assigning numbers
to represent objects, traits, or other characteristics. Measurement can involve four different
scales—nominal, ordinal, interval, and ratio—that have distinct properties.

Nominal scale: a qualitative system for categorizing people or objects into catego-
ries. In nominal scales the categories are not ordered in a meaningful manner and do
not convey quantitative information.
Ordinal scale: a quantitative system that allows you to rank people or objects accord-
ing to the amount of a characteristic possessed. Ordinal scales provide quantitative in-
formation, but they do not ensure that the intervals between the ranks are consistent.

The Basic Mathematics of Measurement
57

SSE STs Ue Re CEDAR, SeaMd CNRS ONE eee a ee a

SPECIAL INTEREST TOPIC. 2.4


Caution: Drawing Conclusions of Causality

Reynolds (1999) related this historical example of how interpreting a relationship between variables
as indicating causality can lead to an erroneous conclusion. He noted that in the 1800s a physician
realized that a large number of women were dying of “childbed fever” (i.e., puerperal fever) in the
prestigious Vienna General Hospital. Curiously more women died when they gave birth in the hos-
pital than when the birth was at home. Childbed fever was even less common among women who
gave birth in unsanitary conditions on the streets of Vienna. A commission studied this situation and
after careful observation concluded that priests who came to the hospital to administer last rites were
the cause of the increase in childbed fever in the hospital. The priests were present in the hospital,
but were not present if the birth were outside of the hospital. According to the reasoning of the com-
mission, when priests appeared in this ritualistic fashion the women in the hospital were frightened,
and this stress made them more susceptible to childbed fever.
Eventually, experimental research debunked this explanation and identified what was actually
causing the high mortality rate. At that time the doctors who delivered the babies were the same doc-
tors who dissected corpses. The doctors would move from dissecting diseased corpses to delivering
babies without washing their hands or taking other sanitary procedures. When hand washing and other
antiseptic procedures were implemented, the incidence of childbed fever dropped dramatically.
In summary, it was the transmission of disease from corpses to new mothers that caused
childbed fever, not the presence of priests. Although the conclusion of the commission might sound
foolish to us now, if you listen carefully to the popular media you are likely to hear contemporary
“experts” establishing causality based on observed relationships between variables. However, now
you know to be cautious when evaluating this information.

Interval scale: asystem that allows you to rank people or objects like an ordinal scale
but with the added advantage of equal scale units. Equal scale units indicate that the
intervals between the units or ranks are the same size.
Ratio scale: a system with all the properties of an interval scale with the added ad-
vantage of a true zero point.

These scales form a hierarchy, and we are able to perform more sophisticated measurements
as we move from nominal to ratio scales.
We next turned our attention to distributions. A distribution is simply a set of scores,
and distributions can be represented in a number of ways, including tables and graphs.
Descriptive statistics have been developed that help us summarize and describe major char-
acteristics of distributions. For example, measures of central tendency are frequently used
to summarize distributions. The major measures of central tendency are

Mean: the simple arithmetic average of a distribution. Formally, the mean is defined
by this equation: Mean = Sum of Score / Number of Scores.
Median: the score or potential score that divides a distribution in half.
Mode: the most frequently occurring score in a distribution.
58 CHAPTER 2

Measures of variability (or dispersion) comprise another set of descriptive statistics


used to characterize distributions. These measures provide information about the way scores
are spread out or dispersed. They include:

Range: the distance between the smallest and largest score in a distribution.
Standard deviation: a popular index of the average distance that scores vary from
the mean.
Variance: another measure of the variability of scores, expressed in squared score
units. Less useful when interpreting individual scores, but important as a theoretical
concept.

Finally we discussed correlation coefficients. A correlation coefficient is a quantita-


tive measure of the relationship between two variables. We described how correlation coef-
ficients provide information about both the direction and strength of a relationship. The sign
of the coefficient (i.e., + or —) indicates the direction of the relationship while the magnitude
of the coefficient indicates the strength of the relationship. Correlation coefficients also
have important implications in the context of predicting performance. The stronger the
correlation between two variables, the better we can predict performance on one variable
given information about performance on the other variable. When there is a perfect cor-
relation between two variables (either positive or negative), you can predict performance
with perfect precision. We also described the use of scatterplots to illustrate correlations
and cautioned that although correlations are extremely useful, they do not imply a causal
relationship.

KEY TERMS AND CONCEPTS

Coefficient of determination, Symmetrical distributions, Variance, p. 49


p. 52 p. 41 Scales of measurement, p. 34
Correlation coefficient, p. 51 Linear regression, p. 54 Interval scales, p. 36
Negative correlation, p. 52 Measures of central tendency, p. 42 Nominal scales, p. 35
Positive correlation, p. 52 Mean, p. 43 Ordinal scales, p. 35
Distribution, p. 38 Median, p. 43 Ratio scales, p. 36
Positively skewed distributions, Mode, p. 44 Scatterplot, p. 52
p. 41 Measures of variability, p. 47
Negatively skewed distributions, Range, p. 47
p. 41 Standard deviation, p. 47

RECOMMENDED READINGS

Hays, W. (1994). Statistics (Sth ed.). New York: Harcourt Nunnally, J.C., & Bernstein, I. H. (1994). Psychometric
Brace. This is an excellent advanced statistics text. theory (3rd ed.). New York: McGraw-Hill. An excel-
It covers the information presented in this chapter in lent advanced psychometric text. Chapters 2 and 4 are
greater detail and provides comprehensive coverage of particularly relevant to students wanting a more detailed
statistics in general. discussion of issues introduced in this chapter.
The Basic Mathematics of Measurement 59

Reynolds, C. R. (1999). Inferring causality from relational psychologist, 13, 386-395. An entertaining and enlight-
data and design: Historical and contemporary lessons ening discussion of the need for caution when inferring
for research and clinical practice. The Clinical Neuro- causality from relational data.

INTERNET
SITES OF INTEREST

www-.fedstats.gov ing and analyzing data related to education. It contains


This site provides easy access to statistics and information interesting information for public school teachers and
provided by over 100 U.S. federal agencies. administrators.
www.ncaa.org/stats www.xist.org
This site is great for sports enthusiasts! It provides access This is the Global Statistics Homepage. It contains infor-
to statistics compiled by the National Collegiate Athletic mation on the population and demographics of regions,
Association for sports ranging from baseball to lacrosse. countries, and cities.
http://nces.ed.gov
This is the site for the National Center for Education Sta-
tistics, the primary federal agency responsible for collect-

PRACTICE ITEMS

1. Calculate the mean, variance, and standard deviation for the following score distributions.
For these exercises, use the formulas listed in Table 2.5 for calculating variance and standard
deviation.

Distribution 1 Distribution 2 Distribution 3


10 10
10

ANAN
AIA
VA
WO
WO
BBA
COW O
CO
~
DH
ADA OO
©
>)
FBPNWHRRUAMNNADAA>I
NNwWHhPhANAMAD
60 CHAPTER 2

2. Calculate the Pearson correlation coefficient for the following pairs of scores.

Sample 1 Sample 2 Sample 3


Variable X Variable Y Variable X Variable Y Variable X Variable Y
9 10 9 LO se 9 i;
10 9 9 9 9 7
9 8 8 8 8 8
8 4 8 7 8 5
9 6 a 5 7 4
5 6 i 5 7 3
8} 6 6 4 6 5
Tl 5 6 3 6 5
5 5 5 4 5 4
4 5 5 5) 5 4
a 4 4 4 4 7
3 4 4 3 4 8
5) 3 3) 7 3 5
6 2, p) 3 2 5
5) 2 2} 2, 2 5

Go to www.pearsonhighered.com/reynolds2e to view a PowerPoint™


presentation and to listen to an audio lecture about this chapter.
Bae
soar RN ER

CHAPTER

The Meaning of Test Scores

Scores are the keys to understanding a student’s performance on tests and


other assessments. As a result, thoroughly understanding the meaning of test
scores and how they are interpreted is of utmost importance.

CHAPTER HIGHLIGHTS

Norm-Referenced and Criterion-Referenced Qualitative Description of Scores


Score Interpretations
Norm-Referenced, Criterion-Referenced, or Both?

LEARNING OBJECTIVES

After reading and studying this chapter, students should be able to


1. Describe raw scores and explain their limitations.
2. Define norm-referenced and criterion-referenced score interpretations and explain their
major characteristics.
List and explain the important criteria for evaluating standardization data.
Describe the normal curve and explain its importance in interpreting test scores.
Describe the major types of standard scores.
Transform raw scores to standard scores.
Convert standard scores from one format to another.
ed Define normalized standard scores and describe the major types of normalized standard
AE
I
eee
scores.
9. Define percentile rank and explain its interpretation.
10. Define grade equivalents and explain their limitations.
11. Describe some common applications of criterion-referenced score interpretations.
12. Explain how tests can be developed that produce both norm-referenced and criterion-
referenced interpretations.
13. Explain and give an example of a qualitative score description.

61
62 CHAPTER 3

Test scores reflect the performance or ratings of the individuals completing a test. Because
test scores are the keys to interpreting and understanding the examinees’ performance, their
meaning and interpretation are extremely important topics and deserve careful attention. As
you will see, there is a wide assortment of scores available for our use and each format has
its own unique characteristics. eae inscea type of score is
A raw score is simply the eta OTe: d or
number of items scored or
@soony For example, the raw score on a classroom math test might be
coded in a specific manner
the number of items the student answered correctly. The calculation
such as correct/incorrect,
of raw scores is usually fairly straightforward, but raw scores are
true/false, and so on.
often of limited use to those interpreting the test results; they tend to
offer very little useful information. Let’s say a student’s score on a
classroom math test is 50. Does a raw score of 50 represent poor, average, or superior per-
formance? The answer to this question depends on a number of factors such as how many
items are on the test, how difficult the items are, and the like. For example, if the test con-
tained only 50 items and if the student’s raw score were 50, the student demonstrated perfect
performance. If the test contained 100 items and if the student’s raw score were 50, he or
she answered only half of the items correctly. However, we still do not know what that really
means. If the test contained 100 extremely difficult items and if a raw score of 50 were the
highest score in the class, this would likely reflect very good performance. Because raw
scores in most situations have little interpretative meaning, we need to transform or convert
them into another format to facilitate their interpretation and give them meaning. These
transformed scores, typically referred to as derived scores, standard scores, or scaled scores,
are pivotal in helping us interpret test results. There are a number of different derived scores,
but they all can be classified as either norm-referenced or criterion-referenced. We will
begin our discussion of scores and their interpretation by introducing you to these two dif-
ferent approaches to deriving and interpreting test scores.

Norm-Referenced and Criterion-Referenced


Score Interpretations
To help us understand and interpret test results we need a frame of reference. That is, we need
to compare the examinee’s performance to “something.” Score interpretations can be classi-
fied as either norm-referenced or criterion-referenced, and this distinction refers to the ““some-
thing” to which we compare the examinee’s performan

(a reference group). For example, scores on tests of intelligence are norm-referenced. If you
report that an examinee has an IQ of 100, this indicates he or she
scored higher than 50% of the people in the standardization sample.
With norm-referenced score This is a norm-referenced interpretation. The examinee’s performance
interpretations,
i the examinee’s is being compared with that of other test takers. Personality tests are
performance is compared to the also typically reported as norm-referenced scores. For example, it
performance of other people. might be reported that an examinee scored higher than 98% of the
The Meaning of Test Scores 63

standardization sample on some trait such as extroversion or sensation seeking. With all
norm-referenced interpretatio inee’s performance is compared to that of others.

With criterion-referenced score


interpretations, the examinee’s (i.e., a criterion).
performance is compared to a With criterion-referenced interpretations, the emphasis is on what
specified level of performance. the examinees know or what they can do, not their standing relative
to other test takers. Possibly the most common example of a
criterion-referenced score is the percentage of correct responses on
a classroom examination. If you report that a student correctly answered 85% of the items
on a classroom test, this is a criterion-referenced interpretation. Notice that you are not
comparing the student’s performance to that of other examinees; you are comparing it to a
standard, in this case perfect performance on the test.

Norm-referenced
interpretations are relative preta bsolute (i.e., compared to an absolute s Norm-
whereas criterion-referenced referenced score interpretations have many applications, and the
interpretations are absolute. majority of published standardized tests produce norm-referenced
scores. Nevertheless, criterion-referenced tests also have important
applications, particularly in educational settings. Although people
frequently refer to norm-referenced and criterion-referenced tests, this is not technically
accurate. The terms norm-referenced and criterion-referenced actually refer to the interpre-
tation of test scores. Although it is most common for tests to produce either norm-referenced
or criterion-referenced scores, it is actually possible for a test to produce both norm- and
criterion-referenced scores. We will come back to this topic later. First, we will discuss
norm-referenced and criterion-referenced score interpretations and the types of derived
scores associated with each approach.

Norm-Referenced Interpretations
Norms and Reference Groups. To understand performance on a psychological or educa-
tional test, it is often useful to compare an examinee’s performance to the performance of some
preselected group of individuals. Raw scores on a test, such as the number correct, take on
special meaning when they are evaluated against the performance of a normative or reference
group. To accomplish this, when using a norm-referenced approach to interpreting test scores,
raw scores on the test are typically converted to derived scores based on information about the
performance of a specific normative or reference group. Probably the most important consid-
eration when making norm-referenced interpretations involves the relevance of the group of
individuals to whom the examinee’s performance is compared. The reference group from which
the norms are derived should be representative of the type of individuals expected to take the
test and should be defined prior to the standardization of the test. When you interpret a student’s
performance on a test or other assessment, you should ask yourself, “Are these norms appropri-
ate for this student?” For example, it would be reasonable to compare a student’s performance
on a test of academic achievement to other students of the same age, grade, and educational
background. However, it would probably not be particularly useful to compare a student’s
64 CHAPTER 3

performance to younger students who had not been exposed to the same curriculum, or to older
students who have received additional instruction, training, or experience. For nomererereneed
"pipette bSimezningful, youneedtceiparehe ORMMET TSTOCS
ec osPeeTCR Therefore, the first step in developing good normative data
is to define clearly the population for whom the test is designed.
Once the appropriate reference population has been defined clearly, a random sample
is selected and tested. The normati rou st often used to derive scores is
called the -
Standardization samples should Most test publishers
be representative of the types of and developers select a standardization sample using a procedure
individuals expected to take the known as population proportionate stratified random sampling. This
tests. means that samples of people are selected in such a way as to ensure
that the national population as a whole is proportionately repre-
sented on important variables. In the United States, for example, tests are typically stan-
dardized using a sampling plan that stratifies the sample by gender, age, education,
ethnicity, socioeconomic background, region of residence, and community size based on
population statistics provided by the U.S. Census Bureau. If data from the Census Bureau
indicate that 1% of the U.S. population consists of African American males in the middle
range of socioeconomic status residing in urban centers of the southern region, then 1% of
the standardization sample of the test is drawn to meet this same set of characteristics.
Once the standardization sample has been selected and tested, tables of derived scores are
developed. These tables are based on the performance of the standardization sample and
are typically referred to as normative tables or “norms.” Because the relevance of the stan-
dardization sample is so important when using norm-referenced tests, it is the responsibil-
ity of test publishers to provide adequate information about the standardization sample.
Additionally, it is the responsibility of every test user to evaluate the adequacy of the
sample and the appropriateness of comparing the examinee’s score to this particular group.
In making this determination, you should consider the following factors:

Re re demographic characteristics of the sample (e.g., age, race, sex,


education, geographical location, etc.) similar to those who will take the test? In lay
terms, are you comparing apples to apples and oranges to oranges?
a I Participants in samples from 20 years ago may have responded
quite differently from a contemporary sample. Attitudes, beliefs, behaviors, and even
cognitive abilities change over time, and to be relevant the normative data
need to be current (see Special Interest Topic 3.1 for information
Normative data need to be on the “Flynn Effect” and how intelligence changes over time).
current and the samples should 2
be large enough to produce naation®A Ithough there is no magic number, if a test covers a broad
stable statistical information. age range it is common for standardization samples to exceed 1,000
participants. Otherwise, the number of participants at each age or
grade level may be too small to produce stable estimation of means, standard devia-
tions, and the more general distribution of scores. For example, the Wechsler Individual
Achievement Test—Second Edition (WIAT-II; The Psychological Corporation, 2002)
The Meaning of Test Scores 65

MESES AMSA DRS a ee Re gies aa ste oe

SPECIAL INTEREST TOPIC, J.1


The “Flynn Effect’

Research has shown that there were significant increases in IQ during the twentieth century. This
phenomenon has come to be referred to as the “Flynn Effect” after the primary researcher credited
with its discovery, James Flynn. In discussing his research, Flynn (1998) notes:

Massive IQ gains began in the 19th century, possibly as early as the industrial revolution, and have
affected 20 nations, all for whom data exist. No doubt, different nations enjoyed different rates of
gains, but the best data do not provide an estimate of the differences. Different kinds of IQ tests show
different rates of gains: Culture-reduced tests of fluid intelligence show gains of as much as 20 points
per generation (30 years); performance tests show 10—20 points; and verbal tests sometimes show 10
points or below. Tests closest to the content of school-taught subjects, such as arithmetic reasoning,
general information, and vocabulary, show modest or nil gains. More often than not, gains are simi-
lar at all IQ levels. Gains may be age specific, but this has not yet been established and they certainly
persist into adulthood. The fact that gains are fully present in young children means that causal fac-
tors are present in early childhood but not necessarily that they are more potent in young children
than older children or adults. (p. 61)

So what do you think is causing these gains in IQ? When we ask our students some initially suggest
that these increases in IQ reflect the effects of evolution or changes in the gene pool. However, this
is not really a plausible explanation because it is happening much too fast. Summarizing the current
thinking on this topic, Kamphaus (2001) notes that while there is not total agreement, most inves-
tigators believe it is the result of environmental factors such as better prenatal care and nutrition,
enhanced education, increased test wiseness, urbanization, and higher standards of living.
Consider the importance of this effect in relation to our discussion of the development of test
norms. When we told you that it is important to consider the date of the normative data when
evaluating its adequacy, we were concerned with factors such as the Flynn Effect. Due to the
gradual but consistent increase in IQ, normative data become more demanding as time passes. In
other words, an examinee must obtain a higher raw score (i.e., correctly answer more items) each
time a test is renormed in order for his or her score to remain the same. Kamphaus suggests that as
arule of thumb, IQ norms increase in difficulty by about 3 points every 10 years (based on a mean
of 100 and a standard deviation of 15). For example, the same performance on IQ tests normed 10
years apart would result in IQs about 3 points apart, with the newer test producing the lower scores.
As a result, he recommends that if the normative data for a test are more than 10 years old one
should be concerned about the accuracy of the norms. This is a reasonable suggestion, and test
publishers are becoming better at providing timely revisions. For example, the Wechsler Intelli-
gence Scale for Children—Revised (WISC-R) was published in 1974, but the next revision, the
WISC-III, was not released until 1991, a 17-year interval. The most current revision, the WISC-IV,
was released in 2003, only 12 years after its predecessor.

has 3,600 participants in the standardization, with a minimum of 150 at each grade
level (i.e., pre-kindergarten through grade 12).

A final consideration regarding norm-referenced interpretations is the importance of


standardized administration. The normative sample should be administered the test under
66 CHAP TER 3

the same conditions and with the same administrative procedures that will be used in actual
practice. Accordingly, when the test is administered in clinical or educational settings, it is
important that the test user follow the administrative procedures precisely. For example, if
you are administering standardized tests you need to make sure that you are reading the
directions verbatim and closely adhering to time limits. It obviously would not be reason-
able to compare your students’ performance on a timed mathematics test to the performance
of a standardization sample that was given either more or less time to complete the items.
(The need to follow standard administration and scoring procedures actually applies to all
standardized tests, both norm-referenced and criterion-referenced.)
Many types of derived scores or units of measurement may be reported in “norms
tables,” and the selection of which derived score to employ can influence the interpretation
of scores. Before starting our discussion of common norm-referenced derived scores, we
need to introduce the concept of a normal distribution.

The Normal Curve. The normal distribution is a special type of distribution that is very
useful when interpreting test scores. Figure 3.1 depicts a normal distribution. The normal
distribution, which is also referred to as the Gaussian or bell-shaped curve, is a distribution
that characterizes many variables that occur in nature (see Special Interest Topic 3.2 for
information on Carl Frederich Gauss, who is credited with discovering the bell curve). Gray
(1999) indicates that the height of individuals of a given age and gender is an example of a
variable that is distributed normally. He notes that numerous genetic and nutritional factors
influence an individual’s height, and in most cases these various factors average out so that
people of a given age and gender tend to be of approximately the same height. This accounts
for the peak frequency in the normal distribution. In referring to Figure 3.1 you will see that
a large number of scores tend to “pile up” around the middle of the distribution. However,

KXKX
KX
xX
xXXX KKK
KX
KX
x
KXKXKXKX
ee
ee
MOK
Me
MN
MK
eM
OM
MK KX
KK
xXKXKXKXKXKKK
xx KX
xXXXKKKXK
x

Low Scores High Scores


Test Scores

FIGURE 3.1 Illustration of the Normal Distribution :


The Meaning of Test Scores
67

TiS

SPECIAL INTEREST TOPIC : 3.2


Whence the Normal Curve?

Carl Frederich Gauss (1777-1855) was a noted German mathematician who is generally credited
with being one of the founders of modern mathematics. Born in Brunswick, he turned his scholarly
pursuits toward the field of astronomy around the turn of the nineteenth century. In the course of
tracking star movements and taking other forms of physical survey measurements (at times with
instruments of his own invention), Gauss found to his annoyance that students and colleagues who
were plotting the location of an object at the same time noted it to be in somewhat different places!
He began to plot the frequency of the observed locations systematically and found the observations
to take the shape of a curve. He determined that the best estimate of the true location of the object
was the mean of the observations and that each independent observation contained some degree of
error. These errors formed a curve that was in the shape of a bell. This curve or distribution of error
terms has since been demonstrated to occur with a variety of natural phenomena and indeed has
become so commonplace that it is most often known as the “normal curve” or the normal distribu-
tion. Of course, you may know it as the bell curve as well due to its shape, and mathematicians and
others in the sciences sometimes refer to it as the Gaussian curve after its discoverer and the man
who described many of its characteristics. Interestingly, Gauss was a very prolific scholar and the
Gaussian curve is not the only discovery to bear his name. He did groundbreaking research on
magnetism and the unit of magnetic intensity is called a gauss.

for a relatively small number of individuals a unique combination of factors results in them
being either much shorter or much taller than the average. This accounts for the distribution
trailing off at both the low and high ends.
Although the previous discussion addressed only observable characteristics of the
normal distribution, certain mathematical properties make it particularly useful when in-
terpreting scores. For example, the normal distribution is a sym-
metrical, unimodal distribution in which the mean, median, and
The normal distribution is
mode are all equal. It is also symmetrical, meaning that if you divide
a symmetrical, unimodal
the distribution into two halves, they will mirror each other. Proba-
distribution in which the mean, bly the most useful characteristic of the normal distribution is that
median, and mode are all equal. —_redictable proportions of scores occur at specific points in the
distribution. Referring to Figure 3.2 you find a normal distribution
with the mean and standard deviations (o) marked. Figure 3.2 also indicates percentile
rank (PR), which will be discussed later in this chapter. Because we know that the mean
equals the median in a normal distribution, we know that an individual who scores at the
mean scored better than 50% of the sample of examinees (remember, earlier we defined
the median as the score that divides the distribution in half). Because approximately 34%
of the scores fall between the mean and | standard deviation above the mean, an individual
whose score falls 1 standard deviation above the mean performs at a level exceeding ap-
proximately 84% (i.e., 50% + 34%) of the population. A score 2 standard deviations above
the mean will be above 98% of the population. Because the distribution is symmetrical, the
relationship is the same in the inverse below the mean. A score | standard deviation below
68 CHAPTER 3

PR 1 10 20 30 40 50 60 70 80 90 ag

me
-lo
PR 0.1 2 16 50 84 98 99.9

FIGURE 3.2. Normal Distribution with Mean, Standard Deviations, and Percentages
Source: From L. H. Janda, Psychological Testing: Theory and Applications. Published by Allyn & Bacon,
Boston, MA. Copyright © 1998 by Pearson Education. Reprinted by permission of the publisher.

the mean indicates that the individual exceeds only about 16% (i.e., 50% — 34%) of the
population on the attribute in question. Approximately two-thirds (i.e., 68%) of the popula-
tion will score within | standard deviation above and below the mean on a normally dis-
tributed variable.
We have reproduced in Appendix F a table that allows you to determine what propor-
tion of scores are below any given point in a distribution by specifying standard deviation
units. For example, you can use these tables to determine that a score 1.96 SD above the
mean exceeds 97.5% of the scores in the distribution whereas a score 1.96 SD below the
mean exceeds only 2.5% of the scores. Although we do not feel it is necessary for you to
become an expert in using these statistical tables, we do encourage you to examine Figure
3.2 carefully to ensure you have a good grasp of the basic properties of the normal distribu-
tion before proceeding.
Although many variables of importance in educational settings
such as achievement and intelligence are very close to conforming to
Although many variables
the normal distribution, not all educational, psychological, or be-
closely approximate the normal
havioral variables are normally distributed. For example, aggressive
distribution, not all educational behavior and psychotic behavior are two variables of interest to psy-
or psychological variables are chologists and educators that are distinctly different from the normal
distributed normally. curve in their distributions. Most children are not aggressive toward
The Meaning of Test Scores 69

their peers, so on measures of aggression, most children pile up at the left side of the distri-
bution whereas children who are only slightly aggressive may score relatively far to the
right. Likewise, few people ever experience psychotic symptoms such as hearing voices of
people who are not there or seeing things no one else can see. Such variables will each have
their own unique distribution, and even though one can, via statistical manipulation, force
these score distributions into the shape of a normal curve, it is not always desirable to do so.
We will return to this issue later, but at this point it is important to refute the common myth
that all human behaviors or attributes conform to the normal curve; clearly they do not!

Derived Scores Used with Norm-Referenced Interpretations


Standard Scores. As we have noted, raw scores such as the number of items correct are dif-
ficult to work with and interpret. Raw scores therefore are typically transformed to another unit
of measurement or derived score. With norm-referenced score inter-
Standard scores are the pretations, standard scores (sometimes called scaled scores) are often
transformation of raw scores the preferred type of derived score. Transforming raw scores into stan-
to a desired scale with a dard scores involves creating a set of scores with a predetermined mean
predetermined mean and and standard deviation that remains constant across some preselected
standard deviation. variable such as age. Although we are going to describe a number of
different standard score formats, they all share numerous common
characteristics. All standard scores use standard deviation units to indicate where an examin-
ee’s score is located relative to the mean of the distribution. Standard scores are typically linear
transformations of raw scores to a desired scale with a predetermined mean and standard de-
viation. In a linear transformation, the following generic equation is applied to each score:

Standard Score = Xs tL) x ja


x

where X, = raw score of any individual taking the test 7


X = mean of the raw scores
SD, = standard deviation of the raw scores
SD,, = desired standard deviation of the derived standard scores
X_ Ss = desired mean of the derived or standard scores

Standard scores calculated


using linear transformations d
retain a direct relationship with ;
raw scores and the distribution of this statement will become more evident when we discuss normal-
ized standard scores). Table 3.1 provides an example of how this for-
retains its original shape.
mula is applied to raw scores to transform them into standard scores.
As we noted, there are different standard score formats that have common character-
istics. They differ in means and standard deviations. Here are brief descriptions of some of
the more common standard score formats. This is not an exhaustive list, and it is possible to
create a new format with virtually any mean and standard deviation you desire. However,
test authors and publishers typically use these common standard score formats because
educators and psychologists are most familiar with them.
70 CHAPTER 3

TABLE 3.1 Transforming Raw Scores to Standard Scores

In this chapter we provided the following formula for transforming raw scores to z-scores.

Z-SCore =
(X%; — X)
SD,.

where XX; = raw score of any individual i


X = mean of the raw scores
SD = standard deviation of the raw scores

Consider the situation in which the mean of the raw scores (X ) is 75, the standard deviation of
raw scores (SD) is 10, and the individual’s raw score is 90.

(90 — 75)
z-score
10
15/10
= 15

If you wanted to convert the individual’s score to a T-score, you would use the generic
formula:

Standard Score = X,, + SD, X = x


x

where X; = raw score of any individual taking the test i


X = mean of the raw scores
SD,, = standard deviation of the raw scores
SD,, = desired standard deviation of the derived standard scores
X,, = desired mean of the derived or standard scores

In this case the calculations are:

T-score = 50 + 10 x
(90 — 75)
. 10
50 + 10 x 1.5
50 + 15
=) 65

Z-scores are the simplest of the


standard scores and indicate
how far above or below the ikii its’ z-scores are simple
ean OF theaiseibutiol the raw to calculate and a SanpHted aan can be used (equation 2):
score is in standard deviation eearete X,- X
units. SD,
where X; = raw score of any individual i
X = mean of the raw scores,
n o |. =| Standard deviation of the raw scores
The Meaning of Test Scores 71

z-scores have a mean of 0 and a standard deviation of 1. As a result all scores above the mean
will be positive and all scores below the mean will be negative. For example, a z-score of
1.6 is 1.6 standard deviations above the mean (i.e., exceeding 95% of the scores in the dis-
tribution) and a score of —1.6 is 1.6 standard deviations below the mean (i.e., exceeding only
5% of the scores in the distribution). As you see, in addition to negative scores, z-scores
involve decimals. This results in scores that many find difficult to use and interpret. As a
result, few test publishers routinely report z-scores for their tests. However, researchers
commonly use z-scores because scores with a mean of 0 and a standard deviation of 1 make
statistical formulas easier to calculate.
u T-scores. T-scores have a mean of 50 and a standard deviation of 10. Rélativeyto
For
example, a score of 66 is 1.6 standard deviations above the mean (i.e., exceeding 95% of the
scores in the distribution) and a score of 34 is 1.6 standard deviations below the mean (i.e.,
exceeding only 5% of the scores in the distribution).
a Wechsler IQs (and many others). The Wechsler intelligence scales use a standard
score format with a mean of 100 and a standard deviations of 15. Like T-scores, the Wechsler
IQ format avoids decimals and negative values. For example, a score of 124 is 1.6 standard
deviations above the mean (i.e., exceeding 95% of the scores in the distribution) and a score
of 76 is 1.6 standard deviations below the mean (i.e., exceeding only 5% of the scores in the
distribution). This format has become very popular, and most aptitude and individually
administered achievement tests report standard scores with mean of 100 and standard de-
viation of 15.
= Stanford-Binet IQs. The Stanford-Binet intelligence scales until recently used a
standard score format with a mean of 100 and a standard deviation of 16. This is similar to
the format adopted by the Wechsler scales, but instead of a standard deviation of 15 there is
a standard deviation of 16 (see Special Interest Topic 3.3 for an explanation). This may ap-
pear to be a negligible difference, but it was enough to preclude direct comparisons between
the scales. With the Stanford-Binet scales, a score of 126 is 1.6 standard deviations above
the mean (i.e., exceeding 95% of the scores in the distribution) and a score of 74 is 1.6
standard deviations below the mean (i.e., exceeding only 5% of the scores in the distribu-
tion). The most recent edition of the Stanford-Binet (the fifth edition) adopted a mean of
100 and a standard deviation of 15 to be consistent with the Wechsler and other popular
standardized tests.

a CEEB Scores (SAT/GRE). This format was developed by the College Entrance Ex-
amination Board and used with tests including the Scholastic Assessment Test (SAT) and
the Graduate Record Examination (GRE). CEEB scores have a mean of 500 and a standard
deviation of 100. With this format, a score of 660 is 1.6 standard deviations above the mean
(i.e., exceeding 95% of the scores in the distribution) and a score of 340 is 1.6 standard
deviations below the mean (i.e., exceeding only 5% of the scores in the distribution).

As we noted, standard scores can be set to any desired mean and standard deviation, with
the fancy of the test author frequently being the sole determining factor. Fortunately, the few
standard score formats we just summarized will account for the majority of standardized
tests in education and psychology. Figure 3.3 and Table 3.2 illustrate the relationship
72 CHAPTER 3

SPECIAL INTEREST TOPIC 3.3


Why Do IQ Tests Use a Mean of 100
and a Standard Deviation of 15?

When Alfred Binet and Theodore Simon developed the first popular IQ test in the late 1800s, items
were scored according to the age at which half the children got the answer correct. This resulted in
the concept of a “mental age” for each examinee. This concept of a mental age (MA) gradually
progressed to the development of the IQ, which at first was calculated as the ratio of the child’s MA
to actual or chronological age multiplied by 100 to remove all decimals. The original form for this
score, known as the Ratio IQ, was:

MA/CA x 100
where MA = mental age
CA = chronological age

This score distribution has a mean fixed at 100 at every age. However, due to the different restric-
tions on the range of mental age possible at each chronological age (e.g., a 2-year-old can range in
MA only 2 years below CA but a 10-year-old can range 10 years below the CA), the standard de-
viation of the distribution of the Ratio IQ changes at every CA! At younger ages it tends to be small
and it is typically larger at upper ages. The differences are quite large, often with the standard de-
viation from large samples varying from 10 to 30! Thus, at one age a Ratio IQ of 110 is 1 standard
deviation above the mean, whereas at another age the same Ratio IQ of 110 is only 0.33 standard
deviation above the mean. Across age, the average standard deviation of the now archaic Ratio IQ
is about 16. This value was then adopted as the standard deviation for the Stanford-Binet IQ tests
and continued until David Wechsler scaled his first IQ measure in the 1930s to have a standard
deviation of 15, which he felt would be easier to work with. Additionally, he selected a standard
deviation of 15 to help distinguish his test from the then dominant Stanford-Binet test. The Stan-
ford-Binet tests have long abandoned the Ratio IQ in favor of a true standard score, but remained
tethered to the standard deviation of 16 until Stanford-Binet’s fifth edition was published in 2003.
With the fifth edition Standford-Binet’s new primary author, Gale Roid, converted to the far more
popular scale with a mean of 100 and a standard deviation of 15.

between various standard score formats. If reference groups are comparable, Table 3.2 can
also be used to help you equate scores across tests to aid in the comparison of a student’s
performance on tests of different attributes using different standard scores. Table 3.3 illus-
trates a simple formula that allows you to convert standard scores from one format to an-
other (e.g., z-scores to T-scores).
It is important to recognize that not all authors, educators, or clinicians are specific
when it comes to reporting or describing scores. That is, they may report “standard
scores,”
but not specify exactly what standard score format they are using. Obviously the format
is
extremely important. Consider a standard score of 70. If this is a T-score it represents
a
score 2 standard deviations above the mean (exceeding approximately 98% of the
scores
in the distribution). If it is a Wechsler IQ (or comparable score) it is 2 standard
deviations
The Meaning of Test Scores 73

Number
of Cases

: 34.13% 34.13%
13.59% 13.59%

0.13% 0.13%

—4o -30 -—20 -lo Mean +1lo +20 +30 +40


Test Score

z-score ee ee ee a | ee el

T-score ee ee eee ee ee ee ee

SAT score [ee eee | ee ee ee eee |e ee eS


200 300 400 500 600 700 800

tt
Deviation 1Q_ ©/#——@_____ tt ___4. _t—__t
(SD = 15) 55 70 85 100 115 130 145

Percentile
1 5 10 20 30 40 50 60 70 80 90 95 99

FIGURE 3.3 Normal Distribution Illustrating the Relationship among Standard Scores
Source: From L. H. Janda, Psychological Testing: Theory and Applications. Published by Allyn & Bacon. Copy-
right © 1998 by Pearson Education. Reprinted by permission of the publisher.

below the mean (exceeding only approximately 2% of the scores in the distribution). In
other words, be sure to know what standard score format is being used so you will be able
to interpret the scores accurately.

Normalized Standard Scores. Discussion about standard scores thus far applies primarily
to scores from distributions that are normal (or that at least approximate normality) and were
computed using a linear transformation. As noted earlier, although it is commonly held that
psychological and educational variables are normally distributed, this is not always the case.
Many variables such as intelligence, memory skills, and academic achievement will closely
approximate the normal distribution when well measured. However, many variables of inter-
est in psychology and education, especially behavioral ones (€.g., aggression, attention, and
hyperactivity), may deviate substantially from the normal distribution. As a result it is not
unusual for test developers to end up with distributions that deviate from normality enough
to cause concern. In these situations test developers may elect to develop normalized standard
CHAE
TEE RS

TABLE 3.2 Relationship of Different Standard Score Formats

Z-scores T-scores IQ CEEB Scores


X=0 X = 50 X = 100 X = 500 Percentile
SD = 1 SD = 10 SD = 15 SD = 100 Rank

2.6 76 139 760 >99


2.4 74 136 740 99
Dy WD, 133 720 99
2.0 70 130 700 98
1.8 68 127 680 96
1.6 66 124 660 95
1.4 64 121 640 92
ie) 62 118 620 88
1.0 60 115 600 84
0.8 58 112 580 79
0.6 56 109 560 a
0.4 54 106 540 66
0.2 22) 103 520 58
0.0 50 100 500 50
—0.2 48 97 480 42
—0.4 46 94 460 34
—0.6 44 91 440 27
-0.8 42 88 420 21
-1.0 40 85 400 16
—1.2 38 82 380 12
-1.4 36 79 360 8
—1.6 34 76 340 5
-1.8 32 3 320 4
—2.0 30 70 300 2
2.2 28 67 280 1
2.4 26 64 260 1
—2.6 24 61 240 1
erent
be pcaagmeier
aS aos reat simi:
BR even
SSIS a r kseemeee: _—
SS ASRS ab areca etre cee art
em appt

Note: X = mean, SD = standard deviation.


Source: Adapted from Reynolds (1998b).
The Meaning of Test Scores 75

TABLE 3.3. Converting Standard Scores from One Format to Another

You can easily convert standard scores from one format to another using the following formula:

New Standard Score = X..,ss2 + SD.., ss2 X ee


SD,¢)

where X = original standard score


Xj = mean of the original standard score format
SD,,, = standard deviation of the original standard score format
X..9 = mean of the new standard score format
SD... = standard deviation of the new standard score format

For example, consider the situation in which you want to convert a z-score of 1.0 toa
T-score. The calculations are:

T-score = 50 + 10 X Garig)

50 + 10 x (1/1)
504+ 10x. 4
30. + 10
= 60
If you want to convert a T-score of 60 to a CEEB score, the calculations are:

(60 — 50)
CEEB score = 500 + 100 x
10
500 + 100 x (10/10)
500 + 100 x 1
500 + 100
= 600
LENT SRNEE SR

Normalized standard scores


are standard scores based on
The transformations applied in
underlying distributions that
these situations are often nonlinear transformations. Whereas stan-
were not originally normal, but dard scores calculated with linear transformations retain a direct rela-
were transformed into normal tionship with the original raw scores and the distribution retains its
distributions. original shape, this is not necessarily so with normalized standard
scores based on nonlinear transformations. This does not mean that
normalized standard scores are undesirable. In situations in which the obtained distribution
is not normal because the variable is not normally distributed, normalization is not generally
useful and indeed may be misleading. However, in situations in which the obtained distribu-
tion is not normal because of sampling error or choice of subjects, normalization can enhance
76 QOH AP TE R 3

the usefulness and interpretability of the scores. Nevertheless, it is desirable to know what
type of scores you are working with and how they were calculated.
In most situations, normalized standard scores are interpreted in a manner similar to
other standard scores. In fact, they often look strikingly similar to standard scores. For ex-
ample, they may be reported as normalized z-scores or normalized T-scores and often re-
ported without the prefix normalized at all. In this context, they will have the same mean
and standard deviation as their counterparts derived with linear transformations. However,
several types of scores that have traditionally been based on nonlinear transformations are
normalized standard scores. These include:

m Stanine scores. Stanine (i.e., standard nine) scores divide the distribution into nine
bands (1 through 9). Stanine scores have a mean of 5 and a standard deviation of 2.
Because stanine scores use only nine values to represent the full range of scores, they
are not a particularly precise score format. As a result, some professionals avoid their
use. However, certain professionals prefer them because of their imprecision. These
professionals, concerned with the imprecision inherent in all psychological and edu-
cational measurement, choose stanine scores because they do not misrepresent the
precision of measurement (e.g., Popham, 2000). Special Interest Topic 3.4 briefly
describes the history of stanine scores.
m Wechsler scaled scores. The subtests of the Wechsler Intelligence Scale for Children—
Fourth Edition (WISC-IV; Wechsler, 2003) and predecessors are reported as normal-
ized standard scores referred to as scaled scores. The Wechsler scaled scores have a
mean of 10 and a standard deviation of 3. This transformation was performed so the
subtest scores would be comparable, even though their underlying distributions may
have deviated from the normal curve and each other.
= Normal Curve Equivalent (NCE). The normal curve equivalent (NCE) is a normal-
ized standard score with a mean of 50 and a standard deviation of 21.06. NCEs are
not usually used for evaluating individuals, but are primarily used to assess the prog-

i EEie Se SS

SPECIAL INTEREST TOPIC 3.4


The History of Stanine Scores

Stanines have a mean of 5 and a standard deviation of 2. Stanines have a range of 1 to 9 and are a
form of standard score. Because they are standardized and have nine possible values, the contrived,
contracted name of stanines was given to these scores (standard nine). A stanine is a conversion of
the percentile rank that represents a wide range of percentile ranks at each score point. The U.S. Air
Force developed this system during World War II because a simple score system was needed that
could represent scores as a single digit. On older computers, which used cards with holes punched
in them for entering data, the use of stanine scores not only saved time by having only one digit to
punch but also increased the speed of the computations made by computers and conserved com-
puter memory. Stanines are now used only occasionally and usually only in statistical reporting of
aggregated scores (from Reynolds, 2002).
The Meaning of Test Scores 7

ress of groups (e.g., The Psychological Corporation, 2002). Because school districts
must report NCE scores to meet criteria as part of certain federal education programs,
many test publishers report these scores for tests used in education.

One of the most popular and Percentile Rank. One of the most popular and easily understood
easily understood ways to ways to interpret and report a test score is the percentile rank. Like all
norm-referenced scores, the percentile rank simply reflects an exam-
interpret and report a test score
inee’s performance relative to a specific group. Although there are
is the percentile rank. Percentile
some subtle differences in the ways percentile ranks are calculated and
ranks reflect the percentage of
interpreted, the typical way of interpreting them is as reflecting the
individuals scoring below a percentage of individuals scoring below a given point in a distribution.
given point in a distribution. For example, a percentile rank of 80 indicates that 80% of the indi-
viduals in the standardization sample scored below this score. A per-
centile rank of 20 indicates that only 20% of the individuals in the standardization sample
scored below this score. Percentile ranks range from 1 to 99, and a rank of 50 indicates the
median performance (in a perfectly normal distribution it is also the mean score). As you can
see, percentile ranks can be easily explained to and understood by individuals without formal
training in psychometrics. Whereas standard scores might seem somewhat confusing, a per-
centile rank might be more understandable. For example, a parent might believe an IQ of 75
is in the average range, generalizing from experiences with classroom tests whereby 70 to 80
is often interpreted as representing average or perhaps “C-level” performance. However, ex-
plaining that the child’s score exceeded only approximately 5% of the standardization sample
or scores of other children at the same age level might clarify the issue. One common misun-
derstanding may arise when using percentile ranks: It is important to ensure that results in
terms of percentile rank are not misinterpreted as “percent correct” (Kamphaus, 1993). That
is, a percentile rank of 60 means that the examinee scored better than 60% of the standardiza-
tion sample, not that the examinee correctly answered 60% of the items.
Although percentile ranks can be easily interpreted, they do not represent interval
level measurement. That is, percentile ranks are not equal across all parts of a distribution.
Percentile ranks are compressed near the middle of the distribution, where there are large
numbers of scores, and spread out near the tails, where there are relatively few scores (you
can see this in Figure 3.3 by examining the line that depicts percentiles). This implies that
small differences in percentile ranks near the middle of the distribution might be of little
importance, whereas the same difference at the extremes might be substantial. However,
because the pattern of inequality is predictable, this can be taken into consideration when
interpreting scores and it is not particularly problematic.
There are two formats based on percentile ranks that you might come across in edu-
cational settings. Some publishers report quartile scores that divide the distribution of per-
centile ranks into four equal units. The lower 25% receives a quartile score of 1, 26% to 50%
a quartile score of 2, 51% to 75% a quartile score of 3, and the upper 25% a quartile score
of 4. Similarly, some publishers report decile-based scores, which divide the distribution of
percentile ranks into ten equal parts. The lowest decile-based score is 1 and corresponds to
scores with percentile ranks between 0% and 10%. The highest decile-based score is 10 and
corresponds to scores with percentile ranks between 90% and 100% (e.g., The Psychologi-
cal Corporation, 2002).
78 CHAPTER 3

Grade equivalents are norm- Grade Equivalents.


referenced scores that identify
the academic grade level Although grade equivalents are very popular in school set-
achieved by the examinee. tings and appear to be easy to interpret, they actually need to be inter-
Although grade equivalents are preted with considerable caution. To understand grade equivalents, it
is helpful to be familiar with how they are calculated. When a test is
very popular and appear to be
administered to a group of children, the mean raw score is calculated
easy to interpret, they actually
at each grade level, and this mean raw score is called the grade equiv-
need to be interpreted with
alent for raw scores of that magnitude. For example, if the mean raw
considerable caution. score for beginning 3rd graders on a reading test is 50, then any exam-
inee earning a score of 50 on the test is assigned a grade equivalent of
3.0 regardless of age. If the mean score for 4th graders is 60, then any examinee earning a
score of 60 is assigned a grade equivalent of 4.0. It becomes a little more complicated when
raw scores fall between two median grade scores. In these situations intermediate grade equiv-
alents are typically calculated using a procedure referred to as interpolation. To illustrate this
procedure with a straightforward example, consider a score of 55 on our imaginary reading
test. Here, the difference between a grade equivalent of 3.0 (i.e., raw score of 50) and a grade
equivalent of 4.0 (i.e., raw score of 60) is divided into ten equal units to correspond to ten
months of academic instruction. In this example, because the difference is 10 (40 — 30 = 10),
each raw score unit corresponds to one-tenth (i.e., one month), and a raw score of 55 would
be assigned a grade equivalent of 3.5. In actual practice, interpolation is not always this
straightforward. For example, if the difference between a grade equivalent of 3.0 and 4.0 had
been 6 points (instead of 10), the calculations would have been somewhat more compli-
cated.
Much has been written about the limitations of grade equivalents, and the following
list highlights some major concerns summarized from several sources (Anastasi & Urbina,
1997; The Psychological Corporation, 2002; Popham, 2000; Reynolds, 1998b).

m The use of interpolation to calculate intermediate grade equivalents assumes that aca-
demic skills are achieved at a constant rate and that there is no gain or loss during the sum-
mer vacation. This tenuous assumption is probably not accurate in many situations.
= Grade equivalents are not comparable across tests or even subtests of the same battery
of tests. For example, grade equivalents of 6.0 on a test of reading comprehension and a test
of math calculation do not indicate that the examinee has the same level of proficiency in
the two academic areas. Additionally, there can be substantial differences between the ex-
aminee’s percentile ranks on the two tests.
*

m Grade equivalents reflect an ordinal level scale of measurement, not an interval scale.
As discussed in the previous chapter, ordinal level scales do not have equal scale units across
the scale. For example, the difference between grade equivalents of 3.0 and 4.0 is not neces-
sarily the same as the difference between grade equivalents of 5.0 and 6.0. Statistically, one
should not add, subtract, multiply, or divide such scores because their underlying metrics
are different. It is like multiplying feet by meters—you can multiply 3 feet by 3 meters and
get 9, but what does it mean?
The Meaning of Test Scores 79

m There is not a predictable relationship between grade equivalents and percentile ranks.
For example, examinees may have a higher grade equivalent on a test of reading comprehen-
sion than of math calculations, but their percentile rank and thus their skill relative to age
peers on the math test may actually be higher.
= Acom misp g-
SRSA
SYTHE SradS eGuIVAleNES. Parents may ask, “Johnny is only in the 4th grade but has
a grade equivalent of 6.5 in math. Doesn’t that mean he is ready for 6th-grade math instruc-
tion?” The answer is clearly “No!” Although Johnny correctly answered the same number
of items as an average 6th grader, this does not indicate that he has mastered the necessary
prerequisites to succeed at the 6th-grade level.
= Unfortunately, grade equivalents tend to become standards of performance. For ex-
ample, lawmakers might decide that all students entering the 6th grade should achieve grade
equivalents of 6.0 or better on a standardized reading test. If you will recall how grade
equivalents are calculated, you will see how ridiculous this is. Because the mean raw score
at each grade level is designated the grade equivalent, 50% of the standardization sample
scored below the grade equivalent. As a result, it would be expected that a large number of
students with average reading skills would enter the 6th grade with grade equivalents below
6.0. It is a law of mathematics that not everyone can score above the average!

As the result of these and other limitations, we recommend that


We recommend that you avoid
you avoid using grade equivalents. Age equivalents are another de-
using grade equivalents.
rived score format that indicates the age, typically in years and
months, at which a raw score is the mean or median. Age equivalents
have the same limitations as grade equivalents and we again recommend that you avoid
using them. Many test publishers report grade and age equivalents and occasionally you will
find a testing expert that favors them (at least at the lower grade levels). Nevertheless, they
are subject to misinterpretation and should be avoided when possible. If you are required to
use them, we recommend that you also report standard scores and percentile ranks and
emphasize these more precise derived scores when explaining test results.

Criterion-referenced Criterion-Referenced Interpretations


perry tons emphasize what As noted previously, with criterion-referenced interpretations the ex-
the examinees know or what aminee’s performance is not compared to that of other people, but to
they can do, not their standing a specified level of performance (i.e., a criterion).
relative to other test takers. t
. Although
some authors appear to view criterion-referenced score interpretations as a relatively new
approach dating back to only the 1960s or 1970s, criterion-referenced interpretations actu-
ally predate norm-referenced interpretations. For example, educators were evaluating their
students’ performance in terms of “percentage correct” or letter grades to reflect mastery
(i.e., A, B, C, D, and F) long before test developers started developing norm-referenced
scores. Nevertheless, since the 1960s there has been renewed interest in and refinement of
80 CHAPTER 3

criterion-referenced score interpretations. A number of different labels have been applied to


this type of score interpretation in the last 40 years, including content-referenced, domain-
referenced, and objective-referenced (e.g., Anastasi & Urbina, 1997). In this text we will be
using the term criterion-referenced because it is probably the broadest and most common
label.
Probably the most common example of a criterion-referenced score is percent correct.
For example, when a teacher reports that a student correctly answered 85% of the problems
on a classroom test assessing the student’s ability to multiply double digits, this is a criteri-
on-referenced interpretation. Although there are a variety of criterion-referenced scoring
systems, they all involve an absolute evaluation of examinees’ performances as opposed to
a relative evaluation. That is, instead of comparing their performances to the performances
of others (a relative interpretation), a criterion-referenced interpretation attempts to describe
what they know or are capable of doing—the absolute level of performance.
In addition to percent correct, another type of criterion-referenced interpretation is
referred to as mastery testin
Mastery testing involves
determining whether the
examinee has achieved a RA et al., 1999). Most of us
specific level of mastery of the have had experience with mastery testing in obtaining a driver’s li-
knowledge and skills domain cense. The written exam required to obtain a driver’s license is de-
and is usually reported in an signed to determine whether the applicant has acquired the basic
all-or-none score such as a knowledge necessary to operate a motor vehicle successfully and
safely (e.g., state motoring laws and standard
pass/fail designation.
pr

the cut score requires correctly answering 85% of the items, all examinees with scores of
84% or below fail and all with 85% and above pass. There is no practical distinction in such
a decision between an examinee answering 85% of the items correctly and one who an-
swered 100% correctly. They both pass! For many educators, mastery testing is viewed as
the preferred way of assessing mastery or proficiency of basic educational skills. For ex-
ample, a teacher can develop a test to assess students’ mastery of multiplication of fractions
or addition with decimals. Likewise, a teacher can develop a test to assess students’ mastery
of spelling words on a 3rd-grade reading list. In both of these situations, the teacher may set
the cut score for designating mastery at 85%, and all students achieving a score of 85% or
higher will be considered to have mastered the relevant knowledge or skills domain. Special
Interest Topic 3.5 provides a brief introduction to the processes many states use to establish
performance standards on their statewide assessments.
Another common criterion-referenced interpretative approach is referred to as “stan-
dards-based interpretations.” Whereas mastery testing typically results in an all-or-none
interpretation (i.e., the student either passes or fails), standards-based interpretations usually
involve three to five performance categories. For example, the results of an achievement test
might be reported as not proficient, partially proficient, proficient, or advanced performance
(e.g., Linn & Gronlund, 2000). An old variant of this approach is the assignment of letter
grades to reflect performance on classroom achievement tests. For example, many teachers
assign letter grades based on the percentage of items correct on a test, which is another type
The Meaning of Test Scores 81

SPECIAL INTEREST TOPIC | 3.5


Establishing Performance Standards

In developing their statewide assessment programs, most states establish performance standards
that specify acceptable performance. Crocker and Algina (1986) outlined three major approaches
to setting performance standards: (a) holistic, (b) content-based, and (c) performance-based. All of
these methods typically invoke the judgment of experts in the content area of interest. The selection
of experts is a sampling problem—what is the sampling adequacy of the experts selected in relation
to the population of such experts?
In holistic standard setting a panel of experts is convened to examine a test and estimate the
percentage of items that should be answered correctly by a person with minimally proficient knowl-
edge of the content domain of interest. After each judge provides a passing standard estimate the
results are averaged to obtain the final cut score. State assessments, however, use content-based and
performance-based strategies.
Content-based standard setting evaluates tests at the item level. The most popular content-
based approaches are the Angoff and modified Angoff procedures, which involve assembling a
panel of about 15 to 20 judges who review the test items. They work independently and decide how
many of 100 students “minimally acceptable,” “borderline,” or “barely proficient” would answer
each item correctly. The average for all judges is computed, which then becomes the estimated cut
score for the number correct on the test.
The original Angoff method was criticized as being too cognitively complex for judges
(Shepard, Glaser, Linn, & Bohrnstedt, 1993) and was modified by using a ““yes—no” procedure, for
which judges are simply asked to indicate whether a borderline proficient student would be able to
answer the item correctly. The number of items recommended by each judge is then averaged
across judges to provide a cut score indicating minimal proficiency. Impara and Plake (1997) con-
cluded that both traditional and modified methods resulted in nearly identical cut scores, and rec-
ommended the yes—no approach based on its simplicity.
The Angoff and modified Angoff procedures offer a more comprehensive approach to standard
setting. Although using a yes—no modification may be less taxing on judges, the few empirical stud-
ies that address this method leave us unconvinced that important information is not lost in the process.
Until more information becomes available regarding the technical merit of this modification, we
recommend the traditional Angoff instructions. Modifying this to include several rounds, in which
judges receive the previous results, has both theoretical and some empirical support, particularly if
the process is moderated by an independent panel referee (Ricker, 2004).
Much of the research and most of the conclusions about content-based procedures have been
predicated on getting groups of judges together physically. Over the last decade, new interactive
procedures based on Internet-oriented real-time activities (e.g., focus groups, Delphi techniques)
have emerged in marketing and other fields that might be employed. These techniques open up new
alternatives for standard setting that have not yet been explored.
The performance-based approach uses the judgment of individuals (e.g., teachers) who are
intimately familiar with a specific group of candidates who are representative of the target popula-
tion (e.g., their own students). The judges are provided with the behavioral criteria, the test, and a
description of “proficient” and “nonproficient” candidates. Judges then identify individuals that,
based on previous classroom assessments, clearly fall into one of the two categories (borderline
candidates are excluded). These individuals are then tested, and separate frequency distributions
are generated for each group. The intersection between the two distributions is then used as the cut
score that differentiates nonproficient from proficient examinees.
(continued)
82 CHAPTER 3

SPECIAL INTEREST TOPIC 3.5 Continued

A variation used in at least 31 states in the last few years is the bookmark procedure, some-
times referred to as an item mapping strategy, which involves. first testing examinees. Using Item
Response Theory (IRT), test items (which may be selected-response or open-ended) are mapped
onto a common scale. Item difficulties are then computed, and a booklet is developed with items
arranged from easiest to hardest. Typically, 15 to 20 judges are provided with item content specifi-
cations, content standards, scoring rubrics for open-ended items, and other relevant materials, such
as initial descriptors for performance levels. The judges are then separated into smaller groups and
examine the test and the ordered item booklet. Starting with the easiest items, each judge examines
each item and determines whether a student meeting a particular performance level would be ex-
pected to answer the item correctly. This procedure continues until the judge reaches the first item
that is perceived as unlikely to be answered correctly by a student at the prescribed level. A “book-
mark” is then placed next to this item. After the first round, participants in the small group are al-
lowed to discuss their bookmark ratings for each group and may modify initial ratings based on
group discussion. In a final round, the small group findings are discussed by all judges, and addi-
tional information, such as the impact of the various cut scores on the percent of students that would
be classified in each category, is considered. A final determination is reached either through con-
sensus or by taking the median ratings between groups. The last step involves revising the initial
performance descriptors based on information provided by panelists (and the impact data) during
the iterative rounds. The disadvantages of this method, however, such as difficulties in assembling
qualified judges and the lack of an inter-judge agreement process, raise questions about its
validity.
There are serious concerns with current procedures. First, numerous studies have provided
evidence that different standard setting methods often yield markedly different results. Moreover,
using the same method, different panels of judges arrive at very different conclusions regarding the
appropriate cut score (Crocker & Algina, 1986; Jaeger, 1991). Ideally, multiple methods would be
used to determine the stability of scores generated by different methods and different judges. Un-
fortunately, this is not typically an economic or logistic reality. Rudner (2001) and Glass (1978)
have noted the “arbitrary” nature of the standard setting process. Glass, in particular, has con-
demned the notion of cut scores as a “common expression of wishful thinking” (p. 237). A second
concern related to standard setting involves the classification of minimally proficient, competent,
acceptable, or master examinees. Glass (1978) has noted that valid external criteria for assessing
the legitimacy of such distinctions in the context of subject-matter areas are virtually nonexistent.
They may hold only in rare instances, such as when a person who types zero words per minute with
zero accuracy (a complete absence of skill) may be deemed incapable of working as a typist. But,
would one dare suggest that a minimal amount of reading skill is necessary to be a competent
parent—and if so, what would this level be? Glass concludes, “The attempt to base criterion scores
on a concept of minimal competence fails for two reasons: (1) it has virtually no foundation in
psychology; (2) when its arbitrariness is granted but judges attempt nonetheless to specify minimal
competence, they disagree wildly” (p. 251).
_
eeeeeeSsSsS—

of criterion-referenced interpretation. For example, As might be assigned for percentage


correct scores between 90% and 100%, Bs for scores between 80% and 89%, Cs for scores
between 70% and 79%, Ds for scores between 60% and 69%, and Fs for scores below
60%.
Note that with this system a student with a score of 95% receives an A regardless of how
The Meaning of Test Scores 83

other students scored. If all of the students in the class correctly answered 90% or more of
the items correctly, they would all receive As on the test.
As noted previously, with norm-referenced interpretations the most important consid-
eration is the relevance of the group that the examinee’s performance is compared to. How-
ever, with criterion-referenced interpretations, there is no comparison group, and the most
important consideration is how clearly the knowledge or skill domain
The most important being assessed is specified or defined (e.g., Popham, 2000). For
consideration with criterion- criterion-referenced interpretations to provide useful information
referenced interpretations is about what students know or what skills they possess, it is important
how clearly the knowledge or that the knowledge or skill domain assessed by the test be clearly de-
skill domain is specified or fined. To facilitate this, it is common for tests specifically designed to
defined. produce criterion-referenced interpretations to assess more limited or
narrowly focused content domains than those designed to produce
norm-referenced interpretations. For example, a test designed to produce norm-referenced
interpretations might be developed to assess broad achievement in mathematics (e.g., ranging
from simple number recognition to advanced algebraic computations). In contrast, a math test
designed to produce criterion-referenced interpretations might be developed to assess the
students’ ability to add fractions. In this situation, the criterion-referenced domain is much
more focused, which allows for more meaningful criterion-based interpretations. For example,
if a student successfully completed 95% of the fractional addition problems, you would have
a good idea of his or her math skills in this limited, but clearly defined area. In contrast, ifa
student scored at the 50th percentile on the norm-referenced broad mathematics achievement
test, you would know that the performance was average for that age. However, you would not
be able to make definitive statements about the specific types of math problems the student is
able to perform. Although criterion-referenced interpretations are most applicable to narrowly
defined domains, they are often applied to broader, less clearly defined domains. For example,
most tests used for licensing professionals such as physicians, lawyers, teachers, or psycholo-
gists involve criterion-referenced interpretations.

Norm-Referenced, Criterion-Referenced, or Both?

Early in this chapter we noted that it is not technically accurate to refer to norm-referenced
tests or criterion-referenced tests. It is the interpretation of perfor-
It is not technically accurate to
mance on atest that is either norm-referenced or criterion-referenced.
refer to norm-referenced or
As a result, it is possible for a test to produce both norm-referenced
criterion-referenced tests. It and criterion-referenced interpretations. That being said, for several
is the interpretation of reasons it is usually optimal for tests to be designed to produce either
performance on a test that is norm-referenced or criterion-referenced scores. Norm-referenced
either norm-referenced or interpretations can be applied to a larger variety of tests than criteri-
criterion-referenced. on-referenced interpretations. We have made the distinction between
maximum performance tests (e.g., aptitude and achievement) and
typical response tests (e.g., interest, attitudes, and behavior). Norm-referenced interpreta-
tions can be applied to both categories, but criterion-referenced interpretations are typically
applied only to maximum performance tests. That is, because criterion-referenced scores
84 CHAPTER 3

reflect an examinee’s knowledge or skills in a specific domain, it is not logical to apply them
to measures of personality. Even in the broad category of maximum performance tests,
norm-referenced interpretations tend to have broader applications. Consistent with their
focus on well-defined knowledge and skills domains, criterion-referenced interpretations
are most often applied to educational achievement tests or other tests designed to assess
mastery of a clearly defined set of skills and abilities. Constructs such as aptitude and intel-
ligence are typically broader and lend themselves best to norm-referenced interpretations.
Even in the context of achievement testing we have alluded to the fact that tests designed
for norm-referenced interpretations often cover broader knowledge and skill domains than
those designed for criterion-referenced interpretations.
In addition to the breadth or focus of the knowledge or skills domain being assessed,
test developers consider other factors when developing tests intended primarily for either
norm-referenced or criterion-referenced interpretations. For example, because tests de-
signed for criterion-referenced interpretations typically have a narrow focus, they are able
to devote a large number of items to measuring each objective or skill. In contrast, because
tests designed for norm-referenced interpretations typically have a broader focus they may
devote only a few items to measuring each objective or skill. When developing tests in-
tended for norm-referenced interpretations, test developers will typically select items of
average difficulty and eliminate extremely difficult or easy items. When developing tests
intended for criterion-referenced interpretations, test developers
Tests can be developed that match the difficulty of the items to the difficulty of the knowledge
provide both norm-referenced or skills domain being assessed.
and criterion-referenced Although our discussion to this point has emphasized differ-
interpretations. ences between norm-referenced and criterion-referenced interpreta-

TABLE 3.4 Characteristics of Norm-Referenced and Criterion-Referenced Scores

Norm-Referenced Interpretations Criterion-Referenced Interpretations

Compare performance to a specific reference group— Compare performance to a specific level of


a relative interpretation. performance—an absolute interpretation.
Useful interpretations require a relevant reference Useful interpretations require a carefully defined
group. knowledge or skills domain.
Usually assess a fairly broad range of knowledge Usually assess a limited or narrow domain of
or skills. knowledge or skills.
Typically have only a limited number of items to Typically have several items to measure each test
measure each objective or skill. objective or skill.
Items are selected that are of medium difficulty and Items are selected that provide good coverage of
maximize variance; very difficult and very easy items content domain; the difficulty of the items matches
are usually deleted. the difficulty of content domain.
Example: Percentile rank—a percentile rank of 80 Example: Percentage correct—a percentage correct
indicates that the examinee scored better than 80% score of 80 indicates that the examinee successfully
of the subjects in the reference group. answered 80% of the test items.
A

The Meaning of Test Scores 85

tions, they are not mutually exclusive. Tests can be developed that provide both
norm-referenced and criterion-referenced interpretations. Both interpretative approaches
have positive characteristics and provide useful information (see Table 3.4). Whereas
norm-referenced interpretations provide important information about how an examinee
performed relative to a specified reference group, criterion-referenced interpretations pro-
vide important information about how well an examinee has mastered a specified knowl-
edge or skills domain. It is possible, and sometimes desirable, for a test to produce both
norm-referenced and criterion-referenced scores. For example, it would be possible to
interpret a student’s test performance as “by correctly answering 75% of the multiplication
problems, the student scored better than 60% of the students in the class.” Although the
development of a test to provide both norm-referenced and criterion-referenced scores may
require some compromises, the increased interpretative versatility may justify these com-
promises (e.g., Linn & Gronlund, 2000). As a result, some test publishers are beginning to
produce more tests that provide both interpretative formats. Nevertheless, most tests are
designed for either norm-referenced or criterion-referenced interpretations. Although the
majority of published standardized tests are designed to produce norm-referenced interpre-
tations, tests producing criterion-referenced interpretations play an extremely important
role in educational and other settings.

Qualitative Description of Scores


Test developers commonly provide qualitative descriptions of the
Qualitative descriptions of test scores produced by their tests. These qualitative descriptors help pro-
scores help professionals fessionals communicate results in written reports and other formats.
communicate results in written For example, the Stanford-Binet Intelligence Scales, Fifth Edition
reports and other formats. (SB5) (Roid, 2003) provides the following qualitative descriptions:

IQ Classification

145 and above Very Gifted or Highly Advanced


130-144 Gifted or Very Advanced
120-129 Superior
110-119 High Average
90-109 Average
80-89 Low Average
70-79 Borderline Impaired or Delayed
55-69 Mildly Impaired or Delayed
40-54 Moderately Impaired or Delayed

These qualitative descriptors help professionals communicate information about an ex-


aminee’s performance in an accurate and consistent manner. That is, professionals using
the SBS should consistently use these descriptors when describing test performance.
86 CHAPTER 3

A similar approach is often used with typical response assessments. For example, the
Behavior Assessment System for Children (BASC; Reynolds & Kamphaus, 1998) provides
the following descriptions of the clinical scales such as the depression or anxiety scales:

T-Score Range Classification

70 and above Clinically Significant


60-69 At-Risk
41-59 Average
31-40 Low
30 and below Very Low

Summary

This chapter provided an overview of different types of test scores and their meanings. We
started by noting that raw scores, while easy to calculate, usually provide little useful infor-
mation about an examinee’s performance on a test. As a result, we usually transform raw
scores into derived scores. The many different types of derived scores can be classified as
either norm-referenced or criterion-referenced. Norm-referenced score interpretations com-
pare an examinee’s performance on a test to the performance of other people, typically the
standardization sample. When making norm-referenced interpretations, it is important to
evaluate the adequacy of the standardization sample. This involves determining if the stan-
dardization is representative of the examinees the test will be used with, if the sample is
current, and if the sample is of adequate size to produce stable statistics.
When making norm-referenced interpretations it is useful to have a basic understand-
ing of the normal distribution (also referred to as the bell-shaped curve). The normal distri-
bution is a distribution that characterizes many naturally occurring variables and has several
characteristics that psychometricians find very useful. The most useful of these character-
istics is that predictable proportions of scores occur at specific points in the distribution. For
example, if you know that an individual’s score is one standard deviation above the mean
on a normally distributed variable, you know that the individual’s score exceeds approxi-
mately 84% of the scores in the standardization sample. This predictable distribution of
scores facilitates the interpretation and reporting of test scores.
Standard scores are norm-referenced derived scores that have a predetermined mean
and standard deviation. A variety of standard scores is commonly used today, including

m z-scores: mean of 0 and standard deviation of 1


m 7-scores: mean of 50 and standard deviation of 10
= Wechsler IQs: mean of 100 and standard deviation of 15
= CEEB Scores (SAT/GRE): mean of 500 and standard deviation of 100

By combining an understanding of the normal distribution with the information pro-


vided by standard scores, you can easily interpret an examinee’s performance relative to the
specified reference group. For example, an examinee with a T-score,of 60 scored 1 standard
The Meaning of Test Scores 87

deviation above the mean. You know that approximately 84% of the scores in a normal
distribution are below 1 standard deviation above the mean. Therefore, the examinee’s score
exceeded approximately 84% of the scores in the reference group.
When scores are not normally distributed (i.e., do not take the form of a normal dis-
tribution), test publishers often use normalized standard scores. These normalized scores
often look just like regular standard scores, but they are computed in a different manner.
Nevertheless, they are interpreted in a similar manner. For example, if a test publisher re-
ports normalized T-scores, they will have a mean of 50 and standard deviation of 10, just
like regular T-scores. There are some unique normalized standard scores, including:

m= Stanine scores: mean of 5 and standard deviation of 2


a Wechsler Subtest Scaled scores: mean of 10 and standard deviation of 3
a Normal Curve Equivalent (NCE): mean of 50 and standard deviation of 21.06

Another common type of norm-referenced score is percentile rank. This popular for-
mat is one of the most easily understood norm-referenced derived scores. Like all norm-
referenced scores, the percentile rank reflects an examinee’s performance relative to a
specific reference group. However, instead of using a scale with a specific mean and stan-
dard deviation, the percentile rank simply specifies the percentage of individuals scoring
below a given point in a distribution. For example, a percentile rank of 80 indicates that 80%
of the individuals in the reference group scored below this score. Percentile ranks have the
advantage of being easily explained to and understood by individuals without formal train-
ing in psychometrics.
The final norm-referenced derived scores we discussed were grade and age equiva-
lents. For numerous reasons, we recommend that you avoid using these scores. If you are
required to report them, also report standard scores and percentile ranks and emphasize
these when interpreting the results.
In contrast to norm-referenced scores, criterion-referenced scores compare an exam-
inee’s performance to a specified level of performance referred to as a criterion. Probably
the most common criterion-referenced score is the percent correct score routinely reported
on classroom achievement tests. For example, if you report that a student correctly an-
swered 80% of the items on a spelling test, this is a criterion-referenced interpretation.
Another type of criterion-referenced interpretation is mastery testing. On a mastery test
you determine whether examinees have achieved a specified level of mastery on the knowl-
edge or skill domain. Here, performance is typically reported as either pass or fail. If ex-
aminees score above the cut score they pass; if they score below the cut score they fail.
Another criterion-referenced interpretation is referred to as standards-based interpreta-
tions. Instead of reporting performance as simply pass/fail, standards-based interpretations
typically involve three to five performance categories.
With criterion-referenced interpretations, a prominent consideration is how clearly the
knowledge or domain is defined. For useful criterion-referenced interpretations, the knowledge
or skill domain being assessed must be clearly defined. To facilitate this, criterion-referenced
interpretations are typically applied to tests that measure focused or narrow domains. For
to the
example, a math test designed to produce criterion-referenced scores might be limited
addition of fractions. This way, if a student correctly answers 957% of the fraction problems,
88 CHAPTER 3

you will have useful information regarding the student’s proficiency with this specific type
of math problem. You are not able to make inferences about a student’s proficiency in other
areas of math, but you will know if this specific type of math problem was mastered. If the
math test contained a wide variety of math problems (as is common with norm-referenced
tests), it would be more difficult to specify exactly in which areas a student is proficient.
We closed the chapter by noting that the terms norm-referenced and criterion-referenced
refer to the interpretation of test performance, not the test itself. Although it is often optimal
to develop a test to produce either norm-referenced or criterion-referenced scores, it is possible
and sometimes desirable for a test to produce both norm-referenced and criterion-referenced
scores. This may require some compromises when developing the test, but the increased flex-
ibility may justify these compromises. Nevertheless, most tests are designed for either norm-
referenced or criterion-referenced interpretations, and most published standardized tests
produce norm-referenced interpretations. That being said, tests that produce criterion-referenced
interpretations have many important applications, particularly in educational settings.

KEY TERMS AND CONCEPTS

Age equivalents, p. 79 Normal distribution, p. 67 Stanine scores, p. 76


CEEB scores, p. 71 Normalized standard scores, p. 75 T-scores, p. 71
Criterion-referenced, p. 63 Norm-referenced, p. 62 Wechsler IQ, p. 71
Cut score, p. 80 Percentile rank, p. 77 Wechsler scaled scores, p. 76
Grade equivalents, p. 78 Qualitative descriptions, p. 85 z-scores, p. 70
Interpolation, p. 78 Raw score, p. 62
Linear transformation, p. 69 Standardization sample, p. 64
Mastery testing, p. 80 Standard scores, p. 69
Normal curve equivalent (NCE), Stanford-Binet Intelligence Scales,
p. 76 p. 71

RECOMMENDED READINGS
American Educational Research Association, American Psy- Lyman, H. B. (1998). Test scores and what they mean. Boston:
chological Association, & National Council on Mea- Allyn & Bacon. This text provides a comprehensive and
surement in Education (1999). Standards for educational very readable discussion of test scores. An excellent re-
and psychological testing. Washington, DC: AERA. For source!
the technically minded, Chapter 4, Scales, Norms, and
Score Comparability, is must reading!

INTERNET SITES OF INTEREST

www.teachersandfamilies.com/open/parent/scores1.cfm http://childparenting.miningco.com/cs/learningproblems/a/
Understanding Test Scores: A Primer for Parents is a us- wisciii.htm
er-friendly discussion of tests that is accurate and read- This Parents’ Guide to Understanding the IQ Test Scores
able. Another good resource for parents. contains a good discussion ofthe use of intelligence tests
in schools and how they help in assessing learning dis-
abilities. A good resource for parents.
The Meaning of Test Scores 89

PRACTICE ITEMS

1. Transform the following raw scores to the specified standard score formats. The raw score
distribution has a mean of 70 and a standard deviation of 10.
a. Raw score = 85 Z-SCOre T-score =
b. Raw score 60 Z-SCore T-score =
c. Raw score 55 Z-SCOre T-score =
d. Raw score 95 Z-SCore T-score =
e. Raw score 15 Z-SCOTe T-score =

2. Convert the following z-scores to T-scores and CEEB scores.


. z-SCOore = 1) T-score CEEB score
. z-SCOre = -1.5 T-score CEEB score I
z-Score i] 25 T-score CEEB score
- z-score = —2.0 T-score CEEB score
ae) z-SCOre = —1.70 T-score CEEB score

Go to www.pearsonhighered.com/reynolds2e to view a PowerPoint™


presentation and to listen to an audio lecture about this chapter.
CHAPTER

‘ Reliability for Teachers

It is the user who must take responsibility for determining whether or not scores
are sufficiently trustworthy to justify anticipated uses and interpretations.
—AERA et al., 1999, p. 31

CHAPTER HIGHLIGHTS

Errors of Measurement The Standard Error of Measurement


Methods of Estimating Reliability Reliability: Practical Strategies for Teachers

LEARNING OBJECTIVES

After reading and studying this chapter, students should be able to


1. Define and explain the importance of reliability in educational assessment.
Define and explain the concept of measurement error.
Explain classical test theory and its importance to educational assessment.
Describe the major sources of measurement error and give examples.
we
os Identify the major methods for estimating reliability and describe how these analyses are
performed.
S Identify the sources of measurement error that are reflected in different reliability estimates.
=~ Explain how multiple scores can be combined in a composite to enhance reliability.
8. Describe the factors that should be considered when selecting a reliability coefficient for a
specific assessment application.
9. Explain the factors that should be considered when evaluating the magnitude of reliability
coefficients.
10. Describe steps that can be taken to improve reliability.
11. Discuss special issues in estimating reliability such as estimating the reliability of speed
tests and mastery testing.
12. Define the standard error of measurement (SEM) and explain its importance.
13. Explain how SEM is calculated and describe its relation to reliability.
14. Explain how confidence intervals are calculated and used in educational and psychological
assessment.
15. Describe and apply shortcut procedures for estimating the reliability of classroom tests;

90
Reliability for Teachers 91

In simplest terms, in the context


Most dictionaries define reliability in terms of dependability, trust-
of measurement, reliability worthiness, or having a high degree of confidence in something. Reli-
refers to consistency or stability ability in the context of educational and psychological measurement
of assessment results. is concerned to some extent with these same factors, but is extended
o such concepts as stability and consistency. In simplest terms, in the
context of measurement, S.
Although it is common for people to refer to the “reliability of a test,” in the Standards for
Educational and Psychological Testing (AERA et al., 1999) reliability is considered to be a
characteristic of scores or assessment results, not tests themselves.
Consider the following example: A teacher administers a 25-item math test in the
morning to assess the students’ skill in multiplying two-digit numbers. If the test had been
administered in the afternoon rather than the morning, would Susie’s score on the test have
been the same? Because there are literally thousands of two-digit multiplication problems, if
the teacher had used a different group of 25 two-digit multiplication problems, would Susie
have received the same score? What about the ambulance that went by, its siren wailing
loudly, causing Johnny to look up and watch for a few seconds? Did this affect his score, and
did it affect Susie’s, who kept working quietly? Jose wasn’t feeling well that morning but
came to school because he felt the test was so important. Would his score have been better if
he had waited to take the test when he was feeling better? Would the students have received
the same scores if another teacher had graded the test? All of these questions involve issues
of reliability. They all ask if the test produces consistent scores.
As you can see from these examples, numerous factors can affect reliability. The time
the test is administered, the specific set of questions included on the test, distractions due to
external (e.g., ambulances) or internal (e.g., illness) events, and the person grading the test
are just a few of these factors. In this chapter you will learn to take many of the sources of
unreliability into account when selecting or developing assessments and evaluating scores.
You will also learn to estimate the degree of reliability in test scores with a method that best
fits your particular situation. First, however, we will introduce the concept of measurement
error as it is essential to developing a thorough understanding of reliability.

Errors of Measurement

Some degree of measurement Some degree of error is inherent in all measurement. Although mea-
error is inherent in all surement error has largely been studied in the context of psychologi-
measurement. cal and educational tests, measurement error clearly is not unique to
this context. In fact, as Nunnally and Bernstein (1994) point out, mea-
surement in other scientific disciplines has as much, if not more, error
than that in psychology and education. They give the example of physiological blood pressure
measurement, which is considerably less reliable than many educational tests. Even in situa-
tions in which we generally believe measurement is exact, some error is present. If we asked
a dozen people to time a 440-yard race using the same brand of stopwatch, it is extremely
unlikely that they would all report precisely the same time. If we had a dozen people and a
measuring tape graduated in millimeters and required each person to measure independently
92 CHAPTER 4

the length of a 100-foot strip of land, it is unlikely all of them would report the same answer
to the nearest millimeter. In the physical sciences the introduction of more technologically
sophisticated measurement devices has reduced, but not eliminated, measurement error.
Different theories or models have been developed to address measurement issues, but
possibly the most influential is classical test theory (also called true score theory). Accord-
ing to this theory, every score on a test is composed of two components: the true score (i.e.,
the score that would be obtained if there were no errors, if the score were perfectly reliable)
and the error score: Obtained Score = True Score + Error. This can be represented in a very
simple equation:

A= 1s &

Here we use X; to represent the observed or obtained score of an individual; that is, X; is the
score the test taker received on the test. The symbol T is used to represent an individual’s
true score and reflects the test taker’s true skills, abilities, knowledge, attitudes, or whatever
the test measures, assuming an absence of measurement error. Finally, E represents mea-
surement error.
Measurement error reduces the usefulness of measurement. It limits the extent to
which test results can be generalized and reduces the confidence we have in test results
(AERA et al., 1999). Practically speaking, when we administer a
Measurement error limits the test we are interested in knowing the test taker’s true score. Due to
extent to which test results can the presence of measurement error we can never know with abso-
be generalized and reduces lute confidence what the true score is. However, if we have informa-
the confidence we have in test tion about the reliability of measurement, we can establish intervals
results (AERA et al., 1999). around an obtained score and calculate the probability that the true
score will fall within the interval specified. We will come back to
this with a more detailed explanation when we discuss the standard error of measurement
later in this chapter. First, we will elaborate on the major sources of measurement error. It
should be noted that we will limit our discussion to random measurement error. Some writ-
ers distinguish between random and systematic errors. Systematic error is much harder to
detect and requires special statistical methods that are generally beyond the scope of this
text; however, some special cases of systematic error are discussed in Chapter 16. (Special
Interest Topic 4.1 provides a brief introduction to Generalizability Theory, an extension of
classical reliability theory.)

Sources of Measurement Error


Because measurement error is so pervasive, it is beneficial to be knowledgeable about its
characteristics and aware of the methods that are available for estimating its magnitude. As
educational professionals, we should also work to identify sources
As educational professionals, we
of measurement error and minimize their impact to the extent pos-
should work to identify sources sible. Generally, whenever you hear a discussion of reliability or read
of measurement error and about the reliability of test scores, it is the relative freedom from
minimize their impact to the measurement error that is being discussed. Reliable assessment re-
extent possible. sults are relatively free from measurement error whereas less reliable
Reliability for Teachers 93

SPECIAL INTEREST TOPIC. 4.1

Generalizability Theory

Lee Cronbach and colleagues developed an extension of classical reliability theory known as “gen-
eralizability theory” in the 1960s and 1970s. Cronbach was instrumental in the development of the
general theory of reliability discussed in this chapter during and after World War II. The basic focus
of generalizability theory is to examine various conditions that might affect the reliability of a test
score. In classical reliability theory there are only two sources for variation in an observed test score:
true score and random error. Suppose, however, that for different groups of people the scores reflect

; different things. For example, boys and girls might respond differently to career interest items. When
the items for a particular career area are then grouped into a scale, the reliability of the scale might be
quite different for boys and girls as a result. This gender effect becomes a limitation on the generaliz-

| ability of the test’s functioning with respect to reliability.


Generalizability theory extends the concept of reliability as the ratio of true score variance
to total score variance by adding other possible sources of true score variation to both the numera-
tor and the denominator of the reliability estimate. Because gender is a reliable indicator, if there is
significant gender variation on a test scale due to gender differences, this additional variation will
change the original true score variation. What originally appeared to be high true score variation
might instead be a modest true score variation and large gender variation. In some instances the true
score variation may be high within the boys’ group but near zero within the girls’ group. Thus, in this
study, gender limits the generalizability of the test with respect to reliability.
The sources of variation that might be considered are usually limited to theoretically relevant
characteristics of population, ecology, or time. For example, population characteristics may include
gender, ethnicity, or region of residence. These are usually discussed as fixed sources, because typically
all characteristics will be present in the data analysis (male and female, all ethnic groups of interest,
or all regions to be considered). Sources are considered random when only a sample of the possible
current or future population is involved in the analysis. Common random sources include raters or
observers; classrooms, schools, or districts; clinics or hospitals; or other organized groupings in which
respondents are placed. In a nursing school, for example, students may be evaluated ina series of activi-
ties that they are expected to have mastered (administering an injection, adjusting a drip, and determin-
ing medication levels). Several instructors might rate each activity. Because we are usually interested in
how reliable the ratings are for raters like those in the study, but not just those specific raters, the rater
source is considered a random source. Random sources are always included only in the denominator
of the reliability ratio (discussed in the next major section). That is, the variance associated with raters
will be added to the error variance and true score variance only to the total variance term.
Although calculating generalizability coefficients is beyond the scope of this text, the general
procedure is to use a statistical analysis program such as Statistical Package for the Social Sciences
(SPSS) or Statistical Analysis System (SAS). These statistical programs have analysis options that
will estimate the variance components (i.e., the variances of each source of variation specified by
the analyst). A numerical value for each source is obtained, and the generalizability value is calcu-
lated from specific rules of computation that have been derived over the last several decades. Some
psychometricians advocate simply examining the magnitude of the variances. For example, if true
score variance in the gender study mentioned is 10, but gender variance is 50, while error variance
per item is 2, it is clear that most of the apparent reliability of the test is due to gender differences
rather than individual differences. Boy and girl studies might be conducted separately at this point.
In the nursing study, if rater variance is 3 and individual true score variance is 40, it is clear without
further study that raters will have little effect on the reliability of the nursing assessments.
94 CHAPTER 4

results are not. A number of factors may introduce error into test scores and even though all
cannot be assigned to distinct categories, it may be helpful to group these sources in some
manner and to discuss their relative contributions. The types of errors that are our greatest
concern are errors due to content sampling and time sampling.

Content Sampling Error. Tests rarely, if ever, include every possible question or evalu-
ate every possible relevant behavior. Let’s revisit the example we introduced at the begin-
ning of this chapter. A teacher administers a math test designed to assess skill in multiplying
two-digit numbers. We noted that there are literally thousands of two-digit multiplication
problems. Obviously it would be impossible for the teacher to develop and administer a
test that includes all possible items. Instead, a universe or domain of test items is defined
based on the content of the material to be covered. From this domain a sample of test ques-
tions is taken. In this example, the teacher decided to select 25 items to measure students’
ability. These 25 items are simply a sample and, as with any sampling procedure, may not
be representative of the domain from which they are drawn.

reading other sources, you


might see this type of error referred to as domain sampling error. Domain sampling error
and content sampling error are the same. Content sampling error typically is considered
the largest source of error in test scores and therefore is the source that concerns us most.
Fortunately, content sampling error is also the easiest and most accurately estimated source
of measurement error.
The amount of measurement error due to content sampling is determined by how well
we sample the total domain of items. If the items on a test are a good sample of the domain,
the amount of measurement error due to content sampling will be
If the items on a test are a good relatively small. If the items on a test are a poor sample of the domain,
sample of the domain, the the amount of measurement error due to content sampling will be
amount of measurement error relatively large. Measurement error resulting from content sampling
due to content sampling will be is estimated by analyzing the degree of statistical similarity among
relatively small. the items making up the test. In other words, we analyze the test items
to determine how well they correlate with one another and with the
test taker’s standing on the construct being measured. We will explore a variety of methods
for estimating measurement errors due to content sampling later in this chapter.

Measurement error due to Time Sampling Error. Measurement error also can be introduced
time sampling reflects random by one’s choice of a particular time to administer the test. If Eddie
fluctuations in performance did not have breakfast and the math test was just before lunch, he
from one situation to another might be distracted or hurried and not perform as well as if he took
and limits our ability to the test after lunch. But Michael, who ate too much at lunch and
generalize test scores across was up a little late last night, was a little sleepy in the afternoon and
might not perform as well on an afternoon test as he would have on
different situations.
the morning test. If during the morning testing session a neighboring
class was making enough noise to be disruptive, the class might have
performed better in the afternoon when the neighboring class was relatively quiet. These
are all examples of situations in which random changes over time in the test taker (e.g.,
fatigue, illness, anxiety) or the testing environment (e.g., distractions, temperature) affect
Reliability for Teachers 95

erformance on the test.

s. Some assessment
experts refer to this type of error as temporal instability. As you might expect, testing experts
have developed methods of estimating error due to time sampling.

Other Sources of Error. Although errors due to content sampling and time sampling ac-
count for the major proportion of random error in testing, administrative and scoring errors
that do not affect all test takers equally will also contribute to the random error observed
in scores. Clerical errors committed while adding up a student’s score or an administrative
error on an individually administered test are common examples. When the scoring of a test
relies heavily on the subjective judgment of the person grading the test or involves subtle
discriminations, it is important to consider differences in graders, usually referred to as
inter-scorer or inter-rater differences. That is, would the test taker receive the same score
if different individuals graded the test? For example, on an essay test would two different
graders assign the same scores? These are just a few examples of sources of error that do
not fit neatly into the broad categories of content or time sampling errors.

Methods of Estimating Reliability


You will note that we are referring to reliability as being estimated. This is because the ab-
solute or precise reliability of assessment results cannot be known. Just as we always have
some error in test scores, we also have some error in our attempts to measure reliability.
Earlier in this chapter we introduced the idea that test scores are composed of two compo-
nents, the true score and the error score. We represented this with the equation:

X,=T+E
As you remember, X; represents an individual’s obtained score, T represents the true score,
and E represents random measurement error. This equation can be extended to incorporate the
concept of variance. This extension indicates that the variance of test scores is the sum of the
true score variance plus the error variance, and is represented in the following equation:

Oxph = O7Py) +t OF2

Here, G2 represents the variance of the total test, o represents true score variance, and o2
represents the variance due to measurement erro

The general symbol for the reliability of assessment results associated with content
or domain sampling is r,, and is referred to as the reliability coefficient.
Math-
ematically, reliability is written
96 CHAPTER 4

This equation defines the reliability of test scores as the proportion of test score variance
due to true score differences. The reliability coefficient is considered to be the summary
mathematical representation of this ratio or proportion.
Reliability coefficients can be classified into three broad categories (AERA et al.,
1999). These include (1) coefficients derived from the administration of the same test on
different occasions (i.e., test-retest reliability), (2) coefficients based on the administration
of parallel forms of a test (i.e., alternate-form reliability), and (3) coefficients derived from
a single administration of a test (internal consistency coefficients). A
Reliability can be defined as fourth type, inter-rater reliability, is indicated when scoring involves
the proportion of test score a significant degree of subjective judgment. The major methods of
variance due to true score estimating reliability are summarized in Table 4.1. Each of these ap-
differences. proaches produces a reliability coefficient (r,,) that can be inter-
preted in terms of the proportion or percentage of test score variance
attributable to true variance. For example, a reliability coefficient of 0.90 indicates that 90%
of the variance in test scores is attributable to true variance. The remaining 10% reflects
error variance. We will now consider each of these methods of estimating reliability.

TABLE 4.1 Major Types of Reliability

Type of Reliability Common Number of Number of


Estimate Symbol Test Forms Testing Sessions
eS Summary ee ee ee
Test-Retest r19 One form Two sessions Administer the same test to the same
group at two different sessions.

Alternate forms
Simultaneous hs Two forms One session Administer two forms of the test to
administration the same group in the same session.
Delayed oe Two forms Two sessions Administer two forms of the test
administration to the same group at two different
sessions.

Split-half oy One form One session Administer the test to a group


one time. Split the test into two
equivalent halves, typically
correlating scores on the odd-
numbered items with scores on the
even-numbered items.

Coefficient alpha te One form One session Administer the test to a group one
or KR-20 time. Apply appropriate procedures.
Inter-rater F One form One session Administer the test to a group one
time. Two or more raters score the
test independently.
PVs
Reliability for Teachers 97

Test-Retest Reliability
Probably the most obvious way to estimate the reliability of a test is to administer the same
test to the same group of individuals on two different occasions. With this approach the reli-
ability coefficient is obtained by simply calculating the correlation between the scores on
the two administrations. For example, we could administer our 25-item math test one week
after the initial administration and then correlate the scores obtained on the two administra-
tions. This estimate of reliability is referred to as test-retest reliability and is sensitive to
measurement error due to time sampling. It is an index of the stability
Test-retest reliability is sensitive
of test scores over time. Because many tests are intended to measure
to measurement error due to fairly stable characteristics, we expect tests of these constructs to
time sampling and is an index of produce stable scores. Test-retest reliability reflects the degree to
the stability of scores over time. which test scores can be generalized across different situations or
over time.
One important consideration when calculating and evaluating test-retest reliability
is the length of the interval between the two test administrations. If the test-retest interval
is very short (e.g., hours or days), the reliability estimate may be artificially inflated by
memory and practice effects from the first administration. If the test interval is longer, the
estimate of reliability may be lowered not only by the instability of the scores but also by
actual changes in the test takers during the extended period. In practice, there is no single
“best” time interval, but the optimal interval is determined by the
One important consideration way the test results are to be used. For example, intelligence is acon-
when calculating and evaluating — Struct or characteristic that is thought to be fairly stable, so it would
be reasonable to expect stability in intelligence scores over weeks or
test-retest reliability is the
months. In contrast, an individual’s mood (e.g., depressed, elated,
length of the interval between
nervous) is more subject to transient fluctuations, and stability across
the two test administrations. weeks or months would not be expected.
In addition to the construct being measured, the way the test
is to be used is an important consideration in determining what is an appropriate test-retest
interval. Because the SAT is used to predict performance in college, it is sensible to expect
stability over relatively long periods of time. In other situations, long-term stability is much
less of an issue. For example, the long-term stability of a classroom achievement test (such
as our math test) is not a major concern because it is expected that the students will be
enhancing existing skills and acquiring new ones due to class instruction and studying. In
summary, when evaluating the stability of test scores, one should consider the length of the
test-retest interval in the context of the characteristics being measured and how the scores
are to be used.
The test-retest approach does have significant limitations, the most prominent being
carryover effects from the first to second testing. Practice and memory effects result in
different amounts of improvement in retest scores for different test
The test-retest approach does takers. These carryover effects prevent the two administrations from
have significant limitations, the being independent and as a result the reliability coefficients may
most prominent being carry- be artificially inflated. In other instances, repetition of the test may
over effects from the first to change either the nature of the test or the test taker in some subtle or
second testing. even obvious way (Ghiselli, Campbell, & Zedeck, 1981). As aresult,
98 CHAPTER 4

only tests that are not appreciably influenced by these carryover effects are suitable for this
method of estimating reliability.

Alternate-Form Reliability
Another approach to estimating reliability involves the development of two equivalent or
parallel forms of the test. The development of these alternate forms requires a detailed test
plan and considerable effort because the tests must truly be parallel in terms of content, dif-
ficulty, and other relevant characteristics. The two forms of the test are then administered
to the same group of individuals and the correlation is calculated between the scores on
the two assessments. In our example of the 25-item math test, the teacher could develop
a parallel test containing 25 new problems involving the multiplication of double digits.
To be parallel the items would need to be presented in the same format and be of the same
level of difficulty. Two fairly common procedures are used to es-
Alternate-form reliability
tablish alternate-form reliability. One is alternate-form reliability
based on simultaneous based on simultaneous administrations and is obtained when the two
administration is primarily forms of the test are administered on the same occasion (i.e., back
sensitive to measurement error to back). The other, alternate form with delayed administration, is
due to content sampling. obtained when the two forms of the test are administered on two
different occasions. Alternate-form reliability based on simultane-
ous administration is primarily sensitive to measurement error related to content sampling.
Alternate-form reliability with delayed administration is sensitive to measurement error due
to both content sampling and time sampling.
Alternate-form reliability has the advantage of reducing the carryover effects that are
a prominent concern with test-retest reliability. However, although practice and memory
effects may be reduced using the alternate-form approach, they are
Alternate-form reliability often not fully eliminated. Simply exposing test takers to the com-
based on delayed administration mon format required for parallel tests often results in some carryover
is sensitive to measurement effects even if the content of the two tests is different. For example,
error due to content sampling a test taker given a test measuring nonverbal reasoning abilities may
and time sampling, but cannot develop strategies during the administration of the first form that alter
differentiate the two types of her approach to the second form, even if the specific content of the
error. items is different. Another limitation of the alternate-form approach
to estimating reliability is that relatively few tests, standardized or
teacher made, have alternate forms. As we suggested, the development of alternate forms
that are actually equivalent is a time-consuming process, and many test developers do not
pursue this option. Nevertheless, at times it is desirable to have more than one form of a test,
and when multiple forms exist, alternate-form reliability is an important consideration.

Internal-Consistency Reliability
Internal-consistency reliability estimates primarily reflect errors related to content sam-
pling. These estimates are based on the relationship between items within a test and
are
derived from a single administration of the test. :
Reliability for Teachers 99

Split-Half Reliability. Estimating split-half reliability involves administering a test and


then dividing the test into two equivalent halves that are scored independently. The results
on the first half of the test are then correlated with results on the other half of the test by cal-
culating the Pearson product-moment correlation. Obviously, there are many ways a test can
be divided in half. For example, one might correlate scores on the first half of the test with
scores on the second half. This is usually not a good idea because the items on some tests
get more difficult as the test progresses, resulting in halves that are not actually equivalent.
Other factors, such as practice effects, fatigue, or declining attention that increases as the
test progresses, can also make the first and second halves of the test
Split-half reliability can not equivalent. A more acceptable approach would be to assign test
be calculated from one items randomly to one half or the other. However, the most common
administration of a test and approach is to use an odd-even split. Here all “odd’’-numbered items
reflects error due to content go into one half and all “even”-numbered items go into the other half.
sampling. A correlation is then calculated between scores on the odd-numbered
and even-numbered items.
Before we can use this correlation coefficient as an estimate
of reliability, there is one more task to perform. Because we are actually correlating two
halves of the test, the reliability coefficient does not take into account the reliability of the
test when the two halves are combined. In essence, this initial coefficient reflects the reli-
ability of only a shortened, half-test. As a general rule, longer tests are more reliable than
shorter tests. If we have twice as many test items, then we are able to sample the domain of
test questions more accurately. The better we sample the domain the lower the error due to
content sampling and the higher the reliability of our test. To “put the two halves of the test
back together” with regard to a reliability estimate, we use a correction formula commonly
referred to as the Spearman-Brown formula (or sometimes the Spearman-Brown proph-
ecy formula since it prophesies the reliability coefficient of the full-length test). To estimate
the reliability of the full test, the Spearman-Brown formula is generally applied as:

2 x Reliability of Half Test


Reliability of, Pull. lest, > coi e uiapiiiny of Hall Tet
Here is an example. Suppose the correlation between odd and even halves of your
midterm in this course was 0.74, the calculation using the Spearman-Brown formula would
go as follows:

‘ability ofFull T; 2 x 0.74


Reliabilityof Full Test = wegh

1.48
Reliability of Full Test = nay ae 0.85

The reliability coefficient of 0.85 estimates the reliability of the full test when the odd—even
halves correlated at 0.74. This demonstrates that the uncorrected split-half reliability coef-
ficient presents an underestimate of the reliability of the full test. Table 4.2 provides examples
of half-test coefficients and the corresponding full-test coefficients that were corrected with
100 CHAPTER 4

the Spearman-Brown formula. By looking at the first row in this table, you will see that a
half-test correlation of 0.50 corresponds to a corrected full-test coefficient of 0.67.
Although the odd—even approach is the most common way to divide a test and will
generally produce equivalent halves, certain situations deserve special attention. For exam-
ple, if you have a test with a relatively small number of items (e.g., <8), it may be desirable
to divide the test into equivalent halves based on a careful review of item characteristics
such as content, format, and difficulty. Another situation that deserves special attention in-
volves groups of items that deal with an integrated problem (this is referred to as a testlet).
For example, if multiple questions refer to a specific diagram or reading passage, that whole
set of questions should be included in the same half of the test. Splitting integrated problems
can artificially inflate the reliability estimate (e.g., Sireci, Thissen, & Wainer, 1991).
An advantage of the split-half approach to reliability is that it can be calculated from a
single administration of a test. Also, because only one testing session is involved, this approach
reflects errors due only to content sampling and is not sensitive to time sampling errors.

Coefficient Alpha and Kuder-Richardson Reliability. Other approaches to estimating


reliability from a single administration of a test are based on formulas developed by Kuder
and Richardson (1937) and Cronbach (1951). Instead of comparing responses on two halves
of the test as in split-half reliability, this approach examines the consistency of responding
to all the individual items on the test. Reliability estimates produced
Coefficient alpha and Kuder- with these formulas can be thought of as the average of all possible
Richardson reliability are split-half coefficients. Like split-half reliability, these estimates are
sensitive to error introduced sensitive to measurement error introduced by content sampling. Ad-
by content sampling, but also ditionally, they are also sensitive to the heterogeneity of the test con-
reflect the heterogeneity of tent. When we refer to content heterogeneity, we are concerned with
test content.
the degree to which the test items measure related characteristics.
For example, our 25-item math test involving multiplying two-digit
numbers would probably be more homogeneous than a test designed

TABLE 4.2 Half-Test Coefficients and Corresponding Full-Test


Coefficients Corrected with the Spearman-Brown Formula

Half-Test Correlation Spearman-Brown Reliability

0.50 0.67
0.55 0.71
0.60 0.75
0.65 0.79
0.70 0.82
0.75 0.86
0.80 0.89
0.85 0.92
0.90 0.95
0.95 0.97
eee
Reliability for Teachers 101

to measure both multiplication and division. An even more heterogeneous test would be
one that involves multiplication and reading comprehension, two fairly dissimilar content
domains. As discussed later, sensitivity to content heterogeneity can influence a particular
reliability formula’s use on different domains.
While Kuder and Richardson’s formulas and coefficient alpha both reflect item het-
erogeneity and errors due to content sampling, there is an important difference in terms
of application. In their original article Kuder and Richardson (1937) presented numerous
formulas for estimating reliability. The most commonly used formula is known as the
Kuder-Richardson formula 20 (KR-20). KR-20 is applicable when test items are scored
dichotomously, that is, simply right or wrong, as 0 or 1. Coefficient alpha (Cronbach, 1951)
is a more general form of KR-20 that also deals with test items that produce scores with
multiple values (e.g., 0, 1, or 2). Because coefficient alpha is more broadly applicable, it has
become the preferred statistic for estimating internal consistency (Keith & Reynolds, 1990).
Tables 4.3 and 4.4 illustrate the calculation of KR-20 and coefficient alpha, respectively.

Inter-Rater Reliability
If the scoring of a test relies on subjective judgment, it is important to evaluate the degree
of agreement when different individuals score the test. This is referred to as inter-scorer or
inter-rater reliability. Estimating inter-rater reliability is a fairly straightforward process.
The test is administered one time and two individuals independently score each test. A cor-
relation is then calculated between the scores obtained by the two scorers. This estimate
of reliability is not sensitive to error due to content or time sampling, but only reflects dif-
ferences due to the individuals scoring the test. In addition to the correlational approach,
inter-rater agreement can also be evaluated by calculating the per-
centage of times that two individuals assign the same scores to the
If the scoring of an assessment
performances of students. This approach is illustrated in Special
relies on subjective judgment,
Interest Topic 4.2.
it is important to evaluate the On some tests, inter-rater reliability is of little concern. For ex-
degree of agreement when ample, on a test with multiple-choice or true—false items, grading is
different individuals score the fairly straightforward and a conscientious grader should produce reli-
test. This is referred to as inter- able and accurate scores. In the case of our 25-item math test, a care-
rater reliability. ful grader should be able to determine whether the students’ answers
are accurate and assign a score consistent with that of another careful
grader. However, for some tests inter-rater reliability is a major concern. Classroom essay
tests are a classic example. It is common for students to feel that a different teacher might
have assigned a different score to their essays. It can be argued that the teacher’s personal
biases, preferences, or mood influenced the score, not only the content and quality of the
student’s essay. Even on our 25-item math test, if the teacher required that the students “show
their work” and this influenced the students’ grades, subjective judgment might be involved
and inter-rater reliability could be a concern.
102 CHAPTER 4

TABLE 4.3 Calculating KR-20

KR-20 is sensitive to measurement error due to content sampling and is also a measure of item
heterogeneity. KR-20 is applicable when test items are scored dichotomously, that is, simply right
or wrong, as 0 or 1. The following formula is used for calculating KR-20:

k (a x i)
KR-20 =
hig =| SD?

where k = number of items


SD? = variance of total test scores
P;,, = proportion of correct responses on item
q, = proportion of incorrect responses on item

Consider these data for a five-item test administered to six students. Each item could receive a
score of either | or 0.

Item 1 Item 2 Item 3 Item 4 Item 5 Total Score

Student 1 1 0 1 1 1 4
Student 1 1 1 1 1 1 5
Student 3 1 0 1 0 0 2
Student 4 0 0 0 1 0 1
Student 5 1 1 1 1 1 5
Student 6 1 1 0 1 1 4
D: 0.8333 0.5 0.6667 0.8333 0.6667 SD? = 2.25
qi 0.1667 0.5 0.3333 0.1667 0.3333
PX 4; 0.1389 0.25 072222 0.1389 0.2222
eeee Ee eee ee eee
Note: When calculating SD, n was used in the denominator.

Xp, x g, = 0.1389 + 0.25 + 0.2222 + 0.1389 + 0.2222

Lp; x gq, = 0.972


2.25 — 0.972
TOR? OS) 40 [ee
2.25
1.278
25 =
2.25
1.25(0.568)
0.71

Reliability of Composite Scores


Psychological and educational measurement often yields multiple scores that can be com-
bined to form a composite. For example, the assignment of grades in educational settings is
often based on a composite of several tests and other assessments administered over a grad-
Reliability for Teachers 103

TABLE 4.4 Calculating Coefficient Alpha

Coefficient alpha is sensitive to méasurement error due to content sampling and is also a measure
of item heterogeneity. It can be applied to tests with items that are scored dichotomously or that
have multiple values. The formula for calculating coefficient alpha is:

where k = number of items


SD,? = variance of individual items
SD? = variance of total test scores

Consider these data for a five-item test that was administered to six students. Each item could
receive a score ranging from | to 5.

Item 1 Item 2 Item 3 Item 4 Item 5 Total Score

Student 1 4 3 4 > 5 Bil


Student 2 3 3 wy 3 3 14
Student 3 p] 3 2 2 1 10
Student 4 4 4 5 3 4 20
Student 5 2 3 4 2 3 14
Student 6 2 2 2 1 3 10
SD? 0.8056 0.3333 1.4722 1.5556 1.4722 SD*=18.8

Note: When calculating SD? and SD, n was used in the denominator.

5 (1 0.8056 + 0.3333 + 1.4722 + 1.5556 + =


Coefficient Alpha
Wh 18.81
1.25(1 — 5.63889/18.81)
1.25(1 — 0.29978)
1.25(0.70)
0.875

Reliability of composite scores ing period or semester. Many standardized psychological instruments
is generally greater than the contain several measures that are combined to form an overall com-
measures that contribute to posite score. For example, the Wechsler Adult Intelligence Scale—
Third Edition (Wechsler, 1997) is composed of 11 subtests used in
the composite.
the calculation of the Full Scale Intelligence Quotient (FSIQ). Both
of these situations involve composite scores obtained by combining
the scores on several different tests or subtests. The advantage of composite scores is that the
reliability of composites is generally greater than that of the individual scores that contribute
to the composite. More precisely, the reliability of a composite is the result of the number of
scores in the composite, the reliability of the individual scores, and the correlation between
those scores. The more scores in the composite, the higher the correlation between those
SPECIAL INTEREST TOPIC 4.2

Calculating Inter-Rater Agreement

Performance assessments require test takers to complete a process or produce a product in a context
that closely resembles real-life situations. For example, a student might engage in a debate, compose
a poem, or perform a piece of music. The evaluation of these types of performances is typically
based on scoring rubrics that specify what aspects of the student’s performance should be considered
when providing a score or grade. The scoring of these types of assessments obviously involves the
subjective judgment of the individual scoring the performance, and as a result inter-rater reliability
is a concern. As noted in the text one approach to estimating inter-rater reliability is to calculate the
correlation between the scores that are assigned by two judges. Another approach is to calculate the
percentage of agreement between the judges’ scores.
Consider an example wherein two judges rated poems composed by 25 students. The poems
were scored from | to 5 based on criteria specified in a rubric, with 1 being the lowest performance
and 5 being the highest. The results are illustrated in the following table:

Ratings of Rater 1

Ratings of Rater 2 1 2 3) d 3)

P) 0 0 1 2 4
4 0 0 2 B Z
3 0 2 3 1 0
Z 1 1 1 0 0
1 1 1 0 0 0

Once the data are recorded you can calculate inter-rater agreement with the following formula:

Number of Cases Assigned the Same Scores


Inter-Rater A C= x 100
ae aene Total Number of Cases

In our example the calculation would be:

Inter-Rater Agreement = 12/25 x 100


Inter-Rater Agreement = 48%

This degree of inter-rater agreement might appear low to you, but this would actually be re-
spectable for a classroom test. In fact the Pearson correlation between these judges’ ratings is 0.80
(better than many, if not most, performance assessments).
Instead of requiring the judges to assign the exact same score for agreement, some authors
suggest the less rigorous criterion of scores being within one point of each other (e.g., Linn & Gron-
lund, 2000). If this criterion were applied to these data, the modified agreement percent would be
96% because only one of the judges’ scores were not within one point of each other (Rater 1 assigned
a3 and Rater 2 a5).
We caution you not to expect this high a rate of agreement should you examine the inter-rater
agreement of your own performance assessments. In fact you will learn later that difficulty scoring
performance assessments in a reliable manner is one of the major limitations of these procedures.
ee eee
Reliability for Teachers 105

scores, and the higher the individual reliabilities, the higher the composite reliability. As
we noted, tests are simply samples of the test domain, and combining multiple measures is
analogous to increasing the number of observations or the sample size.

Selecting a Reliability Coefficient


Table 4.5 summarizes the sources of measurement error reflected in different reliability co-
efficients. As we have suggested in our discussion of each approach to estimating reliability,
different conditions call for different estimates of reliability. One should consider factors
such as the nature of the construct and how the scores will be used when selecting an esti-
mate of reliability. If a test is designed to be given more than one time
One should consider factors to the same individuals, test-retest and alternate-form reliability with
delayed administration are appropriate because they are sensitive to
such as the nature of the
measurement errors resulting from time sampling. Accordingly, if a
construct being measured and
test is used to predict an individual’s performance on a criterion in
how the scores will be used the future, it is also important to use a reliability estimate that reflects
when selecting an estimate errors due to time sampling.
of reliability. When a test is designed to be administered only one time, an
estimate of internal consistency is appropriate. As we noted, split-
half reliability estimates error variance resulting from content sampling whereas coefficient
alpha and KR-20 estimate error variance due to content sampling and content heteroge-
neity. Because KR-20 and coefficient alpha are sensitive to content heterogeneity, they
are applicable when the test measures a homogeneous domain of knowledge or a unitary
characteristic. For example, our 25-item test measuring the ability to multiply double digits
reflects a homogeneous domain and coefficient alpha would provide a good estimate of
reliability. However, if we have a 50-item test, 25 measuring multiplication with double
digits and 25 measuring division, the domain is more heterogeneous and coefficient alpha

TABLE 4.5 Sources of Error Variance Associated


with the Major Types of Reliability

Type of Reliability Error Variance

Test-retest reliability Time sampling

Alternate-form reliability
Simultaneous administration Content sampling
Delayed administration Time sampling and content sampling

Split-half reliability Content sampling

Coefficient alpha and KR-20 Content sampling and item


heterogeneity

Inter-rater reliability Differences due to raters/scorers


106 CHARTER 4

and KR-20 would probably underestimate reliability. In the situation of a test with hetero-
geneous content (the heterogeneity is intended and not a mistake), the split-half method is
preferred. Because the goal of the split-half approach is to compare two equivalent halves,
it would be necessary to ensure that each half has equal numbers of both multiplication and
division problems.
We have been focusing on tests of achievement when providing examples, but the same
principles apply to other types of tests. For example, a test that measures depressed mood
may assess a fairly homogeneous domain, making the use of coefficient alpha or KR-20 ap-
propriate. However, if the test measures depression, anxiety, anger, and impulsiveness, the
content becomes more heterogeneous and the split-half estimate would be indicated. In this
situation, the split-half approach would allow the construction of two equivalent halves with
equal numbers of items reflecting the different traits or characteristics under investigation.
Naturally, if different forms of a test are available, it would be important to estimate
alternate-form reliability. If a test involves subjective judgment by the person scoring the
test, inter-rater reliability is important. Many contemporary test manuals report multiple
estimates of reliability. Given enough information about reliability, one can partition the
error variance into its components, as demonstrated in Figure 4.1.

Sampling

SEuseilales

Inter-Rater
Difference

FIGURE 4.1 Partitioning the Variance to Reflect Sources


of Variance i
Reliability for Teachers 107

Evaluating Reliability Coefficients


Another important question that arises when considering reliability coefficients is “How
large do reliability coefficients need to be?” Remember, we said reliability coefficients can
be interpreted in terms of the proportion of test score variance attributable to true variance.
Ideally we would like our reliability coefficients to equal 1.0 because
What constitutes an acceptable this would indicate that 100% of the test score variance is due to
reliability coefficient depends true differences between individuals. However, due to measurement
on several factors, including error, perfectly reliable measurement does not exist. There is not a
the construct being measured, single, simple answer to our question about what is an acceptable
the amount of time available level of reliability. What constitutes an acceptable reliability coeffi-
for testing, the way the scores cient depends on several factors, including the construct being mea-
will be used, and the method of sured, the amount of time available for testing, the way the scores
estimating reliability. will be used, and the method of estimating reliability. We will now
briefly address each of these factors.

Construct. Some constructs are more difficult to measure than others simply because the
item domain is more difficult to sample adequately. As a general rule, personality variables
are more difficult to measure than academic knowledge. As a result, what might be an ac-
ceptable level of reliability for a measure of “dependency” might be regarded as unaccept-
able for a measure of reading comprehension. In evaluating the acceptability of a reliability
coefficient one should consider the nature of the variable under investigation and how dif-
ficult it is to measure. By carefully reviewing and comparing the reliability estimates of
different instruments available for measuring a construct, one can determine which is the
most reliable measure of the construct.

Time Available for Testing. If the amount of time available for testing is limited, only
a limited number of items can be administered and the sampling of the test domain is open
to greater error. This could occur in a research project in which the school principal allows
you to conduct a study in his or her school but allows only 20 minutes to measure all the
variables in your study. As another example, consider a districtwide screening for reading
problems wherein the budget allows only 15 minutes of testing per student. In contrast, a
psychologist may have two hours to administer a standardized intelligence test individually.
It would be unreasonable to expect the same level of reliability from these significantly dif-
ferent measurement processes. However, comparing the reliability coefficients associated
with instruments that can be administered within the parameters of the testing situation can
help one select the best instrument for the situation.

Test Score Use. The way the test scores will be used is another major consideration
when evaluating the adequacy of reliability coefficients. Diagnostic tests that form the
basis for major decisions about individuals should be held to a higher standard than tests
used with group research or for screening large numbers of individuals. For example,
an individually administered test of intelligence that is used in the diagnosis of mental
retardation would be expected to produce scores with a very high level of reliability. In
108 CHAPTER 4

this context, performance on the intelligence test provides critical information used to
determine whether the individual meets the diagnostic criteria. In contrast, a brief test
used to screen all students in a school district for reading problems would be held to less
rigorous standards. In this situation, the instrument is used simply for screening purposes
and no decisions are being made that cannot easily be reversed. It helps to remember that
although high reliability is desirable with all assessments, standards of acceptability vary
according to the way test scores will be used. High-stakes decisions demand highly reli-
able information!

Method of Estimating Reliability. The size of reliability coefficients is also related to


the method selected to estimate reliability. Some methods tend to produce higher estimates
than other methods. As a result, it is important to take into consideration the method used to
produce correlation coefficients when evaluating and comparing the reliability of different
tests. For example, KR-20 and coefficient alpha typically produce reliability estimates that
are smaller than ones obtained using the split-half method. As indicated in Table 4.5, alter-
nate-form reliability with delayed administration takes into account more sources of error
than other methods do and generally produces lower reliability coefficients. In summary,
some methods of estimating reliability are more rigorous and tend to produce smaller coef-
ficients, and this variability should be considered when evaluating reliability coefficients.

General Guidelines. A\though it is apparent that many factors deserve consideration


when evaluating reliability coefficients, we will provide some general guidelines that can
provide some guidance.

If a test is being used to make = If a test is being used to make important decisions that are
important decisions that are likely to significantly impact individuals and are not easily reversed,
likely to impact individuals it is reasonable to expect reliability coefficients of 0.90 or even 0.95.
This level of reliability is regularly obtained with individually ad-
significantly and are not easily
ministered tests of intelligence. For example, the reliability of the
reversed, it is reasonable to
Wechsler Adult Intelligence Scale—Third Edition (Wechsler, 1997),
expect reliability coefficients an individually administered intelligence test, is 0.98.
of 0.90 or even 0.95.
# Reliability estimates of 0.80 or more are considered acceptable
in many testing situations and are commonly reported for group and
individually administered achievement and personality tests. For example, the California
Achievement Test/5 (CAT/5)(CTB/Macmillan/McGraw-Hill, 1993), a set of group-admin-
istered achievement tests frequently used in public schools, has reliability coefficients that
exceed 0.80 for most of its subtests.
u For teacher-made classroom tests and tests used for screening, reliability estimates
of at least 0.70 are expected. Classroom tests are frequently combined to form linear com-
posites that determine a final grade, and the reliability of these composites is expected to be
greater than the reliabilities of the individual tests. Marginal coefficients in the 0.70s might
also be acceptable when more thorough assessment procedures are available to address
concerns about individual cases. ;
Reliability for Teachers 109

Some writers suggest that reliability coefficients as low as 0.60 are acceptable for group
research, performance assessments, and projective measures, but we are reluctant to endorse
the use of any assessment that produces scores with reliability estimates below 0.70. As you
recall, a reliability coefficient of 0.60 indicates that 40% of the observed variance can be
attributed to random error. How much confidence can you place in assessment results when
you know that 40% of the variance is attributed to random error?
The preceding guidelines on reliability coefficients and qualitative judgments of their
magnitude must also be considered in context. Some constructs are just a great deal more
difficult to measure reliably than others. From a developmental perspective, we know that
emerging skills or behavioral attributes in children are more difficult to measure than mature
or developed skills. When a construct is very difficult to measure, any reliability coefficient
greater than 0.50 may well be acceptable just because there is still more true score variance
present in such values relative to error variance. However, before choosing measures with
reliability coefficients below 0.70, be sure no better measuring instruments are available
that are also practical and whose interpretations have validity evidence associated with the
intended purposes of the test.

How to Improve Reliability


A natural question at this point is “What can we do to improve the
Possibly the most obvious
reliability of our assessment results?” In essence we are asking what
way to improve the reliability
steps can be taken to maximize true score variance and minimize error
of measurement is simply to variance. Probably the most obvious approach is simply to increase
increase the number of items on the number of items on a test. In the context of an individual test, if
a test. If we increase the number we increase the number of items while maintaining the same quality
of items while maintaining the as the original items, we will increase the reliability of the test. This
same quality as the original concept was introduced when we discussed split-half reliability and
items, we will increase the presented the Spearman-Brown formula. In fact, a variation of the
reliability of the test. Spearman-Brown formula can be used to predict the effects on reli-
ability achieved by adding items:

5 ale Vy
~ 1l+ta- Dry

where r = estimated reliability on test with new items


n = factor by which the test length is increased
r,, = ‘eliability of the original test

For instance, consider the example of our 25-item math test. If the reliability of the
test were 0.80 and we wanted to estimate the increase in reliability we would achieve by
increasing the test to 30 items (a factor of 1.2), the formula would be:

b2 % 0.80
Ts (2 190:80]
110 CHAPTER 4

0.96
ph TUBA
r = 0.83

Table 4.6 provides other examples illustrating the effects of increasing the length of
our hypothetical test on reliability. By looking in the first row of this table you see that in-
creasing the number of items on a test with a reliability of 0.50 by a factor of 1.25 results ina
predicted reliability of 0.56. Increasing the number of items by a factor of 2.0 (i.e., doubling
the length of the test) increases the reliability to 0.67.
In some situations various factors will limit the number of items we can include in
a test. For example, teachers generally develop tests that can be administered in a specific
time interval, usually the time allocated for a class period. In these situations, one can
enhance reliability by using multiple measurements that are combined for an average or
composite score. As noted earlier, combining multiple tests in a linear composite will
increase the reliability of measurement over that of the component tests. In summary,
anything we do to get a more adequate sampling of the content domain will increase the
reliability of our measurement.
In Chapter 6 we will discuss a set of procedures collectively referred to as “item anal-
yses.” These procedures help us select, develop, and retain test items with good measure-
ment characteristics. While it is premature to discuss these procedures in detail, it should
be noted that selecting or developing good items is an important step in developing a good
test. Selecting and developing good items will enhance the measurement characteristics of
the assessments you use.
Another way to reduce the effects of measurement error is what Ghiselli, Campbell,
and Zedeck (1981) refer to as “good housekeeping procedures.” By this they mean test
developers should provide precise and clearly stated procedures regarding the administra-

TABLE 4.6 Reliability Expected when Increasing the Number of Items

The Reliability Expected when the Number of Items Is Increased By:


Current
Reliability ae Eee) x 1.50 x 2.0 DS)
0.50 0.56 0.60 0.67 0.71
0.55 0.60 0.65 0.71 0.75
0.60 0.65 0.69 0.75 0.79
0.65 0.70 0.74 0.79 0.82
0.70 0.74 0.78 0.82 0.85
0.75 0.79 0.82 0.86 0.88
0.80 0.83 0.86 0.89 0.91
0.85 0.88 0.89 0.92 0.93
0.90 0.92 0.93 0.95 0.96
Reliability for Teachers 111

tion and scoring of tests. Examples include providing explicit instructions for standardized
administration, developing high-quality rubrics to facilitate reliable scoring, and requiring
extensive training before individuals can administer, grade, or interpret a test.

Special Problems in Estimating Reliability


Reliability of Speed Tests. A speed test generally contains items that are relatively easy
but has a time limit that prevents any test takers from correctly answering all questions.
As a result, the test taker’s score on a speed test primarily reflects
When estimating the reliability the speed of performance. When estimating the reliability of the re-
of the results of speed tests, sults of speed tests, estimates derived from a single administration
estimates derived from a single of a test are not appropriate. Therefore, with speed tests, test-retest
administration of a test are not or alternate-form reliability is appropriate, but split-half, coefficient
appropriate. alpha, and KR-20 should be avoided.

Reliability as a Function of Score Level. Though it is desirable, tests do not always


measure with the same degree of precision throughout the full range of scores. If a group of
individuals is tested for whom the test is either too easy or too difficult, we are likely to have
additional error introduced into the scores. At the extremes of the distribution, at which scores
reflect either all correct or all wrong responses, little accurate measurement has occurred. It
would be inaccurate to infer that a child who missed every question on an intelligence test
has “no” intelligence. Rather, the test did not adequately assess the low-level skills neces-
sary to measure the child’s intelligence. This is referred to as the test having an insufficient
“floor.” At the other end, it would be inaccurate to report that a child who answers all of the
questions on an intelligence test correctly has an “infinite level of intelligence.” The test is
simply too easy to provide an adequate measurement, a situation referred to as a test having
an insufficient “ceiling.” In both cases we need a more appropriate test. Generally, aptitude
and achievement tests are designed for use with individuals of certain ability levels. When a
test is used with individuals who fall either at the extremes or outside this range, the scores
might not be as accurate as the reliability estimates suggest. In these situations, further study
of the reliability of scores at this level is indicated.

Range Restriction. The values we obtain when calculating reliability coefficients are
dependent on characteristics of the sample or group of individuals on which the analyses
are based. One characteristic of the sample that significantly impacts the coefficients is the
degree of variability in performance (i.e., variance). More precisely, reliability coefficients
based on samples with large variances (referred to as heterogeneous samples) will generally
produce higher estimates of reliability than those based on samples with small variances
(referred to as homogeneous samples). When reliability coefficients are based on a sample
with a restricted range of variability, the coefficients may actually underestimate the reli-
ability of measurement. For example, if you base a reliability analysis on students in a gifted
and talented class in which practically all of the scores reflect exemplary performance (e.g.,
>90% correct), you will receive lower estimates of reliability than if the analyses are based
on a class with a broader and more nearly normal distribution of scores.
112 CHAPTER 4

The reliability estimates Mastery Testing. Criterion-referenced tests are used to make inter-
discussed in this chapter are pretations relative to a specific level of performance. Mastery testing
usually not applicable to scores is an example of a criterion-referenced test by which a test taker’s
of mastery tests. Because performance is evaluated in terms of achieving a cut score instead
mastery tests emphasize of the degree of achievement. The emphasis in this testing situation
is on Classification. Either test takers score at or above the cut score
classification, a recommended
and are classified as having mastered the skill or domain, or they
approach is to use an index
score below the cut score and are classified as having not mastered
that reflects the consistency of
the skill or domain. Mastery testing often results in limited variability
classification. among test takers, and, as we just described, limited variability in
performance results in small reliability coefficients. As a result, the
reliability estimates discussed in this chapter are typically inadequate for assessing the reli-
ability of mastery test scores. Given the emphasis on classification, a recommended approach
is to use an index that reflects the consistency of classification (AERA et al., 1999). Special
Interest Topic 4.3 illustrates a useful procedure for evaluating the consistency of classifica-
tion when using mastery tests.

Reliability coefficients are


useful when comparing The Standard Error of Measurement
the reliability of the scores
Reliability coefficients are interpreted in terms of the proportion of
produced by different tests,
observed variance attributable to true variance and are a useful way of
but when the focus is on
comparing the reliability of scores produced by different assessment
interpreting the test scores procedures. Other things being equal, you will want to select the test
of individuals, the standard that produces scores with the best reliability. However, once a test has
error of measurement is a more been selected and the focus is on interpreting scores, the standard
practical statistic. error of measurement (SEM) is a more practical statistic. The SEM
is the standard deviation of the distribution of scores that would be
obtained by one person if he or she were tested on an infinite number of parallel forms of a
test comprised of items randomly sampled from the same content domain. In other words, if
we created an infinite number of parallel forms of a test and had the same person take them
with no carryover effects, the presence of measurement error would prevent the person from
earning the same score every time. Although each test might represent the content domain
equally well, the test taker would perform better on some tests and worse on others simply
due to random error. By taking the scores obtained on all of these tests, a distribution of scores
would result. The mean of this distribution is the individual’s true score (T) and the SEM is the
standard deviation of this distribution of error score. Obviously, we are never actually able to
follow these procedures and must estimate the SEM using information available to us.

Evaluating the Standard Error of Measurement


The SEM is a function of the reliability (r,.) and standard deviation (SD) of a test.
When
calculating the SEM, the reliability coefficient takes into consideration measurement errors

Reliability for Teachers 113

tes ss Cia sah ae Se a AE A

SPECIAL INTEREST TOPIC. 4.3


Consistency of Classification with Mastery Tests

As noted in the text, the size of reliability coefficients is substantially affected by the variance of
the test scores. Limited test score variance results in lower reliability coefficients. Because mastery
tests often do not produce test scores with much variability, the methods of estimating reliability
described in this chapter will often underestimate the reliability of these tests. To address this, reli-
ability analyses of mastery tests typically focus on the consistency of classification. That is, because
the objective of mastery tests is to determine if a student has mastered the skill or knowledge domain,
the question of reliability can be framed as one of how consistent mastery—nonmastery classifica-
i tions are. For example, if two parallel or equivalent mastery tests covering the same skill or content
i domain consistently produce the same classifications for the same test takers (i.e., mastery versus
nonmastery), we would have evidence of consistency of classification. If two parallel mastery tests
produced divergent classifications, we would have cause for concern. In this case the test results are
not consistent or reliable.
The procedure for examining the consistency of classification on parallel mastery tests is
fairly straightforward. Simply administer both tests to a group of students and complete a table like
the one that follows. For example, consider two mathematics mastery tests designed to assess stu-
dents’ ability to multiply fractions. The cut score is set at 80%, so all students scoring 80% or higher
are classified as having mastered the skill while those scoring less than 80% are classified as not
having mastered the skill. In the following example, data are provided for 50 students:

Form B: Nonmastery Form B: Mastery


(score <80%) (score of 80% or better)

Form A: Mastery
(score of 80% or better) 4 32
Form A: Nonmastery
(score <80%) 1h 3
on———_—————————————

Students classified as achieving mastery on both tests are denoted in the upper right-hand
cell while students classified as not having mastered the skill are denoted in the lower left-hand cell.
There were four students who were classified as having mastered the skills on Form A but not on
Form B (denoted in the upper left-hand cell). There were three students who were classified as hav-
ing mastered the skills on Form B but not on Form A (denoted in the lower right-hand cell). The next
step is to calculate the percentage of consistency by using the following formula:

Mastery on Both Forms + Nonmastery on Both Forms


Percent Consistency x 100
Total Number of Students

32 + 11
Percent Consistency x 100
50

Percent Consistency = 0.86 x 100


Percent Consistency = 86%

(continued)
114 CHAPTER 4

SPECIAL INTEREST TOPIC 4.3 Continued

This approach is limited to situations in which you have parallel mastery tests. Another limitation is
that there are no clear standards regarding what constitutes “acceptable” consistency of classifica-
tion. As with the evaluation of all reliability information, the evaluation of classification consistency
should take into consideration the consequences of any decisions that are based on the test results
(e.g., Gronlund, 2003). If the test results are used to make high-stakes decisions (e.g., awarding a
diploma), a very high level of consistency is required. If the test is used only for low-stake decisions
(e.g., failure results in further instruction and retesting), a lower level of consistency may be accept-
able. Subkoviak (1984) provides a good discussion of several techniques for estimating the clas-
sification consistency of mastery tests, including some rather sophisticated approaches that require
only a single administration of the test.

The greater the reliability of a present in test scores, and the SD reflects the variability of the
test score, the smaller the SEM scores in the distribution. The SEM is estimated using the follow-
and the more confidence we have 198 formula:
in the precision of test scores.
SEM = SDyv1 - ne
where SD = the standard deviation of the obtained scores
r,, = the reliability of the test

Let’s work through two quick examples. First, let’s assume a test with a standard
deviation of 10 and reliability of 0.90.

Example 1: SEM = y1 — 0.90


SEM =I 0.10
SEM = 3.2

Now let’s assume a test with a standard deviation of 10 and reliability of 0.80. The SD
is the same as in the previous example, but the reliability is lower.

Example 2: SEM = v1 — 0.80


SEM = ¥V0.20
SEM = 4.5

Notice that as the reliability of the test scores decreases, the SEM increases.
Because the
reliability coefficient reflects the proportion of observed score variance
due to true score
variance and the SEM is an estimate of the amount of error in test
scores, this inverse
relationship is what one would expect. The greater the reliability of test scores, the smaller
the SEM and the more confidence we have in the precision of test scores. The lower the
reliability of a test, the larger the SEM and the less confidence we have in the precision of
test scores. Table 4.7 shows the SEM as a function of SD and reliability, Examining the
Reliability for Teachers 115

TABLE 4.7 Standard Errors of Measurement for Values


of Reliability and Standard Deviation

Reliability Coefficients
Standard ae he | Pan = a ari San a
Deviation 0.95 0.90 0.85 0.80 OM) .70

30 6.7 95 11.6 13.4 15.0 16.4


28 6.3 8.9 10.8 IIPS) 14.0 115);
26 5.8 8.2 10.1 11.6 13.0 14.2
24 5.4 7.6 93 10.7 12.0 13.1
22 4.9 7.0 8.5 9.8 11.0 12.0
20 4.5 6.3 ey 8.9 10.0 11.0
18 4.0 oF7 7.0 8.0 9.0 9.9
16 3.6 Epil 6.2 Ted 8.0 8.8
14 3.1 4.4 5.4 6.3 7.0 Wed)
12 DH 3.8 4.6 5.4 6.0 6.6
10 2.2 32 3.9) 4.5 5.0 3155)

8 1.8 DiS 3.1 3.6 4.0 4.4

6 13 1.9 pe) Dei, 3.0 33

4 0.9 3) 5 1.8 2.0 22,

4 0.4 0.6 0.8 0.9 1.0 tell

SS TD 1 AES SI SMB I ESP EINES ES SR PH RS RA NS RTT

first row in the table shows that on a test with a standard deviation of 30 and a reliability
coefficient of 0.95 the SEM is 6.7. In comparison, if the reliability of the test score is 0.90
the SEM is 9.5; if the reliability of the test is 0.85 the SEM is 11.6; and so forth. The SEM
is used in calculating intervals or bands around observed scores in which the true score is
expected to fall. We will now turn to this application of the SEM.

A confidence interval reflects a Calculating Confidence Intervals. A confidence interval re-


range of scores that will contain flects a range of scores that will contain the individual’s true score
the individual’s true score with with a prescribed probability (AERA et al., 1999). We use the SEM
a prescribed probability (AERA to calculate confidence intervals. When introducing the SEM, we
said it provides information about the distribution of observed scores
et al., 1999).
around true scores. More precisely, we defined the SEM as the stan-
dard deviation of the distribution of error scores. Like any standard deviation, the SEM can
be interpreted in terms of frequencies represented in a normal distribution. In the previous
chapter we showed that approximately 687% of the scores in a normal distribution are located
between one SD below the mean and one SD above the mean. As a result, approximately
68% of the time an individual’s observed score would be expected to be within +1 SEM of
the true score. For example, if an individual had a true score of 70 on a test with a SEM of 3,
116 @HVAIP
AL BERS

then we would expect him or her to obtain scores between 67 and 73 two-thirds of the time.
To obtain a 95% confidence interval we simply determine the number of standard devia-
tions encompassing 95% of the scores in a distribution. By referring to a table representing
areas under the normal curve (see Appendix F), you can determine that 95% of the scores
in a normal distribution fall within +1.96 of the mean. Given a true score of 70 and SEM
of 3, the 95% confidence interval would be 70 + 3(1.96) or 70 + 5.88. Therefore, in this
situation an individual’s observed score would be expected to be between 64.12 and 75.88
95% of the time.
You might have noticed a potential problem with this approach to calculating confi-
dence intervals. So far we have described how the SEM allows us to form confidence in-
tervals around the test taker’s true score. The problem is that we don’t know a test taker’s
true score, only the observed score. Although it is possible for us to estimate true scores
(see Nunnally & Bernstein, 1994), it is common practice to use the SEM to establish con-
fidence intervals around obtained scores (see Gulliksen, 1950). These confidence inter-
vals are calculated in the same manner as just described, but the interpretation is slightly
different. In this context the confidence interval is used to define the range of scores that
will contain the individual’s true score. For example, if an individual obtains a score of 70
on a test with a SEM of 3.0, we would expect his or her true score to be between 67 and 73
(obtained score +1 SEM) 68% of the time. Accordingly, we would expect his or her true
score to be between 64.12 and 75.88 95% of the time (obtained score +1.96 SEM).
It may help to make note of the relationship between the reliability of the test score,
the SEM, and confidence intervals. Remember that we noted that as the reliability of scores
increases the SEM decreases. The same relationship exists between test reliability and confi-
dence intervals. As the reliability of test scores increases (denoting less measurement error),
the confidence intervals become smaller (denoting more precision in measurement).
A major advantage of the SEM and the use of confidence intervals is that they serve to
remind us that measurement error is present in all scores and that we should interpret scores
cautiously. A single numerical score is often interpreted as if it is precise and involves no
error. For example, if you report that Susie has a Full Scale IQ of 113, her parents might
interpret this as implying that Susie’s IQ is exactly 113. If you are
A major advantage of the using a high-quality IQ test such as the Wechsler Intelligence Scale
SEM and the use of confidence for Children—4th Edition or the Reynolds Intellectual Assessment
intervals is that they serve to Scales, the obtained IQ is very likely a good estimate of her true IQ.
However, even with the best assessment instruments the obtained
remind us that measurement
scores contain some degree of error and the SEM and confidence
error is present in all scores and
intervals help us illustrate this. This information can be reported in
that we should interpret scores
different ways in written reports. For example, Kaufman and Lich-
cautiously.
tenberger (1999) recommend the following format:

Susie obtained a Full Scale IQ of 113 (between 108 and 118 with 95% confidence
).

Kamphaus (2001) recommends a slightly different format:

Susie obtained a Full Scale IQ in the High Average range, with a 95% probabili
ty that
her true IQ falls between 108 and 118. :
Reliability for Teachers 117

Regardless of the exact format used, the inclusion of confidence intervals highlights
the fact that test scores contain some degree of measurement error and should be interpreted
with caution. Most professional test publishers either report scores as bands within which
the test taker’s true score is likely to fall or provide information on calculating these confi-
dence intervals.

Reliability: Practical Strategies for Teachers


Now that you are aware of the importance of the reliability of measurement, a natural
question is “How can I estimate the reliability of scores on my classroom tests?” Most
f teachers have a number of options. First, if you use multiple-choice
Most teachers have multiple or other tests that can be scored by a computer scoring program, the
options for estimating the score printout will typically report some reliability estimate (e.g.,
reliability of scores produced coefficient alpha or KR-20). If you do not have access to computer
by their classroom tests. scoring, but the items on a test are of approximately equal difficulty
and scored dichotomously (i.e., correct/incorrect), you can use an in-
ternal consistency reliability estimate known as the Kuder-Richardson formula 21 (KR-21).
This formula is actually an estimate of the KR-20 discussed earlier and is usually adequate
for classroom tests. To calculate KR-21 you need to know only the mean, variance, and
number of items on the test:

X(n — X)
KR221 =)1 = -——
no

where X = mean
o” = variance
n = number of items

Consider the following set of 20 scores: 50, 48, 47, 46, 42, 42, 41, 40, 40, 38, 37, 36,
36, 35, 34, 32, 32, 31, 30, and 28. Here the X = 38.25, o* = 39.8, and n = 50. Therefore,

38.25(50 — 38.25)
KR-21
50(39.8)
449.4375
1990
1 — 0.23 = 0.77

As you see, this is a fairly simple procedure. If you have access to a computer with a spread-
sheet program or a calculator with mean and variance functions, you can estimate the reli-
ability of a classroom test easily in a matter of minutes with this formula.
Special Interest Topic 4.4 presents a shortcut approach for calculating the Kuder-
Richardson formula 21 (KR-21). If you want to avoid even these limited computations, we
prepared Table 4.8, which allows you to estimate the KR-21 reliability for dichotomously
CHAPTER 4

SPECIAL INTEREST TOPIC 4.4


A Quick Way to Estimate Reliability for Classroom Exams

Saupe (1961) provided a quick method for teachers to calculate reliability for a classroom exam in
the era prior to easy access to calculators or computers. It is appropriate for a test in which each item
is given equal weight and each item is scored either right or wrong. First, the standard deviation of
the exam must be estimated from a simple approximation:
|
“=,
SD = [sum of top 1/6th of scores — sum of bottom 1/6th of scores] / [total # of scores — 1] / 2

Then reliability can be estimated from:

Reliability = 1 — [0.19 x number of items] / SD

Thus, for example, in a class with 24 student test scores, the top one-sixth of the scores are 98, 92,
87, and 86, while the bottom sixth of the scores are 48, 72, 74, and 75. With 25 test items, the cal-
_ culations are:

SD = [98 + 92 + 87 + 86 — 48 — 72 — 74 — 75] / 23/2


[363 — 269] / 11.5
= 94 (NS = 18.17
So,

Reliability = 1 — [0.19 x 25] / 8.172


= i = O07
= 0.93

A reliability coefficient of 0.93 for a classroom test is excellent! Don’t be dismayed if your class-
room tests do not achieve this high a level of reliability.

Source: Saupe, J. L. (1961). Some useful estimates of the Kuder-Richardson formula number 20 reliability coef-
ficient. Educational and Psychological Measurement, 2, 63-72.
ee
eSSSSSSSSFSSFSSSSSSSSEeee

TABLE 4.8 KR-21 Reliability Estimates for Tests


with a Mean of 80%

Standard Deviation of Test

Number of Items (n) 0.10(n) 0.15(n) 0.20(n)

10 — 0.29 0.60
20 0.20 0.64 0.80
30 0.47 0.76 0.87
40 0.60 0.82 0.90
50 0.68 0.86 0.92
75 0.79 0.91 0.95
100 0.84 0.93 0.96
SR
Reliability for Teachers 119

scored classroom tests if you know the standard deviation and number of items (this table
was modeled after tables originally presented by Deiderich, 1973). This table is appropriate
for tests with a mean of approximately 80% correct (we are using a mean of 80% correct
because it is fairly representative of many classroom tests). To illustrate its application, con-
sider the following example. If your test has 50 items and an SD of 8, select the “Number of
Items” row for 50 items and the “Standard Deviation” column for 0.151, because 0.15(50)
= 7.5, which is close to your actual SD of 8. The number at the intersection is 0.86, which
is a very respectable reliability for a classroom test (or a professionally developed test for
that matter).
If you examine Table 4.8, you will likely detect a few fairly obvious trends. First, the
more items on the test the higher the estimated reliability coefficients. We alluded to the
beneficial impact of increasing test length previously in this chapter and the increase in reli-
ability is due to enhanced sampling of the content domain. Second, tests with larger standard
deviations (i.e., variance) produce more reliable results. For example, a 30-item test with
an SD of 3—i.e., 0.10(n)—results in an estimated reliability of 0.47, while one with an SD
of 4.5—i.e., 0.15(n)—1esults in an estimated reliability of 0.76. This reflects the tendency
we described earlier that restricted score variance results in smaller reliability coefficients.
We should note that while we include a column for standard deviations of 0.20(n), standard
deviations this large are rare with classroom tests (Deiderich, 1973). In fact, from our expe-
rience it is more common for classroom tests to have standard deviations closer to 0.10().
Before leaving our discussion of KR-21 and its application to classroom tests, we do want
to caution you that KR-21 is only an approximation of KR-20 or coefficient alpha. KR-21
assumes the test items are of equal difficulty and it is usually slightly lower than KR-20 or
coefficient alpha (Hopkins, 1998). Nevertheless, if the assumptions are not grossly violated,
it is probably a reasonably good estimate of reliability for many classroom applications.
Our discussion of shortcut reliability estimates to this point has been limited to tests
that are dichotomously scored. Obviously, many of the assessments teachers use are not
dichotomously scored and this makes the situation a little more complicated. If your items
are not scored dichotomously, you can calculate coefficient alpha with relative ease using
a commonly available spreadsheet such as Microsoft Excel. With a little effort you should
be able to use a spreadsheet to perform the computations illustrated previously in Tables
4.3 and 4.4.

Summary
Reliability refers to consistency in test scores. If a test or other assessment procedure pro-
duces consistent measurements, its scores are reliable. Why is reliability so important? As
we have emphasized, assessments are useful because they provide information that helps
n
educators make better decisions. However, the reliability (and validity) of that informatio
us to make good decisions, we need reliable information .
is of paramount importance. For
By estimating the reliability of our assessment results, we get an indication of how much
confidence we can place in them. If we have highly reliable and valid information, it is prob-
able that we can use that information to make better decisions. If the results are unreliable,
they are of little value to us.
120 CHAPTER 4

Errors of measurement undermine the reliability of measurement and therefore re-


duce the utility of the measurement. Although there are multiple sources of measurement
error, the major ones are content sampling and time sampling errors. Content sampling
errors are the result of less than perfect sampling of the content domain. The more repre-
sentative tests are of the content domain, the less content sampling errors threaten the reli-
ability of the test. Time sampling errors are the result of random changes in the test taker
or environment over time. Experts in testing and measurement have developed methods of
estimating errors due to these and other sources, including the following major approaches
to estimating reliability:

a Test-retest reliability involves the administration of the same test to a group of indi-
viduals on two different occasions. The correlation between the two sets of scores is the
test-retest reliability coefficient and reflects errors due to time sampling.

a Alternate-form reliability involves the administration of parallel forms of a test to a


group of individuals. The correlation between the scores on the two forms is the reliability
coefficient. If the two forms are administered at the same time, the reliability coefficient
reflects only content sampling error. If the two forms of the test are administered at different
times, the reliability coefficient reflects both content and time sampling errors.
= Internal-consistency reliability estimates are derived from a single administration
of a test. Split-half reliability involves dividing the test into two equivalent halves and
calculating the correlation between the two halves. Instead of comparing performance on
two halves of the test, coefficient alpha and the Kuder-Richardson approaches examine the
consistency of responding among all of the individual items of the test. Split-half reliability
reflects errors due to content sampling whereas coefficient alpha and the Kuder-Richardson
approaches reflect both item heterogeneity and errors due to content sampling.
a Inter-rater reliability is estimated by administering the test once but having the re-
sponses scored by different examiners. By comparing the scores assigned by different ex-
aminers, one can determine the influence of different raters or scorers. Inter-rater reliability
is important to examine when scoring involves considerable subjective judgment.

We also discussed a number of issues important for understanding and interpreting re-
liability estimates. We provided some guidelines for selecting the type of reliability estimate
most appropriate for specific assessment procedures, some guidelines for evaluating reli-
ability coefficients, and some suggestions on improving the reliability of measurement.
Although reliability coefficients are useful when comparing the reliability of dif-
ferent tests, the standard error of measurement (SEM) is more useful when interpreting
scores. The SEM is an index of the amount of error in test scores and is used in calculating
confidence intervals within which we expect the true score to fall. An advantage of the
SEM and the use of confidence intervals is that they serve to remind us that measurement
error is present in all scores and that we should use caution when interpreting scores. We
closed the chapter by illustrating some shortcut procedures that teachers can use to esti-
mate the reliability of their classroom tests.
Reliability for Teachers 121

KEY TERMS AND CONCEPTS

Alternate-form reliability, Internal-consistency reliability, Reliability coefficient, p. 95


p. 98 p. 98 Spearman-Brown formula, p. 99
Coefficient alpha, p. 101 Inter-rater differences, p. 95 Split-half reliability, p. 99
Composite score, p. 103 Inter-rater reliability, p. 101 Standard error of measurement
Confidence interval, p. 115 Kuder-Richardson formula 20, (SEM), p. 112
Content heterogeneity, p. 100 p. 101 Test-retest reliability, p. 97
Content sampling error, p. 94 Measurement error, p. 91 Time sampling error, p. 95
Error score, p. 92 Obtained score, p. 92 True score, p. 92
Error variance, p. 95 Reliability, p. 91 True score variance, p. 95

RECOMMENDED READINGS

American Educational Research Association, American Psy- W. H. Freeman. Chapters 8 and 9 provide outstanding
chological Association, & National Council on Measure- discussions of reliability. A classic!
ment in Education (1999). Standards for educational Nunnally, J.C., & Bernstein, I. H. (1994). Psychometric
and psychological testing. Washington, DC: AERA. theory (3rd ed.). New York: McGraw-Hill. Chapter 6,
Chapter 5, Reliability and Errors of Measurement, is a The Theory of Measurement Error, and Chapter 7, The
great resource! Assessment of Reliability are outstanding chapters. An-
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. other classic!
Linn (Ed.), Educational measurement (3rd ed., pp. 105— Subkoviak, M. J. (1984). Estimating the reliability of mastery—
146). Upper Saddle River, NJ: Merrill/Prentice Hall. A nonmastery classifications. In R. A. Berk (Ed.), A guide
little technical at times, but a great resource for students to criterion-referenced test construction (pp. 267-291).
wanting to learn more about reliability. Baltimore: Johns Hopkins University Press. An excellent
Ghiselli, E. E., Campbell, J. P., & Zedeck, S. (1981). Measure- discussion of techniques for estimating the consistency
ment theory for the behavioral sciences. San Francisco: of classification with mastery tests.

PRACTICE ITEMS

1. Consider these data for a five-item test that was administered to six students. Each item could
receive a score of either 1 or 0. Calculate KR-20 using the following formula:
k pee — ae a)
KR-20 =
k= 1 SD?
where k = number of items
SD? = variance of total test scores
proportion of correct responses on item
Pj
q, = proportion of incorrect responses on item
122 CHAPTER 4

Item 1 Item 2 Item 3 Item 4 Item5 Total Score

Student 1 1
Student 2 1
Student 3 0
Student 4 0
Student 5 1
Student 6 CO
OFF
Pr 1 oF
OR
ee CO
OK
KF
Ree oroorr
Pi SD?
qj
DX 4G;

Note: When calculating SD2?, use 7 in the denominator.

2.. Consider these data for a five-item test that was administered to six students. Each item
could receive a score ranging from | to 5. Calculate coefficient alpha using the following
formula:

: k DeSD?
Coefficient alpha = eaaay ] - Sp2

where k = number of items


SD.” = variance of individual items
SD? = variance of total test scores

Item 1 Item 2 Item 3 Item 4 Item5 Total Score

Student 1 4 By 4 5 5
Student 2 3 3 2 3 2
Student 3 2 3 1 2 1
Student 4 4 4 5 5 4
Student 5 2 3 2 2 8
Student 6 ~1 2 » i 3
SD? SD? =
e eee a ee ae ee eels ee re
Note: When calculating SD,” and SD?, use n in the denominator.

Sts SADE aLAN SEU NORA CUT iat Ags abageNUS neath ga ACME
ANah ea cee tee eet eee

Go to www.pearsonhighered.com/reynolds2e to view a PowerPoint™


presentation and to listen to an audio lecture about this chapter.

SBIR STE CRORE
SSSEES LSU SN Ne ene:
EPG
SoPe wn ane ae eaea EE

CHAPTER

Validity for Teachers

Validity refers to the degree to which evidence and theory support the
interpretations of test scores entailed by proposed uses of the test. Validity is,
therefore, the most fundamental consideration in developing and evaluating
tests.
—AERA et al., 1999, p. 9

CHAPTER HIGHLIGHTS

Threats to Validity Types of Validity Evidence


Reliability and Validity Validity: Practical Strategies for Teachers
“Types of Validity” versus
“Types of Validity Evidence”

LEARNING OBJECTIVES

After reading and studying this chapter, students should be able to


Define validity and explain its importance in the context of educational assessment.
Describe the major threats to validity.
Explain the relationship between reliability and validity.
Trace the development of the contemporary conceptualization of validity.
Describe the five categories of validity evidence specified in the 1999 Standards.
SSFor each category of validity evidence, give an example to illustrate the type of information
fA
ee
provided.
Explain how validity coefficients are interpreted.
Define the standard error of estimate and explain its interpretation.
Explain how validity evidence is integrated to develop a sound validity argument.
10. Apply validity analyses to classroom assessments.

In the previous chapter we introduced you to the concept of the reliability of measurement.
In this context, reliability refers to accuracy and consistency in test scores. Now we turn our
attention to validity, another fundamental psychometric property. Messick (1989) defined

123
124 CHAPTER 5

Validity refers to the validity as “an integrated evaluative judgment of the degree to which
appropriateness or accuracy empirical evidence and theoretical rationales support the adequacy
of the interpretations of test and appropriateness of inferences and actions based on test scores or
scores. other modes of assessment” (p. 13). Similarly, the Standards for
Educational and Psychological Testing (AERA et al., 1999) defined
validity as “the degree to which evidence and theory support the in-
terpretations of test scores entailed by proposed uses of the tests” (p. 9). Do not let the
technical tone of these definitions throw you. In simpler terms, both of these influential
sources indicate that v.
. If test scores are interpreted as reflecting intelligence, do they actually reflect
intellectual ability? If test scores are used (1.e., interpreted) to predict success in college, can
they accurately predict who will succeed in college? Naturally the validity of the interpreta-
tions of test scores is directly tied to the usefulness of the interpretations. Valid interpreta-
tions help us to make better decisions; invalid interpretations do not!
Although it is often done as a matter of convenience, it is not technically correct to refer
to the validity of a test. Validity is a characteristic of the interpretations given to test scores. It
is not technically correct to ask the question “Is the Wechsler Intelligence Scale for Children—
Fourth Edition (WISC-IV) a valid test?” It is preferable to ask the question “Is the interpreta-
tion of performance on the WISC-IV as reflecting intelligence valid?” Validity must always
have a context and that context is interpretation. What does performance on this test mean?
The answer to this question is the interpretation given to performance
When test scores are and it is this interpretation that possesses the construct of validity, not
the test itself. Additionally, when test scores are interpreted in multiple
interpreted in multiple ways,
ways, each interpretation needs to be validated. For example, an
each interpretation needs to
achievement test can be used to evaluate a student’s performance in
be evaluated.
academic classes, to assign the student to an appropriate instructional
program, to diagnose a learning disability, or to predict success in col-
lege. Each of these uses involves different interpretations and the validity of each interpreta-
tion needs to be evaluated (AERA et al., 1999). To establish or determine validity is a major
responsibility of the test authors, test publisher, researchers, and even test user.

Threats to Validity

Messick (1994) and others have identified the two major threats to validity as construct
underrepresentation and construct-irrelevant variance. To translate this into everyday
language, validity is threatened when a test measures either less (construct underrepresenta-
tion) or more (construct-irrelevant variance) than the construct it is
Validity is threatened when supposed to measure (AERA et al., 1999). =
a test measures either less or
more than the construct it is . Consider a test designed to be a comprehensive
designed to measure. measure of the mathematics skills covered in a 3rd-grade curriculum
and to convey information regarding mastery of each skill. If the test
contained only division problems, it would not be an adequate representation of the broad
array of math skills typically covered in a 3rd-grade curriculum (although a score on such
Validity for Teachers 125

a test may predict performance on a more comprehensive measure). Division is an important


aspect of the math curriculum, but not the only important aspect. To address this problem
the content of the test would need to be expanded to reflect all of the skills typically taught
in a 3rd-grade math curriculum. Consituctitrelevantyarianceisipresentiwhenithe:testmea-
s ; t. For example,
if our 3rd-grade math test has extensive and complex written instructions, it is possible that
in addition to math skills, reading comprehension skills are being measured. If the test is
intended to measure only math skills, the inclusion of reading comprehension would reflect
construct-irrelevant variance. To address this problem, one might design the test to mini-
mize written instructions and to ensure that the reading level is low. As you might imagine,
most tests leave out some aspects that some users might view as important and include as-
pects that some users view as irrelevant (AERA et al., 1999).
In addition to characteristics of the test itself, factors external to the test can impact
the validity of the interpretation of results. Linn and Gronlund (2000) identify numerous
factors external to the test that can influence validity. They highlight the following factors:

With educational tests, in addition to the content of the test


influencing validity, the way the material is presented can influence validity. For ex-
ample, consider a test of critical thinking skills. If the students were coached and
given solutions to the particular problems included on a test, validity would be com-
promised. This is a potential problem when teachers “teach the test.”
& ikclad i aieecetan, Deviations from standard administrative
and scoring procedures can undermine validity. In terms of administration, failure to
provide the appropriate instructions or follow strict time limits can lower validity. In
terms of scorin eliable or biased scoring can lower validity.
aa iisctovestems Any personal factors that restrict or alter the examinees’ re-
sponses in the testing situation can undermine validity. For example, if an examinee
experiences high levels of test anxiety or is not motivated to put forth a reasonable
effort, the results may be distorted.

Additionally, the validity of norm-referenced interpretations of performance on a test is


influenced by the appropriateness of the reference group (AERA et al., 1999). As these
examples illustrate, a multitude of factors can influence the validity of assessment-based
interpretations. Due to the cumulative influence of these factors,
Validity is not an all-or-none validity is not an all-or-none concept. Rather, it exists on a con-
concept, but exists on a tinuum, and we usually refer to degrees of validity or to the relative
continuum. validity of the interpretation(s) given to a particular measurement.

Reliability and Validity

meas Reliability is a necessary but insufficient condition for validity.


A test that does not produce reliable scores cannot produce valid interpretations. However,
126 (CUEITASP GEARS 5

Reliability is a necessary but no matter how reliable measurement is, it is not a guarantee of valid-
insufficient condition for ity. From our discussion of reliability you will remember that ob-
validity. tained score variance is composed of two components: true score
variance and error variance. Only true score variance is reliable, and
only true score variance can be systematically related to any construct the test is designed
to measure. If reliability is equal to zero, then the true score variance component must also
be equal to zero, leaving our obtained score to be composed only of error, that is, random
variations in responses. Thus, without reliability there can be no validity.
Although low reliability limits validity, high reliability does not ensure validity. It is
entirely possible that a test can produce reliable scores but inferences based on the test
scores can be completely invalid. Consider the following rather silly example involving
head circumference. If we use some care we can measure the circumferences of our stu-
dents’ heads in a reliable and consistent manner. In other words, the measurement is reliable.
However, if we considered head circumference to be an index of intelligence, our inferences
would not be valid. The measurement of head circumference is still reliable, but when in-
terpreted as a measure of intelligence it would result in invalid inferences.
A more relevant example can be seen in the various Wechsler intelligence scales.
These scales have been shown to produce highly reliable scores on a Verbal Scale and a
Performance Scale. There is also a rather substantial body of research demonstrating these
scores are interpreted appropriately as reflecting types of intelligence. However, some psy-
chologists have drawn the inference that score differences between the Verbal Scale and the
Performance Scale indicate some fundamental information about personality and even
forms of psychopathology. For example, one author argued that a person who, on the
Wechsler scales, scores higher on the Verbal Scale relative to the Performance Scale is
highly likely to have an obsessive-compulsive personality disorder! There is no evidence or
research to support such an interpretation, and, in fact, a large percentage of the population
of the United States score higher on the Verbal Scale relative to the Performance Scale on
each of the various Wechsler scales. Thus, while the scores are themselves highly reliable
and some interpretations are highly valid (the Wechsler scales measure intelligence), other
interpretations wholly lack validity despite the presence of high reliability.

“Types of Validity” versus


“Types of Validity Evidence”
We have already introduced you to the influential Standards for Educational and Psycho-
logical Testing (AERA et al., 1999). This is actually the latest in a series of documents
providing guidelines for the development and use of tests. At this point we are going to trace
the evolution of the concept of validity briefly by highlighting how it has been defined and
described in this series of documents. In the early versions (i.e., APA, 1954, 1966; APA et
al., 1974, 1985) validity was divided into three distinct types. As described by Messick
(1989), these are

=jdentifigdi@onstuct. In other words, is the content of the test relevant and representative of
Validity for Teachers 127

the content domain? We speak of it being representative because every possible question
that could be asked cannot as a practical matter be asked, so questions are chosen to sample
or represent the full domain of questions. Content validity is typically based on professional
judgments about the appropriateness of the test content.

Studies of criteri-
on-related validity empirically examine the relationships between test scores and criterion
scores using correlation or regression analyses.

This evidence can be collected using a wide variety of research


strategies and designs.

This classification terminology has been widely accepted by researchers, authors,


teachers, and students and is often referred to as the traditional nomenclature (AERA et
al., 1999). However, in the 1970s and 1980s measurement profes-
Validity is a unitary concept. sionals began moving toward a conceptualization of validity as a
unitary concept. That is, whereas we previously had talked about
different types of validity (i.e., content, criterion-related, and construct validity), these
“types” really only represent different ways of collecting evidence to support validity. To
emphasize the view of validity as a unitary concept and get away from the perception of
distinct types of validity, the 1985 Standards for Educational and Psychological Testing
(APA et al., 1985) referred to “types of validity evidence” in place of “types of validity.”
Instead of content validity, criterion-related validity, and construct validity, the 1985 Stan-
dards referred to content-related evidence of validity, criterion-related evidence of validity,
and construct-related evidence of validity.
This brings us to the current Standards for Educational and Psychological Testing
(AERA et al., 1999). According to the 1999 Standards

The 1999 document is conceptually similar to the 1985 document (i.e., “types of valid-
ity evidence” versus “types of validity”), but the terminology has expanded and changed
somewhat. The change in terminology is not simply cosmetic, but is substantive and intended
to promote a new way of conceptualizing validity, a view that has been growing in the profes-
sion for over two decades (Reynolds, 2002). The 1999 Standards identifies the following five
categories of evidence that are related to the validity of test score interpretations:

a Evidence based on test content includes evidence derived from an analysis of the test
content, which includes the type of questions or tasks included in the test and administration
and scoring guidelines.

a Evidence based on relations to other variables includes evidence based on an exami-


nation of the relationships between test performance and external variables or criteria.
128 CHAPTER 5

m Evidence based on internal structure includes evidence regarding relationships among


test items and components.
m Evidence based on response processes includes evidence derived from an analysis of
the processes engaged in by the examinee or examiner.
m Evidence based on consequences of testing includes evidence based on an examina-
tion of the intended and unintended consequences of testing.

Sources of validity evidence These sources of evidence will differ in their importance or relevance
differ in their importance according to factors such as the construct being measured, the in-
according to factors such as the tended use of the test scores, and the population being assessed.
construct being measured, the Those using tests should carefully weight the evidence of validity
intended use of the test scores, and make judgments about how appropriate a test is for each applica-
tion and setting. Table 5.1 provides a brief summary of the different
and the population being
classification schemes that have been promulgated over the past four
assessed.
decades in the Standards.
At this point you might be asking, “Why are the authors wasting my time with a dis-
cussion of the history of technical jargon?” There are at least two important reasons. First,
it is likely that in your readings and studies you will come across references to various
“types of validity.” Many older test and measurement textbooks refer to content, criterion,
and construct validity, and some newer texts still use that or a similar nomenclature. We
hope that when you come across different terminology you will not be confused, but instead
will understand its meaning and origin. Second, the Standards are widely accepted and
serve as professional guidelines for the development and evaluation of tests. For legal and
ethical reasons test developers and publishers generally want to adhere to these guidelines.
As a result, we expect test publishers will adopt the new nomenclature in the next few years.
Currently test manuals and other test-related documents are adopting this new nomenclature
(e.g., Reynolds, 2002). However, older tests typically have supporting literature that uses
the older terminology, and you need to understand its origin and meaning. When reviewing
test manuals and assessing the psychometric properties of a test, you need to be aware of
the older as well as the newer terminology.

TABLE 5.1 Tracing Historical Trends in the Concept of Validity

1985 Standards
1974 Standards (Validity as Three 1999 Standards
(Validity as Three Types) _Interrelated Types) (Validity as a Unitary Construct)
ee

Content validity Content-related validity Validity evidence based on test content


Criterion validity Criterion-related validity Validity evidence based on relations to other variables
Construct validity Construct-related validity Validity evidence based on internal structure
Validity evidence based on response processes
Validity evidence based on consequences of testing
Validity for Teachers 129

Types of Validity Evidence


At this point we will address each of the categories of validity evidence individually. As we
do this we will attempt to highlight how the current nomenclature relates to the traditional
nomenclature. Along these lines, it will hopefully become clear that construct validity as
originally conceptualized is a comprehensive category that essentially corresponds with the
contemporary conceptualization of validity as a unitary concept. As a result, construct valid-
ity actually encompasses content and criterion-related validity.

Evidence Based on Test Content


Valuable validity evidence The Standards (AERA et al., 1999) note that valuable validity evi-
can be gained by examining dence can be gained by examining the relationship between the con-
the relationship between the tent of the test and the construct or domain the test is designed to
content of the test and the measure. In this context, test
construct it is designed to ’ ? ?

p. 11). Other
measure.
writers provide similar descriptions. For example, Reynolds (1998b)
notes that validity evidence based on test content focuses on how
well the test items sample the behaviors or subject matter the test is designed to measure. In
a similar vein, Anastasi and Urbina (1997) note that validity evidence based on test content
involves the examination of the content of the test to determine whether it provides a repre-
sentative sample of the domain being measured. Popham (2000) succinctly frames it as
“Does the test cover the content it’s supposed to cover?” (p. 96). In the past, this type of
validity evidence was primarily subsumed under the label “content validity.”
Test developers routinely begin considering the appropriateness of the content of the
test at the earliest stages of development. Identifying what we want to measure is the first order
of business, because we cannot measure anything very well that we have not first clearly de-
fined. Therefore, the process of developing a test should begin with a clear delineation of the
construct or content domain to be measured. Once the construct or
A table of specifications is content domain has been clearly defined, the next step is to develop a
essentially a blueprint that table of specifications. This table of specifications is essentially a
guides the development of blueprint that guides the development of the test. It delineates the top-
the test. ics and objectives to be covered and the relative importance of each
topic and objective. Finally, working from this table of specifications
the test developers write the actual test items. These steps in test development are covered in
detail later in this text. Whereas teachers usually develop classroom tests with little outside
assistance, professional test developers often bring in external consultants who are considered
experts in the content area(s) covered by the test. For example, if the goal is to develop an
achievement test covering American history, the test developers will likely recruit experienced
teachers of American history for assistance developing a table of specifications and writing
test items. If care is taken with these procedures, the foundation is established for a correspon-
dence between the content of the test and the construct it is designed to measure. Test develop-
ers may include adetailed description of their procedures for writing items as validity evidence,
including the number, qualifications, and credentials of their expert consultants.
130 (CHHAGP IEDR 25

Item relevance and content After the test is written, it is common for test developers to
coverage are two important continue collecting validity evidence based on content. This typi-
factors to be considered when cally involves having expert judges systematically review the test
evaluating the correspondence and evaluate the correspondence between the test content and its con-
between the test content and its struct or domain. These experts can be the same ones who helped
during the early phase of test construction or a new, independent
construct.
group of experts. During this phase, the experts typically address two
major issues, item relevance and content coverage. To asses$)itém

To under-
stand the difference between these two issues, consider these examples. For a classroom test
of early American history, a question about the American Revolution would clearly be
deemed a relevant item whereas a question about algebraic equations would be judged to be
irrelevant. This distinction deals with the relevance of the items to the content domain. In
contrast, if you examined the total test and determined that all of the questions dealt with
the American Revolution and no other aspects of American history were covered, you would
conclude that the test had poor content coverage. That is, because early American history
has many important events and topics in addition to the American Revolution that are not
covered in the test, the test does not reflect a comprehensive and representative sample of
the specified domain. The concepts of item relevance and content coverage are illustrated
in Figures 5.1 and 5.2.
As you can see, the collection of content-based validity evidence is typically qualita-
tive in nature. However, although test publishers might rely on traditional qualitative ap-
proaches (e.g., the judgment of expert judges to help develop the tests and subsequently to
evaluate the completed test), they can take steps to report their results in a more quantitative
manner. For example, they can report the number and qualifications of the experts, the

Content Domain

Relevant Items Irrelevant Items


(In the Content Domain) (Outside the Content Domain)

FIGURE 5.1 Illustration of Item Relevance “~ ‘


Validity for Teachers 131

A B
Content Domain Content Domain

Good Content Coverage Poor Content Coverage


(Representative of the (Not Representative of the
Entire Content Domain) Entire Content Domain)

FIGURE 5.2 Illustration of Content Coverage

number of chances the experts had to review and comment on the assessment, and their
degree of agreement on content-related issues. Taking these efforts a step further, Lawshe
(1975) developed a quantitative index that reflects the degree of agreement among the ex-
perts making content-related judgments. Newer approaches are being developed that use a
fairly sophisticated technique known as multidimensional scaling analysis (Sireci, 1998).
As we suggested previously, different types of validity evidence are most relevant,
appropriate, or important for different types of tests. For example, content-based validity
evidence is often seen as the preferred approach for establishing the
Content-based validity evidence —_ validity of academic achievement tests. This applies to both teacher-
is often the preferred approach made classroom tests and professionally developed achievement
for establishing the validity of tests. Another situation in which content-based evidence is of pri-
achievement tests, including mary importance is with tests used in the selection and classification
teacher-made classroom tests. of employees. For example, employment tests may be designed to
sample the knowledge and skills necessary to succeed at a job. In this
context, content-based evidence can be used to demonstrate consistency between the con-
tent of the test and the requirements of the job. The key factor that makes content-based
validity evidence of paramount importance with both achievement tests and employment
tests is that they are designed to provide a representative sample of the knowledge, behavior,
or skill domain being measured. In contrast, content-based evidence of validity is usually
less relevant for personality and aptitude tests (Anastasi & Urbina, 1997).

Face Validity. Before leaving our discussion of content-based va-


Face validity is technically not a lidity evidence, we need to highlight the distinction between it and
form of validity, but refers to a face validity. idi
test “appearing” to measure :
what it is designed to measure. »designed tomeasure. That is, does the test appear valid to untrained
132 (CaAVAGP ER

individuals who take, administer, or examine the test? Face validity really has nothing to
do with what a test actually measures, just what it appears to measure. For example, does
a test of achievement look like the general public expects an achievement test to look?
Does a test of intelligence look like the general public expects an intelligence test to look?
Naturally, the face validity of a test is closely tied to the content of a test. In terms of face
validity, when untrained individuals inspect a test they are typically looking to see whether
the items on the test are what they expect. For example, are the items on an achievement
test of the type they expect to find on an achievement test? Are the items on an intelligence
test of the type they expect to find on an intelligence test? Whereas content-based evidence
of validity is acquired through a systematic and technical analysis of the test content, face
validity involves only the superficial appearance of a test. A test can appear “face valid” to
the general public, but not hold up under the systematic scrutiny involved in a technical
analysis of the test content.
This is not to suggest that face validity is an undesirable or even irrelevant character-
istic. A test that has good face validity is likely to be better received by the general public.
If a test appears to measure what it is designed to measure, examinees are more likely to be
cooperative and invested in the testing process, and the public is more likely to view the
results as meaningful (Anastasi & Urbina, 1997). Research suggests that good face validity
can increase student motivation, which in turn can increase test performance (Chan, Schmitt,
DeShon, Clause, & Delbridge, 1997). If a test has poor face validity those using the test may
have a flippant or negative attitude toward the test and as a result put little effort into com-
pleting it. If this happens, the actual validity of the test can suffer. The general public is not
likely to view a test with poor face validity as meaningful, even if there is technical support
for the validity of the test.
There are times, however, when face validity is undesirable. These occur primarily in
forensic settings in which detection of malingering may be emphasized. Malingering is a
situation in which an examinee intentionally feigns symptoms of a mental or physical dis-
order in order to gain some external incentive (e.g., receiving a financial reward, avoiding
punishment). In these situations face validity is not desirable because it may help the exam-
inee fake pathological responses.

Evidence Based on Relations to Other Variables


Important validity evidence can also be secured by examining the relationships between test
scores and other variables (AERA et al., 1999). In describing this type of validity evidence,
the Standards recognize two related, but fairly distinct applications of this approach. One
involves the examination of test-criterion evidence and the other convergent and discrimi-
nant evidence. For clarity, we will address these two applications separately.

Test-Criterion Evidence. Many tests are designed to predict performance on some

eae ete 1 mss ometributeoronteomethatsofprimaryines


variable that iss typicall referred to as a criterion. The Standards (AERA et al., 1999)

(p. 14). The criterion can be academic performance as reflected by the grade point aver-
age (GPA), job performance as measured by a supervisor’s ratings, or anything else that
is of importance to the user of the test. Historically, this type of validity evidence has been
referred to as “predictive validity,” “criterion validity,” or “criterion-related validity.”
Validity for Teachers 133

There are two different types of There are two different types of validity studies typically used to
validity studies typically used to collect test-criterion evidence: predictive studies and concur-
collect test-criterion evidence: rent studies. In a predictive study the test is administered, there is
predictive studies and concurrent an intervening time interval, and then the criterion is measured. In
studies. a concurrent study the test is administered and the criterion is mea-
sured at about the same time.
To illustrate these two approaches we will consider the Scho-
lastic Achievement Test (SAT). The SAT is designed to predict how well high school stu-
dents will perform in college. To complete a predictive study, one might administer the
SAT to high school students, wait until the students have completed their freshman year of
college, and then examine the relationship between the predictor (i.e., SAT scores) and the
criterion (i.e., freshman GPA). Researchers often use a correlation coefficient to examine
the relationship between a predictor and a criterion, and in this context the correlation
coefficient is referred to as a validity coefficient. To complete a concurrent study of the
relationship between the SAT and college performance, the researcher might administer
the SAT to a group of students completing their freshman year and then simply correlate
their SAT scores with their GPAs. In predictive studies there is a time interval between the
predictor test and the criterion; in a concurrent study there is no time interval. Figure 5.3
illustrates the temporal relationship between administering the test and measuring the cri-
terion in predictive and concurrent studies.
A natural question is “Which type of study, predictive or concurrent, is best?” As you
might expect (or fear), there is not a simple answer to that question. Very often in education
and other settings we are interested in making predictions about future performance. Con-
sider our example of the SAT; the question is which students will do well in college and
which will not. Inherent in this question is the passage of time. You want to administer a test
before students graduate from high school that will help predict the likelihood of their success

Predictive Design

Time | Time Il
Fall 2003 Spring 2005
Administration of “ College GPA
Scholastic Achievement
Test (SAT)

Concurrent Design

Time |

Fall 2003
Administration of SAT
and
College GPA

FIGURE 5.3 Illustration of Predictive and Concurrent Studies


134 CHAPTER 5

in college. In situations such as this, predictive studies maintain the temporal relationship and
other potentially important characteristics of the real-life situation (AERA et al., 1999).
Because a concurrent study does not retain the temporal relationship or other character-
istics of the real-life situation, a predictive study is preferable when prediction is the ultimate
goal of assessment. However, predictive studies take considerable time to complete and can be
extremely expensive. As a result, although predictive studies might be preferable from a techni-
cal perspective, for practical reasons test developers and researchers might adopt a concurrent
strategy to save time and/or money. In some situations this is less than optimal and you should
be cautious when evaluating the results. However, in certain situations concurrent studies are
the preferred approach. Concurrent studies clearly are appropriate when the goal of the test is
to determine current status of the examinee as opposed to predicting future outcome (Anastasi
& Urbina, 1997). For example, a concurrent approach to validation would be indicated for a
test designed to diagnose the presence of psychopathology in elementary school students. Here
we are most concerned that the test give us an accurate assessment of the child’s conditions at
the time of testing, not at some time in the future. The question here is not “Who will develop
the disorder?” but “Who has the disorder?” In these situations, the test being validated is often
a replacement for a more time-consuming or expensive procedure. For example, a relatively
brief screening test might be evaluated to determine whether it can serve as an adequate re-
placement for a more extensive psychological assessment process. However, if we were inter-
ested in selecting students at high risk of developing a disorder in the future, say, for
participation in a prevention program, a prediction study would be in order. We would need to
address how well or accurately our test predicts who will develop the disorder in question.

Selecting a Criterion. In both predictive and concurrent studies, itis importantihatethe.


. As noted earlier, reliability is a prerequisite for validity.
If a measure is not reliable, whether it is a predictor test or a criterion measure, it cannot be
valid. At the same time, reliability does not ensure validity. Therefore, we need to select cri-
terion measures that are also valid. In our example of using the SAT to predict freshman GPA,
we consider our criterion, GPA, to be a valid measure of success in college. In a concurrent
study examining the ability of a test to diagnose psychopathology, the criterion might be the
diagnosis provided by an extensive clinical assessment involving a combination of clinical
interviews, behavioral observations, and psychometric testing. Optimally the criterion should
be viewed as the “gold standard,” the best existing measure of the construct of interest.

Criterion Contamination. It is important that the predictor and criterion scores be indepen-
dently obtained. That. i i iter
If predictor scores do influence criterion scores, the criterion is said to be contami-
nated. Consider a situation in which students are selected for a college program based on
performance on an aptitude test. If the college instructors are aware of the students’ perfor-
mance on the aptitude test this might influence their evaluation of the students’ performance
in their class. Students with high aptitude test scores might be given
Criterion contamination occurs preferential treatment or graded in a more lenient manner. In this situ-
when knowledge about ation knowledge of performance on the predictor is influencing perfor-
performance on the predictor mance on the criterion. Criterion contamination has occurred and
includes performance on the any resulting validity coefficients will be artificially inflated. That is,
criterion. the validity coefficients between the predictor test and the criterion
Validity for Teachers 135

will be larger than they would be had the criterion not been contaminated. The coefficients will
suggest the validity is greater than it actually is. To avoid this undesirable situation, test devel-
opers must ensure that no individual who evaluates criterion performance has knowledge of
the examinees’ predictor scores.

Interpreting Validity Coefficients. Predictive and concurrent validity studies examine the
relationship between a test and a criterion and the results are often reported in terms of a
validity coefficient. At this point it is reasonable to ask, “How large should validity coeffi-
cients be?” For example, should we expect validity coefficients greater than 0.80? Although
there is no simple answer to this question, validity coefficients should be large enough to
indicate that information from the test will help predict how individuals will perform on the
criterion measure (e.g., Cronbach & Gleser, 1965). Returning to our example of the SAT,
the question is whether the relationship between the SAT and the
If a test provides information freshman GPA is sufficiently strong so that information about SAT
that helps predict criterion performance helps predict who will succeed in college. If a test pro-
performance better than any vides information that helps predict criterion performance better
other existing predictor, the test than any other existing predictor, the test may be useful even if its
may be useful even if its validity validity coefficients are relatively small. As a result, testing experts
coefficients are relatively small. avoid specifying a minimum coefficient size that is acceptable for
validity coefficients.
Although we cannot set a minimum size for acceptable validity coefficients, certain
techniques are available that help us evaluate the usefulness of test scores for prediction pur-
poses. In Chapter 2 we introduced linear regression, a mathematical procedure that allows
you to predict values on one variable given information on another variable. In the context of
validity analysis, linear regression allows you to predict criterion performance based on pre-
dictor test scores. When using linear regression, a statistic called the
The standard error of estimate standard error of estimate is used to describe the amount of predic-
is used to describe the amount tion error due to the imperfect validity of the test. The standard error
of prediction error due to the of estimate is the standard deviation of prediction errors around the
imperfect validity of the test. predicted score. The formula for the standard error of estimate is quite
similar to that for the SEM introduced in the last chapter. We will not
go into great detail about the use of linear regression and the standard error of estimate, but
Special Interest Topic 5.1 provides a very user-friendly discussion of linear regression.
When tests are used for making decisions such as in student or personnel selection,
factors other than the correlation between the test and criterion are important to consider.
For example, factors such as the proportion of applicants needed to fill Positions (1.e., Se-
lection ratio) and the proportion of applicants who can be successful on the criterion (i.e.,
base rate) can impact the usefulness of test scores. As an example of how the selection
ratio can influence selection decisions, consider an extreme situation in which you have
more positions to fill than you have applicants. Here you do not have the luxury of being
selective and have to accept all the applicants. In this unfortunate situation no test is useful,
no matter how strong a relationship there is between it and the criterion. However, if you
have only a few positions to fill and many applicants, even a test with a moderate correla-
tion with the criterion may be useful. As an example of how the base rate can impact
selection decisions, consider a situation in which practically every applicant can be suc-
cessful (i.e., a very easy task). Because almost any applicant selected will be successful,
136 CHAPTER 5

SPECIAL INTEREST TOPIC |

Regression, Prediction, and Your First Algebra Class

One of the major purposes of various aptitude measures such as IQ tests is to make predictions about
performance on some other variable such as reading achievement test scores, success in a job train-
ing program, or even college grade point average. In order to make predictions from a score on one
test to a score on some other measure, the mathematical relationship between the two must be deter-
mined. Most often we assume the relationship to be linear and direct with test scores such as intel-
ligence and achievement. When this is the case, a simple equation is derived that represents the
relationship between two test scores that we will call X and Y.
If our research shows that X and Y are indeed related—that is, for any change in the value of
X there is a systematic (not random) change in Y—our equation will allow us to estimate this change
or to predict the value of Y if we know the value of X. Retaining X and Y as we have used them so
far, the general form of our equation would be:

Y=ax +b

This equation goes by several names. Statisticians are most likely to refer to it as a regression equa-
tion. Practitioners of psychology who use the equation to make predictions may refer to it as a predic-
tion equation. However, somewhere around 8th or 9th grade, in your first algebra class, you were
introduced to this expression and told it was the equation of a straight line. What algebra teachers
typically do not explain at this level is that they are actually teaching you regression!
Let’s look at an example of how our equation works. For this example, we will let X represent
some individual’s score on an intelligence test and Y the person’s score on an achievement test a year
in the future. To determine our actual equation, we would have had to test a large number of students
on the IQ test, waited a year, and then tested the same students on an achievement test. We then
calculate the relationship between the two sets of scores. One reasonable outcome would yield an
equation such as this one:

Y = 0.5X+ 10

In determining the relationship between X and Y, we calculated the value of a to be 0.5 and
the value of b to be 10. In your early algebra class, a was referred to as the slope of your line and b
as the Y-intercept (the starting point of your line on the Y-axis when X = 0). We have graphed this
equation for you in Figure 5.4. When X = 0, Y is equal to 10 (Y = 0.5(0) + 10) so our line starts on
the Y-axis at a value of 10. Because our slope is 0.5, for each increase in X, the increase in Y will be
half, or 0.5 times, as much. We can use our equation or our prediction line to estimate or predict the
value of Y for any value of X, just as you did in that early algebra class. Nothing has really changed
except the names.
Instead of slope, we typically refer to a as a regression coefficient or a beta weight. Instead of
the Y-intercept, we typically refer to b from our equation as a constant, because it is always being
added to aX in the same amount on every occasion.
If we look at Figure 5.4, we can see that for a score of 10 on our intelligence test, a score of
15 is predicted on the achievement test. A score of 30 on our intelligence test, a 20-point increase,
predicts an achievement test score of 25, an increase in Y equal to half the increase in X. These values
are the same whether we use our prediction line or our equation—they are simply differing ways of
showing the relationship between X and Y. Vo
Validity for Teachers 137

X
10 20 30 40 50 60

FIGURE 5.4 Example of a Graph of the Equation of a Straight Line, also Known
as a Regression Line or Prediction Line
Note: Y=aX +b when a=0.5 and b= 10. For example, if X is 30, then Y = (0.5)30 + 10 = 25.

We are predicting Y from X, and our prediction is never perfect when we are using test scores.
For any one person, we will typically be off somewhat in predicting future test scores. Our prediction
actually is telling us the mean or average score on Y of all the students in the research study at each
score on X. For example, the mean achievement score of all of our students who had a score of 40
on the intelligence test was 30. We know that not all of the students who earned a 40 on the intelli-
gence test will score 30 on the achievement test. We use the mean score on Y of all our students who
scored 40 on the intelligence measure as our predicted value for all students who score 40 on X
nevertheless. The mean is used because it results in the smallest amount of error in all our predic-
tions. In actual practice, we would also be highly interested in just how much error existed in our
predictions, and this degree of error would be calculated and reported. Once we determine the aver-
age amount of error in our predictions, we make statements about how confident we are in predicting
Y based on X. For example, if the average amount of error (called the standard error of estimate) in
our prediction were 2 points, we might say that based on his score of 40 on the IQ measure, we are
68% confident John’s achievement test score a year later will be between 38 and 42 and 95% confi-
dent that it will fall within the range of scores from 36 to 44.

no test is likely to be useful regardless of how strong a relationship there is between it and
the criterion. However, if you have a difficult task and few applicants can be successful,
even a test with a moderate correlation with the criterion may be useful. To take into con-
sideration these factors, decision-theory mode e be eloped ay
1989). In brief, €
tie s. We will not go
138 CHAPTER 5

Decision-theory models help the into detail about decision theory, but interested students are referred
test user determine how much to Anastasi and Urbina (1997) for a readable discussion of decision-
information a predictor test theory models.
can contribute when making
classification decisions. Validity Generalization. An important consideration in the inter-
pretation of predictive and concurrent studies is the degree to which
they can be generalized to new situations, that is, to circumstances similar to but not the
same as those under which the validity studies were conducted. When a test is used for
prediction in new settings, research has shown that validity coefficients can vary consider-
ably. For example, a validation study may be conducted using a national sample, but differ-
ent results may be obtained when the study is repeated using a restricted sample such as a
local school district. Originally these results were interpreted as suggesting that test users
were not able to rely on existing validation studies and needed to conduct their own local
validation studies. However, subsequent research using a new statistical procedure known
as meta-analysis indicated that much of the variability previously observed in validity coef-
ficients was actually due to statistical artifacts (e.g., sampling error). When these statistical
artifacts were taken into consideration the remaining variability was often negligible, sug-
gesting that validity coefficients can be generalized more than previously thought (AERA
et al., 1999). Currently, in many situations local validation studies are not seen as necessary.
For example, if there is abundant meta-analytic research that produces consistent results,
local validity studies will likely not add much useful information. However, if there is little
existing research or the results are inconsistent, then local validity studies may be particu-
larly useful (AERA et al., 1999).

Convergent and Discriminant Evidence. Convergent and discriminant evidence of valid-


ity have traditionally been incorporated under the categ onstruct validity. t
evidence of validityis oblained whenYou celete «vest exist tossdatessa
For example, if you are developing a new intelligence test you might elect
; “y to correlate scores on your new test with scores on the Wechsler Intel-
Convergent evidence of validity ligence Scale for Children—Fourth Edition (WISC-IV; Wechsler,
involves correlating a test with 2003). Because the WISC-IV is a well-respected test of intelligence
existing tests that measure with considerable validity evidence, a strong correlation between the
similar constructs. WISC-IV and your new intelligence test would provide evidence that
your test is actually measuring the construct of intelligence.
Discriminant evidence of [EATON ice emnacy.
validity involves correlating . For
a test with existing tests that example, if you were validating a test designed to measure anxiety,
measure dissimilar constructs. you might correlate your anxiety scores with a measure of sensation
seeking. Because anxious individuals do not typically engage in
sensation-seeking behaviors, you would expect a negative correla-
tion between the measures. If your analyses produce the expected negative correlations,
this
would support your hypothesis.
There is a related, relatively sophisticated validation technique referred to as the
multitrait-multimethod matrix (Campbell & Fiske, 1959). This approach requires
that
you examine two or more traits (e.g., anxiety and sensation seeking) using two
or more
Validity for Teachers 139

measurement methods (e.g., self-report and teacher rating). The researcher then examines
the resulting correlation matrix, comparing the actual relationships with a priori (i.e., pre-
existing) predictions about the relationships. In addition to revealing information about
convergent and discriminant relationships, this technique provides information about the
influence of common method variance. When two measures show an unexpected correla-
tion due to similarity in their method of measurement, we refer to this as method variance.
Thus, the multitrait-multimethod matrix allows one to determine what the test correlates
with, what it does not correlate with, and how the method of measurement influences these
relationships. This approach has considerable technical and theoretical appeal, yet difficulty
with implementation and interpretation has limited its application to date.

Contrasted Groups Studies.’ Validity evidence canalso begarnered byexamining different

«s=measure..This is referred to as a contrasted group study. For example, if you are attempting
to validate a new measure of intelligence, you might form two groups,
In a contrasted group study, individuals with mental retardation and normal control participants. In
validity evidence is garnered by this type of study, the diagnoses or group assignment would have been
examining groups that differ on made using assessment procedures that do not involve the test under
the construct being measured. consideration. Each group would then be administered the new test,
and its validity as a measure of intelligence would be supported if the
predefined groups differed in performance in the predicted manner. Although the preceding
example is rather simplistic, it illustrates a general approach that has numerous applications.
For example, many constructs in psychology and education have a developmental component.
That is, you expect younger participants to perform differently than older participants. Tests
designed to measure these constructs can be examined to determine whether they demonstrate
the expected developmental changes by looking at the performance of groups reflecting dif-
ferent ages and/or education. In the past, this type of validity evidence has typically been
classified as construct validity.

Evidence Based on Internal Structure

By examining the internal structure of a test (or battery of tests) one can determine whether
the relationships between test items (or, in the case of test batteries, component tests) are
consistent with the construct the test is designed to measure (AERA et al., 1999). For ex-
ample, one test might be designed to measure a construct that is hypothesized to involve a
single dimension, whereas another test might measure a construct thought to involve mul-
tiple dimensions. By examining the internal structure of the test we can determine whether
its actual structure is consistent with the hypothesized structure of the construct it measures.
Factor analysis is a sophisticated statistical procedure used to determine the number of
conceptually distinct factors or dimensions underlying a test or bat-
By examining the internal
tery of tests. Because factor analysis is a fairly complicated tech-
structure of the gale fren nique, we will not go into detail about its calculation. However, factor
determine whether its actual analysis plays a prominent role in test validation and you need to be
structure is consistent with the aware of its use. In summary, test publishers and researchers use fac-
hypothesized structure of the tor analysis either to confirm or to refute the proposition that the in-
construct it measures. ternal structure of the tests is consistent with that of the construct.
140 CHAPTER 5

Factor analysis is not the only approach researchers use to examine the internal struc-
ture of a test. Any technique that allows researchers to examine the relationships between
test components can be used in this context. For example, if the items on a test are assumed
to reflect a continuum from very easy to very difficult, empirical evidence of a pattern of
increasing difficulty can be used as validity evidence. If a test is thought to measure a one-
dimensional construct, a measure of item homogeneity might be useful (AERA et al., 1999).
The essential feature of this type of validity evidence is that researchers empirically examine
the internal structure of the test and compare it to the structure of the construct of interest.
This type of validity evidence has traditionally been incorporated under the category of
construct validity and is most relevant with tests measuring theoretical constructs such as
intelligence or personality.

Evidence Based on Response Processes


PUT SE*PEVCesses THVORCU UY c

wStruc#being assessed) Although this type of validity evidence has not received as much at-
tention as the approaches previously discussed, it has considerable potential and in terms of
the traditional nomenclature it would likely be classified under construct validity. For ex-
ample, consider a test designed to measure mathematical reasoning ability. In this situation
it would be important to investigate the examinees’ response processes to verify that they
are actually engaging in analysis and reasoning as opposed to applying rote mathematical
algorithms (AERA et al., 1999). There are numerous ways of collecting this type of validity
evidence, including interviewing examinees about their response processes and strategies,
recording behavioral indicators such as response times and eye movements, or even analyz-
ing the types of errors committed (AERA et al., 1999; Messick, 1989).
The Standards (AERA et al., 1999) note that studies of response processes are not re-
stricted to individuals taking the test, but may also examine the assessment professionals who
administer or grade the tests. When testing personnel record or evaluate the performance of
examinees, it is important to make sure that their processes or actions are in line with the
construct being measured. For example, many tests provide specific criteria or rubrics that are
intended to guide the scoring process. The Wechsler Individual Achievement Test—Second
Edition (WIAT-II; The Psychological Corporation, 2002) has a section to assess written ex-
pression that requires the examinee to write an essay. To facilitate grading, the authors include
an analytic scoring rubric that has four evaluative categories: mechanics (e.g., spelling, punc-
tuation), organization (e.g., structure, sequencing, use of introductory/concluding sentences,
etc.), theme development (use of supporting statements, evidence), and vocabulary (e.g.,
specific and varied words, unusual expressions). In validating this assessment it would be
helpful to evaluate the behaviors of individuals scoring the test to verify that the criteria are
being carefully applied and that irrelevant factors are not influencing the scoring process.

Evidence Based on Consequences of Testing


Recently, researchers have started examining the consequences of test use, both intended
and unintended, as an aspect of validity. In many situations the use of tests is largely based
on the assumption that their use will result in some specific benefit (AERA et al., 1999). For
Validity for Teachers 141

Researchers have started example, if a test is used to identify qualified applicants for employ-
examining the consequences ment, it is,assumed that the use of the test will result in better hiring
of test use, both intended and decisions (e.g., lower training costs, lower turnover). If a test is used
unintended, as an aspect of to help select students for admission to a college program, it is as-
validity. sumed that the use of the test will result in better admissions deci-
sions (e.g., greater student success and higher retention). This line of
validity evidence simply asks the question, “Are these benefits being
achieved?” This type of validity evidence, often referred to as consequential validity evi-
dence, is most applicable to tests designed for selection and promotion.
Some authors have advocated a broader conception of validity, one that incorporates
social issues and values. For example, Messick (1989) in his influential chapter suggested
that the conception of validity should be expanded so that it “formally brings consideration
of value implications and social consequences into the validity framework” (p. 20). Other
testing experts have criticized this position. For example, Popham (2000) suggests that in-
corporating social consequences into the definition of validity would detract from the clar-
ity of the concept. Popham argues that validity is clearly defined as the “accuracy of
score-based inferences” (p. 111) and that the inclusion of social and value issues unneces-
sarily complicates the concept. The Standards (AERA et al., 1999) appear to avoid this
broader conceptualization of validity. The Standards distinguish between consequential
evidence that is directly tied to the concept of validity and evidence that is related to social
policy. This is an important but potentially difficult distinction to make. Consider a situation
in which research suggests that the use of a test results in different job selection rates for
different groups. If the test measures only the skills and abilities related to job performance,
evidence of differential selection rates does not detract from the validity of the test. This
information might be useful in guiding social and policy decisions, but it is not technically
an aspect of validity. If, however, the test measures factors unrelated to job performance, the
evidence is relevant to validity. In this case, it may suggest a problem with the validity of
the test such as the inclusion of construct-irrelevant factors.
Another component to this process is to consider the consequences of not using tests.
Even though the consequences of testing may produce some adverse effects, these must be
contrasted with the positive and negative effects of alternatives to using psychological tests.
If more subjective approaches to decision making are employed, for example, the likelihood
of cultural, ethnic, and gender biases in the decision-making process will likely increase.

Integrating Evidence of Validity


The Standards (AERA et al., 1999) state:

Validation can be viewed as developing a scientifically sound validity argument to support


the intended interpretation of test scores and their relevance to the proposed use. (p. 9)

The development of a validity The development of this validity argument typically involves the
argument typically involves the integration of numerous lines of evidence into a coherent commen-
integration of numerous lines tary. The development of a validity argument is an ongoing process;
of evidence into a coherent it takes into consideration existing research and incorporates new sci-
commentary. entific findings. As we have noted, different types of validity evidence
142 CHAPTERS

are most applicable to different type of tests. Here is a brief review of some of the prominent
applications of different types of validity evidence.

= Evidence based on test content is most often reported with academic achievement
tests and tests used in the selection of employees.

= Evidence based on relations to other variables can be either test-criterion validity


evidence, which is most applicable when tests are used to predict performance on an external
criterion, or convergent and discriminant validity evidence, which can be useful with a wide
variety of tests, including intelligence tests, achievement tests, personality tests, and so on.
= Evidence based on internal structure can be useful with a wide variety of tests, but
has traditionally been applied with tests measuring theoretical constructs such as personal-
ity or intelligence.

u Evidence based on response processes can be useful with practically any test that
requires examinees to engage in any cognitive or behavioral activity.

u Evidence based on consequences of testing is most applicable to tests designed for


selection and promotion, but can be useful with a wide range of tests.
You might have noticed that most types of validity evidence have applications to a broad va-
riety of tests, and this is the way it should be. The integration of multiple lines of research or
types of evidence results in a more compelling validity argument. It is also important to re-
member that every interpretation or intended use of a test must be validated. As we noted
earlier, if a test is used for different applications, each use or application must be validated. In
these situations it is imperative that different types of validity evidence be provided. Table 5.2
provides a summary of the major applications of different types of validity evidence.

TABLE 5.2 Sources of Validity Evidence


a ge ee ee
Source Example Major Applications
Evidence based on test content Analysis of item relevance and Achievement tests and tests used
content coverage in the selection of employees
Evidence based on relations Test-criterion; convergent and Wide variety of tests
to other variables discriminant evidence; contrasted
groups studies
Evidence based on internal Factor analysis, analysis of test Wide variety of tests, but
structure homogeneity particularly useful with tests
of constructs such as personality
or intelligence
Evidence based on response Analysis of the processes engaged Any test that requires examinees to
processes in by the examinee or examiner engage in a cognitive or behavioral
activity
Evidence based on consequences Analysis of the intended and Most applicable to tests designed
of testing structure unintended consequences of testing for selection and promotion, but
useful on a wide range of tests
SSIES AS SRST Ra UR en ERTS SS TS ge
Validity for Teachers 143

Validity: Practical Strategies for Teachers

Validity refers to the appropriateness or accuracy of the interpreta-


Although teachers typically do tion of assessment results. The results of classroom assessments are
not have the time or resources used in many different ways in today’s schools, and teachers need to
to conduct large-scale validity consider the validity of all of these applications. One of the most
studies, they can use some prominent uses of classroom assessment results is the summative
practical and sound procedures evaluation of student knowledge and skills in a specified content area
to evaluate the validity of the (e.g., evaluating mastery and assigning grades). In this context, Nitko
results of classroom (2001) developed a set of guidelines for evaluating and improving
assessments. the validity of the results of classroom assessments. These guidelines
include the following.

Examination of Test Content. Evaluating the validity of the results of classroom assess-
ments often begins with an analysis of test content. As discussed earlier in this chapter, this
typically involves examining item relevance and content coverage. Analysis of item rele-
vance involves examining the individual test items and determining whether they reflect
essential elements of the content domain. Content coverage involves examining the overall
test and determining the degree to which the items cover the specified domain (refer back
to Figure 5.1). The question here is “Does validity evidence based on the content of the test
support the intended interpretations of test results?”’In other words, is this test covering the
content it is supposed to cover?

Examination of Student Response Processes. This guideline considers the validity


evidence that examines the cognitive and behavioral processes engaged in by students. In
other words, do the assessments require the students to engage in the types of cognitive
processes and behavioral activities that are specified in the learning objectives? For exam-
ple, if your learning objectives involve problem solving, does the assessment require the
students to engage in problem solving? In a later chapter we will describe how the use of a
taxonomy of student abilities can help you develop tests that cover a broad range of abilities
and skills.

Examination of Relations to Other Assessments. This guideline examines the relation-


ship between a given assessment and other sources of information about students. That is,
are the results of the assessment in question consistent with other sources of information
(e.g., other tests, class projects, teacher’s observation)? If the results are consistent, the va-
lidity of the assessment is supported. If they are inconsistent, then the validity of the assess-
ment results may be questionable. With a computer and spreadsheet it is easy to calculate
the correlation between scores on multiple assessments and examine them for evidence of
consistency (e.g., moderate to high positive correlations among the different assessments).

Examination of Reliability. This guideline considers evidence regarding the reliability


of assessment results. As we noted earlier the reliability of test scores sets an upper limit on
the validity of their interpretation. Although reliability does not assure validity, you cannot
144 Ca AGP ALGER 35

have valid score interpretations if those scores are not reliable. As a result, efforts to increase
the reliability of assessment results can enhance validity.

Examination of Test Fairness. This guideline examines classroom assessments to ensure


fairness to all students. For example, tests and other assessments should be fair to students
from diverse ethnic and cultural backgrounds.

Examination of Practical Features. This guideline examines classroom assessments to


ensure they are practical and efficient. This involves a consideration of the amount of time
required to develop, administer, and score assessments. Although assessments plays an
extremely important role in today’s schools, their development and use should not consume
an inordinate amount of time.

Examination of the Overall Assessment Strategy. Nitko (2001) notes that even when
you follow all of the previous guidelines, perfect validity will always elude you. To counter
this, he recommends that teachers employ a multiple-assessment strategy that incorporates
the results of numerous assessments to measure student achievement.
We feel Nitko’s (2001) guidelines provide a good basis for evaluating and improving
the validity of classroom assessments. Although teachers typically do not have the time or
resources to conduct large-scale validity studies, these guidelines provide some practical
and sound advice for evaluating the validity of the results of classroom assessments.

Summary

In this chapter we introduced the concept of validity. In the context of educational and psy-
chological tests and measurement, validity refers to the degree to which theoretical and
empirical evidence supports the meaning and interpretation of test scores. In essence the
validity question is “Are the intended interpretations of test scores appropriate and accu-
rate?” Numerous factors can limit the validity of interpretations. The two major internal
threats to validity are construct underrepresentation (i.e., the test is not a comprehensive
measure of the construct it is supposed to measure) and construct-irrelevant variance (165
the test measures content or skills unrelated to the construct). Other factors that may reduce
validity include variations in instructional procedures, test administration/scoring proce-
dures, and student characteristics. There is also a close relationship between validity and
reliability. For a test to be valid it must be reliable, but at the same time reliability does not
ensure validity. Put another way, reliability is a necessary but insufficient condition for
validity.
As a psychometric concept, validity has evolved and changed over the last half cen-
tury. Until the 1970s validity was generally divided into three distinct types: content valid-
ity, criterion-related validity, and construct validity. This terminology was widely accepted
and is still often referred to as the traditional nomenclature. However, in the 1970s and
1980s measurement professionals started conceptualizing validity as a unitary construct.
That is, although there are different ways of collecting validity evidence, there are not dis-
tinct types of validity. To get away from the perception of distinct types of validity, today
Validity for Teachers 145

we refer to different types of validity evidence. The most current typology includes the fol-
lowing five categories:

a Evidence based on test content. Evidence derived from a detailed analysis of the test
content includes the type of questions or tasks included in the test and guidelines for admin-
istration and scoring. Collecting content-based validity evidence is often based on the eval-
uation of expert judges about the correspondence between the test’s content and its construct.
The key issues addressed by these expert judges are whether the test items assess relevant
content (i.e., item relevance) and the degree to which the construct is assessed in a compre-
hensive manner (i.e., content coverage).

a Evidence based on relations to other variables. Evidence based on an examination


of the relationships between test performance and external variables or criteria can actu-
ally be divided into two subcategories of validity evidence: test-criterion evidence and
convergent and discriminant evidence. Test-criterion evidence is typically of interest
when a test is designed to predict performance on a criterion such as job performance or
success in college. Two types of studies are often used to collect test-criterion evidence:
predictive and concurrent studies. They differ in the timing of test administration and
criterion measurement. In a predictive study the test is administered and there is an inter-
val of time before the criterion is measured. In concurrent studies the test is administered
and the criterion is measured at approximately the same time. The collection of conver-
gent and discriminant evidence involves examining the relationship between a test and
other tests that measure similar constructs (convergent evidence) or dissimilar constructs
(discriminant evidence). If the test scores demonstrate the expected relationships with
these existing measures, this can be used as evidence of validity.

a Evidence based on internal structure. Evidence examining the relationships among test
items and components, or the internal structure of the test, can help determine whether the
structure of the test is consistent with the hypothesized structure of the construct it measures.

a Evidence based on response processes. Evidence analyzing the processes engaged in


by the examinee or examiner can help determine if test goals are being achieved. For ex-
ample, if the test is designed to measure mathematical reasoning, it is helpful to verify that
the examinees are actually engaging in mathematical reasoning and analysis as opposed to
performing rote calculations.

a Evidence based on consequences of testing. Evidence examining the intended and un-
intended consequences of testing is based on the common belief that some benefit will result
from the use of tests. Therefore, it is reasonable to confirm that these benefits are being
achieved. This type of validity evidence has gained considerable attention in recent years and
there is continuing debate regarding the scope of this evidence. Some authors feel that social
consequences and values should be incorporated into the conceptualization of validity, whereas
others feel such a broadening would detract from the clarity of the concept.

Different lines of validity evidence are integrated into a cohesive validity argument that
supports the use of the test for different applications. The development of this validity argu-
ment is a dynamic process that integrates existing research and incorporates new scientific
146 CHAPTER 5

findings. Validation is the shared responsibility of the test authors, test publishers, research-
ers, and even test users. Test authors and publishers are expected to provide preliminary evi-
dence of the validity of proposed interpretations of test scores whereas researchers often
pursue independent validity studies. Ultimately, those using tests are expected to weigh the
validity evidence and make their own judgments about the appropriateness of the test in their
own situations and settings, placing the practitioners or consumers of psychological tests in
the final, most responsible role in this process.

KEY TERMS AND CONCEPTS

Base rate, p. 135 Discriminant evidence, p. 138 Linear regression, p. 135


Concurrent studies, p. 133 Evidence based on consequences of Method variance, p. 139
Construct-irrelevant variance, p. 124 testing, p. 142 Multitrait-multimethod matrix,
Construct underrepresentation, Evidence based on internal p. 138
p. 124 structure, p. 142 Predictive studies, p. 133
Construct validity, p. 126 Evidence based on relations to Selection ratio, p. 135
Content coverage, p. 130 other variables, p. 142 Standard error of estimate, p. 135
Content validity, p. 127 Evidence based on response Table of specifications, p. 129
Contrasted group study, p. 139 processes, p. 142 Test-criterion evidence, p. 133
Convergent evidence, p. 138 Evidence based on test content, Validity, p. 124
Criterion, p. 132 p. 142 Validity argument, p. 141
Criterion contamination, p. 134 Face validity, p. 131 Validity as a unitary concept, p. 127
Criterion-related validity, p. 127 Factor analysis, p. 139 Validity coefficient, p. 133
Decision-theory models, p. 137 Item relevance, p. 130

RECOMMENDED READINGS

American Educational Research Association, American Psy- Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational
chological Association, & National Council on Mea- measurement (3rd ed., pp. 13-103). Upper Saddle River,
surement in Education (1999). Standards for educational NJ: Merrill/Prentice Hall. A little technical at times, but
and psychological testing. Washington, DC: American very influential.
Educational Research Association. Chapter 1 is a must Sireci, S. G. (1998). Gathering and analyzing content validity
read for those wanting to gain a thorough understanding data. Educational Assessment, 5, 299-321. This article
of validity. provides a good review of approaches to collecting va-
Cronbach, L. J., & Gleser, G. C. (1965). Psychological tests lidity evidence based on test content, including some of
and personnel decisions (2nd ed.). Champaign: Univer- the newer quantitative approaches.
sity of Illinois Press. A classic, particularly with regard to Tabachnick, B. G., & Fidel, L. S. (1996). Using multivariate
validity evidence based on relations to external variables! statistics (3rd ed.). New York: HarperCollins. A great
Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, chapter on factor analysis that is less technical than Gor-
NJ: Erlbaum. A classic for those really interested in un- such (1993).
derstanding factor analysis.

EEO

Go to www.pearsonhighered.com/reynolds2e to view a PowerPoint™


presentation and to listen to an audio lecture about this chapter.
CHAPTER

Item Analysis for Teachers

The better the items, the better the test.

CHAPTER HIGHLIGHTS

Item Difficulty Index (or Item Difficulty Level) Item Analysis of Performance Assessments
Item Discrimination Qualitative Item Analysis
Distracter Analysis Using Item Analysis to Improve Classroom
Item Analysis: Practical Strategies for Teachers Instruction

Using Item Analysis to Improve Items

LEARNING OBJECTIVES

After reading and studying this chapter, students should be able to


1. Discuss the relationship between the reliability and validity of test scores and the quality of
the items on a test.
2. Describe the importance of the item difficulty index and demonstrate its calculation and
interpretation.
3. Describe how special assessment situations may impact the interpretation of the item
difficulty index.
4. Describe the importance of item discrimination and demonstrate its calculation and
interpretation.
Describe the relationship between item difficulty and discrimination.
Describe how item-total correlations can be used to examine item discrimination.
Describe how the calculation of item discrimination can be modified for mastery tests.
AMDescribe the importance of distracter analysis and demonstrate its calculation and
eI
interpretation.
9. Describe how the selection of distracters influences item difficulty and discrimination.
10. Apply practical strategies for item analysis to classroom tests.
11. Show how item analysis statistics can be used to improve test items.
12. Describe how item analysis procedures can be applied to performance assessments.

147
148 CHAE
ee Ra6

13. Describe qualitative approaches to improving test items.


14. Describe how information from item analyses can be used to improve classroom instruction.

A number of quantitative procedures are useful in assessing the quality and measurement
characteristics of the individual items that make up tests. Collectively these procedures are
referred to as item analysis statistics or procedures. Unlike reliability and validity analyses
that evaluate the measurement characteristics of a test as a whole, item analysis procedures
examine individual items separately, not the overall test. Item analysis statistics are useful
in helping test developers, including both professional psychometricians and classroom
teachers, decide which items to keep on a test, which to modify, and which to eliminate. In
addition to helping test developers improve tests by improving the individual items, they can
also provide valuable information regarding the effectiveness of instruction or training.
The reliability of test scores and the validity of the interpretation of test scores are
dependent on the quality of the items on the test. If you can improve the quality of the indi-
vidual items, you will improve the overall quality of your test. When
The reliability and validity of discussing reliability we noted that one of the easiest ways to increase
test scores are dependent on the the reliability of test scores is to increase the number of items that go
quality of the items on the test. into making up the test score. This statement is generally true and is
If you can improve the quality based on the assumption that when you lengthen a test you add items
of the individual items, you will of the same quality as the existing items. If you use item analysis to
improve the overall quality of delete poor items and improve other items, it is possible to end up
your test. with a test that is shorter than the original test and that also produces
scores that are more reliable and result in more valid interpretations.
Although quantitative procedures for evaluating the quality of test items will be the
focus of the chapter, some qualitative procedures may prove useful when evaluating the
quality of test items. These qualitative procedures typically involve an evaluation of validity
evidence based on the content of the test and an examination of individual items to ensure
they are technically accurate and clearly stated. Although qualitative procedures have not
received as much attention as their quantitative counterparts, it is often beneficial to use a
combination of quantitative and qualitative procedures.
Before describing the major quantitative item analysis procedures, we should first
note that different types of items and different types of tests require different types of item
analysis procedures. Items scored dichotomously (i.e., either right or wrong) are handled
differently than items scored on a continuum (e.g., an essay that can receive scores ranging
from 0 to 10). Tests designed to maximize the variability of scores (e.g., norm-referenced)
are handled differently than mastery tests (i.e., scored pass or fail). As we discuss various
item analysis procedures, we will specify which types of procedures are appropriate for
which types of items and tests.

Item Difficulty Index (or Item Difficulty Level)

When evaluating items on ability tests, an important consi ion i ifficulty level of
in toms, legaifiuliyisdefinedasthepercentage exproportii- ode? eRe
Item Analysis for Teachers 149

Item difficulty is defined as the rectlyanswerthestem. The item difficulty level or index is abbrevi-
percentage or proportion of test ated as p and calculated with the following formula:
takers who correctly answer the
item. _ Number of Examinees Correctly Answering the Item
Number of Examinees

For example, in a class of 30 students, if 20 students get the answer correct and ten are
incorrect, the item difficulty index is 0.67. The calculations are illustrated here.

ee20 eT
P~ 30
In the same class, if ten students get the answer correct and 20 are incorrect, the item
difficulty index is 0.33. The item difficulty index can range from 0.0 to 1.0 with easier
items having larger decimal values and difficult items at lower values. An item answered
correctly by all students receives an item difficulty of 1.0 whereas an item answered in-
correctly by all students receives an item difficulty of 0.0. Items with p values of either
1.0 or 0.0 provide no information about individual differences and are of no value from a
measurement perspective. Some test developers will include one or two items with p val-
ues of 1.0 at the beginning of a test to instill a sense of confidence in test takers. This is a
defensible practice from a motivational perspective, but from a technical perspective these
items do not contribute to the measurement characteristics of the test. Another factor that
should be considered about the inclusion of very easy or very difficult items is the issue of
time efficiency. The time students spend answering ineffective items is largely wasted and
could be better spent on items that enhance the measurement characteristics of the test.
the
For maximizing variability and reliability Optimialjitem|difficultylevelus,0.50,
indicating that 50% of test takers answered the item correctly and 50% answered incor-
rectly. Based on this statement, you might conclude that it is desirable for all test items to
have a difficulty level of 0.50, but this is not necessarily true for several reasons. One reason
is that items on a test are often correlated with each other, which
means the measurement process may be confounded if all the items
For maximizing variability have p values of 0.50. As a result, it is often desirable to select some
among test takers, the optimal items with p values below 0.50 and some with values greater than
item difficulty level is 0.50, 0.50, but with a mean of 0.50. Aiken (2000) recommends that there
indicating that 50% of the should be approximately a 0.20 range of these p values around the
test takers answered the item optimal value. For example, a test developer might select items with
correctly and 50% answered the _ difficulty levels ranging from 0.40 to 0.60, with a mean of 0.50.
item incorrectly. Another reason why 0.50 is not the optimal difficulty level
for every testing situation involves the influence of guessing. On
Although a Piped a Joe constructed-response items (e.g., essay and short-answer items) for
is optimal for maximizing which guessing is not a major concern, 0.50 is typically considered
variability and reliability, the optimal difficulty level. However, with selected-response items
different levels are desirable (e.g., multiple choice and true—false items) for which test takers
in many different testing might answer the item correctly simply by guessing, the optimal dif-
applications. ficulty level varies. To take into consideration the effects of guessing,
150 CHAPTER 6

TABLE 6.1 Optimal p Values for Items with Varying Numbers of Choices

Number of Choices Optimal Mean p Value

2 (e.g., true—false) 0.85


3 0.77
4 0.74
5 0.69
Constructed response (e.g., essay) 0.50
Sats SEE NS ARR RT BTR Se IT TN REE

Source: Based on Lord (1952).

the optimal item difficulty level is set higher than for constructed- i . For ex-
an Norm pl-oeielensiTourpons vse soldbe HppO RAR
ord, 1952). That is, the test developer might select items with difficulty levels rang-
ing from 0.64 to 0.84 with a mean of approximately 0.74. Table 6.1 provides information on
the optimal mean p value for selected-response items with varying numbers of alternatives
or choices.

Special Assessment Situations and Item Difficulty


Our discussion of item difficulty so far is most applicable to norm-referenced tests. For
criterion-referenced tests, particularly mastery tests, item difficulty is evaluated differently.
On mastery tests the test taker typically either passes or fails and there is the expectation
that most test takers will eventually be successful. As a result, on mastery tests it is com-
mon for items to have average p values as high as 0.90. Other tests that are designed for
special assessment purposes may vary in terms of what represents desirable item difficulty
levels. For example, if a test were developed to help employers select the upper 25% of job
applicants, it would be desirable to have items with p values that average around 0.25. If it
is desirable for a test to be able to distinguish between the highest-performing examinees
(e.g., in testing gifted and talented students), it may also be desirable to include at least
some very difficult items. In summary, although a mean p of 0.50 is optimal for maximizing
variability among test takers, different difficulty levels are desirable in many testing ap-
plications (see Special Interest Topic 6.1 for another example). Later in this chapter we will
provide some examples of how test developers use information about item difficulty and
other item analysis statistics to select items to retain, revise, or delete from future admin-
istrations of the test. First, we will discuss another popular item analysis procedure—the
item discrimination index.

Item discrimination refers to


how well an item can accurately Item Discrimination
discriminate between test takers
who differ on the construct Ite iscriminate
being measured. fc) ,
Item Analysis for Teachers 151

SPECIAL INTEREST TOPIC 6.1


Item Difficulty Indexes and Power Tests

As a rule of thumb and for psychometric reasons explained in this chapter, we have noted that item
difficulty indexes of 0.50 are desirable in many circumstances on standardized tests. However, it is
also common to include some very easy items so all or most examinees get some questions correct,
as well as some very hard items, so the test has enough ceiling. With a power test, such as an IQ
test, that covers a wide age range and whose underlying construct is developmental, item selection
becomes much more complex. Items that work very well at some ages may be far too easy, too hard,
or just developmentally inappropriate at other ages. If a test covers the age range of say 3 years up to
20 years, and the items all have a difficulty level of 0.50, you could be left with a situation in which
the 3-, 4-, 5-, and even 6-year-olds typically pass no items and perhaps the oldest individuals nearly
always get every item correct. This would lead to very low reliability estimates at the upper and lower
ages and just poor measurement of the constructs generally, except near the middle of the intended
age range. For such power tests covering a wide age span, item statistics such as the difficulty index
and the discrimination index are examined at each age level and plotted across all age levels. In this
way, items can be chosen that are effective in measuring the relevant construct at different ages.
When the item difficulty indexes for such a test are examined across the entire age range, some will
approach 0.0 and some will approach 1.0. However, within the age levels, for example, for 6-year-
olds, many items will be close to 0.5. This affords better discrimination and gives each examinee a
range of items on which they can express their ability on the underlying trait.

@measnred byithestests For example, if a test is designed to measure reading comprehension,


item discrimination reflects an item’s ability to distinguish between individuals with good
reading comprehension skills and those with poor reading skills. Unlike item difficulty level
about which there is agreement on how to calculate the statistic, over 50 different indexes
of item discrimination have been developed over the years (Anastasi & Urbina, 1997). Fortu-
nately, most of these indexes produce similar results (Engelhart, 1965; Oosterhof, 1976).

Discrimination Index
Probably the most popular method of calculating an index of item discrimination is based
on the difference in performance between two groups. Although there are different ways
of selecting the two groups, they are typically defined in terms of total test performance.
One common approach is to select the top and bottom 27% of test
takers in terms of their overall performance on the test and exclude
Probably the most popular the middle 46% (Kelley, 1939). Some assessment experts have sug-
method of calculating an index gested using the top and bottom 25%, some the top and bottom 33%,
of item discrimination is based and some the top and bottom halves. In practice, all of these are
on the difference between those probably acceptable (later in this chapter we will show you a more
who score well on the overall practical approach that saves both time and effort). The difficulty of
test and those who score poorly. the item is computed for each group separately, and these are labeled
152 CHAPTER 6

Py and pp, (T for top, B for bottom). The difference between p, and pp is the discrimination
index, designated as D, and is calculated with the following formula (e.g., Johnson, 1951):

D = pr — Pp

where D = discrimination index


Py = proportion of examinees in the top group getting the item correct
Pp = proportion of examinees in the bottom group getting the item correct

To illustrate the logic behind this index, consider a classroom test designed to measure aca-
demic achievement in some specified area. If the item is discriminating between students
who know the material and those who do not, then students who are more knowledgeable
(i.e., students in the top group) should get the item correct more often than students who are
less knowledgeable (i.e., students in the bottom group). For example, if p, = 0.80 (indicat-
ing 80% of the students in the top group answered the item correctly) and p, = 0.30 (indicat-
ing 30% of the students in the bottom group answered the item correctly), then

D = 0.80 — 0.30 = 0.50

Hopkins (1998) provided guidelines for evaluating items in terms of their D values
(see Table 6.2). According to these guidelines, D values of 0.40 and above are considered
excellent, between 0.30 and 0.39 are good, between 0.11 and 0.29 are fair, and between
0.00 and 0.10 are poor. Items with negative D values are likely miskeyed or there are other
serious problems. Other testing assessment experts have provided
different guidelines, some more rigorous and some more lenient.
As a general rule, we
As a general rule, we suggest that items with D values over
recommend that items with D
0.30 are acceptable (the larger the better), and items with D values
values over 0.30 are acceptable,
below 0.30 should be carefully reviewed and possibly revised or de-
and items with D values below leted. However, this is only a general rule and there are exceptions.
0.30 should be carefully For example, most indexes of item discrimination, including the item
reviewed and possibly revised discrimination index (D), are biased in favor of items with interme-
or deleted. diate difficulty levels. That is, the maximum D value of an item is

TABLE 6.2 Guidelines for Evaluating D Values

Difficulty

0.40 and larger Excellent


0.30-0.39 Good
0.11-0.29 Fair
0.00-0.10 Poor
Negative values Miskeyed or other major flaw
ee ee ee en ee ee
Se

Source: Based on Hopkins (1998).


Item Analysis for Teachers 153

TABLE 6.3. Maximum D Values at Different Difficulty Levels

Item Difficulty Index (p) Maximum D Value

1.00 0.00
0.90 0.20
0.80 0.40
0.70 0.60
0.60 0.70
0.50 1.00
0.40 0.70
0.30 0.60
0.20 0.40
0.10 0.20
0.00 0.00
Se a Ee A A aR BS Ee RE

related to its p value (see Table 6.3). Items that all test takers either pass or fail (i.e., p values
of either 0.0 or 1.0) cannot provide any information about individual differences and their
D values will always be zero. If half of the test takers correctly answered an item and half
failed (i.e., p value of 0.50), then it is possible for the item’s D value to be 1.0. This does not
mean that all items with p values of 0.50 will have D values of 1.0, but just that the item can
conceivably have a D value of 1.0. As aresult of this relationship between p and D, items that
have excellent discrimination power (i.e., D values of 0.40 and above) will necessarily have p
values between 0.20 and 0.80. In testing situations in which it is desirable to have either very
easy or very difficult items, D values can be expected to be lower than those normally desired.
Additionally, items that measure abilities or objectives that are not emphasized throughout
the test may have poor discrimination due to their unique focus. In this situation, if the item
measures an important ability or learning objective and is free of technical defects, it should
be retained (e.g., Linn & Gronlund, 2000).
In summary, although low D values often indicate problems, the guidelines provided
in Table 6.2 should be applied in a flexible, considered manner. Our discussion of the cal-
culation of item difficulty and discrimination indexes has used examples with items that are
dichotomously scored (i.e., correct/incorrect, 1 or 0). Special Interest Topic 6.2 provides a
discussion of the application of these statistics with constructed-response items that are not
scored in a dichotomous manner.

Item-Total Correlation Coefficients

Another approach to examining — Another approach to examining item discrimination hhaimate


item discrimination is to
. The total
correlate performance on the
item with the total test score. test score is usually the total number of items answered correctly
154 CHAPTER 6

SPECIAL INTEREST TOPIC 6.2


Item Analysis for Constructed-Response Items

Our discussion and examples of the calculation of the item difficulty index and discrimination index
used examples that were dichotomously scored (i.e., scored right or wrong: 0 or 1). Although this
procedure works fine with selected-response items (e.g., true—false, multiple-choice), you need a
slightly different approach with constructed-response items that are scored in a more continuous
manner (e.g., an essay item that can receive scores between | and 5 depending on quality). To calcu-
late the item difficulty index for a continuously scored constructed-response item, use the following
formula (Nitko, 2001):

_ Average Score on the Item


- Range of Possible Scores

The range of possible scores is calculated as the maximum possible score on the item minus the
minimum possible score on the item. For example, if an item has an average score of 2.7 and is
scored on a | to 5 scale, the calculation would be:

2.7 Pel

Therefore, this item has an item difficulty index of 0.675. This value can be interpreted the same as
the dichotomously scored items we discussed.
To calculate the item discrimination index for a continuously scored constructed-response
item, you use the following formula (Nitko, 2001):

_ Average Score for the Top Group — Average Score for the Bottom Group
D
Range of Possible Scores

For example, if the average score for the top group is 4.3, the average score for the bottom group is
1.7, and the item is scored on a 1 to 5 scale, the calculation would be:

pretetllienZhics oes
5-1 4

Therefore, this item has an item discrimination index of 0.65. Again, this value can be interpreted
the same as the dichotomously scored items we discussed.
——_-_--—————————————————————————— eee

(unadjusted) or the total number of items answered correctly omitting the item being ex-
amined (adjusted). Either way, the item-total correlation is usually calculated using the
point-biserial correlation. As you remember from our discussion of basic statistics, the
point-biserial is used when one variable is a dichotomous nominal score and the other vari-
able is measured on an interval or ratio scale. Here the dichotomous variable is the score
on a single item (e.g., right or wrong) and the variable measured 6n an interval scale is the
Item Analysis for Teachers 155

total test score. A large item-total correlation is taken as evidence that an item is measur-
ing the same construct as the overall test measures and that the item discriminates between
individuals high on that construct and those low on that construct. An item-total correlation
calculated on the adjusted total will be lower than that computed on the unadjusted total and is
preferred because the item being examined does not “‘contaminate” or inflate the correlation.
The results of an item-total correlation will be similar to those of an item discrimination index
and can be interpreted in a similar manner (Hopkins, 1998). As teachers gain more access to
computer test scoring programs, the item-total correlation will become increasingly easy to
compute and will likely become the dominant approach for examining item discrimination.

Item Discrimination on Mastery Tests


As we noted previously, the item difficulty indexes on mastery tests tend to be higher (in-
dicating easier items) than on tests designed primarily to produce norm-referenced scores.
This is because with mastery testing it is usually assumed that most examinees will be
successful. As a result, on mastery tests it is common for items to
Several different approaches have average p values as high as 0.90, and the standard approach to
have been suggested for interpreting item difficulty levels needs to be modified to accom-
determining item discrimination modate this tendency.
on mastery tests. The interpretation of indexes of discrimination is also compli-
cated on mastery tests. Because it is common to obtain high p values
for both high- and low-scoring examinees, it is normal for traditional item discrimination
indexes to underestimate an item’s true measurement characteristics. Several different ap-
proaches have been suggested for determining item discrimination on mastery tests (e.g.,
Aiken, 2000; Popham, 2000). One common approach involves administering the test to two
groups of students: one group that has received instruction and one that has not received
instruction. The formula is:

Dee Pinstruction ~ Pro instruction

where Pinstruction
= proportion of instructed students getting the answer correct
Pro instruction
= proportion of students without instruction getting the answer
correct

This approach is technically adequate, with the primary limitation being potential dif-
ficulty obtaining access to an adequate group that has not received instruction or training
on the relevant material. If one does have access to an adequate sample, this is a promis-
ing approach.
Another popular approach involves administering the test to the same sample twice,
once before instruction and once after instruction. The formula is:

Dix Pposttest ~ Poretest

where Pisa = proportion of examinees getting the answer correct on posttest


Pana = proportion of examinees getting the answer correct on pretest
156 CHAPTER 6

Some drawbacks are associated with this approach. First, it requires that the test developers
write the test, administer it as a pretest, wait while instruction is provided, administer it as
a posttest, and then calculate the discrimination index. This can take an extended period
of time in some situations, and test developers often want feedback in a timely manner. A
second limitation is the possibility of carryover effects from the pre- to the posttest. For
example, examinees might remember items or concepts emphasized on the pretest, and
this carryover effect can influence how they respond to instruction, study, and subsequently
prepare for the posttest.
Aiken (2000) proposed another approach for calculating discrimination for mastery
tests. Instead of using the top and bottom 27% of students (or the top and bottom 50%), he
recommends using item difficulty values based on the test takers who reached the mastery
cut score (i.e., mastery group) and those who did not reach mastery (i.e., nonmastery group),
using the following formula:

D> Pp mastery — P nonmastery

where Prnastarg f= proportion of mastery examinees getting the answer correct


Pisonmmastoty 0 proportion of nonmastery examinees getting the answer correct

The advantage of this approach is that it can be calculated based on the data from one test
administration with one sample. A potential problem is that because it is common for the
majority of examinees to reach mastery, the p value of the nonmastery group might be based
on a small number of examinees. As a result the statistics might be unstable and lead to er-
roneous conclusions.

Item Analysis of Speed Tests


Based on our discussion up to this point it should be clear that there are situations in which
the interpretation of indexes of item difficulty and discrimination are complicated. One
such situation involves item analysis of speed tests, whereby performance depends pri-
marily on the speed of performance. Items on speed tests are often fairly easy and could be
completed by most test takers if there were no time limits. However, there are strict time
limits, and these limits are selected so that no test taker will be able to complete all of the
items. The key factor is how many items the test taker is able to complete in the allotted
time. On power tests everyone is given sufficient time to attempt all the items, but the
items vary in difficulty with some being so difficult that no test takers will answer them all
correctly. In many situations tests incorporate a combination of speed and power, so the
speed—power distinction is actually one of degree.
On speed tests, measures of item difficulty and discrimina-
On speed tests, measures
tion will largely reflect the location of the item in the test rather
of item difficulty and
than the item’s actual difficulty level or ability to discriminate. Items
discrimination will largely that appear late on a speed test will be passed by fewer individuals
reflect the position of the item than items appearing earlier simply because the strict time limits
in the test rather than the prevent students from being able to attempt them. The items appear-
actual difficulty of the item ing later on the test are probably not actually more difficult than the
or its discriminative ability. earlier items, but their item difficulty index will suggest that they
Item Analysis for Teachers 157

are more difficult. Similar complications arise when interpreting indexes of discrimination
with speed tests. Because the individuals completing the later items also tend to be the most
capable test takers, indexes of discrimination may exaggerate the discriminating ability of
these items. Although different procedures have been developed to take into consideration
these and related factors, they all have limitations and none have received widespread ac-
ceptance (e.g., Aiken, 2000; Anastasi & Urbina, 1997). Our recommendation is that you
should be aware of these issues and take them into consideration when interpreting the item
analyses of speed tests.

Distracter Analysis
The final quantitative item analysis procedure we will discuss in this chapter involves the
analysis of individual distracters. On multiple-choice items, the incorrect alternatives are re-
ferred to as distracters because they serve to “distract” examinees who do not actually know
the correct response. Some test developers routinely examine the performance of distracters
for all multiple-choice items, whereas others reserve distracter analysis for items with p or
D values that suggest problems. If you are a professional test developer you can probably
justify the time required to examine each distracter for each item, but for busy teachers it
is reasonable to reserve distracter analysis procedures for items that need further scrutiny
based on their p or D values.
Distracter analysis allows you to examine how many exam-
Distracter analysis allows you inees in the top and bottom groups selected each option on a multi-
to examine how many students ple-choice item. The key is to examine each distracter and ask two
in the top and bottom groups questions. First, did the distracter distract some examinees? If no
selected each option on a examinees selected the distracter, it is not doing its job. An effec-
multiple-choice item. tive distracter must be selected by some examinees. If a distracter
is so obviously incorrect that no examinees select it, it is ineffective
We expect distracters to be and needs to be revised or replaced. The second question involves
selected by more examinees discrimination. Did the distracter attract more examinees in the bot-
in the bottom group than tom group than in the top group? Effective distracters should. When
examinees in the top group. looking at the correct response, we expect more examinees in the
top group to select it than examinees in the bottom group (i.e., it
demonstrates positive discrimination). With distracters we expect the opposite. We expect
more examinees in the bottom group to select a distracter than examinees in the top group.
That is, distracters should demonstrate negative discrimination!
Consider the following example:

Options

Item 1 A* B G D

Number in top group 22 3 2 3


Number in bottom group 9 (i 8 6

*Correct answer
158 CHAPTER 6

For this item, p = 0.52 (moderate difficulty) and D = 0.43 (excellent discrimination). Based
on these values, this item would probably not require further examination. However, this
can serve as an example of what might be expected with a “good” item. As reflected in
the D value, more examinees in the top group than the bottom group selected the correct
answer (1.e., option A). By examining the distracters (i.e., options B, C, and D), you see
that they were all selected by some examinees, which means they are serving their purpose
(i.e., distracting examinees who do not know the correct response). Additionally, all three
distracters were selected more by members of the bottom group than the top group. This is
the desired outcome! While we want more high-scoring examinees to select the correct an-
swer than low-scoring examinees (i.e., positive discrimination), we want more low-scoring
examinees to select distracters than high-scoring examinees (i.e., negative discrimination).
In summary, this is a good item and all of the distracters are performing well.
Now we will look at an example that illustrates some problems.

Options

Item 1 A* B Cc D

Number in top group 17 9 0


Number in bottom group 13 6 0 11

*Correct answer

For this item, p = 0.50 (moderate difficulty) and D = 0.14 (fair discrimination but further
scrutiny suggested). Based on these values, this item needs closer examination and possible
revision. Examining option B, you will notice that more examinees in the top group than in
the bottom group selected this distracter. This is not a desirable situation; option B needs to
be examined to determine why it is attracting top examinees. It is possible that the wording is
ambiguous or that the option is similar in some way to the correct answer. Examining option
C, you note that no one selected this distracter. It attracted no examinees, was obviously not
the correct answer, and needs to be replaced. To be effective, a distracter must distract some
examinees. Finally, option D performed well. More poor-performing examinees selected this
option than top-performing examinees (i.e., 11 versus 4). It is likely that if the test developer
revises options B and C this will be a more effective item.

How Distracters Influence Item Difficulty and Discrimination


Before leaving our discussion of distracters, we want to highlight how the selection of dis-
tracters impacts both item difficulty and discrimination. Consider the following item:
1. In what year did Albert Einstein first publish his full general theory of relativity?
a. 1910
b. 1912
c. 1914
d. 1916
e918 ;
Item Analysis for Teachers 159

Unless you are very familiar with Einstein’s work, this is probably a fairly difficult question.
Now consider this revision:

1. In what year did Albert Einstein first publish his full general theory of relativity?
1655
= . 1762
1832
bD1G
cae 2001

The selection of distracters This is the same question but with different distracters. This revised
can significantly impact the item would likely be a much easier item in a typical high school sci-
difficulty of the item and ence class. The point is that the selection of distracters can signifi-
consequently the ability of cantly impact the difficulty of the item and consequently the ability
the item to discriminate. of the item to discriminate.

Item Analysis: Practical Strategies for Teachers

Teachers typically have a Teachers typically have a number of practical options for calculating
item analysis statistics for their classroom tests. Many teachers will
number of practical options
have access to computer scoring programs that calculate the various
for calculating item analysis
item analysis statistics we have described. Numerous commercial
statistics for their classroom
companies sell scanners and scoring software that can scan answer
tests. sheets and produce item analysis statistics and related printouts (see
Table 6.4 for two examples). If you do not have access to computer
scoring at your school, Website Reactions has an excellent Internet site that allows you to com-
pute common item analysis statistics online (www.surveyreaction.com/itemanalysis.asp).

TABLE 6.4 Two Examples of Test Scoring and Item Analysis Programs

Assessment Systems Corporation


One of its products, ITEMAN, can score and analyze a number of item
formats, including multiple-choice and true—false items. This product will
compute common item analysis and test statistics (e.g., mean, variance,
standard deviation, KR-20). Its Internet site is www.assess.com/Software/
sItemTest.htm.

Principia Products
One of its products, Remark Office OMR, will grade tests and produce
statistics and graphs reflecting common item analysis and test statistics.
Its Internet site is www.principiaproducts.com/office/index.html.
160 CHAP
AE RG

If you prefer to perform the calculations by hand, several authors have suggested some
abbreviated procedures that make the calculation of common item analysis statistics fairly
easy (e.g., Educational Testing Service, 1973; Linn & Gronlund, 2000). Although there are
some subtle differences between these procedures, they generally involve the following
steps:

1. Once the tests are graded, arrange them according to score (i.e., lowest to highest
score).
2. Take the ten papers with the highest scores and the ten with the lowest scores. Set
these into two piles. Set aside the remaining papers; they will not be used in these
analyses.
3. For each item, determine how many of the students in the top group correctly an-
swered it and how many in the bottom group correctly answered it. With this infor-
mation you can calculate the overall item difficulty index (i.e., p) and separate item
difficulty indexes for the top group (p,) and bottom group (p,). For example, if
eight students in the top group answered the item correctly and three in the bottom
group answered the item correctly, add these together (8 + 3 = 11) and divide by 20
to compute the item difficulty index: p = 11/20 = 0.55. Although this item difficulty
index is based on only the highest and lowest scores, it is usually adequate for use
with classroom tests. You can then calculate p; and pz. In this case: p, = 8/10 = 0.80
and pp = 3/10 = 0.30. ;
4. You now have the data needed to calculate the discrimination index for the items.
Using the data for our example: D = py — pp = 0.80 — 0.30 = 0.50.

Using these simple procedures you see that for this item p = 0.55 (moderate difficulty) and
D=0.50 (excellent discrimination). If your items are multiple choice you can also use these
same groups to perform distracter analysis.
Continuing with our example, consider the following results:

Options

A B* Cc D

Top group (top 10) 0 8 1 1


Bottom group (bottom 10) 2 3 3

*Correct answer

As reflected in the item D value (i.e., 0.50), more students in the top group than the bottom
group selected the correct answer (i.e., option B). By examining the distracters (i.e., op-
tions A, C, and D), you see that they each were selected by some students (.e., they are all
distracting as hoped for) and they were all selected by more students in the bottom group
than the top group (i.e., demonstrating negative discrimination). In summary, this item is
functioning well.
Item Analysis for Teachers 161

Using Item Analysis to Improve Items


At this point we will provide a few examples and illustrate a step-by-step procedure for
evaluating the quality of test items. Other authors have provided similar guidelines (e.g.,
Kubiszyn & Borich, 2000) that pose essentially the same questions.
Consider this information:

Example 1

Options
p = 0.63 ——
D = 0.40 A Be € D

Number in top group 2) 23 2 1


Number in bottom group 6 13 6 5
*
Correct answer

To illustrate our step-by-step procedure for evaluating the quality of test items, con-
sider this series of questions and how it applies to the first example.

1. A p of 0.63 is
appropriate for a multiple-choice item on a norm-referenced test. Remember, the
optimal mean p value for a multiple-choice item with four choices is 0.74.
-Pe RB is ereAeseeciate BaegSRELEE a D of 0.40 this item does an excellent
job of discriminating between examinees who performed well on the test and those
doing poorly.
Seen amon Tremere the answers to the previous
€ positive, we might actually skip this question. However, be-
cause we have the data available we can easily examine the result. All three dis-
tracters (i.e., A, C, and D) attracted some examinees and all three were selected
more frequently by members of the bottom group than the top group. This is the
desired outcome.
In summary, this is a good item and no revision is necessary.

Now we will consider an item that is problematic. Examine these data:

Example 2
Options
p = 9.20 ——
D = -0.13 A B Ge D

Number in top group 20 a 4 2


Number in bottom group 11 6 8 >
————————————————————————

“Correct answer
162 CHAPTER 6

1. Is the item difficulty level appropriate for the testing application? A p of 0.20 sug-
gests that this item is too difficult for most applications. Unless there is some reason
for including items that are this difficult, this is cause for concern.
2. Does the item discriminate adequately? A D of —0.13 suggests major problems with
this item. It may be miskeyed or some other major flaw is present.
3. Are the distracters performing adequately? Option A, a distracter, attracted most
of the examinees in the top group and a large number of examinees in the bottom
group. The other three options, including the one keyed as correct, were negative
discriminators (i.e., selected more by examinees in the bottom group than the top
group).
4. Overall evaluation? There is a major problem with this item! Because five times as
many examinees in the top group selected option A than option C, which is keyed as
correct, we need to verify that option C actually is the correct response. If the item
is miskeyed and option A is the correct response, this would likely be an acceptable
item (p = 0.52, D = 0.30) and could be retained. If the item was not miskeyed, there
is some other major flaw and the item should be deleted.

Now consider this example:

Example 3

Options
p = 0.43 —_-—
D = 0.20 A B C IDES

Number in top group 9 2 3 16


Number in bottom group 4 i w. 10

“Correct answer

1. Is the item difficulty level appropriate for the testing application? A p of 0.43 sug-
gests that this item is moderately difficult.
2. Does the item discriminate adequately? A D of 0.20 indicates this item is only a fair
discriminator.
3. Are the distracters performing adequately? Options B and C performed admirably
with more examinees in the bottom group selecting them than examinees in the top
group. Option A is another story! Over twice as many examinees in the top group
selected it than examinees in the bottom group. In other words, this distracter is at-
tracting a fairly large number of the top-performing examinees. It is likely that this
distracter either is not clearly stated or resembles the correct answer in some manner.
Either way, it is not effective and should be revised.
4. Overall evaluation? In its current state, this item is marginal and can stand revision. It
can probably be improved considerably by carefully examining option A and revising
this distracter. If the test author is able to replace option A with a distracter as effective
as B or C, this would likely be a fairly good item.
Item Analysis for Teachers 163

We will look at one more example:

Example 4

Options
p = 0.23 —
D = 0.27 A B Gs D

Number in top group 6 i Hig 6


Number in bottom group 9 10 3 8
*

Correct answer

. Is the item difficulty level appropriate for the testing application? A p of 0.23 sug-
gests that this item is more difficult than usually desired.
Does the item discriminate adequately? A D of 0.27 indicates this item is only a fair
discriminator.
Are the distracters performing adequately? All of the distracters (i.e., options A, B,
and D) were selected by some examinees, which means that they are serving their
purpose. Additionally, all of the distracters were selected more by the bottom group
than the top group (i.e., negative discrimination), the desired outcome.
Overall evaluation? This item is more difficult than typically desired and demon-
strates only marginal discrimination. However, its distracters are all performing prop-
erly. If this item is measuring an important concept or learning objective, it might be
desirable to leave it in the test. It might be improved by manipulating the distracters
to make the item less difficult.

Item Analysis of Performance Assessments

In Chapter 1 we introduced you to performance assessments, noting that they have be-
come very popular in educational settings in recent years. Performance assessments require
test takers to complete a process or produce a product in a setting that closely resembles
real-life situations (AERA et al., 1999). Traditional item analysis
Traditional item analysis statistics have not been applied to performance assessments as rou-
statistics have not been tinely as they have to more traditional paper-and-pencil tests. One
routinely applied to factor limiting the application of item analysis statistics is that per-
performance assessments, formance assessments often involve a fairly small number of tasks
but in many situations these (and sometimes only one task). However, Linn and Gronlund (2000)
suggest that if the assessment involves several tasks, item analysis
procedures could be adopted for
procedures can be adopted for performance assessments. For exam-
performance assessments.
ple, if a performance assessment involves five individual tasks that
receive scores from 0 (no response) to 5 (exemplary response), the
total scores would theoretically range from a low of 0 to a high of 25. Using the practical
strategy of comparing performance between the top 10 high-scoring students with that of
the low-scoring students, one can examine each task to determine whether the task discrimi-
nates between the two groups.
164 CHAPTER 6

Consider this example:

Performance Assessment Task 1

Scores

Group 0 I 2 3 4 5 Mean Score

Top group (top 10) 0 0 0 1 4 5 4.4


Bottom group (bottom 10) 1 3 5 1 0 0 1.6

On this task the mean score of the top-performing students was 4.4 while the mean score
of the low-performing students was 1.6. This relatively large difference between the mean
scores suggests that the item is discriminating between the two groups.
Now examine the following example:

Performance Assessment Task 2

Scores

Group 0 1 2 5) 4 3 Mean Score

Top group (top 10) 0 2 3 3 1 1 2.6


Bottom group (bottom 10) 0 2 4 3 1 0 23

On this task the mean score of the top-performing students was 2.6 while the mean score
of the low-performing students was 2.3. A difference this small suggests that the item is
not discriminating between the two groups. Linn and Gronlund (2000) suggest that two
possible reasons for these results should be considered. First, it is possible that this item
is not discriminating because the performance measured by this task is ambiguous. If this
is the case, the task should be revised or discarded. Second, it is possible that this item is
measuring skills and abilities that differ significantly from those measured by the other four
tasks in the assessment. If this is the case, it is not necessarily a poor item that needs to be
revised or discarded.

Qualitative Item Analysis

In addition to the quantitative item analysis procedures described


In addition to quantitative
to this point, test developers can also use qualitative item analysis
item analysis procedures,
procedures to improve their tests. Along these lines, Popham (2000)
test developers can also provides some useful suggestions. He recommends that after writing
use qualitative item analysis the test the developer set the test aside for a few days to gain some
procedures to improve their distance from it. We can tell you from our own experience this is good
tests. advice. Even though you carefully proof a test immediately after writ-
Item Analysis for Teachers 165

ing it, a review a few days later will often reveal a number of errors. This delayed review often
catches both clerical errors (e.g:, spelling or grammar) and less obvious errors that might
make an item unclear or inaccurate. After a “‘cooling-off” period we are often amazed that an
“obvious” error evaded detection earlier. Somehow the introduction of a period of time pro-
vides distance that seems to make errors more easily detected. The time you spend proofing a
test is well spent and can help you avoid problems once the test is administered and scored.
Popham (2000) also recommends that you have a colleague review the test. Ideally
this should be a colleague familiar with the content of the test. For example, a history
teacher might have another history teacher review the test. In addition to checking for cleri-
cal errors, clarity, and accuracy, the reviewer should determine whether the test is covering
the material that it is designed to cover. This is akin to collecting validity evidence based on
the content of the test. For example, on a classroom achievement test you are trying to deter-
mine whether the items cover the material that the test is supposed to cover. Finally, Popham
recommends that you have the examinees provide feedback on the test. For example, after
completing the test you might have the examinees complete a brief questionnaire asking
whether the directions were clear and if any of the questions were confusing.
Ideally a test developer should use both quantitative and qualitative approaches to
improve tests. We regularly provide a delayed review of our own tests and use colleagues as
reviewers whenever possible. After administering a test and obtain-
We recommend the use of both ing the quantitative item analyses, we typically question students
quantitative and qualitative about problematic items, particularly items for which the basis of
approaches to improve the the problem is not obvious. Often a combination of quantitative and
quality of test items. qualitative procedures will result in the optimal enhancement of your
tests.
Popham (2000) notes that historically quantitative item analysis procedures have been
applied primarily to tests using norm-referenced score interpretations and qualitative pro-
cedures have been used primarily with tests using criterion-referenced interpretations. This
tendency can be attributed partly to some of the technical problems we described earlier
about using item analysis statistics with mastery tests. Nevertheless, we recommend the
use of both quantitative and qualitative approaches with both types of score interpretations.
When improving tests, we believe the more information the better.
Having spent the time to develop and analyze test items, you might find it useful to
develop a test bank to catalog your items. Special Interest Topic 6.3 provides information
on this process.

Using Item Analysis to Improve


Classroom Instruction
In describing the benefits of item analysis procedures, we indicated that they can provide
information about the quality of the items and also the effectiveness of classroom instruc-
tion. We have spent considerable time describing how item analysis procedures can be used
to evaluate the quality of test items and will now briefly address how they can help improve
classroom instruction. For example, by examining p values teachers can learn which items
are difficult for a group of students and which are easy. This provides valuable information
166 CHAPTER 6

SPECIAL INTEREST TOPIC 6.3


Developing Item Banks

Many teachers at all grade levels find it helpful to develop a test bank to catalog and archive their
test items. This allows them to easily write new tests using test items that they have used previously
and have some basic measurement information on. Several sources have provided guidelines for
developing item banks (e.g., Linn & Gronlund, 2000; Ward & Murray-Ward, 1994). Consider the
following example.

Course: Tests and Measurement Chapter: 2—Basic Math of Measurement

Learning Objective: Describe the measures of variability and their appropriate use.
If the standard deviation of a set of test scores is equal to 9, the variance is equal to:
ees)
b. 18
| c. 30
: d. 81°
Administration Date: February 7, 2002

: Options
& p = 0.58
; D = 0.43 A B fe De
] Number in top group 4 2 0 24
i Number in bottom group 9 7 3 11
@
“Correct answer
ome

Administration Date: September 3, 2003

Options
p = 0.68
D = 0.37 A B Le Dt
Number in top group 1 2 1 26
Number in bottom group 8 5 2 15
ee ee eee ee ee
“Correct answer

This indicates that this item has been administered on two different occasions. By including informa-
tion from multiple administrations, you will have a better idea of how the item is likely to perform
on
a future test. If you are familiar with computer databases (e.g., Microsoft Access), you can
set up a
database that will allow you to access items with specific characteristics quickly and efficiently.
Pro-
fessionally developed item bank programs are also available. For example, the Assessment
Systems
Corporation’s FastTEST product will help you create and maintain item banks, as well
as construct
tests (see www.assess.com).

Item Analysis for Teachers 167

about which learning objectives have been achieved and which need further elaboration and
review. Sometimes as teachers we believe that our students have grasped a concept only
to discover on a test that items measuring understanding of that concept were missed by a
large number of them. When this happens, it is important to go back and carefully review
the material, possibly trying a different instructional approach to convey the information.
At another level of analysis, information about which distracters are
Item analysis can result not only being selected by students can help teachers pinpoint common mis-
in better tests but also conceptions and thereby correct them. In these ways, item analysis
in better teaching. can result not only in better tests but also in better teaching.

Summary
In this chapter we described several procedures that can be used to assess the quality of the
individual items making up a test.

a Item difficulty level. The item difficulty level or index is defined as the percentage or
proportion of examinees correctly answering the item. The item difficulty index (i.e.,
p) ranges from 0.0 to 1.0 with easier items having larger decimal values and difficult
items having smaller values. For maximizing variability among examinees, the op-
timal item difficulty level is 0.50, indicating that half of the examinees answered the
item correctly and half answered it incorrectly. Although 0.50 is optimal for maximiz-
ing variability, in many situations other values are preferred.
Item discrimination. Item discrimination refers to the extent to which an item accu-
rately discriminates between examinees who vary on the test’s construct. For exam-
ple, on an achievement test the question is whether the item can distinguish between
examinees who are high achievers and those who are poor achievers. Although a
number of different approaches have been developed for assessing item discrimina-
tion, we focused our discussion on the popular item discrimination index (i.e., D). We
provided guidelines for evaluating item discrimination indexes, and as a general rule
items with D values over 0.30 are acceptable, and items with D values below 0.30
should be reviewed. However, this is only a general rule, and we discussed a number
of situations in which smaller D values might be acceptable.
Distracter analysis. The final quantitative item analysis procedure we described was
distracter analysis. In essence distracter analysis allows the test developer to evaluate
the distracters on multiple-choice items (i.e., incorrect alternatives) and determine
whether they are functioning properly. This involves two primary questions. First:
Did the distracter distract some examinees? If a distracter is so obviously wrong that
no examinees selected it, it is useless and deserves attention. The second question
involves discrimination. Did the distracter attract more examinees in the bottom group
than in the top group?

After introducing these different item analysis statistics, we described some practical
strategies teachers can use to examine the measurement characteristics of the items on their
168 CHAPTER 6

classroom assessments. We also introduced a series of steps that teachers can engage in to
use the information provided by item analysis procedures to improve the quality of the items
they use in their assessments.
In addition to quantitative item analysis procedures, test developers can also use
qualitative approaches to improve their tests. Popham (2000) suggested that the test devel-
oper carefully proof the test after setting it aside for a few days. This break often allows
the test author to gain some distance from the test and provide a more thorough review of
it. He also recommends getting a trusted colleague to review the test. Finally, he recom-
mends that the test developer solicit feedback from the examinees regarding the clarity of
the directions and the identification of ambiguous items. Test developers are probably best
served by using a combination of quantitative and qualitative item analysis procedures.
In addition to helping improve tests, in the classroom the information obtained with item
analysis procedures can help the teacher identify common misconceptions and material
that needs further instruction.

KEY TERMS AND CONCEPTS

Distracter analysis, p. 157 Item discrimination on mastery Point-biserial correlation,


Item analysis, p. 148 tests, p. 155 p. 154
Item analysis of speed tests, p. 156 Item-total correlation, p. 153 Qualitative item analysis, p. 164
Item difficulty, p. 148 Optimal item difficulty level,
Item discrimination, p. 150 p. 149

RECOMMENDED READINGS

Anastasi, A., & Urbina, S. (1997). Psychological testing (7th Kelley, T. L. (1939). The selection of upper and lower groups
ed.). Upper Saddle River, NJ: Prentice Hall. Chapter 7, for the validation of test items. Journal of Educational
Item Analysis, presents a readable but comprehensive Psychology, 30, 17-24. A real classic!
discussion of item analysis that is slightly more techni- Nitko, A. J., & Hsu, T. C. (1984). A comprehensive micro-
cal than that provided in this text. computer system for classroom testing. Journal of Edu-
Johnson, A. P. (1951). Notes on a suggested index of item va- cational Measurement, 21, 377-390. Describes a set of
lidity: The U-L index. Journal of Educational Measure- computer programs that archives student data, performs
ment, 42, 499-504. This is a seminal article in the history common item analyses, and banks the test question to
of item analysis. facilitate test development.

ii
Go to www.pearsonhighered.com/reynolds2e to view a PowerPoint™
presentation and to listen to an audio lecture about this chapter.
i
CHAPTER

The Initial Steps in


Developing a Classroom Test
Deciding What to Test
and How to Test It

The Standards for Educational and Psychological Testing (1999) indicate


that the initial steps in developing a test are to specify the purpose and scope
of the test and develop test specifications. In the development of classroom
achievement tests, this process begins with the specification of educational
objectives and development of a table of specifications.

CHAPTER HIGHLIGHTS

Characteristics of Educational Objectives Implementing the Table of Specifications


Taxonomy of Educational Objectives and Developing an Assessment
Behavioral versus Nonbehavioral Preparing Your Students and Administering
the Assessment
Educational Objectives
Writing Educational Objectives
Developing a Table of Specifications
(or Test Blueprint)

LEARNING OBJECTIVES

After reading and studying this chapter, students should be able to


1. Describe the importance of educational objectives in terms of both instruction and
assessment.
Describe three prominent characteristics of educational objectives.
Describe and give examples of how educational objectives can differ in terms of scope.
Describe and give examples of the three domains covered by educational objectives.
2
a Describe Bloom’s taxonomy of cognitive objectives. Explain and give examples of each
category.
6. Describe and give examples of behavioral and nonbehavioral educational objectives.

169
170 CHAPTER 7

Illustrate a thorough understanding of the principles for writing effective educational


objectives by writing objectives for a specified content area.
Explain the importance of developing a table of specifications before beginning to write an
assessment. j
Illustrate a thorough understanding of the principles for developing a table of specifications
by developing one for a specified content area.
10. Describe norm-referenced and criterion-referenced score interpretations and their
application in classroom assessments.
11. Compare and contrast the strengths and weaknesses of selected-response and constructed-
response items.
12. Discuss major considerations involved with assembling an assessment.
13. Discuss major considerations involved with preparing your students and administering an
assessment.
14. Be able to apply strategies for reducing test anxiety.
15. Be able to apply strategies for reducing the likelihood of cheating.

As noted in Chapter 1, classroom testing has important implications and its effects are felt
immediately by students and teachers alike. It has been estimated that assessment activities
consume as much as 30% of the available instructional time (Stiggins & Conklin, 1992).
Because testing activities are such important parts of the educational process, all teachers
who develop and use tests should work diligently to ensure their assessment procedures are
adequate and efficient. In this chapter we will start discussing the development of class-
room achievement tests. The initial steps in developing a classroom achievement test are to
specify the educational objectives, develop a table of specifications, and select the type of
items you will include in your assessment. These activities provide the foundation for all
classroom tests and many professionally designed tests of educational achievement.
The identification and statement of educational objectives is an important first step
in developing tests. Educational objectives are simply educational goals, that is, what you
hope the students will learn or accomplish. Educational objectives
The identification and are also referred to as instructional or learning objectives. The teach-
statement of educational ing of any lesson, unit, or course has one or more educational objec-
objectives is an important first tives. These objectives are sometimes clearly stated and sometimes
step in developing tests. (all too often) implicit. Even when the objectives are implicit they
can usually be inferred by carefully examining the materials used in
instruction, the topics covered, and the instructional processes employed. A good classroom
test can be written from clearly stated objectives much more easily than can one from vague
or poorly developed objectives. Clearly stated objectives help you make sure that the test
measures what has been taught in class and greatly facilitate the test development process.
Establishing explicit, clearly stated educational objectives also has the added benefit of
enhancing the quality of teaching. If you know what your educational goals are, you are
much more likely to reach them. The educational reform movement of the 1990s focused
considerable attention on the development and statement of content standards. It is likely
The Initial Steps in Developing a Classroom Test 171

that your state or school district has developed fairly explicit curriculum guidelines that
dictate to some degree the educational objectives you have for your students.

Characteristics of Educational Objectives

This textbook is not devoted to curriculum development or the construction of educational


objectives to organize curriculum by content or sequence. However, any reasonable school
assessment procedure should be closely tied to the curriculum and its
Any reasonable school objectives. Classroom tests should reflect what was taught in class,
assessment procedure should be and tests should emphasize what was emphasized in class. As a result,
closely tied to the curriculum any discussion of the development of classroom tests should touch
and its objectives. on educational objectives. It is probably best to begin by describing
some of the characteristics of educational objectives. Probably the
three most prominent characteristics of educational objectives involve

: : We will start by
discussing how objectives can differ in terms of scope.

Scope
Scope refers to how broad or narrow an objective is. An example of a broad objective is

The student will be able to analyze and discuss the effects of the Civil War on
twentieth-century American politics.

An example of a narrow or specific objective is

The student will be able to list the states that seceded from the Union during the Civil
War.

Clearly different kinds of student responses would be expected for test questions developed
from such different objectives. Objectives with a broad scope are often broken down into
objectives with a more narrow scope. The broad objective above might have been reformu-
lated to the following objectives:

1. The student will be able to analyze and discuss the effects of the Civil War on
twentieth-century American politics.
la. The student will be able to discuss the political effects of post—Civil War occupa-
tion by federal troops on Southern state politics.
1b. The student will be able to trace the rise and fall of African Americans’ political
power during and after Reconstruction.
1c. The student will be able to discuss the long-term effects of the economic depres-
sion in the South after the Civil War.
1d. The student will be able to list three major effects of the Civil War on twentieth-
century U.S. politics.
172 CHAP
TE Re 7

Although these four specific objectives all might help the student attain the broad objective,
they do not exhaust all of the potential objectives that could support the broad objective. In
fact a whole course might be needed to completely master the broad objective.
If you use only very specific educational objectives, you may end up with a large num-
ber of disjointed items that emphasize rote memory and other low-level cognitive abilities.
On the other hand, if you use only broad educational objectives, you
It is probably best to strike may not have the specific information needed to help you develop tests
a balance between broad with good measurement characteristics. Although you can find test de-
velopment experts who promote the use of narrow objectives and other
objectives and narrow
experts who promote broad objectives, in practice it is probably best
objectives.
to strike a balance between the two extremes. This can best be accom-
plished using two approaches. First, you can write objectives that are
at an intermediate level of specificity. Here the goal is to write objectives that provide the
specificity necessary to guide test development but are not so narrow as to limit assessment
to low-level abilities. The second approach is to use a combination of broad and specific
objectives as demonstrated earlier. That is, write broad objectives that are broken down into
more specific objectives. Either of these approaches can help you develop well-organized
tests with good measurement characteristics.

Taxonomy of Educational Objectives

In addition to the scope of educational objectives, they also differ in the domain or the type
of ability/characteristic being measured. The domains typically addressed by educational
objectives involve cognitive, affective, or psychomotor abilities or characteristics. These
three domains are usually presented as hierarchies involving different levels that reflect
varying degrees of complexity. We will start by discussing the cognitive domain.

Cognitive Domain’
The objectives presented in the previous section are referred to as a

Remember these two objectives?

1. The student will be able to analyze and discuss the effects of the Civil War on twentieth-
century American politics.
2. The student will be able to list the states that seceded from the Union during the Civil
War.

When we first discussed these two objectives, we emphasized how they differed in scope. The
first objective is broad and could be the basis for a whole course of study. The second one is
narrow and specific. In addition to scope they also differ considerably in the complexity of the
cognitive processes involved. The first one requires “analysis and discussion” whereas the sec-
ond requires only “listing.” If a student can memorize the states that,seceded from the Union,
he or she can be successful on the second objective, but memorization of facts would not be
The Initial Steps in Developing a Classroom Test
173

Bloom’s taxonomy provides a sufficient for the first objective. Analysis and discussion require more
useful way of describing the complex cognitive processes than rote memorization. A taxonomy of
complexity of an objective by cognitive objectives developed by Bloom, Englehart, Furst, Hill, and
classifying it into one of six Krathwohl (1956) is commonly referred to as Bloom’s taxonomy.
hierarchical categories. This taxonomy provides a useful way of describing the complexity
of an objective by classifying it into one of six hierarchical categories
ranging from the most simple to the most complex. Table 7.1 provides
a summary of Bloom’s taxonomy. The categories include the following:

Knowledge. The simplest level of the taxonomy t I-


é) iJ

Educational objectives in the knowledge category include the following examples:

a The student will be able to name each state capital.


= The student will be able to list U.S. presidents in the order they served.

Comprehension. Objectives

explain, and summarize. Educational objectives at the comprehension level include the fol-
lowing examples:

u The student will be able to describe the use of each symbol on a U.S. Geographical
Survey map.
m The student will be able to explain how interest rates affect unemployment.

TABLE 7.1 Bloom’s Taxonomy of Educational Objectives

Level Description Example

Knowledge Rote memory, learning facts Name each state capital.


Comprehension Summarize, interpret, or explain Summarize the use of every symbol
3
material on a geographical survey map.
Application Use general rules and principles Write directions for traveling by
to solve new problems numbered road from any city on
a map to any other city.
Analysis Reduction of concepts into parts and Describe maps in terms of function
showing the relationship of parts to —_and form.
the whole
Synthesis Creation of new ideas or results Construct a map of a hypothetical
from existing concepts country with given characteristics.
Evaluation Judgment of value or worth The student will evaluate the
usefulness of a map to enable him or
her to travel from one place to another.

Source: Based on Bloom et al. (1956).


174 CHAPTER 7

Application. Objectives at th®applicationIévelivolveahemsejofwgeneral


ules, prin-
d. Objectives at the
application level include the following examples:

u The student will be able to write directions for traveling by numbered roads from any
city on a map to any other city.
= The student will be able to apply multiplication and division of double digits in ap-
plied math problems.

Analysis.
c
of parts toa whole) Educational objectives at this level include the following examples:

= The student will describe maps in terms of function and form.


= The student will distinguish the different approaches to establishing validity and il-
lustrate their relationship to each other.

Synthesis. jecti
. Ojectives at the synthesis level
include the following examples:

m The student will construct a map of a hypothetical country with given characteristics.
m The student will propose a viable plan for establishing the validity of an assessment
instrument following the guidelines presented in the Standards for Educational and
Psychological Testing (1999).

Evaluation.
. Objec-
tives at the evaluation level include the following examples:

a The student will evaluate the usefulness of a map to enable him or her to travel from
one place to another.
a The student will judge the quality of validity evidence for a specified assessment
instrument.

Although it is somewhat dated, we agree with others (e.g., Hopkins, 1998) who feel
that Bloom’s taxonomy is helpful because it presents a framework that helps remind teach-
ers to include items reflecting more complex educational objectives in their tests. Popham
(1999) suggests that teachers tend to focus almost exclusively on objectives at the knowl-
edge level. He goes as far as to suggest that in practice one can actu-
Instruction and assessment ally simplify the taxonomy by having just two levels: knowledge
are too often limited to rote and anything higher than knowledge. We will not go quite that far,
memorization, and higher- but we do agree that instruction and assessment are often limited to
level objectives should be rote memorization, and higher-level educational objectives should
emphasized. be emphasized.
This is not to imply that lower-level objectives are trivial and
should be ignored. For each objective in your curriculum you must’ decide at what level you
expect students to perform. In a brief introduction to a topic it may be sufficient to expect only
The Initial Steps in Developing a Classroom Test 175

knowledge and comprehension of major concepts. In a more detailed study of a topic, higher,
more complex levels of mastery will typically be required. However, it is often not possible to
master higher-level objectives without first having mastered lower-level objectives. Although
we strongly encourage the development of higher-level objectives, it is not realistic to require
high-level mastery of everything. Education is a pragmatic process of choosing what is most
important to emphasize in a limited amount of instructional time. Our culture helps us make
some of these choices, as do legislative bodies, school boards, administrators, and even oc-
casionally parents and students. In some school districts the cognitive objectives are provided
in great detail; in others they are practically nonexistent. As noted earlier, the current trend is
for federal and state lawmakers to exert more and more control over curriculum content.

Affective Domain
Most people think of cognitive objectives when they think of a student’s educational ex-
periences. However, two other domains of objectives appear in the school curriculum: af-
fective and psychomotor objectives. The affective domain involves characteristics such as
values, attitudes, interests, and behavioral actions. As a result

The student will demonstrate interest in earth science by conducting a science fair
project in some area of earth science.

As a general rule, affective objectives are emphasized more in elementary school cur-
ricula than secondary curricula. A taxonomy of affective objectives developed by Krath-
wohl, Bloom, and Masia (1964) is presented in Table 7.2. This taxonomy involves levels of

TABLE 7.2 Krathwohl’s Taxonomy of Affective Objectives

Level Description Sublevels

Receiving (attending) Being aware of and willing to attend Awareness, willingness to


to something (e.g., instruction) attend, and selective attention
Responding Actively participating in an activity Acquiescence, willingness,
or process and satisfaction
Valuing Assigning value or worth to an Acceptance, preference, and
activity or idea commitment
Organization Ideas and values become internalized Conceptualization and
and organized into one’s personal hierarchy
system of values and beliefs
Characterization by a Individual values are exemplified Generalized set and examples
value or value complex ina characteristic set of behaviors of characterization
and actions

Source: Based on Krathwohl et al. (1964).


176 CHAP
TE Re?

increasing sophistication, with each level building on preceding levels. It depicts a process
whereby new ideas, values, and beliefs are gradually accepted and internalized as one’s own.
Krathwohl’s taxonomy of affective objectives has never approached the popularity of
Bloom’s taxonomy of cognitive objectives, probably because the affective domain has been
more difficult to define and is also a more controversial area of education. In schools, af-
fective objectives are almost always adjuncts to cognitive objectives. For example, we want
our students to learn about science and as a result to appreciate or enjoy it. Classroom tests
predominantly focus on cognitive objectives, but affective objectives are found in school
curricula, either explicitly or implicitly. Because affective objectives appear in the school
curriculum, their specification enhances the chance of them being achieved.

Psychomotor Domain
The third class of objectives deals with physical activity and is referred to as psychomo-
tor objectives. :
t .g., biology or computer science), or career—technical classes such as
woodworking, electronics, automotive, or metalwork. For example, in physical education
there are countless psychomotor activities such as rolling a bowling ball a certain way or
hitting tennis balls with a certain motion. Biology classes also have many psychomotor ac-
tivities, including focusing a microscope, staining cells, and dissection. Computer science
courses require skill in using a computer keyboard and assembling computer hardware.
Taxonomies of psychomotor objectives have been developed, and Harrow’s (1972) model
is illustrated in Table 7.3. Psychomotor objectives are typically tied to cognitive objectives

TABLE 7.3 Harrow’s Taxonomy of Psychomotor Objectives

Level Description Sublevels

Reflex movements Involuntary actions Segmental, intersegmental,


. and suprasegmental reflexes
Basic fundamental Inherent movement patterns that are a Locomotor, nonlocomotor,
movements combination of reflex movements and serve and manipulative movements
as the basis for more complex movements
Perceptual abilities Involves interpretation of sensory input Kinesthetic, visual, auditory,
that in turn guides movement and tactile discrimination,
coordinated abilities
Physical abilities | Functional physical characteristics that Endurance, strength,
serve as the basis for skilled movements flexibility, and agility
Skilled movements Complex movements that are the result of Simple, compound, and
learning and based on inherent movement — complex adaptive skills
patterns (see level 2)
Nondiscursive Nonverbal communication ranging from Expressive and interpretive
communication facial expressions to expressive dance movements
grapes

ource: Based on Harrow (1972).


The Initial Steps in Developing a Classroom Test 177

because almost every physical activity involves cognitive processes. As a result, like af-
fective objectives, psychomotor objectives typically are adjuncts to cognitive objectives.
Nevertheless, they do appear in the school curriculum and their specification may enhance
instruction and assessment.

Behavioral versus Nonbehavioral


Educational Objectives
Educational objectives are often — Educational objectives are often classified as either behavioral or
classified as either behavioral or nonbehavioral. To illustrate this distinction, consider the following
nonbehavioral. examples:

Behavioral: The student will be able to list the reasons cited in the curriculum guide
for the United States’ entry into World War I with 80% accuracy.
Nonbehavioral: The student will be able to analyze the reasons for the United States’
entry into World War I.

These two objectives differ in the activity to be performed. The behavioral objective re-
quires that the student list the rea ; the nonbehavioral objective requires that the student
analyze the reason
UC ay

Behav
measured by the teacher. Nonbehavioral activities must be inferre

actvilee
«write ined cule, ereminguiae, kno apdunderstands (hough it is possible to
either behavioral or nonbehavioral objectives at all levels of the cognitive taxonomy,
teachers often find it easier to write behavioral objectives for the lower levels (e.g., knowl-
edge, comprehension, and application) and to write nonbehavioral objectives for the higher
levels (e.g., analysis, synthesis, and evaluation).
It is also common for behavioral objectives to specify an outcome criterion. For ex-
ample, in the previous example, the criterion is listing “the reasons cited in the curriculum
guide... with 80% accuracy.” As illustrated, behavioral objectives often state outcome
criteria as a percentage correct that represents mastery, as in the following example:

The student will be able to diagram correctly 80% of sentences presented from a
standard list.

Although behavioral objectives frequently specify an outcome criterion, it is often


difficult to determine what represents mastery. Does 80% accuracy reflect mastery, or
should you require 90% or even 100%? Occasionally the nature of the material and the
results of not reaching 100% mastery will dictate a criterion. For example, most of the
students who demonstrate 80% mastery on a test measuring knowledge of the safe use of
power tools in an industrial arts class may be accident free, yet a 5% accident rate may be
178 CHAPTER 7

completely intolerable. In this situation the criterion for mastery may need to be raised to
100% to achieve an acceptable accident rate (e.g., <1%). When training pilots to fly fighter
jets, the Air Force may likewise require 100% mastery of all ground flight objectives be-
cause a single mistake may result in death and the loss of expensive equipment.
The use of behavioral objectives received widespread acceptance in the 1960s and
1970s because they helped teachers clearly specify their objectives. However, a disadvan-
tage of behavioral objectives is that if carried to an extreme they can be too specific and too
numerous, and as a result no longer facilitate instruction or assessment. The ideal situation
is to have objectives that are broad enough to help you organize your instruction and assess-
ment procedures, but that also specify clearly measurable activities.

Writing Educational Objectives

So far we have defined educational objectives and described some of their major character-
istics. Consistent with our goal of limiting our discussion to the information teachers really
need to know to develop, use, and interpret tests effectively, we have kept our discussion
relatively brief. Because the specification of educational objectives plays an important role
in the development of classroom tests, we will provide a few general suggestions for writing
useful objectives. These include the following:

Your educational objectives


1. ‘Write objectives thatCoveria /broadispectrum.of.abili:
tiesAs we have suggested repeatedly, it is desirable to specify
should cover a broad spectrum
objectives that cover a broad range of abilities. Go beyond rote
of abilities.
memorization and specify higher-level cognitive abilities.
2. :When feasible, identify behaviors that are observable and directly measurable. Ide-
ally, objectives should specify behaviors that are observable and measurable. One way to
determine whether this criterion is achieved is to ask whether different people independently
observing the student would agree regarding the achievement of the objective. As we noted,
this is usually best accomplished by using action verbs such as arrange, build, create, define,
develop, identify, list, and recite. Objectives requiring the student to analyze, examine, judge,
know, and understand are not as easily observed and measured and often must be inferred.
Nevertheless, it may be necessary to specify behaviors that are not directly observable in order
to assess some of the more complex cognitive abilities (e.g., analysis, synthesis, and evalu-
ation). As a general rule, when possible specify observable and measurable behaviors, but
when necessary use nonbehavioral objectives to describe higher-level abilities or behaviors.

8. State special conditions. If the target activity is to be demonstrated under specific


conditions, these conditions should be clearly stated. Consider this example:
Given a map of the United States, the student will be able to correctly identify each
state and name its capital.
In this example, a map of the United States is identified as material necessary for achiev-
ing the objective. Kubiszyn and Borich (2000) list specific times, settings, equipment, and
resources as conditions that should be included in an objective when they are relevant.
The Initial Steps in Developing a Classroom Test
179

TABLE7.4_ Learning Objectives for Chapter 2, The Basic Math of Measurement


oS on ie cs “Se LETae Scant
After reading and studying Chapter 2, the student should be able to

- Define measurement.
- Describe the different scales of measurement and give examples.
- Describe the measures of central tendency and their appropriate use.
- Describe the measures of variability and their appropriate use.
- Explain the meaning of correlation coefficients and how they are used.
- Explain how scatterplots are used to describe the relationships between two variables.
- Describe how linear regression is used to predict performance.
. Describe major types of correlation coefficients.
= . Distinguish between correlation and causation.
WYN
CeONINMN

See Bent ch cin cat SSE EE EN

‘4. When appropriate, specify an outcome criterion. As we discussed earlier, it is


sometimes beneficial to specify an outcome criterion, that is, the level of performance
viewed as indicating that the student has achieved the objective. This is most applicable
when using behavioral objectives.

The development of educational objectives is not an area in which there is com-


plete agreement among testing experts. Although every test development expert we know
of believes educational objectives need to be specified, different writers support slightly
different approaches. Some recommend writing content-free objectives whereas others
recommend content-centered objectives. Some writers recommend the use of behavioral
objectives whereas others see this approach as too restrictive. Our guidelines are simply
suggestions that you can use in a flexible manner to meet your specific needs. We believe it
is very important for teachers to develop and specify educational objectives, but the exact
format they adopt is less important. In Table 7.4 we restate the learning objectives provided
in Chapter 2 of this text. We do this because we will be using this chapter and its associated
objectives in the next section to demonstrate the development of a table of specifications
for a test.

Developing a Table of Specifications


(or Test Blueprint)

You might be asking why we have spent so much time discussing educational objectives.
The reason is that the development of a classroom test should be closely tied to the class
curriculum and educational objectives. As we noted earlier, classroom tests should mea-
sure what was taught. Classroom tests should emphasize what was
The table of specifications, or emphasized in class. The method of ensuring congruence between
test blueprint, is used to ensure classroom instruction and test content is the development and ap-
congruence between classroom plication of a table of specifications, also referred to as a test blue-
instruction and test content. print. An example is given in Table 7.5 for Chapter 2 of this text.
180 CHAPTER 7

TABLE7.5 Table of Specifications for Test on Chapter 2 Based on Content Areas (Number of Items)

Level of Objective

Content Area Knowledge Comprehension Application Analysis Synthesis Evaluation Total

Scales of
measurement 2 2 2 6
Measures of
central tendency 5 5) 6
Measures
of variability 3 3 3 2,
Correlation and
regression 2 3 2 2 9

The column on the left, labeled Content Area, lists the major content areas to be covered in
the test. These content areas are derived by carefully reviewing the educational objectives
and selecting major content areas to be included in the test. Across the top of the table we
list the levels of Bloom’s cognitive taxonomy. The inclusion of this section encourages us
to consider the complexity of the cognitive processes we want to measure. As noted earlier,
there is a tendency for teachers to rely heavily on lower-level processes (e.g., rote memory)
and to underemphasize higher-level cognitive processes. By incorporating these categories
in our table of specifications, we are reminded to incorporate a wider range of cognitive
processes into our tests.
The numbers in the body of the table reflect the number of items to be devoted to
assessing each content area at each cognitive taxonomic level. Table 7.5 depicts specifica-
tions for a 30-item test. If you examine the first content area in Table 7.5 (i.e., scales of
measurement) you see two knowledge-level items, two comprehension-level items, and
two analysis-level items devoted to assessing this content area. The next content area (i.e.,
measures of central tendency) will be assessed by three knowledge-level items and three
comprehension-level items. The number of items dedicated to assessing each objective
should reflect the importance of the objective in the curriculum and how much instruc-
tional time was devoted to it. In our table of specifications we determined the number
of items dedicated to each content area/objective by examining how much material was
devoted to each topic in the text and how much time we typically spend on each topic in
class lectures.
Some testing experts recommend using percentages instead of the number of items
when developing a table of specifications. This approach is illustrated in Table 7.6. For
example, you might determine that approximately 20% of your instruction involved the
different scales of measurement. You would like to reflect this weighting in your test so you
devote 20% of the test to this content area. If you are developing a 30-item test this means
you will write six items to assess objectives related to scales of measurement (0.20 x 30
= 6). If you are developing a 40-item test, this means you will write eight items to assess
The Initial Steps in Developing a Classroom Test 181

TABLE7.6 Table of Specifications for Test on Chapter 2 Based on Content Areas (Percentages)

Level of Objective

Content Areas Knowledge = Comprehension Application Analysis Synthesis Evaluation Total

Scales of
measurement 6.7% 6.7% 6.7% 20%
Measures of
central tendency 10% 10% 20%
Measures
of variability 10% 10% 10% 30%
Correlation and
regression 6.7% 10% 6.7% 6.7% 30%
ET AS

objectives related to scales of measurement (0.20 x 40 = 8). An advantage of using percent-


ages rather than number of items is that you do not have to determine beforehand how many
items you will have on a test. Nevertheless, the decision to use percentages or numbers of
items is probably best left to the individual teacher because either approach can result in
useful specification tables.
Each test a teacher constructs should be based on a table of specifications. It may
be relatively informal when constructing a brief quiz, or as fully developed as in Table 7.5
when constructing a major examination. A table of specifications helps teachers review the
curriculum content and minimizes the chance of overlooking important concepts or includ-
ing irrelevant concepts. A table of specifications also encourages the teacher to use items of
varying complexity. For students the table can serve as a basis for study and review. There
will be few student or parental complaints of “unfair” testing if students are aware of the
elements of the table of specifications prior to the test. Although we have concentrated on
achievement tests and cognitive objectives in this section, tables of specifications can be de-
veloped for affective and psychomotor tasks in a similar manner, substituting the taxonomy
being used in those domains for the cognitive taxonomy.

Implementing the Table of Specifications


and Developing an Assessment
So far in this chapter we have focused on what we want students to learn and what content
we want our tests to cover. This has involved a discussion of educational objectives and the
development of a table of specifications or test blueprint. Before we actually start writing
the test, however, we still have several more important decisions to make. One important
decision involves how to interpret test scores (norm-referenced and criterion-referenced
approaches to score interpretation were introduced in Chapter 3). Another important deci-
sion involves selecting the types of items to include in the assessment. We will now briefly
182 CHAPTER 7

review information about norm-referenced and criterion-referenced assessment and frame


it in terms of the development of classroom tests.

Norm-Referenced versus Criterion-


Referenced Score Interpretations
Norm-referenced score Remember our earlier discussion of norm-referenced and criterion-
interpretations compare a referenced score interpretations. In review, with norm-referenced
student’s performance with assessment a student’s performance is interpreted in relation to the
performance of other students. Norm-referenced interpretation is
Se ents. relative because it involves the performance of the student relative
Criterion-referenced score to other students. Percentile ranks and standard scores are common
interpretations compare a norm-referenced scores that are used in schools today. In contrast,
student’s performance to an criterion-referenced interpretation compares a student’s performance
absolute standard or criterion. to an absolute standard or criterion, not to the performance of other
students. Criterion-referenced interpretation is absolute because it
reflects the degree to which the student has mastered the content or domain the test repre-
sents. Criterion-referenced scores include percent correct and mastery/nonmastery scores.
If the results indicate that the student scored at the 80th percentile, meaning he or she scored
better than 80% of the students in the norm group, the interpretation is norm-referenced. If
the results indicate that the student correctly answered 80% of the test items, the interpreta-
tion is criterion-referenced.

Developing Classroom Tests


in a Statewide Testing Environment
With the current national focus on testing and content mastery, it is important for teachers
to be aware of and monitor their instructional content and testing in relation to their state’s
declared content and performance standards. All U.S. states have developed these as guides
for schools, and for most states they become the basis for statewide assessments at various
grade levels. While we do not advocate narrow adherence to such lists, both elected officials
and the public expect children to master the content based on standards developed through
the legislative processes of the state.
In many school districts school-level or even district-level committees focus on the
content at a grade level, often developing schedules for the sequence of instructional con-
tent for the school year. This has much to recommend it, particularly when mobility among
students is high. Students who move one or more times across schools during the school
year are at much higher risk for falling behind, and districts that manage their schedules well
across schools provide better support for such students. In many industrialized countries
students can move anywhere in their country and expect to be on the same page of their text
at any school.
When a teacher’s instruction is reasonably related to the state’s content expectations,
the tests the teacher develops can provide an excellent basis for students’ success on state-
wide tests. Previous state tests are typically available for teachers to view, and test-wiseness,
The Initial Steps in Developing a Classroom Test 183

giving students experience with the testing process similar to the statewide test, has been
shown to improve student performance. Thus, teachers can develop test questions similar
to those on the statewide tests in format. Of course, there is no reason to limit teacher tests
to these formats.

Selecting Which Types of Items to Use


Another important decision involves the types of items or tasks to include in your test.
Different authors use different classification systems or schemes when categorizing test
items. Historically a popular approach has been to classify test items as either “objective”
or “subjective.” This distinction usually referred to how the items were scored (i.e., in ei-
ther an objective or a subjective manner). For example, there should be no disagreement
between different individuals grading multiple-choice items. The items should be easily
scored “correct” or “incorrect” according to the scoring criteria. The same goes for true—
false and matching items. They can all be scored in an objective manner and are classified as
objective: Everyone agrees on which answers are keyed as correct and incorrect. In contrast,
essay items are considered subjective because grading them involves subjective judgment
on the part of the individual grading the test. It is not too surprising that two graders might
assign different grades to the same essay item. Another example could be a student’s re-
sponses on an oral examination. Here there also might be considerable subjectivity in scor-
ing and two individuals might score the responses differently. As a result, essay and other
test items involving more subjective scoring are classified as subjective.
Although the objective—subjective distinction is generally useful, there is some am-
biguity. For example, are short-answer items objective or subjective? Many authors refer to
them as objective items, but as you will see in a later chapter, scor-
ing short-answer items often involves considerable subjectivity. A
On selected-response items the
more direct approach is to classify items as either selected-response
students select the appropriate
or constructed-response items. With this approach, if an item re-
response from options that are
quires a student to select a response from available alternatives it is
provided. classified as a selected-response item. Multiple-choice, true—false,
and matching items are all selected-response items. If an item re-
On constructed-response items quires students to create or construct a response, it is classified as
a constructed-response item. Constructed-response items include
the students actually create
fill-in-the-blank, short-answer, and essay items. In a broader sense,
or construct an appropriate
constructed-response assessments also include performance assess-
response.
ments and portfolios. The selected-response—constructed-response
classification system is the one we will use in this textbook. In subsequent chapters we will
delve into greater detail in the development of these different types of items. For now, we
will just provide a brief overview of some of the major characteristics of selected-response
and constructed-response items.
As we indicated, on selected-response items students select the appropriate response
from options that are provided. On a true—false item the student simply selects true or false
to answer the item. On multiple-choice items the student selects the best response from a
list of alternatives. On matching items the student matches premises (typically listed on the
left) with the appropriate responses (typically listed on the right). The key factor is that all
184 CHAPTER 7

TABLE 7.7 Strengths and Weaknesses of Selected-Response Items

Strengths of Selected-Response Items


1. You can typically include a relatively large number of selected-response items in your test.
This facilitates adequate sampling of the content domain.
2. They can be scored in an efficient, objective, and reliable manner.
3. They are particularly good for measuring lower-level objectives.
4. They can reduce the influence of certain construct-irrelevant factors.

Weaknesses of Selected-Response Items


1. They are relatively difficult to write.
2. They are not able to assess all educational objectives (e.g., writing ability).
3. They are subject to random guessing.
EEE

selected-response items provide the answer; the student simply selects the appropriate one.
Although there are considerable differences among these selected-response item formats,
we can make some general statements about their strengths and limitations (see Table 7.7).
Strengths include the following:

= Students can generally respond to a relatively large number of selected-response


items in a limited amount of time. This means you can include more items in your test. Be-
cause tests are essentially samples of the content domain, and large samples are better than
small samples, the inclusion of a large number of items tends to enhance the measurement
characteristics of the test.
m Selected-response items can be scored in an efficient, objective, and reliable manner. A
computer can often score selected-response items. As a result, scoring takes less time and there
are fewer grading errors. This can produce tests with desirable measurement characteristics.

= Selected-response items are particularly good for measuring lower-level cognitive


objectives (e.g., knowlédge, comprehension, and application).
m Selected-response items decrease the influence of certain construct-irrelevant factors
that can impact test scores (e.g., the influence of writing ability on a test measuring scientific
knowledge).

Naturally, there are limitations associated with the use of selected-response items,
including the following:

m Selected-response items are challenging to write. Relative to constructed-re-


sponse items, they typically take more effort and time to write. This is not to say that
writing constructed-response items is an easy task, just that the development of effective
selected-response items is usually more difficult and time consuming.
= Although selected-response items are particularly well suited for assessing lower-
level cognitive objectives, they are not as well suited for assessing higher-level objectives
(i.e., analysis, synthesis, and evaluation). This is especially true for true—false and match-
ing items that are often limited to the assessment of lower-level educational objectives.
The Initial Steps in Developing a Classroom Test 185

Multiple-choice items can be written to assess higher-level objectives, but this often takes a
little more effort and creativity. -

m Selected-response items are subject to blind guessing.

Constructed-response items include short-answer items, essays, performance as-


sessments, and portfolios. Most people are familiar with short-answer items and essays.
Short-answer items require the student to supply a word, phrase, or number in response
to a direct question. Short-answer items may also take the form of an incomplete sentence
that the student completes (i.e., fill in the blank). Essay items pose a question or problem
for the student to respond to in a written format. Essay items can typically be classified as
either restricted-response or extended-response. As the name suggests, restricted-response
essays are highly structured and place restrictions on the nature and scope of the students’
responses. In contrast, extended-response essays are less structured and provide more free-
dom to students in how they respond. Although we have mentioned performance assess-
ments a number of times in this text to this point, you may not be very familiar with them.
Previously, we noted that performance assessments require students to complete a process
or produce a product in a context that closely resembles real-life situations. Portfolios, a
form of performance assessment, involve the systematic collection of student work products
over a specified period of time according to a specific set of guidelines (AERA et al., 1999).
Constructed-response assessments have their own associated strengths and weaknesses (see
Table 7.8). Their strengths include the following:

= Compared to selected-response items, some constructed-response assessments (e.g.,


short answer and essays) may be easier to write or develop. Not easy, but easier!
= Constructed-response items are well suited for assessing higher-order cognitive
abilities and complex task performance, and some tasks simply require a constructed-
response format (e.g., composing a letter, demonstrating problem-solving skills). As a
result they expand the range of learning objectives that can be assessed.

TABLE 7.8 Strengths and Weaknesses of Constructed-Response Items

Strengths of Constructed-Response Items


1. Compared to selected-response items, they are often easier to write.
2. They are well suited for assessing higher-order cognitive abilities and complex task
performance.
3. They eliminate random guessing.

Weaknesses of Constructed-Response Items


1. Because they typically take more time than selected-response items for the students to
complete, you cannot include as many items in a test. As a result, you are not as able to sample
the content domain as thoroughly.
2. They are more difficult to score in a reliable manner.
3. They are vulnerable to feigning.
4. They are vulnerable to the influence of construct-irrelevant factors.
186 CHAPTER 7

= Constructed-response items eliminate blind guessing.

Their weaknesses include the following:

= Constructed-response items take more time for students to complete. You cannot in-
clude as many constructed-response items or tasks on a test as you can selected-response
items. As a result, you are not able to sample the content domain as thoroughly.
= Constructed-response items are difficult to score. In addition to scoring being more
difficult and time consuming compared to selected-response items, scoring is more subjec-
tive and less reliable.

= Although constructed-response items eliminate blind guessing, they are vulnerable


to “bluffing.” That is, students who do not actually know the correct response might feign a
response that superficially resembles a correct response.

= Constructed-response items are vulnerable to the influence of extraneous or


construct-irrelevant factors that can impact test scores (e.g., the influence of writing ability
on a test measuring scientific knowledge).

As you see, selected-response and constructed-response assessments have specific


strengths and weaknesses that deserve careful consideration when selecting an assessment
format. However, typically the key factor in selecting an assessment or item format involves
identifying the format that most directly measures the behaviors specified by the educational
objectives. That is, you want to select the item format or task that will be the most pure, direct
measure of the objective you are trying to measure. For example, if you want to assess stu-
dents’ ability to demonstrate their writing abilities, an essay is the natural choice. If you want
to assess students’ ability to engage in oral debate, a performance assessment would be the
logical choice. Although the nature of some objectives dictates the use
You want to select an item of constructed-response items (e.g., writing skills), some objectives
format or task that provides can be measured equally well using either selected-response or con-
the most pure, direct measure structed-response items. If after careful consideration you determine
of the objective you are trying that both formats are appropriate, we generally recommend the use
to assess. of selected-response items because they allow broader sampling of the
content domain and more objective and reliable scoring procedures.
Both of these factors enhance the measurement characteristics of your
test. We will be discussing these assessment formats in the next three chapters, and this dis-
cussion will help you determine which format is most appropriate for your tests. We believe
that ideally educational assessments should contain a variety of assessment procedures (e.g.,
multiple-choice, short-answer, and performance assessments) that are specifically tailored to
measure the educational objectives of interest.

Putting the Assessment Together


We will now provide some suggestions for organizing and assembling your classroom as-
sessment. Many of these suggestions will be addressed in more detail in the context of the
The Initial Steps in Developing a Classroom Test 187

TABLE7.9 Practical Suggestions for Assembling an Assessment

1. Adhere to your table of specifications.


2. Provide clear instructions.
3. State items clearly.
4. Develop items that can be scored in a decisive manner.
5. Avoid inadvertent cues to the correct answers.
6. Arrange items in a manner that facilitates student performance and scoring.
7. Include items that contribute to the reliability and validity of your assessment results.
8. When determining how many items to include in an assessment, consider factors such as the
age of the students, the types of items employed, and the type and purpose of the test.

different item formats, but their introduction here will hopefully help you begin to consider
some of the main issues. Some of these suggestions might seem obvious, but sometimes the
obvious is overlooked! Table 7.9 summarizes these suggestions.

Follow Your Table of Specifications. We hopefully conveyed the importance of explic-


itly stating your educational objectives and developing a thorough table of specifications for
your test. Once you have invested the time and energy in that process, we encourage you to
follow through and use it as a guide or blueprint for developing your test. Remember that
your table of specifications is a tool that helps ensure congruence between your classroom
instruction and the content of your test.

Provide Clear Directions. It is common for teachers to take for granted that students
understand how to respond to different item formats. This may not be the case! When creat-
ing a test always include thorough directions that clearly specify how the student should
respond to each item format. Just to be safe, assume that the students have never seen a test
like it before and provide directions in sufficient detail to ensure they know what is expected
of them.

State the Question, Problem, or Task in as Clear and Straightforward a Manner as


Possible. You want students who have mastered the learning objective to get the item
correct and students who have not mastered the objective to get it wrong. If students have
mastered the learning objective, you do not want ambiguous wording, complex syntax, or
an overly difficult vocabulary to cause them to miss the question.

Develop Items and Tasks That Can Be Scored in a Decisive Manner. Ask yourself
whether the items have clear answers that virtually every expert would agree with. In terms
of essays and performance assessments, the question may be if experts would agree about
the quality of performance on the task. The grading process can be challenging even when
your items have clearly “correct” answers. When there is ambiguity regarding what repre-
sents a definitive answer or response, scoring can become much more difficult.
188 CHAPTER 7

Avoid Inadvertent Cues to the Correct Answers. It is easy for unintended cues to the
correct response to become embedded in a test. These cues have the negative effect of
allowing students who have not mastered the material to correctly answer the item. This
confounds intelligence (i.e., figuring out the correct answer based on detected cues) with
achievement (i.e., having learned the material). To paraphrase Gronlund (1998), only the
students who have mastered an objective should get the item right, and those who have not
mastered it, no matter how intelligent they are, should not get it correct.

Items should be arranged Arrange the Items in an Assessment in a Systematic Manner.


in a systematic manner that You should arrange the items in your assessment in a manner that
facilitates student performance. promotes the optimal performance of your students. If your test
contains multiple item formats, the items should be arranged in
sections according to the type of item. That is, place all the multi-
ple-choice items together, all the short-answer items together, and so on. This allows the
students to maintain the same mental set throughout the section. It has the added benefit
of making it easier for you to score the items. After arranging the items by format, you
should arrange the items in each section (e.g., multiple-choice items) according to their
level of difficulty. That is, start with the easy items and move progressively to the more
difficult items. This arrangement tends to reduce anxiety, enhances motivation, and allows
students to progress quickly through the easier items and devote the remaining time to the
more difficult items.
Some assessment experts suggest that you arrange the items in the order that the
material was presented in your class instruction. This is thought to help students retrieve
the information more easily. If you adopt this approach, however, it is recommended that
you encourage students to skip the most difficult items and return to them as time permits.
A logical variation on this approach is to arrange the items from the easiest to the most
difficult within each specific content area (e.g., Nitko, 2001). For example, on a multiple-
choice section covering reliability, validity, and item analysis, you would arrange all of the
items related to reliability in order of difficulty, followed by the items on validity in order
of difficulty, and finally the items on item analysis in order of difficulty.

Include Test Items and Tasks That Will Result in an Assessment That Produces Reli-
able and Valid Test Results. In the first section of this text we discussed the important
properties of reliability and validity. No matter which format you select for your test, you
should not lose sight of the importance of developing tests that produce reliable and valid
results. To make better educational decisions, you need high-quality information.

How Many Items Should You Include? As is often the case, there is no simple answer
to this question. The optimal number of items to include in an assessment is determined
by factors such as the age of the students, the types of items, the breadth of the material or
topics being assessed (i.e., scope of the test), and the type of test. Let’s consider several of
these factors separately:

Age of Students. For students in elementary school it is probably best to limit regular
classroom exams to approximately 30 minutes in order to maximize effort, concentration,
and motivation. With older students you can increase this period considerably, but it is
The Initial Steps in Developing a Classroom Test 189

probably desirable to limit assessments to approximately one hour in order to maximize


performance and accommodate class schedules. Naturally these are just flexible guidelines.
For example, when administering six-week or semester exams, more time may be necessary
to adequately assess the learning objectives. Additionally, you will likely need help with
the administration of standardized assessments that take significantly more time than the
standard classroom exam.

Types of Items. Obviously, students can complete more true—false items than they can
essay items in a given period of time. Gronlund (2003) estimates that high school students
should be able to complete approximately one multiple-choice item, three true—false items,
or three fill-in-the-blank items in one minute if the items are assessing objectives at the
knowledge level. Naturally, with younger students or more complex objectives, more time
will be needed. When you move to restricted-response essays or performance assessments,
significantly more time will be needed, and when you include extended-response tasks the
time demands increase even more. As we have already alluded to, the inclusion of more
“time-efficient” items will enhance the sampling of the content domain.

Type and Purpose of the Test. Maximum performance tests can typically be categorized
as either speed or power tests. Pure speed tests generally contain items that are relatively
easy but have strict time limits that prevent examinees from successfully completing all the
items. On pure power tests, the speed of performance is not an issue. Everyone is given
enough time to attempt all the items, but the items are ordered according to difficulty, with
some items being so difficult that no examinee is expected to answer them all. The distinc-
tion between speed and power tests is one of degree rather than being absolute. Most often
a test is not a pure speed test or a pure power test, but incorporates some combination of the
two approaches. The decision to use a speed test, a power test, or some combination of the
two will influence the number and type of items you include on your test.

Scope of the Test. In addition to the speed versus power test distinction, the scope of the test
will influence how many items you include in an assessment. For a weekly exam designed to
assess progress in a relatively narrow range of skills and knowledge, a brief test will likely be
sufficient. However, for a six-week or semester assessment covering a broader range of skills
and knowledge, a more comprehensive (i.e., longer) assessment is typically indicated.
When estimating the time needed to complete the test you should also take into con-
sideration test-related activities such as handing out the test, giving directions, and collect-
ing the tests. Most professional test developers design power tests that approximately 95%
of their samples will complete in the allotted time. This is probably a good rule of thumb for
classroom tests. This can be calculated in the classroom by dividing the number of students
completing the entire test by the total number of subjects.

Preparing Your Students and


Administering the Assessment
In the final section of this chapter we will provide some suggestions on how you can
best prepare your students for and then administer an assessment. Obviously, it would be
190 CHAPTER 7

(Co

RS
SPECIAL INTEREST TOPIC 7.1
. Suggestions for Reducing Test Anxiety

Research suggests that there is a curvilinear relationship between anxiety and performance. That is,
at relatively low levels anxiety may have a motivating effect. It can motivate students to study in a
conscientious manner and put forth their best effort. However, when anxiety exceeds a certain point it
becomes detrimental to performance. It will enhance the validity of your interpretations if you can re-
duce the influence of debilitating test anxiety. Remember, in most classroom situations you are striv-
ing to measure student achievement, not the impact of excessive anxiety. In this situation test anxiety
is a source of construct-irrelevant variance. By reducing test anxiety, you reduce construct-irrelevant
variance and increase the validity of y our interpretations. Researchers have provided suggestions for
helping students control test anxiety (e.g., Hembree, 1988; Linn & Gronlund, 2000; Mealey & Host,
1992; Nitko, 2001; Tippets & Benson, 1989). These suggestions include the following:

m Students with test anxiety may benefit from relaxation training. In many schools students
with debilitating test anxiety may be referred to a school counselor or school psychologist
who can teach them some fairly simple relaxation techniques.
= Although it is good practice to minimize environmental distractions for all students, this is
even more important for highly anxious students. Highly anxious students tend to be more
easily distracted by auditory and visual stimuli than their less anxious peers.
m Do not make the test a do-or-die situation. Although it is reasonable to emphasize the impor-
tance of an assessment, it is not beneficial to tell your students that this will be the most difficult
test they have ever taken or that their future is dependent on their performance on the test.
m Provide a review of the material to be covered on the test before the testing date. This is a
good instructional strategy that can facilitate the integration of material, students will ap-
preciate the review, and anxiety will be reduced.
m Arrange the items on your test from easy to difficult. Have you ever taken a test in which
the first item was extremely difficult or covered some obscure topic you had never heard of?
If so, you probably experienced a sudden drop in confidence, even if you initially felt well
prepared to take the test. To avoid this, many instructors will intentionally start the test with a
particularly easy item. It might not do much from a technical perspective (e.g., item difficulty
or discrimination), but it can have a positive influence on student motivation and morale.
m It is beneficial to have multiple assessments over the course of a grading period rather than
basing everything on one or two assessments. When there are only a limited number of as-
sessments, the stakes may seem so high that student anxiety is increased unnecessarily.
m Prepare all of your students for the test by teaching appropriate test-taking strategies. A novel
or unfamiliar test format provokes anxiety in many students, and this tendency is magnified
in students prone to test anxiety.
= When the students are seated and ready to begin the test, avoid unnecessary discussion before
letting them begin. The students are typically a little “on edge” and anxious to get started. If
the teacher starts rambling about irrelevant topics, this tends to increase student anxiety.
————_—Ss—X
eESSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSsSSSSe

regretful to develop an exemplary assessment and then have its results compromised by poor
preparation or inappropriate administration procedures. Your goal should be to promote
conditions that allow students to perform their best. Before administering an assessment,
you should take appropriate steps to prepare the students. This can include announcing in
The Initial Steps in Developing a Classroom Test 191

Before administering an advance when the test will be administered, describing what con-
assessment, you should take tent and skills will be covered, the basic parameters of the test (e.g.,
appropriate steps to prepare one-hour test including short-answer and restricted-response essay
the students. items), how it will be scored, and how the results will be used (e.g.,
Linn & Gronlund, 2000). It is also beneficial to give the students
examples of the types of items that will be included on the test and
provide general instruction in basic test-taking skills. You also want to do your best to
minimize excessive test anxiety because it can be a source of construct-irrelevant variance
that undermines the validity of your interpretations. Although stressing the importance of
an upcoming assessment can help motivate students, there is a point at which it is no longer
motivating and becomes counterproductive. Special Interest Topic 7.1 provides some sug-
gestions for helping students manage their anxiety.
The scheduling of an assessment is also a decision that deserves careful consideration.
You should try to schedule the test at a time when the students will not be distracted by other
events. For example, scheduling a test the last day before a big holiday is probably not opti-
mal. In this situation the students are likely to be more focused on the
Take steps to ensure that upcoming holiday than on the test. The same goes with major events
the testing environment is at the school. Scheduling tests the day of the big homecoming game
conducive to optimal student or the senior prom is probably not desirable. Teachers should make
performance. every effort to ensure that the physical environment is conducive to
optimal student performance. You should take steps to ensure that
the room is comfortable (e.g., temperature, proper ventilation), that there is proper lighting,
and that extraneous noise is minimized. Additionally, you should make efforts to avoid any
unexpected interruptions (e.g., ask whether a firedrill is scheduled, place a “Test in Prog-
ress” sign on the door).
Once the students have started the test, be careful about providing help to students.
Students can be fairly crafty when it comes to coaxing information from teachers during a
test. They may come asking for clarification while actually “fishing” for hints or clues to the
answer. As a teacher you do not want to discourage students from clarifying the meaning
of ambiguous items, but you also do not want to inadvertently provide hints to the answer
of clearly stated items. Our suggestion is to carefully consider the student’s question and
determine whether the item is actually ambiguous. If it is, make a brief clarifying comment
to the whole class. If the item is clear and the student is simply fishing for a clue to the an-
swer, simply instruct the student to return to his or her seat and carefully read and consider
the meaning of the item. Finally, take reasonable steps to discourage cheating. Cheating is
another source of construct-irrelevant variance that can undermine the validity of your score
interpretations. Special Interest Topic 7.2 provides some strategies for preventing cheating
on classroom tests.

Summary
In this chapter we addressed the initial steps a teacher should follow in developing class-
room achievement tests. We noted that the first step is to specify the educational objectives
192 CHAPTER 7

Fe

SPECIAL INTEREST TOPIC 7.2


Strategies for Preventing Cheating

Cheating on tests is as old as assessment. In ancient China, examinees were searched before taking
civil service exams, and the actual exams were administered in individual cubicles to prevent cheat-
ing. The punishment for cheating was death (Hopkins, 1998). We do not punish cheaters as severely
today, but cheating continues to be a problem in schools. Like test anxiety, cheating is another source
of construct-irrelevant variance that undermines the validity of test interpretations. If you can reduce
cheating you will enhance the validity of your interpretations. Many authors have provided sugges-
tions for preventing cheating (e.g., Hopkins, 1998; Linn & Gronlund, 2000; Popham, 2000). These
include the following:

Keep the assessment materials secure. Tests and other assessments have a way of getting into
the hands of students. To avoid this, do not leave the assessments in open view in unlocked
offices, make sure that the person copying the tests knows to keep them secure, and number
the tests so you will know if one is missing. Verify the number of tests when distributing them
to students and when picking them up from students.
Possibly the most commonsense recommendation is to provide appropriate supervision of stu-
dents during examinations. This is not to suggest that you hover over students (this can cause
unnecessary anxiety), but simply that you provide an appropriate level of supervision. This can
involve either observing from a position that provides an unobstructed view of the entire room
or occasionally strolling around the room. Possibly the most important factor is to be attentive
and visible; this will probably go a long way toward reducing the likelihood of cheating.
Have the students clear their desks before distributing the tests.
When distributing the tests, it is advisable to individually hand each student a test. This will
help you avoid accidentally distributing more tests than there are students (an accident that
can result in a test falling into the wrong hands).
If students are allowed to use scratch paper, you should require that they turn this in with
the test.
When possible, use alternative seating with an empty row of seats between students.
Create two forms of the test. This can be accomplished by simply changing the order of test
items slightly so that the items are not in exactly the same order. Give students sitting next
to each other alternate forms.

or goals you have for your students. It is important to do this because these objectives will
serve as the basis for your test. In writing educational objectives, we noted that there are
several factors to consider, including the following:

Scope. Educational objectives can be written on a continuum from very specific to


very broad. We noted that there are limitations associated with objectives at either end
of this continuum and suggested strategies to help you minimize these limitations.
Domain. Educational objectives also differ in the type of ability or characteristic being
measured. Educational objectives typically involve cognitive, affective, or psycho-
motor abilities. While all three of these domains are important in school settings, the
cognitive domain is of primary importance. We presented Bloom’s taxonomy of cogni-
The Initial Steps in Developing a Classroom Test 193

tive objectives, which presents six hierarchical categories including knowledge, com-
prehension, application, analysis, synthesis, and evaluation.
= Format. Educational objectives are often classified as behavioral or nonbehavioral.
Although behavioral objectives have advantages, if the behavioral format is taken to
the extreme it also has limitations. We noted that it is optimal to have objectives that
are broad enough to help you organize your instruction and testing procedures, but
that also state measurable activities.

In concluding our discussion of educational objectives, we provided some general


suggestions to help you write objectives. These suggestions included the following: (1) write
objectives that cover a broad spectrum of abilities; (2) when feasible, identify behaviors that
are observable and directly measurable; (3) state any special conditions; and (4) when ap-
propriate, specify an outcome criterion.
The next step in developing a classroom test is to develop a table of specifications,
which is essentially a blueprint for the test that helps you organize the educational objectives
and make sure that the test content matches the curriculum content. A table of specifications
also helps you include items of varying degrees of complexity. Before actually proceeding
with writing your test, you have some other important decisions to make. One decision is
whether to use a norm-referenced or criterion-referenced interpretation of performance. We
noted that in most situations criterion-referenced assessment is most useful in the classroom.
Another decision deals with the use of selected-response and constructed-response items.
Although we will be devoting the next three chapters to detailed discussions of these items,
we provided an overview of some of the advantages and limitations of both item formats.
We closed this chapter by providing some practical suggestions for assembling your
test, preparing your students for the assessment, and administering it. When assembling
your test, we recommend (1) adhering to your table of specifications, (2) providing clear
instructions, (3) stating questions clearly, (4) developing items that can be scored in a de-
cisive manner, (5) avoiding cues to the correct answers, (6) arranging items in a systematic
manner, and (7) including items that will contribute to the reliability and validity of your
assessment results. We also discussed some factors to consider when determining how many
items to include in an assessment. These factors included the age of your students, the types
of items you are using, and the type of test you are developing.
In terms of preparing students for and administering an assessment, we described a
number of things teachers can do to enhance student performance and increase the reliability
and validity of assessment results. These include (1) preparing your students for the assess-
ment, (2) scheduling the assessment at an appropriate time, (3) ensuring that the testing condi-
tions are adequate (e.g., comfortable, proper lighting, quiet), (4) avoiding answering questions
that might “give away” the answers, and (5) taking reasonable steps to discourage cheating.

KEY TERMS AND CONCEPTS

Affective objectives, p. 175 Bloom’s taxonomy, p. 173 Criterion-referenced scores, p. 182


Analysis, p. 174 Cognitive objectives, p. 172 Domain, p. 171
Application, p. 174 Comprehension, p. 173 Educational objectives, p. 170
Behavioral objectives, p. 177 Constructed-response items, p. 183 Evaluation, p. 174
194 CHAPTER 7

Format, p. 171 Portfolios, p. 185 Selected-response items, p. 183


Knowledge, p. 173 Power tests, p. 189 Speed tests, p. 189
Nonbehavioral objectives, p. 177 Psychomotor objectives, p. 176 Synthesis, p. 174
Norm-referenced scores, p. 182 Reducing test anxiety, p. 190 - Table of specifications, p. 179
Performance assessments, p. 185 Scope, p. 171 Test blueprint, p. 179

RECOMMENDED READINGS

Gronlund, N. E. (2000). How to write and use instructional psychology: A century of contributions (pp. 367-389).
objectives (6th ed.). Upper Saddle River, NJ: Merrill/ Mahwah, NJ: Erlbaum. This chapter provides a bio-
Prentice Hall. This is an excellent example of a text that graphical sketch of Dr. Bloom and reviews his influence
focuses on the development of educational objectives. in educational psychology.
Lorin, W. (2003). Benjamin S. Bloom: His life, his works,
and his legacy. In B. Zimmerman (Ed.), Educational

Go to www.pearsonhighered.com/reynolds2e to view a PowerPoint™


presentation and to listen to an audio lecture about this chapter.
ae
ee
CHAPTER

The Development and Use


of Selected-Response Items
Some educators embrace selected-response items because they can contribute
to the development of psychometrically sound tests. Other educators reject
them because they believe they cannot adequately measure the really
important knowledge and skills they want students to acquire.

CHAPTER HIGHLIGHTS

Multiple-Choice Items Matching Items


True—False Items

LEARNING OBJECTIVES

After reading and studying this chapter, students should be able to


1. Describe the major types of selected-response items and their characteristics.
e Describe the components and types of multiple-choice items and give examples.
Describe the principles involved with developing effective multiple-choice items.
Develop effective multiple-choice items for a given content area.
Discuss the strengths and weaknesses of multiple-choice items.
Describe the principles involved with developing effective true—false items.
Develop effective true—false items for a given content area.
Discuss the strengths and weaknesses of true—false items.
aAnawn
een
Describe the principles involved with developing effective matching items.
Develop effective matching items for a given content area.
. Discuss the strengths and weaknesses of matching items.
—_n=s
—_=
_=_ Be able to apply and interpret a correction for guessing.

Ih the last chapter we addressed the development of educational objectives and provided
some general suggestions for developing, assembling, and administering your assessments.
In the next three chapters we will discuss the development of specific types of test items. In
this chapter we will focus on the development of selected-response items. As we noted in
the last chapter, if an item requires a student to select a response from available alternatives,
it is classified as a selected-response item. Multiple-choice, true—false, and matching items
are all selected-response items. If an item requires a student to create or construct a re-
sponse, it is classified as a constructed-response item. Essay and short-answer items are
constructed-response items, but this category also includes other complex activities such as
making a class presentation, composing a poem, or painting a picture.
195
196 CHAPTER 8

If an item requires a student to In this chapter we will address selected-response items in detail.
select a response from available We will discuss their strengths and weaknesses and provide sugges-
alternatives, it is classified as a tions for developing effective items. In the next chapter we will address
selected-response item. If it essay and short-answer items. In Chapter 10 we will address perfor-
requires a student to create or mance assessments and portfolios—types of constructed-response as-
sessments that have gained increased popularity in recent years. In
construct a response, it is
these chapters we will focus on items used to assess student achieve-
classified as a constructed-
ment (as opposed to interests, personality characteristics, etc.).
response item.

Multiple-Choice Items

Multiple-choice items are by far the most popular of the selected-response items. They
have gained this degree of popularity because they can be used in a variety of content areas
and can assess both simple and complex learning outcomes. Multi-
Multiple-choice items are by ple-choice items take the general form of a question or an incomplete
far the most popular selected- statement with a set of possible answers, one of which is correct. The
response items. part of the item that is either a question or an incomplete statement
is referred to as the stem. The possible answers are referred to as
alternatives. The correct alternative is simply called the answer and the incorrect alterna-
tives are referred to as distracters (i.e., they serve to “distract” students who do not actually
know the correct response).
Multiple-choice items can be written so the stem is in the form
Multiple-choice items can be of a direct question or an incomplete sentence. Most writers prefer the
written as a direct question or direct-question format because they feel it presents the problem in
an incomplete sentence. the clearest manner. The advantage of the incomplete-sentence for-
mat is that it may present the problem in a more concise manner. If the
question is formatted as an incomplete statement, it is suggested that the omission occur near
the end of the stem. Our recommendation is to use the direct-question format unless the prob-
lem can be stated more concisely using the incomplete-sentence format without any loss of
clarity. Examine these examples of the two formats.
Example 1 Direct-Question Format
1. Which river is the largest in the United States of America?
A. Mississippi <
B. Missouri
C. Ohio
D. Rio Grande

Example 2 Incomplete-Sentence Format


2. The largest river in the United States of America is the
A. Mississippi. <
B. Missouri.
C. Ohio.
D. Rio Grande. ‘
Another distinction is made between multiple-choice items that have what is known
as the correct-answer versus the best-answer format. Examples 1 and 2 are correct-answer
The Development and Use of Selected-Response Items 197

items. The Mississippi is the largest river in the United States of America and the other
answers are incorrect. However, multiple-choice items can be written for situations having
more than one correct answer. The objective is to identify the “best answer.”

Example 3 Best-Answer Format


1. Which variable is generally thought to be the most important when buying a house?
A. cost
B. builder
C. design
D. location <

In Example 3 all the variables listed are important to consider when buying a house, but as
almost any realtor will tell you, location is the most important. Most test developers prefer
the best-answer format for two reasons. First, in some situations it is difficult to write an
answer that everyone will agree is correct. The best-answer format allows you to frame it as
an answer that most experts will agree with. Second, the best-answer format often requires
the student to make more subtle distinctions among the alternatives, which results in more
demanding items that measure more complex educational objectives.

Guidelines for Developing Multiple-Choice Items


Use a Printed Format That Makes the Item as Clear as Possible. While there is not a
universally accepted format for multiple-choice items, here are a few recommendations
regarding physical layout that can enhance clarity.

ws Provide brief but clear directions. Directions should include how the selected alterna-
tive should be marked.
m The item stem should be numbered for easy identification, while the alternatives are
indented and identified with letters.
= Either capital or lowercase letters followed by a period or parenthesis can be used for
the alternatives. If a scoring sheet is used, make the alternative letters on the scoring sheet
and the test as similar as possible.
= There is no need to capitalize the beginning of alternatives unless they begin with a
proper name.
= When the item stem is a complete sentence, there should not be a period at the end of
the alternatives (see Example 4).
= When the stem is in the form of an incomplete statement with the missing phrase at
the end of the sentence, alternatives should end with a period (see Example 5).
= Keep the alternatives in a vertical list instead of placing them side by side because it
is easier for students to scan a vertical list quickly.
= Use correct grammar and formal language structure in writing items.
= All items should be written so that the entire question appears on one page.

The following use formats that promote clarity, illustrating many of these suggestions.
198 CHAPTER 8

Example 4
Directions: Read each question carefully and select the best answer. Circle the
letter of the answer you have selected.
1. Which type of validity study involves a substantial time interval between when the
test is administered and when the criterion is measured?
A. delayed study
B. content study
C. factorial study
D. predictive study <
Example 5
2. The type of validity study that involves a substantial time interval between when the
test is administered and when the criterion is measured is a
A. delayed study.
B. content study.
C. factorial study.
D. predictive study. <

Have the Item Stem Contain All the Information Necessary to Understand the Prob-
lem or Question. When writing multiple-choice items, the problem or question should
be fully developed in the item stem. Poorly developed multiple-choice items often contain
an inadequate stem that leaves the test taker unclear about the central problem or question.
Compare the stems in the following two examples.
Example 6 Poor Item—Inadequate Stem
1. Absolute zero point.
A. interval scale
B. nominal scale
C. ordinal scale
D. ratio scale <

Example 7 Better Item—Adequate Stem


1. Which scale of measurement incorporates a true or absolute zero point?
A. interval scale
B. nominal scale
C. ordinal scale
D. ratio scale <

Your students are not mind readers, and item stems that are not fully developed can
result in misinterpretations by students. One way to determine whether the stem is adequate
is to read the stem without examining the alternatives. If the stem is adequate, a knowledge-
able individual should be able to answer the question with relative ease. In Examples 6 and
7, the first item fails this test whereas the second item passes. This test is equally applicable
if the question is framed as a question or as an incomplete statement.
While we encourage you to develop the problem fully in the item stem, it is usually
not beneficial to include irrelevant material in the stem. Consider this example.
The Development and Use of Selected-Response Items 199

Example 8 Poor Item—Unnecessary Content


1. There are several different scales of measurement used in educational settings. Which
scale of measurement incorporates a true or absolute zero point?
A. interval scale
B. nominal scale
C. ordinal scale
D. ratio scale <

In Example 8 the addition of the sentence “There are several different scales of measurement
used in educational settings” does not serve to add clarity. It simply takes more time to read.

Provide between Three and Five Alternatives. Although there is no “correct” number
of alternatives, it is recommended that you use between three and five. Four are most com-
monly used, but some test developers suggest using five to reduce the chance of correctly
guessing the answer. For example, the chance of correctly guessing the answer with three
alternatives is | in 3 (i.e., 33%); with four alternatives,! in 4 (i.e., 25%); and with five, 1
in 5 (i.e., 20%). The use of five alternatives is probably the upper limit. Many computer
scoring programs accommodate only five alternatives, and it can be difficult to develop
plausible distracters (the addition of distracters that are clearly wrong and not selected by
any students does not reduce the chance of correctly guessing the answer). In some situa-
tions three alternatives may be sufficient. It takes students less time to read and answer
items with three alternatives instead of four (or five), and it is easier to write two good
distracters than three (or four). Certain research even suggests that items with three alterna-
tives can be as effective as items with four or five alternatives (e.g., Costin, 1970; Grier,
1975; Sidick, Barrett, & Doverspike, 1994).

Keep the Alternatives Brief and Arrange Them in an Order That Promotes Efficient
Scanning. As we noted, the item stem should contain as much of the content as possible
and should not contain irrelevant material. A correlate of this is that the alternatives should
be as brief as possible. This brevity makes it easier for the students to scan the alternatives
looking for the correct answer. Consider Examples 9 and 10. While they both measure the
same content, the first one contains an inadequate stem and lengthy alternatives whereas the
second one has an adequate stem and brief alternatives.

Example 9 Poor Item—Inadequate Stem and Lengthy Alternatives


1. Andrew Jackson
A. was born in Virginia.
B. did not fight in the American Revolution due to a childhood illness.
C. was the 7th president of the United States. <
D. served three terms as president of the United States.

Example 10 Better Item—Adequate Stem and Brief Alternatives


2. Who was the 7th president of the United States of America?
A. Andrew Jackson <
B. James Monroe
C. John Adams
D. Martin Van Buren
200 CHAPTER 8

When applicable, alternatives should be arranged in a logical order to promote effi-


cient scanning. For example, numbers should be placed in ascending order, dates ordered in
temporal sequence, and nouns and names alphabetized. See Examples 11 and 12.
Example 11 Poor Item—lIllogical Arrangement of Alternatives
1. What year did the Spanish-American War occur?
1912
1890
1908
1898 <
Pie
1902
Example 12 Better Item—Logical Arrangement of Alternatives
2. What year did the Spanish-American War occur?
1890
1898 <
1902
1908
Sl
1912
Avoid Negatively Stated Stems in Most Situations. As a general rule you should avoid
using negatively stated stems. Limit the use of terms such as except, least, never, and not.
Students might overlook these terms and miss the question even if they have mastered the
learning objective being measured. Unless you intend to measure the student’s ability to
attend to the details of the item, this is not a desired outcome and undermines the validity
of the test’s results. In most situations this can be avoided by rephrasing the stem.
Occasionally it may be necessary or desirable to state stems in the negative. For ex-
ample, in some situations it is important for students to know what not to do (e.g., what
should you not do if you smell gas?) or identify an alternative that differs in some way from
the other alternatives. In these situations you should highlight the negative terms by capital-
izing, underlining, or printing them in bold type. Examine Examples 13 and 14.
Example 13 Poor Item—Negatively Stated Stem
1. Which state does not have a coastline on the Gulf of Mexico?
A. Alabama
B. Florida
C. Tennessee <
D. Texas

Example 14_ Better Item—Negative Term Highlighted


2. Which state does NOT have a coastline on the Gulf of Mexico?
A. Alabama
B. Florida
C. Tennessee <
D. Texas

Double negatives should always be avoided. Logicians know that a double negative
indicates a positive, but students should not have to decipher this logic problem.
The Development and Use of Selected-Response Items 201

Make Sure Only One Alternative Is Correct or Represents the Best Answer. Care-
fully review your alternatives to ensure there is only one correct or best answer. Commonly
teachers are confronted by students who feel they can defend one of the distracters as a correct
answer. It is not possible to avoid this situation completely, but you can minimize it by care-
fully evaluating the distracters. We recommend setting the test aside for a period of time and
returning to it later for proofing. Fatigue and tight deadlines can allow undetected errors.
Occasionally it might be appropriate to include more than one correct alternative in a
multiple-choice item and require the students to identify all of the correct alternatives. It is
usually best to format these questions as a series of true—false items, an arrangement re-
ferred to as a cluster-type or multiple true—false item. See Examples 15 and 16.

Example 15 Poor Item—Multiple-Choice Item


with Multiple Correct Alternatives
1. Which states have a coastline on the Gulf of Mexico?
A. Alabama <
B. Florida <
C. Tennessee
D. Texas <

Example 16 Better Item—Multiple True-False Item


2. Which of the following states have a coastline on the Gulf of Mexico? Underscore the
T if the state has a Gulf coastline or the F if the state does not have a Gulf coastline.
Alabama T F
Florida T F
Tennessee T F
Texas T F

Avoid Cues That Inadvertently Identify the Correct Answer. Item stems should not
contain information that gives away the answer. A cue is something in the stem that provides
aclue to the answer that is not based on knowledge. It often involves an association between
the words in the stem and the correct alternative. See Examples 17 and 18.

Example 17 Poor Item—Stem Contains a Cue to the Correct Answer


1. Which type of validity study examines the ability of test scores to predict a criterion?
A. interval study
B. content study
C. factorial study
D. predictive study <

Example 18 Better Item—Cues Avoided


2. Which type of validity study involves a substantial time interval between when the
test is administered and when the criterion is measured?
A. interval study
B. content study
C. factorial study
D. predictive study <
202 CHAPTER 8

In Example 17, the use of predict in the stem and predictive in the correct alternative pro-
vides a cue to the correct answer. This is corrected in Example 18. Additionally, in the
second example there is an intentional verbal association between the stem and the first
distracter (i.e., interval). This association makes the first-distracter more attractive, particu-
larly to students relying on cues, who do not know the correct answer.
In addition to the stem containing cues to the correct answer, the alternatives can
themselves contain cues. One way to avoid this is to ensure that all alternatives are ap-
proximately equal in length and complexity. In an attempt to be precise, teachers may make
the correct answer longer or more complex than the distracters. This can serve as another
type of cue for students. Although in some cases it might be possible to both maintain
precision and shorten the correct alternative, it is usually easier to lengthen the distracters
(though this does make scanning the alternatives more difficult for students). Compare
Examples 19 and 20.

Example 19 Poor Item—Unequal Length and Complexity of Alternatives


1. Ecology is the study of
A. genetics.
B. organisms and their relationship to the environment. <
C. internal balances.
D. evolution.

Example 20 Better Item—Alternatives Similar in Length and Complexity


2. Ecology is the study of
A. the genetic and molecular basis of organisms.
B. organisms and their relationship to the environment. <
C. how organisms maintain their delicate internal balance.
D. how organisms have slowly evolved over the last million years.

When dealing with numerical alternatives, the visual characteristics of the choices can
also serve as a cue. Examine the following examples.

Example 21 Poor Item—Alternative Contains


a Visual Cue to the Correct Answer
1. The correlation between two measures is 0.90. What is the coefficient of determination?
A. 0.1
B. 0.3
C. 0.81 <
D. 0.9

Example 22 Better Item—Cues Avoided


2. The correlation between two measures is 0.90. What is the coefficient of determination?
A. 0.10
B. 0.30
C. 0.81 <
D. 0.99
The Development and Use of Selected-Response Items 203

In Example 21, the third option (i.e., C) is the only alternative that, like the number in the
stem, has two decimal places. The visual characteristics of this alternative may attract the
student to it independent of the knowledge required to answer it. In Example 22, each alter-
native has an equal number of decimal places and is equally visually attractive.

Make Sure All Alternatives Are Grammatically Correct Relative to the Stem. Gram-
matical cues that may help the uninformed student select the correct answer are usually the
result of inadequate proofreading. Examine the following examples.

Example 23 Poor Item—Grammatical Cue Present


1. Which individuals are credited with making the first successful flights in a heavier-
than-air aircraft that was both powered and controlled?
A. Octave Chanute
B. Otto Lilienthal
C. Samuel Langley
D. Wilbur and Orville Wright <

Example 24 _ Better Item—Grammatical Cue Avoided


2. Which individuals are credited with making the first successful flights in a heavier-
than-air aircraft that was both powered and controlled?
A. Octave Chanute and Sir George Cayley
B. Otto Lilienthal and Francis Herbert Wenham
C. Samuel Langley and Alphonse Penaud
D. Wilbur and Orville Wright <

In Example 23 the phrase “individuals are” in the stem indicates a plural answer. However,
only the fourth alternative (i.e., D) meets this requirement. This is corrected in Example 24
by ensuring that each alternative reflects a plural answer.

Another common error is inattention to the articles a and an. See the following.

Example 25 Poor Item—Grammatical Cue Present


1. A coherent and unifying explanation for a class of phenomena is a
A. analysis.
B. experiment.
C. observation.
D. theory. <

Example 26 Better Item—Grammatical Cue Avoided


2. Acoherent and unifying explanation for a class of phenomena is a
A. conjecture.
B. hypothesis.
C. prediction.
D. theory. <
204 CHAPTER 8

Example 27 Better Itemn—Grammatical Cue Avoided


3. A coherent and unifying explanation for a class of phenomena is a(n)
A. experiment.
B. hypothesis.
C. observation.
D. theory. <

In Example 25, the use of the article a indicates an answer beginning with a consonant in-
stead of a vowel. An observant student relying on cues will select the fourth alternative (i.e.,
D) because it is the only one that is grammatically correct. This is corrected in Example 26
by ensuring that all alternatives begin with consonants and in Example 27 by using a(n) to
accommodate alternatives beginning with either consonants or vowels.

Make Sure No Item Reveals the Answer to Another Item. One item should not contain
information that will help a student answer another item. Also, a correct answer on one item
should not be necessary for answering another item. This would give double weight to the
first item.

Have All Distracters Appear Plausible. Distracters should be designed to distract un-
knowledgeable students from the correct answer. Therefore, all distracters should appear
plausible and should be based on common student errors. For example, what concepts,
terms, events, techniques, or individuals are commonly confused? After you have adminis-
tered the test once, analyze the distracters to determine which are effective and which are
not. Replace or revise the ineffective distracters. There is little point in including a distracter
that can be easily eliminated by uninformed students. This simply wastes time and space.

Use Alternative Positions in a Random Manner for the Correct Answer. The correct
answer should appear in each of the alternative positions approximately the same number of
times. When there are four alternatives (e.g., A, B, C, and D), teachers tend to overuse the
middle alternatives (i.e., B and C). Alert students are likely to detect this pattern and use it to
answer questions of which they are unsure. Students have indicated that when faced with a
question they cannot answer based on knowledge they simply select B or C. Additionally, you
should ensure there is no detectable pattern in the placement of correct answers (e. g..A,C. B;
D, A, etc.). If there is no logical ordering for the alternatives (see the earlier recommendation),
they should be randomly arranged. Attempt random assignment when possible and then once
the test is complete, count the number of times the correct answer appears in each position. If
any positions are over- or underrepresented, make adjustments to correct the imbalance.

Minimize the Use of “None of the Above” and Avoid Using “All of the Above.” There
is some disagreement among test development experts regarding the use of “none of the
above” and “all of the above” as alternatives. The alternative “none of the above” is criti-
cized because it automatically forces the item into a correct-answer format. As noted
earlier,
the correct-answer form is often limited to lower-level educational objectives and easier
items. Although there are times when “none of the above” is appropriate as an alternative,
it should be used sparingly. Testing experts are more unified in their criticism of “all
of the
above” as an alternative. There are two primary concerns. First, students may read alterna-
The Development and Use of Selected-Response Items 205

tive A, see that it is correct, and mark it without ever reading alternatives B, C, and D. In this
situation the response is incorrect because the students did not read all of the alternatives,
not necessarily because they have not mastered the educational objective. Second, students
may know that two of the alternatives are correct and therefore conclude that “all of the
above” is correct. In this situation the response is correct but is based on incomplete knowl-
edge. Our recommendation is to use “none of the above” sparingly and avoid using “all of
the above.”

Avoid Artificially Inflating the Reading Level. Unless it is necessary to state the prob-
lem clearly and precisely, avoid obscure words and an overly difficult reading level. This
does not mean to avoid scientific or technical terms necessary to state the problem, but
simply to avoid the unnecessary use of complex incidental words.

Limit the Use of Always and Never in the Alternatives. The use of always and never
should generally be avoided because it is only in mathematics that their use is typically
justified. Savvy students know this and will use this information to rule out distracters.

Avoid Using the Exact Phrasing from the Text. Most measurement specialists suggest
that you avoid using the exact wording used in a text. Exact phrasing may be appropriate if
rote memorization is what you desire, but it is of limited value in terms of concept formation
and the ability to generalize. Exact phrasing should be used sparingly.

Organize the Test in a Logical Manner. The topics in a test should be organized in a
logical manner rather than scattered randomly. However, the test does not have to exactly
mirror the text or lectures. Strive for an organization that facilitates student performance.

Give Careful Consideration to the Number of Items on Your Test. Determining the
number of items to include on a test is a matter worthy of careful consideration. On one
hand you want to include enough items to ensure adequate reliability and validity. Recall
that one way to enhance the reliability of a score is to increase the number of items that go
into making up the score. On the other hand there is usually a limited amount of class time
allotted to testing. Occasionally teachers will include so many items on a test that students
do not have enough time to make reasoned responses. A test with too many items essen-
tially becomes a “speed test” and unfairly rewards students who respond quickly even if
they know no more than students who were slower in responding.
Companies who publish tests estimate a completion time for each item. For example,
an item may be considered a 30-second item, a 45-second item, or a 60-second item. Mak-
ing similar estimates can be useful, but unless you are a professional test developer you will
probably find it difficult to accurately estimate the time necessary to complete every item.
As a general rule you should allot at least one minute for secondary school students to com-
plete a multiple-choice item that measures a lower-level objective (e.g., Gronlund, 2003).
Younger students or items assessing higher-level objectives typically require more time.

Be Flexible When Applying These Guidelines. Apply these guidelines in a flexible


manner. Although these suggestions apply in most cases, there are exceptions. The goal is
to write items that measure your educational objectives and contribute to psychometrically
sound tests. As you gain more experience writing items, you may occasionally need to
206 CHAPTER 8

TABLE 8.1 Checklist for the Development of Multiple-Choice Items

. Are the items clear and easy to read?


- Does the item stem clearly state the problem or question?
. Are there between three and five alternatives?
=.
NO
WB
> Are the alternatives brief and arranged in an order that promotes
efficient scanning?
. Have you avoided negatively stated stems?
. Is there only one alternative that is correct or represents the best answer?
. Have you checked for cues that accidentally identify the correct answer?
. Are all alternatives grammatically correct relative to the stem?
nm. Have you checked
eonrmnnrn to make sure no item reveals the answer
to another item?
10. Do all distracters appear plausible?
11. Did you use alternative positions in a random manner for the correct answer?
12. Did you minimize the use of “none of the above” and avoid using
“all of the above’?
13. Is the reading level appropriate?
14, Did you limit the use of always and never in the alternatives?
15. Did you avoid using the exact phrasing from the text?
16. Is the test organized in a logical manner?
17. Can the test be completed in the allotted time period?

violate one of the guidelines to write the most efficient and effective item. This is clearly
appropriate, but if you find yourself doing it routinely, you are most likely being lazy or
careless in your test preparation. Table 8.1 provides a summary of these guidelines.
As we noted earlier the multiple-choice format is the most popular selected-response
format. Major strengths of multiple-choice include the following.

Strengths of Multiple-Choice Items


Multiple-Choice Items Are Versatile. Multiple-choice items can be used to assess achieve-
ment in a wide range of content areas from history and geography to
Multiple-choice items can be Statistics and research design. They can be used to assess a variety of
used to assess a variety of educational objectives ranging from the simple to the complex. One of
educational objectives ranging the most frequent (but unfounded) criticisms of multiple-choice items
from the simple to the complex. is that they are limited to lower-level objectives. With creativity and
The Development and Use of Selected-Response Items 207

effort multiple-choice items can be written that measure more complex objectives. Consider
the following example suggested by Green (1981) of an item designed to assess a complex
learning objective.

Example 28 Item Assessing Complex Objectives


1. The correlation of SAT verbal and SAT math among all test takers is about 0.5. What
is the correlation between SAT verbal and SAT math among applicants admitted to
Harvard?
- greater than 0.50
about 0.50
. less than 0.50
SaAm>
. there is no basis for a guess

To answer this item correctly, students must understand that the strength of a correlation is
affected by the variability in the sample. More homogeneous samples (i.e., samples with
less variability) generally result in lower correlations. The students then have to reason that
because Harvard is an extremely selective university, the group of applicants admitted there
would have more homogeneous SAT scores than the national standardization sample. That
is, there will be less variance in SAT scores among Harvard students relative to the national
sample. Because there is less variance in the Harvard group, the correlation will be less than
the national sample (if this is unclear, review the section on correlation one more time). This
illustrates that multiple-choice items can measure fairly complex learning objectives. Spe-
cial Interest Topic 8.1 describes research that found that, contrary to claims by critics,
multiple-choice items do not penalize creative or “deep thinking” students.

Multiple-Choice Items Can Be Scored in an Objective Manner. Miultiple-choice


items are easy to score. Many schools even have computer-scoring systems that will score
multiple-choice tests for you. By removing subjectivity in scoring, the reliability of your
students’ test scores is increased. Although creating a reliable test score does not ensure that
your test scores are valid, scores cannot be valid without being reliable.

Multiple-Choice Items Are Not Unduly Subject to Guessing. Although multiple-


choice items are subject to guessing (like all selected-response items), they are not as sub-
ject to guessing as are true—false items. This aspect also enhances the reliability of these
items. See Special Interest Topic 8.2 for information about a correction for guessing that can
be applied to multiple-choice and true—false items.

Multiple-Choice Items Are Not Significantly Influenced by Response Sets. A re-


sponse set is a tendency for an individual to respond in a specific manner. For example,
when unsure of the correct response on a true—false item, there is an “acquiescence set”
whereby students are more likely to select true than false (Cronbach, 1950). Although stu-
dents might have a tendency to select the middle alternatives when unsure of the answer,
CHAPTER 8

SPECIAL INTEREST TOPIC 8.1


Do Multiple-Choice Items Penalize Creative Students?

Critics of multiple-choice and other selected-response items have long asserted that these items
measure only superficial knowledge and conventional thinking and actually penalize students who
are creative, deep thinkers. In a recent study, Powers and Kaufman (2002) examined the relation-
ship between performance on the Graduate Record Examination (GRE) General Test and selected
personality traits, including creativity, quickness, and depth. In summary, their analyses revealed
that there was no evidence that deeper-thinking students were penalized by the multiple-choice
format. The correlation between GRE scores and Depth were as follows: Analytical = 0.06, Quan-
titative = 0.08, and Verbal = 0.15. The results in terms of creativity were more positive, with the
correlation between GRE scores and Creativity as follows: Analytical = 0.24, Quantitative = 0.26,
and Verbal = 0.29 (all p < 0.001). Similar results were obtained with regard to Quickness, with the
correlation between GRE scores and Quickness as follows: Analytical = 0.21, Quantitative = 0.15,
and Verbal = 0.26 (all p < 0.001). In summary, there is no evidence that individuals who are creative,
deep thinking, and mentally quick are penalized by multiple-choice items. In fact, the research re-
veals modest positive correlations between the GRE scores and these personality traits. To be fair,
there was one rather surprising finding, a slightly negative correlation between GRE scores and
Conscientious (e.g., careful, avoids mistakes, completes work on time). The only hypothesis the
authors proposed was that being “‘conscientious” does not benefit students particularly well on timed
tests, such as the GRE, that place a premium on quick performance.

this tendency does not appear to significantly affect performance on multiple-choice tests
(e.g., Hopkins, 1998).

Multiple-choice items are Multiple-Choice Items Are an Efficient Way of Sampling the Con-
efficient at sampling the tent Domain. Multiple-choice tests allow teachers to broadly sample
content domain. the test’s content domain in an efficient manner. That is, because stu-
dents can respond to multiple-choice items in a fairly rapid manner, a
sufficient number of items can be included to allow the teacher to ade-
quately sample the content domain. Again, this enhances the reliability of the test.

Multiple-Choice Items Are Easy to Improve Using the Results of Item Analysis. The
careful use of difficulty and discrimination indexes and distracter analysis can help refine
and enhance the quality of the items.

Multiple-Choice Items Provide Information about the Type of Errors That Students
Are Making. Teachers can gain diagnostic information about common student errors and
misconceptions by examining the distracters that students commonly endorse. This infor-
mation can be used to improve instruction in the future, and current students’ knowledge
base can be corrected in class review sessions. 4
The Development and Use of Selected-Response Items 209

eee Se

SPECIAL INTEREST TOPIC — 8.2


Correction for Guessing

Some testing experts support the use of a “correction for guessing” formula with true—false and multiple-
choice items. Proponents of this practice use it because it discourages students from attempting to
raise their scores through blind guessing. The most common formula for correcting for guessing is:

Corrected Score = Right — Wrong/(n — 1)

where right = number of items answered correctly


wrong = number of items answered incorrectly
n = number of alternatives or potential answers

For true—false items it is simply calculated as:

Corrected Score = Right — Wrong

For multiple-choice items with four alternatives it is calculated as:

Corrected Score = Right — Wrong/3

Consider this example: Susan correctly answered 80 multiple-choice items on a 100-item test
(each item having 4 alternatives). She incorrectly answered 12 and omitted 8. Her uncorrected scores
would be 80 (or 80% correct). Applying the formula to these data:

Corrected Score = 80 — 12/3


Corrected Score = 76

Susan’s corrected score is 76 (or 76%). Note that the omitted items are not counted in the corrected
score; only the items answered correctly and incorrectly are counted. What the correction formula
does is remove the number of items assumed to be the result of blind guessing.
Should you use a correction for guessing? This issue has been hotly debated among assess-
ment professionals. The debate typically centers on the assumptions underlying the correction for-
mula. For example, the formula is based on the questionable assumption that all guesses are random,
and none are based on partial knowledge or understanding of the item content. Probably anyone who
has ever taken a test knows that all guesses are not random and that sometimes students are able to
rule out some alternatives using partial knowledge of the item content. As a result, many assessment
experts don’t recommend using a correction for guessing with teacher-made classroom tests. Some
authors suggest that their use is defensible in situations in which students have insufficient time to
answer all the items or in which guessing is contraindicated due to the nature of the test content (e.g.,
Linn & Gronlund, 2000). Nevertheless, on most classroom assessments a correction for guessing is
not necessary. In fact, in most situations the relative ranking of students using corrected and uncor-
rected scores will be about the same (Nitko, 2001).
Two related issues need to be mentioned. First, if you are not using a correction for guessing,
your students should be encouraged to attempt every item. If you are using a correction for guessing,
your students should be informed of this, something along the lines of “Your score will be corrected
for guessing, so it is not in your best interest to guess on items.”
(continued)
210 CHAPTER 8

SPECIAL INTEREST TOPIC $.2 Continued

The second issue involves professionally developed standardized tests. When you are using a
standardized test it is imperative to strictly follow the administration and scoring instructions. If the
test manual instructs you to use a correction for guessing, you must apply it for the test’s normative
data to be usable. If the test manual instructs you simply to use the “number correct” when calculat-
ing scores, these instructions should be followed. In subsequent chapters we will describe the admin-
istration and use of professionally developed standardized tests.

Although multiple-choice items have many strengths to recommend their use, they do
have limitations. These include the following.

Weaknesses of Multiple-Choice Items


Multiple-Choice Items Are Not Effective for Measuring All Educational Objec-
tives. Although multiple-choice items can be written to measure both simple and com-
plex objectives, they are not optimal for assessing all objectives. For example, some
objectives simply cannot be measured using multiple-choice items (e.g., writing a poem,
engaging in a debate, performing a laboratory experiment).

Multiple-choice items are not Multiple-Choice Items Are Not Easy to Write. Although ease
easy to write, and they are not and objectivity of scoring are advantages of multiple-choice items,
effective for measuring all it does take time and effort to write effective items with plausible
educational objectives. distracters.
In summary, multiple-choice items are by far the most popular
selected-response format. They have many advantages and few weak-
nesses. As a result, they are often the preferred format for professionally developed tests. When
skillfully developed, they can contribute to the construction of psychometrically sound class-
room tests. Table 8.2 summarizes the strengths and weaknesses of multiple-choice items.

TABLE 8.2 Strengths and Weaknesses of Multiple-Choice Items

Strengths of Multiple-Choice Items


Multiple-choice items are versatile.
Multiple-choice items can be scored in an objective and reliable manner.
Multiple-choice items are not overly subject to guessing.
Multiple-choice items are not significantly influenced by response sets.
Multiple-choice items are an efficient way of sampling the content domain.
Multiple-choice items are easy to refine using the results of item analysis.
Multiple-choice items provide diagnostic information.

Weaknesses of Multiple-Choice Items


= Multiple-choice items are not effective for measuring all educational
objectives.
= Multiple-choice items are not easy to write.

The Development and Use of Selected-Response Items 211

True—False Items

The next selected-response format we will discuss is the true—false format. True—false items
are very popular, second only to the multiple-choice format. We will actually use the term
true—false items to refer to a broader class of items. Sometimes this category is referred to as
binary-choice items, two-option items, or alternate-choice items. The common factor is that
all these items involve a statement or question that the student marks as true or false, agree or
disagree, correct or incorrect, yes or no, fact or opinion, and so on. Because the most common
form is true—false, we will use this term generically to refer to all two-
True—false items involve a option items.
statement or question that the Here follow examples of true—false items. Example 29 takes
student marks as true or false, the form of the traditional true—false format. Example 30 takes the
agree or disagree, yes or no, form of the correct—incorrect format. We also provide examples of
and so on. the type of directions needed with these questions.

Example 29 True—False Item with Directions


Directions: Carefully read each of the following statements. If the statement is
true, underscore the T. If the statement is false, underscore the F.

deel F In recent years, malaria has been eliminated worldwide.


2. 1 F The ozone layer protects us from harmful ultraviolet radiation.

Example 30 Correct-Incorrect Item with Directions


Directions: Carefully read each of the following sentences. If the sentence is gram-
matically correct, underscore C. If the sentence contains a grammatical error,
underscore I for incorrect.

= Cc I He set the book on the table.


ZUG I She set on the couch.

Two variations of the true—false format are fairly common and deserve mention. The
first is the multiple true—false format we briefly mentioned when discussing multiple-choice
items. On traditional multiple-choice items the student must select one correct answer from
the alternatives, whereas on multiple true—false items the student indicates whether each one
of the alternatives is true or false. Frisbie (1992) provides an excellent discussion of the
multiple true—false format. Example 31 is a multiple true—false item.

Example 31 Multiple True—False Item


1. Which of the following Apollo astronauts actually landed on the moon? Underscore the
T if the astronaut landed on the moon, the F if the astronaut did not land on the moon.
Edwin Aldrin
Frank Borman
Neil Armstrong
Pete Conrad
Thomas Patten
Walter Cunningham oe
HAsies
leo
>is)
ees
212 CHAPTER 8

In the second variation of the traditional true—false format, the student is required to
correct false statements. This is typically referred to as true—false with correction format.
With this format it is important to indicate clearly which part of the statement may be
changed by underlining it (e.g., Linn & Gronlund, 2000). Consider Example 32.

Example 32. True—False Items with Correction of False Items


Directions: Read each of the statements. If the statement is true, underscore the
T. If the statement is false, underscore the F and change the word or words that
are underlined to make the statement true. To make the correction, write the
correct word or words in the blank space.
healt: FE Apollo7_ Apollo 5 was the first Apollo mission to conduct an orbit
flight test of the Command and Service Module (CSM).
7A F _____ Apollo 8 was the first Apollo mission to achieve lunar
orbit.

Although this variation makes the true—false items more demanding and less suscepti-
ble to guessing, it also introduces some subjectivity in scoring, which may reduce reliability.

Guidelines for Developing True—False Items


Avoid Including More than One Idea in the Statement. True-false items should ad-
dress only one central idea or point. Consider the following examples.

Example 33 Poor Item—Statement Contains More Than One Idea


eT: F The study of biology helps us understand living organisms and predict
the weather.

Example 34 Better Item—Statement Contains Only One Idea


2. 1 rE The study of biology helps us understand living organisms.

Example 35 Better Item—Statement Contains Only One Idea


peak EF The study of biology helps us predict the weather.

Example 33 contains two ideas, one that is correct and one that is false. Therefore it is par-
tially true and partially false. This can cause confusion as to how students should respond.
Examples 34 and 35 each address only one idea and are less likely to be misleading.

Avoid Specific Determiners and Qualifiers That Might Serve as Cues to the Answer.
Specific determiners such as never, always, none, and all occur more frequently in false
statements and serve as cues to uninformed students that the statement is too broad to be
true. Accordingly, moderately worded statements including usually, sometimes, and fre-
quently are more likely to be true and these qualifiers also serve as cues to uninformed
students. Although it would be difficult to avoid using qualifiers in true—false items, they
can be used equally in true and false statements so their value as cues is diminished. Exam-
ine the following examples.
The Development and Use of Selected-Response Items 213

Example 36 Poor Item—Specific Determiners Serve as Cue


Leck F Longer tests always produce more reliable scores than shorter tests.

Example 37 Better Item—Cue Eliminated


2 EF Shorter tests usually produce more reliable scores than longer tests.

In Example 36 always may alert a student that the statement is too broad to be true. Ex-
ample 37 contains the qualifier usually, but the statement is false so a student relying on cues
would not benefit from it.

Ensure That True and False Statements


Are ofApproximately the Same Length. There
is a tendency to write true statements that are longer than false statements. To prevent state-
ment length from serving as an unintentional cue, visually inspect your statements and en-
sure that there is no conspicuous difference between the length of true and false statements.
It is usually easier to increase the length of false statements by including more qualifiers
than it is to shorten true statements.

Avoid Negative Statements. Avoid using statements that contain no, none, and not. The
use of negative statements can make the statement more ambiguous, which is not desirable.
The goal of a test item should be to determine whether the student has mastered a learning
objective, not to see whether the student can decipher an ambiguous question.

Avoid Long and/or Complex Statements. A\l statements should be presented as clearly
and concisely as possible. As noted in the previous guideline, the goal is to make all state-
ments clear and precise.

Include an Approximately Equal Number of True and False Statements. As noted


earlier when discussing response sets in the context of multiple-choice items, some students
are more likely to select true when they are unsure of the correct response (i.e., acquiescence
set). There are also students who have adopted a response set whereby they mark false when
unsure of the answer. To prevent students from artificially inflating their scores with either
of these response sets, include an approximately equal number of true and false items.

Avoid Including the Exact Wording from the Textbook. As on multiple-choice items
you should avoid the exact wording used in a text. Students will recognize this over time,
and it tends to reward rote memorization rather than the development of a more thorough
understanding of the content (Hopkins, 1998). Table 8.3 provides a summary of the guide-
lines for developing true—false items.

Testing experts provide mixed evaluations of true—false items. Some experts are advo-
cates of true—false items whereas others are much more critical of this format. We tend to
fall toward the more critical end of the continuum.

Strengths of True—False Items


True—False Items Can Be Scored in an Objective Manner. Like other selected-re-
sponse items, true—false items can be scored easily, objectively, and reliably.
214 CHAPTER 8

TABLE 8.3 Checklist for the Development of True—False Items

1. Does each statement include only one idea?


2. Have you avoided using specific determiners and qualifiers that could
serve as cues to the answer?
. Are true and false statements of approximately the same length?
. Have you avoided negative statements?
W

n . Have you avoided long and complex statements?
6. Is there an approximately equal number of true and false statements?
7. Have you avoided using the exact wording from the textbook?
ELE De RO OES SEGRE ASST ISai Uc OS EE a SMEs Ne cn Oe Mat el Fic Me een de ee eee |

True-—false items are effective at True—-False Items Are Efficient. Students can respond quickly to
sampling the content domain true—false items, even quicker than they can to multiple-choice items.
and can be scored in a reliable This allows the inclusion of more items on a test designed to be ad-
manner. ministered in a limited period of time.

Weaknesses of True—False Items


True—False Items Are Not Particularly Useful Except with the Simplest Educa-
tional Objectives. Many testing experts believe that true—false items are useful only for
assessing low-level objectives such as knowledge and comprehension. Much of what we
hope to teach our students cannot be divided into the clear dichotomies represented by
true—false items and, as a result, is not well suited for this format. Additionally, many
experts believe true—false items promote rote memorization (even
True-—false items are if you avoid using the exact wording from the text or lecture).
particularly vulnerable to
guessing and are usually limited True-False Items Are Very Vulnerable to Guessing. Because
to measuring the simplest there are only two options on true—false items, students have a 50%
educational objectives. chance of getting the answer correct simply by chance. Because un-
intended cues to the correct answer are often present, an observant
but uninformed student can often get considerably more than 50% of these items correct.
As a result, guessing can have a significant influence on test scores. Guessing also reduces
the reliability of the individual items. To compensate, true—false tests often need many items
in order to reduce the influence of guessing and demonstrate adequate reliability.

True—False Items Are Subject to Response Sets. True—false items are considerably
more susceptible to the influence of response sets than are other selected-response items.

True—False Items Provide Little Diagnostic Information. Teachers can often gain
diagnostic information about common student errors and misconceptions by examin-
ing incorrect responses to other test items, but true—false items provide little diagnostic
information.
The Development and Use of Selected-Response Items 215

True—False Items May Produce a Negative Suggestion Effect. Some testing experts
have expressed concern that exposing students to the false statements inherent in true—false
items might promote learning false information (e.g., Hopkins, 1998), called a negative
suggestion effect.

Effective True—False Items Appear Easy to Write to the Casual Observer. This Is
Not the Case! Most individuals believe that true—false items are easy to write. Writing
effective true—false items, like all effective test items, requires considerable thought and
effort. Simply because they are brief does not mean they are easy to write.

In summary, true—false items are a popular selected-response format. They can be


scored in an objective and reliable manner and students can answer many items in a short
period of time. However, they have numerous weaknesses including being limited to the
assessment of simple learning objectives and being vulnerable to guessing. Before using
true—false items, we suggest that you weight their strengths and weaknesses and ensure that
they are appropriate for assessing the specific learning objectives. Table 8.4 provides a sum-
mary of the strengths and weaknesses of true—false items.

Matching Items
The final selected-response format we will discuss is matching items. Matching items
usually contain two columns of words or phrases. One column contains words or phrases
for which the student seeks a match. This column is traditionally placed on the left and
the phrases are referred to as premises. The second column contains words that are
available for selection. The items in this column are referred to as responses. The prem-
ises are numbered and the responses are identified with letters. Directions are provided
that indicate the basis for matching the items in the two lists. Here is an example of a
matching item.

TABLE 8.4 Strengths and Weaknesses of True—False Items

Strengths of True—False Items


= True-false items can be scored in an objective and reliable manner.
a True—false items are efficient.

Weaknesses of True—False Items


m True-false items are not particularly useful except with the
simplest educational objectives.
True—false items are vulnerable to guessing.
True—false items are subject to response sets.
True—false items provide little diagnostic information.
True—false items may produce a negative suggestion effect.
Effective true—false items are not easy to write.
216 CHAPTER 8

Matching items usually contain Example 38 Matching Items


two columns of words or Directions: Column A lists major functions of the brain. Col-
phrases. One column, typically umn B lists different brain structures. Indicate which structure
located on the left, contains primarily serves which function by placing the appropriate let-
ter in the blank space to the left of the function. Each brain struc-
words or phrases for which
ture listed in Column B can be used once, more than once, or not
the student seeks a match.
at all.

Column A Column B
—o_ 1. Helps initiate and control rapid movement of a. basal ganglia
the arms and legs. b. cerebellum
_4_ 2. Serves as a relay station connecting different
c. corpus callosum
parts of the brain
; ‘ Paar d. hypothalamus
£3. Is involved in the regulation of basic drives
e. limbic system
and emotions.
f. medulla
4__ 4, Helps control slow, deliberate movements
of the arms and legs. g. thalamus
. Connects the two hemispheres.
ls . Controls
Nm the release of certain hormones
important in controlling the internal
environment of the body.

This item demonstrates an imperfect match because there are more responses than premises.
Additionally, the instructions also indicate that each response may be used once, more than
once, or not at all. These procedures help prevent students from matching items simply by
elimination.

Guidelines for Developing Matching Items


Limit Matching Items to Homogeneous Material. Possibly the most important
guideline to remember when writing matching items is make sure the lists contain homo-
geneous content. By this we mean you should base the lists on a common theme. For
example, in the previous example (Example 38) all of the premises specified functions
served by brain structures, and all of the responses were brain structures. Other examples
of homogeneous lists could be the achievements matched with famous individuals, his-
torical events matched with dates, definitions matched with words, and so on. What should
be avoided is including heterogeneous material in your lists. For example, consider Ex-
ample 39.

Example 39 Poor Item—Heterogeneous Content


Directions: Match the items in Column A with the items in Column B. Each item
in Column B can be used once, more than once, or not at all.

The Development and Use of Selected-Response Items 217

Column A Column B
£1. Most populous U.S. city. a. Amazon
Bie 2: Largest country in South America. b. Brazil
4_ 3. Largest river in the Western Hemisphere. c. Lake Superior
—4_ 4. Canada’s leading financial and manufacturing d. Mississippi
center. e. New York City
¢ 5. Largest freshwater lake in the world. f. Nicaragua
Btn: Largest country in Central America. g. Toronto

Although this is an extreme example, it does illustrate how heterogeneous lists can undermine
the usefulness of matching items. For example, premise 1 asks for the most populous U.S. city
and the list of responses includes only two cities, only one of which is in the United States.
Premise 2 asks for the largest country in South America and the list of responses includes only
two countries, only one of which is in South America. In these questions students do not have
to possess much information about U.S. cities or South America to answer them correctly. It
would have been better to develop one matching list to focus on U.S. cities, one to focus on
countries in the Western Hemisphere, one to focus on major bodies of water, and so forth.

Indicate the Basis for Matching Premises and Responses in the Directions. Clearly
state in the directions the basis for matching responses to premises. You may have noticed
that in our example of a poor heterogeneous item (Example 39), the directions do not clearly
specify the basis for matching. This was not the case with our earlier example involving
brain functions and brain structures (Example 38). If you have difficulty specifying the basis
for matching all the items in your lists, it is likely that your lists are too heterogeneous.

Review Items Carefully for Unintentional Cues. Matching items are particularly sus-
ceptible to unintentional cues to the correct response. In Example 39, the use of Jake in
premise 5 and response c may serve as a cue to the correct answer. Carefully review match-
ing lists to minimize such cues.

Include More Responses than Premises. By including more responses than premises,
you reduce the chance that an uninformed student can narrow down options and success-
fully match items by guessing.

Indicate That Responses May Be Used Once, More than Once, or Not at All. By
adding this statement to your directions and writing responses that are occasionally used
more than once or not at all, you also reduce the impact of guessing.

Limit the Number of Items. For several reasons it is desirable to keep the list of items
fairly brief. It is easier for the person writing the test to ensure that the lists are homogeneous
when the lists are brief. For the student taking the test, it is easier to read and respond to a
shorter list of items. Although there is not universal agreement regarding the number of
items to include in a matching list, a maximum of ten appears reasonable with lists between
five and eight items generally recommended.
218 CHAPTER 8

Ensure That the Responses Are Brief and Arrange Them in a Logical Order. Stu-
dents should be able to read the longer premises and then scan the briefer responses in an
efficient manner. To facilitate this process, keep the responses as brief as possible and ar-
range them in a logical order when appropriate (e.g., alphabetically, numerically).

Place All Items on the Same Page. Finally, keep the directions and all items on one
page. It greatly reduces efficiency in responding if the students must turn the page looking
for responses. Students also are more likely to transpose a letter or number if they have to
look back and forth across two pages, leading to errors in measuring what the student has
learned. Table 8.5 summarizes the guidelines for developing matching items.
Testing experts generally provide favorable evaluations of the matching format. Al-
though this format does not have as many advantages as multiple-choice items, it has fewer
limitations than the true—false format.

Strengths of Matching Items


Matching Items Can Be Scored in an Objective Manner. Like other selected-response
items, matching items can be scored easily, objectively, and reliably.

Matching items can be scored in Matching Items Are Efficient. They take up little space and stu-
a reliable manner, are efficient, dents can answer many items in a relatively brief period.
and are relatively simple to
write. Matching Items Are Relatively Simple to Write. Matching items
are relatively easy to write, but they still take time, planning, and ef-
fort. The secret to writing good matching items is developing two homogeneous sets of
items to be matched and avoiding cues to the correct answer. If they are not developed well,
efficiency and usefulness are lost.

Weaknesses of Matching Items


Matching Items Have Limited Scope and Application in Assessing Student Learning.
With matching items, students are asked to match two things based on logical and usually
simple associations. Although matching items measure this type of learning outcome fairly

TABLE 8.5 Checklist for the Development of Matching Items


se ee ee ees EOC
. Is the material homogeneous and appropriate for the matching format?
- Do the directions indicate the basis for matching premises and responses?
. Have unintentional cues to the correct answer been avoided?
. Are there more responses than premises?
&=-
an
KY
WwW Do the directions indicate that responses may be used once, more than once,
or not at all?
6. Are the lists relatively short to facilitate scanning (e.g., < 10)?

IT
TTT
7. Are the responses brief and arranged in a logical order?
8. Are all the items on the same page?
The Development and Use of Selected-Response Items 219

well, much of what we teach students involves greater understanding and higher-level
skills.
Matching items are fairly Matching Items May Promote Rote Memorization. Due to their
limited in scope and may focus on factual knowledge and simple associations, the use of match-
ing items may encourage rote memorization.
promote rote memorization.
Matching Items Are Vulnerable to Cues That Increase the
Chance of Guessing. Unless written with care, matching items are particularly suscep-
tible to cues that accidentally suggest the correct answer.
It Is Often Difficult to Develop Homogeneous Lists of Relevant Material. When
developing matching items, it is often difficult to generate homogeneous lists for matching.
As aresult, there are two common but unattractive outcomes; the lists may become hetero-
geneous, or information that is homogeneous but trivial may be included. Neither one of
these outcomes is desirable because they both undermine the usefulness of the items.
In summary, matching items are a prevalent selected-response format. They can be
scored in an objective manner, are relatively easy to write, and are efficient. They do have
weaknesses, including being limited in the types of learning outcomes they can measure and
potentially encouraging students to simply memorize facts and simple associations. You
also need to be careful when writing matching items to avoid cues that inadvertently provide
hints to the correct answer. Nevertheless, when dealing with information that has acommon
theme and that lends itself to this item format, they may be particularly useful. Table 8.6
provides a summary of the strengths and weaknesses of matching items.

Summary
All test items can be classified as either selected-response items or constructed-response
items. Selected-response items include multiple-choice, true—false, and matching items
whereas constructed-response items include essay items, short-answer items, and perfor-
mance assessments. We discussed each specific selected-response format, describing how
to write effective items and their individual strengths and weaknesses.

TABLE 8.6 Strengths and Weaknesses of Matching Items

Strengths of Matching Items


= Matching items can be scored in an objective and reliable manner.
= Matching items are efficient.
= Matching items are relatively simple to write.

Weaknesses of Matching Items


= Matching items have limited scope and application in assessing student
learning.
= Matching items may promote rote memorization.
= Matching items are vulnerable to cues that increase the chance of guessing.
= It is often difficult to develop homogeneous lists of meaningful material.
Pre
220 CHAPTER 8

SPECIAL INTEREST TOPIC 8.3


What Research Says about ‘“‘Changing Your Answer”

Have you ever heard that it is usually not in your best interest to change your answer on a multiple-
choice test? Many students and educators believe that you are best served by sticking with your first
impression. That is, don’t change your answer. Surprisingly this is not consistent with the research!
Pike (1979) reviewed the literature and came up with these conclusions:

Examinees change their answers only on approximately 4% of the questions.


When they do change their answer, more often than not it is in their best interest. Typically
there are approximately two favorable changes (i.e., incorrect to correct) for every unfavor-
able one (i.e., correct to incorrect).
These positive effects tend to decrease on more difficult items.
High-scoring students are more likely to profit from changing their answers than are low-
scoring students.

This does not mean that you should encourage your students to change their answers on a whim.
However, if students feel a change is indicated based on careful thought and consideration, they
should feel comfortable doing so. Research suggests that they are probably doing the right thing to
enhance their score.

Multiple-choice items are the most popular selected-response format. They have nu-
merous strengths including versatility, objective and reliable scoring, and efficient
sampling of the content domain. The only weaknesses are that multiple-choice items
are not effective for measuring all learning objectives (e.g., organization and presenta-
tion of material, writing ability, performance tasks), and they are not easy to develop.
Testing experts generally support the use of multiple-choice items as they can contrib-
ute to the development of reliable and valid assessments.
True—False items are another popular selected-response format. Although true—false
items can be scored in an objective and reliable manner and students can answer many
items in a short period of time, they have numerous weaknesses. For example, they
are limited to the assessment of fairly simple learning objectives and are very vulner-
able to guessing. Although true—false items have a place in educational assessment,
before using them we recommend that you weight their strengths and weaknesses and
ensure that they are the most appropriate item format for assessing the specific learn-
ing objectives.
Matching items were the last selected-response format we discussed. These items can
be scored in an objective and reliable manner, can be completed in a fairly efficient
manner, and are relatively easy to develop. Their major limitations include a rather
limited scope and the possibility of promoting rote memorization of material by your
students. Nevertheless, carefully developed matching items can effectively assess
lower-level educational objectives.
In the next two chapters we will address constructed-response items, including essays,
short-answer items, performance assessments, and portfolios. We stated earlier in the textbook
The Development and Use of Selected-Response Items 221

that typically the deciding factor when selecting an assessment or item format involves iden-
tifying the format that most directly measures the behaviors specified by the educational ob-
jectives. The very nature of some objectives mandates the use of constructed-response items
(e.g., writing a letter), but some objectives can be measured equally well using either selected-
response or constructed-response items. If after thoughtful consideration you determine that
both formats are equally well suited, we typically recommend the use of selected-response
items because they allow broader sampling of the content domain and can be scored in a more
reliable manner. However, we do not want you to think that we have a bias against construct-
ed-response items. We believe that educational assessments should contain a variety of assess-
ment procedures that are individually tailored to assess the educational objectives of interest.

KEY TERMS AND CONCEPTS

Alternatives, p. 196 Homogeneous content, p. 216 Selected-response items, p. 195


Best-answer format, p. 197 Incomplete-sentence format, p. 196 Stems, p. 196
Constructed-response items, p. 195 Matching items, p. 215 True-false items, p. 211
Correct-answer format, p. 197 Multiple-choice items, p. 196 True—false with correction,
Cue, p. 201 Multiple true—false item, p. 201 p. 212
Direct-question format, p. 196 Negative suggestion effect, p. 215
Distracters, p. 196 Response sets, p. 207

RECOMMENDED READING

Aiken, L. R. (1982). Writing multiple-choice items to mea- At this site the author provides some good suggestions for
sure higher-order educational objectives. Educational & making multiple-choice distracters more attractive.
Psychological Measurement, 42, 803-806. A respected Ebel, R. L. (1970). The case for true—false items. School Re-
author presents suggestions for writing multiple-choice view, 78, 373-389. Although many assessment experts are
items that assess higher-order learning objectives. opposed to the use of true—false items for the reasons cited
Beck, M. D. (1978). The effect of item response changes on in the text, Ebel comes to their defense in this article.
scores on an elementary reading achievement test. Jour- Sidick, J. T., Barrett, G. V., and Doverspike, D. (1994). Three-
nal of Educational Research, 71, 153-156. This article is alternative multiple-choice tests: An attractive option.
an example of the research that has examined the issue of Personnel Psychology, 47, 829-835. In this study the au-
students changing their responses on achievement tests. thors compare tests with three-choice multiple-choice
A good example! items with ones with five-choice items. The results sug-
Dewey, R.A. (2000, December 12). Writing multiple choice gest that both have similar measurement characteristics
items which require comprehension. Retrieved November and that a case can be made supporting the use of three-
29, 2004, from www.psywww.com/selfquiz/aboutq.htm. choice items.

FaSL I ET RO
SNE

Go to www.pearsonhighered.com/reynolds2e to view a PowerPoint™


presentation and to listen to an audio lecture about this chapter.
CHAPTER

The Development and Use of


Constructed-Response Items

In recent years there has been increased criticism of selected-response items


and aeall for relying more on constructed-response items. Proponents of
constructed-response items claim that they provide a more “authentic”
assessment of student abilities, one that more closely resembles the way
these abilities are applied in the real world. 2

CHAPTER HIGHLIGHTS

Oral Testing: The Oral Essay as a Short-Answer Items ;


Precursor of Constructed-Response Items A Final Note: Constructed-Response versus
Essay Items Selected-Response Items

LEARNING OBJECTIVES

After reading and studying this chapter, students should be able to


Li Trace the history of constructed-response assessment.
Explain how essay items can differ in terms of purpose and level of complexity.
Compare and contrast restricted-response and extended-response essay items.
Describe the principles involved with developing effective essay items.
Develop effective essay items for a given content area.
Discuss the strengths and weaknesses of essay items.
Describe the principles involved with grading essays.
Demonstrate the ability to grade essays in a reliable and valid manner.
AN
eet
awh
Describe the principles involved with developing effective short-answer items.
Develop effective short-answer items for a given content area.
. Discuss the strengths and weaknesses of short-answer items.
—_
fk
mk
S
NF- Discuss prominent issues to be considered when deciding whether to use selected-response
or constructed-response items.

We have noted that most test items used in the classroom can be classified as either
selected-response or constructed-response items. If an item requires a student to select a

222
The Development and Use of Constructed-Response Items 223

response from a list of alternatives, it is classified as a selected-response item. Examples of


selected-response items include multiple-choice, true—false, and matching items. In contrast,
if an item requires students to create or construct a response, it is classified as a constructed-
response item. Essay and short-answer items are common examples of constructed-response
items and will be the focus of this chapter. We will discuss their strengths and weaknesses
and provide suggestions for developing effective items. In the next chapter we will address
performance and portfolio assessments, which are types of constructed-response assessments
that have gained increased popularity in recent years. In all of these chapters we will focus
on the development of items to assess classroom achievement.

Oral Testing: The Oral Essay as a


Precursor of Constructed-Response Items
We would like to begin our discussion of constructed-response items by briefly tracing their
history. Written constructed-response items had their beginning in oral testing. Oral testing
as a method of examination was a prominent aspect of Greek teachings
Oral testing was a pr ominent as far back as the fourth century B.c. Subsequently oral testing was ad-
aspect of Greek teaching as far opted by the Romans and continued when universities were established
back as the fourth century B.c. during the Dark Ages. Oral examinations have persisted in Western uni-
versities to the present, most commonly being found in various forms
at the master’s and doctoral degree level (e.g., a thesis or dissertation defense). The following
procedure typifies an oral examination. Students are typically examined by a group of examin-
ers, each of whom may ask one or more questions. General topics for questioning are agreed
on beforehand, and the student must respond immediately to questions. Often the examiners
qualify or clarify questions when the student experiences difficulty. Although students may fail
portions of the examination, complete failures are relatively rare. Except for their use in gradu-
ate programs, oral testing is not very common in public schools and universities. Even though
students are still called on to answer questions in the classroom, extended responses are rarely
required or desired by the teacher. We have a few colleagues at universities who require oral ex-
aminations for their courses, but they are the exceptions rather than the rule. Some professional
licensing boards require applicants to sit for an oral examination in addition to meeting other
requirements. In such instances, the oral examination is usually the final hurdle. For example,
after obtaining a Ph.D. in a relevant area of psychology, one is required to obtain a license to
practice or offer psychological services to the public. The state of Texas requires each applicant
to pass two written examinations (both are multiple-choice, or selected-response, exams) prior
to sitting for a one-hour oral examination with two examiners.
The problems with oral testing are numerous and well documented. For fair evalua-
tion each person taking a common examination should take the test under uniform condi-
tions. These conditions include both testing procedures (e.g., time available) and the format,
content, and scoring of questions. For example, although it is clearly possible to present the
same questions in the same format to all students, this does not always occur. Unless the
examiners specify the questions beforehand and write them down, it is difficult for oral ex-
aminers to present them in the same manner to all students or even to ask the same questions
of all students. Scoring responses can also be problematic. Examiners usually do not record
the responses verbatim (if at all), and they subjectively review and score the responses based
224 CHAPTER 9

on their memory. In oral examinations a premium is often placed on the student’s facility
with oral responding. For example, students are rarely given extended time to formulate
a response, and hesitation is often taken as lack of knowledge. With this arrangement the
achievement being measured may be achievement in the articulation of subject matter rather
than knowledge of subject matter. If that is the expressed educational objective, as in rhetoric
or debate, then the oral test is clearly appropriate. Otherwise it may not be a valid measure of
the specified educational objectives. A final limitation of oral testing is one first recognized
during the nineteenth-century industrialization process: inefficiency. The total testing time
for a class is equal to the number of students times the number of minutes allotted to each
student times the number of examiners. As you can see, the testing time adds up quickly.
Teachers are very busy professionals and time is at a premium. All of these shortcomings of
oral testing provide sufficient reason for its use to be quite restricted.

Essay Items
An essay item poses a question An essay item is a test item that poses a question or problem for
or problem for the student to the student to respond to in a written format. Being a constructed-
respond to in a written format. response item, the student must respond by constructing a response,
not by selecting among alternatives. Although essay items vary in
the degree of structure they impose on the student’s response, they generally provide con-
siderable freedom to the student in composing a response: Good essay items challenge the
student to organize, analyze, integrate, and synthesize information. At their best, essay items
elicit novel and creative cognitive processes from students. At their worst they present an
ambiguous task to students that is difficult, if not impossible, to score in a reliable manner.
Written essay items were used in Chinese civil service examinations as long as two
thousand years ago, but they did not become popular in Western civilization until much later.
In the nineteenth century, technical developments (e.g., increased availability of paper, devel-
opment of lead pencils, now principally using graphite) made written examinations cheaper
and more practical in both America and Europe. About the same time, Horace Mann, an in-
fluential nineteenth-century educator, argued about the evils of oral testing and the superiority
of the written essay. This set the stage for the emergence of essay (and other constructed-re-
sponse) tests. Although essay items have their own limitations, they have addressed some of
the problems associated with oral testing. They afforded more uniformity in test content (i.e.,
students get the same questions presented in the same order), there was a written record of the
student’s response, and they were more efficient (i.e., they take less testing time).
Essay items can be classified according to their educational purpose or focus (i.e.,
evaluating content, style, or grammar), the complexity of the task presented (e.g., knowl-
edge, comprehension, application, analysis, synthesis, and evaluation), and how much
structure they provide (restricted or extended response). We will begin by discussing how
essay items can vary according to their educational purpose.

Purposes of Essay Testing


Table 9.1 illustrates different purposes for which essay items are typically used when assess-
ing student achievement. The major purposes are for assessing content, style, and gram-
The Development and Use of Constructed-Response Items 225

TABLE 9.1 Purposes of Essay Testing

Content Style Grammar

Content Assess cognitive objectives Assess content and Assess content and
or knowledge of content only —_ writing style grammar
Style Assess writing ability Assess writing style
and style only and grammar
Grammar Assess grammar only

Content—Style—Grammar

Assess content knowledge,


writing style, and grammar

Essay items can be scored in mar (which we take to include writing mechanics such as spelling).
terms of content, style, and In Table 9.1 the three purposes have been crossed to form a nine-
grammar. element matrix. These elements represent different assessment goals.
The composition of the elements in the diagonal is as follows:

se T element represents testing solely for cognitive achievement.


When scoring for this purpose, you attend only on the content of the response and ignore the
student’s achievement in writing style and in grammar. The purpose is to determine what the
student knows or can produce. Here, all levels of the cognitive taxonomy can be measured.
Essay testing in this context should not penalize a student deficient in skills unrelated to the
content being assessed. For example, poor organization and misspellings are not counted
against the student.
. THeisiiessiyle element is the purpose often found in writing composition classes.
The content of the essay is largely irrelevant. The student is told to pick a topic and write ina
specified manner. All measurement is based on objectives related to organization, structure,
phrasing, transition, and other components of the writing process. Here grammar is also
unimportant.
ti bs ecrrarrra carrer is one in which the objective is to examine the stu-
dent’s ability to apply grammatical rules. This category typically involves all aspects of
writing mechanics (e.g., spelling, punctuation, etc.). Content and style are unimportant and
are not scored.

All the other elements (i.e., off-diagonal elements) combine two purposes. For ex-
ample, the purpose of an essay item could involve the combination of content-style, content—
grammar, or style-grammar. Although not represented in this matrix, an item may have a
three-element purpose in which the student’s essay is evaluated in terms of content, style, and
grammar. This latter purpose is often encountered in take-home assignments such as reports,
term papers, or final examinations. All of these different combinations may be considered
essay examinations, and the topics of this chapter will apply in varying degrees to the ele-
ments in Table 9.1.
226 CHAPTER 9

While theoretically essay items can be scored independently based on these three pur-
poses, this is actually much more difficult than it appears. For example, research has shown
that factors such as grammar, penmanship, and even the length of the answer influence the
scores assigned, even when teachers are instructed to grade only on content and disregard
style and grammar. We will come back to this and other scoring issues later in this chapter.

Essay Items at Different Levels of Complexity


Essay items can be written to Essay items can be written to measure objectives at all levels of the
measure objectives at all levels cognitive taxonomy. Some examples of items written at each level of
of the cognitive taxonomy. the cognitive taxonomy follow.

Knowledge. At the knowledge level, essay items are likely to include verbs such as define,
describe, identify, list, and state. A knowledge level item follows:

Example 1 Knowledge Level Item


1. List the four scales of measurement and define each scale.

Comprehension. Comprehension level essay items often include verbs such as explain,
paraphrase, summarize, and translate. An example of a comprehension level question
follows:

Example 2 Comprehension Level Item


1. Explain the use of the Spearman-Brown formula in calculating split-half reliability.

Knowledge and comprehension level objectives can also be assessed with selected-
response items (e.g., multiple-choice), but there is a distinction. Selected-response items re-
quire only recognition of the correct answer, whereas essay items require recall. That is, with
essay items students must remember the correct answer without having the benefit of having
it in front them. There are instances when the recall/recognition distinction is important and
instances when it is not. When recall is important, essay items should be considered.

Application. Application level essay items typically include verbs such as apply, com-
pute, develop, produce, solve, and use. An example of an application level item follows:

Example 3 Application Level Item


1. For the objective listed below, develop a multiple-choice item and a true—false item.
Objective: The student will be able to compute the reliability for a test that is doubled
in length and had an initial reliability of 0.50.

Example 3 demonstrates the application of a general procedure, principle, or rule, in this


case the production of an item from an educational objective. Application level essay items
typically require the student to solve a problem with a specific method or approach.
The Development and Use of Constructed-Response Items 227

Analysis. Analysis level items are frequently encountered in essay tests. Verbs used at the
analysis level include analyze, break down, differentiate, illustrate, outline, and summarize.
Consider this example:

Example 4 Analysis Level Item


1. Summarize in a systematic, coherent manner the effects of the Industrial Revolution
on educational testing.

In Example 4 the student is asked to analyze the material by identifying and describing
the effects of the Industrial Revolution on educational testing. Many teachers simply use
the verb discuss in this context (i.e., Discuss the effects . . . ), and this may be acceptable
if the students understand what is expected of them. Otherwise, it introduces ambiguity
into the assessment process, something to be avoided.

Synthesis. Essay items written at the synthesis level require students to create something
new and original. Verbs often encountered at the synthesis level include compose, create,
develop, and design. Here is an example of an essay item at the synthesis level:

Example 5 Synthesis Level Item


1. Create a new educational objective and corresponding essay item at the analysis level
for the content area of your choice.

Evaluation. The evaluation level of the cognitive taxonomy requires judgments concern-
ing value and worth. Essay items written at this level often involve “choice” as in the next
example.

Example 6 Evaluation Level Item


1. Specify what you view as the best type of item for assessing analysis level objectives
and defend your choice. Be sure to stipulate the basis for your choice, highlighting
the qualities that make this item type optimal and the reasons you selected it over the
other item types.

Here students select the best item type, a subjective choice, and defend their selection.
Words most often used in essay items at the evaluation level include appraise, choose, criti-
cize, debate, evaluate, judge, and others involving a determination of worth.
In place of a personal subjective choice, the criteria may be defined in terms of scien-
tific, legal, social, or other external standards. Consider this example:

Example 7 Evaluation Level Item


1. Do you believe essay tests are biased against minorities? State your position and
defend it using current legal, scientific, and social standards.

In Example 7 the students must choose a position and build a case for it based on their un-
derstanding of current mores and standards as well as psychometric expertise.
228 ChE PASP
Mia RBS

Restricted-Response versus Extended-Response Essays


So far we have discussed how essay items can be used for different purposes (i.e., content,
style, and grammar) and how to measure objectives at different levels of the cognitive com-
plexity (e.g., knowledge, comprehension, application, analysis, synthesis, and evaluation).
In addition to these distinctions, essay items are often classified as either restricted response
or extended response. Restricted-response items are highly structured and clearly spec-
ify the form and scope of a student’s response. Restricted-response
Essay items can be written to items typically require students to list, define, describe, or give rea-
elicit restricted responses or sons. These items may specify time or length limits for the response.
extended responses. Here are examples of restricted-response items.

Example 8 Restricted-Response Items


1. In the space provided, define homeostasis and describe its importance.
2. List the types of muscle tissue and state the function of each.

Extended-response items provide more latitude and flexibility in how the students can
respond to the item. There is little or no limit on the form and scope of the response. When limi-
tations are provided, they are usually held to a minimum (e.g., page and time limits). Extended-
response items often require students to compose, summarize, formulate, compare/interpret,
interpret, and so forth. Examples of extended-response items include the following:

Example 9 Extended-Response Items


1. Summarize and write a critical evaluation of the research on global warming. Include
a detailed analysis of the strengths and weaknesses of the empirical research, and
provide an evaluative statement regarding your conclusions on the topic.
2. Compare and contrast asthma and emphysema in terms of physiological processes,
treatment options, and prognosis.

Extended-response items provide less structure and this promotes greater creativity,
integration, and organization of material.
As you might expect, restricted-response and extended-response essay items have
their own strengths and limitations. Restricted-response essay items are particularly good
for assessing objectives at the knowledge, comprehension, and application levels. They
can be answered in a timely fashion by students, which allows you to include more items,
and they are easier to score in a reliable manner than extended-response items. In contrast,
extended-response items are particularly well suited for assessing higher-level cognitive ob-
jectives. However, they are difficult to score in a reliable manner and because they take con-
siderable time for students to complete, you typically have to limit your test to relatively few
items, which results in limited sampling of the content domain. Although restricted-response
items have the advantage of more reliable and efficient scoring, along with better sampling of
the content domain, certain learning objectives simply require the use of extended-response
essay items. In these situations, it is important to write and score extended-response items as
carefully as possible and take into consideration the limitations. To that end, we will start by
giving you some general guidelines for writing good essay items. ‘
The Development and Use of Constructed-Response Items 229

Guidelines for Developing Essay Items


Write in a Clear, Straightforward Manner. The most important criterion for a good
essay item is that it clearly specifies the assessment task. The assessment task is simply what
you want the student to do. We recommend that you provide enough
It is important that essay items information in your essay item that there is no doubt about what
specify the assessment task in you expect. If you want the students to list reasons, specify that you
a clear and straightforward want a list. If you want them to make an evaluative judgment, clearly
manner. state it. If you want a restricted response, specify that. If you want an
extended response, make that clear. When appropriate, indicate the
point value of the item or how much time students should devote to it. On extended-response
items, some experts recommend that you specify your grading criteria so the students will _
have a clear picture of what you expect (e.g., Gronlund, 1998). Also, avoid using unneces- _
sarily difficult or technical language. The student should not have to guess your intentions
from obtuse wording or technical jargon. We are not suggesting that your essay items be
unnecessarily lengthy. In fact we recommend that they be as brief as possible, that is, as brief
as possible and still clearly specify the assessment task. Consider the following examples,
one of a poor essay item, the other of a better, more specific essay item.

Example 10 Poor Item—Unclear Assessment Task


1. Why did World War II begin?

Example 11 Better Item—Clear Assessment Task


2. Describe the course of events that led up to Britain and France’s policy of appeasement
toward Germany in 1938. In your response explain why Chamberlain and Daladier
pursued this policy. Explain what event or events later convinced Britain and France
to abandon their policy of appeasement? Answer in the space provided. (counts as
25% of the test grade)

Consider Carefully the Amount of Time Students Will Need to Respond to the Essay
Items. This is a practical recommendation that you pay attention to the amount of time
the students will need to complete each essay item. For example, you might estimate that
students need approximately 15 minutes to complete one item and 30 minutes for another.
As a general rule, teachers tend to underestimate the time students need to respond to essay
items. As teachers we may estimate only the time necessary to write the response whereas
students actually need time to collect and organize their thoughts before even starting the
writing process. As a rule of thumb, we recommend you construct a test you think is ap-
propriate to the available time and reduce it in length by about 25%.

Do Not Allow Students to Select the Items to Which They Will Respond. Some teach-
ers provide a number of items and allow the students to select a specified number of items
to respond to. For example, a test might include eight items and the students are required
to select five items to respond to. As a general rule this practice is to be avoided. When stu-
dents respond to different items, they are essentially taking different tests. When they take
different tests, they cannot be evaluated on a comparative basis. In addition, when students
respond only to the items they are best prepared for or knowledgeable about, you get a less
230 CHAPTER 9

representative sample of their knowledge (e.g., Gronlund, 1998). As you know, anything
that results in less effective content sampling compromises the measurement properties of
the test.

Use More Restricted-Response Items in Place of a Smaller Number of Extended-


Response Items. Restricted-response items have measurement characteristics that may
make them preferable over extended-response items. First, they are easier to score in a reli-
able manner. Second, because students can respond to a larger number of items in a given
amount of time, they can provide superior sampling of content domain. Although some
educational objectives require the use of extended-response items, when you have a choice
we recommend using multiple restricted-response items.

Limit the Use of Essay Items to Educational Objectives That Cannot Be Measured
Using Selected-Response Items. While essays are extremely popular among many
teachers and have their strengths, they do have limitations that we have alluded to and will
outline in the next section. For now, we just want to recommend that you restrict the use
of essay items to the measurement of objectives that cannot be measured adequately using
selected-response items. For example, if you want to assess the student’s ability to organize
and present material in a written format, an essay item would be a natural choice. These
guidelines are summarized in Table 9.2.
So far we have alluded to some of the strengths and weaknesses of essay items, and
this is probably an opportune time to discuss them more directly.

Strengths of Essay Items


Essay Items Can Be Written to Assess Higher-Level Cognitive
Skills. In the last chapter we argued that multiple-choice items can
Essay items can be written to
be written to assess higher-level cognitive objectives. We still stand
assess higher-level cognitive by that statement. Nevertheless, some educational objectives are most
skills and are ideal for ‘easily measured with essay items, and these tend to be higher-level
measuring some objectives objectives. Some objectives such as writing skills literally require the
such as writing skills. use of essay items. Essay items also have the advantage of requiring

TABLE 9.2 Checklist for the Development of Essay Items

1. Are the items written in a clear, straightforward manner?


2. Will the students be able to complete the test in the time available?
3. Will all students respond to the same set of items?
4. When appropriate, did you use more restricted-response items in place of
fewer extended-response items?
5. Did you limit the use of essay items to objectives that cannot be measured
with selected-response items?
The Development and Use of Constructed-Response Items 231

recall, often denoting stronger mastery of the material than recognition, as required with
selected-response items.

It Generally Takes Less Time to Write Essay Items than Selected-Response


Items. Writing an essay test typically takes less time than preparing a test with selected-
response items. Because most essay tests contain only a fraction of the number of items
that an objective test might, you will usually have fewer items to write. It is also tempting
to say that essay items are easier to write than objective items. While this is probably true,
we don’t want to mislead you into thinking that writing essay items is effortless. Writing
good essay items requires considerable thought and effort. Essay items are probably easier
to write than objective items, but that does not necessarily make them easy.

The Use of Essay Items Largely Eliminates Blind Guessing. Because essay items
require the student to produce a response as opposed to simply selecting one, students are
not able to guess successfully the desired answer.

When Studying for Essay Tests, Students May Spend Less Time on Rote Memoriza-
tion and More Time Analyzing and Synthesizing Information. Many teachers believe
that students study differently for essay tests than they do for selected-response tests, and
some research supports this claim (e.g., Coffman, 1972; Hakstian, 1971). It is possible that
students preparing for essay tests spend more time analyzing and synthesizing information
rather than memorizing facts. Hopkins (1998) suggests that teachers may combine a few
essay items with selected-response items to achieve this potential instructional benefit.

Weaknesses of Essay Items


It is difficult to score essay items Reliable Scoring of Essay Items Is Difficult. As we noted in
in a reliable manner. our chapter on reliability, when scoring relies on subjective judg-
ment, it is important to evaluate the degree of agreement when dif-
ferent individuals score the test. This is referred to as inter-scorer or inter-rater reliability.
Studies of the inter-rater reliability of essay items have shown that there is often disagree-
ment between raters. In addition to inconsistency between scores assigned by different
raters (inter-rater inconsistency), there is also often inconsistency in the scores assigned
by the same rater at different times (intra-rater inconsistency). When scoring essay items,
a multitude of factors can contribute to this unreliability. These factors or effects include
the following.

Content Indeterminancy Effects. Content indeterminancy effects are the result of an


inadequate or ambiguous understanding by the teacher of the response required by the essay
item. When scoring an essay item, the teacher should have a very clear idea of what con-
stitutes a “good” response. Obviously if two teachers scoring an essay have different ideas
about what the desired response is, they are not likely to agree on the score to be assigned.
Even an individual teacher who has only a vague idea of what a good response is will likely
have difficulty scoring essays in a consistent manner. This ambiguity or indeterminancy
regarding what constitutes a good response leads to unreliable, inconsistent scoring.
232 CHAPTER 9

Expectancy Effects. Expectancy effects occur when the teacher scoring the test allows
irrelevant characteristics of the student to affect scoring. This is also referred to as the “halo
effect.” For example, if a teacher has a favorable overall impression of a student with a history
of academic excellence, the teacher might be inclined to assign a higher score to the student’s
responses (e.g., Chase, 1979). In contrast, a teacher might tend to be more critical of a re-
sponse by a student with a poor academic record who is viewed as difficult or apathetic. These
effects are not typically intentional or even conscious, but they are often present nevertheless.
Similar effects can also carry over from one item to the next within a test. That is, if you see
that a student performed well on an earlier item, it might influence scoring on later items.

Handwriting, Grammar, and Spelling Effects. Research dating from the 1920s has shown
that teachers are not able to score essay items solely on content even when they are in-
structed to disregard style and handwriting, grammar, and spelling effects (e.g., James,
1927; Sheppard, 1929). For example, good handwriting raises scores and poor handwriting,
misspellings, incorrect punctuation, and poor grammar reduce scores even when content is
the only criteria for evaluation. Even the length of the response impacts the score. Teachers
tend to give higher scores to lengthy responses, even when the content is not superior to that
of a shorter response (Hopkins, 1998), something students have long suspected!

Order Effects. Order effects are changes in scoring that emerge during the grading pro-
cess. As a general rule, essays scored early in the grading process receive better grades than
essays scored later (Coffman & Kurfman, 1968; Godshalk, Swineford, Coffman, & ETS,
1966). Research has also shown that the quality of preceding responses impacts the scores
assigned. That is, essays tend to receive higher scores when they are preceded by poor-
quality responses as opposed to when they are preceded by high-quality responses (Hales
& Tokar, 1975; Hughes, Keeling, & Tuck, 1980).

Fatigue Effects. The teacher’s physical and cognitive abilities are likely to degrade if essay
scoring continues for too long a period. The maximum period of time will probably vary
according to the complexity of the responses, but reading essays for more than two hours
without sufficient breaks will likely produce fatigue effects.
As you can see a number of factors can undermine reliability when scoring essay items.
In earlier chapters we emphasized the importance of reliability, so this weakness should be
given careful consideration when developing and scoring essay items. It should also be noted
that reduced reliability undermines the validity of the interpretation of test performance.

Restricted Sampling of the Content Domain. Because essay items typically require a
considerable amount of time to evaluate and to construct a response to, students are able
to respond to only a few items in a testing period. This results in limited sampling of the
content domain and potentially reduced reliability. This is particularly true of extended-re-
sponse essay items but may also apply to restricted-response items.

Scoring Essay Items Is Time Consuming. In addition to it being difficult to score essay
items in a reliable manner, it is a tedious, time-consuming process. Although selected-response
items tend to take longer to develop, they can usually be scored easily, quickly, and reliably.
The Development and Use of Constructed-Response Items
233

TABLE 9.3 Strengths and Weaknesses of Essay Items

Strengths of Essay Items


m Essay items are good for assessing some higher-level cognitive skills.
m Essay items are easier to write than objective items.
m Essay items eliminate blind guessing.
m Essay items may promote a higher level of learning.

Weaknesses of Essay Items


m Essay items are difficult to score in a reliable manner.
m The use of essay items may result in a limited sample of the content domain.
m Scoring essay items is a tedious and time-consuming process.
m Essay items are subject to bluffing.

Bluffing. Although the use of essay items eliminates random guessing, bluffing is intro-
duced. Bluffing occurs when a student does not possess the knowledge or skills to respond
to the item, but tries to “bluff” or feign a response. Due to the subjective nature of essay
scoring, student bluffing may result in them receiving partial or even full credit. Experience
has shown that some students are extremely proficient at bluffing. For example, a student
may be aware that teachers tend to give lengthy responses more credit and so simply reframe
the initial question as a statement and then repeat the statement in slightly different ways.
Table 9.3 provides a summary of the strengths and weaknesses of essay items.

Guidelines for Scoring Essay Items


It should be obvious from the preceding discussion that there are
To enhance the reliability of significant concerns regarding the reliability and validity of essay
essay tests, you need to develop tests. To enhance the measurement characteristics of essay tests, you
structured, unbiased scoring need to concentrate on developing structured, unbiased scoring pro-
cedures. Here are a few suggestions to help you score essay items in
procedures.
a consistent, reliable way.

Use Predetermined Scoring Criteria to Reduce Content Inde-


terminancy Effects. Indeterminancy scoring effects are the result of an imperfect un-
derstanding of the response required by the essay item. When scoring an essay item, the
teacher should have a very clear idea of what constitutes a “good” response. Obviously if two
teachers scoring an essay have different ideas about what the desired response is, they are not
likely to agree on the score to be assigned. You can reduce content indeterminancy effects
by clearly specifying the important elements of the desired response. A written guide that
helps you score constructed-response items is typically referred to as a scoring rubric. For
restricted-response essay items at the lower levels of the cognitive domain (knowledge, com-
prehension, application, and analysis), the criteria for scoring can often be specified by writ-
ing a sample answer or simply listing the major elements. However, for extended-response
items and items at the higher levels of the cognitive domain, more complex rubrics are often
234 CaVASP TERI

required. For extended-response items, due to the freedom given to the student, it may not be
possible to write a sample answer that takes into consideration all possible “good” responses.
For items at the synthesis and evaluation levels, new or novel responses are expected. As a
result, the exact form and content of the response cannot be anticipated, and a simple model
response cannot be delineated.
Scoring rubrics are often classified as either analytic or holistic. Analytic scoring
rubrics identify different aspects or dimensions of the response and the teacher scores each
dimension separately. For example, an analytic scoring rubric might distinguish between
content, writing style, and grammar/mechanics. With this scoring rubric the teacher will
score each response in terms of these three categories. With analytic rubrics it is usually nec-
essary to specify the value assigned to each characteristic. For example, for a 15-point essay
item in a social science class wherein the content of the response is of primary concern,
the teacher may designate 10 points for content, 3 points for writing style, and 2 points for
grammar/mechanics. If content were of equal importance with writing style and grammar/
mechanics, the teacher could assign 5 points for each category. In many situations two or
three categories are sufficient whereas in other cases more elaborate schemes are necessary.
An advantage of analytic scoring rubrics is that they provide specific feedback to students
regarding the adequacy of their responses in different areas. This helps students know which
aspects of their responses were adequate and which aspects need improvement. The major
drawback of analytic rubrics is that their use can be fairly time consuming, particularly
when the rubric specifies many dimensions to be graded individually.
With a holistic rubric, the teacher assigns a single score based on the overall quality
of the student’s response. Holistic rubrics are often less detailed than analytic rubrics. They
are easier to develop and scoring usually proceeds faster. Their primary disadvantage is
that they do not provide specific feedback to students about the strengths and weaknesses
of their responses.
Some testing experts suggest that, instead of using holistic rubrics to assign a numeri-
cal or point score, you use an ordinal or ranking approach. With this approach, instead of
assigning a point value to each response, you read and evaluate the responses and sort them
into categories reflecting different qualitative levels. Many teachers use five categories to
correspond to letter grades (i.e., A, B, C, D, and F). When using this approach, Gronlund
(1998) recommends that teachers read each item twice. You initially read through the essay
items and sort them into the designated categories. Subsequently you read the items in each
category as a group checking for consistency. If any items appear to be either superior or
inferior to the other items in that category, you make the necessary adjustment.
To illustrate the differences between holistic and analytic scoring rubrics, consider
this essay question:

Example 12 Sample Essay Item


1. Describe and then compare and contrast Thurstone’s model of intelligence with that
presented by Gardner. Give examples of the ways they are similar and the ways they
differ.

Table 9.4 presents a holistic scoring rubric that might be used when scoring this item.
Table 9.5 presents an analytic scoring rubric that might be used when scoring this item.
The Development and Use of Constructed-Response Items 235

TABLE 9.4 Holistic Scoring Rubric (5-Point Scale)

Essay Item: Compare and contrast Thurstone’s model of intelligence with that presented
by Gardner. Give examples of the ways they are similar and the ways they differ.

Classification Description Rating

Excellent The student demonstrated a thorough understanding of both models =__5___


of intelligence and could accurately describe in detail similarities
and differences and give examples. This is an exemplary response.
Good The student demonstrated a good understanding of the models and 4
could describe similarities and differences and give examples.
Average The student demonstrated an adequate understanding of the models 3
and could describe some similarities and differences. Depth of
understanding was limited and there were gaps in knowledge.
Marginal The student showed limited understanding of the models and 2
could provide no more than vague references to similarities and
differences. Some information was clearly inaccurate. Examples
were either vague, irrelevant, or not applicable.
Poor The student showed very little understanding of the models and was !
not able to describe any similarities or differences.
Very poor The student showed no understanding of the models. 0

TABLE 9.5 Analytic Scoring Rubric (15-Point Item)

Essay Item: Compare and contrast Thurstone’s model of intelligence with that presented
by Gardner. Give examples of the ways they are similar and the ways they differ.

Poor Average Above Average Excellent


Area (0 points) (1 point) (2 points) (3 Points)

The student demonstrated an


understanding of Thurstone’s model.
The student demonstrated an
understanding of Gardner’s model.
The student was able to compare
and contrast the models.
The student was able to present
relevant and clear examples high-
lighting similarities and differences. § —___
The response was clear, well
organized, and showed a thorough
understanding of the material. fauna
se

Total Number of Points Awarded _____

RSE ESERIES
REESE TRI MINERS AIEEE ELIT PASSED
236 CHAP TERS

Our final comment regarding scoring rubrics is that to be effective they should be used in
a consistent manner. Keep the rubric in front of you while you are scoring and apply it in a
fair, evenhanded manner.

Avoid Expectancy Effects. As youremember, expectancy effects occur when the teacher
allows irrelevant characteristics of the student to affect scoring (also referred to as the “halo
effect”). The obvious approach to minimizing expectancy effects is to score essay items ina
way that the test taker’s identity is not known. If you use test booklets, fold back the cover so
that the student’s name is hidden. If you use standard paper, we suggest that students write
their names on the back of essay sheets and that only one side of the sheet be used. The goal
is simply to keep you from being aware of the identity of the student whose paper you are
currently scoring. To prevent the student’s performance on one item from influencing scores
on subsequent items, we recommend that you start each essay item on a separate page. This
way, exceptionally good or poor performance on a previous item will not inadvertently
influence your scoring of an item.

Consider Writing Effects (e.g., Handwriting, Grammar, and Spelling). If one could
adhere strictly to the guidelines established in the scoring rubrics, writing effects would
not influence scoring unless they were considered essential. However, as we noted, even
when writing abilities are not considered essential, they tend to impact the scoring of an
item. These effects are difficult to avoid other than to warn students early in their academic
careers that these effects exist and suggest that they develop good writing abilities. For those
with poor cursive writing, a block letter printing style might be preferred. Because personal
computers are readily available in schools today, you might allow students to complete
essays using word processors and then print their tests. You should encourage students to
apply grammatical construction rules and to phrase sentences in a straightforward manner
that avoids awkward phrasing. To minimize spelling errors you might elect to provide dic-
tionaries to all students because this will mirror more closely the writing situation in real life
and answer critics who say essay tests should not be spelling tests. The use of word proces-
sors with spelling and grammar checkers might also help reduce these effects.

Minimize Order Effects. To minimize order effects, it is best to score the same question
for all students before proceeding to the next item. The tests should then be reordered in a
random manner before moving on to scoring the next item. For example, score item 1 for all
students; reorder the tests in a random fashion; then score essay item 2 and so forth.

Avoid Fatigue. The difficult task of grading essays is best approached as a series of one-
or two-hour sessions with adequate breaks between them. Although school schedules often
require that papers be scored in short periods of time, you should take into consideration the
effects of fatigue on scoring and try to arrange a schedule for grading that permits frequent
rest periods.

Score Essays More than Once. Whenever possible it is desirable to score essays items
at least two times. This can be accomplished either by you scoring the items twice or having
a colleague score them after you have scored them. When the two scores or ratings are con-
The Development and Use of Constructed-Response Items 237

TABLE 9.6 Guidelines for Scoring Essay Items

1. Develop a scoring rubric for each item that clearly specifies the scoring criteria.
2. Take steps to avoid knowing whose paper you are scoring.
3. Avoid allowing writing effects to influence scoring if they are not considered essential.
4. Score the same question for all students before proceeding to the next item.
5. Score the essays in one- or two-hour periods with adequate rest breaks.
6. Score each essay more than one time (or have a colleague score them once after you have
scored them).

sistent, you can be fairly confident in the score. If the two ratings are significantly different,
you should average the two scores. Table 9.6 summarizes our suggestions for scoring essay
items. Special Interest Topic 9.1 presents a brief discussion of automated essay scoring
systems that are being used in several settings.

Short-Answer Items

Short-answer items are the final type of constructed-response item


Short-answer items are items
we will discuss in this chapter. Short-answer items require the
that require the student to
student to supply a word, phrase, number, or symbol in response
supply a word, phrase, number,
to a direct question. Short-answer items can also be written in an
or symbol. incomplete-sentence format instead of a direct-question format
(this format is sometimes referred to as a completion item). Here are examples of both
formats.

Example 13 Direct-Question Format


1. What is the membrane surrounding the nucleus called? —.--_-_
2. What is the coefficient of determination if the correlation coefficient is 0.60?

Example 14 Incomplete-Sentence Format


i The membrane surrounding the nucleus is called the :
2. For a correlation coefficient of 0.60, the coefficient of determination is

Relative to essay items, short-answer items place stricter limits on the nature and
length of the response. Practically speaking, short-answer items can be viewed as a type of
restricted-response essay item. As we noted, restricted-response essay items provide more
structure and limit the form and scope of a student’s response relative to an extended-
response essay item. Short-answer items take this a step further, providing even more struc-
ture and limits on the student’s response.
238 CHAPTER 9

SPECIAL INTEREST TOPIC 9.1


Automated Essay Scoring Systems

Myford and Cline (2002) note that even though essays are respected and desirable assessment
techniques, their application in large-scale standardized assessment programs has been limited be-
cause scoring them with human raters is usually expensive and time consuming. For example, they
note that when using human raters, students may take a standardized essay test at the end of an
academic year and not receive the score reports until the following year. The advent of automated
essay-scoring systems holds promise for helping resolve these problems. By using an automated
scoring system, testing companies can greatly reduce expense and the turnaround time. As a re-
sult, educators and students can receive feedback in a fraction of the time. Although such systems
have been around since at least the 1960s, they have become more readily available in recent
years. Myford and Cline (2002) note that these automated systems generally evaluate essays on
the basis of either content (i.e., subject matter) or style (i.e., linguistic style). Contemporary essay
scoring systems also provide constructive feedback to students in addition to an overall score.
For example, a report might indicate that the student is relying too much on simple sentences or
a limited vocabulary (Manzo, 2003).
In addition to being more cost- and time-efficient, these automated scoring systems have the
potential for increasing the reliability of essay scores and the validity of their interpretation. For
example, the correlation between grades assigned by a human and an automated scoring program is
essentially the same as that between two human graders. However, in contrast to humans, computers
never have a bad day, are never tired or distracted, and assign the same grade to the same essay every
time (Viadero & Drummond, 1998).
In addition to expediting scoring of large-scale assessment, these automated essay scoring
programs have recently found application in the classroom. Manzo (2003) gave the example of a
middle school language arts teacher who regularly assigns essay assignments. She has more than 180
students and in the past would spend up to 60 hours grading a single assignment. She is currently
using an automated online scoring system that facilitates her grading. She has the program initially
score the essays, then she reviews the program’s evaluation and adds her own comments. In other
words, she is not relying exclusively on the automated scoring system, but using it to supplement
and enhance her personal grading. She indicates that the students can receive almost instantaneous
feedback on their essays and typically allows the students to revise their daily assignments as many
times as they desire, which has an added instructional benefit.
These programs are receiving more and more acceptance in both classrooms and large-scale
standardized assessment programs. Here are examples of some popular programs and Web sites at
which you can access information about them:

m e-rater, www.ets.org/research/erater.htm+1
a Intelligent Essay Assessor, www.knowledgetechnologies.com
w IntelliMetric, www.intellimetric.com
= Bayesian Essay Scoring System, http://ericae.net/betsy
The Development and Use of Constructed-Response Items 239

Guidelines for Developing Short-Answer Items


Structure the Item So That the Response Is as Short as Possible. As the name implies,
you should write short answer-items so that they require a short answer. This makes scoring
easier, less time consuming, and more reliable.

Make Sure There Is Only One Correct Response. In addition to brevity, it is important
that there only be one correct response. This is more difficult than you might imagine. When
writing a short-answer item, ask yourself if the student can interpret it in more than one way.
Consider this example:

John Adams was born in

The correct response could be “Massachusetts.” Or it could be “Braintree” (now Quincy) or


even the “United States of America.” It could also be “1735” or even “the eighteenth cen-
tury.” All of these would be correct! This highlights the need for specificity when writing
short-answer items. A much better item would be:

John Adams was born in what city and state?

Use the Direct-Question Format in Preference to the Incomplete-Sentence Format.


There is usually less chance of student confusion when the item is presented in the direct-
question format. This is particularly true when writing tests for young students, but even
secondary students may find direct questions more understandable than incomplete sen-
tences. Most experts recommend using only the incomplete-sentence format when it results
in a briefer item without any loss in clarity.

Have Only One Blank Space when Using the Incomplete-Sentence Format, Prefer-
ably Near the End of the Sentence. As we noted, unless incomplete-sentence items are
carefully written, they may be confusing or unclear to students. Generally the more blank
spaces an item contains, the less clear the task becomes. Therefore, we recommend that
you usually limit each incomplete sentence to one blank space. We also recommend that
the blank space be located near the end of the sentence. This arrangement tends to provide
more clarity than if the blank appears early in the sentence.

Avoid Unintentional Cues to the Answer. As with selected-response items, you should
avoid including any inadvertent clues that might alert an uninformed student of the correct re-
sponse. For example, provide blanks of the same length for all short-answer items (both direct
questions and incomplete sentences). This way you avoid giving cues about the relative length
of different answers. Also be careful about grammatical cues. The use of the article a indicates
an answer beginning with a consonant instead of a vowel. An observant student relying on cues
will detect this and it may help him or her narrow down potential responses. This can be cor-
rected by using a(n) to accommodate answers that begin with either consonants or vowels.

Make Sure the Blanks Provide Adequate Space for the Student’s Response. A previ-
ous guideline noted that all blanks should be the same length to avoid unintentional cues
240 GAT Pei:

to the correct answer. You should also make sure that each blank provides adequate space
for the student to write the response. As a result, you should determine how much space is
necessary for providing the longest response in a series of short-answer items, and use that
length for all other items.

Indicate the Degree of Precision Expected in Questions Requiring Quantitative


Answers. For example, if you want your answer stated in inches, specify that. If you want
all fractions reduced to their lowest terms or all numerical answers rounded to the second
decimal point, specify these expectations.

Avoid Lifting Sentences Directly Out of the Textbook and Converting Them into
Short-Answer Items. Sentences taken directly from textbooks often produce ambiguous
short-answer items. Sentences typically need to be understood in the context of surrounding
material, and when separated from that context their meaning often becomes unclear. Ad-
ditionally, if you copy sentences directly from the text, some students may rely on simple
word associations to answer the items. This may promote rote memorization rather than
developing a thorough understanding of the material (Hopkins, 1998).

Create a Scoring Rubric and Consistently Apply It. As with essay items, it is impor-
tant to create and consistently use a scoring rubric when scoring short-answer items. When
creating this rubric, take into consideration any answers besides the preferred or “best”
response that will receive full or partial credit. For example, remember this item?

John Adams was born in what city and state?

How would you score it if the student responded only “Braintree” or only “Massachusetts”?
This should be specified in the scoring rubric.
These guidelines are summarized in Table 9.7.

TABLE 9.7 Checklist for the Development of Short-Answer Items

1. Does the item require a short response?


2. Is there only one correct response?
3. Did you use an incomplete sentence only when there was no loss of
clarity relative to a direct question?
4. Do incomplete sentences contain only one blank?
5. Are blanks in incomplete sentence near the end of the sentence?
6. Have you carefully checked for unintentional cues to the answer?
7. Do the blanks provide adequate space for the answers?
8. Did you indicate the degree of precision required for quantitative answers?
9. Did you avoid lifting sentences directly from the textbook?
10. Have you created a scoring rubric for each item?
ee ee erene See ee ee rt
r
The Development and Use of Constructed-Response Items 241

Strengths of Short-Answer Items


Like all item types, short-answer items have their own strengths and weaknesses. First, we
will address their strengths.

When recall is important, Short-Answer Items Require Recall, Not Just Recognition.
when dealing with quantitative Whereas selected-response items require only recognition of the cor-
problems, and when rect answer, short-answer items require recall. T'hat is, with these
items students must remember the correct answer without having it
interpreting graphic material,
provided for them. There are instances when the recall/recognition
short-answer items can be
distinction is important, and when recall is important, short-answer
extremely effective.
items can be useful. Also, because short-answer items require recall,
blind guessing is reduced.

Short-Answer Items Are Particularly Well Suited for Quantitative Problems and Prob-
lems Requiring the Interpretation of Graphic Material. When the problem involves
mathematical computations or the interpretation of graphic material such as charts, diagrams,
or illustrations, the short-answer format can be particularly useful (e.g., Hopkins, 1998).

Because Students Can Answer More Short-Answer Items than Essay Items, They
May Allow Better Content Sampling. Because students can usually answer short-answer
items fairly quickly, you can include more short-answer items than essay items on a test. This
can result in more representative sampling of the content domain and enhanced reliability.

Short-Answer Items Are Relatively Easy to Write. We are always cautious when we say
an item type is easy to write, so we say it is relatively easy to write. Compared to multiple-
choice items, short-answer items are easier to write. Even though they are relatively easy to
write, they still need to be developed with care following the guidelines provided.

Weaknesses of Short-Answer Items


Scoring Short-Answer Items in a Reliable Manner Is Difficult. Some authors classify
short-answer items as objective items, which suggests they can be scored in an objective
manner. Actually there is considerable subjective judgment involved in scoring short-answer
items. For example, how will you score misspellings and responses that are only partially
legible? What if the misspelling is so extreme it is difficult to determine what the student is
actually attempting to spell? Also, no matter how carefully you write short-answer items,
inevitably a student will provide a response that is not what was desired or expected, but can
still be construed as at least partially correct. Although scoring well-written short-answer
items in a reliable manner is easier than scoring extended-response essay items, there is still
a significant degree of subjectivity involved and it can be a lengthy and tiresome process.

With the Exception of Quantitative Problems and the Interpretation of Graphic


Material, Short-Answer Items Are Often Limited to Assessing Fairly Simple Edu-
cational Objectives. If you rely extensively on short-answer items, it may encourage
students to emphasize rote memorization when studying rather than developing a more
thorough understanding of the material. This can be countered by not relying exclusively
242 CHAPTER 9

TABLE 9.8 Strengths and Weaknesses of Short-Answer Items

Strengths of Short-Answer Items


m Short-answer items require recall, not just recognition.
= Short-answer items are well suited for assessing quantitative problems and
problems requiring the interpretation of graphic material.
= Because students can answer more short-answer items than essay items,
they may allow better content sampling.
m Short-answer items are relatively easy to write.

Weaknesses of Short-Answer Items


m Short-answer items are difficult to score in a reliable manner.
m With the exception of quantitative problems and the interpretation of
graphic material, short-answer items are often limited to assessing fairly
simple educational objectives.

on short-answer items and incorporating other item types that demand higher-level cogni-
tive processes.

As aresult of these limitations, we generally recommend that the use of short-answer


items be limited to those situations for which they are uniquely effective. When recall is im-
portant, when dealing with quantitative problems, and when interpreting graphic material,
short-answer items can be extremely effective. However, if the educational objective can be
assessed equally well with a selected-response item, it is preferable to use the selected-
response format due to potentially enhanced reliability and validity. These strengths and
weaknesses are summarized in Table 9.8.

A Final Note: Constructed-Response


versus Selected-Response Items

Throughout much of the twentieth century, critics of essay items emphasized their weak-
nesses (primarily unreliable scoring and reduced content sampling) and promoted the use
of selected-response items. In recent years there has been increased criticism of selected-
response items and a call for relying more on essays and other constructed-response items.
Proponents of constructed-response tests, particularly essay items (and performance as-
sessments, discussed in the next chapter), generally claim they provide a more “authentic”
assessment of student abilities, one that more closely resembles the way abilities and knowl-
edge are demonstrated or applied in the real world. Arguments on both sides are pervasive
and often passionate. We take the position that both formats have an important role to play
in educational assessment. As we have repeated numerous times, to adequately assess the
complex array of knowledge and skills emphasized in today’s schools, teachers need to
take advantage of the full range of assessment procedures available. Due to the tendency
for selected-response items to provide reliable and valid measurement, we promote their
use when they can adequately assess the educational objectives. However, it is important
The Development and Use of Constructed-Response Items 243

To adequately assess the to recognize that there are educational objectives that cannot be ad-
complex array of knowledge equately assessed using selected-response items. In these situations
and skills emphasized in today’s you should use constructed-response items. By being aware of the
schools, teachers need to take weaknesses of constructed-response items and using the guidelines
advantage of the full range for developing and scoring them outlined in this chapter you will be
of assessment procedures able to write items that produce results you can have considerable
bvailable confidence in. Remember that the best practice is to select items that
provide the most valid and reliable information about your students’
knowledge and skills.

Summary

In this chapter we focused on the development and use of constructed-response items. Essay
items have a long history, dating back to China over two thousand years ago. An essay item
poses a question or problem that the student responds to in a written format. Although essay
items vary in terms of the limits they place on student responses, most essay items give
students considerable freedom in developing their responses. Essay tests gained popularity
in the United States in the nineteenth century largely due to problems associated with oral
testing. Even though written essay tests addressed some of the problems associated with
oral testing, essays have their own associated problems. The most prominent weaknesses
of essay items involve difficulty scoring in a reliable manner and limited content sampling.
Both of these issues can result in reduced reliability and validity. On the positive side, essay
items are well suited for measuring many complex educational objectives and are relatively
easy to write. We provided numerous suggestions for writing and scoring essay items, but
encouraged teachers to limit the use of essay items to the measurement of educational objec-
tives that are not easily assessed using selected-response items.
The second type of constructed-response item addressed in this chapter was short-an-
swer items. Like essay items, students respond to short-answer items by providing a written
response. However, instead of having a large degree of freedom in drafting their response,
on short-answer items the student is usually required to limit the response to a single word, a
brief phrase, or a symbol/number. Similar to essay items, short-answer items are somewhat
difficult to score in a reliable manner. On the positive side, short-answer items are well suited
for measuring certain educational objectives (e.g., math computations) and are relatively easy
to write. We provided several suggestions for writing short-answer items, but nevertheless
encouraged teachers to limit their use to those situations for which they are uniquely quali-
fied. As with essay items, short-answer items have distinct strengths, but should be used in a
judicious manner.
We ended this chapter by highlighting the classic debate between proponents of
selected-response and constructed-response formats. We believe both have a role to play in
educational assessment and that by knowing the strengths and limitations of both formats
one will be better prepared to develop and use tests in educational settings. In the next chap-
ter we will turn your attention to performance assessments and portfolios. These are special
types of constructed-response items (or tasks) that have been around for many years, but
have gained increasing popularity in schools in recent years.
244 CHAPTER 9

KEY TERMS AND CONCEPTS

Content indeterminancy effect, Extended-response items, p. 228 Intra-rater inconsistency, p. 231


p. 231 Fatigue effects, p. 232 Oral testing, p. 223
Content, style, and grammar, p. 224 Handwriting, grammar, and Order effects, p. 232
Direct-question format, p. 237 spelling effects, p. 232 Restricted-response items, p. 228
Essay item, p. 224 Incomplete-sentence format, p. 237 Rubric, p. 233
Expectancy effects, p. 232 Inter-rater inconsistency, p. 231 Short-answer items, p. 237

RECOMMENDED READINGS

Fleming, K., Ross, M., Tollefson, N., & Green, S. (1998). types among students are essays and multiple-choice
Teacher’s choices of test-item formats for classes with items. Females overwhelmingly prefer essay items
diverse achievement levels. Journal of Educational whereas males show a slight preference for multiple-
Research, 91, 222-228. This interesting article reports choice items.
that teachers tend to prefer using essay items with high- Gulliksen, H. (1986). Perspective on educational measure-
achieving classes and more recognition items with ment. Applied Psychological Measurement, 10, 109-
mixed-ability or low-achieving classes. 132. This paper presents recommendations regarding the
Gellman, E., & Berkowitz, M. (1993). Test-item type: What development of educational tests, including the develop-
students prefer and why. College Student Journal, 27, ment and grading of essay items.
17-26. This article reports that the most popular item

UR ae MED eens ages ag NRE TTS Ee MeeeGe Gee MIC eee eywe me ees Ce SNE Ge SESE HRN TO Taoees MONO REE Oe

Go to www.pearsonhighered.com/reynolds2e to view a PowerPoint™


presentation and to listen to an audio lecture about this chapter.
CHAPTER

: |() Performance Assessments


: and Portfolios

Performance assessment is claimed to be useful for evaluating programs,


improving instruction, comparing districts, and evaluating university and job
applicants. Tomorrow’s news will probably report it lowers cholesterol.
—Linn & Baker, 1992, p. 1

CHAPTER HIGHLIGHTS

What Are Performance Assessments? Portfolios


Guidelines for Developing Effective
Performance Assessments

LEARNING OBJECTIVES

After reading and studying this chapter, students should be able to


1. Define and give examples of performance assessments.
2. Explain why performance assessments have become popular in schools in recent years.
3. Describe differences in the ways educators define performance assessments and identify some
characteristics common to most definitions.
Describe the principles involved with developing effective performance assessments.
Develop effective performance assessments for a given content area.
Discuss the strengths and weaknesses of performance assessments.
Describe the principles involved with developing effective portfolio assessments.
Develop effective portfolios for a given content area.
ee
rans
Discuss the strengths and weaknesses of portfolios.

Ih Chapter 1, we noted that one of the current trends in educational assessment is the rising
popularity of performance assessments and portfolios. Performance assessments and port-
folios are not new creations. In fact as far back as written records have been found, there is
evidence that students were evaluated with what are currently referred to as performance

245
246 CHAPTER 10

As far back as written records assessments. However, interest in and the use of performance as-
have been found, there is sessments and portfolios in schools has increased considerably in
evidence that students were the last decade. Although traditional paper-and-pencil assessments,
evaluated with what are particularly multiple-choice and other selected-response formats
(e.g., true—false, matching), have always had their critics, opposition
currently referred to as
has become much more vocal in recent years. Opponents of tradi-
performance assessments.
tional paper-and-pencil assessments complain that they emphasize
rote memorization and other low-level cognitive skills and largely
neglect higher-order conceptual and problem-solving skills. To make the situation worse,
critics claim that reliance on paper-and-pencil assessments may have negative effects on
what teachers teach and what students learn. They note that in the era of high-stakes as-
sessment teachers often feel compelled to teach to the test. As a result, if high-stakes tests
measure only low-level skills, teachers may teach only low-level skills.
To address these shortcomings, many educational assessment experts have promoted
the use of performance assessments and portfolios. The Standards (AERA et al., 1999)

Performance assessments
require students to complete a real-life situations. Forexample, a medical student might be required
to interview a mock patient, select medical tests and other assess-
process or produce a product in
ment procedures, arrive at a diagnosis, and develop a treatment plan.
a context that closely resembles
Portfolios, a specific form of performance assessment, involve the
real-life situations (AERA et al., systematic collection of a student’s work products over a specified
1999). period of time according to a specific set of guidelines. Artists, archi-
tects, writers, and others have used portfolios to represent their work
for many years, and in the last decade portfolios have become increasingly popular in the
assessment of students.

What Are Performance Assessments?

We just gave you the definition of performance assessments provided by the Standards
(AERA et al., 1999). The Joint Committee on Standards for Educational Evaluation (2003)
provides a slightly different definition. It defines a performance assessment as a

Notice that whereas the Standards’ definition of performance assessments requires test tak-
ers to complete a task in a context or setting that closely resembles real-life situations, the
definition provided by the Joint Committee on Standards for Educational Evaluation does
not. This may alert you to the fact that not everyone agrees as to what qualifies as a per-
formance assessment. Commenting on this, Popham (1999) observed that the distinction
between performance assessments and more traditional assessments is not always clear.
For example, some educators consider practically any constructed-response assessment a
Performance Assessments and Portfolios
247

performance assessment. To them a short-answer or essay test is a type of performance


assessment. Other educators and assessment professionals set more rigorous standards for
what qualifies as a performance assessment. They hold that genuine performance assess-
ments differ from more traditional paper-and-pencil assessments in a number of important
ways, such as the following.

Performance assessments Performance Assessments More Closely Reflect Real-Life Set-


‘more closely reflect real-life tings and Applications than Traditional Paper-and-Pencil Assess-
settings and applications than ments. Possibly the most prominent factor distinguishing between
performance and more traditional assessments is the degree to which
traditional paper-and-pencil
the assessment mirrors an important real-life situation. For example, a
assessments.
paper-and-pencil assessment could contain multiple-choice, short-
answer, and essay items about how to diagnose and repair an auto-
mobile engine, but a performance assessment would require that the student actually repair a
defective engine. Naturally, performance assessments do differ in terms of how closely they
mirror real-life activities. To capture these differences,-some authors use different labels to re-
flect how closely the assessment mirrors the real-life situation. These include the following:

Actual Performance Assessment. An actual performance assessment takes place in the


actual setting in which the real-life activity occurs or in a simulation that re-creates the ac-
tual setting. An example of an actual performance assessment that most people have experi-
enced is the driving portion of a driver’s licensing examination. Most states have a two-part
assessment process for acquiring a driver’s license. The first part is a written exam that is
designed to determine whether the applicant has acquired the basic knowledge necessary to
successfully and safely operate a motor vehicle (e.g., state motoring laws and standards).
The second part, the actual driving examination, is designed to determine whether the ap-
plicant can actually drive an automobile in a safe and lawful manner. Typically the driving
portion of the examination is completed on the same public roads on which the applicant
will drive once licensed.

Analogue Performance Assessments. In many situations it is not possible to assess people


in real-life conditions because of the potential consequences of failure, and so analogue
performance assessments are performed. For example, nuclear power plant operators must
be recertified every few months, and the recertification process requires an assessment of
their operation of a nuclear power plant control system. Because mistakes in the real system
could be catastrophic, a simulator that is a re-creation of the control room is used to generate
problems for the operator to respond to and correct. Similarly, airline pilots attempting to
qualify on a new model aircraft are assessed in simulators that act like actual aircraft, incor-
porating complex hydraulic systems that move the simulator cabin to simulate flight.

Artificial Performance Assessment. Artificial performance assessment is considerably less


realistic than the previous categories and typically involves merely establishing conditions that
the test taker must consider when performing a task. This type of assessment is common in the
schools. Here a student may be asked to create the testing environment mentally and solve the
problem posed. For example, a student may be asked to step through the process of creating a
menu of meals for the week and purchasing food at a supermarket with a limited budget and
248 CHAP TER 0

specific food requirements. This type of assessment assumes that the student is familiar with
the real-life setting and the elements of the problem. Clearly, the assumption 1s made that a
student who can solve the artificial problem can also solve an actual performance problem of
the same sort. That assumption is rarely tested and is problematic in many instances.

Performance Assessments Involve Multiple Assessment Criteria. This distinction re-


quires that a student’s performance must be evaluated on multiple criteria (Popham, 1999,
2000). Popham gives the example of a student’s ability to speak in a foreign language being
evaluated in terms of accent, syntax, and vocabulary. Instead of focusing on just one aspect
of the student’s performance, multiple criteria are evaluated.

Performance Assessments Involve Subjective Evaluation of Student Performance.


Whereas many traditional assessments can be scored in an objective manner, genuine
performance assessments involve the subjective evaluation of the student’s performance
(Popham, 1999, 2000).
A natural question is “Which approach to defining performance assessment is cor-
rect?” On one hand you have those with a very broad definition of performance assessments,
which includes essentially any assessment that involves the construction of a response; on
the other hand are those that set more rigorous standards for what qualifies as a performance
assessment. This is one of those situations in which there is really no right and wrong posi-
tion. Just be aware that different people assign different meanings to the term performance
assessment.
To complicate the situation even more, not everyone uses the
Some educators refer to term performance assessment to describe these procedures. Some
performance assessments educators use the term authentic assessment to refer to essentially
as authentic assessments or the same procedures we refer to as performance assessment. They
alternative assessments. generally prefer the term authentic assessment because it implies
that the assessment more closely mirrors real-life situations. Person-
ally, we find this title a little pompous because it seems to imply that more traditional assess-
ments are “not authentic.” Some educators use the term alternative assessments to signify
that they are an “alternative” to traditional paper-and-pencil assessments. Some authors
argue that there are substantive differences between authentic, alternative, and performance
assessments and that the terms should not be used interchangeably (e.g., Nitko, 2001). How-
ever, from our experience, educators usually do use these terms interchangeably. We have
elected to use the term performance assessment in this text because we feel it is the most
descriptive title and has received the most widespread use and acceptance.
Now that we have provided some background information, it may be useful to illus-
trate some of the many applications of performance assessments in today’s schools. As most
educators recognize, many learning objectives simply cannot be measured using standard
paper-and-pencil tasks, and these are situations in which performance assessments excel.
Consider the following examples:

= Laboratory classes. Students may be asked to demonstrate problem-solving skills,


conduct an experiment, use a microscope, dissect an animal, evaluate chemical composi-
tions, estimate the velocity of objects, produce a diorama, or write a lab report.
Performance Assessments and Portfolios
249

a Mathematics classes. Students may be required to demonstrate quantitative problem-


solving skills with problems constructed around real-life problems in areas such as engi-
neering, architecture, landscaping, political polling, business finance, economics, or family
budgeting. See Special Interest Topic 10.1 for an example of a performance assessment in
mathematics.
a English, foreign-language, debate classes. In classes that emphasize communication
skills, performance assessments typically play an important role. For example, students may
be required to give a speech; speak in a foreign language; engage in an oral debate; recite a
poem; or write a poem, essay, or position paper.
= Social studies classes. Students may be required to demonstrate the use of maps and
globes, debate opposing political positions, make oral presentations, produce dioramas,
demonstrate problem-solving skills, or write theme papers.
a Art classes. Students typically engage in a variety of art projects that result in work
products.
m Music classes. Students engage in performances ranging from solo recitals to group
productions.
ws Physical education classes. Students perform a wide variety of psychomotor ac-
tivities such as hitting a tennis or golf ball, demonstrating different swimming strokes,
executing a dive, playing a position in team sports, and individual training activities.

Performance assessments may This is only a partial list of the many applications for performance
be the primary approach to assessments in schools. Consider shop classes, theater classes,
assessment in classes such as home economics classes, typing/keyboarding classes, and com-
art, music, physical education, puter classes. Even in classes in which traditional paper-and-pencil
theater, and shop. assessments are commonly used, performance assessments can be
useful adjuncts to the more traditional assessments. For example,
Even in classes in which in college tests and measurement classes it is beneficial to have
traditional paper-and-pencil students select a test construct, develop a test to measure that con-
assessments are commonly struct, administer the test to a sample of subjects, and complete
used, performance assessments preliminary analyses on the resulting data. Like many performance
can be useful adjuncts. assessments, this activity demands considerable time and effort.
However, it measures skills that are not typically assessed when
relying on traditional paper-and-pencil assessments.
We just noted that performance assessments can be very time consuming, and this ap-
plies to both the teacher and the students. Performance assessments take considerable time
for teachers to construct, for students to complete, and for teachers to score. However, not all
performance assessments make the same demands on students and teachers. It is common to
distinguish between extended-response performance assessments and restricted-response
performance assessments. Extended-response performance tasks typically are broad in
scope, measure multiple learning objectives, and are designed to closely mirror real-life situa-
tions. In contrast, restricted-response performance tasks typically measure a specific learning
objective and relative to extended-response assessments are easier to administer and score.
However, restricted-response tasks are less likely to mirror real-life situations.
SPECIAL INTEREST TOPIC 10.1
Example of a Performance Assessment in Mathematics

The issues involved in assessing mathematics problem solving are similar to those in all performance
assessments, so we will use this topic to highlight them. Almost all states have incorporated so-called
higher-order thinking skills in their curricula and assessments. In mathematics this is commonly
focused on problem solving. Common arithmetic and mathematics performance assessments 1n stan-
dardized tests that are developed by mathematicians and mathematics educators focus on common
problem situations and types. For example, algebra problems may be developed around landscape
architectural requirements for bedding perimeters or areas, as well as driving times and distance
problems. Students are asked in a series of questions to represent the problem, develop solutions,
select a solution, solve the problem, and write a verbal description of the solution.
Each part of a mathematics problem such as that just mentioned will be evaluated separately.
In some assessments each part is awarded points to be cumulated for the problem. How these points
are set is usually a judgment call, and there is little research on this process. Correlational analysis
with other indicators of mental processing, factor analysis, and item response theory can all provide
help in deciding how to weight parts of a test, but these are advanced statistical procedures. As
with essay testing, however, each part of a mathematics problem of this kind will be addressed in a
scoring rubric. Typically the rubric provides a set of examples illustrating performances at different
score levels. For example, a 0 to 4 system for a solution generation element would have examples of
each level from 1 to 4. Typically a 0 is reserved for no response, a 1-point response reflects a poorly
developed response that is incorrect, a 2-point response reflects a single correct but simple solution,
whereas 3 and 4 are reserved for multiple correct and increasingly well-developed solutions. An
example of a performance assessment in mathematics follows. The assessment is similar to those
used in various state and national assessments. Note that some parts require responses that are simply
multiple-choice, whereas others require construction of a response along with the procedures the
student used to produce the answer. The process or procedures employed are evaluated as well as the
answer. One of the reasons for examining the student’s construction process is that students some-
times can get the correct answer without knowing the procedure (they may conduct various arithme-
tic operations and produce a response that corresponds to an option on a multiple-choice item).

1. Why did the number of border bricks increase as gray paving blocks were added?

2. Complete the table below. Decide how many paving stones and border bricks the
fifth patio design would have.

Patio Design Number of Paving Stones Number of Border Bricks

1 8
: 2 (4) (12)

| 3 (9) (16)
| 4 (16) (20)
: 5 (25) (24)
— 3) 3223999990009) 0 95959595999.

250
CONSTRUCTED-RESPONSE MATHEMATICS ITEM, GRADES 6-8 ALGEBRA CONCEPTS
Directions: All of the questions are about the same problem shown below. Read the
problem and then answer each question in the boxes given with each question.
Sue is a landscaper who builds patios with gray paving stones and white bricks that make
the border. The number of paving stones and bricks depends on the size of the patio as
shown below:

Patio Design 1 Patio Design 2

[| Palabel
ia
ee
[|
cal

Patio Design 3 Patio Design 4

qQoooo YOUU
neeen (100
Jesse.
@ @ UO
(18 @@ O
(10 @@ @ Oo
N@@@U
qoood FAeeeer
qooocod
3. From the pattern in the table accompanying Problem 2, write a statement about how many
more border bricks will be needed as the patio design goes from 5 to 6.

4. The number of the patio design is the same as the number of rows of paving stones in the de-
sign. As a new row is added, how many border bricks are added? ANSWER

(continued)

251
Zoe CHAPTER 10

SPECIAL INTEREST TOPIC 10.1 Continued

| 5. Notice that if you multiply the value 1 for Patio Design 1 by 4 and add 4, you get the number
of border bricks, 8. Does the same thing work for Patio Design 2? ANSWER

Now write a math statement about the number of the patio design and the number of border
bricks: You can use P for patio design and N for the number of border bricks. Thus, your state-
ment should start, N=...

Guidelines for Developing


Effective Performance Assessments

Due to the great diversity in the types of objectives measured by performance assessments,
it is somewhat difficult to develop specific guidelines for developing effective performance
assessments. However, most experts tend to agree on some general guidelines for this process
(e.g., Gronlund, 1998; Linn & Gronlund, 2000; Nitko, 2001; Popham, 1999, 2000; Stiggins,
2001). These can be classified as suggestions for selecting appropriate performance tasks,
developing clear instructions for students, developing procedures for evaluating students’
performance, and implementing procedures to minimize rating errors. In summarizing these
guidelines, the logical place to start is with the selection of a performance task.

Selecting Appropriate Performance Tasks


The first major task in A performance task is an assessment activity that requires a student
developing a performance to produce a written or spoken response, to engage in an activity, or
assessment is to select an to create a product (Nitko, 2001). Here are some factors that should
appropriate performance task. be considered.

Select Performance Tasks That Provide the Most Direct Assessment of the Educa-
tional Objectives You Want to Measure. One principle we have touched on several times
is that you should select assessment techniques that provide the most direct measurement
of the educational objective of interest. This applies when selecting the type of assessment
to use (e.g., selected-response, constructed-response, or performance assessment) and also
when selecting the specific task that you will employ. To this end, carefully examine the
educational objectives you are targeting and select performance tasks that capture the es-
sential features of those objectives.

Performance Assessments and Portfolios 253

Select Performance Tasks That Maximize Your Ability to Generalize the Results of
the Assessment. One of the most important considerations when selecting a perfor-
mance task is to choose one that will allow you to generalize the results to comparable
tasks. In other words, if a student can perform well on the selected task, there should be a
high probability that he or she can perform well on other tasks that involve similar skills
and knowledge.

Select Performance Tasks That Reflect Essential Skills. As a general rule, perfor-
mance assessments should be used only for assessing the most important or essential skills.
Because performance assessments require considerable time and energy to complete (for
both teachers and students), to promote efficiency use them only for assessing the really
important skills that you want to ensure your students have mastered.

Select Performance Tasks That Encompass More than One Learning Objective. Be-
cause performance assessments often require such extensive time and energy commitments,
it is highly desirable to select tasks that allow the assessment of multiple important edu-
cational objectives. Although this may not always be possible, when it is it enhances the
efficiency of the assessment process.

Select Performance Tasks That Focus Your Evaluation on the Processes and/or Prod-
ucts You Are Most Interested In. Before selecting a performance task you should deter-
mine whether you are primarily interested in assessing the process the students engage in,
the product they produce, or some combination of the two. Sometimes the answer to this
question is obvious; sometimes it is less clear. Some performance tasks do not result in a
product and in this situation it is obvious that you will focus on the process. For example,
assessment of musical performances, speeches, debates, and dance routines require evalua-
tion of the process in real time. In contrast, when evaluating a student-developed diorama,
poem, or sculpture, the process is often less important than the end product. Assessment
experts (e.g., Nitko, 2001) recommend that you focus on the process when

No product is produced.
A specific sequence of steps or procedures is taught.
The specific steps or procedures are essential to success.
The process is clearly observable.
Analysis of the process can provide constructive feedback.
You have the time to devote to observing the students perform the task.

Focus on products is recommended when

m An equally good product can be produced using different procedures.


The process is not directly observable.
m The quality of the product can be objectively judged.

As we noted, it is possible and often desirable to evaluate both process and prod-
uct. Additionally, the emphasis on process or product may change at different stages of
254 CHAPTER 10

instruction. Gronlund (1998) suggests that process is often more important early in the
learning process, but after the procedural steps have been mastered, the product assumes
primary importance. For example, in painting the teacher’s focus may be on procedure
and technique in the early stages of instruction and then shift to the quality of the finished
painting in later stages of instruction. When the process has been adequately mastered, it
may be preferable to focus your evaluation on the product because it can usually be evalu-
ated in a more objective manner, at a time convenient to the teacher, and if necessary the
scoring can be verified.

Select Performance Tasks That Provide the Desired Degree of Realism. This in-
volves considering how closely your task needs to mirror real-life applications. This is
along the lines of the distinction between actual, analogue, and artificial performance as-
sessments. This distinction can be conceptualized as a continuum, with actual performance
tasks being the most realistic and artificial performance tasks the least realistic. Although
it may not be possible to conduct actual or even analogue performance assessments in the
classroom, considerable variability in the degree of realism can be found in artificial perfor-
mance assessments. Gronlund (1998) identifies four factors to consider when determining
how realistic your performance assessment should be:

a The nature of the educational objective being measured. Does the objective require
a high, medium, or low level of realism?
m The sequential nature of instruction. Often in instruction the mastery of skills that
do not require a high level of realism to assess can and should precede the mastery of
skills that demand a high level of realism in assessment. For example, in teaching the
use of power tools in a shop class it would be responsible to teach fundamental safety
rules (which may be measured using paper-and-pencil assessments) before proceed-
ing to hands-on tasks involving the actual use of power tools.
= Practical constraints. Consider factors such as time requirements, expenses, manda-
tory equipment, and so forth. As a general rule, the more realistic the task, the greater
the demands in terms of time and equipment.
a The nature of the task. Some tasks by their very nature preclude actual performance
assessment. Remember our example regarding the recertification of nuclear power
plant operators. In this context, mistakes in the real system could be disastrous, so a
simulator that re-creates the control room is used for assessment purposes.

Select Performance Tasks That Measure Skills That Are “Teachable.” That is, make
sure your performance assessment is measuring a skill that is acquired through direct in-
struction and not one that reflects innate ability. Ask yourself, “Can the students become
more proficient on this task as a result of instruction?” Popham (1999) notes that when
evaluation criteria focus on “teachable skills” it strengthens the relationship between in-
struction and assessment, making both more meaningful.

Select Performance Tasks That Are Fair to All Students. Choose tasks that are fair to
all students regardless of gender, ethnicity, or socioeconomic status.
Performance Assessments and Portfolios
255

Select Performance Tasks That Can Be Assessed Given the Time and Resources Avail-
able. Consider the practicality of a performance task. For example, can the assessment
realistically be completed when considering the expense, time, space, and equipment re-
quired? Consider factors such as class size; what might be practical in a small class of ten
students might not be practical in a class of 30 students. From our experience it is common
for teachers to underestimate the time students require to complete a project or activity. This
is because the teacher is an expert on the task and can see the direct, easy means to comple-
tion. In contrast students can be expected to flounder to some degree. Not allowing sufficient
time to complete the tasks can result in student failure and a sense that the assessment was
not fair. To some extent experience is needed to determine reasonable times and deadlines
for completion. New teachers may find it useful to consult with more experienced colleagues
for guidance in this area.

Select Performance Tasks That Can Be Scored ina Reliable Manner. Choose perfor-
mance tasks that will elicit student responses that can be measured in an objective, accurate,
and reliable manner.

Select Performance Tasks That Reflect Educational Objectives That Cannot Be


Measured Using More Traditional Measures. As you will learn when we describe the
strengths and weaknesses of performance assessments, there are some significant limita-
tions associated with the use of these assessments. As a result, most assessment experts
recommend that you reserve their use to measuring educational objectives that simply can-
not be assessed using more traditional paper-and-pencil assessments. However, if you are
a strong supporter of performance assessments do not be dismayed; as we have indicated,
many educational objectives require the use of performance assessments.
Table 10.1 provides a summary of guidelines for selecting performance tasks.

TABLE 10.1 Guidelines for Selecting Performance Tasks

1. Select performance tasks that provide the most direct assessment of the educational
objectives you want to measure.
2. Select performance tasks that maximize your ability to generalize the results of the
assessment.
3. Select performance tasks that reflect essential skills.
4. Select performance tasks that encompass more than one learning objective.
5. Select performance tasks that focus your evaluation on the processes and/or products you are
most interested in.
6. Select performance tasks that provide the desired degree of realism.
7. Select performance tasks that measure skills that are “teachable.”
8. Select performance tasks that are fair to all students.
9. Select performance tasks that can be assessed given the time and resources available.
10. Select performance tasks that can be scored in a reliable manner.
11. Select performance tasks that reflect educational objectives that cannot be measured using
more traditional measures.
256 CHAPTER 10

Developing Instructions
Because performance tasks often require fairly complex student responses, it is important
that your instruction precisely specify the types of responses you are expecting. Because
originality and creativity are seen as desirable educational outcomes,
The second major task in performance tasks often give students considerable freedom in how
they approach the task. However, this does not mean it is appropriate
developing performance
for teachers to provide vague or ambiguous instructions. Few things
assessments is to develop
in the classroom will create more negativity among students than
instructions that clearly
confusing instructions that they feel result in a poor evaluation. It is
specify what students are the teacher’s responsibility to write instructions clearly and precisely
expected to do. so that students do not need to “read the teacher’s mind” (this applies
to all assessments, not only performance assessments). Possibly the
best way to avoid problems in this area is to have someone else (e.g., an experienced col-
league) read and interpret the instructions before you administer the assessment to your stu-
dents. Accordingly, it may be beneficial to try out the performance activity with one or two
students before administering it to your whole class to ensure that the instructions are thor-
ough and understandable. Your instructions should clearly specify the types of responses
you are expecting and the criteria you will use when evaluating students’ performance. Here
is a list of questions that assessment professionals recommend you consider when evaluat-
ing the quality of your instructions (e.g., Nitko, 2001):

Do your instructions match the educational level of your students?


Do your instructions contain unnecessary jargon and overly technical language?
Do your instructions clearly specify the purpose or goal of the task?
Do your instructions clearly specify the type of response you expect?
Do your instructions specify all the important parameters of the performance task
(e.g., time limits, the use of equipment or materials)?
= Do your instructions clearly specify the criteria you will use when evaluating the
student responses?
a Will students from diverse cultural and ethnic backgrounds interpret the instructions
in an accurate manner?

Table 10.2 provides a summary of these guidelines for developing instructions for
your performance assessments.

The third major step in Developing Procedures for Evaluating Responses


developing performance
Whether you are evaluating process, product, or a combination of
assessments is to develop
the two, it is imperative that you develop systematic, objective, and
procedures for evaluating
reliable procedures for evaluating student responses. Performance as-
the students’ responses. sessments are essentially constructed-response assessments, and as
such share many of the scoring problems associated with essays we
discussed in Chapter 9. The scoring procedures applied to performance assessments are often
referred to as scoring rubrics, which we initially introduced when discussing essay items in

Performance Assessments and Portfolios 257

TABLE 10.2 Guidelines for Developing Instructions for Performance Assessments

1. Make sure that your instructions clearly specify the types of responses you are expecting.
2. Make sure that your instructions specify any important parameters of the performance task
(e.g., time limits, the use of equipment or materials).
3. Make sure that your instructions clearly specify the criteria you will use when evaluating the
students’ responses.
4. Have a colleague read and interpret the instructions before you administer the assessment to
your students.
5. Try out the performance activity with one or a limited number of students before administering
it to your whole class to ensure that the instructions are thorough and understandable.
6. Write instructions that students from diverse cultural and ethnic backgrounds will interpret in
an accurate manner.
[ELSE MAT
Saat LIS

the preceding chapter. A rubric is simply a written guide that helps you score constructed-
response assessments. In discussing the development of scoring rubrics for performance
assessments, Popham (1999) identified three essential tasks that need to be completed, dis-
cussed in the following paragraphs.

Select Important Criteria That Will Be Considered When Evaluating Student Re-
sponses. Start by selecting the criteria or response characteristics that you will employ
when judging the quality of a student’s response. We recommend that you give careful
consideration to the selection of these characteristics because this is probably the most
important step in developing good scoring procedures. Limit it to three or four of the most
important response characteristics to keep the evaluation process from becoming unman-
ageable. The criteria you are considering when judging the quality of a student’s response
should be described in a precise manner so there is no confusion about what the rating re-
fers to. It is also highly desirable to select criteria that can be directly observed and judged.
Characteristics such as interest, attitude, and effort are not directly observable and do not
make good bases for evaluation.

Specify Explicit Standards That Describe Different Levels of Performance. For each
criterion you want to evaluate, you should develop clearly stated standards that distinguish
among levels of performance. In other words, your standards should spell out what a stu-
dent’s response must encompass or look like to be regarded as excellent, average, or infe-
rior. It is often helpful to provide behavioral descriptions and/or specimens or examples to
illustrate the different levels of performance.

With analytic scoring rubrics,


‘ Determine What Type of Scoring Procedure You Will Use. Scor-
teachers award credit on aj ing rubrics can be classified as either holistic or analytic. With analytic
criterion-by-criterion basis. scoring rubrics the teacher awards credit on a criterion-by-criterion
With holistic rubrics, a single basis whereas with holistic rubrics the teacher assigns a single score
score reflects the overall quality reflecting the overall quality of the student’s response. Analytic scor-
of the student’s response. ing rubrics have the advantage of providing specific feedback to
258 CHAPTER 10

students regarding the strengths and weaknesses of their response. This informs students
which aspects of their responses were adequate and which need improvement. The major
limitation of analytic rubrics is that they can take considerable time to complete. Holistic
rubrics are often less detailed than analytic rubrics and as a result are easier to develop and
complete. Their major disadvantage is that they do not provide specific feedback to students
about the strengths and weaknesses of their responses. Tables 9.4 and 9.5 in Chapter 9 pro-
vide examples of holistic and analytic scoring rubrics.
Linn and Gronlund (2000) identify rating scales and checklists as popular alternatives
to the traditional scoring rubrics. Noting that the distinction between rating scales and tra-
ditional rubrics is often subtle, they find that rating scales typically
Rating scales specify the quality se quality judgments (e.g., outstanding, good, average, marginal,
of performance or frequency of poor) to indicate performance on each criterion as opposed to the
a behavior. more elaborate descriptive standards common on scoring rubrics.
In place of quality judgments, some rating scales indicate frequency
judgments (e.g., always, often, sometimes, seldom, never). Table 10.3 provides an example
of a rating scale using verbal descriptions.
A number of different types of rating scales are commonly used in scoring perfor-
mance assessments. On some rating scales the verbal descriptions are replaced with numbers
to facilitate scoring. Table 10.4 provides an example of a numerical rating scale. Another
variation, referred to as a graphic rating scale, uses a horizontal line with ratings positioned
along the lines. Table 10.5 provides an example of a graphic rating scale. A final popular
type of rating scale combines the graphic format with brief descriptive phrases as anchor
points. This is typically referred to as a descriptive graphic scale. Linn and Gronlund (2000)
suggest that this type of rating scale has a number of advantages that support its use with
performance assessments. It communicates more information to the students regarding their
performance and it helps teachers rate their students’ performance with greater objectivity
and accuracy. Table 10.6 provides an example of a descriptive graphic rating scale. When
developing rating scales it is usually desirable to have between three and seven rating points.
For example, at a minimum you would want your rating scale to include ratings of poor,
average, and excellent. Most experts suggest that including more than seven positions is not
useful because raters usually cannot make finer discriminations than this.

TABLE 10.3 Example of a Rating Scale Using Verbal Descriptions

Directions: Indicate the student’s ability to successfully perform the specified activity by
circling the appropriate descriptor.

1. Rate the student’s ability to serve the ball.


Poor Marginal Average Good Excellent

2. Rate the student’s ability to strike the tennis ball using the forehand stroke.
Poor Marginal Average Good Excellent
3. Rate the student’s ability to strike the tennis ball using the backhand stroke.
Poor Marginal Average Good Excellent
BEA EAS RE IOVS OS OA
TABLE 10.4 Example of a Numerical Rating Scale

Directions: Indicate the student’s ability to successfully perform the specified activity by
circling the appropriate number. On this scale, the numbers represent the following evaluations:
1 = Poor, 2 = Marginal,3 = Average, 4 = Good, and5 = Excellent.

1. Rate the student’s ability to serve the ball.


1 Z 3 4 =
2. Rate the student’s ability to strike the tennis ball using the forehand stroke.
1 2 3 4 5
3. Rate the student’s ability to strike the tennis ball using the backhand stroke.
1 2 3 4 5
aS ee gc

TABLE 10.5 Example of a Graphic Rating Scale

Directions: Indicate the student’s ability to successfully perform the specified


activity by marking an X anywhere along the horizontal line below each item.

1. Rate the student’s ability to serve the ball.

mrooe 8 tC crap ee OE xcatient


2. Rate the student’s ability to strike the tennis ball using the forehand stroke.

= Poot. ae SE PeAverapents SOMOE I SexCallent


3. Rate the student’s ability to strike the tennis ball using the backhand stroke.
Poor Average Excellent

TABLE 10.6 Example of a Descriptive Graphic Rating Scale

Directions: Indicate the student’s ability to successfully perform the specified


activity by marking an X anywhere along the horizontal line below each item.
1. Rate the student’s ability to serve the ball.
Form is poor Form and accuracy Form and accuracy
and accuracy usually within are consistently
is poor the average range superior
2. Rate the student’s ability to strike the tennis ball using the forehand stroke.

Form is poor Form and accuracy Form and accuracy


and accuracy usually within are consistently
is poor the average range superior

3. Rate the student’s ability to strike the tennis ball using the backhand stroke.
Form is poor Form and accuracy Form and accuracy
and accuracy usually within are consistently
is poor the average range superior

259
TABLE 10.7 Example of a Checklist Used with Preschool Children

Directions: Circle Yes or No to indicate whether each skill has been demonstrated.

Self-Help Skills

Yes No Attempts to wash face and hands


Yes No Helps put toys away
Yes No Drinks from a standard cup
Yes No Eats using utensils
Yes No Attempts to use the toilet
Yes No Attempts to dress self

Language Development

Yes No Follows simple directions


Yes No Verbalizes needs and feelings
Yes No Speech can be understood most of the time
Yes No Speaks in sentences of three or more words

Basic Skills Development

Yes No Can count to ten


Yes No Recognizes numbers to ten
Can name the following shapes:

Yes No Circle
Yes No Square
Yes No Triangle
Yes No Star
Can identify the following colors:

Yes No Red 3
Yes No Blue
Yes No Green

Understands the following concepts:

Yes No Up and down


Yes No Big and little
Yes No Open and closed
Yes No On and off
Yes No In and out

Social Development
ES 35:5.) See e. ee eS
Yes No Plays independently
Yes No Plays parallel to other students
Yes No Plays cooperatively with other students
Yes No Participates in group activities
je GOES SYAR AD eA R eeS

260
Performance Assessments and Portfolios
261

Checklists require a simple Checklists are another popular procedure used to score per-
yes/no judgment. formance assessments. Checklists are similar to rating scales, but
whereas rating scales note the quality of performance or the fre-
quency of a behavior, checklists require a simple yes/no judgment. Table 10.7 provides
an example of a checklist that might be used with preschool children. Linn and Gronlund
(2000) suggest that checklists are most useful in primary education because assessment is
mostly based on observation rather than formal testing. Checklists are also particularly use-
ful for skills that can be divided into a series of behaviors.
Although there is overlap between traditional scoring rubrics such as those used for
scoring essays, rating scales, and checklists, there are differences that may make one pref-
erable for your performance assessment. Consider which format is most likely to produce
the most reliable and valid results and which will provide the most useful feedback to the
students.

Implementing Procedures to Minimize Errors in Rating


The final major step in When discussing the scoring of essay items in the preceding chapter,
developing performance we noted that a multitude of factors could introduce error into the
scoring process. Similar factors need to be considered when scoring
assessments is to implement
performance assessments. Common sources of error when teachers
procedures to minimize errors
rate the performance of students include the following.
in rating.
= Halo effect. We introduced you to the concept of expectancy effects when discussing
the scoring of essay items, noting that these effects come into play when the teacher scoring
the test allows irrelevant characteristics of the student to influence scoring. In the context
of ratings this phenomenon is often referred to as the halo effect. The halo effect istheten-

other
words, if students impressed a teacher with their punctuality and good manners, the teacher
might tend to rate them more favorably when scoring performance assessments. Obviously
this is to be avoided because it undermines the validity of the results.

a Leniency, severity, and central tendency errors. Leniency errors occur because some
teachers tend to give all students good ratings que because some
teachers tend to give all students poor ratings. occur because
some teachers tend to give all students scores in the middle range (e.g., indicating average
performance). Leniency, severity, and central tendency errors all reduce the range of scores
and make scores less reliable.

s Personal biases. Personal biases may corrupt ratings if teachers have a tendency to
let stereotypes influence their ratings of students’ performance.

= Logical errors. Logical errors occur when a teacher assumes that two characteristics
are related and tends to give similar ratings based on this assumption (Nitko, 2001). An
example of a logical error would be teachers assuming that all students with high aptitude
scores should do well in all academic areas, and letting this belief influence their ratings.
262 CHAPTER 1:0

= Order effects. Order effects are changes in scoring that emerge during the grading
process. These effects are often referred to as rater drift or reliability decay. Nitko (2001)
notes that when teachers start using a scoring rubric they often adhere to it closely and apply
it consistently, but over time there is a tendency for them to adhere to it less closely, and as
a result the reliability of their ratings decreases or decays. :

Obviously these sources of errors can undermine the reliability of scores and the validity
of their interpretations. Therefore, it is important for teachers to take steps to minimize the
influence of factors that threaten the accuracy of ratings. Here are some suggestions for
improving the reliability and accuracy of teacher ratings that are based on our own experi-
ences and the recommendations of other authors (e.g., Linn & Gronlund, 2000; Nitko, 2001;
Popham, 1999, 2000).

Before Administering the Assessment, Have One or More Trusted Colleagues Eval-
uate Your Scoring Rubric. If you have other teachers who are familiar with the per-
formance area review and critique your scoring rubric, they may be able to identify any
limitations before you start the assessment.

When Possible, Rate Performances without Knowing the Student’s Identity. This
corresponds with the recommendation we made with regard to grading essay items. Anony-
mous scoring reduces the chance that ratings will be influenced by halo effects, personal
biases, or logical errors.

Rate the Performance of Every Student on One Task before Proceeding to the Next
Task. It is easier to apply the scoring criteria uniformly when you score one task for every
student before proceeding to the next task. That is, score task number one for every student
before proceeding to the second task. Whenever possible, you should also randomly reorder
the students or their projects before moving on to the next task. This will help minimize
order effects.

Be Sensitive to the Presence of Leniency, Severity, or Central Tendency Errors. As


you are rating the tasks, keep a tally of how often you use each point on the rating scale. If
it becomes apparent that there is little variability in your ratings (all very high, all very low,
or all in the middle), you may need to modify your rating practice to more accurately reflect
differences in your students’ performance.

Conduct a Preliminary Reliability Analysis to Determine Whether Your Ratings Have


Acceptable Reliability. For example, rescore a subset of the assessments, or even the
entire set, to determine consistency in ratings. Special Interest Topic 10.2 provides a dis-
cussion of reliability issues in performance assessments and illustrates some approaches to
estimating the reliability of ratings.

Have More than One Teacher Rate Each Student’s Performance. The combined
ratings of several teachers will typically yield more reliable scores than the ratings of only
one teacher. This is particularly important when the results of an assessment will have sig-
nificant consequences for the students.

Performance Assessments and Portfolios 263

SPECIAL INTEREST TOPIC 10.2

Reliability Issues in Performance Assessments

Reliable scoring of performance assessments is hard to achieve. When estimating the reliability of
performance assessments used in standardized assessment programs, multiple readers (also termed
raters) are typically given the same student response to read and score. Because it is cost prohibitive
for readers to read and score all responses in high-stakes testing, the responses are typically assigned
randomly to pairs of readers. For each response, the scores given by the two readers are compared. If
they are identical, the essay is given the common score. If they differ meaningfully, such as one score
represents a passing performance and the other failing, a procedure to decide on the score given is
invoked. Sometimes this involves a third reader or, if the scores are close, an average score is given.
For tasks with multiple parts the same two readers evaluate the entire problem. It is important to note
that reliability of the parts will be much lower than reliability of the entire problem. Summing the
scores of the various parts of the problem will produce a more reliable score than that for any one
part, so it is important to decide at what level score reliability is to be assessed. It will be much more
costly to require high reliability for scoring each part than for the entire problem.
Prior to reading the responses the readers must be given training. This training usually includes
review of the rubrics, practice with samples of responses, and repeated scoring until the readers reach
some criterion performance themselves. Readers may be required to agree with a predetermined score
value. For example, readers may be expected to reach a criterion agreement of 70% or 80% with the
predetermined score value over a set of responses. This agreement means that a reader achieves 70%
agreement with the assigned score (the assigned scores were established by experts). Statistically, this
agreement itself is dependent on the number of responses a reader is given. That is, if a reader scores
ten responses and obtains an exact match in scoring seven of them, the reader may have achieved the
required reliability for scoring. However, from a purely statistical perspective they may have a percent
agreement as low as 0.41. This is based on statistical probability for a percentage. It is calculated from
the equation for the standard deviation of a percent:

= OPO Se OO =
= 0.145
S proportion = oes

From the distribution of normal scores, plus or minus 1.96 standard deviations will capture 95% of
the values of the percent a reader would obtain over repeated scoring of 10 responses. This is about
two standard deviations, so that subtracting 2 x 0.145 from 0.70 leaves a percent as low as 0.41.
Clearly, ten responses is a poor sample to decide whether a reader has a true percent agreement
score of 0.70. If the number of essays is increased to 100, the standard error becomes 0.046, and
the lower bound of the interval around 0.70 becomes 0.70 minus 2 x 0.046, or about 0.60. Reading
100 responses is very time consuming and costly, increasing the cost of the entire process greatly.
If we want a true minimal agreement value of 0.70, using this process, we would require the ac-
tual agreement for readers to be well above 0.80 for 100 responses, and above 0.95 for only ten.
Unfortunately, it is not generally reported for high-stakes testing how many responses are scored
to achieve the state-mandated level of reliability. Most of these assessments prescribe the minimal
level of agreement, but in some cases the agreement is between readers, not necessarily with an
expert. This produces yet another source of unreliability, the degree of inter-rater agreement. The
more common approach to inter-rater agreement is to construct the percent agreement across a set
of responses and then correct it for the chance agreement that could occur even if there were no real
(continued)
264 CHAPTER 10

SPECIAL INTEREST TOPIC 10.2 Continued

agreement. That is, suppose that for practical purposes in a 0-4 system, the score of 0 represents
a blank sheet—that is, the student did not respond. These are easily separated from the rest of the
responses but have no real contribution to meaningful agreement so they are not considered. This
leaves a 1-4 range for scoring. It is easy to show that if scores were distributed equally across the
range, a random pairing of scores would produce 25% agreement. If the score distribution for the
population of students is actually normally distributed, centered at 2.5, for example, the chance
agreement is quite a bit greater, as high as almost 80%, depending on the assumptions about the
performance underlying the testing and scoring. In any case, the apparent agreement among rat-
ers must be accounted for. One solution is to calculate Cohen’s kappa, an agreement measure that
subtracts the chance agreement from the observed agreement.

Cohen’s kappa = (p agreement — Pchance agreement) fh 5 chance agreement)

The calculations illustrated next are for hypothetical data for two raters. The numbers in the
diagonal of this table reflect agreement between the two raters. For example, if you look in the top
left corner you see that there were two cases in which both raters assigned a score of one. Moving
to the next column where ratings of two coincide, you see that there were three cases in which both
raters assigned a score of two. However, there was one case in which Rater 1 assigned a one and
Rater 2 a two. Likewise, there was one case in which Rater 1 assigned a three and Rater 2 a two.
Note that we would need a much higher observed agreement to ensure a minimal 70% agreement
beyond chance. Because the classical estimate of reliability is always based on excluding error such
as chance, Cohen’s kappa is theoretically closer to the commonly agreed-on concept of reliability.
Unfortunately, there is little evidence that high-stakes assessments employ this method. Again, the
cost of ensuring this level of true agreement becomes quite high.

Computation of Cohen’s Kappa for Two Raters

Rater 2 Scores
1 2 3 4 Percent

1 30%

I 30%
Rater 1
Scores 3 20%

4 20%

Percent 20% 60% 10% 10% 100%

Calculation of Cohen’s kappa


Kappa = (p agreement — P chance Sorcament) A = P chance agrecsnent)

4 agreement = 70% ‘
Performance Assessments and Portfolios
265

Pehance agreement = (30% x 20%) + (30% x 60%) + (20% x 10%) + (20% x 10%)
= 6% + 18% + 2% + 2%
28%

Cohen’s kappa = (70% — 28%) / (1 — 28%)


= 0.42 / 0.72
= 0.583
Notice that this value does not depend on the number of essays scored. As with all statistical meth-
ods, the standard deviation of this value, however, depends on the number of scores. The standard
error of kappa is equal to

Se = 0.3945

A number of Web sites produce this computation. Simply search with the term Cohen’s kappa using
an Internet search engine to find one of these sites. For example, a site at Vassar University produced
the computations just used (Lowry, 2003).
For 20 essays, doubling the number of cases in the table, the estimate remains the same,
but the standard error is reduced to 0.2789. For 80 responses, multiplying the numbers in the
table by 8, the standard error is 0.1395. Thus, even with 80 responses, given a 70% observed agree-
ment among two raters, the actual Cohen’s kappa chance corrected agreement is as low as 0.42
(approximately 0.70 — 1.96 x 0.1395).

TABLE 10.8 Guidelines for Developing and Implementing Scoring Procedures

. Ensure that the criteria you are evaluating are clearly specified and directly observable.
. Ensure that the standards clearly distinguish among levels of performance.
. Select the type of scoring procedure that is most appropriate.
. Have one or more trusted colleagues evaluate your scoring rubric.
When possible, rate performances without knowing the student’s identity.
. Rate the performance of every student on one task before proceeding to the next task.
. Be sensitive to the presence of leniency, severity, or central tendency errors.
ERWNE
SAAN
. Conduct a preliminary reliability analysis to determine whether your ratings have acceptable
reliability.
\o. Have more than one teacher rate each student’s performance.

Table 10.8 provides a summary of these guidelines for developing and implementing
procedures for scoring your performance assessments. Special Interest Topic 10.3 presents
a discussion of the problems some states have experienced incorporating performance as-
sessments into their high-stakes assessment programs.
Much has been written about the strengths and weaknesses of performance assess-
ments, and this is good time to examine them in more detail.
266 CHAPTER 10

SPECIAL INTEREST ToPIc 10.3


Performance Assessments in High-Stakes Testing

In our earlier edition we documented the experiences of states such as Kentucky and Ver-
mont in implementing versions of performance assessments as part of their statewide assess-
ment programs. Most of the evaluations of these experiences have been negative in terms
of limited reliability, validity, and benefit relative to the cost. Since the implementation of
The No Child Left Behind Act (NCLB) in 2001, there has been almost no statewide movement to-
ward employing performance assessments. Even those states that began experimenting with versions
of performance assessment have rejected them as too expensive for too little additional validity over
paper-and-pencil or computer-based testing.
This does not mean that performance assessment does not exist in education. It has been
adopted in postsecondary education in such areas as medical, veterinary, and management educa-
tion. In these cases there appear to be sufficient resources to conduct these more expensive perfor-
mance assessments. For example, in medical schools there are relatively few students per course
of study, and with a much greater expenditure per student, performance assessments are relatively
cheaper to conduct.

Strengths of Performance Assessments


Performance Assessments Can Measure Abilities That Are Not Assessable Using
Other Assessments. Possibly the greatest strength of performance assessments is that
they can measure abilities that simply cannot be measured with
Performance assessments can _. other types of assessments. If you want to measure abilities such
measure learning outcomes that as a student’s ability to engage in an oral debate, paint a picture,
are not assessable using other tune an engine, or use a microscope, performance assessments fit
assessments. the bill.

The Use of Performance Assessments Is Consistent with Modern Learning T.heory.


Modern learning theory holds that for optimal learning to occur students need to integrate
new information with existing knowledge and be actively engaged in complex tasks that
mirror real-life applications. Many assessment experts agree that performance assessments
are consistent with the principles supported by modern learning theory.

The Use of Performance Assessments May Result in Better Instruction. Because


teachers may be motivated to teach to the test, the use of performance assessments may
help broaden instruction to cover more complex educational objectives that parallel real-
life applications.

Performance assessments may Performance Assessments May Make Learning More Meaning-
make learning more meaningful _ful and Help Motivate Students. Performance assessments are
and help motivate students. inherently attractive to teachers and students, To many students test-
Performance Assessments and Portfolios 267

ing under conditions that are similar to those they will encounter in the real world is more
meaningful than paper-and-pencil tests. As a result, students might be more motivated to be
actively engaged in the assessment process.

Performance Assessments Allow You to Assess Process as Well as Products. Per-


formance assessments give teachers the opportunity for evaluation of products and
processes—that is, evaluating the ways students solve problems and perform tasks as well as
the products they produce.

The Use of Performance Assessments Broadens Your Approach to Assessment.


Throughout this text we have emphasized the advantages of using multiple approaches
when assessing student achievement. We concur with these comments from the U.S. De-
partment of Education (1997):

A single measure or approach is unlikely to adequately measure the knowledge, skills, and
complex procedures covered by rigorous content standards. Multiple measures and ap-
proaches can be used to capitalize on the strengths of each measurement technique, enhanc-
ing the utility of the assessment system and strengthening the validity of decisions based on
assessment results. (p. 9)

Although performance assessments have a number of strengths that support their use,
there are some disadvantages. We will now summarize these disadvantages.

Weaknesses of Performance Assessments


Performance assessments are
Scoring Performance Assessments in a Reliable Manner Is
time consuming and difficult to
Difficult. Probably the most common criticism of performance
score in a reliable manner.
assessments is that due to the inherent subjectivity in scoring they
often result in unreliable scores both across raters and across time. That is, the same rater is
likely to assign different scores if he or she scores the same performance at different times,
and two different raters are likely to assign different scores to the same performance. The
best way to minimize this tendency is to follow the guidelines for developing and imple-
menting scoring procedures we described previously (see Table 10.8).

Performance Assessments Typically Provide Limited Sampling of the Content Domain,


and It Is Difficult to Make Generalizations about the Skills and Knowledge the Stu-
dents Possess. Because students typically are able to respond to only a limited number of
performance tasks, there is limited sampling of the content domain. Research has shown that
performance on one task does not allow teachers to predict with much accuracy how students
will perform on other tasks that measure similar skills and abilities (e.g., Shavelson, Baxter,
& Gao, 1993). For example, if students do well on one performance task, you cannot be sure
that they have actually mastered the skills and knowledge encompassed by the task, or were
simply lucky on this one task. Likewise, if students perform poorly on one performance task,
you cannot be sure that their performance reflects inadequate skills and knowledge, because
it is possible that a misunderstanding of the task requirements undermined their performance.
268 CHAPTER 10

Because students typically complete a limited number of tasks, your ability to generalize
with much confidence is limited. The solution to this limitation is to have students complete
multiple performance tasks in order to provide adequate domain sampling. Regretfully, due to
their time-consuming nature, this is not always possible.

Performance Assessments Are Time Consuming and Difficult to Construct, Admin-


ister, and Score. Performance assessments are not quick and easy! It takes considerable
time to develop good performance tasks and scoring procedures, to allow students time to
complete the task, and for you to adequately evaluate their performance. Regretfully, there
are no shortcuts to make them quick and easy. As Stiggins (2001) noted:

Performance assessment is complex. It requires users to prepare and conduct their assess-
ments in a thoughtful and rigorous manner. Those unwilling to invest the necessary time and
energy will place their students directly in harm’s way. (p. 186)

There Are Practical Limitations That May Restrict the Use of Performance Assessments.
In addition to high time demands, other practical limitations might restrict the use of perfor-
mance assessments. These can include factors such as space requirements and special and
potentially expensive equipment and materials necessary to simulate a real-life setting.

In summary, performance assessments have numerous strengths and they represent


an important assessment option available to teachers. At the same time they have some
significant limitations that should be taken into consideration. Although some educational
professionals are so enamored with performance assessments that they appear blind to
their limitations, most recognize these limitations and recommend
We recommend that you that these assessments be used in an appropriately cautious man-
limit the use of performance ner. We recommend that you limit the use of performance assess-
assessments to the measurement ments to the measurement of educational objectives that cannot be
of educational objectives that adequately measured using techniques that afford more objective
cannot be adequately measured and reliable scoring. When you do use performance assessments,
using techniques that afford be cognizant of their limitations and follow the guidelines we have
more objective and reliable provided to enhance the reliability of their scores and the validity
of your inferences. Table 10.9 provides a summary of the strengths
scoring.
and weaknesses of performance assessments.

Portfolios
Portfolios are a specific type of
Portfolios are a specific type of performance assessment that involves
performance assessment that
the systematic collection of a student’s work products over a speci-
involves the systematic collection fied period of time according to a specific set of guidelines (AERA
of a student’s work products et al., 1999). As we noted earlier, artists, photographers, writers, and
over a specified period of time others have long used portfolios to represent their work, and in the
according to a specific set of last decade portfolios have become increasingly popular in the class-
guidelines (AERA et al., 1999), room. As typically applied in schools today, portfolios may best be
Performance Assessments and Portfolios 269

TABLE 10.9 Strengths and Weaknesses of Performance Assessments

Strengths of Performance Assessments


Performance assessments can measure abilities that are not assessable using other assessments.
The use of performance assessments is consistent with modern learning theory.
The use of performance assessments may result in better instruction.
Performance assessments may make learning more meaningful and help motivate students.
Performance assessments allow you to assess process as well as products.
The use of performance assessments broadens your approach to assessment.

Weaknesses of Performance Assessments


m Performance assessments are notorious for producing unreliable scores.
w With performance assessments it is difficult to make generalizations about the skills and knowl-
edge the students possess.
m Performance assessments are time consuming and difficult to construct, administer, and score.
m There are practical limitations that may restrict the use of performance assessments.
SeOES EE LE TI SS
SEES AS SAE RES

conceptualized as a systematic way of collecting, organizing, and evaluating examples of


students’ work products. As such, portfolios can conceivably serve as the basis for evaluat-
ing students’ achievements and providing feedback to the students and their parents.

Guidelines for Developing Portfolio Assessments


Like all performance assessments, there is such diversity in portfolios that it is somewhat
difficult to specify specific guidelines for their development. However, also like perfor-
mance assessments, there are some general guidelines that most assessment professionals
accept (AERA et al., 1999; Gronlund, 1998; Linn & Gronlund, 2000; Nitko, 2001; Popham,
1999, 2000). These are summarized next.

Decide on the Purpose of the Portfolio. The first step in developing a portfolio is to
determine the purpose or use of the portfolio. This is of foremost importance because it will
largely determine the content of your students’ portfolios. For example, you will need to
decide whether the portfolio will be used purely to enhance learning, as the basis for grades
(i.e., a scorable portfolio), or some combination of the two. If the purpose is only to enhance
learning, there is little need to ensure comparability among the entries in the portfolios.
Students can be given considerable freedom to include entries at their discretion. However,
if the portfolio is going to be used for summative evaluation and the assignment of grades,
then it is important to have standardized content across portfolios. This is necessary to pro-
mote a degree of comparability when evaluating the portfolios.

Decide on What Type of Items Will Be Placed in the Portfolio. _\t is also important
to determine whether the portfolios will showcase the students’ “best work,” represen-
tative products, or indicators of progress or growth. Best work portfolios contain what
the students select as their exemplary work, representative portfolios contain a broad
270 CHAPTER 10

representative sample of the students’ work (including both exemplary and below-average
examples), and growth or learning-progress portfolios include selections that illustrate
the students’ progress over the academic period. A fourth type of portfolio referred to as
evaluation portfolios is designed to help teachers determine whether the students have met
established standards of performance. As such, they should contain products that demon-
strate the achievement of specified standards.

Decide Who Will Select the Items to Include in the Portfolio. The teacher must decide
who will be responsible for selecting the items to include in the portfolio: the teacher, the
student, or both. When selecting items the guiding principle should be to choose items that
will allow the teacher or other raters to make valid inferences about the students’ skills and
knowledge. To promote student involvement in the process, most professionals recommend
that teachers and students collaborate when selecting items to be included in the portfolio.
However, at certain times it may be necessary for the teacher to exert considerable control
over the selection of work products. For example, when it is important for scoring purposes
to ensure standardization of content, the teacher needs to closely supervise the selection of
work products.

Establish Procedures for Evaluating or Scoring the Portfolio. Student portfolios are
typically scored using scoring rubrics similar to those discussed in the context of scoring
essays and performance assessments. As described earlier, scoring rubrics should

m Specify the evaluation criteria to be considered when evaluating the students’ work
products
m Provide explicit standards that describe different levels of performance on each
criterion
m Indicate whether the criteria will be evaluated in a holistic or analytical manner

Promote Student Involvement in the Process. Actively involving students in the assess-
ment process is a goal of all performance assessments, and portfolio assessments provide
particularly good opportunities to solicit student involvement. As we suggested, students
should be involved to the greatest extent possible in selecting what items are included in their
portfolios. Accordingly, they should be involved in maintaining the portfolio and evaluating
the quality of the products it contains. Along these lines, it is highly desirable for teachers to
schedule regular student-teacher meetings to review the portfolio content and compare their
evaluations with those of the students. This enhances the students’ self-assessment skills,
helps them identify individual strengths and weaknesses, and increases their personal in-
volvement in the learning process (e.g., Gronlund, 1998).
Table 10.10 provides a summary of the guidelines for developing portfolios.
Like all assessment techniques, portfolios have their own set of strengths and weak-
nesses (Gronlund, 1998; Kubiszyn & Borich, 2003; Linn & Gronlund, 2000; Nitko, 2001;
Popham 1999, 2000).
Performance Assessments and Portfolios 271

TABLE 10.10 Guidelines for Developing Portfolio Assessments

1. Decide on the purpose of the portfolio


a. Enhance learning, assign grades, or some combination
b. Best work, representative products, growth or learning progress, or
evaluation
2. Decide who will select the items to include in the portfolio
a. Teacher
b. Student
c. Teacher and student in collaboration
3. Establish procedures for evaluating or scoring the portfolio
a. Specify the evaluation criteria
b. Provide specific standards
c. Decide on a holistic or analytic approach
4. Promote student involvement in the process

Strengths of Portfolio Assessments


Portfolios Are Particularly Good at Reflecting Student Achieve-
Portfolios are particularly
ment and Growth over Time. Possibly the greatest strength of
good at reflecting student
portfolios is that they are exemplary at illustrating a student’s prog-
achievement and growth ress over an extended period of time. As a result, they can greatly fa-
over time. cilitate communication with students and parents by providing actual
examples of the student’s work.

Portfolios May Help Motivate Students and Get Them More Involved in the Learn-
ing Process. Because students typically help select items for and maintain the portfolio,
evaluating their progress as they do so, they may be more motivated to become actively
involved in the learning and assessment process.

Portfolios May Enhance Students’ Ability to Evaluate Their Own Performances and
Products. Because students are typically asked to evaluate their own progress, it is ex-
pected that they will demonstrate enhanced self-assessment skills.

When used correctly portfolios When Used Correctly Portfolios Can Strengthen the Relation-
can strengthen the relationship ship between Instruction and Assessment. Because portfolios
often incorporate products closely linked to classroom instruction,
between instruction and
they can help strengthen the relationship between instruction and
assessment.
assessment.

Portfolios Can Enhance Teachers’ Communication with Both Students and Parents.
Providing regular student-teacher and parent-teacher conferences to review the contents of
portfolios is an excellent way to enhance communication.
272 CHAPTER 10

Weaknesses of Portfolio Assessments


Scoring portfolios in a reliable Scoring Portfolios in a Reliable Manner Is Difficult. Scoring
manner is difficult. portfolios reliably is a very challenging task. In addition to the error
introduced by the subjective judgment of raters and difficulty estab-
lishing specific scoring criteria, inadequate standardization of portfolio content often results in
limited comparability across students. As a result, reliability can be dismally low. For example,
Nitko (2001) notes that the reliability of portfolio results is typically in the 0.40 to 0.60 range.
If you think back on what you learned about interpreting reliability coefficients in Chapter 4,
you will recall that this indicates that as much as 60% of the variability in portfolio scores is the
result of measurement error. This should give all educators a reason to be cautious when using
the results of portfolio assessments in assigning grades or making high-stakes decisions.

Conducting Portfolio Assessments Properly Is a Time-Consuming and Demanding


Process. Most educators agree that for portfolios to be an effective assessment technique
teachers need to be committed to the process and willing to invest the time and energy nec-
essary to make them work.

In summary, portfolios have significant strengths and weaknesses. On the positive


side they provide a broad framework for examining a student’s progress, encourage stu-
dent participation in the assessment process, enhance communication, and strengthen the
relationship between instruction and assessment. Clearly, these are laudable features. On
the down side they demand considerable time and energy and have questionable reliability.
Consider the comments of Hopkins (1998):

The use of portfolios has great potential for enriching education and student assessment but
should not be viewed as an alternative to traditional tests and examinations. Students still
need to demonstrate proficiency on uniform tasks designed to be a representative sample of
the objectives of a course of study. One may have wonderful tasks in a science portfolio (col-
lections of rocks, leaves, insects, experiments) but have great gaps in understanding about
major laws of physics, genetics, and so on. (p. 311)

We largely concur with Dr. Hopkins. Portfolios (and other performance assessments) hold
great potential for enriching educational assessment practices. They have considerable
strengths and when used in a judicious manner will enhance the assessment of students. At
the same time we encourage teachers to be aware of the specific strengths and weaknesses of
all assessment techniques and to factor these in when developing their own procedures for
assessing student achievement. No approach to assessment—whether it is selected-re-
sponse items, constructed-response items, performance assessments, or portfolios—should
be viewed as the only way to assess student achievement. As we have repeatedly stated, no
single assessment approach can adequately assess all of the complex skills and knowledge
taught in today’s schools. By using multiple approaches to assessment, one can capitalize
on the strengths of the different approaches in order to elicit the most useful, reliable, and
accurate information possible. Table 10.11 provides a summary of the strengths and weak-
nesses of portfolio assessments.
Performance Assessments and Portfolios 273

TABLE 10.11 Strengths and Weaknesses of Portfolio Assessments

Strengths of Portfolio Assessments


1. Portfolios are particularly good at reflecting student achievement and growth over time.
2. Portfolios may help motivate students and get them more involved in the learning process.
3. Portfolios may enhance students’ ability to evaluate their own performances and products.
4. When used correctly portfolios can strengthen the relationship between instruction and
assessment.
5. Portfolios can enhance teachers’ communication with both students and parents.

Weaknesses of Portfolio Assessments


1. Scoring portfolios in a reliable manner is difficult.
2. Conducting portfolio assessments properly is a time-consuming and demanding process.

Summary
In this chapter we focused on performance assessments and portfolios. These special types
of constructed-response tasks have been around for many years, but have gained increasing
popularity in schools in recent years. To many educators performance assessments are seen
as a positive alternative to traditional paper-and-pencil tests. Critics of traditional paper-
and-pencil tests complain that they emphasize rote memory and other low-level learning
objectives. In contrast they praise performance assessments, which they see as measuring
higher-level outcomes that mirror real-life situations.
Performance assessments require students to complete a process or produce a product
in a setting that resembles real-life situations (AERA et al., 1999). Performance assess-
ments can be used to measure a broad range of educational objectives, ranging from those
emphasizing communication skills (e.g., giving a speech, writing a term paper), to art (e.g.,
painting, sculpture), to physical education (e.g., tennis, diving, golf). Due to this diversity,
it is difficult to develop specific guidelines, but some general suggestions can facilitate the
development of performance assessments. These can be categorized as guidelines for select-
ing performance tasks, developing clear instructions, developing procedures for evaluating
students’ performance, and implementing procedures for minimizing rating errors. These
are listed next.

Selecting Appropriate Performance Tasks


Select tasks that provide the most direct measure of the educational objective.
Select tasks that maximize your ability to generalize.
Select tasks that reflect essential skills.
Select tasks that encompass more than one educational objective.
Select tasks that focus evaluation on the processes and/or products you are interested
in.
Select tasks that provide the desired degree of realism.
Select tasks that measure skills that are “teachable.”
Select tasks that are fair.
Select tasks that can be assessed given the time and resources available.
274 CHAPTER 10

m Select tasks that can be scored in a reliable manner.


Select tasks that reflect objectives that cannot be measured using traditional
assessments.

Developing Instructions That Clearly Specify What the Student Is Expected to Do


Instructions should match the educational level of the students.
Instructions should avoid jargon or unnecessary technical language.
Instructions should specify the purpose or goal of the task.
Instructions should specify the type of response you expect.
Instructions should specify all the important parameters of the task (e.g., time limits).
Instructions should specify the criteria used to evaluate the student’s responses.
Instructions should be clear to students from different backgrounds.

Developing Procedures to Evaluate the Students’ Responses


Select important criteria that will be considered when evaluating student responses.
Specify explicit standards that illustrate different levels of performance on each
criteria.
Determine what type of scoring procedure you will use (e.g., rating scale, checklist).

Implementing Procedures to Minimize Errors in Rating


Have trusted colleagues evaluate your scoring rubric.
Rate performances without knowing the student’s identity.
Rate the performance of every student on one task before proceeding to the next
task.
Be sensitive to the presence of leniency, severity, or central tendency errors.
Conduct a preliminary reliability analysis.
Have more than one teacher rate each student’s performance.

As with all types of assessments procedures, performance assessments have strengths


and weaknesses. Strengths of performance assessments include the following:

They can measure abilities not assessable using other assessments.


They are consistent with modern learning theory.
They may result in better instruction.
They may make learning more meaningful and help motivate students.
They allow you to assess process as well as products.
They can broaden your approach to assessment.

Their weaknesses include the following:


They are difficult to score in a reliable manner.
They provide limited sampling of the content domain, and it is often difficult to make
generalizations about the knowledge and skills the students possess.
They are time consuming and require considerable effort.
Portfolios are a specific type of performance assessment that involves the systematic
collection of a student’s work products over a period of time according to a specific set of

Performance Assessments and Portfolios 275

guidelines (AERA et al., 1999). Guidelines for developing and using portfolios include the
following:

Specify the purpose of the portfolio (e.g., enhance learning, grading, both?).
Decide on the type of items to be placed in the portfolio.
Specify who will select the items to include in the portfolio.
Establish procedures for evaluating the portfolios.
Promote student involvement in the process.

Strengths of portfolios include the following:

Portfolios are good at reflecting student achievement and growth over time.
Portfolios may help motivate students and get them more involved in the learning
process.
Portfolios may enhance students’ ability to evaluate their own performances and
products.
When used correctly portfolios can strengthen the relationship between instruction
and assessment.
Portfolios can enhance teachers’ communication with both students and parents.

Weaknesses of portfolios include the following:

Scoring portfolios in a reliable manner is difficult.


Conducting portfolio assessments properly is a time-consuming and demanding
process.

We concluded this chapter by noting that performance assessments have considerable


strengths and when used in a prudent manner they can enhance the assessment of students.
As a general rule, we recommend that you limit the use of performance assessments to the
measurement of educational objectives that cannot be adequately measured using tech-
niques that afford more objective and reliable scoring. At the same time we noted that there
are many such objectives. We stressed that no single approach to assessment should be
viewed as the one and only way to assess student achievement. As stated before, no single
assessment approach can adequately assess all of the complex skills and knowledge taught
in today’s schools. By using multiple approaches to assessment, one can take advantage of
the strengths of the different approaches and obtain the most useful, reliable, and accurate
information possible.

KEY TERMS AND CONCEPTS

Actual performance assessment, Analogue performance assessment, Artificial performance assessment,


p. 247 p. 247 p. 247
Alternative assessments, p. 248 Analytic scoring rubrics, p. 257 Authentic assessment, p. 248
276 CHAPTER 10

Best work portfolios, p. 269 Growth or learning-progress Portfolios, p. 268


Central tendency errors, portfolios, p. 270 Rating scales, p. 258
p. 261 Halo effect, p. 261 Reliability decay, p. 262
Checklists, p. 261 Holistic rubrics, p. 257 Representative portfolios,
Evaluation of products and Leniency errors, p. 261 p. 269
processes, p. 267 Logical errors, p. 261 Restricted-response performance
Evaluation portfolios, p. 270 Order effects, p. 262 assessment, p. 249
Extended-response performance Performance assessments, p. 246 Severity error, p. 261
assessment, p. 249 Personal biases, p. 261

RECOMMENDED READINGS

Feldt, L. (1997). Can validity rise when reliability declines? Rosenquist, A., Shavelson, R., & Ruiz-Primo, M. (2000).
Applied Measurement in Education, 10, 377-387. This On the “exchangeability” of hands-on and computer-
extremely interesting paper argues that at least in theory simulated science performance assessments (CSE
performance tests of achievement can be more valid than Technical Report 531). Stanford University, CA:
constructed-response tests even though the performance CRESST. Previous research has shown inconsistencies
assessments have lower reliability. He notes that now the between scores on hands-on and computer-simulated
challenge is to find empirical examples of this theoreti- performance assessments. This paper examines the
cal possibility. sources of these inconsistencies.

INTERNET SITE OF INTEREST

www.cresst.org provides a plethora of informational resources on per-


The Web site for the National Center for Research on formance assessment and portfolios. For example, you
Evaluation Standards and Student Testing (CRESST) can access previous newsletters and reports.

Seah ahot a =e Pees aiheSAS isda Una fae aesees Se

Go to www.pearsonhighered.com/reynolds2e to view a PowerPoint™


presentation and to listen to an audio lecture about this chapter.
Sen RTL
ne
I LEAT UIE ES

| CHAPTER

Assigning Grades on the Basis


of Classroom Assessments

I love teaching, but I find assigning grades at the end of a semester to be a


very difficult and unpleasant process!

CHAPTER HIGHLIGHTS

Feedback and Evaluation Combining Grades into a Composite


Reporting Student Progress: Informing Students of the Grading System
Which Symbols to Use? and Grades Received

The Basis for Assigning Grades Parent Conferences


Frame of Reference

LEARNING OBJECTIVES

After reading and studying this chapter, students should be able to


1. Define formative and summative evaluation and explain the roles they play in teaching.
2. Describe the advantages and disadvantages of assigning grades or marks.
3. Explain what is meant by formal and informal evaluation, and the advantage of formal
evaluation.
4. Describe the use of formative evaluation in summative evaluation, including when it is
desirable and when it is not desirable.
5. Describe the common symbols used for reporting student progress.
6. Explain why assessment experts recommend against mixing achievement and
nonachievement factors when assigning grades.
7. Compare and contrast the strengths and weaknesses of norm-referenced and criterion-
referenced grading procedures.
8. Explain why assessment experts recommend against using achievement relative to aptitude,
improvement, or effort as a frame of reference for assigning grades.
9. Explain and demonstrate the appropriate procedures for combining grades into a composite.
10. Explain the importance of providing students and parents information on your grading
system.
11. Explain some considerations when holding conferences with parents.

277
278 CHAPTER 11

The process of teaching is a dialogue between the teacher and the student. One important
aspect of this dialogue, both oral and written, is the evaluation of student performance. This
evaluative process includes the testing procedures described in the last four chapters. Ad-
ditionally, in most schools it is mandatory for student evaluations to
Marks are cumulative grades include the assignment of grades or marks. In this context, marks
that reflect student progress are typically defined as cumulative grades that reflect students’ aca-
during a specific period of demic progress during a specific period of instruction. In this chapter
instruction. we will be using the term score to reflect performance on a single as-
sessment procedure (e.g., test or homework assignment), and grades
and marks interchangeably to denote a cumulative evaluation of student performance (e.g.,
cumulative semester grade). In actual practice, people will often use score, grade, and mark
synonymously.
Our discussion will address a variety of issues associated with the process of assign-
ing grades or marks. First we will discuss some of the ways tests and other assessment
procedures are used in schools. This includes providing feedback to students and making
evaluative judgments regarding their progress and achievement. In this context we also
discuss the advantages and disadvantages of assigning grades. Next we discuss some of the
more practical aspects of assigning grades. For example, “What factors should be consid-
ered when assigning grades?” and “What frame of reference should be used when assigning
grades?” We then turn to a slightly technical, but hopefully practical, discussion of how to
combine scores into a composite or cumulative grade. Finally, we present some suggestions
on presenting information about grades to students and parents.

Feedback and Evaluation

We have noted that tests and other assessment procedures are used in many ways in school
settings. For example, tests can be used to provide feedback to students about their prog-
ress, to evaluate their achievement, and to assign grades. esting applications can
be classified as either formative or summative. Focmatiy a aaaati
activities that areaimed at providing feedback tostudents. In this context, feedback implies
the communication of information concerning a student’s performance or achievement that
is intended to have a corrective effect. Formative evaluation is typically communicated
directly to the students and is designed to direct, guide, and modify their behavior. To be
useful it should indicate to the students what is being done well and what needs improve-
ment. By providing feedback in a timely manner, formative evaluation can help divide
the learning experience into manageable components and provide structure to the instruc-
tional process. Because tests can provide explicit feedback about
Formative evaluation involves
what one has learned and what yet needs to be mastered, they make
evaluative activities aimed at
a significant contribution to the learning process. The students are
providing feedback to students. made aware of any gaps in their knowledge and learn which study
Summative evaluation involves Strategies are effective and which are not. In addition to guiding
the determination of the worth, learning, tests may enhance or promote student motivation. Receiv-
value, or quality of an outcome. ing a “good” score on a weekly test may provide positive reinforce-
¢
Assigning Grades on the Basis of Classroom Assessments 279

ment to students, and avoiding “poor” scores can motivate students to increase their study
activiti

(aN GUEOMEN the classroom summative evaluation typically involves the formal evaluation
of performance or progress in a course, often in the form of a numerical or letter grade or
mark. Summative evaluation is often directed to others beside the student, such as parents
and administrators. Grades are generally regarded in our society as formal recognition of a
specific level of mastery. Significant benefits of grades include the following:

Grades
: arej generally regarded a Although there is variation in the way grades are reported (e.g.,
in our society
og as formal
; letter grades, numerical grades), most people are reasonably familiar
recognition of a specific level with their interpretation. Grades provide a practical system for com-
of mastery. municating information about student performance.
= Ideally, summative evaluations provide a fair, unbiased system of comparing students
that minimizes irrelevant criteria such as socioeconomic status, gender, race, and so on.
This goal is worthy even if attainment is less than perfect. If you were to compare the stu-
dents who entered European or American universities a century ago with those of today,
you would find that privilege, wealth, gender, and race count much less today than at that
time. This is due in large part to the establishment of testing and grading systems that have
attempted to minimize these variables. This is not to suggest that the use of tests has com-
pletely eliminated bias, but that they reflect a more objective approach to evaluation that
may help reduce the influence of irrelevant factors.

Naturally, summative evaluations also have significant limitations, including the


following:

ms Grades are only brief summary statements that fail to convey


Although grades have their a great deal of the information carefully collected by teachers. For
example, a grade of a B only hints at a general level of mastery and
advantages, they are only brief
ane & 4 tells little or nothing about specific strengths and weaknesses.
summary statements that fail
to convey rich details about a a Although most people are familiar with grades and their general
student’s achievement. meaning, there is considerable variability in the meaning of grades
across teachers, departments, and schools. For example, an A in one
class might reflect a higher degree of achievement or mastery than is
required to receive an A in another class. Some teachers and schools are more rigorous and
have higher standards than others. Most students in college have been urged by friends to
take a course being offered by a specific professor because he or she is a more lenient grader
than other professors. As a result, it is difficult to generalize about the absolute meaning of
a specific grade.
= Although grades are only a brief summary of performance, competition for high
grades may become more important than mastery of content. That is, grades become the
goal in and of themselves and actual achievement becomes secondary. As a correlate of
this, students sometimes have difficulty separating their worth as individuals from their
280 CHAPTER 11

achievement in school. Particularly in an academically oriented home in which achieve-


ment is highly valued, students may misinterpret grades as an assessment of self-worth. To
ameliorate this tendency, teachers and parents should be careful to differentiate academic
performance from personal value and worth.

As you can see from this brief discussion, grades have both significant benefits and limita-
tions. Nevertheless, they are an engrained aspect of most educational systems and are more
than likely going to be with us for the foreseeable future. As a result, it behooves us to un-
derstand how to assign grades in a responsible manner that capitalizes on their strengths and
minimizes their limitations. Special Interest Topic 11.1 provides a brief history of grading
policies in universities and public schools.

AR

SPECIAL INTEREST ToPic I1.1


A Brief History of Grading

Brookhart (2004) provides a discussion of the history of grading in the United States. Here are a few
of the key developments she notes in this time line.

m Pre 1800: Grading procedures were first developed in universities. Brookhart’s research sug-
gests that the first categorical grading scale was used at Yale in 1785 and classified students as
Optimi (i.e., best), Second Optimi (i.e., second best), Inferiores (i.e., lesser), and Pejores (i.e.,
worse). In 1813 Yale adopted a numerical scale by which students were assigned grades be-
tween | and 4 with decimals used to reflect intermediary levels. Some universities developed
scales with more categories (e.g., 20) whereas others tried simple pass—fail grading.
m 1800s: The common school movement of the 1800s saw the development of public schools
designed to provide instruction to the nation’s children. Initially these early schools adopted
grading scales similar to those in use at universities. About 1840 schools started the practice
of distributing report cards. Teachers at the time complained that assessment and grading
were too burdensome and parents complained the information was difficult to interpret.
These complaints are still with us today!
m 1900s: Percentage grading was common in secondary schools and universities at the begin-
ning of the twentieth century. By 1910, however, educators began to question the reliability
and accuracy of using a scale with 100 different categories or scale points. By the 1920s the
use of letter grades (A, B, C, D, and F) was becoming the most common practice. During
the remainder of the 1900s, a number of grading issues came to the forefront. For example,
educators became increasingly aware that nonachievement factors (e.g., student attitudes
and behaviors and teacher biases) were influencing the assignment of grades and recognized
that this was not a desirable situation. Additionally there was a debate regarding the merits
of norm-referenced versus criterion-referenced grading systems. Finally, efforts were made
to expand the purpose of grades so they not only served to document the students’ level of
academic achievement but also served to enhance the learning of students. As you might
expect, these are all issues that educators continue to struggle with to this day.

In some aspects we have come a long way in refining the ways we evaluate the performance
of our students. At the same time, we are still struggling with many of the same issues we struggled
with a hundred years ago. 4
Assigning Grades on the Basis of Classroom Assessments 281

Formal and Informal Evaluation


Summative evaluation, by its very nature, is typically formal and documented in writing.
Formative evaluation, however, is often informal. Teachers are constantly evaluating stu-
dent performance throughout the school day. I
Whereas summative aicaghiaianshaasaaEEEEEEC <n the feedback
evaluation is typically formal takes the form of a comment such as “that is outstanding,” or “that
and documented in writing, is not quite right, try again.” Although it is usually presented in a
nonthreatening manner, public evaluation can be embarrassing or
formative evaluation is
humiliating when thoughtlessly conducted by the teacher. This can
often informal.
result in students’ developing negative attitudes toward both teachers
and school. These negative attitudes can be very difficult to change
and these regrettable situations should be avoided. Another potential problem with informal
evaluations is that they may not be applied in a consistent manner. For example, some stu-
dents may receive more consistent feedback than others, which may give them an advantage
on upcoming tests and assignments. Additionally, informal evaluations are rarely recorded
or documented. Although some teachers do develop daily summative ratings based on their
general impression of the students’ performance, this is not standard practice and represents
the exception rather than the rule.
Whereas formative evaluation is often informal, a formal approach is superior be-
cause it is more likely to be applied consistently and result in a written record. In fact the
development of processes of formal evaluation was largely in response to the unreliability
and invalidity of informal evaluations. Probably every teacher has perceived a student to be
doing well in class until an examination brings to light gross deficiencies in the student’s
knowledge or skills. In contrast, most of us have known students who appear disengaged
or otherwise doing poorly in class until an assessment allows them to demonstrate mastery
of the relevant knowledge and skills. Most teachers report to students their progress by as-
signing scores to tests and other assignments. These scores are formal evaluations, but they
are of limited utility as feedback unless greater detail is provided regarding specific student
strengths and weaknesses. As a result, a formal approach to formative evaluation should
always include comments reflecting which learning objectives have been mastered and
which have not. It can be very frustrating for students to receive a score of 75 on a writing
homework assignment without any indication of how they can improve their performance.
Accordingly, a midterm examination will provide little useful feedback unless the teacher
provides a thorough review of the test. A score of 75 tells the student practically nothing of
the areas in which he or she is deficient or how to perform better in the future.

The Use of Formative Evaluation


in Summative Evaluation
Formative evaluations are often
2 : : Formative evaluations are often incorporated into a summative eval-
incorporated into a summative uation. For example, teachers will often have a number of small as-
evaluation. This procedure is signments and tests that are primarily designed to evaluate student
reasonable if the material is progress and provide feedback to students, with the scores on these
topical and questionable if assignments contributing to the final course grade. This procedure is
the content is sequential. reasonable if the material is topical and questionable if the content
282 CHAPTER 11

is sequential. Sequential content is material that must be learned in a particular order, such
as mathematics. For example, students learn addition and subtraction of single digits before
progressing to double digits. Topical content, on the other hand, can often be learned equally
well in various orders. Literature typically can be taught in many different ways because
the various topics can be ordered to fit many different teaching strategies. For example, in
literature a topic can be taught chronologically, such as with the typical survey of English
literature, or topically, such as with a course organized around the type of writing: essay,
poem, short story, and novel. In this last example, content within each category might itself
be organized in a topical or sequential manner.
The cumulative grade or mark in a course is typically considered a judgment of the
mastery of content or overall achievement. If certain objectives are necessary to master
later ones, it makes little sense in grading the early objectives as part of the final grade
because they must have been attained in order to progress to the later objectives. For ex-
ample, in studying algebra one must master the solution of single-variable equations before
progressing to the solution of two-variable equations. Suppose a student receives a score
of 70 on a test involving single-variable equations and a 100 on a later test involving two-
variable equations. Should the 70 and 100 be averaged? If so, the resulting grade reflects
the average mastery of objectives at selected times in the school year, not final mastery of
objectives. Should only the latest score be used? If so, what of the student who got a 100
on the first test and a 70 on the second test? These grades indicate a high degree of mastery
of the earlier objectives and less mastery of later objectives. What should be done? Our
answer is that for sequential content, summative evaluation should be based primarily on
performance at the conclusion of instruction. At that point the student will demonstrate
which objectives have been mastered and which have not. Earlier evaluations indicate only
the level of performance at the time of measurement and penalize students who take longer
to master the objectives but who eventually master them within the time allowed for mas-
tery. This is not to minimize the importance or utility of employing formative evaluation
with sequential material, only that formative evaluation may be difficult to meaningfully
incorporate into summative evaluation. With sequential material, use formative evaluations
to provide feedback, but when assigning grades emphasize the students’ achievement at
the conclusion of instruction.
Topical content, material that is related but with objectives that need not be mastered
in any particular order, can be evaluated more easily in different sections. Here formative
evaluations can easily serve as part of the summative evaluation. There is little need or rea-
son to repetitively test the objectives in each subsequent evaluation. Mixed content, in which
some objectives are sequential and others topical, should be evaluated by a combination of
the two approaches. Later in this chapter we will show you how you can “weight” different
assessments and assignments to meet your specific needs.

Reporting Student Progress: Which Symbols to Use?


Letter grades are the most The decision about how to report student achievement is often de-
popular method of reporting cided for teachers by administrators at the state or district level.
student progress and are used Letter grades (i.e., A, B, C, D, F) are the most popular method of
in the majority of schools. reporting student progress and are used in the majority of schools
Assigning Grades on the Basis of Classroom Assessments
283

and universities today. Although there might be some variation in the meaning attached to
them, letter grades are typically interpreted as

A excellent or superior achievement


B above-average achievement
C average achievement
D> below-average or marginal achievement
F = failing or poor performance

Students and parents generally understand letter grades, and the evaluative judgment
represented by the grade is probably more widely accepted than any other system available.
However, as we alluded to in the previous section, letter grades do have limitations. One
significant limitation of letter grades is that they are only a summary statement and convey
relatively little useful information beyond the general or aggregate level of achievement.
Although teachers typically have considerable qualitative information about their students’
specific strengths and weaknesses, much of this information is lost when it is distilled down
to a letter grade. In some schools teachers are allowed to use pluses and minuses with letter
grades (e.g., A— or C+). This approach provides more categories for classification, but still
falls short of conveying rich qualitative information about student achievement. Special
Interest Topic 11.2 provides information on how some schools are experimenting with de-
leting “Ds” from their grading scheme.
Naturally, other grading systems are available, including the following.

Numerical Grades. Numerical grades are similar to letter grades in that they attempt to
succinctly represent student performance, here with a number instead of a letter. Numerical
grades may provide more precision than letter grades. For example, the excellent perfor-
mance represented by a grade of A may be further divided into numerical grades ranging
from the elusive and coveted 100 to a grade of 90. Nevertheless, numerical grades still only
summarize student performance and fail to capture much rich detail.

Verbal Descriptors. Another approach is to replace letter grades with verbal descrip-
tors such as excellent, above average, satisfactory, or needs improvement. Although the
number of categories varies, this approach simply replaces traditional letter grades with
verbal descriptors in an attempt to avoid any ambiguity regarding the meaning of the
mark.

Pass-Fail. Pass-fail grades and other two-category grading systems have been used for
many years. For example, some high schools and universities offer credit/no-credit grading
for selected courses (usually electives). A variant is mastery grading, which is well suited
for situations that emphasize a mastery learning approach in which all or most students are
expected to master the learning objectives and given the time necessary to do so. In situations
in which the learning objectives are clearly specified, a two-category grading system may
be appropriate, but otherwise it may convey even less information than traditional letter/
numerical grades or verbal descriptors.
284 CHAPTER 11

SPECIAL INTEREST ToPic 11.2


Schools No Longer Assigning Ds?

Hoff (2003) reports that some California high schools are considering deleting the letter D from their
grading systems. He notes that at one high school the English department has experimented with
deleting Ds with some success. The rationale behind the decision is that students who are making
Ds are not mastering the material at the level expected by the schools. This became apparent when
schools noticed that the students making Ds in English were, with very few exceptions, failing the
state-mandated exit examination. This caused them to question whether it is appropriate to give
students a passing grade if it is almost assured that they will not pass the standardized assessment
required for them to progress to the next grade or graduate. Schools also hoped that this policy would
motivate some students to try a little harder and elevate their grades to Cs. There is some evidence
that this is happening. For example, after one English department did away with Ds, approximately
one-third of the students who had made Ds the preceding quarter raised their averages to a C level,
while about two-thirds received Fs. The policy has generally been well received by educators and is
likely to be adopted by other departments and schools.

Supplemental Systems. Many teachers and/or schools have adopted various approaches
to replace or supplement the more traditional marking systems. For example, some teachers
use a checklist of specific learning objectives to provide additional information about their
students’ strengths and weaknesses. Other teachers use letters, phone conversations, or in-
dividual conferences with parents to convey more specific information about their students’
individual academic strengths and weaknesses. Naturally, all of these approaches can be
used to supplement the more traditional grading/marking systems.

The Basis for Assigning Grades

Another essential question in assigning grades involves a decision regarding the basis for
grades. By this we mean “Are grades assigned purely on the basis of academic achievement,
or are other student characteristics taken into consideration?” For example, when assign-
ing grades should one take into consideration factors such as a student’s attitudes, behav-
ior, class participation, punctuality, work/study habits, and so forth? As a general rule these
nonachievement factors receive more consideration in elementary school, whereas in the
secondary grades the focus narrows to achievement (Hopkins, 1998). While recognizing
the importance of these nonachievement factors, most assessment experts recommend that
actual academic achievement be the sole basis for assigning achievement grades. If desired,
teachers should assign separate ratings for these nonachievement factors (e.g., excellent,
satisfactory, and unsatisfactory). The key is that these factors should
When educators mix
be rated separately and independently from achievement grades. This
achievement and keeps academic grades as relatively pure marks of achievement that
nonachievement factors, the are not contaminated by nonachievement factors. When educators
meaning of grades is blurred. mix achievement and nonachievement factors, the meaning of the
Assigning Grades on the Basis of Classroom Assessments 285

TABLE 11.1 Report Form Reflecting Achievement and Nonachievement Factors

Student Achievement: The following grades reflect the student’s achievement


in each academic area.
Grades: A = Excellent Subject Grade
B = Above Average
C = Average Reading
D = Below Average Writing
F = Failing English

Math
Social studies
Science

Student Behavior: The following scores reflect the student’s behavior at school.

Rating Scale: E = Excellent Student’s effort 19 S N U


4 a casarartiiee Follows directions Bon Ss oNe eee
= Needs Improvement :
U = Unsatisfactory Acts responsibly je. S N U
Completes and returns work E S N U
Interacts well with peers E S N U
Interacts well with adults E S N U

Overall Classroom
Behavior E S N U
Qe EE ee a eee

grades is blurred. Table 11.1 provides an example of an elementary school report intended to
separate achievement from other factors. Special Interest Topic 11.3 addresses the issue of
lowering grades as a means of classroom discipline.

Frame of Reference

Once you have decided what to base your grades on (hopefully academic achievement), you
need to decide on the frame of reference you will use. In the following sections we will
discuss the most common frames of references.

Norm-Referenced Grading (Relative Grading)

Norm-referenced or relative The first frame of reference we will discuss is referred to 48q500INE
A :
grading involves comparing
each student’s performance
to that of a specific reference comparable to the norm-referenced approach to score interpreta-
group. tion discussed in Chapter 3). This approach to assigning grades is
286 CHAPTER 11

SPECIAL INTEREST ToPic 11.3


Grading and Punishment?

Nitko (2001) distinguishes between “failing work” and “failure to try.” Failing work is work of such
poor quality that it should receive a failing grade (i.e., F) based on its merits. Failure to try is when
the student, for some reason, simply does not do the work. Should students who fail to try receive
a failing grade? What about students who habitually turn in their assignments late? Should they be
punished by lowering their grades? What about students who are caught cheating? Should they be
given a zero? These are difficult questions that don’t have simple answers.
Nitko (2001) contends that it is invalid to assign a grade of F for both failing work and failure
to try because they do not represent the same construct. The F for failing work represents unaccept-
able achievement or performance. In contrast, failing to try could be due to a host of factors such as
forgetting the assignment, misunderstanding the assignment, or simply defiant behavior. The key
factor is that failing to try does not necessarily reflect unacceptable achievement or performance.
Likewise, Nitko contends that it is invalid to lower grades as punishment for turning in assignments
late. This confounds achievement with discipline.
Along the same lines, Stiggins (2001) recommends that students caught cheating should not
be given a zero because this does not represent their true level of achievement. In essence, these
authors are arguing that from a measurement perspective you should separate the grade from the
punishment or penalty. Nitko (2001) notes that these are difficult issues, but they are classroom
management or discipline issues rather than measurement issues. As an example of a suggested
solution, Stiggins (2001) recommends that instead of assigning a zero to students caught cheating,
it is preferable to administer another test to the student and use this grade. This way, punishment
is addressed as a separate issue that the teacher can handle in a number of ways (e.g., detention
or in-school suspension). Teachers face these issues on a regular basis, and you should consider
them carefully before you are faced with them in the classroom.

also referred to as “grading on the curve.” Although the reference group varies, it is often the
students in a single classroom. For example, a teacher might specify the following criteria
for assigning marks:

Mark Percentage of Students Receiving Mark


A 10%
B 20%
cS 40%
D 20%
F 10%

With this arrangement, in a class of 20 students, the two students receiving the highest grades
will receive As, the next four students will receive Bs, the next eight students will receive
Cs, the next four students will receive Ds, and the two students with the lowest scores will
receive Fs. An advantage of this type of grading system is that it is straightforward and clearly
specifies what grades students will receive. A second advantage is that it helps prevent grade
Assigning Grades on the Basis of Classroom Assessments 287

inflation, which occurs when teachers are too lenient in their grading and a large proportion
of their students receive unwarranted high marks.
This approach to assigning grades does have limitations. Possibly the most prominent
limitation is that there can be considerable variability among reference groups. If the refer-
ence group is a single classroom, some classes will be relatively high achieving and some
relatively low achieving. If a student is fortunate enough to be in a low-achieving class, he
or she will stand a much better chance of receiving a high grade than if he or she is in a high-
achieving class. But consider the unlucky “average” student who is assigned to a very high-
achieving class. Although this student’s performance might have been sufficient to earn a
respectable mark in an average classroom, relative to the high-achieving students the stu-
dent might receive a poor grade. Additionally, if the teacher strictly follows the guidelines,
a certain percentage of students will receive poor grades by default. To overcome this, some
teachers maintain records over several years in order to establish more substantive reference
data. In this manner a teacher can reduce the influence of variability in class achievement.
The use of large reference groups containing data from many classes accumulated over time
is one of the best approaches to help minimize this limitation.
Gronlund (1998) provides another approach to reducing the effect of variability in
class achievement. He recommends using ranges of percentages instead of precise percent-
ages. For example:

Mark Percentage of Students Receiving Mark


A 10-20%
B 20-30%
is 40-50%
D 10-20%
F 0-10%

This approach gives the teacher some flexibility in assigning grades. For example, in a
gifted and talented class one would expect more As and Bs, and few Ds or Fs. The use of
percent ranges provides some needed flexibility.
Another limitation of the norm-referenced approach is that the percentage of students
being assigned specific grades is often arbitrarily assigned. In our example we used 20% for
Bs and 40% for Cs; however, it would be just as defensible to use 15% for Bs and 50% for
Cs. Often these percentages are set by the district or school administrators, and one criteria
is not intrinsically better than another. A final limitation is that with relative grading, grades
are not specifically linked to an absolute level of achievement. At least in theory it would be
possible for students in a very low-achieving group to receive relatively high marks without
actually mastering the learning objectives. Accordingly, in a high-achieving class some
students may fail even when they have mastered much of the material. Obviously neither of
these outcomes is desirable!

Criterion-Referenced Grading (Absolute Grading)

m Although modern usage of the phrase “criterion-


referenced” is often associated with dichotomous conditions (e.g., pass or fail; mastery or
288 CHAPTER 11

Criterion-referenced or absolute nonmastery), this approach can also reflect a continuum of achieve-
grading involves comparing ment. One of the most common criterion-referenced grading systems
a student’s performance to a is the traditional percentage-based system. In it grades are based on
specified level of performance. percentages, usually interval bands based on a combination of grades
from tests and other assignments. For example:

Mark Percentage Required for Mark


A 90-100%
B 80-89%
c 70-79%
D 60-69%
i <60%

Many schools list such bands as formal criteria even though they are often modified in ac-
tual practice. An advantage of this grading approach is that the marks directly describe the
performance of students without reference to other students. As a result, there is no limit
on the number of students that receive any specific grade. For example, in a high-achieving
class all students could conceivably receive As. Although such an extreme outcome is not
likely in most schools, the issue is that there is no predetermined percentage of students that
must receive each grade. Another advantage is that this system, like the norm-referenced
approach, is fairly straightforward and easy to apply.
The major limitation of criterion-referenced or absolute grading is that there is con-
siderable variability in the level of difficulty among tests and other academic assignments
assigned by teachers. Some teachers create tests that are extremely difficult and others write
relatively easy tests. Some teachers are consistently more rigorous in their grading than oth-
ers. As a result, a rigorous teacher might have a class average of only 60% or 70%, whereas
a more lenient teacher might have a class average of 80% or 90%. This inherent variability
in difficulty level across courses makes it difficult to interpret or compare the meaning of
scores based on an absolute standard in a consistent manner.

Achievement in Relation to Improvement or Effort


Some teachers feel students should be graded based on the amount of progress they dem-
onstrate during the course of instruction. For example, if students with very poor skills
at the beginning of instruction achieve a moderate level of achievement, they should re-
ceive a better grade than high-achieving students who have demonstrated only small gains.
There are numerous problems with this approach. Examine the data listed in Table 11.2.
If one takes the position that effort as measured by improvement should be the basis for
grades, Joe’s improvement, in number and percentage of words gained, is much greater
than Mary’s improvement. In contrast if the teacher bases grades on achievement, Joe’s
performance is far short of Mary’s and she would receive the higher grade. To reward Joe
more highly than Mary can be a risky procedure. What if Joe, knowing the basis for grad-
ing, deliberately scored low on the initial test in order to ensure a large gain? Should such
actions be rewarded?

Assigning Grades on the Basis of Classroom Assessments 289

TABLE 11.2 Examples of Grading Based on Effort versus Achievement

Joe Mary

Score on spelling pretest Ny 62


Score on spelling posttest 43 80
Gain 26 op
Percentage gain 153% 35%
Grade based on effort A C
Grade based on achievement (€ A
EoSURGE SSS ORR EESTI CRCS AEP GOGAT reCah te OS LCR ee GU BARE Cae gS a

There are other problems associated with basing grades on effort or improvement. For
example, the measurement of improvement or change is plagued with numerous technical
problems (Cronbach & Furby, 1970). Additionally, you have the mixing of achievement
with another factor, in this instance effort or improvement. As we suggested before, if you
want to recognize effort or improvement, it is more defensible to
Ifyou want to recognize effort assign separate scores to these factors. Achievement grades should
or improvement, it is best to reflect achievement and not be contaminated by other factors. Fi-
assign separate scores to these nally, although this approach is typically intended to motivate poor
factors. students, it can have a negative effect on better students. Based on
these problems, our recommendation is not to reward effort/improve-
ment over achievement except in special situations. One situation in which the reward of
effort may be justified is in the evaluation of students with severe disabilities for whom
grades may appropriately be used to reinforce effort.

Achievement Relative to Ability


Although some teachers have tried to base grades on effort/improvement, others have at-
tempted to base grades on achievement relative to ability or aptitude. In this context, ability
or aptitude is usually based on performance on an intelligence test. For example, a student
with average intelligence who scores above average on tests of achievement is considered
an overachiever and receives good grades. Accordingly, an underachiever is a student whose
achievement is considered low in relation to his or her level of intelligence. Like attempts to
base grades on effort/improvement, there are numerous problems associated with this ap-
proach, including (1) technical problems (e.g., unreliable comparisons), (2) inconsistency
in the way IQ is measured, and (3) teachers typically not being given the advanced training
necessary to interpret intelligence or aptitude tests. These problems all argue against basing
grades on a comparison of achievement and aptitude.

Recommendation
Although we recommend against using achievement relative to aptitude, effort, or improve-
ment as a frame of reference for assigning grades, we believe both absolute and relative
290 CHAPTER 11

Both norm-referenced grading systems can be used successfully. They both have advantages
(i.e., relative) and criterion- and limitations, but when used conscientiously either approach can be
referenced (i.e., absolute) effective. It is even possible for teachers to use a combination of abso-
grading systems can be lute and relative grading systems in secondary schools and universi-
ties. For example, Hopkins (1998) recommends that high schools and
used successfully.
colleges report a conventional absolute grade and also the students’
relative standing in their graduating class (e.g., percentile rank). This
would serve to reduce the differences in grading across schools. For example, a student might
have a grade point average (GPA) of 3.0 but a percentile rank of 20. Although the GPA is
adequate, the percentile rank of 20 (i.e., indicating the student scored better than only 20% of
the other students) suggests that this school gives a high percentage of high grades.

Combining Grades into a Composite

When it is time to assign grades at the end of a six-week period, semester, or some other
grading period, teachers typically combine results from a variety of assessments. This can
include tests, homework assignments, performance assessments, and
The decision of how to weight the like. The decision of how to weight these assessments is usually
left to the teacher and reflects his or her determination of what should
assessments when calculating
be emphasized and to what degree (see Special Interest Topic 11.4).
grades is usually left to the
For example, if a teacher believes the primary determiner of a course
teachers and reflects their
grade should be performance on tests, he or she would weight test
determination of what should performance heavily and place less emphasis on homework and term
be emphasized and to what papers. Another teacher, with a different grading philosophy, might
degree. decide to emphasize homework and papers and place less emphasis
on test performance. Whatever your grading philosophy, you need a
system for effectively and efficiently combining scores from a variety of assessment proce-
dures into a composite score. On initial examination this might appear to be a fairly simple
procedure, but it is often more complicated than it appears.
Consider the following data illustrating a simple situation. Here we have two assess-
ment procedures, a test and a homework assignment. For our initial example we will assume
the teacher wants to weight them equally. The test has a maximum score of 100 whereas the
homework assignment has a maximum value of 50.

Assessment Range Johnny Sally

Achievement test (40-100) 100 40


Homework (20-50) 20 50
Composite 120 90

Johnny had a perfect score on the test but the lowest grade on the homework assignment.
Sally had the opposite results, a perfect score on the homework assignment and the low-
est score on the test. If the summed scores were actually reflecting equal weighting as the
Assigning Grades on the Basis of Classroom Assessments 291

: Sk a eR

SPECIAL INTEREST ToPic 11.4


Some Thoughts on Weighting Assessment Procedures

The decision regarding how to weight different assessment procedures is a personal decision
based on a number of factors. Some teachers emphasize homework, some tests, some term pa-
pers, and some performance assessments. No specific practice is clearly right or wrong, but an
understanding of psychometric principles may offer some guidance. First, as we have stated
numerous times, it is desirable to provide multiple assessment opportunities. Instead of relying
on only a midterm and a final, it is best to provide numerous assessment opportunities spread
over the grading period. Second, when possible it is desirable to incorporate different types of
assessment procedures. Instead of relying exclusively on any one type of assessment, when fea-
sible try to incorporate a variety. Finally, when determining weights, consider the psychometric
properties of the different assessment procedures. For example, we often weight the assessments
that produce the most reliable scores and valid interpretations more heavily than those with
less sound psychometric properties. Table 11.2 provides one of our weighting procedures for
an introductory psychology course in tests and measurement. All of the tests are composed of
multiple-choice and short-answer items and make up 90% of the final grade. We include a term
paper, but we are aware that its scores are less reliable and valid so we count it for only 10% of
the total grade.
Naturally this approach emphasizing more objective tests is not appropriate for all classes.
For example, it is difficult to assess performance in a graduate level course on professional ethics
and issues using assessments that emphasize multiple-choice and short-answer. In this course we
use a combination of tests composed of multiple-choice items, short-answer items, and restricted-
response essays along with class presentations and position papers. The midterm and final ex-
amination receive a weighting of approximately 60% of the final grade, with the remaining 40%
accounted for by more subjective procedures. In contrast, in an introductory psychology course,
which typically has about 100 students in every section, we would likely use all multiple-choice
tests. We would not be opposed to requiring a term paper or including some short-answer or
restricted-response items on the tests, but due to the extensive time required to grade these pro-
cedures we rely on the objectively scored tests (which are scored in the computer center). We do
provide multiple assessment opportunities in this course (four semester tests and a comprehensive
final).
In addition to reliability/validity issues and time considerations, you may want to consider
other factors. For example, because students may receive assistance from parents (or others) on
homework assignments, their performance might not be based solely on their own abilities. This
is also complicated by variability in the amount of assistance students receive. Some students may
get considerable support/assistance from parents while others receive little or none. The same
principle applies to time commitment. An adolescent with few extracurricular activities will have
more time to complete homework assignments than one who is required to maintain a part-time
job (or is involved in athletics, band, theater, etc.). If you provide no weight to homework assign-
ments it removes any incentive to complete the work, whereas basing a large proportion of the
grade on homework will penalize students who receive little assistance from parents or who are
involved in many outside activities. Our best advice is to take these factors into consideration and
adopt a balanced approach. Good luck!
292 CHAPTER 11

teacher expects, the composite scores would be equal. Obviously they are not. Johnny’s
score (i.e., 120) is considerably higher than Sally’s score (i.e., 90). The problem is that
achievement test scores have more variability, and as a result they have an inordinate influ-
ence on the composite score.
To correct this problem you need to equate the scores by taking into consideration the
differences in variability. Although different methods have been proposed for equating the
scores, this can be accomplished in a fairly accurate and efficient manner by simply correct-
ing for differences in the range of scores (it is technically preferable to use a more precise
measure of variability than the range, but for classroom applications the range is usually
sufficient). In our example, the test scores had a range of 60 while the homework scores had
a range of 30. By multiplying the homework scores by 2 (i.e., our equating factor) we can
equate the ranges of the scores and give them equal weight in the composite score. Consider
the following illustration.

Equating Corrected Homework Equating Corrected Composite


Student Test Score Factor Score Score Factor Score Score

Johnny 100 x Il 100 20 x2 40 140


Sally 40 xe 40 50 x2 100 140

Note that this correction resulted in the assignments being weighted equally; both students
received the same composite score. If in this situation the teacher wanted to calculate a
percentage-based composite score, he or she would simply divide the obtained composite
score by the maximum composite score (in this case 200) and multiply by 100. This would
result in percentage-based scores of 70% for both Johnny and Sally.
In the previous example we assumed the teacher wanted the test and homework scores
to be equally weighted. Now we will assume that the teacher wants the test score to count
three times as much as the homework score. In this situation we would add another multi-
plier to weight the test score as desired.

Equating Desired Corrected Homework Equating Corrected Composite


Student Test Score Factor Weight Score Score Factor Score Score

Johnny 100 x 1 a8 300 20 sou? 40 340


Sally 40 aul x 3 120 50 x 2 100 220

With this weighting, Johnny’s composite score (i.e., 340) is considerably higher than Sally’s
composite score (i.e., 220) because the teacher has chosen to place more emphasis on the test
score relative to the homework assignments. If the teacher wanted to calculate a percentage-
based composite score, he or she would divide the obtained composite score by the maximum
composite score (in this case 400) and multiply by 100. This would result in percentage-
based scores of 85% for Johnny and 55% for Sally.

A Short Cut. Although the preceding approach is preferred from a technical perspective,
it can be a little unwieldy and time consuming. Being sensitive to the many demands on a
Assigning Grades on the Basis of Classroom Assessments 293

teacher’s time, we will now describe a simpler and technically adequate approach that may
be employed (e.g., Kubiszyn & Borich, 2000). With this approach, each component grade
is converted to a percentage score by dividing the number of points awarded by the total
potential number of points. Using data from the previous examples with an achievement
test with a maximum score of 100 and a homework assignment with a maximum score of
50, we have the following results:

Assessment Maximum Johnny Sally

Achievement test (100) 100/100 x 100 = 100% 40/100 x 100 = 40%


Homework (50) 20/50 x 100 = 40% 50/50 x 100 = 100%

This procedure equated the scores by converting them both to a 100-point scale (based on
the assumption that the converted scores are comparable in variance). If one were then to
combine these equated scores with equal weighting, you would get the following results:

Achievement Test Homework Composite

Johnny 100% 40% 140/2 = 70%


Sally 40% 100% 140/2 = 70%

If one wanted to use different weights for the assessments, you would simply mul-
tiply each equated score by a percentage that represents the desired weighting of each
assessment. For example, if you wanted the test score to count three times as much as the
homework assignment, you would multiply the equated test score by 0.75 (i.e., 75%) and
the equated homework score by 0.25 (i.e., 25%). Note that the weights (i.e., 75% and 25%)
equal 100%. You would get the following results:

Achievement Test Homework Composite

Johnny 100 X 0.75 = 75 40 x 0.25 = 10 75 + 10 = 85%


Sally 40 x 0.75 = 30 100 x 0.25 = 25 30 + 25 = 55%

Table 11.3 provides an expanded example of how this procedure


can be used to calculate grades. As computers become increasingly avail-
As computers become
able, more teachers will have access to computerized or electronic grade
increasingly available, more
books. These grade books can greatly simplify the process of recording
teachers will have access to scores and computing grades. Table 11.4 provides some information on
computerized or electronic a few of the computerized grade books that are available commercially to
grade books. These grade help teachers record scores and calculate grades. Many textbook publish-
books can greatly simplify the ers and other educational suppliers provide similar grade book software
process of recording scores to schools and teachers. These commercial grade books can simplify the
and computing grades. process of recording scores and computing grades greatly.
TABLE 11.3 Example of Weighting Different Assessment Procedures

For this example, we will use the following weights:

Assessment Procedure Value

Test 1 20%
Test 2 20%
Tests 20%
Term paper 10%
Final examination 30%

Total 100%

The scores reported for each assessment procedure will be percent correct. As noted, a relatively
easy way to equate your scores on different procedures is to report them as percent correct
(computed as the number of points obtained by the student divided by the maximum number
of points). We use the scores of three students in this illustration.

Assessment Procedure Julie Tommy Stacey

Test 1 95 X 0.20 = 19 TS eXA020) =" 15 65 X 0:20 =43


Test 2 97:x 0.20 = 19.4 S0\x- 0.20 = 16 Sp) SQA =e hl
Test 3 93 X 0.20 = 18.6 TIRX O20 =5 154 67 X 0.20 = 13.4
Term paper O22 e005 ao 855X° 0.10 =" 35 TOPO AUIS 9
Final examination 94 X 0.30 = 28.2 80 x 0.30 = 24 TORS 0S 0821

Total 94.4 78.9 65.6


As illustrated here, Julie’s composite score is 94.4, Tommy’s is 78.9, and Stacey’s is 65.6.
Using these fairly simple procedures, a teacher can weight any number of different assessment
procedures in any desired manner.

TABLE 11.4 Commercially Available Grade Book Programs

As we indicated, there are a number of commercially available grade book programs. There are
programs for both Mac and PC users and many of these programs reside completely on your
computer. A new trend is Web-based applications that in addition to recording scores and calcu-
lating grades allow students and parents to check on their progress. Clearly with technological
advances these programs will become more sophisticated and more widely available. Here are
just a few of the many grade book programs available and their Web addresses:

m ClassAction Gradebook = www.classactiongradebook.com


a Jackson GradeQuick www.jacksoncorp.com
m MyGrade Book www.my gradebook.com
a ThinkWare Educator www.thinkware.com

294
Assigning Grades on the Basis of Classroom Assessments 295

Informing Students of the Grading System


and Grades Received

Students clearly have a right to know the procedures that will be used to determine their
grades. This information should be given to the students early in a course and well before
any assessment procedures are administered that are included in the grading process. A
common question is “How old should students be before they can benefit from this infor-
mation?” We believe any students who are old enough to be administered a test or given an
assignment are also old enough to know how their grades will be determined. Parents should
also be informed in a note or in person during conferences or visits what is expected of
their children and how they will be graded. For students in upper el-
ementary grades and beyond, an easy way to inform them of grading
Any students who are old
requirements is a handout such as shown in Table 11.5. This system
enough to be administered
is similar to those used by one of the authors in his classes.
a test are also old enough to Students and parents should be informed of the grades obtained
know how their grades will as well. Feedback and reporting of grades should be done individu-
be determined. ally in a protected manner. Grades or test scores should not be posted
or otherwise displayed in any way that reveals a student’s individual
performance. A federal law, the Family Educational Rights and Privacy Act (FERPA; also
know as Public Law 93-380 or the Buckley Amendment) governs the maintenance and
release of educational records, including grades, test scores, and related evaluative mate-
rial. Special Interest Topic 11.5 provides information on this law from the Department of
Education’s FERPA compliance home page.

Parent Conferences

Most school systems try to promote parent-teacher communication, often in the form of par-
ent conferences. Typically conferences serve to inform parents of all aspects of their child’s
progress, so preparation for a parent conference should result in a file folder containing a
record of the child’s performance in all areas. This may include information on social and
behavioral development as well as academic progress. Conferences should be conducted as

TABLE 11.5 Example of Grading Requirements Presented to


Students at the Beginning of a Test and Measurement Course

Assessment Procedure Value

Test 1 20%
Test 2 20%
Test 3 20%
Term paper 10%
Final examination 30%
Total 100%
296 GASP ale) Rae

SPECIAL INTEREST TOPIC 11.5


The Family Educational Rights and Privacy Act of 1974

The Family Educational Rights and Privacy Act (FERPA) (20 U.S.C. § 1232g; 34 CFR Part 99) is a
Federal law that protects the privacy of student education records. The law applies to all schools that
receive funds under an applicable program of the U.S. Department of Education.
FERPA gives parents certain rights with respect to their children’s education records. These
rights transfer to the student when he or she reaches the age of 18 or attends a school beyond the high
school level. Students to whom the rights have transferred are “eligible students.”
Parents or eligible students have the right to inspect and review the student’s education records
maintained by the school. Schools are not required to provide copies of records unless, for reasons
such as great distance, it is impossible for parents or eligible students to review the records. Schools
may charge a fee for copies.
Parents or eligible students have the right to request that a school correct records which they
believe to be inaccurate or misleading. If the school decides not to amend the record, the parent or
Sea eligible student then has the right to a formal hearing. After the hearing, if the school still decides not
to amend the record, the parent or eligible student has the right to place a statement with the record
setting forth his or her view about the contested information.
Generally, schools must have written permission from the parent or eligible student in order
to release any information from a student’s education record. However, FERPA allows schools to
disclose those records, without consent, to the following parties or under the following conditions
(34 CFR § 99.31):

School officials with legitimate educational interest;


Other schools to which a student is transferring;
Specified officials for audit or evaluation purposes;
Appropriate parties in connection with financial aid to a student;
Organizations conducting certain studies for or on behalf of the school;
Accrediting organizations;
To comply with a judicial order or lawfully issued subpoena;
Appropriate officials in cases of health and safety emergencies; and
State and local authorities, within a juvenile justice system, pursuant to specific State law.

Schools may disclose, without consent, “directory” information such as a student’s name, ad-
dress, telephone number, date and place of birth, honors and awards, and dates of attendance. How-
ever, schools must tell parents and eligible students about directory information and allow parents
and eligible students a reasonable amount of time to request that the school not disclose directory
information about them. Schools must notify parents and eligible students annually of their rights
under FERPA. The actual means of notification (special letter, inclusion in a PTA bulletin, student
handbook, or newspaper article) is left to the discretion of each school.
For additional information or technical assistance, you may call (202) 260-3887 (voice). Indi-
viduals who use TDD may call the Federal Information Relay Service at 1-800-877-8339.
Or you may contact us at the following address:
Family Policy Compliance Office
U.S. Department of Education
400 Maryland Avenue, SW
Washington, D.C. 20202-5920
Assigning Grades on the Basis of Classroom Assessments 297

Problems related to educational privacy include grading done by parent or student volunteers, post-
ing grades or test scores, or releasing other evaluative information such as disciplinary records in a
manner accessible by nonauthorized persons. (Note that this only applies to materials that identify
individual students, nothing in FERPA prohibits schools from reporting aggregated data.) In general,
all of these activities and many others that once were common practice are prohibited by FERPA.
Some authoritative sites we have reviewed that give guidance to educators on these issues include
the following:

www.nacada.ksu.edu/Resources/FERPA-Overview.htm
www.ed.gov/policy/gen/guid/fpco/index.html
www.aacrao.org/ferpa_guide/enhanced/main_frameset.html

A tutorial for learning the requirements of FERPA may be found at

www.sis.umd.edu/ferpa/ferpa_what_is.htm

Source: United States Department of Education, www.ed.gov/policy/gen/guid/fpco/ferpa/index.html.

confidential, professional sessions. The teacher should focus on the individual student and
avoid discussions of other students, teachers, or administrators. The teacher should present
samples of students’ work and other evidence of their performance as the central aspect of
the conference, explaining how each item fits into the grading system. If standardized test
results are relevant to the proceedings, the teacher should carefully review the tests and their
scoring procedures beforehand in order to present a summary of the results in language
clearly understandable to the parents. In subsequent chapters we will address the use and
interpretation of standardized tests in school settings.

Summary
In this chapter we focused on the issue of assigning grades based on the performance
of students on tests and other assessment procedures. We started by discussing some of
the different ways assessment procedures are used in the schools. Formative evaluation
involves providing feedback to students whereas summative evaluation involves making
evaluative judgments regarding their progress and achievement. We also discussed the ad-
vantages and disadvantages of assigning cumulative grades or marks. On the positive side,
grades generally represent a fair system for comparing students that minimizes irrelevant
characteristics such as gender or race. Additionally, because most people are familiar with
grades and their meaning, grades provide an effective and efficient means of providing in-
formation about student achievement. On the down side, a grade is only a brief summary of
a student’s performance and does not convey detailed information about specific strengths
and weaknesses. Additionally, although most people understand the general meaning of
grades, there is variability in what grades actually mean in different classes and schools.
Finally, student competition for grades may become more important than actual achieve-
ment, and students may have difficulty separating their personal worth from their grades,
both undesirable situations.
298 CHAPTER Ui

Next we discussed some of the more practical aspects of assigning grades. For ex-
ample, we recommended that grades be assigned solely on the basis of academic achieve-
ment. Other factors such as class behavior and attitude are certainly important, but when
combined with achievement in assigning grades they blur the meaning of grades. Another
important consideration is what frame of reference to use when assigning grades. Although
different frames of references have been used and promoted, we recommend using either a
relative (i.e., norm-referenced) or an absolute (i.e., criterion-referenced) grading approach,
or some combination of the two.
We also provided a discussion with illustrations of how to combine grades into a
composite or cumulative grade. When assigning grades, teachers typically wish to take a
number of assessment procedures into consideration. Although this process may appear
fairly simple, it is often more complicated than first assumed. We demonstrated that when
forming composites it is necessary to equate scores by correcting for differences in the
variance or range of the scores. In addition to equating scores for differences in variability,
we also demonstrated how teachers can apply different weights to different assessment
procedures. For example, a teacher may want a test to count two or three times as much as a
homework assignment. We also provided examples of how these procedures can be applied
in the classroom with relative ease. In closing, we presented some suggestions on presenting
information about grades to students and parents.

KEY TERMS AND CONCEPTS


Absolute grading, p. 287 Frame of reference, p. 285 Pass—fail grades, p. 283
Basis for grades, p. 284 Grades, p. 278 Relative grading, p. 285
Composite scores, p. 290 Informal evaluation, p. 281 Score, p. 278
Criterion-referenced grading, Letter grades, p. 282 Summative evaluation, p. 279
p. 287 Marks, p. 278 Verbal descriptors, p. 283
Formal evaluation, p. 281 Norm-referenced grading, p. 285
Formative evaluation, p. 278 Numerical grades, p. 283

RECOMMENDED READING

Brookhart, S. M. (2004). Grading. Upper Saddle River, NJ:


Pearson Merrill Prentice Hall. This provides a good dis-
cussion of issues related to grading practices.

RAE oe ee ei SVE ROS

Go to www.pearsonhighered.com/reynolds2e to view a PowerPoint™


presentation and to listen to an audio lecture about this chapter.
SOLS NS
ree **** Standardized Achievement
Tests in the Era of High-
Stakes Assessment

Depending on your perspective, standardized achievement tests are either a


bane or boon to public schools. Many politicians and ordinary citizens see
them as a way of holding educators accountable and ensuring students are
really learning. On the other hand, many educators feel standardized tests
are often misused and detract from their primary job of educating students.

CHAPTER HIGHLIGHTS

The Era of High-Stakes Assessment Individual Achievement Tests


Group-Administered Achievement Tests Selecting an Achievement Battery

LEARNING OBJECTIVES

After reading and studying this chapter, students should be able to


1. Describe the characteristics of standardized tests and explain why standardization is
important.
Describe the major characteristics of achievement tests.
Describe the major uses of standardized achievement tests in schools.
Explain what high-stakes testing means and trace the historical development of this
phenomenon.
Compare and contrast group-administered and individually administered achievement tests.
Describe the strengths and weaknesses of group-administered and individually administered
achievement tests.
Identify the major publishers of group achievement tests and name their major tests.
Discuss the major issues and controversies surrounding state and high-stakes testing
programs.
Describe and evaluate common procedures used for preparing students for standardized tests.
10. Describe and be able to apply the appropriate procedures for administering standardized
assessments.
11. Describe and be able to apply the appropriate procedures for interpreting the results of
standardized assessments.

299
300 CHAPTER 12

12. Describe and evaluate the major individual achievement tests.


13. Describe the major factors that should be considered when selecting standardized
achievement tests.

In this and subsequent chapters we will be discussing a variety of standardized tests com-
monly used in the public schools. In this chapter we will focus on standardized achievement
tests. A standardized test is a test that is administered, scored, and interpreted in a standard
manner. Most standardized tests are developed by testing professionals or test publishing
companies. The goal of standardization is to ensure that testing con-
A standardized test is a test ditions are as nearly the same as possible for all individuals taking
that is administered, scored, the test. If this is accomplished, no examinee will have an advantage
and interpreted in a standard over another due to variance in administration procedures, and as-
manner.

Achievement tests are designed AERA et al., 1999).


to assess students’ knowledge Naturally the vastnas Mtiehbf Macher coneeticied classroom tests
or skill in a content domain qualify as achievement tests, but they are not standardized. In de-
in which they have received scribing standardized achievement tests, Linn and Gronlund (2000)
instruction. highlighted the following characteristics:

m Standardized achievement tests typically contain high-quality items that were selected
on the basis of both quantitative and qualitative item analysis procedures.
m They have precisely stated directions for administration and scoring so that consistent
procedures can be followed in different settings.
m= Many contemporary standardized achievement tests provide both norm-referenced
and criterion-referenced interpretations. Norm-referenced interpretation allows comparison
to the performance of other students, whereas criterion-referenced interpretation allows
comparison to an established criterion.
= The normative data are based on large, representative samples.
= Equivalent or parallel forms of the test are often available.
u They have professionally developed manuals and support materials that provide exten-
sive information about the test; how to administer, score, and interpret it; and its measure-
ment characteristics.

There are many different types of standardized achievement tests. Some achievement
tests are designed for group administration whereas others are for individual administration.
Individually administered achievement tests must be given to only one student at a time and
require specially trained examiners. Some achievement tests focus on a single subject area
(e.g., reading) whereas others cover a broad range of academic skills and content areas (e.g.,
reading, language, and mathematics). Some use selection type items exclusively whereas
Standardized Achievement Tests in the Era of High-Stakes Assessment 301

others contain constructed-response and performance assessments. In addition to coming


in a variety of formats, standardized achievement tests have a number of different uses or
applications in the schools. These include the following:

= One of the most common uses is to track student achievement over time or to compare
group achievement across classes, schools, or districts.

m Standardized achievement tests are increasingly being used in high-stakes decision


making. For example, they may be utilized to determine which students are promoted or al-
lowed to graduate. They may also be used in evaluating and rating teachers, administrators,
schools, and school districts.

= Achievement tests can help identify strengths and weaknesses of individual students.
= Achievement tests can be used to evaluate the effectiveness of instructional programs
or curricula and help teachers identify areas of concern.
= A final major use of standard achievement tests is the identification of students with
special educational requirements. For example, achievement tests might be used in assess-
ing children to determine whether they qualify for special education services.

The Era of High-Stakes Assessments

The current trend is toward more, rather than less, standardized testing in public schools.
This trend is largely attributed to the increasing emphasis on educational accountability and
high-stakes testing programs. Popham (2000) notes that while there have always been critics
of public schools, calls for increased accountability became more strident and widespread
in the 1970s. During this period news reports began to surface publicizing incidences of
high school graduates being unable to demonstrate even the most
The current trend is toward basic academic skills such as reading and writing. In 1983 the Na-
more, rather than less, tional Commission on Excellence in Education published A Nation
standardized testing in at Risk: The Imperative for Educational Reform. This important
public schools. report sounded an alarm that the United States was falling behind
other nations in terms of educating our children. Parents, who as
taxpayers were footing the bill for their children’s education, increasingly began to ques-
tion the quality of education being provided and to demand evidence that schools were
actually educating children. In efforts to assuage taxpayers, legislators started implement-
ing statewide minimum-competency testing programs intended to guarantee that graduates
of public schools were able to meet minimum academic standards. While many students
passed these exams, a substantial number of students failed, and the public schools and
teachers were largely blamed for the failures. In this era of increasing accountability, many
schools developed more sophisticated assessment programs that used both state-developed
tests and commercially produced nationally standardized achievement tests. As the trend
continued, it became common for local newspapers to rank schools according to their stu-
dents’ performance on these tests, with the implication that a school’s ranking reflected the
effectiveness or quality of teaching. Special Interest Topic 12.1 provides.a brief description
302 CHAPTER 12

of the National Assessment of Educational Progress (NAEP) that has been used for several
decades to monitor academic progress across the nation, as well as a sample of recent results
in 4th-grade mathematics.
Subsequent legislation and reports continued to focus attention on the quality of our
educational system, promoting increased levels of accountability, which translated into
more testing. In recent years, the No Child Left Behind Act of 2001 required that each state
develop high academic standards and implement annual assessments to monitor the perfor-
mance of states, districts, and schools. It requires that state assessments meet professional
standards for reliability and validity and that states achieve academic proficiency for all
students within 12 years. As this text goes to print Congress is begin-
ning to debate the reauthorization of NCLB. It is likely that there will
The No Child Left Behind Act be significant changes to the act in the next years, but in our opinion
requires states to test students it is likely that standardized achievement tests will continue to see
annually in grades 3 through 8. extensive use in our public schools.
In the remainder of this chapter we will introduce a number
of standardized achievement tests. First we will provide brief descriptions of some major
group achievement tests and discuss their applications in schools. We will then briefly
describe a number of individual achievement tests that are commonly used in schools. The
goal of this chapter is to familiarize you with some of the prominent characteristics of these
tests and how they are used in schools.

Group-Administered Achievement Tests

Achievement tests can be classified as either individual or group tests. Individual tests are
administered in a one-to-one testing situation. One testing professional (i.e., the examiner)
administers the test to one individual (i.e., the examinee) at a time. In contrast, group-
administered tests are those that can be administered to more than one examinee at a time.
The main attraction of group administration 1s that it is an efficient way to collect information
about students or other examinees. By efficient, we mean a large num-
The main attraction of group- ber of students can be assessed with a minimal time commitment from
administered tests is that they educational professionals. As you might expect, group-administered
are an efficient way to collect tests are very popular in school settings. For example, most teacher-
information about students’ constructed classroom tests are designed to be administered to the
achievement. whole class at one time. Accordingly, if a school district wants to test
all the students in grades 3 through 8, it would probably be impos-
sible to administer a lengthy test to each student on a one-to-one basis. There is simply not
enough time or enough teachers (or other educational professionals) to accomplish such a
task without significantly detracting from the time devoted to instruction. However, when
you can have one professional administer a test to 20 to 30 students at a time, the task can be
accomplished in a reasonably efficient manner.
Although efficiency is the most prominent advantage of group-administered tests,
at least three other positive attributes of group testing warrant mentioning. First, because
the role of the individual administering the test is limited, group tests will typically involve
more uniform testing conditions than individual tests. Second, group tests frequently contain
See

SPECIAL INTEREST Topic 12.1


The “‘Nation’s Report Card”

The National Assessment of Educational Progress (NAEP), also referred to as the “Nation’s Report
Card,” is the only ongoing nationally administered assessment of academic achievement in the United
States. NAEP provides a comprehensive assessment of our students’ achievement at critical periods in
their academic experience (i.e., grades 4, 8, and 12). NAEP assesses performance in mathematics, sci-
ence, reading, writing, world geography, U.S. history, civics, and the arts. New assessments in world
history, economics, and foreign language are currently being developed. NAEP has been adminis-
tered regularly since 1969. It does not provide information on the performance of individual students
or schools, but presents aggregated data reflecting achievement in specific academic areas, instruc-
tional practices, and academic environments for broad samples of students and specific subgroups.
The NAEP has an excellent Web site that can be accessed at http://nces.ed.gov/nationsreportcard. Of
particular interest to teachers is the NAEP Questions Tool. This tool provides access to NAEP ques-
tions, student responses, and scoring guides that have been released to the public. This tool can be
accessed at http://nces.ed.gov/nationsreportcard/itmrls. The table below contains 4th-grade reading
scores for the states, Department of Defense Education Agency, and the District of Columbia.

NAEP Results: 2005 4th-Grade Average Reading Scores

State Score State Score

1 Massachusetts 231.28 27 Maryland 220.03


2 New Hampshire 227.44 28 Kentucky 29°93
3 Vermont 226.89 29 Florida 219.47
4 Departmentof Defense 226.13 30 Texas 218.73
Education Agency 31 Michigan 218.26
5 Delaware 225.84 32 Indiana 218.07
6 Virginia 225.81 National Public Average 217.30
7 Connecticut 2251) 33 North Carolina 217.13
8 Minnesota 2258 34 Arkansas 217.07
9 North Dakota 224.81 35 Oregon 216.90
10 Maine 224.58 36 Illinois 216.49
11 Montana 224.55 37 Rhode Island 216.44
12 Colorado 223.66 38 West Virginia 214.77
13. Washington 223.49 39 Georgia 214.43
14 New Jersey 223.30 40 Tennessee 214.22
15 Wyoming 223.26 41 Oklahoma 213.86
16 Pennsylvania 222.11 42 South Carolina 213.20
17 New York 222.70 43 Alaska 211.06
18 Ohio ViB is) 44 Hawaii 209.58
19 South Dakota 222.40 45 Louisiana 209.17
20 Idaho 221.86 46 Alabama 207.75
21 Nebraska 221.38 47 Nevada 207.19
22 Utah PY eH| 48 Arizona 207.14
23. Missouri 221.17 49 New Mexico 206.79
24 Wisconsin 221.16 50 California 206.51
25 Iowa 220.81 51 Mississippi 204.39
26 Kansas 220.47 52 District of Columbia 190.79

Note: Data retrieved July 19, 2007, from http://nces.ed.gov/nationsreportcard.

303
304 CHUA PoE Re y2

items that can be scored objectively, often even by a computer (e.g., selected-response
items). This reduces or eliminates the measurement error introduced by the qualitative scor-
ing procedures more common in individual tests. Finally, group tests often have very large
standardization or normative samples. Normative samples for professionally developed
group tests are often in the range of 100,000 to 200,000, whereas professionally developed
individually administered tests will usually have normative samples ranging from 1,000
to 8,000 participants (Anastasi & Urbina, 1997).
Naturally, group tests have some limitations. For example, in a group-testing situation
the individual administering the test has relatively little personal interaction with the indi-
vidual examinees. As a result, there is little opportunity for the examiner to develop rapport
with the examinees and closely monitor and observe their progress. Accordingly they have
limited opportunities to make qualitative behavioral observations about the performance of
their students and how they approach and respond to the assessment tasks. Another concern
involves the types of items typically included on group achievement tests. Whereas some
testing experts applaud group tests for often using objectively scored items, others criticize
them because these items restrict the type of responses examinees can provide. This par-
allels the same argument for and against selected-response items we discussed in earlier
chapters. Another limitation of group tests involves their lack of flexibility. For example,
when administering individual tests the examiner is usually able to select and administer
only those test items that match the examinee’s ability level. With group tests, however, all
examinees are typically administered all the items. As a result, examinees might find some
items too easy and others too difficult, resulting in boredom or frustration and lengthening
the actual testing time beyond what is necessary to assess the student’s knowledge accu-
rately (Anastasi & Urbina, 1997). It should be noted that publishers of major group achieve-
ment tests are taking steps to address these criticisms. For example, to allay concerns about
the extensive use of selected-response items, more standardized achievement tests are being
developed that incorporate a larger number of constructed-response items and performance
tasks. To address concerns about limited flexibility in administration, online and computer-
based assessments are becoming increasingly available.
In this section we will be discussing a number of standardized group achievement
tests. Many of these tests are developed by large test publishing companies and are com-
mercially available to all qualified buyers (e.g., legitimate educational institutions). In ad-
dition to these commercially available tests, many states have started developing their own
achievement tests that are specifically tailored to assess the state curriculum. These are often
standards-based assessments used in high-stakes testing programs. We will start by briefly
introducing some of the major commercially available achievement tests.

Commercially Developed Group Achievement Tests


Commercially developed group achievement tests are test batteries developed for use in
public schools around the nation and available for purchase by qualified professionals or
institutions. The most popular tests are comprehensive batteries designed to assess achieve-
ment in multiple academic areas such as reading, language arts, mathematics, science, and
social studies. These comprehensive tests are often referred to as survey batteries. As noted,


Standardized Achievement Tests in the Era of High-Stakes Assessment 305

many school districts use standardized achievement tests to track student achievement over
time or to compare performance across classes, schools, or districts. These batteries typically
contain multiple subtests that assess achievement in specific curricular areas (e.g., reading,
language, mathematics, and science). These subtests are organized in
The most widely used a series of test levels that span different grades. For example, a subtest
might have four levels with one level covering kindergarten through
standardized group
a the 2nd grade, the second level covering grades 3 and 4, the third level
achievement tests are produced . grades . grades
: covering 5 and 6, and the fourth level covering 7 and 8
by CTB McGraw-Hill, (Nitko, 2001). The most widely used standardized group achievement
Harcourt Assessment, and tests are produced and distributed by three publishers: CTB McGraw-
Riverside Publishing. Hill, Harcourt Assessment, and Riverside Publishing.

CTB McGraw-Hill. CTB McGraw-Hill publishes three popular standardized group


achievement tests, the California Achievement Tests, Fifth Edition (CAT/S), the TerraNova
CTBS, and TerraNova The Second Edition (CAT/6).

California Achievement Tests, Fifth Edition (CAT/5). The CAT/5, designed for use with
students from kindergarten through grade 12, is described as a traditional achievement
battery. The CAT/5 assesses content in Reading, Spelling, Language, Mathematics, Study
Skills, Science, and Social Studies. It is available in different formats for different applica-
tions (e.g., Complete Battery, Basic Battery, Short Form). The CAT/5 can be paired with
the Tests of Cognitive Skills, Second Edition (TCS/2), a measure of academic aptitude, to
allow comparison of achievement-—aptitude abilities (we will discuss the potential benefits
of making achievement—aptitude comparisons in the next chapter).

TerraNova CTBS. This is a revision of Comprehensive Tests of Basic Skills, Fourth Edi-
tion. The TerraNova CTBS, designed for use with students from kindergarten through
grade 12, was published in 1997. It combines selected-response and constructed-response
items that allow students to respond in a variety of formats. The TerraNova CTBS assesses
content in Reading/Language Arts, Mathematics, Science, and Social Studies. An expanded
version adds Word Analysis, Vocabulary, Language Mechanics, Spelling, and Mathematics
Computation. The TerraNova CTBS is available in different formats for different applica-
tions (e.g., Complete Battery, Complete Battery Plus, Basic Battery). The TerraNova CTBS
can be paired with the Tests of Cognitive Skills, Second Edition (TCS/2), a measure of
academic aptitude, to compare achievement-—aptitude abilities.

TerraNova The Second Edition (CAT/6). TerraNova The Second Edition, or CAT/6, is de-
scribed as comprehensive modular achievement battery designed for use with students from
kindergarten through grade 12 and contains year 2000 normative data. The CAT/6 assesses
content in Reading/Language Arts, Mathematics, Science, and Social Studies. An expanded
version adds Word Analysis, Vocabulary, Language Mechanics, Spelling, and Mathematics
Computation. It is available in different formats for different applications (e.g., CAT Multiple
Assessments, CAT Basic Multiple Assessment, CAT Plus). The CAT/6 can be paired with
InView, a measure of cognitive abilities, to compare achievement—aptitude abilities.
306 CHAP TERA

Harcourt Assessment, Inc. Harcourt Assessment, Inc., formerly Harcourt Educational


Measurement, publishes the Stanford Achievement Test Series, Tenth Edition (Stanford
10). Originally published in 1923, the Stanford Achievement Test Series has a long and rich
history of use.

Stanford Achievement Test Series, Tenth Edition (Stanford 10). The Stanford 10 can be
used with students from kindergarten through grade 12 and has year 2002 normative data.
It assesses content in Reading, Mathematics, Language, Spelling, Listening, Science, and
Social Science. The Stanford 10 is available in a variety of forms, including abbreviated
and complete batteries. The Stanford 10 can be administered with the Otis-Lennon School
Ability Test, Eighth Edition (OLSAT-8). Also available from Harcourt Assessment, Inc. are
the Stanford Diagnostic Mathematics Test, Fourth Edition (SDMT 4) and the Stanford Di-
agnostic Reading Test, Fourth Edition (SDRT 4), which provide detailed information about
the specific strengths and weaknesses of students in mathematics and reading.

Riverside Publishing. Riverside Publishing produces three major achievement tests: the
Iowa Tests of Basic Skills (ITBS), Iowa Tests of Educational Development (ITED), and
Tests of Achievement and Proficiency (TAP).

Iowa Tests of Basic Skills (ITBS). The ITBS is designed for use with students from kin-
dergarten through grade 8 and, as the name suggests, is designed to provide a thorough as-
sessment of basic academic skills. The most current ITBS form was published in 2001. The
ITBS assesses content in Reading, Language Arts, Mathematics, Science, Social Studies,
and Sources of Information. The ITBS is available in different formats for different applica-
tions (e.g., Complete Battery, Core Battery, Survey Battery). The ITBS can be paired with
the Cognitive Abilities Test (CogAT), Form 6, a measure of general and specific cognitive
skills, to allow comparison of achievement-—aptitude abilities. Figures 12.1, 12.2, and 12.3
provide sample score reports for the ITBS and other tests Riverside Publishing publishes.

Iowa Tests of Educational Development (ITED). The ITED, designed for use with stu-
dents from grades 9 through 12, was published in 2001 to measure the long-term goals
of secondary education. The ITED assesses content in Vocabulary, Reading Compre-
hension, Language: Revising Written Materials, Spelling, Mathematics: Concepts and
Problem Solving, Computation, Analysis of Science Materials, Analysis of Social Stud-
ies Materials, and Sources of Information. The ITED is available as both a complete
battery and a core battery. The ITED can be paired with the Cognitive Abilities Test
(CogAT), Form 6, a measure of general and specific cognitive skills, to allow comparison
of achievement—aptitude abilities.

Tests of Achievement and Proficiency (TAP). The TAP, designed for use with stu-
dents from grades 9 through 12, was published in 1996 to measure skills necessary for
growth in secondary school. The TAP assesses content and skills in Vocabulary, Reading
Comprehension, Written Expression, Mathematics Concepts and Problem Solving, Math
Computation, Science, Social Studies, and Information Processing. It also contains an
Standardized Achievement Tests in the Era of High-Stakes Assessment
307

om

THE - ve Student: SSNS


IOW A Student ID: 1254567890
a
Class/Group: Clark FormLevel: 4/9
.f
\. School: Central Elementary
School Code: 990002120
Test Date: 04/2007
: ii f
TE SE SE District: Spying Lake ‘ :
2 Order: No.; 002-A70000028-0-002 Page: 1 24 Rie ele
In the upper left part of this report, scores are printed for the tests, for the totals, afd for the :
composite: Several types of scores are reported, including the. percentile rank which is.the per- ;
‘cent of students inthe comparison group (national or local) with a lower score on thal test,
total, or composite.

The display of NPRS tothe righit of thé scores allows a quick overview of the student's’
performance in each test relative to the other tests. The shaded area is the margin
of error for the composite: score ol all tests. The horizontal bands are margins of error for
individual ‘tests and totals, Bands completely outside the shaded area indicate scores that
are {fuly different (rom the composite. Non-overlapping bands indicate scores that are (ruly 3
ae rape ditforent trom each other. : cee
Math Problam Solving & Data interp, i .
“hath Computation : The lower part-of the report provides detailed information about skills in each test, The number
‘af items for each skill, the number attempted, the percent correct for the student, and the
percent correct for the nation are reported. The horizontal bands {o'the right of this'information
are the margins of error for €ach skill score. Non-overlapping bands Indicate ’skill scores that
Maps. ee P are truly diferent from-each other,
Reference Maloriais : 7 ;
af information Tote 2 Te Ifinched inTatals
Composte 8S. = Standard Sco (NB = Nallonal Stanine NPR = National Percentia Rank
4c = Percant\Cormet GE = Grade Equivaient NCE = Normal Gurve.Equvaiont NO. ATL.= Number Anompted

Test/Skill Mt
Usago.& Expression
Nouns, Pronouns, and Modifiers]

MATHEMATICS
and Estimation |
Number Properties & Operation: Earth and Space Science
Algebea Physical Science
Geometry
Measurement SOURCES OF INFORMATION
Probability and Statistics 2 Maps and
Estimation ? Lacate/Ptocess Information
Interpret information

|
Analyze Intormation
Reference Matortals
Using Reference Materisie
Searching for information

:
Using Search Reswis
Quant /interp. Relat,
Computation | 13V and 2 lem Skil ara not graphed 4 ve
Add with White Numbers | ee
‘Subtract with Whole Numbers:
Multipty/Owvide Whole Numbors

Saat : : —_ _

; . Call 800.323.9540 » Fax 630.467.7192 : eae Stee ae Re ese SR-KMP.04/01


“A HOUGHTCHE MIFECIN COMPANY Visit www.riversidepublishing.com. Embrace Learning’ with The lowa Tests”

FIGURE 12.1 Performance Profile for Iowa Tests of Basic Skills (ITBS). This figure
illustrates a Performance Profile for the Iowa Tests of Basic Skills (ITBS). It is one of the
score report formats that Riverside Publishing provides for the ITBS. The display in the upper-
left portion of the report provides numerical scores for the individual tests, totals, and overall
composite. The National Percentile Rank is also displayed using confidence bands immediately
to the right of the numerical scores. The display in the lower portion of the report provides
detailed information about the specific skills measured in each test. Riverside Publishing
provides an Interpretive Guide for Teachers and Counselors that provides detailed guidance for
interpreting the different score reports.
Source: Reproduced by permission of the publisher, Riverside Publishing Company. Copyright 2001 The River-
side Publishing Company. All rights reserved.

optional Student Questionnaire that solicits information about attitudes toward school, ex-
tracurricular interests, and long-term educational and career plans. The TAP is available
as both a complete battery and a survey battery. The TAP can be paired with the Cognitive
Abilities Test (CogAT), Form 5, a measure of general and specific cognitive skills, to allow
comparison of achievement—aptitude abilities.
308 CHAP RERw2

Please see other side for report features and options.

Se
:
oe
Boe
rz IOWA
m= a TESTS |
NARRATIVE FOR
:
a Tests of Basic Skills® (ITBS*)
MM
. =<
cue : er
Ce
ke reed arse
tary =<
Femme x
fest
ee
Date: 04/
oeae \ }
eae : 72 Order No.: 002-470000028-0002 _ Page: 1 Grade: 5 f

Percentile Ranks |
Scores Low High Achievement Scores for Katrina Adams:
GE SPR NPR_1 25 sO 75 89
Vocabulary} Katrina was given the lowa Tests of Basic Skills in March 2001,
Reading Compretensian| At the time of testing, she was in fifth grade in Central Elementary
858
in Spring Lake.

Her composite score is the score that best describes her overall
achievement on the tests. Katrina's composite national percentile
rank (NPR) of 54 means that she scored higher than 54 percent of
fifth-grade students nationally. Her overall achievement appears to
be about average for fifth grade.
ASAI
adda
In general, a. student's ability to read is related fo success in many
GORE TOFAE! & = areas of school work. Katrina's Reading Gomprehension score is
Social Studies Total] 8.6 48 about average when compared with other students in fifth grade
2 mere: Ml te nationally. The Vocabulary test measures knowledge of words
‘Mays ADiagrams| 3:6 19 18 important in the comprehension of reading materials of all sorts. It
ere ial ae & is also the strongest measure of general verbal ability. Vocabulary
; ‘ development contributes to a student's understanding of spoken
COMPOSITE! 34 and written language encountered both in and out of school.

Legend: GE = Grade Equivalent fmm NPA = National Percentile Rank A student's scores can be compared with each other to determine
sat SHA = Stain Reconcile: Rank relative strengths and weaknesses,

The following are aréas of relative strength for Katrina:


Message from School: Punctuation, Language Usage and Expression, Math Concepts
: : and Estimation, and Math Computation. Some of these strengths
This ‘space may be left blank for teacher to write a message or may be may be used to help improve other areas.
used for a predefined message that the school can provide.
Compared to Katrina's other scores, the following area may need
the most work: Maps and Diagrams,

© 2001 The Riverside Publishing Company. All Rights Reserved. ‘@ Riverside Publishing 4 HoucHioNn mireLin COMPANY

_ @Riverside Publishing _ Call 800.323.9540» Fax 630.467.7192 Pe See RENEE Oe cnn


‘@ Riverside Publishing ; Visit www.riversidepublishing.com _ Embrace Learning™ with The lowa Tests™

FIGURE 12.2 Profile Narrative for Iowa Tests of Basic Skills (ITBS). This figure illustrates
the Profile Narrative report format available-for the Iowa Tests of Basic Skills (ITBS). Although
this format does not provide detailed information about the skills assessed in each test, as in the
Performance Profile shown in Figure 12.1, it does provide an easy-to-understand discussion of
the student’s performance. This format describes the student’s performance on the composite
score (reflecting the student’s overall level of achievement) and the two reading tests (i.e.,
Vocabulary and Reading Comprehension). This report also identifies the student’s relative
strengths and areas that might need attention. This report illustrates the reporting of both state
and national percentile ranks.
Source: Reproduced by permission of the publisher, Riverside Publishing Company. Copyright 2001 The River-
side Publishing Company. All rights reserved.

Supplemental Constructed-Response and Performance Assessments. As we dis-


cussed in earlier chapters, many educators criticize tests that rely extensively on selected-
response items and advocate the use of constructed-response and performance assessments.
To address this criticism, many of the survey batteries previously discussed provide open-
ended and performance assessments to complement their standard batteries. For example,
Standardized Achievement Tests in the Era of High-Stakes Assessment 309

STUDENT SCORE LABELS


lowa Tests of Basic Skills® (ITBS®)

INB-IOMN «=STUDENT SCORE LABELS (82% 27%) _sosom: oc


MERWE == fowa Tests of Educational Development® (ITED®) _S**°e} Sade:
trict: vaconzr20
Spring Lake: Mus ess sesky
sl
Complete Battery Order No.: 002-A70000028-0-002 Page: 1 Grade: 9

CogAT Stent Score Lasets eee ashen


Cognitive Abilities Test™ (CogAT?) Ontera.a-are000e2-
6002 Pogu:1 Grade: s

SAW DUGHTON MIFFLIN COMPANY


Pre keh deaeeee a et 4

FIGURE 12.3. Score Labels for the Iowa Tests and CogAT. This figure presents student score
labels for the Iowa Tests of Basic Skills (ITBS), Iowa Tests of Educational Development (ITED),
and Cognitive Abilities Test (CogAT). The CogAT is an ability test discussed in the next chapter.
These labels are intended for use in the students’ cumulative records and allow educators to track
student growth over time.
Source: Reproduced by permission of the publisher, Riverside Publishing Company. Copyright 2001 The River-
side Publishing Company. All rights reserved.

Riverside Publishing offers the Performance Assessments for ITBS, ITED, and TAP. These
are norm-referenced open-ended assessments in Integrated Language Arts, Mathematics,
Science, and Social Studies. These free-response assessments give students the opportunity
to demonstrate content-specific knowledge and higher-order cognitive skills in a more life-
like context. Other publishers have similar products available that supplement their survey
batteries.

Diagnostic Achievement Tests. The most widely used achievement tests have been
the broad survey batteries designed to assess the student’s level of achievement in broad
310 CHA Pie Rei

academic areas. Although these batteries do a good job in this context, they typically have
too few items that measure specific skills and learning objectives to be useful to teachers
when making instructional decisions. For example, the test results might suggest that a
particular student’s performance is low in mathematics, but the results will not pinpoint the
student’s specific strengths and weaknesses. To address this limitation, many test publishers
have developed diagnostic achievement tests. These diagnostic batteries contain a larger
number of items linked to each specific learning objective. In this way they can provide
more precise information about which academic skills have been achieved and which have
not. Examples of group-administered diagnostic achievement tests include the Stanford Di-
agnostic Reading Test, Fourth Edition (SDRT 4) and the Stanford Diagnostic Mathematics
Test, Fourth Edition (SDMT 4), both published by Harcourt Assessment, Inc. Most other
publishers have similar diagnostic tests to complement their survey batteries.
Obviously these have been very brief descriptions of these major test batteries. These
summaries were based on information current at the time of this writing. However, these tests
are continuously being revised to reflect curricular changes and to update normative data.
For the most current information, interested readers should access the Internet sites for the
publishing companies (see Table 12.1) or refer to the current edition of the Mental Measure-
ments Yearbook or other reference resources. See Special Interest Topic 12.2 for information
on these resources.

State-Developed Achievement Tests


As we noted earlier, standardized achievement tests are increasingly being used in mak-
ing high-stakes decisions at the state level (e.g., which students are promoted or graduate;
rating teachers, administrators, schools, and school districts). While all states now have
statewide testing programs, different states have adopted different
Standardized achievement tests
approaches. Some states utilize commercially available achievement
are increasingly being used in batteries like those described in the previous section (often referred
making high-stakes decisions at to as off-the-shelf tests). An advantage of these commercial tests is
the state level. that they provide normative data based on national samples. This

TABLE 12.1 Major Publishers of Standardized Group Achievement Tests

CTB McGraw-Hill Web site: http://tctb.com


California Achievement Tests, Fifth Edition (CAT/5)
TerraNova CTBS
TerraNova The Second Edition (CAT/6)

Harcourt Assessment, Inc. Web site: http://harcourtassessment.com


Stanford Achievement Test Series, Tenth Edition (Stanford 10)

Riverside Publishing Web site: www.riverpub.com


Iowa Tests of Basic Skills (ITBS)
Iowa Tests of Educational Development (ITED)
Tests of Achievement and Proficiency (TAP)
OC: LUCA
Standardized Achievement Tests in the Era of High-Stakes Assessment 311

ERS RRS Bn eae

SPECIAL INTEREST TOPIC 12:2


Finding Information on Standardized Tests

When you want to locate information on a standardized test, it is reasonable to begin by exam-
ining information provided by the test publishers. This can include their Internet sites, catalogs,
test manuals, specimen test sets, score reports, and other supporting documentation. However, you
should also seek out resources that provide independent evaluations and reviews of the tests you
are researching. The Testing Office of the American Psychological Association Science Directorate
(American Psychological Association, 2008) provides the following description of the four most
popular resources:

= Mental Measurements Yearbook (MMY). MMY, published by the Buros Institute for Mental
Measurements, lists tests alphabetically by title. Each listing provides basic descriptive in-
formation about the test (e.g., author, date of publication) plus information about the avail-
ability of technical information and scoring and reporting services. Most listings also include
one or more critical reviews by qualified assessment experts.
m Tests in Print (TIP). TIP, also published by the Buros Institute for Mental Measurements, is a
bibliographic encyclopedia of information on practically every published test in psychology
and education. Each listing provides basic descriptive information on tests, but does not contain
critical reviews or psychometric information. After locating a test that meets your criteria, you
can turn to the Mental Measurements Yearbook for more detailed information on the test.
wu Test Critiques. Test Critiques, published by Pro-Ed, Inc., contains a three-part listing for
each test that includes Introduction, Practical Applications/Uses, and Technical Aspects,
followed by a critical review of the test.
a Tests. Tests, also published by Pro-Ed, Inc., is a bibliographic encyclopedia covering thou-
sands of assessments in psychology and education. It provides basic descriptive information
on tests, but does not contain critical reviews or information on reliability, validity, or other
technical aspects of the tests. It serves as a companion to Test Critiques.

These resources can be located in the reference section of most college and larger public libraries.
In addition to these traditional references, Test Reviews Online is a Web-based service of the Buros
Institute of Mental Measurements (www.unl.edu/buros). This service makes test reviews available
online to individuals precisely as they appear in the Mental Measurements Yearbook. For a relatively
small fee (currently $15), users can download information on any of over 2,000 tests.

allows one to compare a student’s performance to that of students across the nation, not
only students from one’s state or school district. For example, one could find that Johnny’s
reading performance was at the 70th percentile relative to a national normative group. Using
these commercial tests it is also possible to compare state or local groups (e.g., a district,
school, or class) to a national sample. For example, one might find that a school district’s
mean 4th-grade reading score was at the 55th percentile based on national normative data.
These comparisons can provide useful information to school administrators, parents, and
other stakeholders.
All states have developed educational standards that specify the academic knowl-
edge and skills their students are expected to achieve (see Special Interest Topic 12.3 for
312 CHAPTER 12

information on state standards). One significant limitation of using a commercial off-the-


self national test is that it might not closely match the state’s curriculum standards. Boser

He described a study commissioned by


the education aepinent ofCalifornia that examined a number of commercially available
achievement tests to see how they align with the state’s math standards. These studies found
that the off-the-shelf tests focused primarily on basic math skills and did not adequately
assess whether students had mastered the state’s standards. This comes down to a question
of test validity. If you are interested in assessing what is being taught in classrooms across
the nation, the commercially available group achievement tests probably give you a good
measure. However, if you are more interested in determining whether your students are
mastering your state’s content standards, off-the-shelf achievement tests are less adequate
and_state-developed content-based assessments are preferable.
To address this limitation, many states have developed their own achievement batter-
ies that are designed to closely match the state’s curriculum. In contrast to the commercial
tests that typically report normative scores, state-developed tests often emphasize criterion-
referenced score interpretations. In Texas there is a statewide program that includes the
Texas Assessment of Knowledge and Skills (TAKS). The TAKS measures the success of
students in the state’s curriculum in reading (grades 3 through 9), mathematics (grades 3
through 11), writing (grades 4 and 7), English language arts (grades 10 and 11), science

SPECIAL INTEREST TOPIC 12.3

Standards-Based Assessments

AERA et al. (1999) defines standards-based assessments as tests that are designed to measure clearly
defined content and performance standards. In this context, content standards are statements that
specify what students are expected to achieve in a given subject matter at a specific grade (e.g.,
Mathematics, Grade 5). In other words, content standards specify the skills and knowledge we want
our students to master. Performance standards specify a level of performance, typically in the form
of a cut score or a range of scores that indicates achievement levels. That is, performance standards
specify what constitutes acceptable performance. National and state educational standards have been
developed and can be easily accessed via the Internet. Below are a few examples of state educational
Internet sites that specify the state standards.

= California. Content Standards for California Public Schools: www.cde.ca.gov/standards


w Florida. Sunshine State Standards: http://sunshinestatestandards.net
m New York. Learning Standards: www.emsc.nysed.gov/guides
m Texas. Texas Essential Knowledge and Skills: www.tea.state.tx.us/teks

Education World provides a Web site that allows you to easily access state and national standards.
The site for state standards is www.education-world.com/standards/state/index.shtml.
The site for national standards is www.education-world.com/standards/national.
a
Standardized Achievement Tests in the Era of High-Stakes Assessment 313

(grades 5, 10, and 11), and social studies (grades 8, 10, and 11). There is a Spanish TAKS
that is administered in grades 3 through 6. The decision to promote a student to the next
grade may be based on passing the reading and math sections, and successful completion
of the TAKS at grade 11 is required for students to receive a high school diploma. The
statewide assessment program contains two additional tests. There is a Reading Proficiency
Test in English (RPTE) that is administered to limited English proficient students to assess
annual growth in reading proficiency. Finally there is the State-Developed Alternative As-
sessment (SDAA) that can be used with special education students when it is determined
that the standard TAKS is inappropriate. All of these tests are designed to measure the
educational objectives specified in the state curriculum, the Texas Essential Knowledge and
Skills curriculum (TEKS) (see www.tea.state.tx.us).
Some states have developed hybrid assessment strategies to assess student performance
and meet accountability requirements. For example, some states use a combination of state-
developed tests and commercial off-the-shelf tests, using different tests at different grade
levels. Another approach, commonly referred to as augmented testing, involves the use of a
commercial test that is administered along with test sections that address any misalignment

TABLE 12.2 State Assessment Practices—2007

State-Developed Tests Augmented/Hybrid Off-the-Shelf Tests


(Criterion-Referenced) Tests (Norm-Referenced)

Alabama Yes No Yes


Alaska Yes No Yes
Arizona Yes Yes Yes
Arkansas Yes No Yes
California Yes No Yes
Colorado Yes No Yes
Connecticut Yes No No
Delaware No Yes No
District of Columbia Yes No No
Florida Yes No Yes
Georgia Yes No Yes
Hawaii No Yes No
Idaho Yes No No
Illinois Yes Yes No
Indiana Yes No No
Iowa No No Yes
Kansas Yes No No
Kentucky Yes No Yes
Louisiana Yes Yes No
Maine Yes No Yes
Maryland nes Yes No
(continued)
314 CHAPTER 12

TABLE 12.2 Continued

State-Developed Tests Augmented/Hybrid Off-the-Shelf Tests


(Criterion-Referenced) Tests (Norm-Referenced)

Massachusetts Yes No No
Michigan Yes No Yes
Minnesota Yes No No
Mississippi Yes No No
Missouri No Yes No
Montana Yes No Yes
Nebraska Yes No No
Nevada Yes No Yes
New Hampshire Yes No No
New Jersey Yes No No
New Mexico Yes No Yes
New York Yes No No
North Carolina Yes No No
North Dakota Yes No No
Ohio Yes No No
Oklahoma Yes No No
Oregon Yes No No
Pennsylvania Yes No No
Rhode Island Yes Yes No
South Carolina Yes No No
South Dakota No Yes Yes
Tennessee Yes No No
Texas Yes No No
Utah F Yes No Yes
Vermont Yes No No
Virginia Yes No No
Washington Yes No No
West Virginia Yes No Yes
Wisconsin No Yes No
Wyoming Yes No No
Totals 45 10 18
ki eR ee
Note: Data provided by Education Week, accessed August 9, 2007, at www.edcounts.org/createtable/step 1 .php
?clear=1. State Developed Test (Criterion-Referenced): defined as tests that are custom-made to correspond to
state content standards. Augmented/Hybrid Test: defined as tests that incorporate aspects of both commercially
developed norm-referenced and state-developed criterion-referenced tests (includes commercial tests augmented
or customized to match state standards). Off-the-Shelf Test (Norm-Referenced): defined as commercially devel-
oped norm-referenced tests that have not been modified to specifically reflect state standards.
Standardized Achievement Tests in the Era of High-Stakes Assessment 315

between state standards and the content of the commercial test. Table 12.2 provides informa-
tion on the assessment strategies used as of 2007 in state assessment programs (Education
Week, 2007). A review of this table reveals that the majority of states (i.e., 45) have state-
developed tests that are specifically designed to align with their standards. Only one state (i.e.,
lowa) reported exclusively using an off-the-shelf test. It should be noted that any report of state
assessment practices is only a snapshot of an ever-changing picture. The best way to get infor-
mation on your state’s current assessment practices is to go to the Web
' Proponents of high-stakes site of the state’s board of education and verify the current status.
testing programs believe they There is considerable controversy concerning statewide testing
increase academic expectations programs. Proponents of high-stakes testing programs see them as a
way of increasing academic expectations and ensuring that all stu-
and ensure that all students are
dents are judged according to the same standards. They say these test-
judged according to the same
ing programs guarantee that students graduating from public schools
standards. have the skills necessary to be successful in life after high school.
Critics of high-stakes testing Critics of these testing programs argue that these tests emphasize
programs believe too much rote learning and often neglect critical thinking, problem solving,
and communication skills. To exacerbate the problem, critics feel
instructional time is spent
that too much instructional time is spent preparing students for the
preparing students for the tests
tests instead of teaching the really important skills teachers would
instead of teaching the really
like to focus on. Additionally, they argue that these tests are cultur-
important skills necessary for ally biased and are not fair to minority students (Doherty, 2002). For
success in life. additional information on high-stakes testing programs, see Special
Interest Topics 12.4 and 12.5. This debate is likely to continue for
the foreseeable future, but in the meantime these tests will continue to play an important
role in public schools.

Value-Added Assessment: A New Approach


to Educational Accountability
The term value-added has been used in business and industry to mean the economic value
gain that occurs when material is changed through manufacturing or manipulation.iijga@u-
cati 7 -
ed n many ways, it can be
In schools, value-added
seen as determining the value of instruction in raising knowledge
assessment focuses on the
levels (however, the model does not attempt to determine the many
change in a student’s knowledge
benefits of schooling that go beyond knowledge acquisition). One
as the result of instruction. of the most complex models of value-added assessment has been
developed in Tennessee (Ceperley & Reel, 1997; Sanders, Saxton,
& Horn, 1997). This model also has been implemented in a somewhat different form in
Dallas (Webster & Mendro, 1997). This is a rather complex model, and the basic ideas are
presented here in a hypothetical situation.
Consider students who attend Washington School in East Bunslip, New Jersey, in
Ms. Jones’ 3rd-grade class (all names are made up). These students may be typical or repre-
sentative of 3rd-grade students, or there may be a substantial proportion of excellent or poor
students. Ms. Jones teaches in her style, and the students are given the state achievement test
CHAPT ERY

SL RO I

SPECIAL INTEREST TOPIC 12.4


American Educational Research Association (AERA)
Position Statement on High-Stakes Testing

The American Educational Research Association (AERA) is a leading organization that studies
educational issues. The AERA (2000) presented a position statement regarding high-stakes testing
programs employed in many states and school districts. Its position is summarized in the following
points:

1. Important decisions should not be based on a single test score. Ideally, information from
multiple sources should be taken into consideration when making high-stakes decisions.
When tests are the basis of important decisions, students should be given multiple opportuni-
ties to take the test.
. When students and teachers are going to be held responsible for new content or standards, they
should be given adequate time and resources to prepare themselves before being tested.
. Each test should be validated for each intended use. For example, if a test is going to be used
for determining which students are promoted and for ranking schools based on educational
effectiveness, both interpretations must be validated.
. If there is the potential for adverse effects associated with a testing program, efforts should
be made to make all involved parties aware of them.
There should be alignment between the assessments and the state content standards.
When specific cut scores are used to denote achievement levels, the purpose, meaning, and
validity of these passing scores should be established.
Students who fail a high-stakes test should be given adequate opportunities to overcome any
deficiencies.
. Adequate consideration and accommodations should be given to students with language
differences.
Adequate consideration and accommodations should be given to students with disabilities.
When districts, schools, or classes are.to be compared, it is important to specify clearly
which students are to be tested and which students are exempt, and to ensure that these
guidelines are followed.
11. Test scores must be reliable.
1 There should be an ongoing evaluation of both the intended and unintended effects of any
high-stakes testing program.

These guidelines may be useful when trying to evaluate the testing programs your state or
school employs. For more information, the full text of this position statement can be accessed at
www.aera.net/about/policy/stakes.htm.

at the end of the year. For this example, let’s assume that statewide testing begins in grade 3.
The results of the state test, student by student, are used to build a model of performance for
each student, for Ms. Jones, for Washington School, and for the East Bunslip school district.
One year’s data are inadequate to do more than simply mark the levels of performance of
each focus for achievement: student, teacher, school, and district,
Standardized Achievement Tests in the Era of High-Stakes Assessment 317

ESSE RSE oR RL

SPECIAL INTEREST TOPIC 125


Why Standardized Tests Should Not Be
Used to Evaluate Educational Quality

W. James Popham (2000) provided three reasons why he feels standardized achievement tests should
not be used to evaluate educational effectiveness or quality:

1. There may be poor alignment between what is taught in schools and what is measured by
the tests. Obviously, if the test is not measuring what is being taught in schools, this will
undermine the validity of any interpretations regarding the quality of an education.
2. In an effort to maximize score variance, test publishers often delete items that are relatively
easy. Although this is a standard practice intended to enhance the measurement characteris-
tics of a test, it may have the unintended effect of deleting items that measure learning objec-
tives that teachers feel are most important and emphasize in their instruction. He reasons that
the items might be easy because the teachers focused on the objectives until practically all of
the students mastered them.
3. Standardized achievement tests may reflect more than what is taught in schools. Popham
notes that performance on standardized achievement tests reflects the students’ intellectual
ability, what is taught in school, and what they learn outside of school. As a result, to interpret
them as reflecting only what is taught in school is illogical, and it is inappropriate to use them
as a measure of educational quality.

The next year Ms. Jones’s previous students have been dispersed to several 4th-grade
classrooms. A few of her students move to different school districts, but most stay in East
Bunslip and most stay at Washington School. All of this information will be included in the
modeling of performance. Ms. Jones now has a new class of students who enter the value-
added assessment system. At the end of this year there is now data on each student who com-
pleted 4th grade, although some students may have been lost through attrition (e.g., missed
the testing, left the state). The Tennessee model includes a procedure that accounts for all of
these “errors.” The performance of the 4th-grade students can now be evaluated in terms of
their 3rd-grade performance and the effect of their previous teacher, Ms. Jones, and the effect
of their current teacher (assuming that teacher also taught last year and there was assessment
data for the class taught). In addition a school-level effect can be estimated. Thus, the value-
added system attempts to explain achievement performance for each level in the school system
by using information from each level. This is clearly a very complex undertaking for an entire
state’s data. As of 1997, Sanders et al. noted that over 4 million data points in the Tennessee
system were used to estimate effects for each student, teacher, school, and district.
The actual value-added component is not estimated as a gain, but as the difference in
performance from the expected performance based on the student’s previous performance,
current grade in school effect, sum of current and previous teacher effectiveness, and school
effectiveness. When three or more years’ data become available, longitudinal trend models can
be developed to predict the performance in each year for the various sources discussed.
318 CHA PER eaa

. The system is intended to (1) guide instructional change


through inspection of the teacher and grade-level estimates of average performance and
(2) evaluate teachers and administrators by examining consistency of performance averages
across years. The second purpose is certainly controversial and has its detractors. In par-
ticular, teacher evaluation based on state assessments has been criticized due to the limited
coverage of the state tests. This, it is argued, has resulted in reduced coverage of content,
focus on low-level conceptual understanding, and overemphasis on “teaching to the test”
at the expense of content instruction. Nevertheless, there is continued interest in the value-
added models and their use will likely increase.

Best Practices in Using Standardized


Achievement Tests in Schools
As you can see from our discussion of standardized achievement tests to this point,
these tests have widespread applications in schools today. As a result, modern teachers
are often asked to prepare students for these tests as well as to administer and interpret
them. In this section we will briefly discuss some guidelines for completing these tasks.
We will start by discussing how teachers can prepare students to take standardized tests.
This discussion will focus on group-administered achievement tests, but can actually
be generalized to many of the other standardized tests we will be discussing in this and
later chapters.

Preparing Students for the Test. Much has been written in recent years about the proper
procedures or practices for preparing students to take standardized achievement tests. As
we noted earlier, high-stakes testing programs are in place in every state, and these tests are
used to make important decisions such as which students graduate or get promoted, which
teachers receive raises, and which administrators retain their jobs. As you might imagine,
the pressure to ensure that students perform well on these tests has also increased. Legisla-
tors exert pressure on state education officials to increase student performance, who in turn
put pressure on local administrators, who in turn put pressure on teachers. An important
question is “What test preparation practices are legitimate and acceptable, and what prac-
tices are unethical or educationally contraindicated?” This is a more complicated question
than one might first imagine.
A popular phrase currently being used in both the popular media and professional ed-
ucational literature is teaching to the test. This phrase generally implies efforts by teachers
to prepare students to perform better on standardized achievement
tests. Many writers use “teaching to the test” in a derogatory man-
Teaching to the test has become ner, referencing unethical or inappropriate preparation practices.
a popular concept in both the Other writers use the phrase more broadly to reference any instruc-
popular media and professional tion designed to enhance performance on a test. As you will see, a
literature. wide range of test preparation practices can be applied. Some of
Standardized Achievement Tests in the Era of High-Stakes Assessment 319

these practices are clearly appropriate whereas others are clearly inappropriate. As an
extreme example, consider a teacher who shared the exact items from a standardized test
that is to be administered to students. This practice is clearly a breach of test security and is
tantamount to cheating. It is unethical and educationally indefensible and most responsible
educators would not even consider such a practice. In fact, such a breach of test security
could be grounds for the dismissal of the teacher, revocation of license, and possible legal
charges (Kober, 2002).
Thankfully such flagrantly abusive practices are relatively rare, but they do occur.
However, the appropriateness of some of the more common methods of preparing stu-
dents for tests is less clear. With one notable exception, which we will describe next,
itis generally acceptedt De

é ETE
(@propriated You may recognize that this involves the issue of test validity. Standardized
achievement tests are meant to assess the academic achievement of students in specific
areas. If test preparation practices increase test scores without increasing the level of
achievement, the validity of the test is compromised. Consider the following examples
of various test preparation procedures.

Instruction in Generic Test-Taking Skills. This involves instruction in general test-tak-


ing skills.such as completing answer sheets, establishing an appropriate pace, narrowing
choices on selected-response items, and introductions to novel item formats (e.g., Kober,
2002). This is the “notable exception” to the general rule just noted. Instruction in general
test-taking skills does not increase mastery of the underlying knowledge and skills, but it
does make students more familiar and comfortable with standardized tests. As a result, their
scores are more likely to reflect accurately their true academic abilities and not the influence
of deficient test-taking skills (e.g., Linn & Gronlund, 2000; Popham, 1999). This practice
enhances the validity of the assessment. This type of instruction is also typically fairly brief
and, as a result, not detrimental to other educational activities. Therefore, instruction in
generic test-taking skills is an appropriate preparation practice (see Table 12.3).

Preparation Using Practice Forms of the Test. Many states and commercial test publishers
release earlier versions of their exams as practice tests. Because these are released as practice
tests, their use is not typically considered unethical. However, if these tests become the focus
of instruction at the expense of other teaching activities, this practice can be harmful. Re-
search suggests that direct instruction using practice tests may produce short-term increases
in test scores without commensurate increases in performance on other measures of the test
domain (Kober, 2002). Like instruction in generic test-taking skills, the limited use of practice
tests may help familiarize students with the format of the test. However, practice tests should
be used in a judicious manner to ensure that they do not become the focus of instruction.

Preparation Emphasizing Test-Specific Item Formats. Here teachers provide instruc-


tion and assignments that prepare students to deal exclusively with the specific item for-
mats used on the standardized test. For example, teachers might use classroom tests and
320 CHAPTER 12

TABLE 12.3 Important Test-Taking Skills to Teach Students

1. Carefully listen to or read the instructions.


2. Carefully listen to or read the test items.
3. Establish an appropriate pace. Do not rush carelessly through the test, but do not proceed so
slowly you will not be able to finish.
4. If you find an item to be extremely difficult, do not spend an inordinate amount of time on it.
Skip it and come back if time allows.
5. On selected-response items, make informed guesses by eliminating alternatives that are clearly
wrong.
6. Unless there is a penalty for guessing, make an effort to complete every item. It is better to try
to guess the correct answer than simply leave it blank.
7. Ensure that you carefully mark the answer sheet. For example, on computer-scored answer
sheets, make sure the entire space is darkened and avoid extraneous marks.
8. During the test periodically verify that the item numbers and answer numbers match.
9. If time permits, go back and check your answers.

Sources: Based on Linn & Gronlund (2000) and Sarnacki (1979).

homework assignments that resemble actual items on the test (Kober, 2002). If the writing
section of a test requires single-paragraph responses, teachers will restrict their writing
assignments to a single paragraph. If a test uses only multiple-choice items, the teachers
will limit their classroom tests to multiple-choice items. The key feature is that students
are given instruction exposing them only to the material as presented and measured on
the test. With this approach students will be limited in their ability to generalize acquired
skills and knowledge to novel situations (Popham, 1999). Test scores may increase, but
the students’ mastery of the underlying domain is limited. As a result, this practice should
be avoided.

Preparation Emphasizing Test Content. This practice is somewhat similar to the previous
one, but instead of proyiding extensive exposure to items resembling those on the test, the
goal is to emphasize the skills and content most likely to be included on the standardized
tests. Kober (2002) notes that this practice often has a “narrowing effect” on instruction.
Because many standardized achievement tests emphasize basic skills and knowledge that
can be easily measured with selected-response items, this practice may result in teachers
neglecting more complex learning objectives such as the analysis and synthesis of informa-
tion or development of complex problem-solving skills. While test scores may increase, the
students’ mastery of the underlying domain is restricted. This practice should be avoided.

Preparation Using Multiple Instructional Techniques. With this approach students are
given instruction that exposes them to the material as conceptualized and measured on the
test, but also presents the material in a variety of different formats. Instruction covers all
salient knowledge and skills in the curriculum and addresses both basic and higher-order
learning objectives (Kober, 2002). With this approach, increases in test scores are associ-
ated with increases in mastery of the underlying domain of skills and knowledge (Popham,
1999). As a result, this test preparation practice is recommended. .
Standardized Achievement Tests in the Era of High-Stakes Assessment 321

Only test preparation practices Although this list of test preparation practices is not exhaus-
that introduce generic test- tive, we have tried to address the most common forms. In summary,
taking skills and use multiple only preparation that introduces generic test-taking skills and uses
instructional techniques can be multiple instructional techniques can be recommended enthusiasti-
recommended enthusiastically. cally. Teaching generic test-taking skills makes students more fa-
miliar and comfortable with the assessment process, and as a result
enhances the validity of the assessment. The use of multiple instruc-
tional techniques results in enhanced test performance that reflects an increased mastery
of the content domain. As a result, neither of these practices compromises the validity of
the score interpretation as reflecting domain-specific knowledge. Other test preparation
practices generally fall short of this goal. For example, practice tests may be useful when
used cautiously, but they are often overused and become the focus of instruction with det-
rimental results. Any procedures that emphasize test-specific content or test-specific item
formats should be avoided because they may increase test scores without actually enhanc-
ing mastery of the underlying test domain.

Administering Standardized Tests. When introducing this chapter we noted that standard-
ized tests are professionally developed and must be administered and scored in a standard
manner. For standardized scores to be meaningful and useful, it is imperative to follow these
standard procedures precisely. These procedures are explicitly speci-
fied so that the tests can be administered in a uniform manner in differ-
For standardized scores to be
ent settings. For example, it is obviously important for all students to
meaningful and useful, it is receive the same instructions and same time limits at each testing site
imperative to follow the test’s in order for the results to be comparable. Teachers are often respon-
administration procedures sible for administering group achievement tests to their students and as
precisely. a result should understand the basics of standardized test administra-
tion. Here are a few guidelines to help teachers in standardized test
administration to their students that are based on our own experience and a review of the
literature (e.g., Kubiszyn & Borich, 2003; Linn & Gronlund, 2000; Popham, 1999, 2000).

Review the Test Administration Manual before the Day of the Test. Administering stan-
dardized tests is not an overly difficult process, but it is helpful to review the administration
instructions carefully before the day of the test. This way you will be familiar with the
procedures and there should be no surprises. This review will alert you to any devices (e.g.,
stopwatch) or supporting material (e.g., scratch paper) you may need during the adminis-
tration. It is also beneficial to do a mock administration by reading the instructions for the
test in private before administering it to the students. The more familiar you are with the
administration instructions, the better prepared you will be to administer the test. Addition-
ally, you will find the actual testing session to be less stressful.

Encourage the Students to Do Their Best. Standardized achievement tests (and most other
standardized tests used in schools) are maximum performance tests and ideally students will
put forth their best efforts. This is best achieved by explaining to the students how the test
results will be used to their benefit. For example, with achievement tests you might tell the
students that the results can help them and their parents track their academic progress and
322 CHAPTERSI2

identify any areas that need special attention. Although it is important to motivate students
to do their best, it is equally crucial to avoid unnecessarily raising their level of anxiety. For
example, you would probably not want to focus on the negative consequences of poor perfor-
mance immediately before administering the test. This presents a type of balancing act; you
want to encourage the students to do their best without making them excessively anxious.

Closely Follow Instructions. As we noted, the reliability and validity of the test results
are dependent on the individual administering the test closely following the administration
instructions. First, the instructions to students must be read word for word. Do not alter the
instructions in any way, paraphrase them, or try to improvise. It is likely that some students
will have questions, but you are limited in how you can respond. Most manuals indicate
that you can clarify procedural questions (e.g., where do I sign my name?), but you cannot
define words or in any other way provide hints to the answers.

Strictly Adhere to Time Limits. Bring a stopwatch and practice using it before the day of
the test.

Avoid Interruptions. Avoid making announcements or any other types of interruptions


during the examination. To help avoid outside interruptions you should post a Testing in
Session—Do Not Disturb sign on the door.

Be Alert to Cheating. Although you do not want to hover over the students to the extent
that it makes them unnecessarily nervous, active surveillance is indicated and can help deter
cheating. Stay alert and monitor the room from a position that provides a clear view of the
entire room. Walk quietly around the room occasionally. If you note anything out of the
ordinary, increase your surveillance of those students. Document any unusual events that
might deserve further consideration or follow-up.

By following these suggestions you should have a productive and uneventful testing
session. Nevertheless, be prepared for unanticipated events to occur. Keep the instruction
manual close so you can refer to it if needed. It is also helpful to remember you can rely on
your professional educational training to guide you in case of unexpected events.

Interpreting Standardized Tests. Teachers are also often called on to interpret the re-
sults of standardized tests. This often involves interpreting test results for use in their own
classroom. This can include monitoring student gains in achievement, identifying individ-
ual strengths and weaknesses, evaluating class progress, and planning instruction. At other
times, teachers are called on to interpret the results to parents or even students. Although
report cards document each student’s performance in the class, the results of standardized
tests provide normative information regarding student progress in a
The key factor in accurately
broader context (e.g., Linn & Gronlund, 2000).
interpreting the results of The key factor in accurately interpreting the results of stan-
standardized tests is being dardized tests is being familiar with the type of scores reported. In
familiar with the type of scores Chapter 3 we presented a review of the major types of test scores
reported. test publishers use. As we suggested in that chapter, when report-
Standardized Achievement Tests in the Era of High-Stakes Assessment 323

ing test results to parents, it is usually best to use percentile ranks. As with all norm-
referenced scores, the percentile rank simply reflects an examinee’s performance relative
to the specific norm group. Percentile ranks are interpreted as indicating the percentage of
individuals scoring below a given point in a distribution. For example, a percentile rank of
75 indicates that 75% of the individuals in the standardization sample scored below this
score. A percentile rank of 30 indicates that only 30% of the individuals in the standardiza-
tion sample scored below this score. Percentile ranks range from 1 to 99, and a rank of 50
indicates median performance. When discussing results in terms of percentile rank, it is
helpful to ensure that they are not misinterpreted as “percent correct” (Kamphaus, 1993).
That is, a percentile rank of 80 means that the examinee scored better than 80% of the
standardization sample, not that he or she correctly answered 80% of the items. Although
most test publishers report grade equivalents, we recommend that you avoid interpreting
them to parents. In Chapter 3 we discussed many of the problems associated with the use
of these scores and why they should be avoided.
Before leaving our discussion of the use of standardized achievement tests in
schools, it is appropriate to discuss some factors other than academic achievement that
may influence test performance. As we have emphasized in this textbook, it is extremely
important to select and use tests that produce reliable and valid scores. It is also important
to understand that even with the most psychometrically sound tests, factors other than
those we are attempting to measure may influence test performance. Achievement tests
are an attempt to measure students’ academic achievement in specific content areas. An
example of an extraneous factor that might influence performance on a standardized test
is the emotional state or mood of the student. If a student is emotionally upset the day of
a test, his or her performance will likely be impacted (see Special Interest Topic 12.6 for

SPECIAL INTEREST TOPIC 12.6


Deciding Not to Test an Upset Student

A number of years ago when one of our colleagues was working with a private agency, a mother
and her young son (approximately 9 or 10 years of age) came in for their appointment. Although
he does not remember the specifics of the referral, the primary issue was that the child was
having difficulty at school and there was concern that he might have a learning disability. To
determine the basis of his school problems, he was scheduled to receive a battery of individual
standardized tests. On greeting them it was obvious that the child was upset. He sat quietly crying
in the waiting room with his head down. Our colleague asked the mother what was wrong and
she indicated his pet cat had died that morning. She was clearly sensitive to her son’s grief, but
was concerned that it would takes months to get another appointment (this agency was typically
booked for months in advance). Our colleague explained to her that he was much too upset to
complete the assessment on this day and that any results would be invalid. To ensure that her son
received the help he needed in a timely manner, they were able to schedule an appointment in a
few weeks. Although teachers may not have this much discretion when scheduling or adminis-
tering standardized tests, they should be observant and sensitive to the effects of emotional state
on test performance.
324 CHAP TER 2

a personal example). If you can see that a student is upset while taking a test, make a note
of this as it might be useful later in understanding his or her performance. Accordingly,
a student’s level of motivation will also influence performance. Students who do not see
the test as important may demonstrate a lackadaisical approach to it. If you notice that a
student is not completing the test or is completing it in a haphazard manner, this should
also be documented.

Individual Achievement Tests

As we noted, standardized achievement tests are also used in the identification, diagnosis,
and classification of students with special learning needs. Although some group-adminis-
tered achievement tests might be used in identifying children with
Although some group special needs, in many situations individually administered achieve-
achievement tests are used in ment tests are employed. For example, if a student is having learning
difficulties and parents or teachers are concerned about the possibil-
identifying students with special
ity of a learning disability, the student would likely be given a battery
needs, in many situations
of tests, one being an individual achievement test. A testing profes-
individually administered sional, with extensive training in psychometrics and test administra-
achievement tests are used. tion, administers these tests to one student at a time. Because the
tests are administered individually, they can contain a wider variety
of item formats. For example, the questions are often presented in different modalities, with
some questions presented orally and some in written format. Certain questions may require
oral responses whereas some require written responses. In assessing writing abilities, some
of these tests elicit short passages whereas others require fairly lengthy essays. Relative to
the group tests, individual achievement tests typically provide a more thorough assessment
of the student’s skills. Because they are administered in a one-to-one context, the examiner
can observe the student closely and hopefully gain insight into the source of learning prob-
lems. Additionally, because these tests are scored individually, they are more likely to incor-
porate open-ended item formats (e.g., essay items) requiring qualitative scoring procedures.
Although regular education teachers typically are not responsible for administering and
interpreting these tests, teachers often do attend special education or placement committee
meetings at which the results of these tests are discussed and used to make eligibility and
placement decisions. As a result, it is beneficial to have some familiarity with these tests. In
this section we will briefly introduce you to some of the most popular individual achieve-
ment tests used in the schools.

Wechsler Individual Achievement Test—Second Edition (WIAT-II; The Psycho-


logical Corporation, 2002). The WIAT-IL is a comprehensive individually administered
norm-referenced achievement test published by The Psychological Corporation. By com-
prehensive we mean it covers a broad spectrum of academic skill areas. One desirable
feature is its coverage of all of the areas of learning disability recognized in the Education
of All Handicapped Children Act of 1975 and its successors. It contains the following com-
posites and subtests: {
Standardized Achievement Tests in the Era of High-Stakes Assessment 325

m Reading Composite: composed of the Word Reading subtest (letter knowledge, pho-
nological awareness, and decoding skills), Reading Comprehension subtest (comprehen-
sion of short passages, reading rate, and oral reading prosody), and Pseudoword Decoding
(phonetic decoding skills).

= Mathematics Composite: composed of the Numerical Operations subtest (number


knowledge, ability to solve calculation problems and simple equations) and Math Reason-
ing subtest (ability to reason mathematically including identifying geometric shapes, solv-
ing word problems, interpreting graphs, etc.).

a Written Language Composite: composed of the Spelling subtest (ability to write dic-
tated letters and words) and Written Language subtest (transcription, handwriting, written
word fluency, generate and combine sentences, extended writing sample).

a Oral Language Composite: composed of the Listening Comprehension subtest (abil-


ity to listen and comprehend verbal information) and Oral Expression subtest (verbal word
fluency, repetition, story generation, and providing directions).

The WIAT-II produces a variety of derived scores, including standard scores and per-
centile ranks. The WIAT-II has excellent psychometric properties and documentation. Addi-
tionally, the WIAT-II has the distinct advantage of being statistically linked to the Wechsler
intelligence scales. Linkage with these popular intelligence tests facilitates the aptitude—
achievement discrimination analyses often used to diagnose learning disabilities (this will
be discussed more in the next chapter on aptitude tests).

Woodcock-Johnson III Tests of Achievement (WJ III ACH; Woodcock, McGrew, &
Mather, 2001la). The WJ If ACH is a comprehensive individually administered norm-
referenced achievement test distributed by Riverside Publishing. The standard battery con-
tains the following cluster scores and subtests:

= Broad Reading: composed of the Letter-Word Identification subtest (identify letters


and pronounce words correctly), Reading Fluency subtest (ability to read simple sentences
quickly and decide whether the statement is true or false), and Passage Comprehension
subtest (ability to read passages and demonstrate understanding).

= Oral Language: composed of the Story Recall subtest (ability to recall details of
stories presented on an audiotape) and Understanding Directions subtest (ability to follow
directions presented on an audiotape).

s Broad Math: a comprehensive measure of math skills composed of the Calculation


subtest (ability to perform mathematical computations), Math Fluency subtest (ability to
solve simple math problems quickly), and Applied Problems subtest (ability to analyze and
solve math word problems).

a Math Calculation Skills: a math aggregate cluster composed of the Calculation and
Math Fluency subtests.

a Broad Written Language: a comprehensive measure of writing abilities composed


of the Spelling subtest (ability to correctly spell words presented orally), Writing Fluency
326 CHAP
RE Rei

subtest (ability to formulate and write simple sentences quickly), and Writing Samples sub-
test (ability to write passages varying in length, vocabulary, grammatical complexity, and
abstractness).

m Written Expression: a writing aggregate cluster composed of the Writing Fluency


and Writing Samples subtests.

Other special-purpose clusters can be calculated using the 12 subtests in the stan-
dard battery. In addition, ten more subtests in an extended battery allow the calculation of
supplemental clusters. The WJ III ACH provides a variety of derived scores and has excel-
lent psychometric properties and documentation. A desirable feature of the WJ III ACH is
its availability in two parallel forms, which is an advantage when testing a student on more
than one occasion because the use of different forms can help reduce carryover effects. Ad-
ditionally, the WJ III ACH and the Woodcock-Johnson III Tests of Cognitive Abilities (WJ
Ill COG; Woodcock, McGrew, & Mather, 2001b) compose a comprehensive diagnostic
system, the Woodcock-Johnson III (WJ III; Woodcock, McGrew, & Mather, 2001c). When
administered together they facilitate the aptitude—-achievement discrimination analyses
often used to diagnose learning disabilities.

Wide Range Achievement Test 3 (WRAT3). (The WRAT3 is a brief achievement test that
measures basic reading, spelling, and arithmetic skills. It contains the following subtests:

u Reading: assesses ability to recognize and name letters and pronounce printed
words
m Spelling: assesses ability to write letters, names, and words that are presented
orally
= Arithmetic: assesses ability to recognize numbers, count, and perform written
computations

The WRAT3 can be administered in 15 to 30 minutes and comes in two parallel forms.
Relative to the WIAT-II and WJ III ACH, the WRAT3 measures a limited number of skills.
However, when only a quick estimate of achievement in word recognition, spelling, and
math computation is needed, the WRAT3 can be a useful instrument.

The individual achievement batteries described to this point measure skills in multiple
academic areas. As with the group achievement tests, there are individual tests that focus
on multiple skill domains. The following two tests are examples of individual achievement
tests that focus on specific skill areas.

Gray Oral Reading Test—Fourth Edition (GORT-4). The GORT-4 is a measure of


oral reading skills and is often used in the diagnosis of reading problems. The GORT-4
contains 14 passages of increasing difficulty, which students read aloud. The examiner re-
cords reading rate and reading errors (e.g., skipping or inserting words, mispronunciation).
Additionally, each reading passage contains questions to assess comprehension. There are
two parallel forms available.
Standardized Achievement Tests in the Era of High-Stakes Assessment 327

KeyMath—Revised/NU: A Diagnostic Inventory of Essential Mathematics—Norma-


tive Update (KeyMath R/NU). The KeyMath R/NU, published by American Guidance
Services, measures mathematics skills in the following areas: Basic Concepts (numeration,
rational numbers, and geometry), Operations (addition, subtraction, multiplication, divi-
sion, and mental computations), and Applications (measurement, time and money, esti-
mation, interpreting data, and problem solving). The KeyMath R/NU is available in two
parallel forms.

Selecting an Achievement Battery


Numerous factors should be considered when selecting a standardized achievement bat-
tery. If you are selecting a test for administration to a large number of students, you will
more than likely need a group achievement test. Nitko and Lane (1990) and Nitko (2001)
provide some suggestions for selecting a group achievement battery. He notes that although
most survey batteries assess the common educational objectives covered in most curricula,
there are some potentially important differences in the content covered. In some instruc-
tional areas such as reading and mathematics, there is considerable consistency in the cur-
ricula used in different schools. In other areas such as science and
social studies, there is more variability. As a result, potential users
When selecting a standardized
should examine any potential battery closely to determine whether
achievement test many
its content corresponds with the school, district, or state curriculum.
factors should be considered, Naturally it is also important to evaluate the technical adequacy of a
including the content covered, test. This includes issues such as the adequacy of the standardization
its technical properties, and sample, the reliability of test scores, and the availability of valid-
practical issues such as cost ity evidence supporting the intended use. This is best accomplished
and time requirements. using some of the resources discussed earlier in this chapter (Special
Interest Topic 12.2). Finally, it is also useful to consider practical is-
sues such as cost, testing time required, availability of scoring services, and the quality of
support materials such as administration and interpretative guides.
Many of the same factors should be considered when selecting an individual achieve-
ment test. You should select a test that adequately assesses the specific content areas you
are interested in. For example, although a test such as the WRAT3 might be sufficient for
screening purposes, it is not adequate for in-depth diagnostic purposes. In testing students
to determine whether they have a specific learning disability, it would be important to use a
battery such as the WIAT-II, which covers all recognized areas of learning disability.

Summary
In this chapter we focused on standardized achievement tests and their applications in the
schools. These tests are designed to be administered, scored, and interpreted in a standard
manner. The goal of standardization is to ensure that testing conditions are the same for all
individuals taking the test. If this is accomplished, no examinee will have an advantage over
328 CHAP PER?

another, and test results will be comparable. These tests have different applications in the
schools, including

Tracking student achievement over time


Using high-stakes decision making (e.g., promotion decisions, teacher evaluations)
Identifying individual strengths and weaknesses
Evaluating the effectiveness of educational programs
Identifying students with special learning needs

Of these uses, high-stakes testing programs are probably the most controversial. These
programs use standardized achievement tests to make such important decisions as which
students will be promoted and evaluating educational professionals and schools. Proponents
of high-stakes testing programs see them as a way of improving public education and en-
suring that students are all judged according to the same standards. Critics of high-stakes
testing programs argue that they encourage teachers to focus on low-level academic skills
at the expense of higher-level skills such as problem solving and critical thinking.
We next described several of the most popular commercial group achievement tests.
The chapter included a discussion of the current trend toward increased high-stakes assess-
ments in the public schools and how this is being implemented by states using a combina-
tion of commercial and state-developed assessments. We introduced a potentially useful
approach for assessing and monitoring student achievement that is referred to as value-
added assessment.
We also provided some guidelines to help teachers prepare their students for these
tests. We noted that any test preparation procedure that raises test scores without also increas-
ing the mastery of the underlying knowledge and skills is inappropriate. After evaluating
different test preparation practices, we concluded that preparation that introduces generic
test-taking skills and uses multiple instructional techniques can be recommended. These
practices should result in improved performance on standardized tests that reflects increased
mastery of the underlying content domains. Preparation practices that emphasize the use
of practice tests or focus on test-specific content or test-specific item formats should be
avoided because they may increase test scores, but may not increase mastery of the under-
lying test domain. We also provided some suggestions for teachers to help administer and
interpret test results:

Review the test administration manual before the day of the test.
Encourage students to do their best on the test.
Closely follow administration instructions.
Strictly adhere to time limits.
Avoid interruptions.
Be alert to cheating.
Be familiar with the types of derived scores produced by the test.

We concluded the chapter by briefly describing some popular individual achievement


tests used in schools. Although teachers are not called on to routinely administer and inter-
pret these individual tests, they often do attend committee meetings at which the results of
these tests are discussed and used in making eligibility and placement decisions.
Standardized Achievement Tests in the Era of High-Stakes Assessment 329

KEY TERMS AND CONCEPTS

Achievement test, p. 300 Individually administered tests, Teaching to the test, p. 318
Appropriate preparation practice, p. 304 Test preparation practices, p. 318
p. 319 Standardized scores, p. 321 Value-added assessment, p. 315
Diagnostic achievement tests, p. 310 Standardized test, p. 300
Group-administered tests, p. 302 Standardized test administration,
Inappropriate preparation practices, p. 321
p. 318 Statewide testing programs, p. 310

RECOMMENDED READINGS

The following articles provide interesting commentaries on Kober, N. (2002). Teaching to the test: The good, the bad,
issues related to the use of standardized achievement tests in and who’s responsible. Test Talk for Leaders. (Issue 1).
the schools: Washington, DC: Center on Education Policy.
Boston, C. (2001). The debate over national testing. ERIC Di-
gest, ERIC-RIEO. (20010401).
Doherty, K. M. (2002). Education issues: Assessment. Edu-
cation Week on the Web. Retrieved May 14, 2003, from
www.edweek.org/context/topics/issuespage.cfm?id=41.

TT ENE LEI EEN

Go to www.pearsonhighered.com/reynolds2e to view a PowerPoint™


presentation and to listen to an audio lecture about this chapter.
RL NE ES SS
AAR
CHAPTER

The Use of Aptitude


Tests in the Schools

Conventional intelligence tests and even the entire concept of intelligence testing
are perennially the focus of considerable controversy and strong emotion.
—Reynolds & Kaufman, 1990

CHAPTER HIGHLIGHTS

A Brief History of Intelligence Tests Major Aptitude/Intelligence Tests


The Use of Aptitude and Intelligence Tests in College Admission Tests
Schools
A New Assessment Strategy for Specific Learning
Disabilities: Response to Intervention (RTI

LEARNING OBJECTIVES

After reading and studying this chapter, students should be able to:
1. Compare and contrast the constructs of achievement and aptitude.
2. Explain how achievement and aptitude can be conceptualized as different aspects of a
continuum. Provide examples to illustrate this continuum.
Discuss the major milestones in the history of intelligence assessment.
Describe the major uses of aptitude and intelligence tests in schools.
Explain the rationale for the analysis of aptitude—achievement discrepancies.
Explain the response to intervention (RTI) process and its current status.
Describe and evaluate the major group aptitude/intelligence tests.
Describe and evaluate the major individual achievement tests.
SS
SAEvaluate and select aptitude/intelligence tests that are appropriate for different applications.
SSS
Understand a report of the intellectual assessment of a school-aged child.
11. Identify the major college admission tests and describe their use.

Ih Chapter 1, when describing maximum performance tests we noted that they are often
classified as either achievement tests or aptitude tests. (In some professional sources the
term aptitude is being replaced with ability. For historical purposes we will use aptitude to

330
The Use of Aptitude Tests in the Schools 331

Aptitude tests are designed to designate this type of test in this chapter, but we do want to alert
measure the cognitive skills, readers to this variability in terminology.) We defined achievement
abilities, and knowledge that tests as those designed to assess students’ knowledge or skills in a
individuals have accumulated as content domain in which they have received instruction (AERA et al.
the result of their overall life ve : wma
experiences.

‘their overall life experiences. Inother words, whereas achievement


tests are tied to a specifi 1 1 J
of lif

Both achievement and aptitude These introductory comments might lead you to believe there is
tests measure developed a clear and universally accepted distinction between achievement and
aptitude tests. However, in actual practice this is not the case and the
abilities and can be arranged
distinction is actually a matter of degree. Many, if not most, testing
along a continuum according to
experts conceptualize both achievement and aptitude tests as tests of
how dependent the abilities are
developed cognitive abilities that can be ordered along a continuum in
on direct school experiences. terms of how closely linked the assessed abilities are to specific learn-
ing experiences. This continuum is illustrated in Figure 13.1. At one
end of the continuum you have teacher-constructed classroom tests that are tied directly to the
instruction provided in a specific classroom or course. For example, a classroom mathematics
test should assess specifically the learning objectives covered in the class during a specific
instructional period. This is an example of a test that is linked clearly and directly to specific
academic experiences (i.e., the result of curriculum and instruction). Next along the continuum
are the survey achievement batteries that measure a fairly broad range of knowledge, skills,
and abilities. Although there should be alignment between the learning objectives measured
by these tests and the academic curriculum, the scope of a survey battery is considerably
broader and more comprehensive than that of a teacher-constructed classroom test. The group-
administered survey batteries described in the previous chapter are dependent on direct school
experiences, but there is variability in how direct the linkage is. For example, the achievement
tests developed by states to specifically assess the state’s core curriculum are more directly
linked to instruction through the state’s specified curriculum than the commercially developed
achievement tests that assess a more generic curriculum.
Next are intelligence and other aptitude tests that emphasize verbal, quantitative, and
visual-spatial abilities. Many traditional intelligence tests can be placed in this category,
and even though they are not linked to a specific academic curriculum, they do assess many

Very Specific Very General

Teacher-Constructed Broad Survey Verbal Intelligence Cross-Cultural


Classroom Tests Achievement Batteries and Aptitude Tests Intelligence Tests

FIGURE 13.1 A Continuum of General Abilities


Note: Modeled after Anastasi & Urbina (1997), Cronbach (1990), and others.
332 CHAPTER 13

skills that are commonly associated with academic success. The Otis-Lennon School Abil-
ity Test (OLSAT); Stanford-Binet Intelligence Scales—Fifth Edition; Tests of Cognitive
Skills, Second Edition (TCS/2); Wechsler Intelligence Scale for Children—Fourth Edition
(WISC-IV); and Reynolds Intellectual Assessment Scales (RIAS) are all examples of tests
that fit in this category (some of these will be discussed later in this chapter). In developing
these tests, the authors attempt to measure abilities that are acquired through common, ev-
eryday experiences; not only those acquired through formal educational experiences. For
example, a quantitative section of one of these tests will typically emphasize mental com-
putations and quantitative reasoning as opposed to the developed mathematics skills tradi-
tionally emphasized on achievement tests. Novel problem-solving skills are emphasized on
many portions of these tests as well. Modern intelligence tests are not just measures of
knowledge or how much you know, but also how well you think.
Finally, at the most “general” end of the continuum are the nonverbal and cross-
cultural intelligence or aptitude tests. These instruments attempt to minimize the influence
of language, culture, and educational experiences. They typically emphasize the use of non-
verbal performance items and often completely avoid language-based content (e.g., reading,
writing, etc.). The Naglieri Nonverbal Ability Test—Multilevel Form (NNAT—Multilevel
Form) is an example of a test that belongs in this category. The NNAT—Multilevel Form is
a group-administered test of nonverbal reasoning and problem solving that is thought to be
relatively independent of educational experiences, language, and cultural background (how-
ever, no test is truly culture-free). The NNAT—Multilevel Form (like many nonverbal IQ
tests) employs “progressive matrices’”—items in which the test taker must find the missing
pattern in a series of designs or figures. The matrices in the NNAT—Multilevel Form are
arranged in order of difficulty and contain designs and shapes that are not linked to any spe-
cific culture. Promoters of the test suggest that this test may be particularly useful for students
with limited English proficiency, minorities, or those with hearing impairments.

ties and can


direct school experience. As we Jalili from the specific to the general end of the con-
tinuum, test performance be

Although we feel it is important to recognize that the distinction between achieve-


ment and academic tests is not absolute, we also feel the achievement—aptitude distinction
is useful. In schools and other settings, achievement and aptitude tests traditionally have
been used for different purposes, and these labels help us identify their intended applica-
tions. For example, achievement tests typically are utilized to measure what has been
learned or “achieved” at a specific point in time. In contrast, aptitude tests usually are
used to predict future performance or to reflect an individual’s potential in terms of aca-
demic or job performance.
Although many sources use the terms aptitude and intelligence interchangeably, gen-
eral intelligence tests are not the only type of aptitude test in use today. In addition to intel-
ligence tests, special aptitude tests and multiple aptitude batteries frequently are used in
many educational and other settings. Special aptitude tests were developed originally in the
context of employment settings to help employers select job applicants based on their apti-
The Use of Aptitude Tests in the Schools 333

tudes in specific areas such as mechanical or clerical ability. Subsequently, test developers
developed multiple-aptitude batteries to measure a number of distinct abilities.

A Brief History of Intelligence Tests

General intelligence tests historically have been the most popular and widely used aptitude
tests in school settings. While practically everyone is familiar with the concept of intelli-
gence and uses the term in everyday conversations, it is not easy to develop a definition of
intelligence on which everyone agrees. In fact, the concept of intelligence probably has
generated more controversy than any other topic in the area of tests and measurement (see
Special Interest Topics 13.1 and 13.2). Although practically all edu-
General intelligence tests cators, psychologists, and psychometricians have their own personal
historically have been the definition of intelligence, most of these definitions will incorporate
most popular and widely used abilities such as problem solving, abstract reasoning, and the ability
aptitude tests in school settings. to acquire knowledge (e.g., Gray, 1999). Developing a consensus
beyond this point is more difficult. For our present purpose, instead
Most definitions of intelligence of pursuing a philosophical discussion of the meaning of intelligence,
incorporate abilities such as we will focus only on intelligence as measured by contemporary
problem solving, abstract intelligence tests. These tests typically produce an overall score re-
reasoning, and the ability to ferred to as an intelligence quotient or IQ.
acquire knowledge. Intelligence tests had their beginning in the schools. In the early
1900s, France initiated a compulsory education program. Recognizing
that not all children had the cognitive abilities necessary to benefit
from regular education classes, the minister of education wanted to develop special educa-
tional programs to meet the particular needs of these children. To accomplish this, he needed
a way of identifying children who needed special services. Alfred Binet and his colleague
Theodore Simon had been attempting to develop a measure of intelligence for some years,
and the French government commissioned them to develop a test that could predict academic
performance accurately. The result of their efforts was the first Binet-Simon Scale, released
in 1905. This test contained problems arranged in the order of their difficulty and assessing a
wide range of abilities. The test contained some sensory-perceptual tests, but the emphasis was
on verbal items assessing comprehension, reasoning, and judgment.
Subsequent revisions of the Binet-Simon Scale were released in 1908
The development and success
and 1911. These scales gained wide acceptance in France and were
of the Binet-Simon Scale, and soon translated and standardized in the United States, most success-
subsequently the Stanford- fully by Louis Terman at Stanford University. This resulted in the
Binet, ushered in ane era of Stanford-Binet Intelligence Test, which has been revised numerous
widespread intelligence testing times (the fifth revision remains in use today). Ironically, Terman’s
in the United States. version of the Binet-Simon Scale became even more popular in France
and other parts of Europe than the Binet-Simon Scale!
The development and success of the Binet-Simon Scale, and subsequently the Stanford-
Binet Intelligence Test, ushered in the era of widespread intelligence testing in the United
States. Following Terman’s lead, other assessment experts developed and released their own
intelligence tests. Some of the tests were designed for individual administration (like the
Stanford-Binet Intelligence Test) whereas others were designed for group administration.
334 CHA Pane) Remap

a es scene
LEER EUS ee SERS Ets RET

SPECIAL IN TEREST TOPIC 13.1


The Controversial IQ: Knowns and Unknowns
SESS

A task force established by the American Psychological Association produced a report titled “Intel-
ligence: Knowns and Unknowns” (Neisser et al., 1996). Its authors summarize the state of knowl-
edge about intelligence and conclude by identifying seven critical questions about intelligence that
have yet to be answered. These issues are summarized here and remain unconquered.

1. It is widely accepted that there is a substantial genetic contribution to the development of


intelligence, but the pathway by which genetic differences are expressed is not known.
It is also accepted that environmental factors contribute significantly to the development of
intelligence, but no one really knows the mechanism by which they express their influence.
We The role of nutrition in the development of intelligence is unclear. It is clear that profound
early malnutrition is detrimental, but the effects of more subtle nutritional differences in
populations that are “adequately fed” are not well understood.
. Research has revealed significant correlations between information-processing speed and
intelligence, but these findings have not resulted in clear theoretical models.
The “Flynn Effect” is real! That is, mean IQs are increasing worldwide. No one is really sure
what factors are driving these gains. (See Chapter 3 for more on this topic.)
Mean IQ differences between races cannot be attributed to obvious test bias or simply to
differences in socioeconomic status. There is also no support for genetic explanations. Sim-
ply put, no one really knows the basis of these differences.
It is widely accepted that standardized intelligence tests do not measure all aspects of intel-
ligence such as creativity, common sense, and interpersonal finesse. However, we do not
know very much about these abilities such as how they relate to more traditional aspects of
intelligence or how they develop.

In concluding their report, Neisser et al. (1996) note:

In a field where so many issues are unresolved and so many questions unanswered, the confident tone
that has characterized most of the debate on these topics is clearly out of place. The study of intel-
ligence does not need politicized assertions and recriminations; it needs self-restraint, reflection, and
a great deal more research. The questions that remain are socially as well as scientifically important.
There is no reason to think them unanswerable, but finding the answers will require a shared and
sustained effort as well as the commitment of substantial scientific resources. Just such a commit-
ment is what we strongly recommend. (p. 97)

Due to the often emotional Some of these tests placed more emphasis on verbal and quantitative
abilities whereas others focused more on visual—spatial and abstract
debate over the meaning of
problem-solving abilities. As a general rule, research has shown with
intelligence, many test
considerable consistency that contemporary intelligence tests are good
publishers have adopted more
predictors of academic success. This is to be expected considering this
neutral names such as academic was the precise purpose for which they were initially developed over
potential, school ability, and 100 years ago. In addition to being good predictors of school perfor-
simply ability to designate mance, research has shown that IQs are fairly stable over time. Never-
essentially the same construct. theless, these tests have become controversial themselves as a result of
The Use of Aptitude Tests in the Schools 335

AT

SPECIAL INTEREST TOPIC 13.2


The Controversial IQ: Schools and IQ Tests

Although IQ tests had their origin in the schools, they have been the source of considerable contro-
versy essentially since their introduction. Opponents of IQ tests often argue IQ tests should be
banned from schools altogether whereas proponents can hardly envision the schools without them.
Many enduring issues contribute to this controversy, and we will mention only the most prominent
ones. These include the following.

Mean IQ Differences among Ethnic Groups


There is considerable research that documents mean IQ differences among various ethnic groups,
and this has often been the source of considerable controversy. Although the basis for these differ-
ences has not been identified, there is ample evidence the differences cannot be attributed merely to
test bias (something we address in more detail in Chapter 16). Nevertheless, because mean group
differences in IQ may result in differential educational treatment and placement, there continues to
be the appearance of test bias, and this appearance promulgates the controversy regarding the use
of IQ tests in schools (Canter, 1997). For example, because of the perception of test bias the state of
California has prohibited the use of a number of popular IQ tests for making placement decisions
with certain ethnic minorities. This is not based on the psychometric properties of the IQ tests, but
on public perception and legal cases. Other states have examined the same tests and concluded that
the tests are not biased and supported their use with minorities.

Can IQ Be Increased?
Given the importance society places on intelligence and a desire to help children excel, it is reason-
able to ask how much IQ can be improved. Hereditarians, those who see genetics as playing the
primary role in influencing IQ, hold that efforts to improve it are doomed to failure. In contrast,
environmentalists, who see environmental influences as primary, see IQ as being highly malleable.
So who is right? In summary, the research suggests that IQ can be improved to some degree, but the
improvement is rather limited. For example, adoption studies indicate that lasting gains of approxi-
mately 10 to 12 IQ points are the most that can be accomplished through even the most pervasive
environmental interventions. The results of preschool intervention programs such as Head Start are
much less impressive. These programs may result in modest increases in IQ, but even these gains are
typically lost in a few years (Kranzler, 1997). These programs do have other benefits to children,
however, and should not be judged only on their impact on IQ.

Do We Really Need IQ Tests in Schools?


Although public debate over the use of IQ tests in schools typically has focused on ethnic differ-
ences and the malleability of intelligence, professional educators and psychologists also have de-
bated the usefulness of IQ tests in educational settings. Different terms have been applied to this
question over the years. For example, Wigdor and Garner (1982) framed it as the instructional
validity of IQ test results, Hilliard (1989) referred to it as the pedagogical utility question, and
Gresham and Witt (1997) indicated it was essentially an issue of treatment validity. Whatever label
you use, the question is “Does the use of IQ tests result in educational benefits for students?” Pro-
ponents of IQ tests highlight evidence that intelligence plays a key role in success in many areas of
life, including school achievement. As an extension they argue that information garnered from IQ
tests allows educators to tailor instruction so that it meets the specific needs of their students. As a
result more students are able to succeed academically. Opponents of IQ tests argue that there is
little evidence that the use of IQ tests results in any real improvement in the education of students.
(continued)
336 E Isl ANI? AE1B RG IS)

SPECIAL INTEREST TOPIC Ve bp. Continued

A contemporary debate involves the use of IQ tests in the identification of students with learning
disabilities. Historically the diagnosis of learning disabilities has been based on a discrepancy
model in which students’ level of achievement is compared to their overall level of intelligence. If
students’ achievement in reading, mathematics, or some other specific achievement area is signifi-
cantly below that expected based on their IQ, they may be diagnosed as having a learning disability
(actually the diagnosis of learning disabilities is more complicated than this, but this explanation is
sufficient in this context). Currently some researchers are presenting arguments that IQs need not
play a role in the diagnosis of learning disabilities and are calling for dropping the use of a discrep-
ancy model, and the 2004 federal law governing special education eligibility (the Individuals with
Disabilities Education Act of 2004) no longer requires such a discrepancy, but does allow its use in
diagnosing disabilities.
So what does the future hold for IQ testing in the schools? We believe that when used ap-
propriately IQ tests can make a significant contribution to the education of students. Braden (1997)
noted that

eliminating IQ is different from eliminating intelligence. We can slay the messenger, but the message
that children differ in their learning rate, efficiency, and ability to generalize knowledge to new situ-
ations (despite similar instruction) remains. (p. 244)

At the same time we recognize that on occasion IQ tests (and other tests) have been used in inap-
propriate ways that are harmful to students. The key is to be an informed user of assessment results.
To this end a professional educator should have a good understanding of the topics covered in this
text, including basic psychometric principles and the ethical use of test results.

the often emotional debate over the meaning of intelligence. To try and avoid this association
and possible misinterpretations, many test publishers have adopted more neutral names such
as academic potential, scholastic ability, school ability, mental ability, and simply ability to
designate essentially the same construct.

The Use of Aptitude and


Intelligence Tests in Schools

As you can see from the previous discussion, aptitude and intelligence tests have a long
history of use in the schools. Their widespread use continues to this day, with major applica-
tions including
= Providing alternative measures of cognitive abilities that reflect information not cap-
tured by standard achievement tests or school grades
= Helping teachers tailor instruction to meet a student’s unique pattern of cognitive
strengths and weaknesses
a Assessing how well students are prepared to profit from school experiences
u Identifying students who are underachieving and may need further assessment to rule
out learning disabilities or other cognitive disorders, including mental retardation
u Identifying students for gifted and talented programs
= Helping guide students and parents with educational and vocational planning
The Use of Aptitude Tests in the Schools 337

Although we have identified the most common uses of aptitude/intelligence tests in the
schools, the list clearly is not exhaustive. Classroom teachers are involved to varying degrees
with these applications. For example, teachers are frequently called on to administer and in-
terpret many of the group aptitude tests for their own students. School psychologists or other
professionals with specific training in administering and interpreting clinical and diagnostic
tests typically administer and interpret the individual intelligence and aptitude tests. Even
though they are not directly involved in administering individual intelligence tests, it is impor-
tant for teachers to be familiar with these individual tests. Teachers frequently need to read and
understand psychological reports describing student performances on these tests. Addition-
ally, teachers are often on committees that plan and develop educational programs for students
with disabilities based on information derived from these tests. In a later section we present
an example of a report of the intellectual assessment of a high school student.

Aptitude—Achievement Discrepancies
One common assessment One common assessment practice employed in schools and in clini-
practice employed in schools cal settings is referred to as aptitu@eZaehievement discrepancy
a -

and in clinical settings is


t he basic
referred to as aptitude—-
rationale behind this practice is that normally students’ achievement
achievement discrepancy scores should be commensurate with their aptitude scores. In other
analysis. words, students’ performance on an aptitude test serves as a type of
baseline to compare their performance on an achievement test to,
with the expectation that they will be comparable. In the majority of cases this is what you
will discover when you compare aptitude—achievement scores. This is not to suggest that
the scores will be identical, but that they will be similar, or that there will not be a statisti-
cally significant discrepancy. If students’ achievement scores are significantly higher than
their aptitude scores, they are considered academic overachievers. This may be attributed to
a number of factors such as strong motivation and/or an enriched learning environment. This
may not necessarily be a reason for concern, but may suggest that while students perform
well with the specific skills that are emphasized in school, they have more difficulty solving
novel problems and generalizing their skills to new situations. These students may benefit
from instructional activities that emphasize transfer of learning, generalization, and creativ-
ity (Riverside Publishing, 2002).
If students’ achievement scores are significantly lower than their aptitude scores,
they may be considered academic underachievers, which may be cause for concern. Aca-
demic underachievement may be the result of a number of factors. The student may not
be motivated to perform well in school or may have had inadequate opportunities to learn.
This could include limited exposure to instruction or an impoverished home environment.
It could also reflect cultural or language differences that impact academic achievement.
Naturally a number of medical factors could also be involved, such as impaired hearing
or vision. Additionally a number of psychological disorders or factors could be impli-
cated. For example, children with an attention deficit/hyperactivity disorder (ADHD)
experience attentional problems that may interfere with achievement. Emotional disor-
ders such as depression or anxiety can also detrimentally affect academic performance.
Finally, learning disabilities are often characterized by significant discrepancies between
338 CHAPTER 13

aptitude and achievement. In fact, many contemporary definitions of learning disabilities


incorporate a significant discrepancy between aptitude and achievement as a diagnostic
criterion for the disorder. Although reliance on aptitude—achievement discrepancies to
diagnose learning disabilities is currently the focus of considerable debate (e.g., Fletcher,
Foorman, Boudousquie, Barnes, Schatschneider, & Francis, 2002), many states continue
to use it as an essential element in the diagnosis of learning disabilities. While IDEA 2004
mandates that an aptitude—achievement discrepancy can no longer be required for diagnosis
of a learning disability in public school settings, other state, local, federal, and private agen-
cies continue to require such a discrepancy.
In practice there are a number of methods for determining whether there is a signifi-
cant discrepancy between aptitude and achievement scores. Reynolds (1985, 1990) devel-
oped criteria for conducting aptitude—achievement discrepancy analyses. These included
the requirement that correlation and regression analyses, which are used in predicting
achievement levels and establishing statistical significance, must be based on representative
samples. To help meet this requirement, many of the popular aptitude/intelligence tests are
co-normed (1.e., their normative data were based on the exact same sample of children) or
linked (i.e., there is some overlap in the standardization sample so that a proportion of the
sample received both tests) with a standardized achievement test. This is a desirable situa-
tion and whenever possible one should use co-normed or linked aptitude—achievement tests
when performing aptitude—achievement analyses. When aptitude—achievement discrepancy
analyses are conducted using “nonlinked” tests, the results should be interpreted with cau-
tion (The Psychological Corporation, 2002; Reynolds, 1985).
Some of the popular individual intelligence tests have been co-normed with a standard-
ized achievement test to facilitate the calculation of aptitude—achievement comparisons.
These comparisons typically involve the identification of a statistically significant discrep-
ancy between ability and achievement. Although approaches differ, the simple-difference
method and predicted-achievement method are most commonly used (The Psychological
Corporation, 2002; Reynolds, 1985). In the brief descriptions of major aptitude/intelligence
tests that follow, we will indicate which instruments have been co-normed or linked to
achievement tests and which tests they have been paired with.
Before proceeding, we should note that although it is common
for educators and clinicians to make ability-achievement compari-
Although it is common for
sons, many testing experts criticize this practice. Critics of this ap-
educators and clinicians to proach argue that ability—achievement discrepancies can usually be
make aptitude—achievement attributed simply to measurement error, differences in the content cov-
comparisons, many testing ered, and variations in student attitude and motivation on the different
experts criticize the practice. tests (Anastasi & Urbina, 1997; Linn & Gronlund, 2001). Reynolds
(1985) provides methods to overcome the psychometric problems, but
noncognitive factors are more difficult to control. Also, as we noted, there is considerable
debate about relying on ability—achievement discrepancies to diagnose learning disabilities.
Our position can probably be best described as middle of the road. Analysis of ability—achieve-
ment discrepancies may help identify children who are experiencing some academic prob-
lems, but they should be interpreted cautiously. That is, interpret such discrepancies in the
context of other information you have about the student (e. g., school grades, classroom behay-
ior), and if there is reason for concern, pursue additional assessment or consider a referral to
a school psychologist or other assessment professional.
The Use of Aptitude Tests in the Schools 339

A New Assessment Strategy for Specific Learning


Disabilities: Response to Intervention (RTI)

As we noted, there has been growing criticism of reliance on aptitude—achievement discrep-


ancies for diagnosing learning disabilities. If the profession is going to move away from the
use of aptitude—achievement discrepancies, the natural question is “What is the best way to
identify students with learning disabilities?” The approach that has garnered the most atten-
tion and enthusiasm in recent years is referred to as response to intervention (RTI). RTI
has been defined and applied in a number of different ways, but Fuchs, Mock, Morgan, and
Young (2003) provided a broad but succinct definition, when they described it as follows:

Additionally, Fuchs et al. (2003) outline the perceived benefits of RTI relative to the
aptitude—achievement discrepancy approach. For example, RTI is purported to provide help
to struggling students sooner. That is, RTI will help identify students with learning dis-
abilities in a timely manner, not waiting for them to fail before providing assistance. Addi-
tionally, proponents hold that RTI effectively distinguishes between students with actual
disabilities and students that simply have not received adequate instruction. With RTI dif-
ferent instructional strategies of increasing intensity are implemented as part of the process.
It is also believed by some that RTI will result in a reduced number of students receiving
special education services and an accompanying reduction in costs.
While RTI appears to hold promise in the identification of students with learning dis-
abilities (LD), a number of concerns remain. The RTI process has been defined and applied
in different ways, by different professionals, in different school settings (e.g., Christ, Burns,
& Ysseldyke, 2005; Fuchs et al., 2003). For example, some professionals envision RTI as
part of a behavioral problem-solving model while others feel it should involve the consistent
application of empirically validated protocols for students with specific learning problems.
Even when there is agreement on the basic strategy (e.g., a problem-solving model), differ-
ences exist in the number of levels (or tiers) involved, who provides the interventions, and
if RTI is a precursor to a formal assessment or if the RTI process replaces a formal assess-
ment in identifying students with learning disabilities (Fuchs et al., 2003). These inconsis-
tencies present a substantial problem since they make it difficult to empirically establish the
utility of the RTI process. Currently, the RTI model has been evaluated primarily in the
context of reading disabilities with children in the early grades, and this research is gener-
ally promising. However, much less research is available supporting the application of RTI
with other learning disorders and with older children (Feifer & Toffalo, 2007; Reynolds,
2005). In summary, there is much to be learned!
340 @GHAP TER 23

At this point, we view RTI as a useful process that can help identify struggling stu-
dents and ensure that they receive early attention and more intensive instructional interven-
tions. We also feel that students that do not respond to more intensive instruction should
receive a formal psychological assessment that includes, among other techniques, standard-
ized cognitive tests (i.e., aptitude, achievement, and possibly neuropsychological tests). We
do not agree with those that support RTI as a “stand-alone” process for identifying students
with LD. This position excludes the use of standardized tests and essentially ignores 100
years of empirical research supporting the use of psychometric procedures in identifying
and treating psychological and learning problems. A more moderate and measured approach
that incorporates the best of RTI and psychometric assessment practices seems most reason-
able at this time. If future research demonstrates that RTI can be used independently to
identify and develop interventions for students with learning disabilities, we will re-evaluate
our position.

Major Aptitude/Intelligence Tests

Group Aptitude/Intelligence Tests


As with the standardized achievement tests discussed in the previous chapter, it is common for
schools routinely to administer standardized aptitude/intelligence tests to a large number of
students. Also as with standardized achievement tests, the most commonly used aptitude tests
are also group administered, largely due to the efficiency of these tests. Finally, similar to
group achievement tests, teachers are often called on to help administer and interpret the re-
sults of these tests. The guidelines presented in the previous chapter for administering and
interpreting standardized tests apply equally well to both achievement and aptitude tests. Cur-
rently, the most widely used group aptitude/intelligence tests are produced and distributed by
three publishers: CTB McGraw-Hill, Harcourt Assessment, Inc., and
The most widely used group Riverside Publishing.
aptitude/intelligence tests are
produced by CTB McGraw- Tests of Cognitive Skills, Second Edition (TCS/2). The Tests of
Hill, Harcourt Assessment, Inc., Cognitive Skills, Second Edition (TCS/2), published by CTB
and Riverside Publishing. McGraw-Hill, is designed for use with children in grades 2 through 12.
It measures verbal, nonverbal, and memory abilities that are thought to
be important for academic success. It includes the following subtests:
Sequences (ability to comprehend rules implied in a series of numbers, figures, or letters),
Analogies (ability to recognize literal and symbolic relationships), Verbal Reasoning (deduc-
tive reasoning, analyzing categories, and recognizing patterns and relationships), and Memory
(ability to remember pictures or nonsense words). Although the TCS/2 does not assess quan-
titative abilities like many other aptitude tests, its assessment of memory abilities is unique.
When administered with TerraNova The Second Edition, CAT/5, or CTBS/4, anticipated
achievement scores can be calculated.

Primary Test of Cognitive Skills (PTCS). The Primary Test of Cognitive Skills, pub-
lished by CTB McGraw-Hill, is designed for use with students in kindergarten through Ist
grade (ages 5.1 to 7.6 years). It has four subtests (Verbal, Spatial, Memory, and Concepts)
that require no reading or number knowledge. The PTCS produces an overall Cognitive
The Use of Aptitude Tests in the Schools 341

Skills Index (CSD), and when administered with TerraNova The Second Edition, anticipated
achievement scores can be calculated.

InView. InView, published by CTB McGraw-Hill, is designed for use with students in
grades 2 through 12. It is actually the newest version of the Tests of Cognitive Skills and
assesses cognitive abilities in verbal reasoning, nonverbal reasoning, and quantitative rea-
soning. InView contains five subtests: Verbal Reasoning—Words (deductive reasoning,
analyzing categories, and recognizing patterns and relationships), Verbal Reasoning—Con-
text (ability to identify important concepts and draw logical conclusions), Sequences (abil-
ity to comprehend rules implied in a series of numbers, figures, or letters), Analogies
(ability to recognize literal and symbolic relationships), and Quantitative Reasoning (ability
to reason with numbers). When administered with TerraNova The Second Edition, antici-
pated achievement scores can be calculated.

Otis-Lennon School Ability Test, 8th Edition (OLSAT-8). The Otis-Lennon School
Ability Test, 8th Edition, published by Harcourt Assessment, Inc., is designed for use with
students from kindergarten through grade 12. The OLSAT-8 is designed to measure verbal
processes and nonverbal processes that are related to success in school. This includes tasks
such as detecting similarities and differences, defining words, following directions, recall-
ing words/numbers, classifying, sequencing, completing analogies, and solving mathemat-
ics problems. The OLSAT-8 produces Total, Verbal, and Nonverbal School Ability Indexes
(SAIs). The publishers note that although the total score is the best predictor of success in
school, academic success is dependent on both verbal and nonverbal abilities, and the Verbal
and Nonverbal SAIs can provide potentially important information. When administered
with the Stanford Achievement Test Series, Tenth Edition (Stanford 10), one can obtain
aptitude—achievement comparisons (Achievement/Ability Comparisons, or AACs).

Cognitive Abilities Test (CogAT), Form 6. The Cognitive Abilities Test (CogAT), dis-
tributed by Riverside Publishing, is designed for use with students from kindergarten
through grade 12. It provides information about the development of verbal, quantitative, and
nonverbal reasoning abilities that are related to school success. Students in kindergarten
through grade 2 are given the following subtests: Oral Vocabulary, Verbal Reasoning, Rela-
tional Concepts, Quantitative Concepts, Figure Classification, and Matrices. Students in
grades 3 through 12 undergo the following subtests: Verbal Classification, Sentence Com-
pletion, Verbal Analogies, Quantitative Relations, Number Series, Equation Building, Fig-
ure Classification, Figure Analogies, and Figure Analysis. Verbal, quantitative, and
nonverbal battery scores are provided along with an overall composite score. The publishers
encourage educators to focus on an analysis of the profile of the three battery scores rather
than the overall composite score. They feel this approach provides the most useful informa-
tion to teachers regarding how they can tailor instruction to meet the specific needs of stu-
dents (see Special Interest Topic 13.3 for examples). When given with the Iowa Tests of
Basic Skills or lowa Tests of Educational Development, the CogAT provides predicted
achievement scores to help identify students whose level of achievement is significantly
higher or lower than expected. Figures 13.2, 13.3, and 13.4 provide examples of CogAT
score reports. Table 13.1 illustrates the organization of the major group aptitude/intelligence
tests.
342 CVA? TER 13

SPECIAL INTEREST TOPIC 13.3


Ability Profiles on the CogAT

The Cognitive Abilities Test (CogAT) is an aptitude test that measures the level and pattern of a
student’s cognitive abilities. When interpreting the CogAT, Riverside Publishing (2002) encourages
teachers to focus on the student’s performance profile on the three CogAT batteries: Verbal Reason-
ing, Quantitative Reasoning, and Nonverbal Reasoning. To facilitate interpretation of scores, the
profiles are classified as A, B, C, or E profiles, described next.

a A profiles. Students with A profiles perform at approximately the sAme level on verbal,
quantitative, and nonverbal reasoning tasks. That is, they do not have any relative strengths or weak-
nesses. Approximately one-third of students receive this profile designation.

a B profiles. Students with B profiles have one battery score that is significantly aBove or
Below the other two scores. That is, they have either a relative strength or a relative weakness on
one subtest. B profiles are designated with symbols to specify the student’s relative strength or
weakness. For example, B (Q+) indicates that a student has a relative strength on the Quantitative
Reasoning battery, whereas B(V-) indicates that a student has a relative weakness on the Verbal
Reasoning battery. Approximately 40% of students have this type of profile.

a C profiles. Students with C profiles have both a relative strength and a relative weakness.
Here the C stands for Contrast. For example, C (V+N—) indicates that a student has a relative
strength in Verbal Reasoning and a relative weakness in Nonverbal Reasoning. Approximately 14%
of the students demonstrate this profile type.

us E profiles. Some students with B or C demonstrate strengths and/or weaknesses that are
so extreme they deserve special attention. With the CogAT, score differences of 24 points or
more (on a scale with a mean of 100 and SD of 16) are designated as E profiles (E stands for
Extreme). For example, E (Q—) indicates that a student has an extreme or severe weakness in
Quantitative Reasoning. Approximately 14% of students have this type of profile.
= Level of performance. In addition to the pattern of performance, it is also important to
consider the level of performance. To reflect the level of performance, the letter code is preceded
by a number indicating the student’s middle stanine score. For example, if a student received sta-
nines of 4, 5, and 6 on the Verbal, Quantitative, and Nonverbal Reasoning batteries, the middle
stanine is 5. In classifying stanine scores, Stanine 1 is Very Low, Stanines 2 and 3 are Below Aver-
age, Stanines 4-6 are Average, Stanines 7 and 8 are Above Average, and Stanine 9 is Very High.

As an example of a complete profile, the profile 8A would indicate students with relative evenly
developed Verbal, Quantitative, and Nonverbal Reasoning abilities with their general level of perfor-
mance in the Above Average range.
Riverside Publishing (2002) delineates a number of general principles for tailoring instruc-
tion to meet the needs of students (e.g., build on strengths) as well as more specific suggestions for
working with students with different patterns and levels of performance. CogAT, Form 6: A Short
Guide for Teachers (Riverside Publishing, 2002), an easy to read and very useful resource, is avail-
able online at www.riverpub.com/products/group.cogat6/home.html.
The Use of Aptitude Tests in the Schools 343

Please see other side for report features and options.

Student: Rega Student ID: ¥44587800


ay
ARRATIVEam FOR EN
aa s
ay
rode: 99600:
SnH
gnitive Abilities Test™ (CogAT*) ee ratick Sonig
Lak Foner: Spring2000
Order No.: 002-A70000026-0-002 Page: 1 Grade: 5

Reiities National Age


erie ge Sisone Bee National ean Aaval eeneyie Hace i
Ability Scores for Katrina Adams: :
2 =
«80 99
Katrina was given the Cognitive Abilities Test in March, 2001.
At the time of testing, she was in fifth grade at Central Elementary
in Spring Lake.

Different students bring different patterns and levels of abllitiés to


a Abllity Profile: 5B (Q+) learning tasks. She was given the Cognitive Abilities Test to help
The scores on the Quantitative Battery are highér than the scores on the Verbal and find out about her abilities. Katrina was tested in all three areas:
Nonverbal Batteries. verbal, quantitative, and nonverbal abilities.

Katrina's national percentile rank of 48 on verbal ability means that


compared with other students her age nationally, Katrina scored
higher than 48 percent. Katrina appears to be about average in
Verbal 85 65 45 5 44 5 44 verbal ability, Katrina's. national percentile rank is 60 in quantitative
et & he: 4s 5 S7 5 Ra ability and 31 in nonverbal ability. Katrina seems to be somewhat
Nonvei 3A ‘ 28 : od above average in quantitative ability and somewhat below average
COMPOSITE 40 40
in nonverbal ability.

Predicted Achievement National Percentile Rank Katrina's composite score is derived from results from the three
Low High batteries. Katrina's composite national percentile rank of 43 is a
1 28 50 78 99 general statement of her ability. She seems to be about average in
Vocabulary praca73 overall cognitive ability:
Reading [essa]
Language fess 301.214 To a certain degree, the Cognitive Abilities Test scores can be
Mathematics used to predict success in schoo} subjects. Katrina is expected to
have more or less Consistent achievement In all subjects. Her
predicted level of achievement is about average. Some students’
Message from School: achievement is higher than that predicted from ability scores, and
some lower. Much depends on the study environment, student
This space may be left blank for teacher to write. a message or may be.used for effort, and motivation for learning.
a predefined message that the school can provide,

© 2001 The Riverside Publishing Company. All Rights Reserved. 4 Riverside Publishing 4 HOUGHTON MIFFLIN COMPANY

; ‘Call 600.323.9540 « Fax 630.467.7192 EE BEE Wg RORi RE Rg ay A Cig Saat an


CA MDUGNSOR MIFFLIN COMPANY. Visit www.riversidepublishing.com — _.. Embrace Learning™ with
The lowa Tests

FIGURE 13.2 Profile Narrative for the Cognitive Abilities Test (CogAT) This figure
illustrates one of the report formats available from Riverside Publishing for the CogAT.
This format provides numerical scores and graphs in the left column and a narrative description
of the student’s performance in the right column. Note that the profile depicted in this figure
is identified as 5B (Q+). Please refer to Special Interest Topic 13.3 for information on how
Cog AT score profiles are coded and how teachers can use this information to customize
instruction to meet the needs of individual students.
Source: Reproduced by permission of the publisher, Riverside Publishing Company. Copyright 2001
The Riverside Publishing Company. All rights reserved.

Individual Aptitude/Intelligence Tests


As with achievement tests, both group and individual intelligence tests are commonly used
in schools. Although teachers are often asked to help administer and interpret the group
aptitude tests, school psychologists and other professionals with special training in admin-
istering and interpreting clinical and diagnostic tests usually administer and interpret the
individual tests. This is not to suggest, however, that teachers do not need to be familiar with
344 GHA PIER: «1038

Fe eee Please see other side for report features and options.

enema NARRATIVE FOR Se, suse: ee


lowa Tests of Basic Skills® /CogAT® tio
District: Spring Lake
Order No.: 002-A70000028-0-002

NATIONAL PERCENTILE Dear Parent or Guardian:


50 oof
Katrina was given the lowa Tests of Basic Skills in March 2001.
At the time of testing, she was in third grade In Central Elementary in
Spring Lake.
48
9 | Her composite score is the score that best describes her overall achieve-
52 ment on the tests. Katrina's composite national percentile rank of 50
44
30 means that she scored higher than 50 percent of third-grade students
3
5 k nationally. Her overall achievement appears to be about average for third
35 grade.
AB
A:student’s scores can be compared with each other to determine reia-
4g
67 B tive strengths and weaknesses, Reading Comprehension and Listening
xi seem to be areas of relative strength for Katrina. Some of these
strengths might be used to help improve other areas. Compared to
Katrina's other test areas, Capitalization may need the most work.

= NPR = National Percentile Rank _. 1 Different students bring different patterns and levels of abilities to learn-
NS = Netonw surroe
GE = Grade Equivalent
Nie Dorel alee EeBeere
* Not included in Totals and Composite
’» | ing tasks. The Cognitive Abilities Test is designed to find out about these
| abilities. Katrina’s national percentile rank of 48 on verbal ability means
at, compared with other students her age nationally, Katrina scored
Cognitive Abilities Test (CogAT) smn rote ‘| higher than 48 percent: Katrina appears to be about average in verbal
Age Percentile Rank bility. Katrina's national percentile rank is 60 in quantitative abliity and 31
Grade Scores Aga Scores in nonverbal ability. Katrina seems to be somewhat below average in
jonverbal ability. Katrina's composite score is derived from results from
he three batteries. Katrina's composite national percentile rank of 43'is a
jeneral statement of her ability. She seems to be about average in over-
| cognitive ability.

Sample reports for illustration only and should not be interpreted.

9-95429.
Riverside Publishing Call 800.323.9540 + Fax 630.467.7192 ele SK-KMP-04/01
A HOUGHTON MIFFLIN COMPANY Visit www.riversidepublishing.com Embrace Learning with The lowa Tests™

FIGURE 13.3. Combined Profile Narrative for the Iowa Tests of Basic Skills (TBS) and the
Cognitive Abilities Test (CogAT) This figure illustrates a Profile Narrative depicting a student’s
performance on both the Iowa Tests of Basic Skills (ITBS) and the Cognitive Abilities Test
(CogAT). This format provides numerical scores and graphs in the left column and a narrative
description of the student’s performance in the right column.
Source: Reproduced by permission of the publisher, Riverside Publishing Company. Copyright 2001
The Riverside Publishing Company. All rights reserved.

these tests. Classroom teachers are being asked more and more fre-
Surveys of school psychologists
quently to work with special education students, and as a result
and other assessment personnel
teachers need to be familiar with these tests because they are used in
have consistently shown that the identifying special needs students and planning their educational
Wechsler scales are the most programs (Nitko, 2001).
popular individual intelligence
tests used in clinical and school Wechsler Intelligence Scale for Children—Fourth Edition
settings with children. (WISC-IV). The Wechsler Intelligence Scale for Children—
The Use of Aptitude Tests in the Schools
345

ety
Bae: “ _ Please see other side for report features and options. —

IB
mm
iowa}
a TESTS
NT/ABILITY GRAPHIC COMPARISON
2wa Tests of Basic Skills” (ITBS®)
goneng cuss
Syator Marymte
ral Sanor
Norma: Spmncond

i ”

|i
Wr
¢-
Bes
We
aie
ar

mere-AZmMOoODMD

zAZ>Pr>D

rae
ese
3 6 2 4

Social

This graph is particulary useful for presenting a visual depiction of the relative strengths and weaknesses of this building and for
showing how the performance of the typical student in this building compares with that of the national norm group.

Call 800.323.9540 + Fax 630


ANOUGHTON wIFFLIA COmrANT ‘Visit www.riversidepublishi my

FIGURE 13.4 Achievement—Ability Graphic Comparison of the Iowa Tests of Basic Skills
(ITBS) when Combined with the Cognitive Abilities Test (CogAT) This figure presents a visual
depiction of the National Percentile Rank for each ITBS test relative to the Predicted National
Percentile Rank based on performance on the CogAT. This report illustrates the reporting of group
data, in this case the performance of all the 3rd-grade students in one school building.
Source: Reproduced by permission of the publisher, Riverside Publishing Company. Copyright 2001
The Riverside Publishing Company. All rights reserved.

Fourth Edition (WISC-IV) is the fourth edition of the most popular individual test of intel-
lectual ability for children. Empirical surveys of school psychologists and other assessment
personnel have consistently shown that the Wechsler scales are the most popular individual
intelligence test used in clinical and school settings with children (e.g., Livingston, Eglsaer,
Dickson, & Harvey-Livingston, 2003). The WISC-IV, which takes approximately 2 to 3
hours to administer and score, must be administered by professionals with extensive train-
ing in psychological assessment. Here are brief descriptions of the subtests (Wechsler,
2003):
346 CHAPTER 13

TABLE 13.1 Organization of Major Group Aptitude/Intelligence Tests

Aptitude Test Subtests Composite Scores

Tests of Cognitive Skills, Sequences Verbal ability


Second Edition (TCS/2) Analogies Nonverbal ability
Verbal Reasoning Memory ability
Memory

Primary Test of Cognitive Verbal Cognitive Skills Index (CSI)


Skills (PTCS) Spatial
Memory
Concepts

InView Verbal Reasoning—Words Verbal reasoning


Verbal Reasoning—Context Nonverbal reasoning
Sequences Quantitative reasoning
Analogies
Quantitative Reasoning

Otis-Lennon School Verbal Comprehension Verbal School Ability Index


Ability Test, 8th Edition Verbal Reasoning Nonverbal School Ability Index
(OLSAT-8) Pictorial Reasoning
Figural Reasoning Total School Ability Index
Quantitative Reasoning

Cognitive Abilities Test Oral Vocabulary Verbal battery score


(CogAT), Form 6 Verbal Reasoning
(Levels K, 1, and 2) Relational Concepts Quantitative score
Quantitative Concepts
Figure Classification Nonverbal score
Matrices
Overall composite score

Cognitive Abilities Test Verbal Classification Verbal battery score


(CogAT), Form 6 (Levels Sentence Completion
A-H: Grades 3-12) Verbal Analogies
Quantitative Relations Quantitative score
Number Series
Equation Building
Figure Classification Nonverbal score
Figure Analogies
Figure Analysis
Overall composite score
SES
The Use of Aptitude Tests in the Schools 347

m Arithmetic. The student is presented a set of arithmetic problems to solve mentally


(i.e., no pencil and paper) and answer orally. This subtest involves numerical reasoning
ability, mental manipulation, concentration, and auditory memory.
a Block Design. The student reproduces a series of geometric patterns using red-and-
white blocks. This subtest measures the ability to analyze and synthesize abstract visual
stimuli, nonverbal concept formation, and perceptual organization.
= Cancellation. The student scans sequences of visual stimuli and marks target forms.
This subtest involves processing speed, visual attention, and vigilance.

= Coding. The student matches and copies symbols that are associated with either ob-
jects (i.e., Coding A) or numbers (Coding B). This subtest is a measure of processing speed,
short-term visual memory, mental flexibility, attention, and motivation.

= Comprehension. The student responds to questions presented orally involving every-


day problems or social situations. This subtest is a measure of verbal comprehension and
reasoning as well as the ability to apply practical information.

a Digit Span. The student is presented sequences of numbers orally to repeat verbatim
(i.e., Digits Forward) or in reverse order (i.e., Digits Backwards). This subtest involves
short-term auditory memory, attention, and on Digits Backwards, mental manipulation.

a Information. The student responds to questions that are presented orally involving a
broad range of knowledge (e.g., science, history, geography). This subtest measures the
student’s general fund of knowledge.
a Letter-Number Sequencing. The student reads a list of letters and numbers and then
recalls the letters in alphabetical order and the numbers in numerical order. This subtest
involves short-term memory, sequencing, mental manipulation, and attention.
a Matrix Reasoning. The student examines an incomplete matrix and then selects the
item that correctly completes the matrix. This subtest is a measure of fluid intelligence and
is considered a largely language-free and culture-fair measure of intelligence.

a Picture Completion. The student is presented a set of pictures and must identify what
important part is missing. This subtest measures visual scanning and organization as well
as attention to essential details.
a Picture Concepts. The student examines rows of objects and then selects objects that
go together based on an underlying concept. This subtest involves nonverbal abstract rea-
soning and categorization.
= Similarities. Two words are presented orally to the student, who must identify how
they are similar. This subtest measures verbal comprehension, reasoning, and concept for-
mation.
a Symbol Search. The student scans groups of symbols and indicates whether a target
symbol is present. This subtest is a measure of processing speed, visual scanning, and
concentration.
348 CHAPTER 13

= Vocabulary. The student is presented a series of words orally to define. This subtest is
primarily a measure of word knowledge and verbal conceptualization.
= Word Reasoning. The student must identify the underlying or common concept im-
plied by a series of clues. This subtest involves verbal comprehension, abstraction, and
reasoning.

Information, Word Reasoning, Picture Completion, Arithmetic, and Cancellation are


supplemental subtests whereas the other subtests are core subtests. The administration of
supplemental subtests is not mandatory, but they may be used to “substitute” for a core
subtest if the core subtest is seen as being inappropriate for a particular student (e.g., due to
physical limitation). A supplemental subtest may also be used if a core subtest is “spoiled”
or invalidated for some reason (e.g., its administration is interrupted).
The WISC-IV produces four Index Scores, brief descriptions of which follow
(Wechsler, 2003):

m Verbal Comprehension Index (VCI). The VCI is a composite of Similarities, Vocabu-


lary, and Comprehension. Information and Word Reasoning are supplemental VCI subtests.
The VCI reflects verbal reasoning, verbal conceptualization, and knowledge of facts.

u Perceptual Reasoning Index (PRI). The PRI is a composite of Block Design, Picture
Concepts, and Matrix Reasoning. Picture Completion is a supplemental PRI subtest. The
PRI reflects perceptual and nonverbal reasoning, spatial processing abilities, and visual—
spatial—motor integration.

a Working Memory Index (WMI). The WMI is a composite of Digit Span and Letter—
Number Sequencing. Arithmetic is a supplemental WMI subtest. The WMI reflects the
student’s working memory capacity that includes attention, concentration, and mental
control.
m Processing Speed Index (PSI). The PSlis a composite of Coding and Symbol Search.
Cancellation is a supplemental PSI subtest. The PSI reflects the student’s ability to quickly
process nonverbal material as well as attention and visual—motor coordination.

This four-index framework is based on factor analytic and clinical research (Wechsler, 2003).
Similar index scores have a rich history of clinical use and have been found to provide reli-
able information about the student’s abilities in specific areas (Kaufman, 1994; Kaufman &
Lichtenberger, 1999; Wechsler, 2003). Whereas previous Wechsler scales have produced a
Verbal IQ, Performance IQ, and Full Scale IQ, the WISC-IV reports only a Full Scale IQ
(FSIQ), which reflects the student’s general level of intelligence. The organization of the
WISC-IV is depicted in Table 13.2. To facilitate the calculation of aptitude—achievement
discrepancies, the WISC-IV is statistically linked to the Wechsler Individual Achievement
Test—Second Edition (WIAT-II), which was described in the previous chapter on standard-
ized achievement tests.
The WISC-IV and its predecessors are designed for use with children between the ages
of 6 and 16. For early childhood assessment the Wechsler Preschool and Primary Scale of
Intelligence—Third Edition (WPPSI-III) is available and is appropriate for children between
The Use of Aptitude Tests in the Schools 349

TABLE 13.2 Organization of the Wechsler Intelligence Scale


for Children—Fourth Edition (WISC-IV)

Subtests Index Scores IQs

Information
Vocabulary
Similarities Verbal Comprehension
Comprehension
Word Reasoning

Block Design
Picture Completion
Matrix Reasoning Perceptual Reasoning Full Scale IQ
Picture Concepts

Coding Processing Speed


Symbol Search
Cancellation

Digit Span Working Memory


Arithmetic
Letter-Number Sequencing

2 years 6 months and 7 years 3 months. The Wechsler Adult Intelligence Scale—Third Edi-
tion (WAIS-IID) is appropriate for individuals between the ages of 16 and 89 years of age.

Stanford-Binet Intelligence Scales, Fifth Edition (SB5). As we noted, the Stanford-


Binet Intelligence Test was the first intelligence test to gain widespread acceptance in the
United States. While the Wechsler scales have become the most popular and widely used
intelligence tests in schools, the Stanford-Binet scales have continued to have a strong fol-
lowing. The most recent edition of these scales is the Stanford-Binet Intelligence Scales,
Fifth Edition (SB5), released in 2003. The SBS is designed for use with individuals from
2 to 85 years of age. It contains 10 subtests, which are combined to
produce five factor indexes (i.e., Fluid Reasoning, Knowledge,
An appealing aspect of the
Quantitative Reasoning, Visual—Spatial Processing, and Working
Stanford-Binet Intelligence
Memory), two domain scores (i.e., Verbal IQ and Nonverbal IQ), and
Scales, Fifth Edition is the a Full Scale IQ reflecting overall intellectual ability. The organiza-
availability of an Expanded tion of the SBS5 is depicted in Table 13.3 (Riverside, 2003). A poten-
IQ scale that allows the tially appealing aspect of the SBS is the availability of an Extended
calculations of IQs higher IQ scale that allows the calculation of FSIQs higher than 160. This
than 160. can be useful in the assessment of extremely gifted individuals.

Woodcock-Johnson III (WJ II) Tests of Cognitive Abilities. The Woodcock-Johnson


III (WJ Il) Tests of Cognitive Abilities has gained a loyal following and has some unique
qualities that warrant mentioning. The battery is designed for use with individuals 2 to 90
years of age. The WJ III Tests of Cognitive Abilities is based on the Cattell-Horn-Carroll
350 CHAPTER 13

TABLE 13.3 Organization of the Stanford-Binet Intelligence Scales, 5th Edition (SB5)

Subtests Factor Scores IQs

Verbal Fluid Reasoning Fluid Reasoning (FR) Verbal IQ


Nonverbal Fluid Reasoning (composite of
5 verbal subtests)
Verbal Knowledge Knowledge (KN)
Nonverbal Knowledge
Nonverbal IQ
Verbal Quantitative Reasoning Quantitative Reasoning (QR)
(composite of
Nonverbal Quantitative Reasoning
5 nonverbal subtests)
Verbal Visual—Spatial Processing Visual—Spatial Processing (VS)
Nonverbal Visual—Spatial Processing Full Scale 1Q

Verbal Working Memory Working Memory (WM) (composite of all


Nonverbal Working Memory 10 subtests)

(CHC) theory of cognitive abilities, which incorporates Cattell’s and Horn’s Gf-Gc theory
and Carroll’s three-stratum theory. The CHC theory provides a comprehensive model for
assessing a broad range of cognitive abilities, and many clinicians like this battery because it
measures such a broad range of abilities. The organization of the WJ III Tests of Cognitive
Abilities is depicted in Table 13.4 (Riverside, 2003). The WJ III Tests of Cognitive Abilities
is co-normed with the WJ III Tests of Achievement described in the chapter on standardized
achievement tests.

Reynolds Intellectual Assessment Scales (RIAS). The Reynolds Intellectual As-


sessment Scales (RIAS) is a relative newcomer to the clinician’s collection of intelli-
gence tests rapidly growing in popularity in schools and in clinical settings. It is designed
for use with individuals between 3 and 94 years of age and incor-
One particularly desirable porates a co-normed supplemental memory scale. One particu-
aspect of the Reynolds Intellec- larly desirable aspect of the RIAS is the ability to obtain a reliable,
valid measure of intellectual ability that incorporates both verbal
tual Assessment Scales (RIAS) is
and nonverbal abilities in a relatively brief period (i.e., 20 to 25
the ability to obtain a reliable,
minutes). Most other tests that assess verbal and nonverbal cog-
valid measure of intellectual
nitive abilities require considerably more time. The supplemental
ability that incorporates both memory tests require about ten minutes for administration, so a
verbal and nonverbal abilities in clinician can assess both memory and intelligence in approxi-
a relatively brief period mately 35 minutes. The organization of the RIAS is depicted in
(20 to 25 minutes). Table 13.5.

Selecting Aptitude/Intelligence Tests


A natural question at this point is “Which of these tests should I use?” There are numerous
factors to consider when selecting an aptitude or intelligence test..An initial consideration
The Use of Aptitude Tests in the Schools 351

TABLE 13.4 Organization of the Woodcock-Johnson III (WJ III) Tests of Cognitive Abilities

Subtests Factor Scores IQs


Verbal Comprehension Comprehension/Knowledge (Gc)
General Information

Visual—Auditory Learning Long-Term Retrieval (Gir)


Retrieval Fluency
Visual—Auditory
Learning: Delayed

Spatial Relations Visual—Spatial Thinking (Gy)


Picture Recognition
Planning (Gv/Gf)

Sound Blending Auditory Processing (Ga) General


Auditory Attention Intellectual
Incomplete Words Ability (GIA)

Concept Formation Fluid Reasoning (Gf)


Analysis—Synthesis
Planning (Gv/Gf)

Visual Matching Processing Speed (Gs)


Decision Speed
Rapid Picture Naming
Pair Cancellation

Numbers Reversed Short-Term Memory (Gsm)


Memory for Words
Auditory Working Memory
EEA EERE eke a
Seta aesa a eT PR I EI

TABLE 13.5 Organization of the Reynolds Intellectual Assessment Scales (RIAS)

Subtests Factor Scores IQs

Verbal Reasoning Verbal Intelligence Index (VIX)


Guess What

Odd-Item Out Nonverbal Intelligence Index (NIX) Composite Intelligence


What’s Missing Index (CIX)

Verbal Memory Composite Memory Index (CMX)


Nonverbal Memory
352 CHAPTER 13

involves the decision to use a group or individual test. As is the case with standardized
achievement tests, group aptitude tests are used almost exclusively for mass testing applica-
tions because of their efficiency. Even a relatively brief individual intelligence test typically
requires approximately 30 minutes per student to administer. Additionally, assessment pro-
fessionals with special training in test administration are needed to administer these indi-
vidual tests. A limited amount of time to devote to testing and a limited number of assessment
personnel combine to make it impractical to administer individual tests to a large number
of students. However, some situations demand the use of an individual intelligence test. This
is often the case when making classification decisions such as identifying students who have
learning disabilities or who qualify for gifted and talented programs.
When selecting an intelligence or aptitude test, it is also important to consider how
the information will be used. Are you primarily interested in obtaining a global measure of
intellectual ability, or do you need a test that provides multiple scores reflecting different
sets of cognitive abilities? As we noted, as a general rule intelligence
When selecting an intelligence tests have been shown to be good at predicting academic success.
or aptitude test, it is important Therefore, if you are simply interested in predicting school success
practically any of these tests will meet your needs. If you want to
to consider factors such as how
identify the cognitive strengths and weaknesses of your students, you
the information will be used and
should look at the type of scores provided by the different test batter-
how much time is available for
ies and select one that meets your needs from either a theoretical or
testing. practical perspective. For example, a teacher or clinician who has
embraced the Cattell-Horn-Carroll (CHC) theory of cognitive abili-
ties would be well served using the Woodcock-Johnson III Tests of Cognitive Abilities be-
cause it is based on that specific model of cognitive abilities. The key is to select a test that
provides the specific type of information you need for your application. Look at the type of
factor and intelligence scores the test produces, and select a test that provides meaningful
and practical information for your application.
If you are interested in making aptitude—achievement comparisons, ideally you should
select an aptitude test that is co-normed with an achievement test that also meets your spe-
cific needs. All of the major group aptitude tests we discussed are co-normed or linked to a
major group achievement test. When selecting a combination aptitude—achievement battery,
you should examine both the achievement test and the aptitude test to determine which set
best meets your specific assessment needs. In reference to the individual intelligence tests
we discussed, only the WISC-IV and WJ III Tests of Cognitive Abilities have been co-
normed with or linked to an individual achievement test battery. While it is optimal to use
co-normed instruments when aptitude—achievement comparisons are important, in actual
practice many clinicians rely on aptitude and achievement tests that are not co-normed or
linked. In this situation, it is important that the norms for both tests be based on samples that
are as nearly identical as possible. For example, both tests should be normed on samples
with similar characteristics (e.g., age, race, geographic region) and obtained at approxi-
mately the same time (Reynolds, 1990).
Another important question involves the population you will use the test with. For ex-
ample, if you will be working with children with speech, language, or hearing impairments or
diverse cultural/language backgrounds, you may want to select a test that emphasizes nonver-
bal abilities and minimizes cultural influences. Finally, as when selecting any test, you want
The Use of Aptitude Tests in the Schools 353

to examine the psychometric properties of the test. You should select a test that produces reli-
able scores and has been validated for your specific purposes. All of the aptitude/intelligence
tests we have discussed have good psychometric properties, but it is the test user’s responsibil-
ity to ensure that the selected test has been validated for the intended purposes.

Understanding the Report of an Intellectual Assessment


Whether you take a professional position in regular education or special education within
the schools, you will often have the opportunity to read or listen to the report of the results
of an intellectual assessment of a student. Special Interest Topic 13.4 on pages 354-365
presents an unedited computer-generated report of the intellectual assessment of a 17-year-
old female student suspected of having a learning disability. Typically, you will not encoun-
ter an unedited computer-generated report. Such reports are, however, used by a school,
clinical, and other psychologists as the foundation for their own individualized reporting on
students. We thought it would be instructive for you to have the opportunity to read such a
report in its raw state.
The report begins with a review of all of the data gathered as a result of the administra-
tion and scoring of the intelligence test. You will see a number of terms employed that you
have already learned throughout this text. You will see, for example, that confidence inter-
vals based on the standard errors of measurement are applied to the various intelligence
indexes and that not only standard scores but percentile ranks are provided to assist in the
interpretation. The report continues by providing brief background information on why
Becky was being evaluated accompanied by several behavioral observations considered
important by the person administering the test.
The next section of the report provides some caveats regarding proper administration
and use of the results of the intellectual assessment. This section will clue the reader in to
the assumptions that underlie the interpretation of the results that follow later in the report.
A computer-generated report cannot take into account as yet the behavior of the examinee
or any other extraneous factors that may necessitate altering standard interpretations of test
performance as well as the professional examiner is capable of doing.
The next section of the report provides a narrative summary of Becky’s scores on this
intellectual assessment and provides norm-referenced interpretations. Norm-referenced in-
terpretations are those that compare Becky’s performance to other individuals of the same
chronological age and who belong to the population sampled for development of the norms
for this particular test. You will also see references within this section to the practical ap-
plication of a confidence interval as well as estimates of true scores, all terms with which
you have become acquainted earlier in this text.
Once the more global indexes of intellectual function have been reviewed, the report
provides information on more specific intellectual tasks Becky completed. This is followed
by a section where the pattern of Becky’s intellectual development is discussed by the use of
norm-referenced discrepancy interpretations. Essentially, this section presents an actuarial
analysis of the differences among Becky’s scores across the different subdomains of intel-
ligence evaluated during this assessment. Such an analysis logically leads torecommendations

(Text continued on page 366)


a

SPECIAL INTEREST TOPIC 13.4


Example of a Computer-Generated Report of an Individually
Administered Intelligence Test

RIAS™ Interpretive Report


Cecil R. Reynolds, PhD and Randy W. Kamphaus, PhD
Name: Gender:

mimo tT
Year Month Day
Ethnicity: Grade/Education:

ID#: Examiner:
Date of Birth} 1989 ma
Reason for referral: Referral source:
Age

RIAS Subtest Scores/Index Summary


Age-Adjusted 7 Scores
Raw
Scores
Guess What
(GWH)
Odd-Item Out
(O10)
Verbal Reasoning
(VRZ)
What’s Missing
(WHM)
Verbal Memory
(VRM)
Nonverbal Memory
(NVM)

Sum of T Scores 42 + 2)\o i] _ ioe)_ — _=

RIAS Indexes

Confidence Interval 95% 67-78 107-120


Confidence Interval 90% 67-77 108-119
Percentile Rank Whi: el bode line ren bl "A
Verbal Nonverbal Composite Composite
Intelligence Index Intelligence Index Intelligence Index Memory Index

RIAS Total Battery Scores MEME

Confidence Interval 95% 64-75 94-106 79-88


Confidence Interval 90% 65-74 | 95-105 | 79-88
Porcectile Rank aid. les 2.5tava ahahaeae haloes 1S
Total Verbal Total Nonverbal Total Test
Battery Score Battery Score Battery Score

354
SPECIAL INTEREST TOPIC 13.4 Continued

RIAS Profiles

RIAS Subtest T Scores RIAS Indexes


Score Score Score Score
290 >90 >160 >160
150 150
80 80
140 140
70 70 130 130

120 120
60 60
110 110

50 50 100 100
90 90
40 40
80 80

30 30 70 70

60 60
20 20
50 50

210 210 >40 240


Scale GWH VRZ OIO WHM VRM NVM Scale VIX NIX CIX CMX
Score 9 33 t29) G0) 50 74 64 Score 56 92 TAL ARs

RIAS Total Battery Profiles

RIAS Subtest T Scores RIAS Total Battery Scores


Score Score Score Score
>90 >90 >160 >160

150 150
80 80
140 140

70 70 130 130

120 120
60 60
110 110

50 100 100
50
90 90
40 40
80 80

30 30 70 70
60 60
20 20 “8 50

2>10 >10 240 240


Scale GWH VRZ OIO WHM VRM NVM Scale TVB TNB TTB
Score 9 Bsani29: 60 50 64 Score 68 100 83
(continued)

355
356 CHAPTER 13

SPECIAL INTEREST TOPIC 13.4 Continued

Client: Becky J. Gibson Test Date: 01/09/2007


Client ID: SC123 Page1 of 10

Background Information
Becky J. Gibson is a 17-year-old female. She was referred by her guidance counselor for an initial
learning disability evaluation. Becky is currently in the 11th grade. The name of Becky’s school
was reported as “Lincoln High School.” Becky's parental educational attainment was reported as:
“College.’The primary language spoken in Becky’s home is English.
Becky identified the following vision, hearing, language, and/or motor problems: “Requires
prescription glasses for reading.” Becky further identified the following learning problems:
“None. Finally, Becky identified the following medical/neurological problems: “None”

Behavioral Observations
Becky arrived more than 15 minutes early. She was accompanied to the session by her legal guard-
ian. During testing the following behavioral observations were made: “Client appeared easily
distracted and was very fidgety.’

Caveat and Descriptive Text


The test scores, descriptions of performance, and other interpretive information provided in this
computer report are predicated on the following assumptions. First, it is assumed that the various
subtests were administered and scored correctly in adherence with the general and specific ad-
ministration and scoring guidelines provided in Chapter 2 of the RIAS/RIST Professional Manual
(Reynolds & Kamphaus, 2003). Second, it also is assumed that the examinee was determined to
be appropriately eligible for testing by the examiner according to the guidelines for testing eligi-
bility provided in Chapter 2 of the RIAS Professional Manual and that the examiner was appropri-
ately qualified to administer and score the RIAS/RIST.
This report is intended for revelation, transmission to, and use by individuals appropriately
qualified and credentialed to interpret the RIAS/RIST under the laws and regulations of their local
jurisdiction and meeting the guidelines for use of the RIAS/RIST as stated in the RIAS Professional
Manual (Reynolds & Kamphaus, 2003) (see Chapter 2).
Becky was administered the Reynolds Intellectual Assessment Scales (RIAS). The RIAS is an
individually administered measure ofintellectual functioning normed for individuals between the
ages of 3 and 94 years. The RIAS contains several individual tests of intellectual problem solving
and reasoning ability that are combined to form a Verbal Intelligence Index (VIX) and a Nonverbal
Intelligence Index (NIX). The subtests that compose the VIX assess verbal reasoning ability along
with the ability to access and apply prior learning in solving language-related tasks. Although
labeled the Verbal Intelligence Index, the VIX also is a reasonable approximation of crystallized
intelligence. The NIX comprises subtests that assess nonverbal reasoning and spatial ability. Al-
though labeled the Nonverbal Intelligence Index, the NIX also provides a reasonable approxima-
tion of fluid intelligence. These two indexes of intellectual functioning are then combined to form
an overall Composite Intelligence Index (CIX). By combining the VIX and the NIX to form the CIX,
a stronger, more reliable assessment of general intelligence (g) is obtained. The CIX measures the
two most important aspects of general intelligence according to recent theories and research find-
The Use of Aptitude Tests in the Schools 357

SPECIAL INTEREST TOPIC 13.4 Continued

Client: Becky J. Gibson Test Date: 01/09/2007


Client ID: $C123 Page 2 of 10

ings: reasoning or fluid abilities and verbal or crystallized abilities. Each of these indexes is ex-
pressed as an age-corrected standard score that is scaled to a mean of 100 and a standard
deviation of 15. These scores are normally distributed and can be converted to a variety of other
metrics if desired.
The RIAS also contains subtests designed to assess verbal memory and nonverbal memory.
Depending on the age of the individual being evaluated, the verbal memory subtest consists ofa
series of sentences, age-appropriate stories, or both, read aloud to the examinee. The examinee
is then asked to recall these sentences or stories as precisely as possible. The nonverbal memory
subtest consists of the presentation of pictures of various objects or abstract designs for a period
of 5 seconds. The examinee is then shown a page containing six similar objects or figures and
must discern which object or figure was previously shown. The scores from the verbal memory and
nonverbal memory subtests are combined to form a Composite Memory Index (CMX), which pro-
vides a strong, reliable assessment of working memory and also may provide indications as to
whether or not a more detailed assessment of memory functions may be required. In addition, the
high reliability of the verbal and nonverbal memory subtests allows them to be compared directly
to each other.
For reasons described in the RIAS/RIST Professional Manual (Reynolds & Kamphaus, 2003),
it is recommended that the RIAS subtests be assigned to the indices described above (e.g., VIX,
NIX, CIX, and CMX). For those who do not wish to consider the memory scales as a separate entity
and prefer to divide the subtests strictly according to verbal and nonverbal domains, the RIAS
subtests can be combined to form a Total Verbal Battery (TVB) score and a Total Nonverbal Battery
(TNB) score. The subtests that compose the Total Verbal Battery score assess verbal reasoning
ability, verbal memory, and the ability to access and apply prior learning in solving language-
related tasks. Although labeled the Total Verbal Battery score, the TVB also is a reasonable ap-
proximation of measures of crystallized intelligence. The TNB comprises subtests that assess
nonverbal reasoning, spatial ability, and nonverbal memory. Although labeled the Total Nonverbal
Battery score, the TNB also provides a reasonable approximation of fluid intelligence. These two
indexes of intellectual functioning are then combined to form an overall Total Test Battery (TTB)
score. By combining the TVB and the TNB to form the TTB, a stronger, more reliable assessment
of general intelligence (g) is obtained. TheTTB measures the two most important aspects of gen-
eral intelligence according to recent theories and research findings: reasoning, or fluid, abilities
and verbal, or crystallized, abilities. Each of these scores is expressed as an age-corrected standard
score that is scaled to a mean of 100 and a standard deviation of 15. These scores are normally
distributed and can be converted to a variety of other metrics if desired.

Composite Norm-Referenced Interpretations


On testing with the RIAS, Becky earned a Composite Intelligence Index or CIX of 71. On the RIAS,
this level of performance falls within the range of scores designated as moderately below average
and exceeds the performance of 3% of individuals at Becky's age. The chances are 90 out of 100
that Becky’s true CIX falls within the range of scores from 67 to 77.

(continued)
358 CHAPTER 13

SPECIAL INTEREST TOPIC 13.4 Continued

Client: Becky J. Gibson Test Date: 01/09/2007


Client ID: SC123 Page 3 of 10

Becky earned a Verbal Intelligence Index (VIX) of 56, which falls within the significantly
below average range of verbal intelligence skills and exceeds the performance of less than one
percent of individuals Becky’s age. The chances are 90 out of 100 that Becky’s true VIX falls within
the range of scores from 53 to 64.
Becky earned a Nonverbal Intelligence Index (NIX) of 92, which falls within the average range
of nonverbal intelligence skills and exceeds the performance of 30% of individuals Becky's age.
The chances are 90 out of 100 that Becky's true NIX falls within the range of scores from 87 to
98.
Becky earned a Composite Memory Index (CMX) of 114, which falls within the above average
range of working memory skills. This exceeds the performance of 82% of individuals Becky’s age.
The chances are 90 out of 100 that Becky’s true CMX falls within the range of scores from 108 to
NRNNRRENRNRCRRNER:
NER
ERUEE
SA
119:
On testing with the RIAS, Becky earned a Total Test Battery or TTB score of 83. This level of
performance on the RIAS falls within the range of scores designated as below average and exceeds
the performance of 13% of individuals at Becky’s age. The chances are 90 out of 100 that Becky’s
true TTB falls within the range of scores from 79 to 88.
Becky's Total Verbal Battery (TVB) score of 68 falls within the range of scores designated as
significantly below average and exceeds the performance of2% ofindividuals her age. The chances
are 90 out of 100 that Becky’s true TVB falls within the range of scores from 65 to 74.
Becky’s Total Nonverbal Battery (TNB) score of 100 falls within the range of scores designated
as average and exceeds the performance of 50% of individuals her age. The chances are 90 out of
100 that Becky's true TNB falls within the range of scores from 95 to 105.

Subtest Norm-Referenced Interpretations


The Guess What subtest measures vocabulary knowledge in combination with reasoning skills that
are predicated on language development and acquired knowledge. On testing with the RIAS, Becky
earned aT score of 9 on Guess What.
Odd-Item Out measures analytical reasoning abilities within the nonverbal domain. On test-
ing with the RIAS, Becky earned aT score of 29 on Odd-Item Out.
Verbal Reasoning measures analytical reasoning abilities within the verbal domain. English
vocabulary knowledge is also required. On testing with the RIAS, Becky earned aT score of 33 on
Verbal Reasoning.
What's Missing measures spatial and visualization abilities. On testing with the RIAS, Becky
earned aT score of 60 on What's Missing.
Verbal Memory measures the ability to encode, briefly store, and recall information in the
verbal domain. English vocabulary knowledge also is required. On testing with the RIAS, Becky
earned aT score of 50 on Verbal Memory.
Nonverbal Memory measures the ability to encode, briefly store, and recall information in the
nonverbal and spatial domains. On testing with the RIAS, Becky earned aT score of 64 on Non-
verbal Memory.
The Use of Aptitude Tests in the Schools 359

SPECIAL INTEREST TOPIC 13.4 Continued

Client: Becky J. Gibson Test Date: 01/09/2007


Client ID: SC123 Page4 of 10

RIAS Discrepancy Score Summary Table

Discrepancy Score Statistically Prevalence in


Score Difference Significant? Standardization Sample

VIX < NIX 36 yes 1.20%


CIX < CMX 43 yes 1.30%
VRM < NVM 14 yes 38.80%
TVB <TNB a2 yes 4.70%

VIX is the Verbal Intelligence Index, NIX is the Nonverbal Intelligence Index, CIX is the Composite
Intelligence Index, CMX is the Composite Memory Index, VRM is the Verbal Memory Subtest, NVM is
the Nonverbal Memory Subtest, TVB is the Total Verbal Battery Index, and TNB is the Total Nonverbal
Battery Index.

Discrepancy Norm-Referenced Interpretations


Although the CIX is a good estimate of Becky’s general intelligence, a statistically significant
discrepancy exists between her NIX of 92 and her VIX of 56, demonstrating better developed non-
verbal intelligence or spatial abilities. The magnitude of the difference observed between these two
scores is potentially important and should be considered when drawing conclusions about Becky’s
current status. A difference of this size is relatively uncommon, occurring in only one percent of
cases in the general population. In such cases, interpretation of the CIX or general intelligence score
may be of less value than viewing Becky’s verbal and nonverbal abilities separately.
When compared to Becky's measured level of general intelligence as reflected in Becky’s CIX,
it can be seen that her CIX falls significantly below her CMX. This result indicates that Becky is
able to use immediate recall and working memory functions at a level that significantly exceeds
her ability to engage in intellectual problem solving and general reasoning tasks. The magnitude
of the difference seen in this instance may take on special diagnostic significance due to its
relative infrequency in the general population. A difference between CIX and CMxX of this magni-
tude occurs in only one percent of the population.
Within the subtests making up the CMX, Becky’s performance in the nonverbal memory do-
main significantly exceeded her level of performance within the verbal memory domain. This
difference is reliable and indicates that Becky functions at a significantly higher level when asked
to recall or engage in working memory tasks that are easily adapted to visual—-spatial cues and
other nonverbal memory features, as opposed to tasks relying on verbal linguistic strategies.
Although most likely representing a real difference in Becky's abilities in these two areas, the
magnitude of this difference is relatively common, occurring in 39% of the population at Becky's
age. Therefore, this difference may or may not be indicative of the presence of a psychopatho-
logical condition, depending on the results of other clinical assessment information.
Although the TTB is a good estimate of Becky’s general intelligence, a significant discrepancy
exists between her TNB score of 100 and her TVB score of 68, demonstrating better developed

(continued)
360 CHAPTER 13

SPECIAL INTEREST TOPIC 13.4 Continued

Client: Becky J. Gibson Test Date: 01/09/2007


Client ID: SC123 Page 5 of 10

nonverbal intelligence or spatial abilities. The magnitude of the difference observed between these
two scores is potentially important and should be considered when drawing conclusions about
Becky's current status. A difference of this size is relatively uncommon, occurring in only 5% of
cases in the general population. In such cases, interpretation of the TTB or general intelligence score
may be of less value than viewing Becky's verbal and nonverbal abilities separately.
If interested in comparing the TTB and CIX scores or the TTB and CMX scores, it is better to
compare the CIX and CMX directly. As noted in the RIAS/RIST Professional Manual (Reynolds &
Kamphaus, 2003), the TTB is simply a reflection of the sum of the T scores of the subtests that
compose the CIX and CMxX. Thus, it is more appropriate to make a direct comparison of the CMX and
CIX because any apparent discrepancy between the TTB and the CIX or the TTB and the CMX will in
fact be a reflection of discrepancies between the CIX and the CMx, so this value is best examined
directly. To compare the CMX or CIX to the TTB may exaggerate some differences inappropriately.

General Interpretive Caveats


Examiners should be familiar with Becky's cultural and linguistic background (which may radically
alter the suggestions contained herein) and be certain to consider these factors before arriving at
a final decision regarding any diagnosis, classification, or related decision and before making any
form of recommendations.

School Feedback and Recommendations


Composite Score Feedback and Recommendations
Becky’s CIX score of 71 indicates moderate deficits in overall development of general intelligence
relative to others her same age and her TTB score of 83 indicates mild deficits in overall develop-
ment of general intelligence relative to others at Becky's age. Individuals earning general intel-
ligence scores in this range frequently experience at least some difficulty acquiring information
through traditional educational methods provided in the classroom setting.
The TTB measures the same general construct as the CIX with the exception that six tests are
included rather than four. Evidence in the RIAS/RIST Professional Manual (Reynolds & Kamphaus,
2003) documents the equivalence of these two scores based on evidence that a first factor solution
is defensible at all age levels of the RIAS whether four or six subtests are used. There also is evi-
dence from a variety of intelligence tests to suggest the “indifference of the indicator” (Kam-
phaus, in press). In other words, general intelligence may be assessed using a variety of cognitive
tests providing further evidence that for most individuals the TTB and CIX will be interchangeable.
There will be exceptions to this well-documented scientific finding, in the case of severe brain
injury, for example, where significant memory impairment may be present, but these cases will
be exceptions rather than the rule.
Since most instructional programs presume at least average intellectual ability and involve
lecture, note taking, and other typical instructional approaches, with the exception of demonstra-
tive and repetitive methods commonly used with young children, difficulties in acquiring infor-
mation when these methods are used is anticipated. Given Becky's deficits, special teaching
methods might be considered, including special class placement for severe deficits in general in-
The Use of Aptitude Tests in the Schools 361

SPECIAL INTEREST TOPIC 13.4 Continued

Client: Becky J. Gibson Test Date: 01/09/2007


Client ID: SC123 Page 6 of 10

tellectual development. Teachers should prepare an individualized curriculum designed for stu-
dents who learn at a slower rate than others of the same age and grade level. Alternative methods
of instruction should be considered that involve the use of repeated practice, spaced practice,
concrete examples, guided practice, and demonstrative techniques. Individuals with general intel-
ligence scores in this range often benefit from repeated practice approaches to training because
of problems with acquisition and long-term retrieval, as well as an individualized instructional
method that differs significantly from that of their age-mates. It also will be important to assist
Becky in developing strategies for learning and studying. Although it is important for all students
to know how to learn and not just what to learn, low scores on general intelligence indices make
the development of learning and study strategies through direct instruction even more important.
If confirmed through further testing, co-occurring deficits in adaptive behavior and behavioral
problems should be added to the school intervention program.
Becky’s VIX score of 56 and TVB score of 68 indicate severe deficits in the development of
verbal intellect relative to others at Becky’s age. Individuals at this score level on the TVB nearly
always have accompanying verbal memory difficulties that can easily be moderate to severe in
nature. Special attention to Becky's VRM score is necessary, as well as considerations for any
extant verbal memory problems and their accompanying level of severity in making specific
recommendations.
Verbal ability is important for virtually every aspect of activity because language is key to
nearly all areas of human endeavor. A multitude of research investigations have documented the
importance of verbal ability for predicting important life outcomes. Verbal ability should be con-
sidered equivalent to the term “crystallized intelligence” (Kamphaus, in press). As assessed by
the RIAS, verbal ability (like crystallized intelligence) is highly related to general intelligence,
and as such its relationship to important life outcomes is easily correlated. Verbal ability also is
the foundation for linguistic knowledge, which is necessary for many types of learning.
With the exception of the early grades, along with kindergarten and pre-K settings, school
is principally a language-oriented task. Given Becky's relative verbal deficits, special teaching
methods might be considered, including special class placement in the case of severe deficits in
verbal intellectual development. The examiner should also consider either conducting, or making
a referral for, an evaluation for the presence of a language disorder. Alternative methods of in-
struction that emphasize “show me” rather than “tell me” techniques or that as a minimum pair
these two general approaches, are preferred.
Although linguistic stimulation likely cannot counteract the effects of verbal ability deficits
that began in infancy or preschool years, verbal stimulation is still warranted to either improve
adaptation or at least prevent an individual from falling further behind peers. Verbal concept and
knowledge acquisition should continue to be emphasized. A simple word-for-the-day program
may be beneficial for some students. Verbal knowledge builders of all varieties may be helpful
including defining words, writing book reports, a book reading program, and social studies and
science courses that include writing and oral expression components. Alternatively, assistive
technology (e.g., personal digital assistance devices, tape recorders, MP3 players, or IPODs) may
be used to enhance functioning in the face of the extensive verbal demands required for making
adequate academic progress.

(continued)
362 CHAPTER 13

SPECIAL INTEREST TOPIC 13.4 Continued

Client: Becky J. Gibson Test Date: 01/09/2007


Client ID: SC123 Page 7 of 10

In addition, teachers should rely more heavily on placing learning into the student's experi-
ential context, giving it meaning and enabling Becky to visualize incorporating each newly
learned task or skill into her life experience.
The use ofvisual aids should be encouraged and made
available to Becky whenever possible. Academic difficulties are most likely to occur in language-
related areas (e.g., the acquisition of reading), especially early phonics training. The acquisition
of comprehension skills also is aided when the verbal ability falls into this level by the use of
language experience approaches to reading, in particular. Frequent formal and informal assess-
ment of Becky’s reading skills, as well as learning and study strategies (the latter with an instru-
ment, e.g., the School Motivation and Learning Strategies Inventory; SMALSI; Stroud & Reynolds,
2006) is recommended. This should be followed by careful direct instruction in areas of specific
skill weaknesses and the use of high interest, relevant materials. It also will be important to assist
Becky in developing strategies for learning and studying. Although it is important for all students
to know how to learn and not just what to learn, low scores within the verbal intelligence domains
make the development of learning and study strategies through direct instruction even more
important.

Discrepancy Feedback and Recommendations


The magnitude of discrepancy between Becky's VIX score of 56 and NIX score of 92 as well as the
magnitude of the discrepancy between her TVB score of 68 and TNB score of 100 is relatively
unusual within the normal population. Although this is the most common pattern within referral
populations, the magnitude ofthe discrepancy occurring for Becky makes the difference notewor-
thy. In general, this pattern represents substantially disparate skills in the general domains of
verbal and nonverbal reasoning, with clear superiority evident in the nonverbal domain. Relative
to their verbal reasoning and general language skills, individuals who display this pattern will
experience greater success in tasks involving spatial reasoning, visualization skills, the use of
mental rotation, reading of nonverbal cues, and related aspects of nonverbal reasoning and com-
munication usually including nonverbal and visual memory skills. Nonverbal ability is less influ-
ential in other's appraisal of general intellectual functioning. Because NIX and TNB are greater
than VIX and TVB, Becky's general intellectual functioning may appear lower than is reflected by
her CIX and TTB scores. Whenever possible, one should take advantage of Becky's relatively higher
levels of performance in the nonverbal domain by always providing visual cues and explanations
of tasks, expectations, or demonstrations of what is expected to be learned. Experiential learning
is typically superior to traditional lecture and related pedagogical methods for individuals with
this score pattern. Synthesis of information as opposed to analysis is often a relative strength,
as well.
Teaching should emphasize the use ofvisual images, spatial representations of relationships,
experiential learning, and the synthesis of information as opposed to methods of deduction in
learning. Difficulties are likely to occur with traditional pedagogical styles such as lecturing and
the completion of reading and written assignments. An emphasis on the spatial relationships of
numbers and the construction of problems is likely to be the most effective means for teaching
math versus the memorization and the learning of step-by-step rules for calculation. A heavy


The Use of Aptitude Tests in the Schools 363

SPECIAL INTEREST TOPIC 13.4 Continued

Client: Becky J. Gibson Test Date: 01/09/2007


Client ID: SC123 Page 8 of 10

emphasis on learning by example and by demonstration is likely to be most effective with stu-
dents with this intellectual pattern. Also common are problems with sequencing including se-
quential memory and, in the early grades, mastery of phonics when synthesizing word sounds
into correct words. Emphases on holistic methods of learning are likely to be more successful in
addition to experiential approaches. The practical side of learning and the application of knowl-
edge can be emphasized to enhance motivation in these students.
Often, these students do not have good study, learning, and test-taking strategies. It is often
useful to assess the presence of strategies with a scale such as the School Motivation and Learn-
ing Strategies Inventory and then to target deficient areas of learning strategies for direct instruc-
tion (Stroud & Reynolds, 2006).
The magnitude ofdiscrepancy between Becky’s CMX score of 114 and CIX score of 71 is rela-
tively unusual within the normative population, suggesting that memory skills are relatively more
intact than general intellectual skills. Individuals with this profile may require more intensive
and broad-based intervention because general intelligence is a better predictor of occupational
and educational outcomes than are memory skills (Kamphaus, in press).
Students with this profile may experience problems with inferential reasoning, logic, the
comprehension of new concepts, and the acquisition of new knowledge. As such, participation in
school or intervention programs is often more successful if lessons are of longer duration, infor-
mation is provided in multiple modalities, opportunities to practice newly acquired skills are
provided frequently, and repetition and review is emphasized.

Recommendations for Additional Testing


Becky's NIX score of 92 and her TNB score of 100 are significantly higher than her VIX score of 56
and her TVB score of 68. Although this is the most common pattern in referral populations, ad-
ditional information is almost always helpful in making a diagnosis, in treatment planning, and/
or in making vocational recommendations. Evaluations that consider disturbances in language
and verbal functions in general (including receptive and expressive language) and other left
hemisphere related tasks may prove helpful. Although empirical research at this point is lacking,
clinical experience with the RIAS indicates that when the VIX score is significantly below the NIX
score and the absolute value of the VIX is less than 90, there is a high probability of the presence
of alanguage disorder that may have an adverse impact on academic attainment or success in any
academically related vocational training program. When this pattern occurs, as in the case of
Becky, screening for alanguage disorder is recommended at a minimum anda more comprehensive
language assessment should be considered. Evaluation of language skills with measures such as
the Clinical Evaluation of Language Fundamentals 4 (CELF-4; Semel, Wiig, & Secord, 2004), age
appropriate language tasks from the Halstead-Reitan Neuropsychological Test Battery (e.g., the
Speech Sounds Perception Test, Aphasia Screening Test; Reitan & Wolfson, 1993), the Comprehen-
sive Receptive and Expressive Vocabulary Test (CREVT-2; Wallace & Hammill, 2002), and the De-
velopmental Test of Auditory Perception (DTAP; Reynolds, Voress, & Pierson, 2007), may be
particularly useful. Other tests of specific cognitive-processing functions should be considered.
Research suggests that cognitive processing is measured well with little confounding by level of
(continued)
364 CHAPTER 13

EES
SPECIAL INTEREST TOPIC 13.4 Continued
y
SS

Client: Becky J. Gibson Test Date: 01/09/2007


Client ID: SC123 Page 9 of 10

RIAS Extended Score Summary Table


Score GWH OIO VRZ WHM VRM NVM _ VIX NIX CIX CMX TVB TNB TTB

Raw score 22 40 21 70 38 90

T score
(Mean = 50, SD = 10) ©) 29 33 60 50 64 21 45 31 Be, 29 50 39
z score
(Mean = 0, SD = 1) 4 OOM 2 LO 1 70n a100m 0:00 sn 405 2.935 0:53.01 93 O05 ee. 13 0.00) mais
Subtest scaled score
(Mean = 10, SD = 3) <1 4 5 13 10 14

Sum of subtest
T scores 42 89 131 114 92 153 245

Index score
(Mean = 100, SD = 15) 56 92 71 114 68 100 83

Percentile rank 0.17 30 i} 82 2 50 13

95% confidence
interval 52-65 86-99 67-78 107-120 64-75 94-106 79-88

90% confidence
interval 53-64 87-98 67-77 108-119 65-74 95-105 79-88

NCE (Mean = 50,


SD = 21.06) 1 39 9 70 5 50 26

Stanine
(Mean = 5, SD = 2) 1 4 1 7 1 5 3

general intelligence through the use of comprehensive measures of memory functions including
the WRAML-2 (Sheslow & Adams, 2003) and the TOMAL-2 (Reynolds & Voress, 2007). Subtests of
the Neuropsychological Assessment (NAB; Stern & White, 2003), and other related tests of verbal
skills with which you are familiar and skilled may well be useful adjuncts to the assessment pro-
cess in Becky’s case. Students with this pattern often exhibit inadequate levels of study skills
development and learning strategies and, thus, may become discouraged in school or vocational-
training programs. Assessment and targeted remediation of such deficits can be undertaken for
ages 8 through 18 years with assessments such as the School Motivation and Learning Strategies
Inventory (Stroud & Reynolds, 2006).
In cases where the CMX score is clinically significantly higher than the CIX score, follow-up
evaluation may be warranted, particularly if the CIX is in the below average range or lower. Lower
intelligence test scores are associated with increased forms of a variety of psychopathology, par-
ticularly if scores are in or near the mental retardation range (Kamphaus, in press). Because
general intelligence impacts knowledge and skill acquisition in a variety of areas, a thorough
ry
The Use of Aptitude Tests in the Schools 365

SPECIAL INTEREST TOPIC 13.4 Continued

Client: Becky J. Gibson Test Date: 01/09/2007


ClientID: SC123 Page 10 of 10

evaluation of academic achievement is necessary to gauge the impact of any impairment and make
plans to remediate educational weaknesses.

References
Hammill, D., & Bryant, B. (2005). Detroit Tests of Learning Aptitude-Primary (DTLA-P-3) (3rd ed.).
Austin, TX: PRO-ED.
Hammill, D., Pearson, N. A., & Voress, J. K. (1993). Developmental Test of Visual Perception-2 (DTVP-2).
Austin, TX: PRO-ED.
Kamphaus, R.W. (in press). Clinical assessment of children’s intelligence (3rd ed.). New York:
Springer.
McCarthy, D. (1972). McCarthy Scales of Children’s Abilities. San Antonio, TX: Harcourt Assessment.
Reitan, R. M., & Wolfson, D. (1993). The Halstead-Reitan Neuropsychological Test Battery: Theory and
clinical interpretation (2nd ed.). Tucson, AZ: Neuropsychology Press.
Reynolds, C. R. (2006). Koppitz Developmental Scoring System for the Bender Gestalt Test (Koppitz-2)
(2nd ed.). Austin, TX: PRO-ED.
Reynolds, C. R., & Kamphaus, R. W. (2003). Reynolds Intellectual Assessment Scales (RIAS) and the
Reynolds Intellectual Screening Test (RIST) professional manual. Lutz, FL: Psychological Assess-
ment Resources.
Reynolds, C. R., Pearson, N. A., & Voress, J. K. (2002). Developmental Test of Visual Perception—
Adolescent and Adult (DTVP-A). Austin,TX:PRO-ED.
Reynolds, C.R., & Voress, J. (2007). Test of Memory and Learning (TOMAL-2) (2nd ed.). Austin, TX:
PRO-ED.
Reynolds, C.R., Voress, J., & Pierson, N. (2007). Developmental Test of Auditory Perception (DTAP).
Austin, TX: PRO-ED.
Semel, E. M., Wiig, E.H., & Secord, W. A. (2004). Clinical Evaluation of Language Fundamentals 4—
Screening Test (CELF-4). San Antonio, TX: Harcourt Assessment.
Sheslow, D., & Adams, W. (2003). Wide Range Assessment of Memory and Learning 2 (WRAML-2).
Wilmington, DE: Wide Range.
Stern, R. A., & White, T. (2003). Neuropsychological Assessment Battery (NAB). Lutz, FL: Psychological
Assessment Resources.
Stroud, K., & Reynolds, C. R. (2006). School Motivation and Learning Strategies Inventory (SMALSI).
Los Angeles: Western Psychological Services.
Wallace, G., & Hammill, D. D. (2002). Comprehensive Receptive and Expressive Vocabulary
Test (CREVT-2)
(2nd ed.). Los Angeles: Western Psychological Services.

Reproduced by special permission of the publisher, Psychological Assessment Resources, Inc., 16204 North Flor-
ida Avenue, Lutz, Florida 33549, from the Reynolds Intellectual Assessment Scales Interpretive Report by Cecil
R. Reynolds, PhD and Randy W. Kamphaus, PhD. Copyright 1998, 1999, 2002, 2003, 2007 by Psychological As-
sessment Resources, Inc. Further reproduction is prohibited without permission of PAR, Inc.
a
366 CHAPTER 13

for understanding Becky’s particular pattern of intellectual development and ways that it
may be relevant to altering instructional methods or making other changes in how material
is presented to her in an educational setting. The next major section of the report deals pre-
cisely with school feedback and recommendations. Here the reader is provided with a gen-
eral understanding of the implications of these findings for Becky’s academic development
and alternative methods of instruction are recommended. These are based on various studies
of the implications of intelligence test results for student learning over many decades. In
particular, the actuarial analyses of discrepancies in Becky’s various areas of intellectual
development have led to recommendations for some additional assessment as well as
changes in teaching methods.
The purpose of all of the commentary in this report is ultimately to achieve an under-
standing of Becky’s intellectual development and how it may be related to furthering her
academic development in the best way possible.
The sample report is restricted to recommendations for school or other formal instruc-
tional settings. Other specialized reports can be generated separately for specialized clinical
settings that make quite different recommendations and even provide provisional diagnoses
that should be considered by the professional psychologist administering and interpreting
the intellectual assessment. The reader should be aware that it is rare for a report to be based
only on an intellectual assessment, and we doubt you will ever see such a report based on a
singular instrument. Typically, reports of the assessment of a student conducted by a diag-
nostic professional will include not only a thorough assessment of intellectual functions,
such as reported in Special Interest Topic 13.4, but also will include evaluations of academic
skills and status, personality, and behavior that may affect academic performance, special-
ized areas of development such as auditory perceptual skills, visual perceptual skills, visual
motor integration, attention, concentration, and memory skills, among other important as-
pects of the student’s development, that are dictated by the nature of the referral and infor-
mation gathered during the ongoing assessment process.

College Admission Tests

A final type of aptitude test that is often used in schools includes those used to make ad-
mission decisions at colleges and universities. College admission tests were specifically
designed to predict academic performance in college, and although they are less clearly
linked to a specific educational curriculum than most standard achievement tests, they do
focus on abilities and skills that are highly academic in nature. Higher education admis-
sion decisions are typically based on a number of factors including high school GPA,
letters of recommendation, personal interviews, written statements,
College admissions tests such
and extracurricular activities, but in many situations scores on stan-
as Le SAT and AAS are dardized admission tests are a prominent factor. The two most
designed to predict academic widely used admission assessment tests are the Scholastic Assess-
performance in college. ment Test (SAT) and the American College Test (ACT).

Scholastic Assessment Test. The College Entrance Examination Board (CEEB), com-
monly referred to as the College Board, was originally formed to provide colleges and
The Use of Aptitude Tests in the Schools 367

universities with a valid measure of students’ academic abilities. Its efforts resulted in the
development of the first Scholastic Aptitude Test in 1926. The test has undergone numerous
revisions and in 1994 the title was chan ged to Scholastic Assessment Test (SAT). The new-
est version of the SAT was administered for the first time in fall 2005 and includes the fol-
lowing three sections: Critical Reading, Mathematics, and Writing. Although the Critical
Reading and Mathematics sections assess new content relative to previous exams, the most
prominent change is the introduction of the Writing section. This section contains both
multiple-choice questions concerning grammar and a written essay. The SAT is typically
taken in a student’s senior year. The College Board also produces the Preliminary SAT
(PSAT), which is designed to provide practice for the SAT. The PSAT helps students iden-
tify their academic strengths and weaknesses so they can better prepare for the SAT. The
PSAT is typically taken during a student’s junior year. More information about the SAT can
be assessed at the College Board’s Web site: www.collegeboard.com.

American College Test. The American College Testing Program (ACT) was initiated in
1959 and is the major competitor of the SAT. The American College Test (ACT) is de-
signed to assess the academic development of high school students and predict their ability
to complete college work. The test covers four skill areas—English, Mathematics, Read-
ing, and Science Reasoning—and includes 215 multiple-choice questions. When describ-
ing the ACT, the producers emphasize that it is not an aptitude or IQ test, but an achievement
test that reflects the typical high school curriculum in English, mathematics, and science.
In addition to the four subtests, the ACT also incorporates an interest inventory that pro-
vides information that may be useful for educational and career planning. Beginning in the
2004—2005 academic year, the ACT included an optional 30-minute writing test that as-
sesses an actual sample of students’ writing. More information about the ACT can be as-
sessed at the ACT’s Web site: www.act.org.

Summary

In this chapter we discussed the use of standardized intelligence and aptitude tests in the
schools. We started by noting that aptitude/intelligence tests are designed to assess the cog-
nitive skills, abilities, and knowledge that are acquired as the result of broad, cumulative life
experiences. We compared aptitude/intelligence tests with achievement tests that are de-
signed to assess skills and knowledge in areas in which specific instruction has been pro-
vided. We noted that this distinction is not absolute, but rather one of degree. Both aptitude
and achievement tests measure developed cognitive abilities. The distinction lies with the
degree to which the cognitive abilities are dependent on or linked to formal learning experi-
ences. Achievement tests should measure abilities that are developed as the direct result of
formal instruction and training whereas aptitude tests should measure abilities acquired
from all life experiences, not only formal schooling. In addition to this distinction, achieve-
ment tests are usually used to measure what has been learned or achieved at a fixed point in
time, whereas aptitude tests are often used to predict future performance. Although the
distinction between aptitude and achievement tests is not as clear as one might expect, the
two types of tests do differ in their focus and are used for different purposes.
368 CHAPTER 13

The most popular type of aptitude test used in schools today is the general intelligence
test. Intelligence tests actually had their origin in the public schools approximately 100 years
ago when Alfred Binet and Theodore Simon developed the Binet-Simon Scale to identify
children who needed special educational services to be successful in French schools. The test
was well received in France and was subsequently translated and standardized in the United
States to produce the Stanford-Binet Intelligence Test. Subsequently other test developers
developed their own intelligence tests and the age of intelligence testing had arrived. Some
of these tests were designed for group administration and others for individual administra-
tion. Some of these tests focused primarily on verbal and quantitative abilities whereas others
placed more emphasis on visual—spatial and abstract problem-solving skills. Some of these
tests even avoided verbal content altogether. Research suggests that, true to their initial pur-
pose, intelligence tests are fairly good predictors of academic success. Nevertheless, the
concept of intelligence has taken on different meanings for different people, and the use of
general intelligence tests has been the focus of controversy and emotional debate for many
years. This debate is likely to continue for the foreseeable future. In an attempt to avoid
negative connotations and misinterpretations, many test publishers have switched to more
neutral titles such as school ability or simply ability to designate the same basic construct.
Contemporary intelligence tests have numerous applications in today’s schools. These
include providing a broader measure of cognitive abilities than traditional achievement
tests, helping teachers tailor instruction to meet students’ unique patterns of cognitive
strengths and weaknesses, determining whether students are prepared for educational expe-
riences, identifying students who are underachieving and may have learning or other cogni-
tive disabilities, identifying students for gifted and talented programs, and helping students
and parents make educational and career decisions. Classroom teachers are involved to
varying degrees with practically all of these applications. Teachers often help with the ad-
ministration and interpretation of group aptitude tests, and although they typically do not
administer and interpret individual aptitude tests, they do need to be familiar with the tests
and the type of information they provide.
One common practice when interpreting intelligence tests is referred to as aptitude—
achievement discrepancy analysis. This simply involves comparing a student’s performance
on an aptitude test with performance on an achievement test. The expectation is that achieve-
ment will be commensurate with aptitude. Students with achievement scores significantly
greater than ability scores may be considered academic overachievers whereas those with
achievement scores significantly below ability scores may be considered underachievers.
There are a number of possible causes for academic underachievement ranging from poor
student motivation to specific learning disabilities. We noted that there are different methods
for determining whether a significant discrepancy between ability and achievement scores
exists and that standards have been developed for performing these analyses. To meet these
standards, many of the popular aptitude and achievement tests have been co-normed or statis-
tically linked to permit comparisons. We cautioned that while ability-achievement discrep-
ancy analysis is a common practice, not all assessment experts support the practice. As we
have emphasized throughout this text, test results should be interpreted in addition to other
sources of information when making important decisions. This suggestion applies when mak-
ing ability—achievement comparisons.
The Use ofAptitude Tests in the Schools 369

The chapter concluded with an examination of a number of the popular group and
individual aptitude tests. Finally, a number of factors were discussed that should be consid-
ered when selecting an aptitude test. These included deciding between a group and indi-
vidual test, determining what type of information is needed (e.g., overall IQ versus multiple
factors scores), determining what students the test will be used with, and evaluating the
psychometric properties (e.g., reliability and validity) of the test.

KEY TERMS AND CONCEPTS

Achievement tests, p. 331 Intelligence quotient (IQ), Stanford-Binet Intelligence Scales,


Alfred Binet and Theodore Simon, P3333 Fifth Edition (SB5), p. 349
passe InView, p. 341 Tests of Cognitive Skills, Second
American College Test (ACT), Otis-Lennon School Ability Test, Edition, p. 340
p. 366 8th Edition, p. 341 Wechsler Intelligence Scale for
Aptitude—achievement discrepancy, Primary Test of Cognitive Children—Fourth Edition
p. 337 Skills, p. 340 (WISC-IV), p. 344
Aptitude tests, p. 331 Response to intervention, p. 339 Woodcock-Johnson III (WJ III)
Binet-Simon Scale, p. 333 Reynolds Intellectual Assessment Tests of Cognitive Abilities,
Cognitive Abilities Test (CogAT), Scales (RIAS), p. 350 p. 349
p. 341 Scholastic Assessment Test
College admission tests, p. 366 (SAT), p. 366
Intelligence, p. 333

RECOMMENDED READINGS

Cronbach, L. J. (1975). Five decades of public controversy New York: John Wiley and Sons. This text provides a
over mental testing. American Psychologist, 36, 1-14. review of the use of RTI in the identification of learning
An interesting and readable chronicle of the controversy disabilities.
surrounding mental testing during much of the twentieth Kamphaus, R. W. (2001). Clinical assessment of child and
century. adolescent intelligence. Boston: Allyn & Bacon. This
Fletcher-Janzen, E., & Reynolds, C. R. (Eds.). (in press). Neu- text provides an excellent discussion of the assessment
roscientific and clinical perspectives on the RTI initia- of intelligence and related issues.
tive in learning disabilities diagnosis and intervention.

Go to www.pearsonhighered.com/reynolds2e to view a PowerPoint™


presentation and to listen to an audio lecture about this chapter.
ERE eR Se A IETS RE AE IR SD ECCS ee ee ae Ge
'
:
| CHAPTER
..
co

OI
LOLOL Assessment of Behavior
and Personality

Although educators have typically focused primarily on cognitive abilities,


federal laws mandate that schools provide special education and related
services to students with emotional disorders. Before these services can be
provided, the schools must be able to identify children with these disorders.
The process of identifying these children often involves a psychological
evaluation completed by a school psychologist or other clinician. Teachers
often play an important role in this assessment process.

CHAPTER HIGHLIGHTS

Assessing Behavior and Personality Self-Report Measures


Behavior Rating Scales Projective Techniques

LEARNING OBJECTIVES

After reading and studying this chapter, students should be able to


1. Compare and contrast maximum performance tests and typical response tests.
2. Explain how classroom teachers are involved in the assessment of children with emotional
disorders.
3. Define personality as used in assessment and explain why this should be applied cautiously
with children and adolescents.
4. Define and give examples of response sets.
Sa Explain how test validity scales can be used to guard against response sets and give an
example.
Describe the strengths and limitations of behavior rating scales.
Describe and evaluate the major behavior rating scales.
Describe the strengths and limitations of self-report measures.
oe
enrngn
Describe and evaluate the major self-report measures.
10. Explain the central hypothesis of projective techniques.
11. Describe the strengths and limitations of projective techniques.

370
Assessment of Behavior and Personality 371

In Chapter 1, when describing the different types of tests, we noted that tests typically can
be classified as measures of either maximum performance or typical response. Maximum
performance tests are often referred to as ability tests. On these tests items are usually scored
as either correct or incorrect, and examinees are encouraged to demonstrate the best perfor-
mance possible. Achievement and aptitude tests are common examples of maximum perfor-
mance tess, Tact CATA el RRR
Typical response tests usually . Typical re-
ASSESS SOREN such se sponse tests typically assess constructs such as personality, behavior,
personality, behavior, attitudes, attitudes, or interests (Cronbach, 1990). Although maximum perfor-
or interests. mance tests are the most prominent type of test used in schools today,
typical response tests are used frequently also.
Public Law 94-142 (IDEA) and its most current reauthorization, the Individuals with
Disabilities Education Improvement Act of 2004 (IDEA 2004), mandate that schools pro-
vide special education and related services to students with emotional disorders. These laws
compel schools to identify students with emotional disorders and, as a result, expand
school assessment practices, previously focused primarily on cognitive abilities, to include
the evaluation of personality, behavior, and related constructs. The primary goal of this
chapter is to help teachers become familiar with the major instruments used in assessing
emotional and behavioral features of children and adolescents and to assist them in under-
standing the process of evaluating such students. Teachers are often called on to provide
relevant information on students’ behavior. Teachers are involved to varying degrees with
the assessment of student behavior and personality. Classroom teachers are often asked to
help with the assessment of students in their classrooms, for example, by completing be-
havior rating scales on students in their class. This practice provides invaluable data to
school psychologists and other clinicians because teachers have a unique opportunity to
observe children in their classrooms. Teachers can provide information on how the child
behaves in different contexts, both academic and social. As a result, the knowledge derived
from behavior rating scales completed by teachers plays an essential role in the assessment
of student behavior and personality. Teachers may also be involved with the development
and implementation of educational programs for children with emotional or behavioral
disorders. As part of this role, teachers may need to read psychological reports and incor-
porate these findings into instructional strategies. In summary, although teachers do not
need to become experts in the field of psychological assessment, it is vital for them to
become familiar with the types of instruments used in assessing children’s behavior and
personality.
Before proceeding, it is beneficial to clarify how assessment
Personality can be defined as experts conceptualize personality. Gray (1999) defines personality
an individual’s characteristic as “the relatively consistent patterns of thought, feeling, and behav-
way of thinking, feeling, and ior that characterize each person as a unique individual” (p. G12).
behaving. This definition probably captures most people’s concept of personal-
ity. In conventional assessment terminology, personality is defined in
a similar manner, incorporating a host of emotional, behavioral, motivational, interpersonal,
and attitudinal characteristics (Anastasi & Urbina, 1997). In the context of child and ado-
lescent assessment, the term personality should be used with some care. Measures of
372 CHAPTER 4

personality and behavior in children demonstrate less stability than comparable measures
in adults. This is not particularly surprising given the rapid developmental changes charac-
teristic of children and adolescents. As a result, when using the term personality in the
context of child and adolescent assessment, it is best to interpret it cautiously and under-
stand that it does not necessarily reflect a fixed construct, but one that is subject to develop-
ment and change.

Assessing Behavior and Personality

Even though we might not consciously be aware of it, we all engage in the assessment of
personality and behavior on a regular basis. When you note that “Johnny has a good person-
ality,” “Tommy is a difficult child,” or “Tamiqua is extroverted,” you are making a judgment
about personality. We use these informal evaluations to determine whom we want to associ-
ate with and whom we want to avoid, among many other ways.
The development of the first formal instrument for assessing personality is typically
traced to the efforts of Robert Woodworth. In 1918, he developed the Woodworth Personal
Data Sheet, which was designed to help collect personal information about military recruits.
Much as the development of the Binet scales ushered in the era of intelligence testing, the
introduction of the Woodworth Personal Data Sheet ushered in the era of personality assess-
ment. Subsequent instruments for assessing personality and behavior took on a variety of
forms, but they all had the same basic purpose of helping us to understand the behavior and
personal characteristics of ourselves and others. Special Interest Topic 14.1 provides a brief
description of an early test of personality.

Response Sets

A response set is present when Response biases or response sets are test responses that misrepre-
test takers respond in a manner sent a person’s true characteristics. For example, an individual com-
pleting an employment-screening test might attempt to present an
that misrepresents their true
overly positive image by answering all of the questions in the most
characteristics.
socially appropriate manner possible, even if these responses do not
accurately represent the person. On the other hand, a teacher who is
hoping to have a disruptive student transferred from his or her class might be inclined to
exaggerate the student’s misbehavior in order to hasten that student’s removal. In both of
these situations the individual completing the test or scale responded in a manner that
systematically distorted reality. Response sets can be present when completing maximum
performance tests. For example, an individual with a pending court case claiming neuro-
logical damage resulting from an accident might “fake bad” on an intelligence test in an
effort to substantiate the presence of brain damage and enhance his or her legal case. How-
ever, response sets are an even bigger problem on typical performance tests. Because many
of the constructs measured by typical performance tests (e.g., per-
Response sets are a ubiquitous sonality, behavior, attitudes, beliefs) have dimensions that may be
problem in personality seen as either socially “desirable” or “undesirable,” the tendency to
assessment. employ a response set is heightened. When response sets are pres-
Assessment of Behavior and Personality 373

Se eee er

SPECIAL INTEREST Topic 14.1


The Handsome and the Deformed Leg

Sir Francis Galton (1884) related a tale attributed to Benjamin Franklin about a crude personality
test. Franklin describes two types of people, those who are optimistic and focus on the positive and
those who are pessimistic and focus on the negative. Franklin reported that one of his philosophical
friends desired a test to help him identify and avoid people who were pessimistic, offensive, and
prone to acrimony.

In order to discover a pessimist at first sight, he cast about for an instrument. He of course possessed
a thermometer to test heat, and a barometer to tell the air-pressure, but he had no instrument to test
the characteristic of which we are speaking. After much pondering he hit upon a happy idea. He
chanced to have one remarkably handsome leg, and one that by some accident was crooked and de-
formed, and these he used for the purpose. If a stranger regarded his ugly leg more than his handsome
one he doubted him. If he spoke of it and took no notice of the handsome leg, the philosopher deter-
mined to avoid his further acquaintance. Franklin sums up by saying, that every one has not this
two-legged instrument, but every one with a little attention may observe the signs of a carping and
fault-finding disposition. (pp. 9-10)

Source: This tale was originally reported by Sir Francis Galton (1884). Galton’s paper was reproduced in Good-
stein & Lanyon (1971).

ent, the validity of the test results may be compromised because they introduce construct-
irrelevant error to test scores (e.g., AERA et al., 1999). That is, the test results do not
accurately reflect the construct the test was designed to measure. To
Validity scales are designed to combat this, many typical performance tests incorporate some type
detect the presence of response of validity scale designed to detect the presence of response sets.
bias. Validity scales take different forms, but the general principle is that
they are designed to detect individuals who are not responding in an
accurate manner. Special Interest Topic 14.2 provides an example of a “fake good” re-
sponse set. In the last several decades, personality scale authors have devised many types
of so-called validity scales to detect a dozen or more response sets.

Assessment of Behavior and Personality in the Schools


The instruments used to assess behavior and personality in the schools can usually be clas-
sified as behavior rating scales, self-report measures, or projective techniques. The results
of a recent national survey of school psychologists indicated that five of the top ten instru-
ments were behavior rating scales, four were projective techniques, and one was a self-
report measure (Livingston, Eglsaer, Dickson, & Harvey-Livingston, 2003; see Table 14.1
for a listing of these assessment instruments). These are representative of the instruments
school psychologists use to assess children suspected of having an emotional, behavioral,
or other type of disorder. These are not the only types of typical performance tests used in
the schools. For example, school guidance counselors often use interest inventories to assess
374 CHAPTER 14

SPECIAL INTEREST Topic 14,2


An Example of a “Fake Good” Response Set

Self-report inventories, despite the efforts of test developers, always remain susceptible to response
sets. The following case is an authentic example. In this case the Behavior Assessment System for
Children, Self-Report of Personality (BASC-SRP) was utilized.
Maury was admitted to the inpatient psychiatric unit of a general hospital with the diagnoses
of impulse control disorder and major depression. She is repeating the seventh grade this school year
because she failed to attend school regularly last year. When skipping school, she spent time roaming
the local shopping mall or engaging in other relatively unstructured activities. She was suspended
from school for lying, cheating, and arguing with teachers. She failed all of her classes in both se-
mesters of the past school year.
Maury’s responses to the diagnostic interview suggested that she was trying to portray herself
in a favorable light and not convey the severity of her problems. When asked about hobbies, for
example, she said that she liked to read. When questioned further, however, she could not name a
book that she had read.
Maury’s father reported that he has been arrested many times. Similarly, Maury and her sisters
have been arrested for shoplifting. Maury’s father expressed concern about her education. He said that
Maury was recently placed in an alternative education program designed for youth offenders.
Maury’s SRP results show evidence of a social desirability or fake good response set. All of
her clinical scale scores were lower than the normative T-score mean of 50 and all of her adaptive
scale scores were above the normative mean of 50. In other words, the SRP results suggest that
Maury is optimally adjusted, which is in stark contrast to the background information obtained.
Maury’s response set , however, was identified by the Lie scale of the SRP, where she obtained
a score of 9, which is on the border of the caution and extreme caution ranges. The following table
shows her full complement of SRP scores.

Clinical Scales Adaptive Scales

Scale T-Score Scale T-Score

Attitude to School 41 Relations with Parents aS


Attitude to Teachers 39 Interpersonal Relations 57
Sensation Seeking 41 Self-Esteem 54
Atypicality 38 Self-Reliance Sve
Locus of Control 38
Somatization 39
Social Stress 38
Anxiety 34
Depression 43
Sense of Inadequacy 41

Source: Clinical Assessment of Child and Adolescent Personality and Behavior (2nd ed.) (Box 6.1, p. 99), by
R. W. Kamphaus and P. J. Frick, 2002, Boston: Allyn & Bacon. Copyright 2002 by Pearson Education. Reprinted
with permission.
4
Assessment of Behavior and Personality 375

TABLE 14.1 Ten Most Popular Tests of Child Personality and Behavior

Name of Test Type of Test

1. BASC Teacher Rating Scale Behavior rating scale


2. BASC Parent Rating Scale Behavior rating scale
3. BASC Self-Report of Personality Self-report measure
4. Draw-A-Person Projective technique
5. Conners Rating Scales—Revised Behavior rating scale
6. Sentence Completion Tests Projective technique
7. House-Tree-Person Projective technique
8. Kinetic Family Drawing Projective technique
9. Teacher Report Form (Achenbach) Behavior rating scale
10 . Child Behavior Checklist (Achenbach) Behavior rating scale

Note: BASC = Behavior Assessment System for Children. The Conners Rating Scales—
Revised and Sentence Completion Tests actually were tied. Based on a national sample of
school psychologists (Livingston et al., 2003).

a student’s interest in different career options. However, we will be limiting our discussion
primarily to tests used in assessing children and adolescents with emotional and behavioral
disorders. To this end, we will briefly describe behavior rating scales, self-report measures,
and projective techniques in the following sections.

Behavior Rating Scales


A behavior rating scale is A behavior rating scale is essentially an inventory that asks an in-
an inventory that asks an formant, usually a parent or teacher, to rate a child on a number of
informant to rate a child
dimensions. For example, the instructions might ask an informant to
rate a child according to the following guidelines:
on a number of dimensions.
0 = rarely or never
1 = occasionally
2 = often or almost always

The scale will then present a series of item stems for which the informant rates the child.
For example:

Has difficulty paying attention: 0


Lies: 0
Plays well with peers: 0
oS Contributes to class discussion: 0 Se
— WH
NY
Nw

Behavior rating scales have a number of positive characteristics (e.g., Kamphaus


& Frick, 2002; Piacentini, 1993; Ramsay, Reynolds, & Kamphaus, 2002; Witt, Heffer,
& Pfeiffer, 1990). For example, children may have difficulty accurately reporting their own
376 CHAPTER 14

feelings and behaviors due to a number of factors such as limited insight or verbal abilities
or, in the context of self-report tests, limited reading ability. However, when using behavior
rating scales, information is solicited from the important adults in a child’s life. Ideally these
adult informants will have had adequate opportunities to observe the child in a variety of
settings over an extended period of time. Behavior rating scales also represent a cost-effec-
tive and time-efficient method of collecting assessment information. For example, a clini-
cian may be able to collect information from both parents and one or more teachers with a
minimal investment of time. Most popular behavior rating scales have separate inventories
for parents and teachers. This allows the clinician to collect information from multiple in-
formants who observe the child from different perspectives and in various settings. Behav-
ior rating scales can also help clinicians assess the presence of rare behaviors. Although any
responsible clinician will interview the child, parents, and hopefully teachers, it is still pos-
sible to miss important indicators of behavioral problems. The use of well-designed behav-
ior rating scales may help detect the presence of rare behaviors, such as fire setting and
animal cruelty, that might be missed in a clinical interview.
There are some limitations associated with the use of behavior rating scales. Even
though the use of adult informants to rate children provides some degree of objectivity, these
scales are still subject to response sets that may distort the true characteristics of the child.
For example, as a “cry for help” a teacher may exaggerate the degree of a student’s prob-
lematic behavior in hopes of hastening a referral for special education services. Accord-
ingly, parents might not be willing or able to acknowledge their child has significant
emotional or behavioral problems and tend to underrate the degree and nature of problem
behaviors. Although behavior rating scales are particularly useful in diagnosing “external-
izing” problems such as aggression and hyperactivity, which are easily observed by adults,
they are less helpful when assessing “internalizing” problems such as depression and anxi-
ety, which are not as apparent to observers.
Over the past two decades, behavior rating scales have gained
In recent years behavior rating popularity and become increasingly important in the psychological
scales have gained popularity assessment of children and adolescents (Livingston et al., 2003). It is
and become increasingly common for a clinician to have both parents and teachers complete
important in the assessment behavior rating scales for one child. This is desirable because parents
of children and adolescents. and teachers have the opportunity to observe the child in different
settings and can contribute unique yet complementary information to
the assessment process. As a result, school psychologists will frequently ask classroom
teachers to help with student evaluations by completing behavior rating scales on one of
their students. Next we will briefly review some of the most popular scales.

Behavior Assessment System for Children, Second Edition—


Teacher Rating Scale and Parent Rating Scale (TRS and PRS)
The Behavior Assessment System for Children (BASC) is an integrated set of instruments
that includes a Teacher Rating Scale (TRS), a Parent Rating Scale (PRS), self-report
scales, a classroom observation system, and a structured developmental history (Reynolds
& Kamphaus, 1992). Although the BASC is a relatively new set of instruments, a 2003 na-
tional survey of school psychologists indicates that the TRS and PRS are the most frequently
Assessment of Behavior and Personality 377

used behavior rating scales in the public schools today (Livingston et al., 2003). Information
obtained from the publisher estimates the BASC was used with more than 1 million children
in the United States alone in 2003. By 2006, this estimate had grown to 2 million children
per year. The TRS and PRS are appropriate for children from 2 to 21 years. Both the TRS
and PRS provide item stems to which the informant responds Never, Sometimes, Often, or
Almost Always. The TRS is designed to provide a thorough examination of school-related
behavior whereas the PRS is aimed at the home and community environment (Ramsay,
Reynolds, & Kamphaus, 2002). In 2004, Reynolds and Kamphaus released the second edi-
tion of the BASC, known as the BASC-2, with updated scales and normative samples. Table
14.2 depicts the 5 composite scales, 16 primary scales, and 7 content scales for all the pre-
school, child, and adolescent versions of both instruments. Reynolds and Kamphaus (2004)
describe the individual primary subscales of the TRS and PRS as follows:

Adaptability: ability to adapt to changes in one’s environment


Activities of Daily Living: skills associated with performing everyday tasks
Aggression: acting in a verbally or physically hostile manner that threatens others
Anxiety: being nervous or fearful about actual or imagined problems or situations
Attention Problems: inclination to be easily distracted or have difficulty concentrating
Atypicality: reflects behavior that is immature, bizarre, or suggestive of psychotic
processes (e.g., hallucinations)
Conduct Problems: inclination to display antisocial behavior (e.g., cruelty, destructive)
Depression: reflects feelings of sadness and unhappiness
us Functional Communication: expression of ideas and communication in any way oth-
ers can understand
s Hyperactivity: inclination to be overactive and impulsive
u Leadership: reflects ability to achieve academic and social goals, particularly the
ability to work with others
us Learning Problems: reflects the presence of academic difficulties (only on the TRS)
a Social Skills: reflects the ability to interact well with peers and adults in a variety of
settings
= Somatization: reflects the tendency to complain about minor physical problems
m Study Skills: reflects skills that are associated with academic success, for example,
study habits, organization skills (only on the TRS)
a Withdrawal: the inclination to avoid social contact

New to the BASC-2 are the content scales, so called because their interpretation is
driven more by item content than actuarial or predictive methods. These scales are intended
for use by advanced-level clinicians to help clarify the meaning of the primary scales and
as an additional aid to diagnosis.
In addition to these individual scales, the TRS and PRS provide several different
composite scores. The authors recommend that interpretation follow a “top-down” ap-
proach, by which the clinician starts at the most global level and progresses to more specific
levels (e.g., Reynolds & Kamphaus, 2004). The most global measure is the Behavioral
Symptoms Index (BSI), which is a composite of the Aggression, Attention Problems, Anx-
iety, Atypicality, Depression, and Somatization scales. The BSI reflects the overall level of
378 CHAPTER 14

TABLE 14.2 Composites, Primary Scales, and Content Scales in the TRS and PRS

Teacher Rating Scales Parent Rating Scales

gy (e TN le Gs A
2-5 6-11 12-2] ~ 2-5 6-I1 = 12-21

Composite
Adaptive Skills
Behavioral Symptoms Index
Externalizing Problems
Internalizing Problems Ce
scot
(SR
See &£
*
& Ee:
ORS
Ee
re ee
Oe
ae
THE.

School Problems Te
ke
ae
ermine
ee
<a

Primary Scale
Adaptability *

Activities of Daily Living


Aggression * *

Anxiety * *

Attention Problems * *

Atypicality hss
ee
se
TS * *

Conduct Problems * *

Depression eg
eG
eee* *

Functional Communication :
Hyperactivity %
Leadership
Learning Problems
Social Skills *
Somatization #*
Study Skills
Withdrawal * Bee
x*&%

Content Scale .
Anger Control
Bullying
Developmental Social Disorders
Emotional Self-Control
Executive Functioning
Negative Emotionality
Resiliency

Number of Items 100 139 139 134 160 150


EES EINSURE CAR EU ESSA TU aR er ARERR RNR
Notes: Shaded cells represent new scales added to the BASC-2. P = preschool version; C = child version; A = ad-
olescent version.
Source: Behavior Assessment System for Children (BASC-2): Manual, Table 1.1, p. 3, by Cecil R.
Reynolds and
Randy W. Kamphaus, 2004. Copyright © 2004. All rights reserved. Published and distributed exclusively
by NCS
Pearson, Inc. P. O. Box 1416, Minneapolis,
MN 55440, Reproduced with permission by NCS Pearson, Inc.
Assessment of Behavior and Personality 379

behavioral problems and provides the clinician with a reliable but nonspecific index of pa-
thology. For more specific information about the nature of the problem behavior, the clini-
cian proceeds to the four lower-order composite scores:

w Internalizing Problems. This is acomposite of the Anxiety, Depression, and Somatiza-


tion scales. Some authors refer to internalizing problems as “overcontrolled” behavior. Stu-
dents with internalizing problems experience subjective or internal discomfort or distress, but
they do not typically display acting-out or disruptive behaviors (e.g., aggression, impulsive-
ness). As a result, these children may go unnoticed by teachers and school-based clinicians.

m Externalizing Problems. This is a composite of the Aggression, Conduct Prob-


lems, and Hyperactivity scales. Relative to the behaviors and symptoms associated with
internalizing problems, the behaviors associated with externalizing problems are clearly
apparent to observers. Children with high scores on this composite are typically disrup-
tive to both peers and adults, and usually will be noticed by teachers and other adults.

a School Problems. This composite consists of the Attention Problems and Learning
Problems scales. High scores on this scale suggest academic motivation, attention, and
learning difficulties that are likely to hamper academic progress. This composite is available
only for the BASC-TRS.

ms Adaptive Skills. This is a composite of Adaptability, Leadership, Social Skills, and


Study Skills scales. It reflects a combination of social, academic, and other positive skills
(Reynolds & Kamphaus, 2004).

The third level of analysis involves examining the 16 clinical (e.g., Hyperactivity,
Depression) and adaptive scales (e.g., Leadership, Social Skills). Finally, clinicians will
often examine the individual items. Although individual items are often unreliable, when
interpreted cautiously they may provide clinically important information. This is particu-
larly true of what is often referred to as “critical items.” Critical items, when coded in a
certain way, suggest possible danger to self or others. For example, if a parent or teacher
reports that a child often “threatens to harm self or others,” the clinician would want to de-
termine whether these statements indicate imminent danger to the child or others.
When interpreting the Clinical Composites and Scale scores, high scores reflect ab-
normality or pathology. The authors provide the following classifications: T-score > 70 is
Clinically Significant; 60-69 is At-Risk; 41-59 is Average; 31-40 is Low; and <30 is Very
Low. Scores on the Adaptive Composite and Scales are interpreted differently, with high
scores reflecting adaptive or positive behaviors. The authors provide the following classifi-
cations: T-score > 70 is Very High; 60-69 is High; 41-59 is Average; 31-40 is At-Risk; and
<30 is Clinically Significant. Computer software is available to facilitate scoring and inter-
pretation, and the use of this software is recommended because hand scoring can be chal-
lenging for new users. An example of a completed TRS profile is depicted in Figure 14.1.
The TRS and PRS have several unique features that promote their use. First, they
contain a validity scale that helps the clinician detect the presence of response sets. As noted
previously, validity scales are specially developed and incorporated in the test for the pur-
pose of detecting response sets. Both the parent and teacher scales contain a “fake bad” (F)
380 CHAPTER 14

CLINICAL PROFILE _
Hyper. Aggres- Conduct bonnes
it i ‘Composite

Note: High scores on the adaptive scales indicate


high levels of adaptive skills.
|
a)
Veep
+
ADAPTIVE PROFILE

&8
ee
Te |peneprene

ee
ny

a iyo

a aeeimel
[meal.
t.
prtrpern
3
25

foe
Poendieennl
ayn Re
ny
s
Hoe s—
Oc
any
es
Ty
co)
four
Oybot
vena rereporns boreparesns
ne
ee
Ye
ee
ee
ee
|
Pn

rreeliveifansy
Jove |
es
RR
onibuaene
batoretadeg
ee
ee

tforsafararpeens egedil|
Hf
at

ee
ss
riefaii eeperrrporrtecerfirs MO
os
ee

iE
pifeveiforerdeces
|roteas
Jorden
provbevret
fin|
aaa
eon Joti
oo frrviperssfortrrefireepenns
oo frveeferseforredeceeferiigs
en
i
os |
tt
Josrepreifons

FIGURE 14.1 Completed Clinical and Adaptive Profile Sections of a TRS


Source: Behavior Assessment System for Children (BASC-2): Manual, Figure 3.9, by Cecil R. Reynolds and
Randy W. Kamphaus, 2004. Copyright © 2004. All rights reserved. Published and distributed exclusively by NCS
Pearson, Inc. P.O. Box 1416, Minneapolis, MN 55440. Reproduced with permission by NCS Pearson, Inc.

index that is elevated when an informant excessively rates maladaptive items as Almost al-
ways and adaptive items as Never. If this index is elevated, the clinician should consider the
possibility that a negative response set has skewed the results. Another unique feature of
these scales is that they assess both negative and adaptive behaviors. Before the advent of
the BASC, behavior rating scales were often criticized for focusing only on negative behav-
iors and pathology. Both the TRS and PRS address this criticism by assessing a broad
spectrum of behaviors, both positive and negative. The identification of positive character-
istics can facilitate treatment by helping identify strengths to build on. Still another unique
feature is that the TRS and PRS provide three norm-referenced comparisons that can be
selected depending on the clinical focus. The child’s ratings can be compared to a general
national sample, a gender-specific national sample, or a national clinical sample composed
of children who have a clinical diagnosis and are receiving treatment. In summary, the
Assessment of Behavior and Personality
381

BASC-2 PRS and BASC-2 TRS are psychometrically sound instruments that have gained
considerable support in recent years.

Conners’ Rating Scales—Revised (CRS-R)


The Conners’ Rating Scales—Revised (CRS-R) (Conners, 1997) have a rich history of
use in the assessment of children and adolescents, dating back to the late 1960s when the
early version of the scales was developed to measure the effectiveness of medication in the
treatment of hyperactive children (Kamphaus & Frick, 2002). The current revised version
includes teacher and parent inventories and is appropriate for children from 3 through 17
years. There are both long forms (e.g., 59 or 80 items) and short forms (27 or 28 items)
available. Conners (1997) describes the subscales of the long forms as follows:

= Oppositional: a tendency to break rules, be in conflict with authority figures, and be


easily angered
ms Cognitive Problems/Inattention: characterized by problems with attention, concen-
tration, organization, and difficulty completing projects
Hyperactivity: tendency to be overactive, restless, and impulsive
Anxious/Shy: propensity to be anxious, fearful, and overly emotional
Perfectionism: inclination to be obsessive and set high personal standards
Social Problems: characterized by feelings of isolation and low self-esteem
Psychosomatic: tendency to report numerous physical complaints (only on the Parent
Rating Scale)

The CRS-R produces two index scores, the ADHD Index and the Conners Global
Index (CGI). The ADHD Index is a combination of items that has been found to be useful
in identifying children who have attention deficit/hyperactivity disorder (ADHD). The CGI,
a more general index, is sensitive to a variety of behavioral and emotional problems. The
CGI (formerly the Hyperactivity Index) has been shown to be a sensitive measure of medi-
cation (e.g., psychostimulants such as Ritalin) treatment effects with children with ADHD.
Computer-scoring software is available for the CRS-R to facilitate scoring and interpreta-
tion. Specific strengths of the CRS-R include its rich clinical history and the availability of
short forms that may be used for screening purposes or situations in which repeated admin-
istrations are necessary (e.g., measuring treatment effects; Kamphaus & Frick, 2002).

Child Behavior Checklist and Teacher


Report Form (CBCL and TRF)
The Child Behavior Checklist (CBCL) and the Teacher Report Form (TRF) (Achenbach,
1991a, 1991b) are two components of an integrated system that also includes a self-report
scale and a direct observation system. There are two forms of the CBCL, one for children 2 to
3 years and one for children 4 to 18 years. The TRF is appropriate for children from 5 to 18
years. The CBCL and TRF have long played an important role in the assessment of children
and adolescents and continue to be among the most frequently used psychological tests in
schools today. The scales contain two basic sections. The first section collects information
382 CHAPTER 14

about the child’s activities and competencies in areas such as recreation (e.g., hobbies and
sports), social functioning (e.g., clubs and organizations), and schooling (e.g., grades). The
second section assesses problem behaviors and contains item stems describing problem be-
haviors. On these items the informant records a response of Not true, Somewhat true/Some-
times true, or Very true/Often true. The clinical subscales of the CBCL and TRF are

ws Withdrawn: reflects withdrawn behavior, shyness, and a preference to be alone


= Somatic Complaints: a tendency to report numerous physical complaints (e.g., head-
aches, fatigue)
a Anxious/Depressed: reflects a combination of depressive (e.g., lonely, crying, un-
happy) and anxious (nervous, fearful, worried) symptoms
= Social Problems: reflects peer problems and feelings of rejection
a Thought Problems: evidence of obsessions/compulsions, hallucinations, or other
“strange” behaviors
m Attention Problems: reflects difficulty concentrating, attention problems, and hyper-
activity
= Delinquent Behavior: evidence of behaviors such as stealing, lying, vandalism, and
arson
m Aggressive Behavior: reflects destructive, aggressive, and disruptive behaviors

The CBCL and TRF provide three composite scores:

m Total Problems: overall level of behavioral problems


m Externalizing: a combination of the Delinquent Behavior and Aggressive Behavior
scales
a Internalizing: a combination of the Withdrawn, Somatic Complaints, and Anxious/
Depressed scales

Computer-scoring software is available for the CBCL and TRF and is recommended
because hand scoring is a fairly laborious and time-consuming process. The CBCL and
TRF have numerous strengths that continue to make them popular among school psy-
chologists and other clinicians. They are relatively easy to use, are time efficient (when
using the computer-scoring program), and have a rich history of clinical and research ap-
plications (Kamphaus & Frick, 2002).
The BASC-2 TRS and PRS, the CBCL and TRF, and the CRS-R are typically referred
to as omnibus rating scales. This indicates that they measure a wide range of symptoms and
behaviors that are associated with different emotional and behavioral disorders. Ideally an
omnibus rating scale should be sensitive to symptoms of both internalizing (e.g., anxiety,
depression) and externalizing (e.g., ADHD, conduct) disorders to ensure that the clinician
is not missing important indicators of psychopathology. This is particularly important when
assessing children and adolescents because there is a high degree of comorbidity with this
population. Comorbidity refers to the presence of two or more disorders occurring simulta-
neously in the same individual. For example, a child might meet the criteria for both an
externalizing disorder (e.g., conduct disorder) and an internalizing disorder (e.g., depressive
disorder). However, if a clinician did not adequately screen for internalizing symptoms, the
more obvious externalizing symptoms might mask the internalizing symptoms and result in
Assessment of Behavior and Personality
383

an inaccurate or incomplete diagnosis. Inaccurate diagnosis typically leads to inadequat


e
treatment.
Although omnibus rating scales play a central role in the assessment of childhood
psychopathology, there are a number of single-domain or syndrome-specific rating scales.
These single-domain rating scales resemble the omnibus scales in format, but they focus
on
a single disorder (e.g., ADHD) or behavioral dimension (e.g., social skills). Although they
are limited in scope, they often provide a more thorough assessment of the specific domain
they are designed to assess than the omnibus scales. As a result, they can be useful in supple-
menting more comprehensive assessment techniques (e.g., Kamphaus & Frick, 2002). Ex-
amples of single-domain rating scales are the Teacher Monitor Ratings (TMR) and Parent
Monitor Ratings (PMR), which are components of the BASC Monitor for ADHD (Kam-
phaus & Reynolds, 1998). Although these behavior rating scales do contain items related to
internalizing disorders, the focus is clearly on behaviors related to ADHD. The BASC Mon-
itor is designed to help parents, teachers, and physicians determine whether medical, behav-
ioral, and educational treatments for ADHD are working (Kamphaus & Frick, 2002).

Self-Report Measures
A self-report measure is an A self-report measure is an instrument completed by individuals
instrument completed by that allows them to describe their own subjective experiences, in-
individuals that allows them cluding emotional, motivational, interpersonal, and attitudinal char-
acteristics (e.g., Anastasi & Urbina, 1997). Although the use of
to describe their own subjective
self-report measures has a long and rich history with adults, their use
experiences, including
with children is a relatively new development because it was long
emotional, motivational,
believed that children did not have the personal insights necessary to
interpersonal, and attitudinal understand and accurately report their subjective experiences. To fur-
characteristics (Anastasi & ther complicate the situation, skeptics noted that young children
Urbina, 1997). typically do not have the reading skills necessary to complete written
self-report tests (e.g., Kamphaus & Frick, 2002). However, numer-
ous self-report measures have been developed and used successfully with children and ado-
lescents. Although insufficient reading skills do make these instruments impractical with
very young children, these new self-report measures are being used with older children
(e.g., >7years) and adolescents with considerable success. Self-report measures have proven
to be particularly useful in the assessment of internalizing disorders such as depression and
anxiety that have symptoms that are not always readily apparent to observers. The develop-
ment and use of self-report measures with children are still at a relatively early stage, but
several instruments are gaining widespread acceptance. We will now briefly describe some
of the most popular child and adolescent self-report measures.

Behavior Assessment System for Children, Second


Edition—Self-Report of Personality (SRP)
The Behavior Assessment System for Children—Self-Report of Personality (SRP) (Rey-
nolds & Kamphaus, 2004) is a component of the Behavioral Assessment System for Children
384 CHAPTER 14

(BASC-2) we introduced earlier, and recent research suggests it is the most popular self-report
measure among school psychologists. There are three forms of the SRP, one for children 8 to
11 years and one for adolescents 12 to 18 years. A third version, the SRP-I (for interview) is
standardized as an interview version for ages 6 and 7 years. The SRP has an estimated 3rd-
grade reading level, and if there is concern about the student’s ability to read and comprehend
the material, the instructions and items can be presented using audio. The SRP contains brief
descriptive statements that children or adolescents mark as true or false to some questions, or
never, sometimes, often, or almost always to other questions, as it applies to them. Table 14.3
depicts the 5 composites along with the 10 primary and content scales available for children
and adolescents. Reynolds and Kamphaus (1992) describe the subscales as follows:

uw Anxiety: feelings of anxiety, worry, and fears and a tendency to be overwhelmed by


stress and problems
Attention Problems: being easily distracted and unable to concentrate
Attitude to School: feelings of alienation and dissatisfaction with school
Attitude to Teachers: feelings of resentment and dissatisfaction with teachers
Atypicality: unusual perceptions, behaviors, and thoughts that are often associated
with severe forms of psychopathology
Depression: feelings of rejection, unhappiness, and sadness
Hyperactivity: being overly active, impulsive, and rushing through work
Interpersonal Relations: positive social relationships
Locus of Control: perception that events in one’s life are externally controlled
Relations with Parents: positive attitude toward parents and feeling of being impor-
tant in the family
Self-Esteem: positive self-esteem characterized by self-respect and acceptance
Self-Reliance: self-confidence and ability to solve problems
Sensation Seeking: tendency to take risks and seek excitement
Sense of Inadequacy: feeling unsuccessful in school and unable to achieve goals
Social Stress: stress and tension related to social relationships
Somatization: tendency to experience and complain about physical discomforts and
problems )

The SRP produces five composite scores. The most global composite is the Emotional
Symptoms Index (ESI) composed of the Anxiety, Depression, Interpersonal Relations, Self-
Esteem, Sense of Inadequacy, and Social Stress scales. The ESI is an index of global psy-
chopathology, and high scores usually indicate serious emotional problems. The four
lower-order composite scales are

= Inattention/Hyperactivity. This scale combines the Attention Problems and the Hy-
peractivity scales to form a composite reflecting difficulties with the self-regulation of be-
havior and ability to attend and concentrate in many different settings.
a Internalizing Problems. This is a combination of the Anxiety, Atypicality, Locus of
Control, Social Stress, and Somatization scales. This scale reflects the magnitude of inter-
nalizing problems, and clinically significant scores (i.e., T-scores > 70) suggest significant
problems.
4
Assessment of Behavior and Personality 385

TABLE 14.3 Composites, Primary Scales, and Content


Scales in the SRP
eee

C A
Scale 8-11 12-21
Composite
Emotional Symptoms Index *
Inattention/Hyperactivity M3 *
Internalizing Problems * *
Personal Adjustment * i
School Problems = -

Primary Scale
Anxiety
Attention Problems
Attitude to School * *

Attitude to Teachers ok *

Atypicality ok *

Depression ok *

Hyperactivity
Interpersonal Relations * ok

Locus of Control * *

Relations with Parents * *

Self-Esteem * *

Self-Reliance * *

Sensation Seeking *

Sense of Inadequacy * *

Social Stress * *

Somatization *

Content Scale
Anger Control
Ego Strength
Mania
Test Anxiety

Number of Items 139 176

Note: Shaded cells represent new scales added to the BASC-2.


Source: Behavior Assessment System for Children (BASC-2): Manual,
Table 1.2, p. 5, by Cecil R. Reynolds and Randy W. Kamphaus, 2004.
Copyright © 2004. All rights reserved. Published and distributed exclu-
sively by NCS Pearson, Inc. P.O. Box 1416, Minneapolis, MN 55440. Re-
produced with permission by NCS Pearson, Inc.
386 CHAPTER 14

m School Problems: This is composed of the Attitude to School, Attitude to Teachers,


and Sensation Seeking scales. High scores on this scale suggest a general pattern of dis-
satisfaction with schools and teachers. Clinically significant scores suggest pervasive school
problems, and adolescents with high scores might be at risk for dropping out.
= Personal Adjustment: This is composed of the Interpersonal Relationships, Relations
with Parents, Self-Esteem, and Self-Reliance scales. High scores are associated with posi-
tive adjustment whereas low scores suggest deficits in interpersonal relationships and iden-
tity formation.
As with the BASC-2 TRS and PRS, high scores on the SRP Clinical Composites and
Scales reflect abnormality or pathology. The authors provide the following classifications:
T-score > 70 is Clinically Significant; 60-69 is At-Risk; 41-59 is Average; 31-40 is Low; and
<30 is Very Low. Scores on the Adaptive Composite and Scales are interpreted differently,
with high scores reflecting adaptive or positive behaviors. The authors provide the following
classifications: T-score > 70 is Very High; 60-69 is High; 41-59 is Average; 31-40 is At-Risk;
and <30 is Clinically Significant. Computer software is available to facilitate scoring and in-
terpretation. An example of a completed SRP profile is depicted in Figure 14.2.
The SRP has numerous positive features that recommend its use. Possibly the most
salient of these features is the inclusion of three validity scales (i.e., F index, L index, and
V index). Because self-report measures have historically been criticized for being particu-
larly susceptible to response sets, the detection of response sets is of primary importance.
The F index is composed of items that are “infrequently” endorsed
Because self-report measures
in a specific manner in a normal population. For example, very few
have long been criticized for children or adolescents indicate that they are “not a good friend” or
being sensitive to response sets, that they “often cheat on tests.” This type of validity scale is often
many have incorporated validity referred to as an infrequency index. If an examinee endorses enough
scales to detect the presence of of these items in the keyed direction, his or her F index will be ele-
response sets. vated. High scores on the F index can be the result of numerous fac-
tors, ranging from reading difficulties to an intentional desire to
“fake bad” in order to look more disturbed or pathological. A second SRP validity scale is
the L index, which also contains items that are rarely endorsed in a specific manner in a
normal population. The distinction is that items on this scale are intended to identify indi-
viduals with a “social desirability” response set (i.e., examinees that are trying to “fake
good”). For example, few adolescents who are responding honestly will indicate that “their
life is perfect” or that “their teachers are always right.” High scores on the L index may
suggest that the SRP clinical scales may underestimate any existing emotional or behavioral
problems. The final validity scale is the V index, which is composed of nonsensical items
that may be endorsed due to carelessness, reading difficulty, or simply a refusal to cooper-
ate. An example of an item that might be included in the V index is “Batman is my best
friend.” Special Interest Topic 14.2 provides an example of a fake good response set and
how the use of the SRP Lie scale helps identify this response set.
Another positive feature of the SRP is its coverage of a relatively broad age range.
While most other omnibus self-report measures developed for children were limited to ex-
aminees 11 years or older, the SRP extended the age range down to 8 years. The interview
version of the SRP developed for 6- and 7-year-olds extends the age range even further. With
the SRP—Interview (SRP-I), the clinician reads the items to the child. Items are phrased
Assessment of Behavior and Personality 387

Note: High scores on the adaptive scales indicate


high levels of adaptive skills.
— ADAPTIVE PROFILE
=

a

rm

oe
“—
a

terpeens
tpeure

eeeJoti
ee paafin
prc pepeapoetes
Jobe

FIGURE 14.2 Completed Clinical and Adaptive Profile Sections of an SRP


Source: Behavior Assessment System for Children (BASC-2): Manual, Figure 4.7, by Cecil R. Reynolds and
Randy W. Kamphaus, 2004. Copyright © 2004. All rights reserved. Published and distributed exclusively by NCS
Pearson, Inc. P.O. Box 1416, Minneapolis, MN 55440. Reproduced with permission by NCS Pearson, Inc.

appropriately to make them sound as though they are simply part of an interview. The child’s
responses are then scored according to objective criteria. Another positive aspect of the SRP
is that it covers several dimensions or areas that are important to children and adolescents,
but have been neglected in other child self-report measures (e.g., attitude toward teachers and
school). Finally, the SRP assesses both clinical and adaptive dimensions. This allows the
clinician to identify not only problem areas but also areas of strength to build on.

Youth Self-Report (YSR)


The Youth Self-Report (YSR) (Achenbach, 1991c) is a component of Achenbach’s assess-
ment system that includes the CBCL and TRF described earlier. The YSR can be used with
children from 11 to 18 years and closely parallels the format and content of the CBCL and
TRE. In fact, it produces the same scales (i.e., Withdrawn, Somatic Complaints, Anxious/
Depressed, Social Problems, Thought Problems, Attention Problems, Delinquent Behavior,
and Aggressive Behavior) and composite scores (Externalizing, Internalizing, and Total
388 CHAPTER 14

Problems). This close correspondence with the CBCL and TRF is one of the strengths of the
YSR. Additionally, the YSR has an extensive research base that facilitates clinical interpre-
tations and computer-scoring software that eases scoring. The YSR has a strong and loyal
following and continues to be a popular instrument used in school settings.
As with behavior rating scales, self-report measures come in omnibus and single-
domain formats. Both the SRP and YSR are omnibus self-report measures. An example of
a single-domain self-report measure is the Children’s Depression Inventory (CDI; Kovacs,
1991). The CDI is a brief, 27-item self-report inventory designed for use with children be-
tween 7 and 17 years. It presents a total score as well as five factor scores: Negative Mood,
Interpersonal Problems, Ineffectiveness, Anhedonia (loss of pleasure from activities that
previously brought pleasure), and Negative Self-Esteem. The CDI is easily administered
and scored, is time efficient and inexpensive, and has an extensive research database. As
with the other single-domain measures, the CDI does not provide coverage of a broad range
of psychological disorders or personality characteristics, but it does give a fairly in-depth
assessment of depressive symptoms.

Projective Techniques

Projective techniques involve the


presentation of unstructured
or example, the clinician shows the examinee an
or ambiguous stimuli that
inkblot and asks: “What might this be?” The central hypothesis of
allows an almost infinite range
projective techniques is that the examinees will interpret the ambigu-
of responses from the examinee. ous material in a manner that reveals important and often unconscious
aspects of their psychological functioning or personality. In other
words, the ambiguous material serves as a blank screen on which the examinees “project” their
most intimate thoughts, desires, fears, needs, and conflicts (Anastasi & Urbina, 1997; Finch
& Belter, 1993). Although extremely popular in clinical settings, the use of projective tech-
niques in the assessment of personality has a long and controversial history. In fact, Chandler
(1990) noted that projective techniques have been the focus of controversy practically since
they were initially introduced. Proponents claim that they are the richest source of clinical
information available and are necessary in order to gain a thorough understanding of the indi-
vidual. They suggest that behavior rating scales access only surface behavioral patterns, and
self-report measures reflect only what the examinee wants to reveal. Whereas behavior rating
scales and self-report measures are susceptible to response sets, projective techniques are
thought to be relatively free of response sets because the examinee has little idea of what type
of responses are expected or are socially appropriate.
Critics of the use of projective techniques note that these procedures typically do not
meet even minimum psychometric standards (e.g., having appropriate evidence to support
their reliability and validity), and as a result, their use cannot be justified from an ethical or
technical perspective. Even if projective techniques are used simply to supplement a psycho-
metrically sound battery of objective measures, their questionable reliability and validity
will still detract from the technical soundness of the overall assessment process (Kamphaus
& Frick, 2002). Some of the key points of the debate are depicted in Table 14.4.
Assessment of Behavior and Personality
389

Although some experts have The debate over the use of projective techniques has been going
expressed reservations about on for decades. Although there is evidence of diminished use of projec-
the use of projective techniques, tive techniques in the assessment of children and adolescents, these
they continue to play a promi- techniques are still popular and used in schools. For example, a na-
nent role in the assessment of tional survey of psychological assessment procedures used by school
children and adolescents. psychologists indicates that four of the ten most popular procedures for
assessing personality are projective techniques (Livingston et al.,
2003). This debate is apt to continue, but it is highly likely that projec-
tives will continue to play a prominent role in the assessment of children and adolescen
ts for
the foreseeable future. Next we will briefly describe a few of the major projective
techniques
used with children and adolescents.

Projective Drawings
Some of the most popular projective techniques used with children and adolescents
involve
the interpretation of projective drawings. This popularity is usually attributed to
two fac-
tors. First, young children with limited verbal abilities are hampered in their ability to re-
spond to clinical interviews, objective self-report measures, and even most other projective
techniques. However, these young children can produce drawings because this activity
is
largely nonverbal. Second, because children are usually familiar with and enjoy drawing,
this technique provides a nonthreatening “child-friendly” approach to assessment (Finch &
Belter, 1993; Kamphaus & Frick, 2002). There are several different projective drawing
techniques in use today.

Draw-A-Person Test (DAP). The Draw-A-Person Test (DAP) is the most widely used
projective drawing technique. The child is given a blank sheet of paper and a pencil and asked
to draw a whole person. Although different scoring systems have been developed for the
DAP, no system has received universal approval. The figure in the drawing is often inter-
preted as a representation of the “self.” That is, the figure reflects how children feel about
themselves and how they feel as they interact with their environment (Handler, 1985).

House-Tree-Person (H-T-P). With the House-Tree-Person (H-T-P), the child is given


paper and a pencil and asked to draw a house, a tree, and a person of each gender, all on
separate sheets. The clinician then typically asks a standard set of questions for each picture.
After these drawings are completed, the child is then given a set of crayons and the process
is repeated. The House is typically interpreted as reflecting feelings associated with home
life and family relationships. The Tree and Person are thought to reflect aspects of the self,
with the Tree representing deep unconscious feelings about the self and the Person reflect-
ing a closer-to-conscious view of self (Hammer, 1985).

Kinetic Family Drawing (KFD). With the Kinetic Family Drawing (KFD), children
are given paper and pencil and asked to draw a picture of everyone in their family, including
themselves, doing something (hence the term kinetic). After completing the drawing the
children are asked to identify each figure and describe what each one is doing. The KFD is
thought to provide information regarding the children’s view of their family and their inter-
actions (Finch & Belter, 1993).
390 CHAPTER 14

TABLE 14.4 The Projective Debate

Pro Con

Less structured format allows clinician greater The reliability of many techniques is questionable. As
flexibility in administration and interpretation and a result, the interpretations are more related to
places fewer demand characteristics that would characteristics of the clinician than to characteristics
prompt socially desirable responses from informant. of the person being tested.

Allows for the assessment of drives, motivations, Even some techniques that have good reliability have
desires, and conflicts that can affect a person’s questionable validity, especially in making diagnoses
perceptual experiences but are often unconscious. and predicting overt behavior.
Provides a deeper understanding of a person than Although we can at times predict things we cannot
would be obtained by simply describing behavioral understand, it is rarely the case that understanding
patterns. does not enhance prediction (Gittelman-Klein, 1986).

Adds to an overall assessment picture. Adding an unreliable piece of information to an


assessment battery simply decreases the overall
reliability of the battery.
Helps to generate hypotheses regarding a person’s Leads one to pursue erroneous avenues in testing or
functioning. to place undue confidence in a finding.
Nonthreatening and good for rapport building. Detracts from the time an assessor could better spend
collecting more detailed, objective information.
Many techniques have a long and rich clinical Assessment techniques are based on an evolving
tradition. knowledge base and must continually evolve to
reflect this knowledge.
RA OOS

Source: Clinical Assessmentdts Child and Adolescent Personality and Behavior (2nd ed.) (Table 11.1, p. 231)
by R. W. Kamphaus and P. J. Frick, 2002, Boston: Allyn & Bacon. Copyright 2002 by Pearson Education.
Adapted with permission.

Despite their popularity and appeal to clinicians, little empirical data support the use of
projective drawings as a means of predicting behavior or classifying children by diagnostic
type (e.g., depressed, anxious, conduct disordered, etc.). These techniques may provide a
nonthreatening way to initiate the assessment process and an opportunity to develop rapport,
but otherwise they should be used with considerable caution and an understanding of their
technical limitations (Finch & Belter, 1993; Kamphaus & Frick, 2002).

Sentence Completion Tests


Sentence completion tests are another popular projective approach used with children and
adolescents. These tests typically present incomplete-sentence stems that are completed by
the child. The sentence completion forms either can be given to the child to complete inde-
pendently or can be read aloud to the child and the responses recorded. Examples of pos-
sible incomplete sentence stems include “I really enjoy . . .” and “My greatest fear is...”
Numerous sentence completion forms are available, and as with the projective drawings,
Assessment of Behavior and Personality
391

there are different ways of interpreting the results. Because incomplete-sentence stems pro-
vide more structure than most projective tasks (e.g., drawings or inkblots), some have ar-
gued that they are not actually “projective” in nature, but are more or less a type of structured
interview. As a result, some prefer the term semiprojective to characterize these tests. Re-
gardless of the classification, relatively little empirical evidence documents the psycho-
metric properties of these tests (Kamphaus & Frick, 2002). Nevertheless, they remain
popular, are nonthreatening to children, and in the hands of skilled clinicians may provide
an opportunity to enhance their understanding of their clients.

Apperception Tests
Another type of projective technique used with children is apperception tests. With this
technique the child is given a picture and asked to make up a story about it. Figure 14.3
depicts a picture similar to those in some apperception tests used with older children and
adolescents. These techniques are also sometimes referred to as thematic or storytelling
techniques. Like other projective techniques, children generally find apperception tests in-
viting and enjoyable. Two early apperception tests, the Thematic Apperception Test (TAT)
and the Children’s Apperception Test (CAT), have received fairly widespread use with chil-
dren and adolescents. Like other projective techniques, limited empirical evidence supports
the use of the TAT or CAT. A more recently developed apperception test is the Roberts Ap-
perception Test for Children (RATC; McArthur & Roberts, 1982), which uniquely features
the inclusion of a standardized scoring system and normative data. The standardized scoring
approach results in increased reliability relative to previous apperception tests. However, the
normative data are inadequate and there is little validity evidence available (Kamphaus &
Frick, 2002). Nevertheless, the RATC is a step in the right direction in terms of enhancing
the technical qualities of projective techniques.

Inkblot Techniques
The final projective approach we will discuss is the inkblot technique. With this technique
the child is presented an ambiguous inkblot and asked to interpret it in some manner, typically
by asking: “What might this be?” Figure 14.4 presents an example of an inkblot similar to
those used on inkblot tests. Of all the inkblot techniques, the Rorschach is the most widely
used. Different interpretative approaches have been developed for the Rorschach, but the
Exner Comprehensive System (Exner, 1974, 1978) has received the most attention by clini-
cians and researchers in recent years. The Exner Comprehensive System provides an elaborate
standardized scoring system that produces approximately 90 possible scores. Relative to other
Rorschach interpretive systems, the Exner system produces more reliable measurement and
has reasonably adequate normative data. However, evidence of validity
Because there is little empirical is limited, and many of the scores and indexes that were developed
data supporting the use of with adults have not proven effective with children (Kamphaus &
projective techniques as a means Frick, 2002).
of understanding personality or In summary, in spite of relatively little empirical evidence of
predicting behavior, they should their utility, projective techniques continue to be popular among psy-
be used with caution. chologists and other clinicians. Our recommendation is to use these
392 CHAPTER 14

FIGURE 14.3 A Picture Similar to Those FIGURE 14.4 An Inkblot Similar to Those
Used on Apperception Tests Used on Inkblot Tests
Source: From Robert J. Gregory, Psychological Testing: His- Source: From Robert J. Gregory, Psychological Testing: His-
tory, Principles, and Applications, 3/e. Published by Allyn & tory, Principles, and Applications, 3/e. Published by Allyn &
Bacon, Boston, MA. Copyright © 2004 by Pearson Educa- Bacon, Boston, MA. Copyright © 2004 by Pearson Educa-
tion. Reprinted by permission of the publisher. tion. Reprinted by permission of the publisher.

instruments cautiously. They should not be used for making important educational, clinical,
and diagnostic decisions, but they may have merit in introducing the child to the assessment
process, establishing rapport, and developing hypotheses that can be pursued with more
technically adequate assessment techniques.

Summary

In this chapter we focused on tests of behavior and personality and their applications in the
schools. We noted that Public Law 94-142 and subsequent legislation require that public
schools provide special education and related services to students with emotional disorders.
Before these services can be provided, the schools must be able to identify children with
these disorders. The process of identifying these children often involves a psychological
evaluation completed by a school psychologist or other clinician. Teachers often play an
important role in this assessment process. For example, teachers often complete rating
scales that describe the behavior of students in their class. Teachers are also often involved
in the development and implementation of educational programs for these special needs
students. As a result, it is beneficial for teachers to be familiar with the types of instruments
used to identify students with emotional and behavioral problems..
Assessment of Behavior and Personality 393

We noted the three major types of instruments used in assessing personality and be-
havior in children and adolescents, including the following:

= Behavior rating scales. A behavior rating scale is an inventory completed by an adult


informant such as a teacher, parent, or guardian. An advantage of behavior rating scales is
that they can be used to collect information from adults who have had many opportunities
to observe the child or adolescent in a variety of settings over extended periods of time.
Behavior rating scales are efficient, cost-effective, and particularly useful for assessing
externalizing behaviors such as aggression and hyperactivity. Although the use of adult in-
formants provides some degree of objectivity, behavior rating scales are still susceptible to
response sets (i.e., a situation in which the individuals completing the test respond in a man-
ner that distorts their own or another person’s true characteristics). Although behavior rating
scales are particularly good at diagnosing externalizing problems such as aggression or
overt defiance, they are less effective in assessing internalizing behaviors such as depression
or anxiety because these problems are not always easily observable.
= Self-report measures. A self-report measure is an instrument completed by individu-
als that allow them to describe their own subjective experiences, including their emotional,
motivational, and attitudinal characteristics. The use of self-report measures with children
is a relatively recent development because it was long believed that children did not have
the personal insights necessary to understand and report their subjective experiences. Al-
though it is true that children must have the reading skills necessary to read and complete
these instruments, self-report measures are proving to be useful instruments for assessing
emotional and behavioral problems in older children and adolescents. They are particularly
useful in the assessment of internalizing disorders such as depression and anxiety, which are
not always conspicuous to adults observing the child or adolescent. One prominent limita-
tion of self-report measures is the potential distorting effects of response sets. That is, there
is the potential that examinees will respond in a manner that does not accurately reflect their
true characteristics. For example, they may answer questions in a way that makes them ap-
pear more socially appropriate, even if their responses are not truthful or accurate.
m Projective techniques. Projective techniques involve the presentation of an ambigu-
ous task that places little structure or limitation on the examinee’s response. A classic ex-
ample is the presentation of an inkblot followed by the question: “What might this be?” In
addition to inkblot tests, projective techniques include projective drawings, sentence com-
pletion tests, and apperception (or storytelling) tests. The hypothesis behind the use of
projective techniques is that the examinees will respond to the ambiguous stimuli in a man-
ner that reveals basic, often unconscious aspects of their personality. There is considerable
controversy over the use of projective techniques. Proponents of their use claim projective
techniques represent the richest source of information about the subjective experience of the
examinee. Supporters also hold that behavior rating scales and self-report measures are
vulnerable to the distorting effects of response sets, whereas projective techniques are rela-
tively free from these effects because it is not obvious what type of response is expected or
socially appropriate. In contrast, critics claim that most projective techniques do not meet
even minimal psychometric standards and their use cannot be ethically or technically justi-
fied. While the use of these projective techniques is vigorously debated in the professional
394 CHAPTER 14

literature, they continue to be among the most popular approaches to assessing the personal-
ity of children and adolescents. Our position is that although projective techniques should
not be used as the basis for making important educational, clinical, or diagnostic decisions,
they may have merit in developing rapport with clients and in generating hypotheses that
can be pursued using technically superior assessment techniques.

KEY TERMS AND CONCEPTS

Apperception tests, p. 391 Child Behavior Checklist (CBCL), Projective techniques, p. 388
Behavior Assessment System for p. 381 Public Law 94-142 / IDEA, p. 371
Children—Parent Rating Scale Conners Rating Scales—Revised Response sets, p. 372
(PRS), p. 376 (CRS-R), p. 381 Self-report measure, p. 383
Behavior Assessment System for Draw-A-Person Test (DAP), p. 389 Sentence completion tests, p. 390
Children—Self-Report of House-Tree-Person (H-T-P), p. 389 Teacher Report Form (TRF),
Personality (SRP), p. 383 Inkblot technique, p. 391 p. 381
Behavior Assessment System for Kinetic Family Drawing (KFD), Typical response tests, p. 371
Children—Teacher Rating Scale p. 389 Validity scale, p. 373
(TRS), p. 376 Personality, p. 371 Youth Self-Report (YSR), p. 387
Behavior rating scale, p. 375 Projective drawings, p. 389

RECOMMENDED READINGS

Kamphaus, R. W., & Frick, P. J. (2002). Clinical assessment Personality, behavior, and context. New York: Guilford
of child and adolescent personality and behavior. Bos- Press. This is another excellent source providing thor-
ton: Allyn & Bacon. This text provides comprehensive ough coverage of the major behavioral and personality
coverage of the major personality and behavioral assess- assessment techniques used with children. Particularly
ment techniques used with children and adolescents. It good for those interested in a more advanced discussion
also provides a good discussion of the history and cur- of these instruments and techniques.
rent use of projective techniques.
Reynolds, C. R., & Kamphaus, R. W. (2003). Handbook of
psychological and educational assessment of children:

Go to divedipeareduihtonereal Aneanewalteeeto view a PowerPoint™


presentation and to listen to an audio lecture about this chapter.
CHAPTER

Assessment Accommodations

Assessment accommodations help students show what they know without


being placed at a disadvantage by their disability.
—U.S. Department of Education, 2001, p. 8

CHAPTER HIGHLIGHTS

Major Legislation That Affects the Strategies for Accommodations


Assessment of Students with Disabilities Determining What Accommodations to Provide
Individuals with Disabilities Education Act (IDEA) Assessment of English Language Learners (ELL S)
Section 504 Reporting Results of Modified Assessments
The Rationale for Assessment Accommodations
When Are Accommodations
Not Appropriate or Necessary?

LEARNING OBJECTIVES

After reading and studying this chapter, students should be able to


1. Explain the rationale for making modifications in assessment procedures for students with
disabilities.
2. Distinguish between appropriate and inappropriate assessment accommodations and give
examples of both.
3. Identify situations in which assessment accommodations are inappropriate or unnecessary.
4. Identify major legislation that has impacted the provision of educational services to students
with disabilities.
5. Trace the history of the Individuals with Disabilities Education Act (IDEA) and describe its
impact on the education of students with disabilities.
6. Describe the role of the regular education teacher in providing instructional and assessment
services to students with disabilities, and explain why this role is increasing.
7. Identify and briefly describe the categories of disabilities recognized under IDEA.
8. Describe the impact of Section 504 of the Rehabilitation Act of 1973 and explain its
relationship to IDEA.
9. Identify and give examples of modifications of the presentation format that might be
appropriate for students with disabilities.

395
396 CHAPTER 15

10. Identify and give examples of modifications of the response format that might be appropriate
for students with disabilities.
11. Identify and give examples of modifications of timing that might be appropriate for students
with disabilities.
12. Identify and give examples of modifications of the setting that might be appropriate for
students with disabilities.
13. Identify and give examples of adaptive devices and supports that might be appropriate for
students with disabilities.
14. Describe and give examples illustrating the use of limited portions of an assessment or an
alternate assessment for a student with a disability.
15. Identify and explain the reasoning behind the major principles for determining which
assessment accommodations to provide.
16. Briefly describe the current status of research on the selection of assessment
accommodations.
17. Describe the use of assessment accommodations for English Language Learners.
18. Discuss the controversy regarding the reporting of results of modified assessments.

So far in this text we have emphasized the importance of strictly adhering to standard
assessment procedures when administering tests and other assessments. This is neces-
sary to maintain the reliability and validity of score interpretations.
Standard assessment proce- However, at times it is appropriate to deviate from these standard
dures may not be appropriate procedures. Standard assessment procedures may not be appro-
for a student with a disability priate for students with a disability if the assessment requires the
if the assessment requires the students to use some ability (e.g., sensory, motor, language, etc.)
student to use some ability that that is affected by their disability, but is irrelevant to the construct
being measured. To address this, teachers and others involved in
is affected by the disability, but
assessment may need to modify standard assessment procedures to
is irrelevant to the construct
accommodate the special needs of students with disabilities. In this
being measured. context, the Standards (AERA et al., 1999) note that assessment
accommodations are changes in the standard assessment proce-
dures that are implemented in order to minimize the impact of student characteristics that
are irrelevant to the construct being measured by the assessment. The Standards go on to
state that the goal of accommodations is to provide the most valid and accurate measure-
ment of the construct of interest. As framed by the U.S. Department of Education (1997),
“Assessment accommodations help students show what they know without being placed
at a disadvantage by their disability” (p. 8). For example, consider a test designed to as-
sess a student’s knowledge of world history. A blind student would not be able to read the
material in its standard printed format, but if the student could read Braille, an appropri-
ate accommodation would be to convert the test to the Braille format. In this example, it
is important to recognize that reading standard print is incidental to the construct being
measured. That is, the test was designed to measure the student’s knowledge of world
history, not the ability to read standard print. An important consideration when selecting
accommodations is that we only want to implement accommodations that preserve the
Assessment Accommodations 397

reliability of test scores and the inferences about the meaning of performance on the test
(U.S. Department of Education, 1997).

Major Legislation That Affects the


Assessment of Students with Disabilities

More and more often, teachers More and more often, lawmakers are writing laws that mandate as-
are being called on to modify sessment accommodations for students with disabilities. As a result,
their assessments in order to more and more often, teachers are being called on to modify their
accommodate the special needs assessments in order to accommodate the special needs of students
bP students with disabilities’ with disabilities. Major laws that address assessment accommoda-
tions include Section 504 of the Rehabilitation Act of 1973 (Section
504), The Americans with Disabilities Act (ADA), The No Child
Left Behind Act of 2001 (NCLB), and the Individuals with Disabilities Education Improve-
ment Act (IDEA 2004). In the next section we will focus on IDEA and Section 504 and their
impact on the assessment of students with disabilities.

Individuals with Disabilities Education Act (IDEA)

In 1975, Congress passed Public Law 94-142, the Education of All Handicapped Children
Act (EAHCA). This law required that public schools provide students with disabilities a
free, appropriate public education (FAPE). Prior to the passage of this law, it was estimated
that as many as one million children with disabilities were being denied a FAPE (e.g., Turn-
bull, Turnbull, Shank, Smith, & Leal, 2002). In 1986, Public Law 99-457, the Infants and
Toddlers with Disabilities Act, was passed to ensure that preschool children with disabilities
also received appropriate services. In 1990, the EAHCA was reauthorized and the name was
changed to the Individuals with Disabilities Education Act (IDEA).
These laws had a significant impact on the way students with disabilities received
educational services, The number of children with developmental disabilities in state mental
health institutions declined by almost 90%, the rate of unemployment for individuals in their
twenties with disabilities was reduced, and the number of young adults with disabilities
enrolled in postsecondary education increased. Although this was clearly a step in the right
direction, problems remained. Students with disabilities were still dropping out of school at
almost twice the rate of students without disabilities, there was concern that minority chil-
dren were being inappropriately placed in special education, and educational professionals
and parents had concerns about the implementation of the law (Kubiszyn & Borich, 2003).
To address these and other concerns, the law was updated and reauthorized in 1997 as the
Individuals with Disabilities Education Act of 1997 (IDEA 97) and again in 2004 as the
Individuals with Disabilities Education Improvement Act of 2004 (IDEA 2004).
Entire books have been written on IDEA and its impact on the public schools, and it is
not our intention to cover this law and its impact in great detail. Because this is a textbook on
educational assessment for teachers, we will be limiting our discussion to the effect of IDEA
398 CHAPTER 15

Research has shown that on the assessment practices of teachers. In this context, probably the
students with disabilities dem- greatest effect of IDEA has been its requirement that schools provide
onstrate superior educational services to students with disabilities in the general education class-
and social gains when they room whenever appropriate. Earlier versions of the act had required
receive instruction in regular that students with disabilities receive instruction in the least restrictive
environment. In actual practice students with disabilities were often
education classrooms.
segregated into resource or self-contained classrooms largely based
on the belief that they would not be able to profit from instruction in
regular education classrooms. Educational research, however, has shown that students with
disabilities demonstrate superior educational and social gains when they receive instruction
in regular education classrooms (see McGregor & Vogelsberg, 1998; Stainback & Stainback,
1992). Revisions of IDEA, reflecting this research and prevailing legal and political trends,
mandated that public schools educate students with disabilities alongside students who do
not have disabilities to the maximum extent possible, an approach often referred to as inclu-
sion or mainstreaming (Turnbull et al., 2002). This extends not only to students with mild dis-
abilities but also to those with moderate and severe disabilities. The impact of this on regular
education teachers is that they have more students with disabilities in their classrooms. As
a result, regular education teachers are increasingly responsible for planning and providing
instruction to children with disabilities and for evaluating their progress. This includes help-
ing identify students with disabilities, planning their instruction and assessment, and working
with them daily in the classroom.
Central to the provision of services to students with disabilities is the individualized
educational program (IEP). The IEP is a written document developed by a committee
or team composed of the student’s parents, regular education teachers, special education
teachers, and other school personnel (e.g., school psychologists, counselors). This commit-
tee is typically referred to as the IEP committee. When appropriate the students may be in-
vited to participate as well as professionals representing external agencies. At a minimum
the IEP should specify the student’s present level of academic performance, identify mea-
surable annual goals and short-term objectives, specify their instructional arrangement,
and identify the special education and related services the student will receive. In terms
of assessment accommodations, the IEP should specify any modifications in classroom
tests and other assessments that are deemed necessary, and each of the student’s teachers
should have a copy of the IEP. Additionally, the IEP should identify any accommodations
that are seen as appropriate for state and districtwide assessments, including those required
by the No Child Left Behind Act (NCLB). If the IEP committee decides that the student is
unable to take the state’s regular assessment even with accommodations, the IEP can specify
that the student take an alternate assessment. These alternate assessments are designed for
students that the IEP committee determines should not be assessed based on their grade-
level curriculum.
As we noted, regular education teachers are becoming increasingly involved in teach-
ing and testing students with disabilities. Mastergeorge and Miyoshi (1999) note that as
members of the IEP committee, regular education teachers are involved in

m Developing, reviewing, and revising the student’s IEP


m Developing positive behavioral interventions and supports .
Assessment Accommodations 399

= Determining supplementary aids, services, and program modifications (including in-


structional and assessment modifications or accommodations)
= Determining what type of personnel support is needed to help the child function and
progress in the regular education classroom

The involvement of regular education teachers does not stop simply with planning. They
are also primarily responsible for implementing the IEP in the classroom. It should be noted
that the services and accommodations stipulated in the IEP are not merely suggestions but
legally commit the school to provide the stated modifications, accommodations, services,
and so forth.

IDEA Categories of Disabilities


The Individuals with Disabilities Education Act of 2004 (IDEA 2004) designates 13 dis-
ability categories. Teachers play important roles in identifying children with disabilities. In
many cases teachers may be the first to recognize that a student is having difficulties that
warrant referral for evaluation for special education services. They may also be involved in
different aspects of the evaluation process. The assessment of these disorders involves a wide
range of assessment activities including interviews with students, parents, and teachers; stan-
dardized tests; reviews of existing educational records; reviews of classroom work samples;
observations; and reviewing results of medical examinations. Here are brief descriptions of
the IDEA categories of disabilities.

IDEA defines a learning Specific Learning Disabilities. IDEA defineS@learning disabil-


disability as a disorder that
compromises the student’s
ability to understand or use
This category includes conditions such
spoken or written language
as dyslexia, developmental aphasia, and perceptual disabilities, but
and is manifested in difficulty
does not include learning problems that are primarily the result of
in listening, thinking, speaking, visual, hearing, or motor deficits; mental retardation; emotional dis-
reading, writing, spelling, or turbance; or economic/environmental disadvantage. Students with
doing mathematical calculations. _learning disabilities account for approximately 50% of the students
receiving special education services (Turnbull et al., 2002).
As we discussed in Chapters 12 and 13, most states base the diagnosis of learning dis-
abilities on the presence of a substantial discrepancy between ability (i.e., intelligence) and
achievement. For example, consider a student with a Full Scale IQ on the Wechsler Intelli-
gence Scale for Children—Fourth Edition of 100 and a Reading Comprehension score on the
Wechsler Individual Achievement Test, Second Edition of 70. Both of these tests have a mean
of 100 and a standard deviation of 15, so there is a discrepancy of 30 points or 2 standard de-
viations between ability and achievement (i.e., achievement is 2 SDs below ability). Different
states have different criteria as to what constitutes a substantial discrepancy. Some use | stan-
dard deviation as the criteria, some 1.5 standard deviations, and some 2 standard deviations
(Turnbull et al., 2002). Other states use a regression formula to establish the presence of a
400 CHAPTER 5

severe discrepancy. The individual intelligence and achievement tests used in assessing learn-
ing disabilities are generally administered by assessment specialists with advanced graduate
training in administering and interpreting these tests. Although reliance on ability—achieve-
ment discrepancies to diagnose learning disabilities is the most widely accepted methodology,
it has become the focus of considerable debate in recent years, and some experts recommend
dropping this approach (e.g., Fletcher et al., 2002). As this text goes to print, it appears the next
revision of IDEA may drop the requirement of a discrepancy model (but continue to allow its
use) for an alternative approach that is still being refined.

Speech or Language Impairment. Communication disorders are typically classified as


either speech or language disorders. Speech disorders involve problems in the production
of speech whereas language disorders involve problems receiving, understanding, and
formulating ideas and thoughts. Communication disorders constitute approximately 20% of
the students receiving special education services (Turnbull et al., 2002). Speech—language
pathologists using a variety of speech and language tests typically take a lead role in the
identification of students with communication disorders.

Mental Retardation. Mental retardation typically is identified when the student scores
more than 2 standard deviations below the mean on an individualized intelligence test and
presents significant deficits in two or more areas of adaptive functioning (e.g., communica-
tion, self-care, leisure). Additionally, these deficits must be manifested before the age of
18 years (APA, 1994). Students with mental retardation comprise approximately 11% of
the special education population (Turnbull et al., 2002). The assessment of students with
mental retardation involves the administration of individual intelligence and achievement
tests by assessment professionals and also adaptive behavior scales that parents or teachers
typically complete.

Emotional Disturbance. Students with emotional disorders comprise approximately 8%


of the students receiving special education services (Turnbull et al., 2002). For classification
purposes, emotional disturbance is defined as follows:

(i) The term means a condition exhibiting one or more of the following characteristics
over a long period of time and to a marked degree that adversely affects a student’s
educational performance:
(a) An inability to learn that cannot be explained by intellectual, sensory, or other
health factors.
(b) An inability to build or maintain satisfactory interpersonal relationships with peers
and teachers.
(c) Inappropriate types of behavior or feelings under normal circumstances.
(d) A general pervasive mood of unhappiness or depression.
(e) A tendency to develop physical symptoms or fears associated with personal or
school problems.
(ii) The term includes schizophrenia.
Assessment Accommodations 401

The term does not apply to children who are socially maladjusted, unless it is determined
that they have an emotional disturbance (34 C.E.R. Sec. 300.7(c)(4)).

School psychologists typically take a lead role in the identification and assessment of students
with an emotional disturbance and use many of the standardized measures of behavior and
personality discussed in Chapter 14. When there is concern that students have an emotional
disturbance, their teachers will often be interviewed by the school psychologist asked to
complete behavior rating scales to better understand the nature and degree of any problems.

Other Health Impaired. IDEA covers a diverse assortment of health conditions under
the category of Other Health Impaired (OHI). The unifying factor is that all of these
conditions involve limitations in strength, vitality, or alertness. Approximately 3.5% of the
students receiving special education services have this classification (Turnbull et al., 2002).
The health conditions included in this broad category include, but are not limited to, asthma,
epilepsy, sickle cell anemia, and cancer. Attention deficit/hyperactivity disorder (ADHD)
is also typically classified in this category (but may be served under Section 504 as well).
ADHD is characterized by problems maintaining attention, impulsivity, and hyperactivity
(APA, 1994). As with the diagnosis of emotional disturbance, when there is concern that
students have ADHD, their teachers will often be asked to complete behavior rating scales
to acquire a better picture of their functioning in the school setting.

Multiple Disabilities. (DEA defines multiple disabilities as concurrent disabilities that


cause severe educational impairments. Examples include a student with mental retardation
and blindness or mental retardation and a severe orthopedic impairment. Students with
multiple disabilities comprise approximately 2% of the population of the students receiving
special education services.

Hearing Impairments. IDEA defines hearing impairments as hearing loss that is se-
vere enough to negatively impact a student’s academic performance. Students with hear-
ing impairments account for approximately 1% of the students receiving special education
services (Turnbull et al., 2002). Assessment of hearing impairments will involve an audi-
ologist, but school personnel will typically be involved to help determine the educational
ramifications of the impairment.

Orthopedic Impairments. IDEA defines orthopedic impairments as orthopedic-related


impairments that are the result of congenital anomalies, disease, or other causes. Students
with orthopedic impairments comprise approximately 1% of the students receiving special
education services. Examples of orthopedic impairments include spina bifida and cerebral
palsy. Many educators refer to orthopedic impairments as physical disabilities (Turnbull et
al., 2002). Assessment of orthopedic impairments will typically involve a number of medi-
cal specialists, with school personnel helping to determine the educational implications of
the impairment.

Autism. IDEA defines autism as a developmental disability that is evident before the
age of 3 and impacts verbal and nonverbal communication and social interaction. Students
402 CHAPTER 15

IDEA defines autism as a with autism account for approximately 1% of the students receiving
developmental disability that is special education services (Turnbull et al., 2002). The assessment
evident before the age of 3 and of autism typically involves a combination of intelligence, achieve-
impacts verbal and nonverbal ment, and speech and language tests to assess cognitive abilities as
communication and social well as behavior rating scales to access behavioral characteristics.
interaction.
Visual Impairments Including Blindness. IDEA defines vi-
sual impairment as impaired vision that even after correction (e.g.,
glasses) negatively impacts a student’s academic performance. Students with visual impair-
ments constitute less than 1% of the students receiving special education services (Turn-
bull et al., 2002). Assessment of visual impairments will involve an ophthalmologist or an
optometrist, but school personnel will typically be involved to determine the educational
implications of the impairment.

Deaf-Blindness. This IDEA category is for students with coexisting visual and hearing
impairments that result in significant communication, developmental, and educational
needs. Assessment of students in this category typically relies on student observations,
information from parent and teacher behavior rating scales, and interviews of adults
familiar with the child (Salvia & Ysseldyke, 2007).

Traumatic Brain Injury. IDEA defines traumatic brain injury as an acquired brain
injury that is the result of external force and results in functional and psychosocial impair-
ments that negatively impact the student’s academic performance. Students with traumatic
brain injuries constitute less than 1% of the students receiving special education services
(Turnbull et al., 2002). The assessment of traumatic brain injuries typically involves a
combination of medical assessments (e.g., computerized axial tomography), neuropsycho-
logical tests (e.g., to assess a wide range of cognitive abilities such as memory, attention,
visual—spatial processing), and traditional psychological and educational tests (e.g., intel-
ligence and achievement). These assessments are often complemented with assessments of
behavior and personality.

Developmental Delay. Kubiszyn and Borich (2003) note that early versions of IDEA
required fairly rigid adherence to categorical eligibility procedures that identified and la-
beled students before special education services could be provided. While the intention
of these requirements was to provide appropriate oversight, it had the unintentional effect
of hampering efforts at prevention and early intervention. Before students qualified for
special education services, their problems had to be fairly severe and chronic. To address
this problem, IDEA 97 continued to recognize the traditional categories of disabilities (i.e.,
those listed above) and expanded eligibility to children with developmental delays. This
provision allows states to provide special education services to students between the ages
of 3 and 9 with delays in physical, cognitive, communication, social/emotional, and adap-
tive development. Additionally, IDEA gave the states considerable freedom in how they
define developmental delays, requiring only that the delays be identified using appropriate
assessment instruments and procedures. The goal of this more flexible approach to eligi-
bility is to encourage early identification and intervention. No longer do educators have to
Assessment Accommodations 403

wait until student problems escalate to crisis proportions; they can now provide services
early when the problems are more manageable and hopefully have a better prognosis.

Section 504

Section 504 of the Rehabilitation Act of 1973 is another law that had a significant impact
on the instruction and assessment of students with disabilities. Section 504 (often referred
to simply as 504) prohibits any discrimination against an individual with a disability in any
agency or program that receives federal funds. Because state and local education agencies
receive federal funds, Section 504 applies. Although IDEA requires that a student meet
specific eligibility requirements in order to receive special education services, Section 504
established a much broader standard of eligibility. Under Section 504, an individual with a
disability is defined as anyone with a physical or mental disability that substantially limits
one or more life activities. As a result, it is possible that a student may not qualify for special
education services under IDEA, but still qualify for assistance under Section 504 (this is
often referred to as “504 only”). Section 504 requires that public schools offer students with
disabilities reasonable accommodations to meet their specific educational needs. To meet
this mandate, schools develop “504 Plans” that specify the instructional and assessment
accommodations the student should receive. Parents, teachers, and other school personnel
typically develop these 504 Plans. Regular education teachers are involved in the develop-
ment of these plans and are responsible for ensuring that the modifications and accommoda-
tions are implemented in the classroom.

The Rationale for Assessment Accommodations

As we noted earlier, standard assessment procedures may not be appropriate for students
with a disability if the assessment requires the students to use some ability that is affected
by their disability but is irrelevant to the construct being measured. Assessment accommo-
dations are modifications to standard assessment procedures that are granted in an effort
to minimize the impact of student characteristics that are irrelevant to the construct being
measured. If this is accomplished the assessment will provide a more valid and accurate
measurement of the student’s true standing on the construct (AERA et al., 1999). The goal
is not simply to allow the student to obtain a higher score; the goal is to obtain more valid
score interpretations. Assessment accommodations should increase the validity of the score
interpretations so they more accurately reflect the student’s true standing on the construct
being measured.
Although some physical, cognitive, sensory, or motor deficits may be readily appar-
ent to teachers (e.g., vision impairment, hearing impairment, physical impairment), other
deficits that might undermine student performance are not as obvious. For example, stu-
dents with learning disabilities might not outwardly show any deficits that would impair
performance on a test, but might in fact have significant cognitive processing deficits that
limit their ability to complete standard assessments. In some situations the student may have
readily observable deficits, but have associated characteristics that also need to be taken into
404 CHAPTER 15

consideration. For example, a student with a physical disability (e.g., partial paralysis) may
be easily fatigued when engaging in standard school activities. Because some tests require
fairly lengthy testing sessions, the student’s susceptibility to fatigue,
Fairness to all parties is a not only the more obvious physical limitations, needs to be taken into
central issue when considering consideration when planning assessment accommodations (AERA
assessment accommodations. et al., 1999).
Fairness to all parties is a central issue when considering assess-
ment accommodations. For students with disabilities, fairness requires that they not be penal-
ized as the result of disability-related characteristics that are irrelevant to the construct being
measured by the assessment. For students without disabilities, fairness requires that those
receiving accommodations not be given an unjust advantage over those being tested under
standard conditions. As you can see, these serious issues deserve careful consideration.

When Are Accommodations


Not Appropriate or Necessary?

The Standards (AERA et al., 1999) specify the following three situations in which accom-
modations should not be provided or are not necessary.

Accommodations Are Not Appropriate if the Affected Ability Is Directly Relevant


to the Construct Being Measured. For example, it would not be appropriate to give a
student with a visual impairment a magnification device if the test were designed to mea-
sure visual acuity. Similarly, it would not be appropriate to give a student with a reading
disability the use of a “reader” on a test designed to measure reading ability. Even if the
test is designed as a measure of reading comprehension (as opposed to decoding or reading
fluency), having someone else read the material turns the test into one of listening compre-
hension, not reading comprehension (Fuchs, 2002). In other words, if the test accommoda-
tion changes the construct being measured, the accommodation is inappropriate. Again, the
essential question is “Does the assessment require the use of some ability that is affected by
the disability but is irrelevant to the construct being measured?”

Accommodations Are Not Appropriate for an Assessment if the Purpose of the Test Is
to Assess the Presence and Degree of the Disability. For example, it would not be ap-
propriate to give a student with attention deficit/hyperactivity disorder (ADHD) extra time
on a test designed to diagnose the presence of attention problems. As we indicated earlier, it
would not be appropriate to modify a test of visual acuity for a student
Assessment accommodations with impaired vision.
should be individualized to Accommodations Are Not Necessary for All Students with Dis-
meet the specific needs of each abilities. Not all students with disabilities need accommodations.
student with a disability. Even when students with a disability require accommodations on one
test, this does not necessarily mean that they will need accommoda-
tions on all tests. As we will discuss in more detail later, assessment accommodations should
be individualized to meet the specific needs of each student with a disability. There is no spe-
cific accommodation that is appropriate, necessary, or adequate for all students with a given
Assessment Accommodations 405

disability. As an example, consider students with learning disabilities. Learning disabilities


are a heterogeneous group of disabilities that can impact an individual in a multitude of ways.
One student with a learning disability may require extended time whereas this accommodation
may not be necessary for another student with the same diagnosis.

Strategies for Accommodations


A variety of assessment accommodations have been proposed and implemented to meet
the needs of individuals with disabilities. A brief description follows of some of the most
widely used accommodations compiled from a number of sources (AERA et al., 1999;
King, Baker, & Jarrow, 1995; Mastergeorge & Miyoshi, 1999; Northeast Technical Assis-
tance Center, 1999; U.S. Department of Education, 1997). To facilitate our presentation,
we divided these accommodations into major categories. However, these categories are
not mutually exclusive and some accommodations may be accurately classified into more
than one category.

Modifications of Presentation Format


Modifications of presentation format involve modifying or chang-
Modifications of presentation ing the medium or format used to present the directions, items,
format involve modifying the or tasks to the student. An example would be the use of Braille
medium or format used to or large-print editions for students with visual handicaps (which
present the directions, items, can be supplemented with large-print or Braille figures). Closed
or tasks to the student. circuit television (CCTV) is an adaptive device that enlarges text
and other materials and magnifies them onto a screen (see www
.visionaid.com/cctvpage/cctvdeal.htm). For computer-administered tests, ZoomText Mag-
nifier and ScreenReader allows students to enlarge the image on a computer screen and
has a screen reader that reads the text on the screen. In some cases the use of oversized
monitors may be appropriate. Reader services, which involve listening to the test being
read aloud, may also be employed. Here the reader can read directions and questions and
describe diagrams, graphs, and other visual material. For students with hearing impair-
ments, verbal material may be presented through the use of sign communication or in
writing. Other common modifications to the presentation format include increasing the
spacing between items; reducing the number of items per page; using raised line draw-
ings; using language-simplified directions and questions; changing from a written to an
oral format (or vice versa); defining words; providing additional examples; and helping
students understand directions, questions, and tasks. Table 15.1 provides a listing of these
and related accommodations.

Modifications of response format Modifications of Response Format


allow students to respond with Modifications of response format allow students to respond with
their preferred method of their preferred method of communication. For example, if students
communication. are unable to write due to a physical impairment, you can allow
406 CHAP
TE RW

TABLE 15.1 Accommodations Involving Modifications of Presentation Format

Braille format
Large-print editions
Large-print figure supplements
Braille figure supplement
CCTV to magnify text and materials
For computer-administered tests, devices such as ZoomText Magnifier and ScreenReader to
magnify material on the screen or read text on the screen
Reader services (read directions and questions, describe visual material)
Sign language
Audiotaped administration
Videotaped administration
Alternative background and foreground colors
Increasing the spacing between items
Reducing the number of items per page
Using raised line drawings
Using language-simplified questions
Converting written exams to oral exams; oral exams to written format
Defining words
Providing additional examples
Clarifying and helping students understand directions, questions, and tasks
Highlighting key words or phrases
Providing cues (e.g., bullets, stop signs) on the test booklet
Rephrasing or restating directions and questions
Simplifying or clarifying language
Using templates to limit the amount of print visible at one time
S 2

them to take the exam orally or provide access to a scribe to write down their responses.
A student whose preferred method of communication is sign language could respond in
sign language and responses could subsequently be translated for grading. Other common
modifications to the response format include allowing the student to point to the correct
response; having an aide mark the answers; using a tape recorder to record responses;
using a computer or Braillewriter to record responses; using voice-activated computer
software; providing increased spacing between lines on the answer sheet; using graph
paper for math problems; and allowing the student to mark responses in the test booklet
rather than on a computer answer sheet. Table 15.2 provides a summary listing of these
and related accommodations.

Modifications of Timing
Extended time is probably the most frequent accommodation provided. Extended time is
appropriate for any student who may be slowed down due to reduced processing speed,
reading speed, or writing speed. Modifications of timing are also appropriate for students
q
Assessment Accommodations 407

TABLE 15.2 Accommodations Involving Modifications of Response Format

Oral examinations
Scribe services (student dictates response to scribe, who creates written response)
Allowing a student to respond in sign language
Allowing a student to point to the correct response
Having an aide mark the answers
Using a tape recorder to record responses
Using a computer with read-back capability to record responses
Using a Braillewriter to record responses
Using voice-activated computer software
Providing increased spacing between lines on the answer sheet
Using graph paper for math problems
Allowing students to mark responses in the test booklet rather than on a computer answer
sheet (e.g., Scantron forms)
m Using a ruler for visual tracking

who use other accommodations such as the use of a scribe or some form of adaptive equip-
ment, because these often require more time. Determining how much time to allow is a
complex consideration. Research suggests that 50% additional time is adequate for most
students with disabilities (Northeast Technical Assistance Center, 1999). Although this is
probably a good rule of thumb, be sensitive to special conditions that might demand extra
time. Nevertheless, most assessment professionals do not recommend “unlimited time” as
an accommodation. It is not necessary, can complicate the scheduling of assessments, and
can be seen as unreasonable and undermine the credibility of the accommodation process
in the eyes of some educators. Other time-related modifications include providing more
frequent breaks or administering the test in sections, possibly spread over several days.
For some students it may be beneficial to change the time of day the test is administered to
accommodate their medication schedule or fluctuations in their energy levels. Table 15.3
provides a summary listing of these and related accommodations.

Modifications of Setting
Modifications of setting allow students to be tested in a setting that will enable them
to perform at their best. For example, for students who are highly distractible this may

TABLE 15.3 Accommodations Involving Modifications of Timing

Extended time
More frequent breaks
Administering the test in sections
Spreading the testing over several days
Changing the time of day the test is administered
408 CHAPTER 15

TABLE 15.4 Accommodations Involving Modifications of Setting

Individual test administration


Administration in a small group setting
Preferential seating
Space or accessibility considerations
Avoidance of extraneous noise/distractions
Special lighting
Special acoustics
Study carrel to minimize distractions
Alternate sitting and standing

Modifications of setting allow include administering the test individually or in a small group set-
students to be tested in a ting. For other students preferential seating in the regular classroom
setting that will enable them to may be sufficient. Some students will have special needs based on
perform their best. space or accessibility requirements (e.g., a room that is wheelchair
accessible). Some students may need special accommodations such
as a room free from extraneous noise/distractions, special lighting,
special acoustics, or the use of a study carrel to minimize distractions. Table 15.4 provides
a summary listing of these and related accommodations.

Adaptive Devices and Supports


There are many adaptive devices There are many adaptive devices and supports that may be useful
and supports that may be useful when testing students with disabilities. These can range from so-
phisticated high-technology solutions to fairly simple low-technol-
when testing students with
ogy supports. For individuals with visual impairments, a number of
disabilities.
companies produce products ranging from handheld magnification
devices to systems that automatically enlarge the size of print viewed on a computer screen
(e.g., ZoomText Magnifier and ScreenReader, ClearView, Optelec, and Visualtek). There are
voice recognition computer programs that allow students to dictate their responses and print
out a document containing their text (e.g., Dragon Dictate). Also available are a number of
adaptive keyboards and trackball devices (e.g., Intellikeys keyboard, Kensington Trackball
mouse, HeadMaster Plus mouse). Auditory amplification devices as well as audiotape and
videotape players and recorders may be appropriate accommodations. On the low-tech side,
students may benefit from special chairs and large surface desks, earplugs/earphones, col-
ored templates, markers to maintain place, securing the paper to the desk with tape, dark,
heavy, or raised lines of pencil grips. It may be appropriate to provide an abacus, math tables,
or calculators to facilitate math calculations. Accordingly, in some situations it may be appro-
priate to provide reference materials such as a dictionary or thesaurus. In many situations the
use of aids such as calculators, spell check, and reference materials have become so common,
they are being made available to students without disabilities. Table 15.5 provides a summary
listing of these and related accommodations. 4
Assessment Accommodations 409

TABLE 15.5 Accommodations Involving Adaptive Devices and Supports

m= Handheld magnification devices and CCTV


m Systems that enlarge print (e.g., ZoomText Magnifier and ScreenReader, ClearView, Optelec,
and Visualtek)
m Systems that read text on the screen (ZoomText Magnifier and ScreenReader)
m Voice recognition computer programs that allow students to dictate their responses and print
out a document containing their text (e.g., Dragon Dictate)
Adaptive keyboards and trackball devices (e.g., HeadMaster Plus mouse, Intellikeys keyboard,
Kensington Trackball mouse)
Auditory amplification devices
Audiotape and videotape players and recorders
Special chairs and large surface desks
Earplugs/earphones
Colored templates or transparencies
Markers to maintain place, highlighters
Securing paper to desk with tape
Dark, heavy, or raised lines of pencil grips
Abacus, math tables, or calculators (or talking calculators)
Reference materials such as a dictionary or thesaurus, spell checkers
Watches or clocks with reminder alarms
CAE

Using Only a Portion of a Test


In some situations it may be appropriate to use only a portion of a test for a student with a
disability. In clinical settings clinicians might delete certain subtests of a test battery that
are deemed inappropriate for an individual with a disability. For example, when testing a
student with a severe visual impairment, a psychologist administering the WISC-IV (see
Chapter 13) might delete subtests that require vision (e.g., Block Design, Matrix Reasoning,
Picture Concepts) and use only subtests presented and responded to orally (e.g., Vocabulary,
Information, Similarities). The same principle can be applied to classroom assessments.
That is, a teacher might decide to delete certain items that are deemed inappropriate for
certain students with disabilities. Along the same lines, in some situations items will be
deleted simply to reduce the length of the test (e.g., to accommodate a student who is eas-
ily fatigued). These may be acceptable accommodations in some situations, but it is also
possible that using only portions of an assessment will significantly alter the nature of the
construct being measured (AERA et al., 1999). As a result, teachers should use this ap-
proach with considerable caution.

Using Alternate Assessments


A final category of accommodations involves replacing the standard test with one that has
been specifically developed for students with a disability (AERA et al., 1999). Using al-
ternate assessments is often appropriate for students with severe disabilities that prevent
them from participating in the standard assessments, even with the use of more common
410 COHe ATP reals)

accommodations (U.S. Department of Education, 1997). The use of alternative assessments


may be an appealing accommodation because, with careful planning and development, they
can produce reliable and valid results. The major limitation with this approach is that it may
be difficult to find satisfactory alternate assessments that measure the same construct as the
standard assessment (AERA et al., 1999).

Determining What Accommodations to Provide

Determining whether a student needs assessment accommodations and which accommoda-


tions are appropriate is not an easy decision. In terms of making this decision, the Standards
(AERA et al., 1999) state that “the overarching concern is the valid-
There is relatively little research _ ity of the inference made from the score on the modified test: fair-
on assessment accommodations, _ ness to all parties is best served by a decision about test modification
and what is available has often that results in the most accurate measure of the construct of interest”
produced contradictory results. (p. 102). The Standards go on to emphasize the importance of pro-
fessional judgment in making this decision. There is relatively little
research on assessment accommodations, and what is available has often produced contra-
dictory findings (AERA et al., 1999; Fuchs, 2002). As a result, there are few universally ac-
cepted guidelines about determining what assessment accommodations should be provided.
For example, Fuchs (2002) notes that the accommodations that some states recommend for
their statewide assessments are actually prohibited by other states. Nevertheless, here are a
few principles that experts working with students with disabilities generally accept.

Accommodations Should Be Tailored to Meet the Specific Needs of Individual


Students. Do not try to apply a “‘one-size-fits-all” set of accommodations to students
with disabilities, even when they have the same disability. Not all students with any spe-
cific type of disability need the same set of accommodations. For example, students with
learning disabilities are a heterogeneous group and vary in terms of the nature and severity
of their disability. As a result, it would be inappropriate to provide the same set of assess-
ment accommodations to all students with learning disabilities. The Standards (AERA et
al., 1999) give the example of providing a test in Braille format to all students with visual
impairments. This might be an appropriate accommodation for some, but for others it might
be more appropriate to provide large-print testing materials whereas for others it might be
preferable to provide a reader or an audiotape with the questions. Look at students individu-
ally and determine their specific needs, which should serve as the basis for decisions about
assessment accommodations. Because teachers work with students on a day-to-day basis,
they are often the best qualified to help determine what types of assessment accommoda-
tions are indicated.

Accommodations That Students Routinely Receive in Their Classroom Instruction


Are Generally Appropriate for Assessments. If an accommodation is seen as being ap-
propriate and necessary for promoting learning during classroom instruction, it is likely that
the same accommodation will be appropriate and necessary for assessments. This applies
g
Assessment Accommodations 411

to both classroom assessments and state and district assessment programs. For example, if
a student with a visual handicap receives large-print instructional materials in class (e.g.,
large-print textbook, handouts, and other class materials), it would be logical to provide
large-print versions of classroom assessments as well as large-print standardized assess-
ments. A reasonable set of questions to ask is (1) What types of instructional accommoda-
tions are being provided in the classroom? (2) Are these same accommodations appropriate
and necessary to allow the students to demonstrate their knowledge and skills on assess-
ments? (3) Are any additional assessment accommodations indicated? (Mastergeorge &
Miyoshi, 1999).

To the Extent Possible, Select Accommodations That Promote Independent Func-


tioning. Although you want to provide assessment accommodations that minimize the
impact of irrelevant student characteristics, it is also good educational practice to promote
the independent functioning of students (King et al., 1995). For example, if a student with
a visual handicap can read large-print text, this accommodation would likely be preferable
to providing a reader. Similarly, you might want to provide tape-recorded directions/items
versus a reader or a word processor with a read-back function versus a scribe. You want to
provide the accommodations needed to produce valid and reliable results, but this can often
be accomplished while also promoting student independence.

Periodically Reevaluate the Needs of the Student. Over time, the needs of a student
are likely to change. In some cases, students will mature and develop new skills and abili-
ties. In other situations there may be a loss of some abilities due to a progressive disorder.
As aresult, it is necessary to periodically reexamine the needs of the student and determine
whether the existing accommodations are still necessary and if any new modifications need
to be added.

Typically the determination of assessment accommodations is the responsibility of


the IEP committee, and they are specified in the IEP (or with students who are 504 only,
in the 504 Plan). Teachers, as members of these committees, will have a key role in deter-
mining the accommodations necessary for the student. Again, we emphasize that when
determining which accommodations to provide you always want to assure that the reliability
and validity of the assessment results are maintained. As noted, we do not always have well-
developed research-based information to help us make these decisions, and often we have
to base them on professional judgment. In other words, you will have to carefully examine
the needs of the student and the intended use of the assessment and make a decision about
which accommodations are needed and appropriate. As an example, reading the test items
to a student would clearly invalidate an assessment of reading comprehension. In contrast,
administering the assessment in a quiet setting would not undermine the validity of the
results (U.S. Department of Education, 1997). Table 15.6 provides a summary of factors
to consider when selecting assessment accommodations for students. Table 15.7 provides
information on how to locate information on the accommodation policies of major test
publishers. Special Interest Topic 15.1 illustrates the assessment accommodations allowed
on one statewide assessment.
412 CHAPTER 15

TABLE 15.6 Determining Which Accommodations to Provide

m Tailor the modifications to meet the specific needs of the individual student (i.e., no one-size-
fits-all accommodations).
a Ifastudent routinely receives an accommodation in classroom instruction, that
accommodation is usually appropriate for assessments.
ma When possible, select accommodations that will promote independent functioning.
m Periodically reevaluate the needs of the students (e.g., Do they still need the accommodation?
Do they need additional accommodations’).

TABLE 15.7 Assessment Accommodations and Major Test Publishers

Major test publishers typically provide accommodation guidelines for the assessments they
publish. The easiest way to access up-to-date information on these accommodation policies is
by accessing the publishers’ Web sites. These accommodation policies include both the types of
accommodations allowed and the process examinees must go through in order to qualify for and
request accommodations. Below are some Web sites where you can find accommodation policies
for some major test publishers.

a Educational Testing Service (ETS): www.ets.org


a College Board: www.collegeboard.com/ssd/student
= America College Testing Program (ACT): www.act.org/aap/disab/index.htm
a CTB McGraw-Hill: http://ctb.com
= Harcourt Assessment: http://harcourtassessment.com
= Riverside Publishing: www.riverpub.com

Assessment of English Language Learners (ELLs)

The Standards (AERA et al., 1999) note that “any test that employs language is, in part, a
measure of language skills. This is of particular concern for test takers whose first language
is not the language of the test (p. 91). Accordingly, both IDEA and NCLB require that when
assessing students with limited English proficiency, educators must ensure that they are
actually assessing the students’ knowledge and skills and not their proficiency in English.
For example, if a bilingual student with limited English proficiency is unable to correctly
answer a mathematics word problem presented in English, one must question whether the
student’s failure reflects inadequate mathematical reasoning and computation skills or in-
sufficient proficiency in English. If the goal is to assess the student’s English proficiency, it
is appropriate to test an ELL student in English. However, if the goal is to assess achieve-
ment in an area other than English, you need to carefully consider the type of assessment
or set of accommodations needed to ensure a valid assessment. This often requires testing
students in their primary language.
There are a number of factors that need to be considered when assessing ELL stu-
dents. First, when working with students with diverse linguistic backgrounds it is important
for educators to carefully assess the student’s level of acculturation, language dominance,
and language proficiency before initiating the formal assessment (Jacob & Hartshorne,
Assessment Accommodations
413

SPECIAL INTEREST TOPIC: 15.1


Allowable Accommodations in
a Statewide Assessment Program

The Texas Student Assessment Program includes a number of assessments, the most widely admin-
istered being the Texas Assessment of Knowledge and Skills (TAKS). The manual (Texas Education
Agency, 2003) notes that accommodations that do not compromise the validity of the test results may
be provided. Decisions about what accommodations to provide should be based on the individual
needs of the student and take into consideration whether the student regularly receives the accommo-
dation in the classroom. For students receiving special education services, the requested accommoda-
tions must be noted on their IEP. The manual identifies the following as allowable accommodations:

Signing or translating oral instructions


Signing the prompt on the writing test
Oral administration of selected tests (e.g., math, social studies, and science)
The use of colored transparencies or place markers
Small group or individual administration
Braille or large-print tests
Modified methods of response
m Respond orally
m Mark responses in test booklet (versus machine-scorable response form)
m Type responses
m Tape-recording of an essay that is then played back to a scribe while spelling, capital-
izing, and punctuating it, the student is allowed to read the essay and indicate any desired
corrections
m= Reference materials (English dictionaries are allowed during certain tests)
a Calculators (allowed during certain tests)

Naturally, the testing program allows individuals to request accommodations that are not included
in this list, and these will be evaluated on a one-by-one basis. However, the manual identifies the
following as nonallowable accommodations:

Reading assistance on the writing, reading, and language arts tests


Use of foreign-language reference materials
Use of calculators on certain tests
Translation of test items
No clarification or rephrasing of test questions, passages, prompts, or answer choices
Any other accommodation that would invalidate the results

In addition to the TAKS, the State Developed Alternative Assessment (SDAA) is designed
for students who are receiving instruction in the state-specified curriculum but for whom the IEP
committee has decided the TAKS is inappropriate. Whereas the TAKS is administered based on the
student’s assigned grade level, the SDAA is based on the student’s instructional level as specified
by the IEP committee. The goal of the SDAA is to provide accurate information about the student’s
annual growth in the areas of reading, writing, and math. In terms of allowable accommodations,
the manual (Texas Education Agency, 2003) simply specifies the following:
With the exception of the nonallowable accommodations listed below, accommodations documented
in the individual education plan (IEP) that are necessary to address the student’s instructional needs

(continued)
414 CHAPTER 15

SPECIAL INTEREST TOPIC 15.1 Continued

based on his or her disability may be used for this assessment. Any accommodation made MUST be
documented in the student’s IEP and must not invalidate the tests. (p. 111)

The nonallowable accommodations include

No direct or indirect assistance that identifies or helps identify the correct answer
No clarification or rephrasing of test questions, passages, prompts, or answer choices
No reduction in the number of answer choices for an item
No allowance for reading and writing tests to be read aloud to the student, with the exception
of specific prompts

2007). For example, you need to determine the student’s dominant language (i.e., the pre-
ferred language) and proficiency in both dominant and nondominant languages. It is also
important to distinguish between conversational and cognitive/academic language skills.
For example, conversational skills may develop in about two years, but cognitive/academic
language skills may take five or more years to emerge (e.g., Cummings, 1984). The impli-
cation is that teachers should not rely on their subjective impression of an ELL student’s
English proficiency based on subjective observations of daily conversations, but should em-
ploy objective measures of written and spoken English proficiency. The Standards (AERA
et al., 1999) provide excellent guidance in language proficiency assessment.
A number of strategies exist for assessing students with limited English proficiency
when using standardized assessments, these include the following:

= Locate tests with directions and materials in the student’s native language. There are
a number of commercial tests available in languages other than English. However, these
tests vary considerably in quality depending on how they were developed. For example, a
simple translation of a test from one language to another does not ensure test equivalence. In
this context, equivalence means it is possible to make comparable inferences based on test
performance (AERA et al., 1999). The question is: Does the translated test produce results
that are comparable to the original test in terms of validity and reliability?

m It may be possible to use a nonverbal test. There are a number of nonverbal tests that
were designed to reduce the influence of cultural and language factors. However, one should
keep in mind that while these assessments reduce the influence of language and culture, they
do not eliminate them.
u If it is not possible to locate a suitable translated test or a nonverbal test, a qualified
bilingual examiner may conduct the assessment, administering the tests in the student’s native
language. When a qualified bilingual examiner is not available, an interpreter may be used.
While this is a common practice, there are a number of inherent problems that may compro-
mise the validity of the test results (AERA et al., 1999). It is recommended that educators
considering this option consult the Standards (AERA et al., 1999) for additional information
on the use of interpreters in assessing individuals with limited-English-proficiency.
Assessment Accommodations 415

Salvia and Ysseldyke (2007) provide suggestions for assessing students with limited
English proficiency in terms of classroom achievement. First, they encourage teachers to
ensure that they assess what is actually taught in class, not related content that relies on
incidental learning. Students with different cultural and language backgrounds might not
have had the same opportunities for incidental learning as native English speakers. Second,
give ELL students extra time to process their responses. They note that for a variety of
reasons, students with limited English proficiency may require additional time to process
information and formulate a response. Finally, they suggest that teachers provide ELL stu-
dents with opportunities to demonstrate achievement in ways that do not rely exclusively
on language.

Reporting Results of Modified Assessments

When clinical and school psychologists modify an individually administered standardized


test to accommodate the needs of a student with a disability, they typically document this in
the psychological or educational assessment report. This is standard practice and is not the
focus of serious debate. In contrast, the way scores of students receiving accommodations on
large-scale standardized tests are reported has been the focus of considerable debate. Some
assessment organizations will use an asterisk or some other “flag” to denote a score resulting
from a nonstandard administration. The Standards (AERA et al., 1999) note that this practice
is promoted by some but seen as discriminatory by others. Proponents
Some professionals support the of the practice argue that without nonstandard administration flags
use of “flags” to denote scores the scores may be misleading to those interpreting assessment results.
resulting from nonstandard That is, they will assume no accommodations were made when they
assessments, whereas others actually were. Opponents of the practice hold that it unfairly labels
feel this practice is unfair to and stigmatizes students with disabilities and potentially puts them at a
disadvantage. The Standards suggest that two principles apply: (1) Im-
students with disabilities
portant information necessary to interpret scores accurately should be
and may place them at a
provided, and (2) extraneous information that is not necessary to inter-
disadvantage. pret scores accurately should be withheld. Based on this guidance, if
adequate evidence demonstrates that scores are comparable both with
and without accommodations, flagging is not necessary. When there is insufficient evidence
regarding the comparability of test scores, flagging may be indicated. However, a simple flag
denoting the use of accommodations is rather imprecise, and when permissible by law it is
better to provide specific information about the accommodations that were provided.
Different agencies providing professional assessment services handle this issue differ-
ently. Educational Testing Service (ETS) indicates that when an approved assessment accom-
modation is thought to possibly affect the construct being measured, it includes a statement
indicating that the assessment was taken under nonstandard testing conditions. However, if
only minor accommodations are required, the administration can be considered standard and
the scores are not flagged. Minor accommodations include providing wheelchair access, the
use of a sign language interpreter, or large-print test material. Special Interest Topic 15.2
addresses “flagging” and other legal issues related to the assessment of students with
disabilities.
416 CHAPTER 15

SPECIAL INTEREST TOPIC 15.2

Assessment of Students with Disabilities—Selected Legal Issues

[A]n otherwise qualified student who is unable to disclose the degree of learning he actually pos-
sesses because of the test format or environment would be the object of discrimination solely on the
basis of his handicap.
(Chief Justice Cummings, U.S. 7th Circuit Court of Appeals!)

Section 504 imposes no requirement upon an educational institution to lower or to effect substantial
modifications of standards to accommodate a handicapped person.
(Justice Powell, U.S. Supreme Court?)

These quotes were selected by Phillips (1993) to illustrate the diversity in legal opinions
that have been rendered regarding the provision of assessment accommodations for students with
disabilities. Dr. Phillips’s extensive writings in this area (e.g., 1993, 1994, & 1996) provide some
guidelines regarding the assessment of students with disabilities. Some of these guidelines are most
directly applicable to high-stakes assessment programs, but they also have implications for other
educational assessments.

Notice
Students should be given adequate notice when they will be required to engage in a high-stakes testing
program (e.g., assessments required for graduation). Although this requirement applies to all students,
it is particularly important for students with disabilities to have adequate notice of any testing require-
ments because it may take them longer to prepare for the assessment. What constitutes adequate
notice? With regard to a test required for graduation from high school, one court found 1.5 years to
be inadequate (Brookhart v. Illinois State Board of Education). Another court agreed, finding that ap-
proximately 1 year was inadequate, but suggested that 3 years was adequate (Northport v. Ambach).

Curricular Validity of the Test


If a state is going to implement a high-stakes assessment, it must be able to show that students
have had adequate opportunities to acquire the knowledge and skills measured by the assessment
(Debra P. v. Turlington).? This includes students with disabilities. One way to address this is to
include the learning objectives measured by the assessment in the student’s Individual Education
Program (IEP). One court ruled that parents and educators could decide not to include the skills
and knowledge assessed by a mandatory graduation test on a student’s IEP, but only if there were
adequate time for the parents to evaluate the consequences of their child receiving a certificate of
completion in lieu of a high school diploma (Brookhart).

Accommodations Must Be Individualized


Both IDEA and Section 504 require that educational programs and assessment accommodations
be tailored to meet the unique needs of the individual student. For example, it is not acceptable for
educators to decide that all students with a specific disability (e.g., learning disability) will receive the
same assessment accommodations. Rulings by the federal Office of Civil Rights (OCR) maintain that
decisions to provide specific assessment accommodations must be made on a case-by-case basis.

'Brookhart v. Illinois State Board of Education, 697 F. 2d 179 (7th Cir. 1983).
Southeastern Community College v. Davis, 442 U.S. 397 (1979).
3Debra P. y. Turlington, 474 F. Supp. 244 (M.D. FL 1979).
Assessment Accommodations 417

Invalid Accommodations
Courts have ruled that test administrators are not required to grant assessment accommodations that
“substantially modify” a test or that “pervert” the purpose of the test (Brookhart). In psychometric
terms, the accommodations should not invalidate the interpretation of test scores. Phillips (1994)
suggests the following questions should be asked when considering a given accommodation:

1. Will format changes or accommodations in testing conditions change the skills being
measured?
2. Will the scores of examinees tested under standard conditions have a different meaning than
scores for examinees tested with the requested accommodation?
Ww Would nondisabled examinees benefit if allowed the same accommodation?
4. Does the disabled examinee have any capability for adapting to standard test administration
conditions?
5. Is the disability evidence or testing accommodation policy based on procedures with doubt-
ful validity and reliability? (p. 104)

If the answer to any of these questions is “yes,” Phillips suggests the accommodations are likely
not appropriate.

Flagging
“Flagging” refers to administrators adding notations on score reports, transcripts, or diplomas indicating
that assessment accommodations were provided (and in some cases what the accommodations were).
Proponents of flagging hold that it protects the users of assessment information from making inaccurate
interpretations of the results. Opponents of flagging hold that it unfairly labels and stigmatizes students
with disabilities, breaches their confidentiality, and potentially puts them at a disadvantage. If there is
substantial evidence that the accommodation does not detract from the validity of the interpretation of
scores, flagging is not necessary. However, flagging may be indicated when there is incomplete evi-
dence regarding the comparability of test scores. Phillips (1994) describes a process labeled “self-selec-
tion with informed disclosure.” Here administrators grant essentially any reasonable accommodation
that is requested, even if it might invalidate the assessment results. Then, to protect users of assessment
results, they add notations specifying what accommodations were provided. An essential element is
that the examinee requesting the accommodations must be adequately informed that the assessment
reports will contain information regarding any accommodations provided and the potential advantages
and disadvantages of taking the test with accommodations. However, even when administrators get
informed consent, disclosure of assessment accommodations may result in legal action.
Phillips (1993) notes that at times the goal of promoting valid and comparable test results
and the legal and political goal of protecting the individual rights of students with disabilities may
be at odds. She recommends that educators develop detailed policies and procedures regarding the
provision of assessment accommodations, decide each case on an individual basis, and provide
expeditious appeals when requested accommodations are denied. She notes:

To protect the rights of both the public and individuals in a testing program, it will be necessary to
balance the policy goal of maximum participation by the disabled against the need to provide valid
and interpretable student test scores. (p. 32)
418 CHAPTER 15

Summary

In this chapter we focused on the use of assessment accommodations for students with
disabilities. We noted that standard assessment procedures might not be appropriate for
a student with a disability if the assessment requires the students to use an ability that is
affected by their disability but is irrelevant to the construct being measured. In these situ-
ations it may be necessary for teachers to modify the standard assessment procedures. We
gave the example of students with visual handicaps taking a written test of world history.
Although the students could not read the material in its standard format, if they could read
Braille an appropriate accommodation would be to convert the test to the Braille format.
Because the test is designed to measure knowledge of world history, not the ability to read
standard print, this would be an appropriate accommodation. The goal of assessment ac-
commodations is not simply to allow the student to obtain a better grade, but to provide
the most reliable and valid assessment of the construct of interest. To this end, assessment
accommodations should always increase the validity of the score interpretations so they
more accurately reflect the student’s true standing on the construct being measured.
We noted three situations in which assessment accommodations are not appropriate or
necessary (AERA et al., 1999). These are when (1) the affected ability is directly relevant to
the construct being measured, (2) the purpose of the assessment is to assess the presence and
degree of the disability, and (3) the student does not actually need the accommodation.
A number of federal laws mandate assessment accommodations for students with
disabilities. The Individuals with Disabilities Education Act (IDEA) and Section 504 of the
Rehabilitation Act of 1973 are the laws most often applied in the schools and we spent some
time discussing these. IDEA requires that public schools provide students with disabilities
a free appropriate public education (FAPE) and identifies a number of disability categories.
These include learning disabilities, communication disorders, mental retardation, emotional
disturbance, other health impaired, multiple disabilities, hearing impairments, orthopedic
impairments, autism, visual impairments, traumatic brain injury, and developmental delay.
A key factor in the provision of services to students with disabilities is the individualized
educational program (IEP). The IEP is a written document developed by a committee that
specifies a number of factors, including the students’ instructional arrangement, the special
services they will receive, and any assessment accommodations they will receive.
Section 504 of the Rehabilitation Act of 1973, often referred to as Section 504 or
simply as 504, prohibits discrimination against individuals with disabilities in any agency
or school that receives federal funds. In the public schools, Section 504 requires that schools
provide students with disabilities reasonable accommodations to meet their educational
needs. Section 504 provides a broad standard of eligibility, simply stating that an individual
with a disability is anyone with a physical or mental disability that limits one or more life
activities. Because Section 504 is broader than IDEA, it is possible for a student to qualify
under Section 504 and not qualify under IDEA. This is sometimes referred to as 504 only.
The following assessment accommodations have been developed to meet the needs
of students with disabilities:

= Modifications of presentation format (e.g., use of Braille or large print to replace


standard text) ‘
= Modifications of response format (e.g., allow a student to respond using sign language)
Assessment Accommodations 419

= Modifications of timing (e.g., extended time)


= Modifications of setting (e.g., preferential seating, study carrel to minimize
distractions)
= Adaptive devices and supports (e.g., magnification and amplification devices)
m Using only a portion of a test (e.g., reducing test length)
= Using alternate assessments (e.g., tests specifically developed for students with
disabilities)

We noted that there is relatively little research on assessment accommodations and


what is available has produced inconsistent results. As a result, only a few principles about
providing assessment accommodations are widely accepted. These include:

= Accommodations should be tailored to meet the specific needs of the individual


student.
= Accommodations that students routinely receive in their classroom instruction are
generally appropriate for assessments.
= To the extent possible, select accommodations that promote independent functioning.
m Periodically reevaluate the needs of the student.

In addition to students with disabilities, it may be appropriate to make assessment


accommodations for English Language Learners (ELLs). Both IDEA and NCLB require
that when assessing students with limited English proficiency, educators must ensure that
they are actually assessing the students’ knowledge and skills and not their proficiency in
English. In terms of standardized assessments, typical accommodations include locating
tests with directions and materials in the student’s native language, substituting a nonverbal
test designed to reduce the influence of cultural and language factors, and using a bilingual
examiner or interpreter.
The final topic we address concerned reporting the results of modified assessments. In
the context of individual psychological and educational assessments, it is common for the
clinician to report any modifications to the standardized assessment procedures. However,
in the context of large-scale standardized assessments, there is considerable debate. Some
experts recommend the use of flags to denote a score resulting from a modified administra-
tion of a test. Proponents of this practice suggest that without the use of flags, individuals
interpreting the assessment results will assume that there was a standard administration and
interpret the scores accordingly. Opponents of the practice feel that it unfairly labels and
stigmatizes students with disabilities and may put them at a disadvantage.

KEY TERMS AND CONCEPTS


Adaptive devices and supports, Hearing impairments, p. 401 Learning disability, p. 399
p. 408 IDEA categories of disabilities, Mental retardation, p. 400
Alternate assessments, p. 409 p. 399 Modifications of presentation
Assessment accommodations, p. 396 IEP committee, p. 398 format, p. 405
Autism, p. 401 Individuals with Disabilities Modifications of response format,
Developmental delays, p. 402 Education Act (IDEA), p. 397 p. 405
Emotional disturbance, p. 400 Language disorders, p. 400 Modifications of setting, p. 407
420 CHAPTER 15

Modifications of timing, p. 406 Orthopedic impairments, p. 401 Traumatic brain injury, p. 402
Multiple disabilities, p. 401 Other Health Impaired (OHD), p. Visual impairment, p. 402
Nonstandard administration flags, 401
p.415 Speech disorders, p. 400

RECOMMENDED READINGS

American Educational Research Association, American Psy- in Education, 7(2), 93-120. An excellent discussion of
chological Association, & National Council on Measure- legal cases involving assessment accommodations for
ment in Education (1999). Standards for educational students with disabilities.
and psychological testing. Washington, DC: AERA. The Thurnlow, M., Hurley, C., Spicuzza, R., & El Sawaf,
Standards provide an excellent discussion of assessment H. (1996). A review of the literature on testing ac-
accommodations. commodations for students with disabilities (Min-
Mastergeorge, A. M., & Miyoshi, J. N. (1999). Accommoda- nesota Report No. 9). Minneapolis: University of
tions for students with disabilities: A teacher’s guide Minnesota, National Center on Educational Out-
(CSE Technical Report 508). Los Angeles: National comes. Retrieved April 19, 2004, from http://education
Center for Research on Evaluation, Standards, and Stu- .umn.edu/NCEO/OnlinePubs/MnReport9. html.
dent Testing. This guide provides some useful informa- Turnbull, R., Turnbull, A., Shank, M., Smith, S., & Leal, D.
tion on assessment accommodations specifically aimed (2002). Exceptional lives: Special education in today’s
toward teachers. schools. Upper Saddle River, NJ: Merrill Prentice Hall.
Phillips, S. E. (1994). High-stakes testing accommodations: This excellent text provides valuable information regard-
Validity versus disabled rights. Applied Measurement ing the education of students with disabilities.

Go to www.pearsonhighered.com/reynolds2e to view a PowerPoint™


The Problem of Bias in
Educational Assessment

Test bias: In God we trust; all others must have data.


—Reynolds (1983)

CHAPTER HIGHLIGHTS

What Do We Mean by Bias? Cultural Loading, Cultural Bias,


Past and Present Concerns: A Brief Look and Culture-Free Tests

The Controversy over Bias in Testing: Inappropriate Indicators of Bias:


Its Origin, What It Is, and What It Is Not Mean Differences and Equivalent
Distributions
Cultural Bias and the Nature of Psychological
Bias in Test Content
Testing
Bias in Other Internal Features of Tests
Objections to the Use of Educational and
Psychological Tests with Minority Students Bias in Prediction and in Relation
to Variables External to the Test
The Problem of Definition in Test
Bias Research: Differential Validity

LEARNING OBJECTIVES

After reading and studying this chapter, students should be able to


1., Explain the cultural test bias hypothesis.
2. Describe alternative explanations for observed group differences in performance on aptitude
and other standardized tests.
Describe the relationship between bias and reliability.
Describe the major objections regarding the use of standardized tests with minority students.
Describe what is meant by cultural loading, cultural bias, and culture-free tests.
Describe the mean difference definition of test bias and its current status.
SwDescribe the results of research on the presence of bias in the content of educational and
IAM
psychological tests.
8. Describe the results of research on the presence of bias in other internal features of
educational and psychological tests.

421
422 CHAPTER 16

9. Describe the results of research on bias in prediction and in relation to variables that are
external to the test.
10. Explain what is implied by homogeneity of regression and describe the conditions that may
result when it is not present.

Groups of people who can be defined on a qualitative basis such as gender or ethnicity
(and are thus formed using a nominal scale of measurement as was discussed in Chapter 2),
do not always show the same mean level of performance on various educational and psy-
chological tests. For example, on tests of spatial skill, requiring visualization and imagery,
men and boys tend to score higher than do women and girls. On tests that involve written
language and tests of simple psychomotor speed (such as the rapid copying of symbols
or digits), women and girls tend to score higher than men and boys (see Special Interest
Topic 16.1 for additional information). Ethnic group differences in test performance also
occur and are most controversial and polemic.
There is perhaps no more controversial finding in the field of psychology than the
persistent 1 standard deviation (about 15 points) difference between
the intelligence test performance of black and white students taken
Much effort has been expended
as a group. Much effort has been expended to determine why group
to determine why group differ- differences occur (and there are many, many such group differences
ences occur on standardized on various measures of specialized ability and achievement), but we
aptitude tests, but we do not do not know for certain why they exist. One major, carefully studied
know for certain why. explanation is that the tests are biased in some way against certain
groups. This is referred to as the cultural test bias hypothesis.
The cultural test bias hypothesis
The cultural test bias hypothesis represents the contention that
maintains that group differences
any gender, ethnic, racial, or other nominally determined group differ-
on mental tests are due to inher- ences on mental tests are due to inherent, artifactual biases produced
ent, artifactual biases that exist within the tests through flawed psychometric methodology. Group
within the tests. differences are believed then to stem from characteristics of the tests
and to be totally unrelated to any actual differences in the psychologi-
cal trait, skill, or ability in question. The resolution or evaluation of the validity of the cultural
test bias hypothesis is one of the most crucial scientific questions facing psychology today.
Bias in mental tests has many implications for individuals including the misplacement
of students in educational programs; errors in assigning grades; unfair denial of admission
to college, graduate, and professional degree programs; and the inappropriate denial of em-
ployment. The scientific implications are even more substantive. There would be dramatic
implications for educational and psychological research and theory if the cultural test bias
hypothesis were correct: The principal research of the past 100 years in the psychology of
human differences would have to be dismissed as confounded and largely artifactual be-
cause much of the work is based on standard psychometric theory and testing technology.
This would in turn create major upheavals in applied psychology, because the foundations
of clinical, counseling, educational, industrial, and school psychology are all strongly tied
to the basic academic field of individual differences.
Teachers, be they in the elementary and secondary schools or colleges and universi-
ties, assign grades on the basis of tests or other more subjective evaluations of learning, and
The Problem of Bias in Educational Assessment
423

BaSe nae Ce cere ier ateay

SPECIAL INTEREST Topic. 16.1


Sex Differences in Intelligence

Research has shown that although there are no significant sex differences in overall intelligence
scores, substantial differences exist with regard to specific cognitive abilities. Females typically score
higher on a number of verbal abilities whereas males perform better on visual-spatial and (starting
in middle childhood) mathematical skills. It is believed that sex hormone levels and social factors
both influence the development of these differences. As is typical of group differences in intellectual
abilities, the variability in performance within groups (i.e., males and females) is much larger than the
mean difference between groups (Neisser et al., 1996). Diane Halpern (1997) has written extensively
on gender differences in cognitive abilities. This table briefly summarizes some of her findings.

Selected Abilities on Which Women Obtain Higher Average Scores

Type of Ability Examples

Rapid access and use of verbal and other Verbal fluency, synonym generation,
information in long-term memory associative memory, spelling, anagrams

Specific knowledge areas Literature and foreign languages


Production and comprehension of prose Writing and reading comprehension
Fine motor tasks Matching and coding tasks, pegboard, mirror
tracing

School performance Most subjects

Selected Abilities on Which Men Obtain Higher Average Scores

Type of Ability Examples

Transformations of visual working memory, Mental rotations, dynamic spatiotemporal


moving objects, and aiming tasks, accuracy in throwing

Specific knowledge areas General knowledge, mathematics, science,


and geography

Fluid reasoning Proportional, mechanical, and scientific


reasoning; SAT Math and GRE Quantitative

Note: This table was adopted from Halpern (1997), Appendix (p. 1102).

make decisions regarding promotion or perhaps even professional certification on much


the same criteria. Bias in this process that produces adverse impact because of someone’s
race, sex, or other unrelated factor is clearly serious and unacceptable. If professionally
designed tests subjected to lengthy developmental research and tryout periods and held up
424 CHAPTER 16

If well-constructed and prop- to the most stringent of psychometric and statistical standards turn
erly standardized tests are out to be culturally biased when used with native-born American
biased, then classroom tests are ethnic minorities, what about the teacher-made test in the classroom
almost certain to be at least as and more subjective evaluation of work samples (e.g., performance
biased and probably more so. assessments)? If well-constructed and properly standardized tests are
biased, then classroom measures are almost certain to be at least as
biased and probably more so. As the reliability of a test or evaluation
procedure goes down, the likelihood of bias goes up, the two being inversely related. A large
reliability coefficient does not eliminate the possibility of bias, but as reliability is lowered,
the probability that bias will be present increases.
The purpose of this chapter is to address the issues and findings surrounding the cul-
tural test bias hypothesis in a rational manner and evaluate the validity of the hypothesis,
as far as possible, on the basis of existing empirical research. This will not be an easy task
because of the controversial nature of the topic and strong emotional overtones. Prior to
turning to the reasons that test bias generates highly charged emotions and reviewing some
of the history of these issues, it is proper to engage in a discussion of just what we mean by
the term bias.

What Do We Mean by Bias?


Bias carries many different Bias carries many different connotations for the lay public and for
connotations for the lay public professionals in a number of disciplines. To the legal mind, bias
and for professionals in a denotes illegal discriminatory practices while to the lay mind it may
number of disciplines. conjure up notions of prejudicial attitudes. Much of the rancor in
psychology and education regarding proper definitions of test bias
In terms of assessment, bias is due to the divergent uses of this term in general but especially by
denotes systematic error that professionals in the same and related academic fields. Contrary to
occurs in the estimation of some certain other opinions that more common or lay uses of the term
value or score. bias should be employed when using bias in definitions or discus-
sions of bias in educational and psychological tests, bias as used in
the present chapter will be defined in its widely recognized, but distinct statistical sense.
As defined in the Standards (AERA et al., 1999), bias is “a systematic error in a test score”
(p. 172). Therefore, a biased assessment is one that systematically underestimates or over-
estimates the value of the variable it is designed to measure. If the bias is a function of a
nominal cultural variable (e.g., ethnicity or gender), then the test has a cultural bias. As an
example, if an achievement test produces different mean scores for different ethnic groups,
and there actually are true differences between the groups in terms of achievement, the test
is not biased. However, if the observed differences in achievement scores are the result of
the test underestimating the achievement of one group or overestimating the achievement
of another, then the test is culturally biased.
Other uses of the term bias in research on the cultural test bias hypothesis or cross-
group validity of tests are unacceptable from a scientific perspective for two reasons: (1) The
imprecise nature of other uses of bias makes empirical investigation and rational inquiry
exceedingly difficult, and (2) other uses of the term invoke specific moral value systems
The Problem of Bias in Educational Assessment 425

that are the subject of intense polemic, emotional debate without a mechanism for rational
resolution. It is imperative that.the evaluation of bias in tests be undertaken from the stand-
point of scholarly inquiry and debate. Emotional appeals, legal—adversarial approaches, and
political remedies of scientific issues appear to us to be inherently unacceptable.

Past and Present Concerns: A Brief Look

Concern about cultural bias in mental testing has been a recurring issue since the beginning
of the use of assessment in education. From Alfred Binet in the 1800s to Arthur Jensen over
the last 50 years, many scientists have addressed this controversial problem, with varying,
inconsistent outcomes. In the last few decades, the issue of cultural bias has come forth
as a major contemporary problem far exceeding the bounds of purely academic debate
and professional rhetoric. The debate over the cultural test bias hypothesis has become
entangled and sometimes confused within the larger issues of individual liberties, civil
rights, and social justice, becoming a focal point for psychologists, sociologists, educators,
politicians, minority activists, and the lay public. The issues increasingly have become legal
and political. Numerous court cases have been brought and New York state even passed
“truth-in-testing” legislation that is being considered in other states and in the federal legis-
lature. Such attempts at solutions are difficult if not impossible. Take for example the legal
response to the question “Are intelligence tests used to diagnose mental retardation biased
against cultural and ethnic minorities?” In California in 1979 (Larry P. v. Riles) the answer
was “‘yes” but in Illinois in 1980 (PASE v. Hannon) the response was “no.” Thus two federal
district courts of equivalent standing have heard nearly identical cases, with many of the
same witnesses espousing much the same testimony, and reached precisely opposite con-
clusions. See Special Interest Topic 16.2 for more information on legal issues surrounding
assessment bias.
Though current opinion on the cultural test bias hypothesis is quite divergent, ranging
from those who consider it to be for the most part unresearchable (e.g., Schoenfeld, 1974) to
those who considered the issue settled decades ago (e.g., Jensen, 1980), it seems clear that
empirical analysis of the hypothesis should continue to be undertaken. However difficult
full objectivity may be in science, we must make every attempt to view all socially, politi-
cally, and emotionally charged issues from the perspective of rational scientific inquiry. We
must also be prepared to accept scientifically valid findings as real, whether we like them
or not.

The Controversy over Bias in Testing:


Its Origin, What It Is, and What It Is Not

Systematic group differences on standardized intelligence and aptitude tests may occur as
a function of socioeconomic level, race or ethnic background, and other demographic vari-
ables. Black—white differences on IQ measures have received extensive investigation over
the past 50 or 60 years. Although results occasionally differ slightly depending on the age
groups under consideration, random samples of blacks and whites show a mean difference
426 CHAPTER16

SPECIAL INTEREST ToPIc 16.2


Courtroom Controversy over IQ Testing in the Public Schools

Largely due to overall mean differences in the performance of various ethnic groups on IQ tests,
the use of intelligence tests in the public schools has been the subject of courtroom battles around
the United States. Typically such lawsuits argue that the use of intelligence tests as part of the
determination of eligibility for special education programs leads to overidentification of certain
minorities (traditionally African American and Hispanic children). A necessary corollary to this
argument is that the resultant overidentification is inappropriate because the intelligence tests in
use are biased, underestimating the intelligence of minority students, and that there is in fact no
greater need for special education placement among these ethnic minorities than for other ethnic
groups in the population.
Attempts to resolve the controversy over IQ testing in the public schools via the courtroom
have not been particularly successful. Unfortunately, but not uncharacteristically, the answer to
the legal question “Are IQ tests biased in a manner that results in unlawful discrimination against
minorities when used as part of the process of determining eligibility for special education place-
ments?” depends on where you live!
There are four key court cases to consider when reviewing this question, two from California
and one each from Illinois and Georgia.
The first case is Diana v. State Board of Education (C-70-37 RFP, N.D. Cal., 1970), heard
by the same federal judge who would later hear the Larry P. case (see later). Diana was filed on
behalf of Hispanic (referred to as Chicano at that time and in court documents) children classified
as EMR, or educable mentally retarded (a now archaic term), based on IQ tests administered in
English. However, the children involved in the suit were not native English speakers and when
retested in their native language, all but one (of nine) scored above the range designated as EMR.
Diana was resolved through multiple consent decrees (agreements by the adverse parties ordered
into effect by the federal judge). Although quite detailed, the central component of interest here is
that the various decrees ensured that children would be tested in their native language, that more
than one measure would be used, and that adaptive behavior in nonschool settings would be as-
sessed prior to a diagnosis of EMR.
It seems obvious to us now that whenever persons are assessed in other than their native
language, the validity of the results as traditionally interpreted would not hold up, at least in the
case of ability testing. This had been obvious to the measurement community for quite some time
prior to Diana, but had not found its way into practice. Occasionally one still encounters cases of a
clinician evaluating children in other than their native language and making inferences about intel-
lectual development—clearly this is inappropriate.
Three cases involving intelligence testing of black children related to special education
placement went to trial: Larry P. v. Riles (343 F. Supp. 306, 1972; 495 F. Supp. 976, 1979); PASE vy.
Hannon (506 F. Supp. 831, 1980); and Marshall v. Georgia (CV 482-233, S.D. of Georgia, 1984).
Each of these cases involved allegations of bias in IQ tests that caused the underestimation of the
intelligence of black children and subsequently led to disproportionate placement of black children
in special education programs. All three cases presented testimony by experts in education, testing,
4= measurement, and related fields, some professing the tests to be biased and others professing they
were not. That a disproportionate number of black children were in special education was conceded
4
in all cases—what was litigated was the reason.
Ee
@ In California in Larry P. v. Riles (Wilson Riles being superintendent of the San Francisco
= Unified School District), Judge Peckham ruled that IQ tests were in fact biased against black

SE
The Problem of Bias in Educational Assessment
427

children and resulted in discriminatory placement in special education. A reading of Peckham’s


decision reveals a clear condemnation of special education, which is critical to Peckham’s logic.
He determined that because special education placement was harmful, not helpful, to children, the
use of a test (i.e., IQ) that resulted in disproportionate placement was therefore discriminatory. He
prohibited (or enjoined) the use of IQ tests with black children in the California public schools.
In PASE v. Hannon (PASE being an abbreviation for Parents in Action on Special Education),
a similar case to Larry P. was brought against the Chicago public schools. Many of the same wit-
nesses testified about many of the same issues. At the conclusion of the case, Judge Grady ruled in
favor of the Chicago public schools, finding that although a few IQ test items might be biased, the
degree of bias in the items was inconsequential.
In Marshall v. Georgia, the NAACP brought suit against rural Georgia school districts alleg-
ing bias in the instructional grouping and special education placement associated with IQ testing.
Although some of the same individuals testified in this case, several new opinions were offered.
However, the judge in Marshall eventually ruled in favor of the schools, finding that IQ tests were not
in fact biased, and that a greater actual need for special education existed in minority populations.
In the courtroom, we are no closer to resolution of these issues today than we were in 1984
when Marshall was decided. However, these cases and other societal factors did foster much re-
search that has brought us closer to a scientific resolution of the issues. They also prompted the
development of new, up-to-date IQ tests and more frequent revisions or updating of older tests.
Many challenges remain, especially that of understanding the continued higher failure rates (rela-
tive to the majority ethnic population of the United States) of some ethnic minorities in the public
schools (while other ethnic minorities have a success rate that exceeds the majority population) and
the disproportionate referral rates by teachers of these children for special education placement.
The IQ test seems to be only one of many messengers in this crucial educational issue, and bias in
the tests does not appear to be the answer.

of about | standard deviation, with the mean score of the white groups consistently ex-
ceeding that of the black groups. When a number of demographic variables are taken into
account (most notably socioeconomic status, or SES), the size of the difference reduces
to 0.5 to 0.7 standard deviation but remains robust in its appearance. The differences have
persisted at relatively constant levels for quite some time and under a variety of methods of
investigation. Some recent research suggests that the gap may be narrowing, but this has not
been firmly established (Neisser et al., 1996).
Mean differences between ethnic groups are not limited to black-white comparisons.
Although not nearly as thoroughly researched as black—white differences, Hispanic—white
differences have also been documented, with Hispanic mean performance approximately
0.5 standard deviation below the mean of the white group. On the average, Native Ameri-
cans tend to perform lower on tests of verbal intelligence than whites. Both Hispanics and
Native Americans tend to perform better on visual—spatial tasks relative to verbal tasks. All
studies of race/ethnic group differences on ability tests do not show higher levels of per-
formance by whites. Asian American groups have been shown consistently to perform as
well as or better than white groups. Depending on the specific aspect of intelligence under
investigation, other race/ethnic groups show performance at or above the performance level
of white groups (for a readable review of this research, see Neisser et al., 1996).
428 CHAPTER 16

It should always be kept in mind that the overlap among the distributions of intelli-
gence test scores for different ethnic groups is much greater than the size of the differences
between the various groups. Put another way, there is always more within-group variability
in performance on mental tests than between-group variability. Neisser et al. (1996) frame
it this way:

Group means have no direct implications for individuals. What matters for the next person
you meet (to the extent that test scores matter at all) is that person’s own particular score, not
the mean of some reference group to which he or she happens to belong. The commitment
to evaluate people on their own individual merit is central to a democratic society. It also
makes quantitative sense. The distributions of different groups inevitably overlap, with the
range of scores within any one group always wider than the mean differences between any
two groups. In the case of intelligence test scores, the variance attributable to individual dif-
ferences far exceeds the variance related to group membership. (p. 90)

Explaining Mean Group Differences. Once mean group differences are identified, it is
natural to attempt to explain them. Reynolds (2000) notes that the most common explana-
tions for these differences have typically fallen into four categories:

The differences primarily have a genetic basis.


The differences have an environmental basis (e.g., SES, education, culture).
The differences are due to the interactive effect of genes and environment.
ao
oS The tests are defective and systematically underestimate the knowledge and skills of
minorities.

The final explanation (i.e., category d) is embodied in the cultural test bias hypothesis
introduced earlier in this chapter. Restated, the cultural test bias hypothesis represents the
contention that any gender, ethnic, racial, or other nominally determined group differences on
mental tests are due to inherent, artifactual biases produced within the tests through flawed
psychometric methodology. Group differences are believed then to stem from characteristics
of the tests and to be totally unrelated to any actual differences in the psychological trait, skill,
or ability in question. Because mental tests are based largely on middle-class values and knowl-
edge, their results are more valid for those groups and will be biased
The hypothesis of differential against other groups to the extent that they deviate from those values
validity suggests that tests and knowledge bases. Thus, ethnic and other group differences result
measure constructs more from flawed psychometric methodology and not from actual differ-
ences in aptitude. As will be discussed, this hypothesis reduces to one
accurately and make more valid
of differential validity; the hypothesis of differential validity being that
predictions for individuals from
tests measure intelligence and other constructs more accurately and
the groups on which the tests
make more valid predictions for individuals from the groups on which
are mainly based than for those the tests are mainly based than for those from other groups. The practi-
from other groups. cal implications of such bias have been pointed out previously and are
the issues over which most of the court cases have been fought.
If the cultural test bias hypothesis is incorrect, then group differences are not at-
tributable to the tests and must be due to one of the other factors mentioned above. The
model emphasizing the interactive effect of genes and environment (category c, commonly
The Problem of Bias in Educational Assessment 429

referred to as environment x genetic interaction model) is dominant among contemporary


professionals who reject the argument that group differences are artifacts of test bias; how-
ever, there is much debate over the relative contributions of genetic and environmental fac-
tors (Reynolds, 2000; Suzuki & Valencia, 1997). In addition to the models noted, Williams
(1970) and Helms (1992) proposed another model with regard to black-white differences
on aptitude tests, raising the possibility of qualitatively different cognitive structures that
require different methods of measurement.

The controversy over test Test Bias and Etiology. The controversy over test bias is dis-
bias should not be confused tinct from the question of etiology. Reynolds and Ramsay (2003)
with that over etiology of any note that the need to research etiology is only relevant once it has
observed group differences. been determined that mean score differences are real, not simply
artifacts of the assessment process. Unfortunately, measured differ-
ences themselves have often been inferred to indicate genetic differences and therefore the
genetically based intellectual inferiority of some groups. This inference is not defensible
from a scientific perspective.

Test Bias and Fairness. Bias and fairness are related but separate concepts. As noted by
Brown, Reynolds, and Whitaker (1999), fairness is a moral, philosophical, or legal issue on
which reasonable people can disagree. On the other hand bias is a statistical property of a
test. Therefore, bias is a property empirically estimated from test data whereas fairness is a
principle established through debate and opinion. Nevertheless, it is common to incorporate
information about bias when considering the fairness of an assessment process. For ex-
ample, a biased test would likely be considered unfair by essentially everyone. However, it
is clearly possible that an unbiased test might be considered unfair by at least some. Special
Interest Topic 16.3 summarizes the discussion of fairness in testing and test use from the
Standards (AERA et al., 1999).

Test Bias and Offensiveness. There is also a distinction between test bias and item offen-
siveness. Test developers often use a minority review panel to examine each item for content
that may be offensive or demeaning to one or more groups (e.g., see Reynolds & Kamphaus,
2003, for a practical example). This is a good procedure for identifying and eliminating of-
fensive items, but it does not ensure that the items are not biased. Research has consistently
found little evidence that one can identify, by personal inspection, which items are biased and
which are not (for reviews, see Camilli & Shepard, 1994; Reynolds, Lowe, & Saenz, 1999).

Test Bias and Inappropriate Test Administration and Use. The controversy over test
bias is also not about blatantly inappropriate administration and usage of mental tests. Ad-
ministration of a test in English to an individual for whom English is a poor second language
is inexcusable both ethically and legally, regardless of any bias in the tests themselves (un-
less of course, the purpose of the test is to assess English language skills). It is of obvious
importance that tests be administered by skilled and sensitive professionals who are aware
of the factors that may artificially lower an individual’s test scores. That should go without
saying, but some court cases involve just such abuses. Considering the use of tests to as-
sign pupils to special education classes or other programs, the question needs to be asked,
430 CHAPTER 16

se ee

SPECIAL INTEREST ToPIc 16.3


Fairness and Bias—A Complex Relationship

The Standards (AERA et al., 1999) present four different ways that fairness is typically used in the
context of assessment.

Aa Fairness as absence of bias: There is general consensus that for a test to be fair, it should not be
biased. Bias is used here in the statistical sense: systematic error in the estimation of a value.
ps: Fairness as equitable treatment: There is also consensus that all test takers should be treated
in an equitable manner throughout the assessment process. This includes being given equal
opportunities to demonstrate their abilities by being afforded equivalent opportunities to
prepare for the test and standardized testing conditions. The reporting of test results should
be accurate, informative, and treated in a confidential manner.
Fairness as opportunity to learn: This definition holds that test takers should all have an
equal opportunity to learn the material when taking educational achievement tests.
Fairness as equal outcomes: Some hold that for a test to be fair it should produce equal
performance across groups defined by race, ethnicity, gender, and so on (i.e., equal mean
performance).

Many assessment professionals believe that (1) if a test is free from bias and (2) test takers re-
ceived equitable treatment in the assessment process, the conditions for fairness have been achieved.
The other two definitions receive less support. In reference to definition (3) requiring equal opportu-
nity to learn, there is general agreement that adequate opportunity to learn is appropriate in some cases
but irrelevant in others. However, disagreement exists in terms of the relevance of opportunity to learn
in specific situations. A number of problems arise with this definition of fairness that will likely pre-
vent it from receiving universal acceptance in the foreseeable future. The final definition (4) requiring
equal outcomes has little support among assessment professionals. The Standards note:

The position that fairness requires equality in overall passing rates for different groups has been
almost entirely repudiated in the professional testing literature . . . unequal outcomes at the group
level have no direct bearing on questions of test bias. (pp. 74-76)

In concluding the discussion of fairness, the Standards suggest:

Itis unlikely that consensus in society at large or within the measurement community is imminent on
all matters of fairness in the use of tests. As noted earlier, fairness is defined in a variety of ways and
is not exclusively addressed in technical terms; it is subject to different definitions and interpretations
in different social and political circumstances. According to one view, the conscientious application
of an unbiased test in any given situation is fair, regardless of the consequences for individuals or
groups. Others would argue that fairness requires more than satisfying certain technical require-
ments. (p. 80)

“What would you use instead?” Teacher recommendations alone are less reliable and valid
than standardized test scores and are subject to many external influences. Whether special
education programs are of adequate quality to meet the needs of children is an important edu-
cational question, but distinct from the test bias question, a distinction sometimes confused.
The Problem of Bias in Educational Assessment 431

Bias and Extraneous Factors. The controversy over the use of mental tests is complicated
further by the fact that resolution of the cultural test bias question in either direction will not
resolve the problem of the role of nonintellective factors that may influence the test scores
of individuals from any group, minority or majority. Regardless of any group differences, it
is individuals who are tested and whose scores may or may not be accurate. Similarly, it is
individuals who are assigned to classes and accepted or rejected for employment or college
admission. Most assessment professionals acknowledge that a number of emotional and
motivational factors may impact performance on intelligence tests. The extent to which these
factors influence individuals as opposed to group performance is difficult to determine.

Cultural Bias and the Nature


of Psychological Testing
The question of cultural bias in testing arises from and is continuously fueled by the very
nature of psychological and educational processes and how we measure those processes.
Psychological processes are by definition internal to the organism and not subject to direct
observation and measurement but must instead be inferred from behavior. It is difficult
to determine one-to-one relationships between observable events in the environment, the
behavior of an organism, and hypothesized underlying mediational processes. Many clas-
sic controversies over theories of learning revolved around constructs such as expectancy,
habit, and inhibition. Disputes among different camps in learning were polemical and of
long duration. Indeed, there are still disputes as to the nature and number of processes
such as emotion and motivation. One of the major areas of disagreement has been over the
measurement of psychological processes. It should be expected that intelligence, as one of
the most complex psychological processes, would involve definitional and measurement
disputes that prove difficult to resolve.
There are few charges of bias relating to physical measures that are on absolute scales,
whether interval or ratio. Group differences in height, as an extreme example, are not attrib-
uted by anyone to any kind of cultural test bias. There is no question concerning the validity
of measures of height or weight of anyone in any culture. Nor is there any question about
one’s ability to make cross-cultural comparisons of these absolute measures.
The issue of cultural bias arises because of the procedures involved in psychologi-
cal testing. Psychological tests measure traits that are not directly observable, subject to
differences in definition, and measurable only on a relative scale. From this perspective,
the question of cultural bias in mental testing is a subset, obviously of major importance,
of the problem of uncertainty and possible bias in psychological testing generally. Bias
might exist not only in mental tests but in other types of psychological tests as well, includ-
ing personality, vocational, and psychopathological. Making the problem of bias in mental
testing even more complex, not all mental tests are of the same quality; some are certainly
psychometrically superior to others. There is a tendency for critics and defenders alike to
overgeneralize across tests, lumping virtually all tests together under the heading mental
tests or intelligence tests. Professional opinions of mental tests vary
The question of bias must considerably, and some of the most widely used tests are not well
eventually be answered on respected by psychometricians. Thus, unfortunately, the question of
a virtually test-by-test basis. bias must eventually be answered on a virtually test-by-test basis.
432 CHAPTER 16

Objections to the Use of Educational and


Psychological Tests with Minority Students

In 1969, the Association of Black Psychologists (ABP) adopted the following official policy
on educational and psychological testing (Williams, Dotson, Dow, & Williams, 1980):

The Association of Black Psychologists fully supports those parents who have chosen to
defend their rights by refusing to allow their children and themselves to be subjected to
achievement, intelligence, aptitude and performance tests which have been and are being
used to (a) label black people as uneducable; (b) place black children in “special” classes
and schools; (c) perpetuate inferior education in blacks; (d) assign black children to lower
educational tracks than whites; (e) deny black students higher educational opportunities; and
(f) destroy positive intellectual growth and development of black people.

Since 1968 the ABP has sought a moratorium on the use of all psychological and
educational tests with students from disadvantaged backgrounds. The ABP carried its call
for a moratorium to other professional organizations in psychology and education. In direct
response to the ABP call, the American Psychological Association’s (APA) Board of Direc-
tors requested its Board of Scientific Affairs to appoint a group to study the use of psycho-
logical and educational tests with disadvantaged students. The committee report (Cleary,
Humphreys, Kendrick, & Wesman, 1975) was subsequently published in the official journal
of the APA, American Psychologist.
Subsequent to the ABP’s policy statement, other groups adopted similarly stated
policy statements on testing. These groups included the National Association for the Ad-
vancement of Colored People (NAACP), the National Education Association (NEA), the
National Association of Elementary School Principals (NAESP), the American Personnel
and Guidance Association (APGA), and others. The APGA called for the Association of
Measurement and Evaluation in Guidance (AMEG), a sister organization, to “develop and
disseminate a position paper stating the limitations of group intelligence tests particularly
and generally of standardized psychological, educational, and employment testing for low
socioeconomic and underprivileged and non-white individuals in educational, business,
and industrial environments.” It should be noted that the statements by these organizations
assumed that psychological and educational tests are biased, and that what is needed is that
the assumed bias be removed.
Many potentially legitimate objections to the use of educational and psychological
tests with minorities have been raised by black and other minority psychologists. Unfor-
tunately, these objections are frequently stated as facts on rational rather than empirical
grounds. The most frequently stated problems fall into one of the following categories
(Reynolds, 2000; Reynolds, Lowe, & Saenz, 1999; Reynolds & Ramsay, 2003).

Inappropriate Content
Black and other minority children have not been exposed to the material involved in the test
questions or other stimulus materials. The tests are geared primarily toward white middle-
class homes, vocabulary, knowledge, and values. As a result of inappropriate content, the
tests are unsuitable for use with minority children.
The Problem of Bias in Educational Assessment 433

Inappropriate Standardization Samples


Ethnic minorities are underrepresented in standardization samples used in the collection of
normative reference data. As a result of the inappropriate standardization samples, the
tests are unsuitable for use with minority children.

Examiner and Language Bias


Because most psychologists are white and speak only standard English, they may intimidate
black and other ethnic minorities and so examiner and language bias result. They are also
unable accurately to communicate with minority children—to the point of being insensitive
to ethnic pronunciation of words on the test. Lower test scores for minorities, then, may re-
flect only this intimidation and difficulty in the communication process, not lower ability.

Inequitable Social Consequences


As aresult of bias in educational and psychological tests, minority group members, already
at a disadvantage in the educational and vocational markets because of past discrimination,
are thought to be unable to learn and are disproportionately assigned to dead-end educa-
tional tracks. This represents inequitable social consequences. Labeling effects also fall
under this category.

Measurement of Different Constructs


Related to inappropriate test content mentioned earlier, this position asserts that the tests mea-
sure different constructs when used with children from other than the middle-class culture on
which the tests are largely based, and thus do not measure minority intelligence validly.

Differential Predictive Validity


Although tests may accurately predict a variety of outcomes for middle-class children, they
do not predict successfully any relevant behavior for minority group members. In other
words, test usage might result in valid predictions for one group, but invalid predictions in
another. This is referred to as differential predictive validity. Further, there are objections
to the use of the standard criteria against which tests are validated with minority cultural
groups. That is, scholastic or academic attainment levels in white middle-class schools are
themselves considered by a variety of black psychologists to be biased as criteria.

Qualitatively Distinct Aptitude and Personality


Minority and majority groups possess aptitude and personality characteristics that are quali-
tatively different, and as a result test development should begin with different definitions
for different groups.

The early actions of the ABP were most instrumental in bringing forward these objec-
tions into greater public and professional awareness and subsequently prompted a considerable
434 CHAPTER 16

Se

SPECIAL INTEREST TOPIC 16.4

Stereotype Threat—An Emerging but Controversial Explanation


of Group Differences on Various Tests of Mental Abilities
pcos
EES Steele and Aronson in 1995 posited a unique explanation for group differences on mental test
scores. They argued that such differences were created by a variable they deemed “Stereotype
Threat.” More recently, they defined stereotype threat as follows: “When a negative stereotype
about a group that one is part of becomes relevant, usually as an interpretation of one’s behavior or
an experience one is having, stereotype threat is the resulting sense that one can then be judged or
treated in terms of the stereotype or that one might do something that would inadvertently confirm
it” (Steele, Spencer, & Aronson, 2002, p. 389). While we find this explanation somewhat vague
and lacking specificity for research purposes, in experimental research regarding mental testing
outcomes, stereotype threat is most often operationalized as being given a test that is described as
diagnostic of one’s ability and/or being asked to report one’s race prior to testing. Therefore we see
two components to the threat—being told one’s ability is to be judged on a test of mental ability
and secondly, being asked to report one’s racial identification, or at least believing it to be relevant
in some way to the evaluation of examination results (although some argue either component is
sufficient to achieve the effect). Stereotype threat research then goes on to argue, as one example,
that if one takes a test of mental ability, but the examinee is told it is not for evaluating the test taker,
but to examine the test itself and no racial identifier is requested, then racial group differences in
performance on the test will disappear.
Many studies have now been reported that demonstrate this stereotype effect, but they
incorporate controversial statistical procedures that might confound the results by equating the
two groups (i.e., erasing the group differences) on the basis of variables irrelevant to the effect of
the stereotype threat. Sackett and his colleagues (Sackett et al., 2004) have discussed this meth-
odological problem in detail (noting additional violations of the assumptions that underlie such
analyses), and we find ourselves in essential agreement with their observations. Nomura et al.
(2007) stated it succinctly when they noted from their own findings: “Equalizing the performance
of racial groups in most Stereotype Threat Studies is not an effect of the manipulation of Stereo-
type Threat elicitors (task descriptions), but is a result of a statistical manipulation (covariance)”
(p. 7). Additionally, some research that has taken a thorough look at the issue using multiple
statistical approaches has argued that stereotype threat may have just the opposite effect at times
from what was originally proposed by Steele and Aronson (e.g., see Nomura et al., 2007). That is,
it may enhance the performance of the majority group as opposed to denigrating the performance
of the minority.
We are also bothered by the theoretical vagaries of the actual mechanism by which stereo-
type threat might operate as a practical matter. Steele and Aronson essentially argue that it is a
process of response inhibition; that is, when an individual encounters a circumstance, event, or
activity in which a stereotype of a group to which the person belongs becomes salient, anxiety or
concerns about being judged according to that stereotype arise and inhibit performance. Anxiety
is not named specifically as the culprit by many stereotype threat researchers, but it seems the
most likely moderator of the proclaimed effect. While the well-known inverted U-shaped anxiety—
performance curve seems real enough, can this phenomenon really account for group differences
in mental test scores? So far, we view the findings of racial equalization due to the neutralization
of the so-called stereotype effect as a statistical artifact, but the concept remains interesting, is not
yet fully understood, and we may indeed be proven wrong!
The Problem of Bias in Educational Assessment 435

Some good readings on this issue for follow up include the following works:

Nomura, J. M., Stinnett, T.; Castro, F., Atkins, M., Beason, S., Linden, S., Hogan, K., Newry,
B., & Wiechmann, K. (March, 2007). Effects of Stereotype Threat on Cognitive Per-
formance of African Americans. Paper presented to the annual meeting of the National
Association of School Psychologists, New York.
Sackett, P. R., Hardison, C. M., & Cullen, M. J. (2004). On interpreting stereotype threat as
accounting for African-American differences on cognitive tests. American Psycholo-
gist, 59(1), 7-13.
Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual test performance
of African Americans. Journal of Personality and Social Psychology, 69, 797-811.
Steele, C. M., Spencer, S. J., & Aronson, J. (2002). Contending with group image: The psy-
chology of stereotype and social identity threat. In M. Zanna (Ed.), Advances in ex-
perimental social psychology (Vol. 23, pp. 379-440). New York: Academic Press.

The early actions of the ABP amount of research. When the objections were first raised, very little
were most instrumental in data existed to answer these charges. Contrary to the situation de-
bringing these issues into cades ago when the current controversy began, research now exists
that examines many of these concerns. There is still relatively little
greater public and professional
research regarding labeling and the long-term social consequences
awareness and subsequently bh
of testing, and these areas should be investigated using diverse sam-
promoted a considerable ples and numerous statistical techniques (Reynolds, Lowe, & Saenz,
amount of research. 1999).

The Problem of Definition in Test


Bias Research: Differential Validity
Arriving at a consensual definition of test bias has produced considerable as yet unresolved
debate among many measurement professionals. Although the resulting debate has gen-
erated a number of models from which to examine bias, these models usually focus on
the decision-making system and not on the test itself. The concept of test bias per se then
resolves to a question of the validity of the proposed interpretation of performance on a test
and the estimation of that performance level, that is, the test score. Test bias refers to sys-
tematic error in the estimation of some “true” value for a group of individuals. As we noted
previously, differential validity is present when a test measures a construct differently for
one group than for another. As stated in the Standards (AERA et al., 1999), bias

is said to arise when deficiencies in a test itself or the manner in which it is used result in dif-
ferent meanings for scores earned by members of different identifiable subgroups. (p. 74)

As we discussed in previous chapters, evidence for the validity of test score interpre-
tations can come from sources both internal and external to the test. Bias in a test may be
found to exist in any or all of these categories of validity evidence. Prior to examining the
436 CHAPTER 16

evidence on the cultural test bias hypothesis, the concept of culture-free testing and the def-
inition of mean differences in test scores as test bias merit attention.

Cultural Loading, Cultural Bias,


and Culture-Free Tests

Cultural loading and cultural bias are not synonymous terms, though the concepts are
frequently confused even in the professional literature. A test or test item can be cultur-
ally loaded without being culturally biased. Cultural loading refers
Cultural loading refers to the to the degree of cultural specificity present in the test or individual
degree of cultural specificity items of the test. Certainly, the greater the cultural specificity of a test
present in the test or individual item, the greater the likelihood of the item being biased when used
items of the test. with individuals from other cultures. Virtually all tests in current use
are bound in some way by their cultural specificity. Culture loading
must be viewed on a continuum from general (defining the culture in a broad, liberal sense)
to specific (defining the culture in narrow, highly distinctive terms).
A number of attempts have been made to develop a culture-free (sometimes referred
to as culture fair) intelligence test. However, culture-free tests are generally inadequate
from a statistical or psychometric perspective (e.g., Anastasi & Urbina, 1997). It may be that
because intelligence is often defined in large part on the basis of behavior judged to be of
value to the survival and improvement of the culture and the individuals within that culture,
a truly culture-free test would be a poor predictor of intelligent behavior within the cultural
setting. Once a test has been developed within a culture (a culture loaded test) its generaliz-
ability to other cultures or subcultures within the dominant societal framework becomes a
matter for empirical investigation.

Inappropriate Indicators of Bias:


Mean Differences and Equivalent Distributions

Differences in mean levels of performance on cognitive tasks between two groups histori-
cally (and mistakenly) are believed to constitute test bias by a number of writers (e.g., Alley
& Foster, 1978; Chinn, 1979; Hilliard, 1979). Those who support mean differences as an
indication of test bias state correctly that there is no valid a priori scientific reason to be-
lieve that intellectual or other cognitive performance levels should differ across race. It is
the inference that tests demonstrating such differences are inherently biased that is faulty.
Just as there is no a priori basis for deciding that differences exist, there is no a priori basis
for deciding that differences do not exist. From the standpoint of the objective methods of
science, a priori or premature acceptance of either hypothesis (differences exist versus dif-
ferences do not exist) is untenable. As stated in the Standards (AERA et al., 1999):

Most testing professionals would probably agree that while group differences in testing
outcomes should in many cases trigger heightened scrutiny for possible sources of test bias,
The Problem of Bias in Educational Assessment 437

outcome differences across groups do not in themselves indicate that a testing application is
biased or unfair. (p. 75)

Some adherents to the “mean differences as bias” position also require that the distri-
bution of test scores in each population or subgroup be identical prior to assuming that the
test is nonbiased, regardless of its validity. Portraying a test as biased
The mean difference definition regardless of its purpose or the validity of its interpretations conveys
of test bias is the most uniformly an inadequate understanding of the psychometric construct and is-
rejected of all definitions of sues of bias. The mean difference definition of test bias is the most
uniformly rejected of all definitions of test bias by psychometricians
test bias by psychometricians
involved in investigating the problems of bias in assessment (Camilli
involved in investigating the
& Shepard, 1994; Cleary et al., 1975; Cole & Moss, 1989; Hunter,
problems of bias in assessment.
Schmidt, & Rauschenberger, 1984; Reynolds, 1982, 1995, 2000).
Jensen (1980) discusses the mean differences as bias definition
in terms of the egalitarian fallacy. The egalitarian fallacy contends that all human populations
are in fact identical on all mental traits or abilities. Any differences with regard to any aspect
of the distribution of mental test scores indicate that something is wrong with the test itself. As
Jensen points out, such an assumption is totally, scientifically unwarranted. There are simply
too many examples of specific abilities and even sensory capacities that have been shown to
unmistakably differ across human populations. The result of the egalitarian assumption then
is to remove the investigation of population differences in ability from the realm of scientific
inquiry, an unacceptable course of action (Reynolds, 1980).
The belief of many people in the mean differences as bias definition is quite likely
related to the nature—nurture controversy at some level. Certainly data reflecting racial
differences on various aptitude measures have been interpreted to indicate support for a
hypothesis of genetic differences in intelligence and implicating one group as superior to
another. Such interpretations understandably call for a strong emotional response and are
not defensible from a scientific perspective. Although IQ and other aptitude test score dif-
ferences undoubtedly occur, the differences do not indicate deficits or superiority by any
group, especially in relation to the personal worth of any individual member of a given
group or culture.

Bias in Test Content

Bias in the content of educational and psychological tests has been a


Bias in the content of popular topic of critics of testing. These criticisms typically take the
educational tests has been form of reviewing the items, comparing them to the critics’ views of
a popular topic of critics of minority and majority cultural environments, and then singling out
testing. specific items as biased or unfair because

a The items ask for information that minority or disadvantaged children have not had
equal opportunity to learn.

= The items require the child to use information in arriving at an answer that minority
or disadvantaged children have not had equal opportunity to learn.
438 CHAPTER 16

m The scoring of the items is improper, unfairly penalizing the minority child, because
the test author has a Caucasian middle-class orientation that is reflected in the scoring cri-
terion. Thus minority children do not receive credit for answers that may be correct within
their own cultures but do not conform to Anglocentric expectations.
m= The wording of the questions is unfamiliar to minority children and even though they
may “know” the correct answer are unable to respond because they do not understand the
question.

These problems with test items cause the items to be more difficult than they should
actually be when used to assess minority children. This, of course, results in lower test
scores for minority children, a well-documented finding. Are these criticisms of test items
accurate? Do problems such as these account for minority—majority group score differences
on mental tests? These are questions for empirical resolution rather than armchair specula-
tion, which is certainly abundant in the evaluation of test bias. Empirical evaluation first
requires a working definition. We will define a biased test item as follows:

An item is considered to be biased when it is demonstrated to be significantly more difficult


for one group than another item measuring the same ability or construct when the overall
level of performance on the test is held constant.

There are two concepts of special importance in this definition. First, the group of items
must be unidimensional; that is, they must all be measuring the same factor or dimension of
aptitude or personality. Second, the items identified as biased must be differentially more
difficult for one group than another. The definition allows for score differences between
groups of unequal standing on the dimension in question but requires that the difference
be reflected on all items in the test and.in an equivalent fashion. A number of empirical
techniques are available to locate deviant test items under this definition. Many of these
techniques are based on item-response theory (IRT) and designed to detect differential item
functioning, or DIF. The relative merits of each method are the subject of substantial debate,
but in actual practice each method has led to similar general conclusions, though the specific
findings of each method often differ.
With multiple-choice tests, another level of complexity can easily be added to the
examination of content bias. With a multiple-choice question, typically three or four dis-
tracters are given in addition to the correct response. Distracters may be examined for their
attractiveness (the relative frequency with which they are chosen) across groups. When
distracters are found to be disproportionately attractive for members
Content bias in well-prepared of any particular group, the item may be defined as biased.
standardized tests is irregular Research that includes thousands of subjects and nearly 100
in its occurrence, and no published studies consistently finds very little bias in tests at the level
common characteristics of items °F the individual item. Although some biased items are nearly always
that are found to be biased can found, they seldom account for more than 2% to 5% of the variance
be ascertained by expert judges. in performance and often, for every item favoring one group, there is
an item favoring the other group.
Earlier in the study of item bias it was hoped that the empirical analysis of tests at the
item level would result in the identification of a category of items having similar content
as biased and that such items could then be avoided in future test development (Flaugher,
The Problem of Bias in Educational Assessment 439

1978). Very little similarity among items determined to be biased has been found. No one
has been able to identify those characteristics of an item that cause the item to be biased. In
summarizing the research on item bias or differential item functioning (DIF), the Standards
(AERA et al., 1999) note:

Although DIF procedures may hold some promise for improving test quality, there has been
little progress in identifying the cause or substantive themes that characterizes items exhibiting
DIF. That is, once items on a test have been statistically identified as functioning differently
from one examinee group to another, it has been difficult to specify the reasons for the differ-
ential performance or to identify a common deficiency among the identified items. (p. 78)

It does seem that poorly written, sloppy, and ambiguous items tend to be identified as bi-
ased with greater frequency than those items typically encountered in a well-constructed,
standardized instrument.
A common practice of test developers seeking to eliminate “bias” from their newly
developed educational and psychological tests has been to arrange for a panel of expert mi-
nority group members to review all proposed test items. Any item identified as “culturally
biased” by the panel of experts is then expurgated from the instrument. Because, as previ-
ously noted, no detectable pattern or common characteristic of individual items statistically
shown to be biased has been observed (given reasonable care at the item writing stage), it
seems reasonable to question the armchair or expert minority panel approach to determin-
ing biased items. Several researchers, using a variety of psychological and educational
tests, have identified items as being disproportionately more difficult for minority group
members than for members of the majority culture and subsequently compared their results
with a panel of expert judges. Studies by Jensen (1976) and Sandoval and Mille (1979) are
representative of the methodology and results of this line of inquiry.
After identifying the 8 most racially discriminating and 8 least racially discriminating
items on the Wonderlic Personnel Test, Jensen (1976) asked panels of 5 black psychologists
and 5 Caucasian psychologists to sort out the 8 most and 8 least discriminating items when
only these 16 items were presented to them. The judges sorted the items at a no better than
chance level. Sandoval and Mille (1979) conducted a somewhat more extensive analysis
using items from the WISC-R. These two researchers had 38 black, 22 Hispanic, and 40
white university students from Spanish, history, and education classes identify items from
the WISC-R that are more difficult for a minority child than a white child and items that are
equally difficult for each group. A total of 45 WISC-R items was presented to each judge;
these items included the 15 most difficult items for blacks as compared to whites, the 15
most difficult items for Hispanics as compared to whites, and the 15 items showing the most
nearly identical difficulty indexes for minority and white children. The judges were asked to
read each question and determine whether they thought the item was (1) easier for minority
than white children, (2) easier for white than minority children, or (3) of equal difficulty for
white and minority children. Sandoval and Mille’s (1979) results indicated that the judges
were not able to differentiate between items that were more difficult for minorities and items
that were of equal difficulty across groups. The effects of the judges’ ethnic backgrounds on
the accuracy of their item bias judgments were also considered. Minority and nonminority
judges did not differ in their ability to identify accurately biased items nor did they differ
with regard to the type of incorrect identification they tended to make. Sandoval and Mille’s
440 CHAPTER 16

(1979) two major conclusions were that “(1) judges are not able to detect items which are
more difficult for a minority child than an Anglo child, and (2) the ethnic background of the
judge makes no difference in accuracy of item selection for minority children” (p. 6). Even
without empirical support for its validity, the use of expert panels of minorities continues
but for a different purpose. Members of various ethnic, religious, or other groups that have a
cultural system in some way unique may well be able to identify items that contain material
that is offensive, and the elimination of such items is proper.
From a large number of studies employing a wide range of methodology a relatively
clear picture emerges. Content bias in well-prepared standardized tests is irregular in its
occurrence, and no common characteristics of items that are found to be biased can be
ascertained by expert judges (minority or nonminority). The variance in group score dif-
ferences on mental tests associated with ethnic group membership when content bias has
been found is relatively small (typically ranging from 2% to 5%). Although the search for
common biased item characteristics will continue, cultural bias in aptitude tests has found
no consistent empirical support in a large number of actuarial studies contrasting the perfor-
mance of a variety of ethnic and gender groups on items of the most widely employed intel-
ligence scales in the United States. Most major test publishing companies do an adequate
job of reviewing their assessments for the presence of content bias. Nevertheless, certain
standardized tests have not been examined for the presence of content bias, and research
with these tests should continue regarding potential content bias with different ethnic groups
(Reynolds & Ramsay, 2003).

Bias in Other Internal Features of Tests

There is no single method for the accurate determination of the degree to which educational
and psychological tests measure a distinct construct. The defining of bias in construct measure-
ment then requires a general statement that can be researched from a variety of viewpoints with
a broad range of methodology. The following rather parsimonious definition is proffered:

Bias exists in regard to construct measurement when a test is shown


to measure different hypothetical traits (psychological constructs)
for one group than another or to measure the same trait but with dif-
fering degrees of accuracy. (After Reynolds, 1982)

As is befitting the concept of construct measurement, many


Bias exists in regard to different methods have been employed to examine existing psy-
chological tests and batteries of tests for potential bias. One of the
construct measurement when
more popular and necessary empirical approaches to investigating
a test is shown to measure
construct measurement is factor analysis. Factor analysis, as a pro-
different hypothetical traits or
cedure, identifies clusters of test items or clusters of subtests of psy-
constructs for one group than chological or educational tests that correlate highly with one another,
another or to measure the same and less so or not at all with other subtests or items. Factor analysis
trait but with different degrees allows one to determine patterns of interrelationships of performance
of accuracy. among groups of individuals. For example, if several subtests of an
The Problem of Bias in Educational Assessment 441

intelligence scale load highly on (are members of) the same factor, then if a group of indi-
viduals score high on one of these subtests, they would be expected to score at a high level
on other subtests that load highly on that factor. Psychometricians attempt to determine
through a review of the test content and correlates of performance on the factor in question
what psychological trait underlies performance; or, in a more hypothesis testing approach,
they will make predictions concerning the pattern of factor loadings. Hilliard (1979), one
of the more vocal critics of IQ tests on the basis of cultural bias, has pointed out that one
of the potential ways of studying bias involves the comparison of factor analytic results of
test studies across race.

If the IQ test is a valid and reliable test of “innate” ability or abilities, then the factors which
emerge on a given test should be the same from one population to another, since “intel-
ligence” is asserted to be a set of mental processes. Therefore, while the configuration of
scores of a particular group on the factor profile would be expected to differ, logic would
dictate that the factors themselves would remain the same. (p. 53)

Although not agreeing that identical factor analyses of an instrument speak to the
“innateness” of the abilities being measured, consistent factor analytic results across popu-
lations do provide strong evidence that whatever is being measured by the instrument is
being measured in the same manner and is in fact the same construct within each group.
The information derived from comparative factor analysis across populations is directly
relevant to the use of educational and psychological tests in diagnosis and other decision-
making functions. Psychologists, in order to make consistent interpretations of test score
data, must be certain that the test(s) measures the same variable across populations.
In contrast to Hilliard’s (1979) strong statement that factorial similarity across eth-
nicity has not been reported “in the technical literature,” a number of such studies have ap-
peared over the past three decades, dealing with a number of different tasks. These studies
have for the most part focused on aptitude or intelligence tests, the most controversial of all
techniques of measurement. Numerous studies of the similarity of factor analysis outcomes
for children of different ethnic groups, across gender, and even diagnostic groupings have
been reported over the past 30 years. Results reported are highly consistent in revealing that
the internal structure of most standardized tests varies quite little across groups. Compari-
sons of the factor structure of the Wechsler Intelligence Scales (e.g., WISC-III, WAIS-IIT)
and the Reynolds Intellectual Assessment Scales (Reynolds, 2002) in particular and other
intelligence tests find the tests to be highly factorially similar across gender and ethnic-
ity for blacks, whites, and Hispanics. The structure of ability tests for other groups has
been researched less extensively, but evidence thus far with Chinese, Japanese, and Native
Americans does not show substantially different factor structures for these groups.
As is appropriate for studies of construct measurement, comparative factor analy-
sis has not been the only method of determining whether bias exists. Another method of
investigation involves the comparison of internal-consistency reliability estimates across
groups. As described in Chapter 4, internal-consistency reliability is determined by the
degree to which the items are all measuring a similar construct. The internal-consistency
reliability coefficient reflects the accuracy of measurement of the construct. To be unbiased
with regard to construct validity, internal-consistency estimates should be approximately
442 CHAPTER 16

equal across race. This characteristic of tests has been investigated for a number of popular
aptitude tests for blacks, whites, and Hispanics with results similar to those already noted.
Many other methods of comparing construct measurement across groups have been
used to investigate bias in tests. These methods include the correlation of raw scores with
age, comparison of item-total correlations across groups, comparisons of alternate form and
test-retest correlations, evaluation of kinship correlation and differences, and others (see
Reynolds, 2002, for a discussion of these methods). The general results of research with
these methods have been supportive of the consistency of construct measurement of tests
across ethnicity and gender.
Construct measurement of a large number of popular psychometric assessment in-
struments has been investigated across ethnicity and gender with a divergent set of meth-
odologies. No consistent evidence of bias in construct measurement has been found in the
many prominent standardized tests investigated. This leads to the conclusion that these
psychological tests function in essentially the same manner across ethnicity and gender,
the test materials are perceived and reacted to in a similar manner, and the tests are measur-
ing the same construct with equivalent accuracy for blacks, whites,
Hispanic, and other American minorities for both sexes. Differential
Single-group or differential
validity or single-group validity has not been found and likely is
validity has not been found
not an existing phenomenon with regard to well-constructed stan-
and likely is not an existing
dardized psychological and educational tests. These tests appear to
phenomenon with regard i be reasonably unbiased for the groups investigated, and mean score
well-constructed standardized differences do not appear to be an artifact of test bias (Reynolds &
educational tests. Ramsay, 2003).

Bias in Prediction and in Relation


to Variables External to the Test

Internal analyses of bias (such as with item content and construct measurement) are less
confounded than analyses of bias in prediction due to the potential problems of bias in the
criterion measure. Prediction is also strongly influenced by the reliability of criterion mea-
sures, which frequently is poor. (The degree of relation between a predictor and a criterion
is restricted as a function of the square root of the product of the reliabilities of the two vari-
ables.) Arriving at a consensual definition of bias in prediction is also a difficult task. Yet,
from the standpoint of the traditional practical applications of aptitude and intelligence tests
in forecasting probabilities of future performance levels, prediction is the most crucial use
of test scores to examine. Looking directly at bias as a characteristic
From the standpoint of a test and not a selection model, Cleary et al.’s (1975) definition of
of traditional practical test fairness, as restated here in modern times, is a clear direct state-
applications on aptitude and ment of test bias with regard to prediction bias:
intelligence tests in forecasting
A test is considered biased with respect to prediction when the inference
probabilities of future
drawn from the test score is not made with the smallest feasible random
performance levels, prediction error or if there is constant error in an inference or prediction as a function
is the most crucial use of test of membership in a particular group. (After Reynolds, 1982, p. 201)
scores to examine.
The Problem of Bias in Educational Assessment
443

The evaluation of bias in prediction under the Cleary et al. (1975) definition (known
as the regression definition) is quite straightforward. With simple regressions, predictions
take the form Y,= aX + b, where ais the regression coefficient and b is some constant. When
this equation is graphed (forming a regression line), a represents the slope of the regres-
sion line and b the Y-intercept. Given our definition of bias in prediction validity, nonbias
requires errors in prediction to be independent of group membership, and the regression line
formed for any pair of variables must be the same for each group for whom predictions are
to be made. Whenever the slope or the intercept differs significantly across groups, there
is bias in prediction if one attempts to use a regression equation based on the combined
groups. When the regression equations for two (or more) groups are equivalent, prediction
is the same for those groups. This condition is referred to variously as homogeneity of re-
gression across groups, simultaneous regression, or fairness in prediction. Homogeneity of
regression is illustrated in Figure 16.1, in which the regression line
When the regression equations shown is equally appropriate for making predictions for all groups.
are the same for two or more Whenever homogeneity of regression across groups does not occur,
groups, prediction is the same then separate regression equations should be used for each group
for those groups. concerned.

x,
FIGURE 16.1 Equal Slopes and Intercepts
Note: Equal slopes and intercepts result in homogeneity of regression in which the
regression lines for different groups are the same.
444 CHAPTER 16

In actual clinical practice, regression equations are seldom generated for the prediction
of future performance. Rather, some arbitrary, or perhaps statistically derived, cutoff score is
determined, below which failure is predicted. For school performance, a score of 2 or more
standard deviations below the test mean is used to infer a high probability of failure in the
regular classroom if special assistance is not provided for the student in question. Essentially
then, clinicians are establishing prediction equations about mental aptitude that are assumed
to be equivalent across race, sex, and so on. Although these mental equations cannot be
readily tested across groups, the actual form of criterion prediction can be compared across
groups in several ways. Errors in prediction must be independent of group membership. If
regression equations are equal, this condition is met. To test the hypothesis of simultaneous
regression, regression slopes and regression intercepts must both be compared.
When homogeneity of regression does not occur, three basic conditions can result:
(a) Intercept constants differ, (b) regression coefficients (slopes) differ, or (c) slopes and inter-
cepts differ. These conditions are illustrated in Figures 16.2, 16.3, and 16.4, respectively.
When intercept constants differ, the resulting bias in prediction is constant across the
range of scores. That is, regardless of the level of performance on the independent vari-
able, the direction and degree of error in the estimation of the criterion (systematic over- or
underprediction) will remain the same. When regression coefficients differ and intercepts

x
FIGURE 16.2 Equal Slopes with Differing Intercepts
Note: Equal slopes with differing intercepts result in parallel regression lines that
produce a constant bias in prediction.
The Problem of Bias in Educational Assessment
445

xX,

FIGURE 16.3 Equal Intercepts and Differing Slopes


Note: Equal intercepts and differing slopes result in nonparallel regression lines,
with the degree of bias depending on the distance of an individual’s score from the
origin.

are equivalent, the direction of the bias in prediction will remain constant, but the amount
of error in prediction will vary directly as a function of the distance of the score on the in-
dependent variable from the origin. With regression coefficient differences, then, the higher
the score on the predictor variable, the greater the error of prediction for the criterion. When
both slopes and intercepts differ, the situation becomes even more complex: Both the de-
gree of error in prediction and the direction of the “bias” will vary as a function of level of
performance on the independent variable.
A considerable body of literature has developed over the last 30 years regarding dif-
ferential prediction of tests across ethnicity for employment selection, college admissions,
and school or academic performance generally. In an impressive review of 866 black—white
prediction comparisons from 39 studies of test bias in personnel selection, Hunter, Schmidt,
and Hunter (1979) concluded that there was no evidence to substantiate hypotheses of dif-
ferential or single-group validity with regard to the prediction of the job performance across
race for blacks and whites. A similar conclusion has been reached by other independent
researchers (e.g., Reynolds, 1995). A number of studies have also focused on differential
validity of the Scholastic Aptitude Test (SAT) in the prediction of college performance
446 CHAPTER 16

FIGURE 16.4 Differing Slopes and Intercepts


Note: Differing slopes and intercepts result in a complex situation in which the
amount and the direction of the bias are a function of the distance of an individual’s
score from the origin.

(typically measured by grade point average). In general these studies have found either no
difference in the prediction of criterion performance for blacks and whites or a bias (under-
prediction of the criterion) against whites. When bias against whites has been found, the
differences between actual and predicted criterion scores, while statistically significant,
have generally been quite small.
A number of studies have investigated bias in the prediction of school performance
for children. Studies of the prediction of future performance based on IQ tests for children
have covered a variety of populations including normal as well as referred children: high-
poverty, inner-city children; rural black; and Native American groups. Studies of preschool
as well as school-age children have been carried out. Almost without exception, those stud-
ies have produced results that can be adequately depicted by Figure 16.1, that is, equivalent
prediction for all groups. When this has not been found, intercepts have generally differed
resulting in a constant bias in prediction. Yet, the resulting bias has not been in the popu-
larly conceived direction. The bias identified has tended to overpredict how well minority
children will perform in academic areas and to underpredict how well white children will
\
The Problem of Bias in Educational Assessment 447

perform. Reynolds (1995) provides a thorough review of studies investigating the prediction
of school performance in children.
With regard to bias in prediction, the empirical evidence suggests conclusions simi-
lar to those regarding bias in test content and other internal characteristics. There is no
strong evidence to support contentions of differential or single-group validity. Bias occurs
infrequently and with no apparently observable pattern, except with regard to instruments
of poor reliability and high specificity of test content. When bias oc-
. Bias in prediction occurs curs, it usually takes the form of small overpredictions for low SES,
infrequently and with no disadvantaged ethnic minority children, or other low-scoring groups.
apparently observable These overpredictions are unlikely to account for adverse placement
pattern, except with regard to or diagnosis in these groups (Reynolds & Ramsay, 2003).
instruments of poor reliability
and high specificity of test
content. Summary
A considerable body of literature currently exists failing to substantiate cultural bias against
native-born American ethnic minorities with regard to the use of well-constructed, ade-
quately standardized intelligence and aptitude tests. With respect
A considerable body of to personality scales, the evidence is promising yet far more pre-
literature currently exists liminary and thus considerably less conclusive. Despite the exist-
failing to substantiate cultural ing evidence, we do not expect the furor over the cultural test bias
hypothesis to be resolved soon. Bias in psychological testing will
bias against native-born
remain a torrid issue for some time. Psychologists and educators will
American ethnic minorities
need to keep abreast of new findings in the area. As new techniques
with regard to the use of and better methodology are developed and more specific populations
well-constructed, adequately examined, the findings of bias now seen as random and infrequent
standardized intelligence and may become better understood and seen to indeed display a correct-
aptitude tests. able pattern.
In the meantime however, one cannot ethnically fall prey to
the sociopoliticolegal Zeitgeist of the times and infer bias where none exists. Psychologists
and educators cannot justifiably ignore the fact that low IQ, ethnic, disadvantaged children
are just as likely to fail academically as are their white, middle-class counterparts. Black
adolescent delinquents with deviant personality scale scores and exhibiting aggressive be-
havior need treatment environments as much as their white peers. The potential outcome for
score interpretation (e.g., therapy versus prison, special education versus regular education)
cannot dictate the psychological meaning of test performance. We must practice intelligent
testing (Kaufman, 1994). We must remember that it is the purpose of the assessment process
to beat the prediction made by the test, to provide insight into hypotheses for environmental
interventions that prevent the predicted failure or subvert the occurrence of future maladap-
tive behavior.
Test developers are also going to have to be sensitive to the issues of bias, perform-
ing appropriate checks for bias prior to test publication. Progress is being made in all of
these areas. However, we must hold to the data even if we do not like them. At present, only
scattered and inconsistent evidence for bias exists. The few findings of bias do suggest two
448 CHAPTER 16

guidelines to follow in order to ensure nonbiased assessment: (1) Assessment should be con-
ducted with the most reliable instrumentation available, and (2) multiple abilities should be
assessed. In other words, educators and psychologists need to view multiple sources of ac-
curately derived data prior to making decisions concerning individuals. One hopes that this
is what has actually been occurring in the practice of assessment, although one continues to
hear isolated stories of grossly incompetent placement decisions being made. This is not to
say educators or psychologists should be blind to an individual’s cultural or environmental
background. Information concerning the home, community, and school environment must
all be evaluated in individual decisions. As we noted, it is the purpose of the assessment
process to beat the prediction and to provide insight into hypotheses for environmental
interventions that prevent the predicted failure.
Without question, scholars have not conducted all the research that needs to be done
to test the cultural test bias hypothesis and its alternatives. A number and variety of criteria
need to be explored further before the question of bias is empirically resolved. Many dif-
ferent achievement tests and teacher-made, classroom-specific tests need to be employed
in future studies of predictive bias. The entire area of differential validity of tests in the af-
fective domain is in need of greater exploration. A variety of views toward bias have been
expressed in many sources; many with differing opinions offer scholarly, nonpolemical
attempts directed toward a resolution of the issue. Obviously, the fact that such different
views are still held indicates resolution lies in the future. As far as the present situation is
concerned, clearly all the evidence is not in. With regard to a resolution of bias, we believe
that were a scholarly trial to be held, with a charge of cultural bias brought against mental
tests, the jury would likely return the verdict other than guilty or not guilty that is allowed in
British law—“not proven.” Until such time as a true resolution of the issues can take place,
we believe the evidence and positions taken in this chapter accurately reflect the state of our
empirical knowledge concerning bias in mental tests.

KEY TERMS AND CONCEPTS

Comparative factor analysis, p. 441 Examiner and language bias, p. 433 Prediction bias, p. 442
Content bias, p. 438 Homogeneity of regression, p. 443 Regression intercepts, p. 444
Cultural bias, p. 424 Inappropriate standardization, Regression slopes, p. 444
Cultural loading, p. 436 p. 433 Test bias, p. 424
Cultural test bias hypothesis, p. 422 Inequitable social consequences,
Culture-free tests, p. 436 p. 433
Differential predictive validity, Mean difference definition of test
p. 433 bias, p. 437

RECOMMENDED READINGS

Cleary, T. A., Humphreys, L. G., Kendrick, S. A., & Wesman, educational tests with disadvantaged students—an early
A. (1975). American Psychologist, 30, 15-41. This is and influential article.
the report of a group appointed by the APA’s Board of Halpern, D. F. (1997). Sex differences in intelligence: Im-
Scientific Affairs to study the use of psychological and plications for education. American Psychologist, 52,
The Problem of Bias in Educational Assessment 449

1091-1102. A good article that summarizes the literature Policy, and Law, 6, 144-150. This article provides a par-
on sex differences with an emphasis on educational im- ticularly good discussion of test bias in terms of public
plications. policy issues.
Neisser, U., BooDoo, G., Bouchard, T., Boykin, A., Brody, N., Reynolds, C. R., & Ramsay, M. C. (2003). Bias in psychological
Ceci, S., Halpern, D., Loehlin, J., Perloff, R., Sternberg, assessment: An empirical review and recommendations. In
R., & Urbina, S. (1996). Intelligence: Knowns and un- J. R. Graham & J. A. Naglieri (Eds.), Handbook of psy-
knowns. American Psychologist, 51, 77-101. This report chology: Assessment psychology (pp. 67-93). New York:
of an APA task force provides an excellent review of the Wiley. This chapter also provides an excellent review of
research literature on intelligence. the literature.
Reynolds, C. R. (1995). Test bias in the assessment of intel- Suzuki, L. A., & Valencia, R. R. (1997). Race-ethnicity and
ligence and personality. In D. Saklofsky & M. Zeidner measured intelligence: Educational implications. Ameri-
(Eds.), International handbook of personality and intel- can Psychologist, 52, 1103-1114. A good discussion of
ligence (pp. 545-573). New York: Plenum Press. This the topic with special emphasis on educational implica-
chapter provides a thorough review of the literature. tions and alternative assessment methods.
Reynolds, C. R. (2000). Why is psychometric research on bias
in mental testing so often ignored? Psychology, Public

Go to www.pearsonhighered.com/reynolds2e to view a PowerPoint™


presentation and to listen to an audio lecture about this chapter.
CHAPTER

Best Practices in
Educational Assessment

With power comes responsibility!

CHAPTER HIGHLIGHTS

Guidelines for Developing Assessments Guidelines for Interpreting, Using, and


Guidelines for Selecting Published Assessments Communicating Assessment Results
Guidelines for Administering Assessments Responsibilities of Test Takers
Guidelines for Scoring Assessments

LEARNING OBJECTIVES

After reading and studying this chapter, students should be able to


1. Explain why the assessment practices of teachers are held to high professional standards.
2. Identify major professional organizations that have written guidelines addressing educational
assessment issues.
3. Describe and give examples of the principles to consider when developing educational
assessments.
Describe and give examples of the principles to consider when selecting educational
assessments.
Identify major resources that provide information about published tests and describe the type
of information each one provides.
Describe and give examples of the principles to consider when administering educational
assessments.
. Describe and give examples of the principles to consider when interpreting, using, and
communicating results.
. Describe and give examples of the primary responsibilities of test takers.
Describe the possible consequences for teachers who engage in unethical behavior.

While teachers might not always be aware of it, their positions endow them with consid-
erable power. Teachers make decisions on a day-to-day basis that significantly impact their

450
Best Practices in Educational Assessment 451

It is the teacher’s responsibility students, and many of these decisions involve information garnered
to make sure that the assess- from educational assessments. As a result, it is the teacher’s respon-
ments they use are developed, sibility to make sure that the assessments they use, whether they
administered, scored, and are professionally developed tests or teacher-constructed tests, are
interpreted in a technically, developed, administered, scored, and interpreted in a technically,
ethically, and legally sound manner. This chapter provides some
ethically, and legally sound
guidelines that will help you ensure that your assessment practices
-manner.
are sound.
Much of the information discussed in this chapter has been introduced in previous
chapters. We will also incorporate guidelines that are presented in existing professional
codes of ethics and standards of professional practice. One of the principal sources is the
Code of Professional Responsibilities in Educational Measurement that was prepared by
the National Council on Measurement in Education (NCME, 1995). This code is presented
in its entirety in Appendix B. The Code of Professional Responsibilities in Educational
Measurement specifies the following general responsibilities for NCME members who are
involved in educational assessment:

1. Protect the health and safety of all examinees.


2. Be knowledgeable about, and behave in compliance with, state and federal laws rel-
evant to the conduct of professional activities.
3. Maintain and improve their professional competence in educational assessment.
4. Provide assessment services only in areas of their competence and experience, afford-
ing full disclosure of their professional qualifications.
5. Promote the understanding of sound assessment practices in education.
6. Adhere to the highest standards of conduct and promote professionally responsible
conduct within educational institutions and agencies that provide educational services.
7. Perform all professional responsibilities with honesty, integrity, due care, and fair-
ness. (p. 1)

Although these expectations are explicitly directed toward NCME members, all educational
professionals who are involved in assessment activities are well served by following these
general guidelines.
The Code of Professional Responsibilities in Educational Measurement (NCME, 1995)
delineates eight major areas of assessment activity, five of which are most applicable to teach-
ers. These are (1) Developing Assessments; (2) Selecting Assessments; (3) Administering
Assessments; (4) Scoring Assessments; and (5) Interpreting, Using, and Communicating As-
sessment Results. We will use these categories to organize our discussion of best practices in
educational assessment, and we will add an additional section, Responsibilities of Test Takers.
In addition to the Code of Professional Responsibilities in Educational Measurement (NCME,
1995), the following guidelines reflect a compilation of principles presented in the Standards
for Educational and Psychological Testing (AERA et al., 1999), The Student Evaluation
Standards (Joint Committee on Standards for Educational Evaluation, 2003), Code of Fair
Testing Practices in Education (Joint Committee on Testing Practices, 1988), and the Rights
and Responsibilities of Test Takers: Guidelines and Expectations (JCTP, 1998).
452 CoHUADP
Te Ra 7,

Guidelines for Developing Assessments


Probably the most fundamental The Joint Committee on Testing Practices (JCTP, 1998) notes that
right of test takers is to be probably the most fundamental right of test takers is to be evaluated
evaluated with assessments with assessments that meet high professional standards and that are
that meet high professional valid for the intended purposes. Accordingly, teachers who are in-
standards and that are valid volved in developing their own educational classroom tests and other
for the intended purposes assessment procedures have a professional responsibility to develop
assessments that meet or exceed all applicable technical, ethical, and
(JCTP, 1998).
legal standards (NCME, 1995). The most explicit and comprehen-
sive guidelines for developing and evaluating tests are the Standards
for Educational and Psychological Testing (AERA et al., 1999). Although these standards
apply most directly to professionally developed standardized tests, they may be applied ap-
propriately to less formal assessment procedures such as teacher-constructed tests. Here are
a few guidelines for the development of assessments that meet professional standards.

Clearly Specify Your Educational Objectives and Develop a Table of Specifications.


When developing classroom tests (or really any test), the first step is to specify the purpose
of the test and the construct or domain to be measured. To this end, teachers should begin
by explicitly specifying the educational objectives to be measured and developing a table of
specifications or test blueprint as described in Chapter 7. The table of specifications should
clearly define the content and format of the test and be directly linked to the educational
objectives of the instructional unit being assessed. Although the importance of this process
is probably obvious, in the day-to-day world of teaching in which teachers have many du-
ties and limited time, it may be tempting to skip these steps and simply start writing the
test. However, this is actually one of the most important steps in developing quality tests.
If you have not clearly delineated exactly what you want to measure, you are not likely to
do a very effective job.

Develop Assessment Procedures That Are Appropriate for Measuring the Specified
Educational Outcomes. Once the table of specifications is developed, it should be
used to guide the development of items and scoring procedures. Guidelines for developing
items of different types were presented in Chapters 8, 9, and 10. Selected-response items,
constructed-response items, and performance assessments and portfolios all have their own
specific strengths and weaknesses, and are appropriate for assessing some objectives and in-
appropriate for assessing others. It is the test developer’s responsibility to determine which
procedures are most appropriate for assessing specific learning objectives. In the past it was
fairly common for teachers to use a limited number of assessment
procedures (e.g., multiple-choice, true—false, or essay items). How-
It has become widely recog-
ever, it has become more widely recognized that no single assess-
nized that no single assessment
ment format can effectively measure the diverse range of educational
format can effectively measure outcomes emphasized in today’s schools. As a result, it is important
the diverse range of educational for teachers to use a diverse array of procedures that are carefully
outcomes emphasized in today’s selected to meet the specific purposes of the assessment and to facili-
schools. tate teaching and achievement (e.g., Linn & Gronlund, 2000).
Best Practices in Educational Assessment 453

Develop Explicit Scoring Criteria. Practically all types of assessments require clearly
stated criteria for scoring the items. This can range from fairly straightforward scoring keys
for selected-response items and short-answer items to detailed scoring rubrics for evaluat-
ing performance on extended-response essays and performance assessments. Whatever the
format, developing the items and the scoring criteria should be an integrated process guided
by the table of specifications. Scoring procedures should be consistent with the purpose of
the test and facilitate valid score interpretations (AERA et al., 1999).

Develop Clear Guidelines for Test Administration. All aspects of test administration
should be clearly specified. This includes instructions to students taking the test, time limits,
testing conditions (e.g., classroom or laboratory), and any equipment that will be utilized.
Teachers should develop administration instructions in sufficient detail so that other educa-
tors are able to replicate the conditions if necessary.

When developing assessments, Plan Accommodations for Test Takers with Disabilities and
some thought should be given to Other Special Needs. As discussed in Chapter 15, it is becoming
more common for regular education teachers to have students with
what types of accommodations
disabilities in their classroom. When developing assessments, some
may be necessary for students
thought should be given to what types of accommodations may be
with disabilities.
necessary for these students or other students with special needs.

Carefully Review the Assessment Prior to Administration. Teachers should carefully


review their tests to ensure technical accuracy. To this end it is beneficial to have a trusted
colleague familiar with the content area review their test and scoring criteria prior to admin-
istration and grading. In addition to reviewing for technical accuracy, assessments should
be reviewed for potentially insensitive content or language and evidence of bias due to
race, gender, or ethnic backgrounds. Bias in educational assessment is discussed in detail
in Chapter 16.

Evaluate the Technical Properties of Assessments. After administering the test, teach-
ers should use quantitative and qualitative item analysis procedures to evaluate and refine
their assessments (discussed in Chapter 6). Teachers should also
Teachers should perform perform preliminary analyses that will allow them to assess the reli-
preliminary analyses regarding ability and validity of their measurements. Reliability and validity
the reliability and validity of were discussed in Chapters 4 and 5. Although it might be difficult for
their measurements. teachers to perform some of the more complex reliability and valid-
ity analyses, at a minimum they should use some of the simplified
procedures outlined in the appropriate chapters. Table 17.1 presents a summary of these
guidelines for developing assessments.

Guidelines for Selecting Published Assessments


Although teachers develop the majority of the tests administered in schools, professionally
developed standardized assessments are playing an increasingly important role in today’s
schools. Some of these standardized tests are developed by the state and their administration
454 CHAPTER 17

TABLE 17.1 Checklist for Developing Assessments

1. Have the educational objectives been clearly specified and a table


of specifications developed?
2. Are the assessment procedures appropriate for measuring the
learning outcomes?
3. Have explicit scoring criteria been developed?
4, Have clear guidelines for test administration been developed?
5. Have accommodations for test takers with disabilities and other
special needs been planned?
6. Has the assessment been reviewed for technical accuracy
and potentially insensitive or biased content?
7. Have the technical properties of the assessment been evaluated?
a SR NIRS HS ERS as

is mandated. However, teachers are often involved in the selection and administration of
other standardized assessment instruments. As a result, they incur numerous responsibilities
associated with this role. The guiding principle, as when developing assessments, is to ensure
that the assessments meet high professional standards and are valid for the intended purposes.
Here are a few guidelines for selecting assessments that meet professional standards.

When selecting standardized Select Assessments That Have Been Validated for the Intended
assessments, it is of primary Purpose. As we have emphasized throughout this text, validity is
importance to select tests that a fundamental consideration when developing or selecting a test (see
have been validated for the Chapter 5). Professionally developed assessments should clearly
intended purpose. specify the recommended interpretations of test scores and provide
a summary of the validity evidence supporting each interpretation.
However, in the end it is the person selecting the test who is respon-
sible for determining whether the assessment is appropriate for use in the particular setting
(AERA et al., 1999). As an example, in selecting achievement tests it is important that the
content of the assessment correspond with the content of the curriculum. The essential
questions are “How will the assessment information be used?” and “Has the proposed as-
sessment been validated for those uses?”

Select assessments with Select Assessments with Normative Data That Are Representative
normative data that are of Correct Target Population. The validity of norm-referenced
representative of the type of test interpretations is dependent on the how representative the normative
takers the test will be used with. or standardization group is of the target population (see Chapter 3).
The fundamental question is “Does the normative sample adequately
represent the type of test takers the test will be used with?” It is also
important to consider how current the norms are because their usefulness diminishes over
time (AERA et al., 1999),

Select Assessments That Produce Reliable Results. It is important to select assessment


procedures that produce reliable results. In Chapter 4, we presented guidelines regarding
Best Practices in Educational Assessment 455

the levels of reliability recommended for different applications or uses. For example, when
making high-stakes decisions it is important to utilize assessment results that are highly
reliable (e.g., r,,. > 0.95).

Select Tests That Are Fair. Although no assessment procedure is absolutely free from
bias, efforts should be made to select assessments that have been shown to be relatively
free from bias due to race, gender, or ethnic backgrounds. Bias in educational assessment is
discussed in detail in Chapter 16.

Select assessments based on a Select Assessments Based on a Thorough Review of the Avail-
thorough review of the available able Literature. The selection of assessment procedures can have
literature. significant consequences for a large number of individuals. As a re-
sult, the decision should be based on a careful and thorough review
of the available information. It is appropriate to begin this review by examining information
and material the test publishers provide. This can include catalogs, test manuals, specimen
test sets, score reports, and other supporting documentation. However, the search should not
stop here, and you should seek out independent evaluations and reviews of the tests you are
considering. A natural question is “Where can I access information about assessments?” Four
of the most useful references are the Mental Measurements Yearbook, Tests in Print, Tests,
and Test Critiques. These resources can be located in the reference section of most college
and larger public libraries. The Testing Office of the American Psychological Association
Science Directorate (APA, 2008) provides the following description of these resources:

Mental Measurements Yearbook (MMY). Published by the Buros Institute for Mental Mea-
surements, the Mental Measurements Yearbook (MMY) lists tests alphabetically by title
and is an invaluable resource for researching published assessments. Each listing provides
descriptive information about the test, including test author, publication dates, intended
population, forms, prices, and publisher. It contains additional information regarding the
availability of reliability, validity, and normative data, as well as scoring and reporting ser-
vices. Most listings include one or more critical reviews by qualified assessment experts.

Tests in Print (TIP). Also published by the Buros Institute for Mental Measurements,
Tests in Print (TIP) is a bibliographic encyclopedia of information on practically every
published test in psychology and education. Each listing includes the test title, intended
population, publication date, author, publisher, and references. TIP does not contain critical
reviews or psychometric information, but it does serve as a master index to the Buros Insti-
tute reference series on tests. In the TIP the tests are listed alphabetically, within subjects
(e.g., achievement tests, intelligence tests). There are also indexes that can help you locate
specific tests. After locating a test that meets your criteria, you can turn to the Mental Mea-
surements Yearbook for more detailed information on the test.

Tests. Published by Pro-Ed, Inc., Tests is a bibliographic encyclopedia covering thousands


of assessments in psychology and education. It provides a brief description of the tests,
including information on the author, purpose, intended population, administration time,
scoring method, cost, and the publisher. Tests does not contain critical reviews or informa-
tion on reliability, validity, or other technical aspects of the tests.
456 CHARTER 17

Test Critiques. Also published by Pro-Ed, Inc., Test Critiques is designed to be a companion
to Tests. Test Critiques contains a tripart listing for each test that includes Introduction (e.g., in-
formation on the author, publisher, and purposes), Practical Applications/Uses (e.g., intended
population, administration, scoring, and interpretation guidelines), and Technical Aspects
(e.g., information on reliability, validity), followed by a critical review of the test. Its user-
friendly style makes it appropriate for individuals with limited training in psychometrics.
In addition to these traditional references, Test Reviews Online is a new Web-based
service of the Buros Institute of Mental Measurements (www.unl.edu/buros). This service
makes test reviews available online to individuals precisely as they appear in the Mental
Measurements Yearbook. For a relatively small fee (currently $15), users can download
information on any of over 2,000 tests that includes specifics on test purpose, population,
publication date, administration time, and descriptive test critiques. For more detailed in-
formation on these and other resources, the Testing Office of the American Psychological
Association Science Directorate has prepared an information sheet on “Finding Informa-
tion on Psychological Tests.” This can be requested by visiting its Web site (www.apa
.org/science/faq-findtests.html).

Select and use only assessments Select and Use Only Assessments That You Are Qualified to Ad-
that you are qualified to minister, Score, and Interpret. Because the administration, scor-
administer, score, and interpret. ing, and interpretation of many psychological and educational tests
requires advanced training, it is important to select and use only those
tests that you are qualified to use as a result of your education and training. For example, the
administration of an individual intelligence test such as the Wechsler Intelligence Scale for
Children—Fourth Edition (WISC-IV) requires extensive training and supervision that is typi-
cally acquired in graduate psychology and education programs. Most test publication firms
have established procedures that allow individuals and organizations to qualify to purchase
tests based on specific criteria. For example, Psychological Assessment Resources (2003) has
a three-tier system that classifies assessment products according to qualification requirements.
In this system, level A products require no special qualifications whereas level C products
require an advanced professional degree or license based on advanced training and experience
in psychological and educational assessment practices. Before purchasing restricted tests,
potential buyers must provide documentation that they meet the necessary requirements.

Guard against Potential Misuses and Misinterpretations. When selecting assess-


ments, avoid selecting those that are likely to be used or interpreted in an invalid or biased
manner. This is a difficult responsibility to discharge. Nitko (2001) suggests that to meet
this responsibility you must have a broad knowledge of how assessments are being used in
educational settings and their potential misuses and misinterpretations. To this end, he sug-
gests using references such as Education Week that regularly chronicle the appropriate as
well as inappropriate uses of assessment in our schools. Education Week is available online
at www.edweek.org.

Maintain Test Security. For assessments to be valid, it is important that test security
be maintained. Individuals selecting, purchasing, and using standardized assessment have
a professional and legal responsibility to maintain the security of dssessment instruments.
Best Practices in Educational Assessment
457

Teachers using standardized For example, The Psychological Corporation (2003) includes the fol-
assessments have a professional lowing principles in its security agreement: (a) Test takers should
and legal responsibility to not have access to testing material or answers before taking the
maintain the security of those test; (b) assessment materials cannot be reproduced or paraphrased;
instruments. (c) assessment materials and results can be released only to quali-
fied individuals; (d) if test takers or their parents/guardians ask to
examine test responses or results, this review must be monitored by
a qualified representative of the organization conducting the assessment; and (e) any re-
quest to copy materials must be approved in writing. Examples of breaches in the security
of standardized tests include allowing students to examine the test before taking it, using
actual items from a test for preparation purposes, making and distributing copies of a test,
and allowing test takers to take the test outside of a controlled environment (e.g., allowing
them to take the test home to complete it). Table 17.2 provides a summary of these guide-
lines for selecting published assessments. Special Interest Topic 17.1 provides information
about educators who have engaged in unethical and sometimes criminal practices when
using standardized assessments.

Guidelines for Administering Assessments

So far we have discussed your professional responsibilities related to developing and se-
lecting tests. Clearly, your professional responsibilities do not stop there. Every step of the
assessment process has its own important responsibilities, and now we turn to those associ-
ated with the administration of assessments. Subsequently we will address responsibilities
related to the scoring, interpreting, using, and communicating assessment results. The fol-
lowing guidelines involve your responsibilities when administering assessments.

TABLE 17.2 Checklist for Selecting Published Assessments

1. Have the desired interpretations of performance on the selected


assessments been validated for the intended purpose?
2. Do the selected assessments have normative data that are
representative of the target population?
3. Do selected assessments produce reliable results?
4. Are interpretations of the selected assessments fair?
5. Was the selection process based on a thorough review of the
available literature?
6. Are you qualified to administer, score, and interpret the
selected assessments?
7. Have you screened assessments for likely misuses
and misinterpretations?
8. Have steps been taken to maintain test security?
SPECIAL INTEREST ToPIc 17.1
Teachers Cheating?

Over 50 New York City educators may lose their jobs after an independent auditor produced evidence
that they helped students cheat on state tests.

(Hoff, 1999)

State officials charge that 71 Michigan schools might have cheated on state tests.
(Keller, 2001)

Georgia education officials suspend state tests after 270 actual test questions were posted on an
Internet site that was accessible to students, teachers, and parents.

(Olson, 2003)

Cizek (1998) notes that the abuse of standardized assessments by educators has become a national
scandal. With the advent of high-stakes assessment, it should not be surprising that some educators
would be inclined to cheat. With one’s salary and possibly one’s future employment riding on how stu-
dents perform on state-mandated achievement tests, the pressure to ensure that those students perform
well may override ethical and legal concerns for some people. Cannell (1988, 1989) was among the
first to bring abusive test practices to the attention of the public. Cannell revealed that by using outdated
versions of norm-referenced assessments, being lax with test security, and engaging in inappropriate
test preparation practices, all 50 states were able to report that their students were above the national
average (this came to be referred to as the Lake Wobegon phenomenon). Other common “tricks” that
educators have employed to inflate scores include using the same form of a test for a long period of
time so that teachers could become familiar with the content, encouraging low-achieving students to
Le skip school on the day of the test, selectively removing answer sheets of low-performing students, and
— excluding limited-English and special education students from assessments (Cizek, 1998).
Do not be fooled into thinking that these unethical practices are limited to top administrators try-
ing to make their schools look good; they also involve classroom teachers. Cizek (1998) reports a num-
ber of recent cases wherein principals or other administrators have encouraged teachers to cheat by
having students practice on the actual test items, and in some cases even erasing and correcting wrong
responses on answer sheets. Other unethical assessment practices engaged in by teachers included
providing hints to the correct answer, reading questions that the students are supposed to read, answer-
ing questions about test content, rephrasing test questions, and sometimes simply giving the students
the answers to items. Gay (1990) reported that 35% of the teachers responding to a survey had either
witnessed or engaged in unethical assessment practices. The unethical behaviors included changing
incorrect answers, revealing the correct answer, providing extra time, allowing the use of inappropriate
aids (e.g., dictionaries), and using the actual test items when preparing students for the test.
Just because other professionals are engaging in unethical behavior does not make it right.
Cheating by administrators, teachers, or students undermines the validity of the assessment results.
If you need any additional incentive to avoid unethical test practices, be warned that the test pub-
lishers are watching! The states and other publishers of standardized tests have a vested interest in
maintaining the validity of their assessments. As a result, they are continually scanning the results for
evidence of cheating. For example, Cizek (1998) reports that unethical educators have been identi-
fied as the result of fairly obvious clues such as ordering an excessive number of blank answer sheets
or a disproportionate number of erasures, to more subtle clues such as unusual patterns of increased
scores. The fact is, educators who cheat are being caught and punished, and the punishment may
include the loss of one’s job and license to teach!
SS
a

458
Best Practices in Educational Assessment 459

Provide Information to Students on the Assessment before Administering It. This


includes information on (1) when the assessment will be administered, (2) the conditions
under which it will be administered, (3) the abilities and content areas that will be assessed,
(4) how it will be scored and interpreted, (5) how the results will used, (6) confidentiality
issues and who will have access to the results, and (7) how the results are likely to impact
the student (AERA et. al., 1999; JCTP, 1998; Nitko, 2001). It is also appropriate to provide
information on useful test-taking strategies. For example, if there is a “correction for guess-
ing,” students should be made aware of this because it may affect the way they respond to
the test. Efforts should be made to make this information available to all students and their
parents in an easily understandable format.

Assessments should be Administer the Assessments in a Standardized Manner. Assess-


administered in a standardized ments should be administered in a standardized manner to ensure
manner to ensure fairness fairness and promote the reliability of scores and validity of their inter-
pretations. This implies that all students will take the assessment under
and promote the reliability of
the same conditions. For example, all students will receive the same
scores and the validity of their
materials and have access to the same resources (e.g., the use of cal-
interpretations.
culators), receive the same instructions, and have the same time limits.
Efforts should be made to ensure that the assessment environment is
comfortable, quiet, and relatively free from distractions. Students should be given opportuni-
ties to ask reasonable questions. Some teachers will answer appropriate questions in front of
the entire class so all students will receive the same information. Additional information on
preparing students for and administering standardized tests was provided in Chapter 12.

When Appropriate, Modify Administration to Accommodate the Needs of Students


with Disabilities. As discussed in Chapter 15, when assessing students with disabilities it
is often necessary and appropriate to modify the standard administration procedures to ad-
dress the special needs of these students. Assessment accommodations are granted to mini-
mize the impact of student characteristics that are irrelevant to the construct being measured
by the assessment. A major consideration when selecting accommodations is only to select
accommodations that do not undermine the reliability or validity of the assessment results. If
assessment accommodations are noted in a special education student’s Individual Education
Program (IEP), you have a professional and legal obligation to provide these modifications.

Provide Information to Students and Parents about Their Rights and Give Them
an Opportunity to Express Their Concerns. Students and parents should have an op-
portunity to voice concerns about the testing process and receive information about oppor-
tunities to retake an examination, have one rescored, or cancel scores. When appropriate
they should be given information on how they can obtain copies of assessments or other
related information. When an assessment is optional, students and parents should be given
this information so they can decide whether they want to take the
assessment. If alternative assessments are available, they should also
Students and parents should be informed of this. An excellent resource for all test takers is Rights
have an opportunity to voice and Responsibilities of Tests Takers: Guidelines and Expectations
concerns about the testing developed by the Joint Committee on Testing Practices (1998). This
process. is reproduced in Appendix D.
460 QUARTER le

TABLE 17.3 Checklist for Administering Assessments

1. Did you provide information on the assessment before


administering it?
2. Was the assessment administered in a standardized
and fair manner?
3. When appropriate, was the assessment modified to accommodate
the needs of test takers with disabilities?
4. Was information provided to students and parents about the rights
of test takers?
5. Are you qualified and prepared to administer the assessments?
6. Are proper test security measures being followed?

Administer Only Those Assessments for Which You Are Qualified by Education and
Training. As noted previously, it is important to only select and use tests that you are
qualified to use as a result of your education and training. Some assessments require exten-
sive training and supervision before being able to administer them independently.

Maintain Test Security. As noted previously, it is important for individuals selecting,


purchasing, and using standardized assessments to maintain the security of the assessments.
Table 17.3 provides guidelines for administering assessments.

Guidelines for Scoring Assessments


Make Sure Assessments Are Scored Properly and the Results Are Reported Accu-
rately. It is a teacher’s professional responsibility to develop reasonable quality control
procedures to ensure that the scoring is accurate. With selected-response items this may
involve carefully developing a scoring key, double-checking it for
Teachers are responsible for
errors, and adhering to it diligently (or using computer scoring when
developing reasonable quality possible). With constructed-response items and performance assess-
control procedures to ensure ments, this typically involves the development of explicit scoring
that their scoring is accurate. rubrics and strictly following them when scoring the assessments. In
Chapters 9 and 10, we provided suggestions for minimizing the ef-
fects of irrelevant factors when scoring assessments that involve subjective judgments. It is
also possible for errors to occur when recording the grades. This can usually be avoided by
simply rechecking your grade book (or spreadsheet) after initially recording the grades,

Make Sure the Scoring Is Fair. An aspect of the previous guideline that deserves special
attention involves fairness or the absence of bias in scoring. Whenever scoring involves
subjective judgment, it is also important to take steps to ensure that the scoring is based
solely on performance or content and is not contaminated by expectancy effects related to
students. That is, you do not want your personal impressions of the students to influence
Best Practices in Educational Assessment
461

your evaluation of their performance, in either a positive or negative manner. Again, we


provided suggestions for minimizing expectancy effects and other irrelevant factors that can
influence scoring in Chapters 9 and 10.

Score the Assessments and Report the Results in a Timely Manner. Students and
their parents are often anxious to receive the results of their assessments and deserve to have
their results reported in a timely manner. Additionally, to promote learning, it is important
that students receive feedback on their performance in a punctual manner. If the results are
to be delayed, it is important to notify the students, explain the situation, and attempt to
minimize any negative effects.

If Scoring Errors Are Detected, Correct the Errors and Provide the Corrected Results
in a Timely Manner. If you, or someone else, detect errors in your scoring, it is your
responsibility to take corrective action. Correct the errors, adjust the impacted scores, and
provide the corrected results in a timely manner.

Students have a right to review Implement a Reasonable and Fair Process for Appeal. Stu-
their assessments and appeal dents have a right to review their assessments and appeal their scores
their scores if they believe there if they believe there were errors in scoring. Although most institu-
were errors in scoring. tions have formal appeal procedures, it is usually in everyone’s best
interest to have a less formal process available by which students
can approach the teacher and attempt to address any concerns. This option may prevent
relatively minor student concerns from escalating into adversarial confrontations involving
parents, administrators, and possibly the legal system.

Keep Assessment Results Confidential. It is the responsibility of teachers and others


who score assessments to keep the results confidential. Although different standards of con-
fidentiality and privacy exist in different settings, it is a teacher’s professional and ethical
responsibility to take reasonable steps to maintain the confidentiality of assessment results.
Table 17.4 provides a summary of the guidelines for scoring assessments.

TABLE 17.4 Checklist for Scoring Assessments

1. Are procedures in place to ensure that assessments are scored


properly and the results are reported accurately?
2. Are procedures in place to ensure the scoring is fair?
3. Are scores reported in a timely manner?
4. If scoring errors are detected, are the errors corrected and the
corrected results provided in a timely manner?
5. Is a reasonable and fair process for appeals in place?
6. Are assessment results kept confidential?
462 CoC
ASP Ase Re

Guidelines for Interpreting, Using,


and Communicating Assessment Results

Use assessment results only for Use Assessment Results Only for the Purposes for Which They
the purposes for which they Have Been Validated. When interpreting assessment results,
have been validated. the issue of validity is an overriding concern. A primary consider-
ation when interpreting and using assessment results is to determine
whether there is sufficient validity evidence to support the proposed interpretations and
uses. When teachers use assessment results, it is their responsibility to promote valid inter-
pretations and guard against invalid interpretations.

Be Aware of the Limitations of the Assessment Results. A\l assessments contain error,
and some have more error than others do. It is the responsibility of teachers and other users
of assessment results to be aware of the limitations of assessments and to take these limita-
tions into consideration when interpreting and using assessment results.

Use multiple sources and types Use Multiple Sources and Types of Assessment Information
of assessment information when When Making High-Stakes Educational Decisions. Whenever
making high-stakes educational you hear assessment experts saying “‘multiple-choice items are worth-
decisions. less because they cannot measure higher-order cognitive skills” or
“performance assessments are worthless because they are not reliable”
recognize that they are expressing their own personal biases and not
being objective. Selected-response items, constructed-response items, and performance as-
sessments all have something to contribute to the overall assessment process. Multiple-choice
items and other selected-response formats can typically be scored in a reliable fashion and this
is a definite strength. Although we believe multiple-choice items can be written that measure
higher-order cognitive abilities, many educational objectives simply cannot be assessed using
selected-response items. If you want to measure a student’s writing skills, essay items are
particularly well suited. If you want to assess a student’s ability to engage in an oral debate,
a performance assessment is clearly indicated. The point is, different assessment procedures
have different strengths and weaknesses, and teachers are encouraged to use the results of a
variety of assessments when making important educational decisions. It is not appropriate to
base these decisions on the result of one assessment, particularly when it is difficult to take
corrective action should mistakes occur,

Take into Consideration Personal Factors or Extraneous Events That Might Have
Influenced Test Performance. This guideline holds that teachers should be sensitive to
factors that might have negatively influenced a student’s performance. For example, was the
student feeling ill or upset on the day of the assessment? Is the student prone to high levels
of test anxiety? This guideline also extends to administrative and environmental events
that might have impacted the student. For example, were there errors in administration that
might have impacted the student’s performance? Did any events occur during the admin-
istration that might have distracted the student or otherwise undermined performance? If
it appears any factors compromised the student’s performance, this should be considered
when interpreting their assessment results. ‘
Best Practices in Educational Assessment
463

With Norm-Referenced Assessment, Take into Consideration Any Differences be-


tween the Normative Group and Actual Test Takers. If there are meaningful differ-
ences between the normative groups and the actual test takers, this must be taken into
consideration when interpreting and using the assessment results.

Report Results in an Easily Understandable Manner. Students and their parents have
the right to receive comprehensive information about assessment results presented in an un-
derstandable and timely manner regarding the results of their assessments. It is the teacher’s
responsibility to provide this feedback to students and their parents and to attempt to answer
all of their questions. Providing feedback to students regarding their performance and ex-
plaining the rationale for grading decisions facilitates learning.

Explain to Students and Parents How They Are Likely to Be Im-


Explain to students and parents pacted by Assessment Results. It is the teacher’s responsibility to
how they are likely to be explain to students and their parents both the positive and negative
impacted by assessment results. implications of assessment results. Students and their parents have a
right to be informed of any likely consequences of the assessments.

Inform Students and Parents How Long the Scores Will Be Retained and Who Will
Have Access to the Scores. Students and their parents have a right to know how long the
assessment results will be retained and who will have access to these records.

Develop Procedures so Test Takers Can File Complaints and Have Their Concerns
Addressed. Teachers and school administrators should develop procedures whereby stu-
dents and their parents can file complaints about assessment practices. As we suggested ear-
lier, it is usually desirable to try to address these concerns in an informal manner as opposed
to allowing the problem to escalate into a legal challenge. Table 17.5 provides a summary of
these guidelines for interpreting, using, and communicating assessment results.

ResponsibilitiesofTest Takers

So far we have emphasized the rights of students and other test takers and the responsibili-
ties of teachers and other assessment professionals. However, the Standards (AERA et al.,
1999) note that students and other test takers also have responsibilities. These responsibili-
ties include the following.

Students Are Responsible for Preparing for the Assessment. Students have the right
to have adequate information about the nature and use of assessments. In turn, students are
responsible for studying and otherwise preparing themselves for the assessment.

Students Are Responsible for Following the Directions of the Individual Administer-
ing the Assessment. Students are expected to follow the instructions provided by the
individual administering the test or assessment. This includes behaviors such as showing
464 CHAPTER 17

TABLE 17.5 Checklist for Interpreting, Using, and Communicating Assessment Results

1. Are assessment results used only for purposes for which they
have been validated?
2. Did you take into consideration the limitations of the
assessment results?
3. Were multiple sources and types of assessment information
used when making high-stakes educational decisions?
4. Have you considered personal factors or extraneous events that
might have influenced test performance?
5. Are there any differences between the normative group and actual
test takers that need to be considered?
6. Are results communicated in an easily understandable
and timely manner?
7. Have you explained to students and parents how they are likely
to be impacted by assessment results?
8. Have you informed students and parents how long the scores
will be retained and who will have access to the scores?
9. Have you developed procedures so test takers can file complaints
and have their concerns addressed?
a

up on time for the assessment, starting and stopping when instructed to do so, and recording
responses as requested.

Students are responsible for Students Are Responsible for Behaving in an Academically
acting in an academically Honest Manner. That is, students should not cheat! Any form of
honest manner. cheating reduces the validity of the test and is unfair to other students.
Cheating can include copying from another student, using prohibited
resources (e€.g., notes or other unsanctioned aids), securing stolen copies of tests, or having
someone else take the test for them. Most schools have clearly stated policies on academic
honesty and students caught cheating may be sanctioned.

Students Are Responsible for Not Interfering with the Performance of Other Students.
Students should refrain from any activity that might be distracting to other students.

Students Are Responsible for Informing the Teacher or Another Professional if They
Believe the Assessment Results Do Not Adequately Represent Their True Abilities.
If, for any reason students feel that the assessment results do not adequately represent their
actual abilities, they should inform the teacher. This should be done as soon as possible so
the teacher can take appropriate actions.

Students Should Respect the Copyright Rights of Test Publishers. Students should
not make copies or in any other way reproduce assessment materials.
Best Practices in Educational Assessment 465

TABLE 17.6 Responsibilities of Test Takers

1. Students are responsible for preparing for the assessment.


2. Students are responsible for following the directions of the individual administering the
assessment.
3. Students are responsible for acting is an academically honest manner.
4. Students are responsible for not interfering with the performance of other students.
5. Students are responsible for informing the teacher or another professional if they believe the
assessment results do not adequately represent their true abilities.
6. Students should respect the copyright rights of test publishers.
7. Students should not disclose information about the contents of a test.
QL NS a I IS

REE

SPECIAL INTEREST TOPIC. 17.2


Steps to Prevent Student Cheating

Linn and Gronlund (2000) provide the following suggestions to help prevent cheating in your
classroom.

Take steps to keep the test secure before the testing date.
Prior to taking the test, have students clear off the top of their desks.
If students are allowed to use scratch paper, have them turn it in with their tests.
Carefully monitor the students during the test administration.
When possible provide an empty row of seats between students.
aN Use two forms of the test and alternate forms when distributing (you can use the same test
Sa
items, just arranged in a different order).
= Design your tests to have good face validity, that is, so it appears relevant and fair.
8. Foster a positive attitude toward tests by emphasizing how assessments benefit students (e.g.,
students learn what they have and have not mastered; a fair way of assigning grades).

Students Should Not Disclose Information about the Contents of a Test. In addition
to not making copies of an assessment, students should refrain from divulging in any other
manner information about the contents of a test. For example, they should not give other stu-
dents information about what to expect on a test. This is tantamount to cheating. Table 17.6
provides a summary of the responsibilities of test takers.

Summary and Top 12 Assessment-Related


Behaviors to Avoid
In this chapter we discussed the teacher’s responsibility to ensure that the assessments they
use, whether professional tests or teacher constructed, are developed, administered, scored,
and interpreted in a technically, ethically, and legally sound manner. We are sometimes
466 CHA ITER

asked , ‘Now that you have told us we should do, what things should we avoid?” To that end,
here is our list of 12 assessment-related behaviors that should be avoided.

Don’t teach the test itself. It’s tantamount to cheating (e.g., giving students answers
to tests; changing incorrect answers to correct answers).
Don’t create an environment where it is easy for students to cheat (e.g., failing to
monitor students in a responsible manner).
Don’t base high-stakes decisions on the results of a single assessment.
Don’t use poor quality assessments (e.g., unreliable, lacking relevant validity data,
inadequate normative data).
Don’t keep students and parents “in the dark” about how they will be assessed.
Don’t breach confidentiality regarding the performance of students on assessments.
Don’t let your personal preferences and biases impact the scoring of assessments and
assignment of grades.
Don’t use technical jargon without a clear, commonsense explanation when reporting
the results of assessments.
Don’t use assessments that you are not qualified to administer, score, and interpret.
10. Don’t make decisions using information you do not understand.
11. Don’t ignore the special assessment needs of students with disabilities or from diverse
linguistic/cultural backgrounds.
12. Don’t accede to bad decisions for students based on the faulty interpretation of test
results by others.

In closing, we hope you enjoy a successful and rewarding career as an educational profes-
sional. Remember, “Our children are our future!”

KEY TERMS AND CONCEPTS

Academic honesty, p. 464 Mental Measurements Yearbook The Student Evaluation Standards
Code of Fair Testing Practices in (MMY), p. 455 (JCSEE, 2003), p. 451
Education (JCTP, 1988), Rights and Responsibilities of Test Critiques, p. 456
p. 451 Test Takers: Guidelines and Test Reviews Online, p. 456
Code of Professional Expectations (JCTP, 1998), Tests, p. 455
Responsibilities in Educational p. 451 Test security, p. 456
Measurement (NCME, 1995), Standards for Educational and Tests in Print (TIP), p. 455
p. 451 Psychological Testing (AERA
Education Week, p. 456 etal., 1999), p. 451

RECOMMENDED READINGS
American Educational Research Association, American AERA. This is “the source” for technical information
Psychological Association, & National Council on on the development and use of tests in educational and
Measurement in Education (1999). Standards for edu- psychological settings.
cational and psychological testing. Washington, DC:
Best Practices in Educational Assessment
467

In addition to the Standards (AERA et al., 1999), the Appendix C: Code of Fair Testing Practices in Educa-
codes and guidelines reproduced in the appendixes of this tion (JCTP, 1988)
textbook are outstanding resources. These are the following: Appendix D: Rights and Responsibilities ofTest Takers:
f Guidelines and Expectations (JCTP, 1998)
AppendixA: Summary Statements of The Student Eval- Appendix E: Standards for Teacher Competence in
uation Standards (JCSEE, 2003) Educational Assessment of Students (AFT, NCME, & NEA,
Appendix B: Code of Professional Responsibilities in 1990)
Educational Measurement (NCME, 1995)

Go to www.pearsonhighered.com/reynolds2e to view a PowerPoint™


presentation and to listen to an audio lecture about this chapter.
APPENDIX A

Summary Statements of
The Student Evaluation Standards

Propriety Standards

The propriety standards help ensure that student evaluations will be conducted legally, ethically, and
with due regard for the well-being of the students being evaluated and other people affected by the
evaluation results. These standards are as follows:

Pl. Service to Students: Evaluations of students should promote sound education principles,
fulfillment of institutional missions, and effective student work, so that the educational needs of
students are served.

P2. Appropriate Policies and Procedures: Written policies and procedures should be developed,
implemented, and made available, so that evaluations are consistent, equitable, and fair.

P3. Access to Evaluation Information: Access to a student’s evaluation information should be


provided, but limited to the student and others with established, legitimate permission to view the
information, so that confidentiality is maintained and privacy protected.

P4. Treatment of Students: Students should be treated with respect in all aspects of the evaluation
process, so that their dignity and opportunities for educational development are enhanced.

PS. Rights of Students: Evaluations of students should be consistent with applicable laws and basic
principles of fairness and human rights, so that students’ rights and welfare are protected.

P6. Balanced Evaluation: Evaluations of students should provide information that identifies both
strengths and weaknesses, so that strengths can be built upon and problem areas addressed.

P7. Conflict of Interest: Conflicts of interest should be avoided, but if present should be dealt with
openly and honestly, so that they do not compromise evaluation processes and results.

Utility Standards

The utility standards help ensure that student evaluations are useful. Useful student evaluations are in-
formative, timely, and influential. Standards that support usefulness are as follows:

U1. Constructive Orientation: Student evaluations should be constructive, so that they result in
educational decisions that are in the best interest of the student.

468
Summary Statements of The Student Evaluation Standards 469

U2. Defined Users and Uses: The users and uses of a student evaluation should be specified, so that
the evaluation appropriately contributes to student learning and development.

U3. Information Scope: The information collected for student evaluations should be carefully
focused and sufficiently comprehensive, so that the evaluation questions can be fully answered and
the needs of students addressed.

U4. Evaluator Qualifications: Teachers and others who evaluate students should have the necessary
knowledge and skills, so that the evaluations are carried out competently and the results can be used
with confidence.

U5. Explicit Values: In planning and conducting student evaluations, teachers and others who
evaluate students should identify and justify the values used to judge student performance, so that the
bases for the evaluations are clear and defensible.

U6. Effective Reporting: Student evaluation reports should be clear, timely, accurate, and relevant,
so that they are useful to students, their parents/guardians, and other legitimate users.

U7. Follow-Up: Student evaluations should include procedures for follow-up, so that students,
parents/guardians, and other legitimate users can understand the information and take appropriate
follow-up actions.

Feasibility Standards
The feasibility standards help ensure that student evaluations can be implemented as planned. Feasible
evaluations are practical, diplomatic, and adequately supported. These standards are as follows:

F1. Practical Orientation: Student evaluation procedures should be practical, so that they produce
the needed information in efficient, nondisruptive ways.

F2. Political Viability: Student evaluations should be planned and conducted with the anticipation
of questions from students, their parents/guardians, and other legitimate users, so that their questions
can be answered effectively and their cooperation obtained.

F3. Evaluation Support: Adequate time and resources should be provided for student evaluations,
so that evaluations can be effectively planned and implemented, their results fully communicated, and
appropriate follow-up activities identified.

Accuracy Standards
The accuracy standards help ensure that a student evaluation will produce sound information about
a student’s learning and performance. Sound information leads to valid interpretations, justifiable
conclusions, and appropriate follow-up. These standards are as follows:

Al. Validity Orientation: Student evaluations should be developed and implemented, so that the
interpretations made about the performance of a student are valid and not open to misinterpretation.
470 APPENDIX A

A2. Defined Expectations for Students: The performance expectations for students should be
clearly defined, so that evaluation results are defensible and meaningful.

A3. Context Analysis: Student and contextual variables that may influence performance should be
identified and considered, so that a student’s performance can be validly interpreted.

A4. Documented Procedures: The procedures for evaluating students, both planned and actual,
should be described, so that the procedures can be explained and justified.

A5. Defensible Information: The adequacy of information gathered should be ensured, so that
good decisions are possible and can be defended and justified.

A6. Reliable Information: Evaluation procedures should be chosen or developed and implemented,
so that they provide reliable information for decisions about the performance of a student.

A7. Bias Identification and Management: Student evaluations should be free from bias, so that
conclusions can be fair.

A8. Handling Information and Quality Control: The information collected, processed, and reported
about students should be systematically reviewed, corrected as appropriate, and kept secure, so that
accurate judgments can be made.

A9. Analysis of Information: Information collected for student evaluations should be systematically
and accurately analyzed, so that the purposes of the evaluation are effectively achieved.

A10. Justified Conclusions: The evaluative conclusions about student performance should be
explicitly justified, so that students, their parents/guardians, and others can have confidence in them.

All. Metaevaluation: Student evaluation procedures should be examined periodically using these
and other pertinent standards, so that mistakes are prevented, or detected and promptly corrected, and
sound student evaluation practices are developed over time.

Source: Joint Committee on Standards for Educational Evaluation (2003). The student evaluation
standards. Thousand Oaks, CA: Corwin Press. ;
APPENDIX B

Code of Professional Responsibilities


in Educational Measurement

As an organization dedicated to the improvement of measurement and evaluation practice in educa-


tion, the National Council on Measurement in Education (NCME) has adopted this Code to promote
professionally responsible practice in educational measurement. Professionally responsible practice
is conduct that arises from either the professional standards of the field, general ethical principles,
or both.
The purpose of the Code of Professional Responsibilities in Educational Measurement, herein-
after referred to as the Code, is to guide the conduct of NCME members who are involved in any tinge
of assessment activity in education. NCME is also providing this Code as a public service for all indi-
viduals who are engaged in educational assessment activities in the hope that these activities will be
conducted in a professionally responsible manner. Persons who engage in these activities include local
educators such as classroom teachers, principals, and superintendents; professionals such as school
psychologists and counselors; state and national technical, legislative, and policy staff in education;
staff of research, evaluation, and testing organizations; providers of test preparation services; college
and university faculty and administrators; and professionals in business and industry who design and
implement educational and training programs.
This Code applies to any type of assessment that occurs as part of the educational process, in-
cluding formal and informal, traditional and alternative techniques for gathering information used in
making educational decisions at all levels. These techniques include, but are not limited to, large-scale
assessments at the school, district, state, national, and international levels; standardized tests; obser-
vational measures; teacher-conducted assessments; assessment support materials; and other achieve-
ment, aptitude, interest, and personality measures used in and for education.
Although NCME is promulgating this Code for its members, it strongly encourages other or-
ganizations and individuals who engage in educational assessment activities to endorse and abide by
the responsibilities relevant to their professions. Because the Code pertains only to uses of assessment
in education, it is recognized that uses of assessments outside of educational contexts, such as for
employment, certification, or licensure, may involve additional professional responsibilities beyond
those detailed in this Code.
The Code is intended to serve an educational function: to inform and remind those involved in
educational assessment of their obligations to uphold the integrity of the manner in which assessments
are developed, used, evaluated, and marketed. Moreover, it is expected that the Code will stimulate
thoughtful discussion of what constitutes professionally responsible assessment practice at all levels
in education.
The Code enumerates professional responsibilities in eight major areas of assessment activity.
Specifically, the Code presents the professional responsibilities of those who:

1. Develop Assessments
2. Market and Sell Assessments

471
472 ACP Pi N Dex SB

Select Assessments
Administer Assessments
Score Assessments
Interpret, Use, and Communicate Assessment Results
Educate about Assessment
eh
SIS)
aS Evaluate Programs and Conduct Research on Assessments

Although the organization of the Code is based on the differentiation of these activities, they are viewed
as highly interrelated, and those who use this Code are urged to consider the Code in its entirety. The
index following this Code provides a listing of some of the critical interest topics within educational
measurement that focus on one or more of the assessment activities.

General Responsibilities

The professional responsibilities promulgated in this Code in eight major areas of assessment activity
are based on expectations that NCME members involved in educational assessment will:

— protect the health and safety of all examinees;


2. be knowledgeable about, and behave in compliance with, state and federal laws relevant to the
conduct of professional activities;
maintain and improve their professional competence in educational assessment;
ie provide assessment services only in areas of their competence and experience, affording full
disclosure of their professional qualifications;
oh promote the understanding of sound assessment practices in education;
6. adhere to the highest standards of conduct and promote professionally responsible conduct
within educational institutions and agencies that provide educational services; and
7. perform all professional responsibilities with honesty, integrity, due care, and fairness.

Responsible professional practice includes being informed about and acting in accordance with
the Code of Fair Testing Practices in Education (Joint Committee on Testing Practices, 1988), the
Standards for Educational and Psychological Testing (American Educational Research Association,
American Psychological Association, National Council on Measurement in Education, 1985), or sub-
sequent revisions, as well as all applicable state and federal laws that may govern the development,
administration, and use of assessments. Both the Standards for Educational and Psychological Testing
and the Code of Fair Testing Practices in Education are intended to establish criteria for judging the
technical adequacy of tests and the appropriate uses of tests and test results. The purpose of this Code
is to describe the professional responsibilities of those individuals who are engaged in assessment
activities. As would be expected, there is a strong relationship between professionally responsible
practice and sound educational assessments, and this Code is intended to be consistent with the rel-
evant parts of both of these documents.
It is not the intention of NCME to enforce the professional responsibilities stated in the Code
or to investigate allegations of violations to the Code.
Since the Code provides a frame of reference for the evaluation of the appropriateness of behavior,
NCME recognizes that the Code may be used in legal or other similar proceedings.

Code of Professional Responsibilities in Educational Measurement 473

Section 1: Responsibilities of Those Who


Develop Assessment Products and Services
Those who develop assessment products and services, such as classroom teachers and other assessment
specialists, have a professional responsibility to strive to produce assessments that are of the highest qual-
ity. Persons who develop assessments have a professional responsibility to:

1.1 Ensure that assessment products and services are developed to meet applicable professional,
technical, and legal standards.
1.2 Develop assessment products and services that are as free as possible from bias due to charac-
teristics irrelevant to the construct being measured, such as gender, ethnicity, race, socioeco-
nomic status, disability, religion, age, or national origin.
1.3. Plan accommodations for groups of test takers with disabilities and other special needs when
developing assessments.
1.4 Disclose to appropriate parties any actual or potential conflicts of interest that might influence
the developers’ judgment or performance.
1.5 Use copyrighted materials in assessment products and services in accordance with state and
federal law.
1.6 Make information available to appropriate persons about the steps taken to develop and score
the assessment, including up-to-date information used to support the reliability, validity, scor-
ing and reporting processes, and other relevant characteristics of the assessment.
1.7. Protect the rights to privacy of those who are assessed as part of the assessment development
process.
1.8 Caution users, in clear and prominent language, against the most likely misinterpretations and
misuses of data that arise out of the assessment development process.
1.9 Avoid false or unsubstantiated claims in test preparation and program support materials and
services about an assessment or its use and interpretation.
1.10 Correct any substantive inaccuracies in assessments or their support materials as soon as
feasible.
1.11 Develop score reports and support materials that promote the understanding of assessment results.

Section 2: Responsibilities of Those Who


Market and Sell Assessment Products and Services

The marketing of assessment products and services, such as tests and other instruments, scoring services,
test preparation services, consulting, and test interpretive services, should be based on information that is
accurate, complete, and relevant to those considering their use. Persons who market and sell assessment
products and services have a professional responsibility to:

2.1 Provide accurate information to potential purchasers about assessment products and services
and their recommended uses and limitations.
2.2 Not knowingly withhold relevant information about assessment products and services that
might affect an appropriate selection decision.
2.3 Base all claims about assessment products and services on valid interpretations of publicly
available information.
2.4 Allow qualified users equal opportunity to purchase assessment products and services.
2.5 Establish reasonable fees for assessment products and services.
474 APPENDIX B

2.6 Communicate to potential users, in advance of any purchase or use, all applicable fees associ-
ated with assessment products and services.
2.7 Strive to ensure that no individuals are denied access to opportunities because of their inability
to pay the fees for assessment products and services.
2.8 Establish criteria for the sale of assessment products and services, such as limiting the sale of
assessment products and services to those individuals who are qualified for recommended uses
and from whom proper uses and interpretations are anticipated.
2.9 Inform potential users of known inappropriate uses of assessment products and services and
provide recommendations about how to avoid such misuses.
2.10 Maintain a current understanding about assessment products and services and their appropriate
uses in education.
2.11 Release information implying endorsement by users of assessment products and services only
with the users’ permission.
2.12 Avoid making claims that assessment products and services have been endorsed by another
organization unless an official endorsement has been obtained.
2.13 Avoid marketing test preparation products and services that may cause individuals to receive
scores that misrepresent their actual levels of attainment.

Section 3: Responsibilities of Those Who


Select Assessment Products and Services

Those who select assessment products and services for use in educational settings, or help others do
so, have important professional responsibilities to make sure that the assessments are appropriate for
their intended use. Persons who select assessment products and services have a professional respon-
sibility to:

3.1 Conduct a thorough review and evaluation of available assessment strategies and instruments
that might be valid for the intended uses.
3.2 Recommend and/or select assessments based on publicly available documented evidence of
their technical quality and utility rather than on unsubstantiated claims or statements.
3.3 Disclose any associations or affiliations that they have with the authors, test publishers, or oth-
ers involved with the assessments under consideration for purchase and refrain from participa-
tion if such associations might affect the objectivity of the selection process.
3.4 Inform decision makers and prospective users of the appropriateness of the assessment for
the intended uses, likely consequences of use, protection of examinee rights, relative costs,
materials and services needed to conduct or use the assessment, and known limitations of the
assessment, including potential misuses and misinterpretations of assessment information.
3h) Recommend against the use of any prospective assessment that is likely to be administered,
scored, and used in an invalid manner for members of various groups in our society for reasons
of race, ethnicity, gender, age, disability, language background, socioeconomic status, religion,
or national origin.
3.6 Comply with all security precautions that may accompany assessments being reviewed.
SPY Immediately disclose any attempts by others to exert undue influence on the assessment selec-
tion process.
3.8 Avoid recommending, purchasing, or using test preparation products and services that may cause
individuals to receive scores that misrepresent their actual levels of attainment.
Code of Professional Responsibilities in Educational Measurement 475

Section 4: Responsibilities of Those


Who Administer Assessments

Those who prepare individuals to take assessments and those who are directly or indirectly involved
in the administration of assessments as part of the educational process, including teachers, admin-
istrators, and assessment personnel, have an important role in making sure that the assessments are
administered in a fair and accurate manner. Persons who prepare others for, and those who administer,
assessments have a professional responsibility to:

4.1 Inform the examinees about the assessment prior to its administration, including its purposes,
uses, and consequences; how the assessment information will be judged or scored; how the results
will be kept on file; who will have access to the results; how the results will be distributed; and
examinees’ rights before, during, and after the assessment.
4.2 Administer only those assessments for which they are qualified by education, training, licen-
sure, or certification.
4.3 Take appropriate security precautions before, during, and after the administration of the
assessment.
4.4 Understand the procedures needed to administer the assessment prior to administration.
4.5 Administer standardized assessments according to prescribed procedures and conditions and
notify appropriate persons if any nonstandard or delimiting conditions occur.
4.6 Not exclude any eligible student from the assessment.
4.7 Avoid any conditions in the conduct of the assessment that might invalidate the results.
4.8 Provide for and document all reasonable and allowable accommodations for the administration
of the assessment to persons with disabilities or special needs.
4.9 Provide reasonable opportunities for individuals to ask questions about the assessment procedures
or directions prior to and at prescribed times during the administration of the assessment.
4.10 Protect the rights to privacy and due process of those who are assessed.
4.11 Avoid actions or conditions that would permit or encourage individuals or groups to receive
scores that misrepresent their actual levels of attainment.

Section 5: Responsibilities of
Those Who Score Assessments

The scoring of educational assessments should be conducted properly and efficiently so that the results
are reported accurately and in a timely manner. Persons who score and prepare reports of assessments
have a professional responsibility to:

5.1 Provide complete and accurate information to users about how the assessment is scored, such
as the reporting schedule, scoring process to be used, rationale for the scoring approach, techni-
cal characteristics, quality control procedures, reporting formats, and the fees, if any, for these
services.
5.2 Ensure the accuracy of the assessment results by conducting reasonable quality control proce-
dures before, during, and after scoring.
5:3 Minimize the effect on scoring of factors irrelevant to the purposes of the assessment.
5.4 Inform users promptly of any deviation in the planned scoring and reporting service or schedule
and negotiate a solution with users.
55 Provide corrected score results to the examinee or the client as quickly as practicable should
errors be found that may affect the inferences made on the basis of the scores.
476 APPENDIX B

5.6 Protect the confidentiality of information that identifies individuals as prescribed by state and
federal laws.
Bhi Release summary results of the assessment only to those persons entitled to such information
by state or federal law or those who are designated by the party contracting for the scoring
services.
5.8 Establish, where feasible, a fair and reasonable process for appeal and rescoring the assessment.

Section 6: Responsibilities of Those Who Interpret,


Use, and Communicate Assessment Results

The interpretation, use, and communication of assessment results should promote valid inferences
and minimize invalid ones. Persons who interpret, use, and communicate assessment results have a
professional responsibility to:

6.1 Conduct these activities in an informed, objective, and fair manner within the context of the
assessment’s limitations and with an understanding of the potential consequences of use.
6.2 Provide to those who receive assessment results information about the assessment, its pur-
poses, its limitations, and its uses necessary for the proper interpretation of the results.
6.3 Provide to those who receive score reports an understandable written description of all reported
scores, including proper interpretations and likely misinterpretations.
6.4 Communicate to appropriate audiences the results of the assessment in an understandable and
timely manner, including proper interpretations and likely misinterpretations.
6.5 Evaluate and communicate the adequacy and appropriateness of any norms or standards used
in the interpretation of assessment results.
6.6 Inform parties involved in the assessment process how assessment results may affect them.
6.7 Use multiple sources and types of relevant information about persons or programs whenever
possible in making educational decisions.
6.8 Avoid making, and actively discourage others from making, inaccurate reports, unsubstanti-
ated claims, inappropriate interpretations, or otherwise false and misleading statements about
assessment results.
6.9 Disclose to examinees and others whether and how long the results of the assessment will be
kept on file, procedures for appeal and rescoring, rights examinees and others have to the as-
sessment information, and how those rights may be exercised.
6.10 Report any apparent misuses of assessment information to those responsible for the assessment
process.
6.11 Protect the rights to privacy of individuals and institutions involved in the assessment process.

Section 7: Responsibilities of Those


Who Educate Others about Assessment

The process of educating others about educational assessment, whether as part of higher education,
professional development, public policy discussions, or job training, should prepare individuals to
understand and engage in sound measurement practice and to become discerning users of tests and test
results. Persons who educate or inform others about assessment have a professional responsibility to:

7.1 Remain competent and current in the areas in which they teach and reflect that in their instruction.
7.2 Provide fair and balanced perspectives when teaching about assessment.
Code of Professional Responsibilities in Educational Measurement 477

7.3 Differentiate clearly between expressions of opinion and substantiated knowledge when edu-
cating others about any specific assessment method, product, or service.
7.4 Disclose any financial interests that might be perceived to influence the evaluation of a particu-
lar assessment product or service that is the subject of instruction.
7.5 Avoid administering any assessment that is not part of the evaluation of student performance in a
course if the administration of that assessment is likely to harm any student.
7.6 Avoid using or reporting the results of any assessment that is not part of the evaluation of stu-
dent performance in a course if the use or reporting of results is likely to harm any student.
7.7 Protect all secure assessments and materials used in the instructional process.
7.8 Model responsible assessment practice and help those receiving instruction to learn about their
professional responsibilities in educational measurement.
7.9 Provide fair and balanced perspectives on assessment issues being discussed by policymakers,
parents, and other citizens.

Section 8: Responsibilities of Those


Who Evaluate Educational Programs
and Conduct Research on Assessments
Conducting research on or about assessments or educational programs is a key activity in helping to
improve the understanding and use of assessments and educational programs. Persons who engage
in the evaluation of educational programs or conduct research on assessments have a professional
responsibility to:

8.1 Conduct evaluation and research activities in an informed, objective, and fair manner.
8.2 Disclose any associations that they have with authors, test publishers, or others involved with
the assessment and refrain from participation if such associations might affect the objectivity
of the research or evaluation.
8.3 Preserve the security of all assessments throughout the research process as appropriate.
8.4 Take appropriate steps to minimize potential sources of invalidity in the research and disclose
known factors that may bias the results of the study.
8.5 Present the results of research, both intended and unintended, in a fair, complete, and objective
manner.
8.6 Attribute completely and appropriately the work and ideas of others.
8.7 Qualify the conclusions of the research within the limitations of the study.
8.8 Use multiple sources of relevant information in conducting evaluation and research activities
whenever possible.
8.9 Comply with applicable standards for protecting the rights of participants in an evaluation or
research study, including the rights to privacy and informed consent.

Afterword

As stated at the outset, the purpose of the Code of Professional Responsibilities in Educational Mea-
surement is to serve as a guide to the conduct of NCME members who are engaged in any type of
assessment activity in education. Given the broad scope of the field of educational assessment as well
as the variety of activities in which professionals may engage, it is unlikely that any code will cover
the professional responsibilities involved in every situation or activity in which assessment is used in
education. Ultimately, it is hoped that this Code will serve as the basis for ongoing discussions about
478 APPENDIX B

what constitutes professionally responsible practice. Moreover, these discussions will undoubtedly
identify areas of practice that need further analysis and clarification in subsequent editions of the
Code. To the extent that these discussions occur, the Code will have served its purpose.

Index to the Code of Professional


Responsibilities in Educational Measurement

This index provides a list of major topics and issues addressed by the responsibilities in each of the
eight sections of the Code. Although this list is not intended to be exhaustive, it is intended to serve as
a reference source for those who use this Code.

Advertising: 1.9, 1.10, 2.3, 2.11, 2.12


Bias: 1.2, 3.5, 4.5, 4.7, 5.3, 8.4
Cheating: 4.5, 4.6, 4.11
Coaching and Test Preparation: 2.13, 3.8, 4.11
Competence: 2.10, 4.2, 4.4, 4.5, 5.2, 5.5, 7.1, 7.8, 7.9, 8.1, 8.7
Conflict of Interest: 1.4, 3.3, 7.4, 8.2
Consequences of Test Use: 3.4, 6.1, 6.6, 7.5, 7.6
Copyrighted Materials, Use of: 1.5, 8.6
Disabled Examinees, Rights of: 1.3, 4.8
Disclosure: 1.6, 2.1, 2.2, 2.6, 3.3, 3.7, 4.1, 5.1, 5.4, 6.2, 6.3, 6.4, 6.6, 6.9, 8.2, 8.4, 8.5
Due Process: 4.10, 5.8, 6.9
Equity. 12) 2.4,.2-7. 3:5,.4.6
Feesi252:6; 2.7,
Inappropriate Test Use: 1.8, 2.8, 2.9, 3.4, 6.8, 6.10
Objectivity: 3.1/3.2, 3!3, 6.1765) 7:227,3,7.9; Sits 82h 8589
Rights to Privacy: 1.7, 3.4, 4.10, 5.6, 5.7, 6.11, 8.9
Security: 3.6, 4.3, 7.7, 8.3
Truthfuinéss: 110%2-1, 2.2; 2/3, 2.11)2:12) 3:2; 4.6973
Undue Influence: 3.7
Unsubstantiated Claims: 1.9, 3.2, 6.8

Source: Code of professional responsibilities in educational measurement. Prepared by the NCME Ad Hoc Com-
mittee on the Development of a Code of Ethics: Cynthia B. Schmeiser, ACT—Chair; Kurt F. Geisinger, State
University of New York; Sharon Johnson-Lewis, Detroit Public Schools; Edward D. Roeber, Council of Chief State
School Officers; William D. Schafer, University of Maryland. Copyright 1995 National Council on Measurement
in Education. Any portion of this Code may be reproduced and disseminated for educational purposes.
| APPENDIX C
Code of Fair Testing
Practices in Education

The Code of Fair Testing Practices in Education states the major obligations to test takers of profes-
sionals who develop or use educational tests. The Code is meant to apply broadly to the use of tests in
education (admissions, educational assessment, educational diagnosis, and student placement). The
Code is not designed to cover employment testing, licensure or certification testing, or other types of
testing. Although the Code has relevance to many types of educational tests, it is directed primarily at
professionally developed tests such as those sold by commercial test publishers or used in formally
administered testing programs. The Code is not intended to cover tests made by individual teachers for
use in their own classrooms.
The Code addresses the roles of test developers and test users separately. Test users are people
who select tests, commission test development services, or make decisions on the basis of test scores.
Test developers are people who actually construct tests as well as those who set policies for particular
testing programs. The roles may, of course, overlap as when a state education agency commissions
test development services, sets policies that control the test development process, and makes decisions
on the basis of the test scores.
The Code has been developed by the Joint Committee on Testing Practices, a cooperative ef-
fort of several professional organizations, that has as its aim the advancement, in the public interest,
of the quality of testing practices. The Joint Committee was initiated by the American Educational
Research Association, the American Psychological Association, and the National Council on Mea-
surement in Education. In addition to these three groups the American Association for Counseling and
Development/Association for Measurement and Evaluation in Counseling and Development, and the
American Speech-Language-Hearing Association are now also sponsors of the Joint Committee.
The Code presents standards for educational test developers and users in four areas:

A. Developing/Selecting Appropriate Tests


B. Interpreting Scores
C. Striving for Fairness
D. Informing Test Takers

Organizations, institutions, and individual professionals who endorse the Code commit themselves
to safeguarding the rights of test takers by following the principles listed. The Code is intended to be
consistent with the relevant parts of the Standards for Educational and Psychological Testing (AERA,
APA, NCME, 1985). However, the Code differs from the Standards in both audience and purpose.
The Code is meant to be understood by the general public; it is limited to educational tests; and the
primary focus is on those issues that affect the proper use of tests. The Code is not meant to add new
principles over and above those in the Standards or to change the meaning of the Standards. The goal
is rather to represent the spirit of a selected portion of the Standards in a way that is meaningful to test
takers and/or their parents or guardians. It is the hope of the Joint Committee that the Code will also
be judged to be consistent with existing codes of conduct and standards of other professional groups
who use educational tests.

479
480 APPENDIX C

A. Developing/Selecting Appropriate Tests*


Test developers should provide the information that test Test users should select tests that meet the purpose for
users need to select appropriate tests. which they are to be used and that are appropriate for the
intended test taking populations.

Test Developers Should: Test Users Should:


1. Define what each test measures and what the test 1. First define the purpose for testing and the popula-
should be used for. Describe the population(s) for which tion to be tested. Then, select a test for that purpose and
the test is appropriate. that population based on a thorough review of the available
information.
2. Accurately represent the characteristics, usefulness, 2. Investigate potentially useful sources of informa-
and limitations of tests for their intended purposes. tion, in addition to test scores, to corroborate the informa-
tion provided by tests.
3. Explain relevant measurement concepts as neces- 3. Read the materials provided by test developers and
sary for clarity at the level of detail that is appropriate for avoid using tests for which unclear or incomplete informa-
the intended audience(s). tion is provided.
4. Describe the process of test development. Explain 4. Become familiar with how and when the test was
how the content and skills to be tested were selected. developed and tried out.
5. Provide evidence that the test meets its intended 5. Read independent evaluations of a test and of pos-
purpose(s). sible alternative measures. Look for evidence required to
support the claims of test developers.
6. Provide either representative samples or complete 6. Examine specimen sets, disclosed tests or samples
copies of test questions, directions, answer sheets, manu- of questions, directions, answer sheets, manuals, and score
als, and score reports to qualified users. reports before selecting a test.
7. Indicate the nature of the evidence obtained con- 7. Ascertain whether the test content and norm
cerning the appropriateness of each test for groups of dif- group(s) or comparison group(s) are appropriate for the
ferent racial, ethnic, or linguistic backgrounds who are intended test takers.
likely to be tested.
8. Identify and publish any specialized skills needed to 8. Select and use only those tests for which the skills
administer each test and to interpret scores correctly. needed to administer the test and interpret scores correctly
are available.

B. Interpreting Scores
Test developers should help users interpret scores Test users should interpret scores correctly.
correctly.

Test Developers Should: Test Users Should:


9. Provide timely and easily understood score reports 9. Obtain information about the scale used for report-
that describe test performance clearly and accurately. Also, ing scores, the characteristics of any norms or comparison
explain the meaning and limitations of reported scores. group(s), and the limitations of the scores.

*Many of the statements in the Code refer to the selection of existing tests. However, in customized testing pro-
grams test developers are engaged to construct new tests. In those situations, the test development process should
be designed to help ensure that the completed tests will be in compliance with the Code.
Code of Fair Testing Practices in Education 481

10. Describe the population(s) represented by any 10. Interpret scores taking into account any major dif-
norms or comparison group(s), the dates the data were ferences between the norms or comparison groups and the
gathered, and the process used to select the samples of test actual test takers. Also take into account any differences
takers. in test administration practices or familiarity with the spe-
cific questions in the test.
11. Warn users to avoid specific, reasonably anticipated 11. Avoid using tests for purposes not specifically rec-
misuses of test scores. ommended by the test developer unless evidence is ob-
tained to support the intended use.
12. Provide information that will help users follow rea- 12. Explain how any passing scores were set and gather
sonable procedures for setting passing scores when it is evidence to support the appropriateness of the scores.
appropriate to use such scores with the test.
13. Provide information that will help users gather 13. Obtain evidence to help show that the test is meet-
evidence to show that the test is meeting its intended ing its intended purpose(s).
purpose(s).

C. Striving for Fairness


Test developers should strive to make tests that are as fair Test users should select tests that have been developed in
as possible for test takers of different races, gender, ethnic ways that attempt to make them as fair as possible for test
backgrounds, or different handicapping conditions. takers of different races, gender, ethnic backgrounds, or
handicapping conditions.

Test Developers Should: Test Users Should:

14. Review and revise test questions and related materi- 14. Evaluate the procedures used by test developers to
als to avoid potentially insensitive content or language. avoid potentially insensitive content or language.
15. Investigate the performance of test takers of differ- 15. Review the performance of test takers of different
ent races, gender, and ethnic backgrounds when samples races, gender, and ethnic backgrounds when samples of
of sufficient size are available. Enact procedures that help sufficient size are available. Evaluate the extent to which
to ensure that differences in performance are related pri- performance differences may have been caused by the
marily to the skills under assessment rather than to irrel- test.
evant factors.
16. When feasible, make appropriately modified forms 16. When necessary and feasible, use appropriately
of tests or administration procedures available for test tak- modified forms or administration procedures for test
ers with handicapping conditions. Warn test users of po- takers with handicapping conditions. Interpret standard
tential problems in using standard norms with modified norms with care in the light of the modifications that were
tests or administration procedures that result in noncom- made.
parable scores.

D. Informing Test Takers


Under some circumstances, test developers have direct communication with test takers. Under other
circumstances, test users communicate directly with test takers. Whichever group communicates di-
rectly with test takers should provide the information described below.
482 APPENDIX C

Test Developers or Test Users Should:


17. When atest is optional, provide test takers or their parents/guardians with information to help
them judge whether the test should be taken, or if an available alternative to the test should be used.

18. Provide test takers with the information they need to be familiar with the coverage of the test,
the types of question formats, the directions, and appropriate test-taking strategies. Strive to make
such information equally available to all test takers.

Under some circumstances, test developers have direct control of tests and test scores. Under
other circumstances, test users have such control.
Whichever group has direct control of tests and test scores should take the steps described
below.

Test Developers or Test Users Should:


19. Provide test takers or their parents/guardians with information about rights test takers may
have to obtain copies of tests and completed answer sheets, retake tests, have tests rescored, or cancel
scores.

20. Tell test takers or their parents/guardians how long scores will be kept on file and indicate to
whom and under what circumstances test scores will or will not be released.

21. Describe the procedures that test takers or their parents/guardians may use to register com-
plaints and have problems resolved.

Source: Code of fair testing practices in education. (1988). Washington, DC: Joint Committee on Testing Prac-
tices. (Mailing address: Joint Committee on Testing Practices, American Psychological Association, 1200 17th
Street NW, Washington, DC 20036.) ’
&
ne

APPENDIX D

aa
a Rights and Responsibilities of Test
Takers: Guidelines and Expectations

Preamble

The intent of this statement is to enumerate and clarify the expectations that test takers may reason-
ably have about the testing process, and the expectations that those who develop, administer, and
use tests may have of test takers. Tests are defined broadly here as psychological and educational
instruments developed and used by testing professionals in organizations such as schools, industries,
clinical practice, counseling settings and human service and other agencies, including those assess-
ment procedures and devices that are used for making inferences about people in the above-named
settings. The purpose of the statement is to inform and to help educate not only test takers, but also
others involved in the testing enterprise so that measurements may be most validly and appropriately
used. This document is intended as an effort to inspire improvements in the testing process and does
not have the force of law. Its orientation is to encourage positive and high quality interactions between
testing professionals and test takers.
The rights and responsibilities listed in this document are neither legally based nor inalienable
rights and responsibilities such as those listed in the United States of America’s Bill of Rights. Rather,
they represent the best judgments of testing professionals about the reasonable expectations that those
involved in the testing enterprise (test producers, test users, and test takers) should have of each other.
Testing professionals include developers of assessment products and services, those who market and
sell them, persons who select them, test administrators and scorers, those who interpret test results,
and trained users of the information. Persons who engage in each of these activities have significant
responsibilities that are described elsewhere, in documents such as those that follow (American As-
sociation for Counseling and Development, 1988; American Speech-Language-Hearing Association,
1994; Joint Committee on Testing Practices, 1988; National Association of School Psychologists,
1992: National Council on Measurement in Education, 1995).
In some circumstances, the test developer and the test user may not be the same person, group
of persons, or organization. In such situations, the professionals involved in the testing should clarify,
for the test taker as well as for themselves, who is responsible for each aspect of the testing process.
For example, when an individual chooses to take a college admissions test, at least three parties are
involved in addition to the test taker: the test developer and publisher, the individuals who administer
the test to the test taker, and the institutions of higher education who will eventually use the informa-
tion. In such cases a test taker may need to request clarifications about their rights and responsibilities.
When test takers are young children (e.g., those taking standardized tests in the schools) or are persons
who spend some or all their time in institutions or are incapacitated, parents or guardians may be
granted some of the rights and responsibilities, rather than, or in addition to, the individual.
Perhaps the most fundamental right test takers have is to be able to take tests that meet high
professional standards, such as those described in Standards for Educational and Psychological

483
484 APPENDIX D

Testing (American Educational Research Association, American Psychological Association, & Na-
tional Council on Measurement in Education, 1999) as well as those of other appropriate professional
associations. This statement should be used as an adjunct, or supplement, to those standards. State and
federal laws, of course, supersede any rights and responsibilities that are stated here.

References
American Association for Counseling and Development (now American Counseling Association) & As-
sociation for Measurement and Evaluation in Counseling and Development (now Association for
Assessment in Counseling). (1989). Responsibilities of users of standardized tests: RUST statement
revised. Alexandria, VA: Author.
American Educational Research Association, American Psychological Association, & National Council on
Measurement in Education. (1999). Standards for educational and psychological testing. Washing-
ton, DC: American Educational Research Association.
American Speech-Language-Hearing Association. (1994). Protection of rights of people receiving audiol-
ogy or speech-language pathology services. ASHA (36), 60-63.
Joint Committee on Testing Practices. (1988). Code offair testing practices in education. Washington, DC:
American Psychological Association.
National Association of School Psychologists. (1992). Standards for the provision of school psychological
services. Author: Silver Springs, MD.
National Council on Measurement in Education. (1995). Code of professional responsibilities in educa-
tional measurement. Washington, DC: Author.

Rights and Responsibilities of Test Takers

As a Test Taker, You Have the Right To:


1. Be informed of your rights and responsibilities as a test taker.
2. Be treated with courtesy, respect, and impartiality, regardless of your age, disability, ethnicity,
gender, national origin, religion, sexual orientation or other personal characteristics.
3. Be tested with measures that meet professional standards and that are appropriate, given the
manner in which the test results will be used.
4. Receive a brief oral or written explanation prior to testing about the purpose(s) for testing, the
kind(s) of tests to be used, if the results will be reported to you or to others, and the planned
use(s) of the results. If you have a disability, you have the right to inquire and receive informa-
tion about testing accommodations. If you have difficulty in comprehending the language of
the test, you have a right to know in advance of testing whether any accommodations may be
available to you.
5. Know in advance of testing when the test will be administered, if and when test results will be
available to you, and if there is a fee for testing services that you are expected to pay.
6. Have your test administered and your test results interpreted by appropriately trained individu-
als who follow professional codes of ethics.
7. Know if a test is optional and learn of the consequences of taking or not taking the test, fully
completing the test, or canceling the scores. You may need to ask questions to learn these
consequences.
8. Receive a written or oral explanation of your test results within a reasonable amount of time
after testing and in commonly understood terms.
9. Have your test results kept confidential to the extent allowed by law.
10. Present concerns about the testing process or your results and receive information about proce-
dures that will be used to address such concerns.
Rights and Responsibilities of Test Takers: Guidelines and Expectations 485

As a Test Taker, You Have the Responsibility To:


Read and/or listen to your rights and responsibilities as a test taker.
Se Treat others with courtesy and respect during the testing process.
Ask questions prior to testing if you are uncertain about why the test is being given, how it will
be given, what you will be asked to do, and what will be done with the results.
Read or listen to descriptive information in advance of testing and listen carefully to all test
instructions. You should inform an examiner in advance of testing if you wish to receive a test-
ing accommodation or if you have a physical condition or illness that may interfere with your
performance on the test. If you have difficulty comprehending the language of the test, it is your
responsibility to inform an examiner.
Know when and where the test will be given, pay for the test if required, appear on time with
any required materials, and be ready to be tested.
Follow the test instructions you are given and represent yourself honestly during the testing.
Be familiar with and accept the consequences of not taking the test, should you choose not to
take the test.
Inform appropriate person(s), as specified to you by the organization responsible for testing, if
you believe that testing conditions affected your results.
Ask about the confidentiality of your test results, if this aspect concerns you.
10. Present concerns about the testing process or results in a timely, respectful way, if you have
any.

The Rights of Test Takers:


Guidelines for Testing Professionals
Test takers have the rights described below. It is the responsibility of the professionals involved in the
testing process to ensure that test takers receive these rights.

1. Because test takers have the right to be informed of their rights and responsibilities as test tak-
ers, it is normally the responsibility of the individual who administers a test (or the organization
that prepared the test) to inform test takers of these rights and responsibilities.
Because test takers have the right to be treated with courtesy, respect, and impartiality, regard-
less of their age, disability, ethnicity, gender, national origin, race, religion, sexual orientation,
or other personal characteristics, testing professionals should:

a. Make test takers aware of any materials that are available to assist them in test preparation.
These materials should be clearly described in test registration and/or test familiarization
materials.
b. See that test takers are provided with reasonable access to testing services.

Because test takers have the right to be tested with measures that meet professional standards
that are appropriate for the test use and the test taker, given the manner in which the results will
be used, testing professionals should:

a. Take steps to utilize measures that meet professional standards and are reliable, relevant,
useful given the intended purpose and are fair for test takers from varying societal groups.
b. Advise test takers that they are entitled to request reasonable accommodations in test ad-
ministration that are likely to increase the validity of their test scores if they have a disability
recognized under the Americans with Disabilities Act or other relevant legislation.
486 APPENDIX D

4. Because test takers have the right to be informed, prior to testing, about the test’s purposes, the
nature of the test, whether test results will be reported to the test takers, and the planned use of
the results (when not in conflict with the testing purposes), testing professionals should:
a. Give or provide test takers with access to a brief description about the test purpose (e.g.,
diagnosis, placement, selection, etc.) and the kind(s) of tests and formats that will be used
(e.g., individual/group, multiple-choice/free response/performance, timed/untimed, etc.),
unless such information might be detrimental to the objectives of the test.
. Tell test takers, prior to testing, about the planned use(s) of the test results. Upon request,
the test taker should be given information about how long such test scores are typically kept
on file and remain available.
Provide test takers, if requested, with information about any preventative measures that have
been instituted to safeguard the accuracy of test scores. Such information would include
any quality control procedures that are employed and some of the steps taken to prevent
dishonesty in test performance.
. Inform test takers, in advance of the testing, about required materials that must be brought
to the test site (e.g., pencil, paper) and about any rules that allow or prohibit use of other
materials (e.g., calculators).
e. Provide test takers, upon request, with general information about the appropriateness of the
test for its intended purpose, to the extent that such information does not involve the release
of proprietary information. (For example, the test taker might be told, “Scores on this test are
useful in predicting how successful people will be in this kind of work” or “Scores on this
test, along with other information, help us to determine if students are likely to benefit from
this program.”)
Provide test takers, upon request, with information about re-testing, including if it is possible
to re-take the test or another version of it, and if so, how often, how soon, and under what
conditions.
Provide test takers, upon request, with information about how the test will be scored and in
what detail. On multiple-choice tests, this information might include suggestions for test
taking and about the use of a correction for guessing. On tests scored using professional
judgment (e.g., essay tests or projective techniques), a general description of the scoring
procedures might be provided except when such information is proprietary or would tend
to influence test performance inappropriately.
- Inform test takers about the type of feedback and interpretation that is routinely provided, as
well as what is available for a fee. Test takers have the right to request and receive informa-
tion regarding whether or not they can obtain copies of their test answer sheets or their test
materials, if they can have their scores verified, and if they may cancel their test results.
Provide test takers, prior to testing, either in the written instructions, in other written docu-
ments or orally, with answers to questions that test takers may have about basic test admin-
istration procedures.
Inform test takers, prior to testing, if questions from test takers will not be permitted during
the testing process.
Provide test takers with information about the use of computers, calculators, or other equip-
ment, if any, used in the testing and give them an opportunity to practice using such equip-
ment, unless its unpracticed use is part of the test purpose, or practice would compromise
the validity of the results, and to provide a testing accommodation for the use of such equip-
ment, if needed.
Inform test takers that, if they have a disability, they have the right to request and receive
accommodations or modifications in accordance with the provisions of the Americans with
Disabilities Act and other relevant legislation.
Rights and Responsibilities of Test Takers: Guidelines and Expectations 487

m. Provide test takers with information that will be of use in making decisions if test takers
have options regarding which tests, test forms, or test formats to take.

Because test takers have a right to be informed in advance when the test will be administered,
if and when test results will be available, and if there is a fee for testing services that the test
takers are expected to pay, test professionals should:
a. Notify test takers of the alteration in a timely manner if a previously announced testing
schedule changes, provide a reasonable explanation for the change, and inform test takers
of the new schedule. If there is a change, reasonable alternatives to the original schedule
should be provided.
b. Inform test takers prior to testing about any anticipated fee for the testing process, as well
as the fees associated with each component of the process, if the components can be sepa-
rated.

Because test takers have the right to have their tests administered and interpreted by appropri-
ately trained individuals, testing professionals should:
a. Know how to select the appropriate test for the intended purposes.
b. When testing persons with documented disabilities and other special characteristics that
require special testing conditions and/or interpretation of results, have the skills and knowl-
edge for such testing and interpretation.
c. Provide reasonable information regarding their qualifications, upon request.
d. Insure that test conditions, especially if unusual, do not unduly interfere with test perfor-
mance. Test conditions will normally be similar to those used to standardize the test.
e. Provide candidates with a reasonable amount of time to complete the test, unless a test has
a time limit.
f. Take reasonable actions to safeguard against fraudulent actions (e.g., cheating) that could
place honest test takers at a disadvantage.

Because test takers have the right to be informed about why they are being asked to take particu-
lar tests, if a test is optional, and what the consequences are should they choose not to complete
the test, testing professionals should:

a. Normally only engage in testing activities with test takers after the test takers have pro-
vided their informed consent to take a test, except when testing without consent has been
mandated by law or governmental regulation, or when consent is implied by an action the
test takers have already taken (e.g., such as when applying for employment and a personnel
examination is mandated).
b. Explain to test takers why they should consider taking voluntary tests.
c. Explain, if a test taker refuses to take or complete a voluntary test, either orally or in writing,
what the negative consequences may be to them for their decision to do so.
d. Promptly inform the test taker if a testing professional decides that there is a need to devi-
ate from the testing services to which the test taker initially agreed (e.g., should the testing
professional believe it would be wise to administer an additional test or an alternative test),
and provide an explanation for the change.

8. Because test takers have a right to receive a written or oral explanation of their test results
within a reasonable amount of time after testing and in commonly understood terms, testing
professionals should:
488 APPENDIX D

Interpret test results in light of one or more additional considerations (e.g., disability, lan-
guage proficiency), if those considerations are relevant to the purposes of the test and per-
formance on the test and are in accordance with current laws.
Provide, upon request, information to test takers about the sources used in interpreting their
test results, including technical manuals, technical reports, norms, and a description of the
comparison group, or additional information about the test taker(s).
Provide, upon request, recommendations to test takers about how they could improve their
performance on the test, should they choose or be required to take the test again.
Provide, upon request, information to test takers about their options for obtaining a second
interpretation of their results. Test takers may select an appropriately trained professional
to provide this second opinion.
Provide test takers with the criteria used to determine a passing score, when individual test
scores are reported and related to a pass—fail standard.
Inform test takers, upon request, how much their scores might change, should they elect
to take the test again. Such information would include variation in test performance due to
measurement error (e.g., the appropriate standard errors of measurement) and changes in
performance over time with or without intervention (e.g., additional training or treatment).
Communicate test results to test takers in an appropriate and sensitive manner, without use
of negative labels or comments likely to inflame or stigmatize the test taker.
Provide corrected test scores to test takers as rapidly as possible, should an error occur in
the processing or reporting of scores. The length of time is often dictated by individuals
responsible for processing or reporting the scores, rather than the individuals responsible
for testing, should the two parties indeed differ.
Correct any errors as rapidly as possible if there are errors in the process of developing
scores.

Because test takers have the right to have the results of tests kept confidential to the extent al-
lowed by law, testing professionals should:
a. Insure that records of test results (in paper or electronic form) are safeguarded and main-
tained so that only individuals who have a legitimate right to access them will be able to do
so.
b. Provide test takers, upon request, with information regarding who has a legitimate right to
access their test results (when individually identified) and in what form. Testing profession-
als should respond appropriately to questions regarding the reasons why such individuals
may have access to test results and how they may use the results.
c. Advise test takers that they are entitled to limit access to their results (when individually
identified) to those persons or institutions, and for those purposes, revealed to them prior to
testing. Exceptions may occur when test takers, or their guardians, consent to release the test
results to others or when testing professionals are authorized by law to release test results.
d. Keep confidential any requests for testing accommodations and the documentation support-
ing the request.

10. Because test takers have the right to present concerns about the testing process and to receive
information about procedures that will be used to address such concerns, testing professionals
should:
a. Inform test takers how they can question the results of the testing if they do not believe that
the test was administered properly or scored correctly, or other such concerns.
b. Inform test takers of the procedures for appealing decisions that they believe are based in
whole or in part on erroneous test results.
Rights and Responsibilities of Test Takers: Guidelines and Expectations 489

c. Inform test takers if their test results are under investigation and may be canceled, in-
validated, or not released for normal use. In such an event, that investigation should be
performed in a timely manner. The investigation should use all available information that
addresses the reason(s) for the investigation, and the test taker should also be informed of
the information that he/she may need to provide to assist with the investigation.
d. Inform the test taker, if that test taker’s test results are canceled or not released for normal
use, why that action was taken. The test taker is entitled to request and receive information
on the types of evidence and procedures that have been used to make that determination.

The Responsibilities of Test Takers:


Guidelines for Testing Professionals
Testing professionals should take steps to ensure that test takers know that they have specific respon-
sibilities in addition to their rights described above.

rT. Testing professionals need to inform test takers that they should listen to and/or read their rights
and responsibilities as a test taker and ask questions about issues they do not understand.
2. Testing professionals should take steps, as appropriate, to ensure that test takers know that
they:
a. Are responsible for their behavior throughout the entire testing process.
b. Should not interfere with the rights of others involved in the testing process.
c. Should not compromise the integrity of the test and its interpretation in any manner.

Testing professionals should remind test takers that it is their responsibility to ask questions
prior to testing if they are uncertain about why the test is being given, how it will be given, what
they will be asked to do, and what will be done with the results. Testing professionals should:

a. Advise test takers that it is their responsibility to review materials supplied by test publish-
ers and others as part of the testing process and to ask questions about areas that they feel
they should understand better prior to the start of testing.
b. Inform test takers that it is their responsibility to request more information if they are not satis-
fied with what they know about how their test results will be used and what will be done with
them.

Testing professionals should inform test takers that it is their responsibility to read descriptive
material they receive in advance of a test and to listen carefully to test instructions. Testing
professionals should inform test takers that it is their responsibility to inform an examiner in
advance of testing if they wish to receive a testing accommodation or if they have a physical
condition or illness that may interfere with their performance. Testing professionals should
inform test takers that it is their responsibility to inform an examiner if they have difficulty
comprehending the language in which the test is given. Testing professionals should:

a. Inform test takers that, if they need special testing arrangements, it is their responsibility to
request appropriate accommodations and to provide any requested documentation as far in
advance of the testing date as possible. Testing professionals should inform test takers about
the documentation needed to receive a requested testing accommodation.
b. Inform test takers that, if they request but do not receive a testing accommodation, they
could request information about why their request was denied.
490 A PPE NDTEXaD

Testing professionals should inform test takers when and where the test will be given, and
whether payment for the testing is required. Having been so informed, it is the responsibility
of the test taker to appear on time with any required materials, pay for testing services, and be
ready to be tested. Testing professionals should:

a. Inform test takers that they are responsible for familiarizing themselves with the appropri-
ate materials needed for testing and for requesting information about these materials, if
needed.
b. Inform the test taker, if the testing situation requires that test takers bring materials (e.g.,
personal identification, pencils, calculators, etc.) to the testing site, of this responsibility to
do so.

Testing professionals should advise test takers, prior to testing, that it is their responsibility to:
a. Listen to and/or read the directions given to them.
b. Follow instructions given by testing professionals.
c. Complete the test as directed.
d. Perform to the best of their ability if they want their score to be a reflection of their best
effort.
e. Behave honestly (e.g., not cheating or assisting others who cheat).

Testing professionals should inform test takers about the consequences of not taking a test,
should they choose not to take the test. Once so informed, it is the responsibility of the test taker
to accept such consequences, and the testing professional should so inform the test takers. If test
takers have questions regarding these consequences, it is their responsibility to ask questions of
the testing professional, and the testing professional should so inform the test takers.
Testing professionals should inform test takers that it is their responsibility to notify appropriate
persons, as specified by the testing organization, if they do not understand their results, or if
they believe that testing conditions affected the results. Testing professionals should:
a. Provide information to test takers, upon request, about appropriate procedures for question-
ing or canceling their test scores or results, if relevant to the purposes of testing.
b. Provide to test takers, upon request, the procedures for reviewing, re-testing, or canceling
their scores or test results, if they believe that testing conditions affected their results and if
relevant to the purposes of testing.

Provide documentation to the test taker about known testing conditions that mi ght have affected
the results of the testing, if relevant to the purposes of testing.
a. Testing professionals should advise test takers that it is their responsibility to ask questions
about the confidentiality of their test results, if this aspect concerns them.
b. Testing professionals should advise test takers that it is their responsibility to present con-
cerns about the testing process in a timely, respectful manner.

Source: Test Taker Rights and Responsibilities Working Group of the Joint Committee on Testing
Practices.
(1998, August). The rights and responsibilities of test takers: Guidelines and expectations. Washington,
DC: As-
sociation Psychological Association.
APPENDIX E

Standards for Teacher Competence


in Educational Assessment of Students

The professional education associations began working in 1987 to develop standards for teacher
competence in student assessment out of concern that the potential educational benefits of student
assessments be fully realized. The Committee appointed to this project completed its work in 1990,
following reviews of earlier drafts by members of the measurement, teaching, and teacher prepara-
tion and certification communities. Parallel committees of affected associations are encouraged to
develop similar statements of qualifications for school administrators, counselors, testing directors,
supervisors, and other educators in the near future. These statements are intended to guide the preser-
vice and inservice preparation of educators, the accreditation of preparation programs, and the future
certification of all educators.
A standard is defined here as a principle generally accepted by the professional associations
responsible for this document. Assessment is defined as the process of obtaining information that is
used to make educational decisions about students; to give feedback to students about their progress,
strengths, and weaknesses; to judge instructional effectiveness and curricular adequacy; and to inform
policy. The various assessment techniques include, but are not limited to, formal and informal obser-
vation, qualitative analysis of pupil performance and products, paper-and-pencil tests, oral question-
ing, and analysis of student records. The assessment competencies included here are the knowledge
and skills critical to a teacher’s role as educator. It is understood that there are many competencies
beyond assessment competencies that teachers must possess.
By establishing standards for teacher competence in student assessment, the associations sub-
scribe to the view that student assessment is an essential part of teaching and that good teaching cannot
exist without good student assessment. Training to develop the competencies covered in the standards
should be an integral part of preservice preparation. Further, such assessment training should be
widely available to practicing teachers through staff development programs at the district and building
levels. The standards are intended for use as:

m a guide for teacher educators as they design and approve programs for teacher preparation
m aself-assessment guide for teachers in identifying their needs for professional development in
student assessment
m a guide for workshop instructors as they design professional development experiences for in-
service teachers
= an impetus for educational measurement specialists and teacher trainers to conceptualize stu-
dent assessment and teacher training in student assessment more broadly than has been the case
in the past.

The standards should be incorporated into future teacher training and certification programs.
Teachers who have not had the preparation these standards imply should have the opportunity and sup-
port to develop these competencies before the standards enter into the evaluation of these teachers.

491
492 APPENDIX E

The Approach Used to Develop the Standards


The members of the associations that supported this work are professional educators involved in teaching,
teacher education, and student assessment. Members of these associations are concerned about the inad-
equacy with which teachers are prepared for assessing the educational progress of their students, and thus
sought to address this concern effectively. A committee named by the associations first met in September
1987 and affirmed its commitment to defining standards for teacher preparation in student assessment.
The committee then undertook a review of the research literature to identify needs in student assessment,
current levels of teacher training in student assessment, areas of teacher activities requiring competence
in using assessments, and current levels of teacher competence in student assessment.
The members of the committee used their collective experience and expertise to formulate and
then revise statements of important assessment competencies. Drafts of these competencies went
through several revisions by the Committee before the standards were released for public review.
Comments by reviewers from each of the associations were then used to prepare a final statement.

The Scope of a Teacher’s Professional Role


and Responsibilities for Student Assessment

There are seven standards in this document. In recognizing the critical need to revitalize classroom
assessment, some standards focus on classroom-based competencies. Because of teachers’ growing
roles in education and policy decisions beyond the classroom, other standards address assessment
competencies underlying teacher participation in decisions related to assessment at the school, dis-
trict, state, and national levels.
The scope of a teacher’s professional role and responsibilities for student assessment may be
described in terms of the following activities. These activities imply that teachers need competence in
student assessment and sufficient time and resources to complete them in a professional manner.

Activities Occurring Prior to Instruction


(a) Understanding students’ cultural backgrounds, interests, skills, and abilities as they apply
across a range of learning domains and/or subject areas;
(b) Understanding students’ motivations and their interests in specific class content;
(c) Clarifying and articulating the performance outcomes expected of pupils; and
(d) Planning instruction for individuals or groups of students.

Activities Occurring during Instruction


(a) Monitoring pupil progress toward instructional goals;
(b) Identifying gains and difficulties pupils are experiencing in learning and performing;
(c) Adjusting instruction;
(d) Giving contingent, specific, and credible praise and feedback;
(e) Motivating students to learn; and
(f) Judging the extent of pupil attainment of instructional outcomes.

Activities Occurring after the Appropriate Instructional Segment


(e.g., lesson, class, semester, grade)
(a) Describing the extent to which each pupil has attained both short- and long-term instructional
goals;
(b) Communicating strengths and weaknesses based on assessment results to students, and parents
or guardians;
Standards for Teacher Competence in Educational Assessment of Students 493

(c) Recording and reporting assessment results for school-level analysis, evaluation, and decision
making;
(d) Analyzing assessment information gathered before and during instruction to understand each
students’ progress to date and to inform future instructional planning;
(e) Evaluating the effectiveness of instruction; and
(f) Evaluating the effectiveness of the curriculum and materials in use.

Activities Associated with a Teacher’s Involvement in


School Building and School District Decision Making
(a) Serving on a school or district committee examining the school’s and district’s strengths and
weaknesses in the development of its students;
(b) Working on the development or selection of assessment methods for school building or school
district use;
(c) Evaluating school district curriculum; and
(d) Other related activities.

Activities Associated with a Teacher’s Involvement in a Wider Community of Educators


(a) Serving on a state committee asked to develop learning goals and associated assessment methods;
(b) Participating in reviews of the appropriateness of district, state, or national student goals and
associated assessment methods; and
(c) Interpreting the results of state and national student assessment programs.

Each standard that follows is an expectation for assessment knowledge or skill that a teacher
should possess in order to perform well in the five areas just described. As a set, the standards call on
teachers to demonstrate skill at selecting, developing, applying, using, communicating, and evaluat-
ing student assessment information and student assessment practices. A brief rationale and illustrative
behaviors follow each standard.
The standards represent a conceptual framework or scaffolding from which specific skills can
be derived. Work to make these standards operational will be needed even after they have been pub-
lished. It is also expected that experience in the application of these standards should lead to their
improvement and further development.

Standards for Teacher Competence


in Educational Assessment of Students

1. Teachers should be skilled in choosing assessment methods appropriate for instructional


decisions. Skills in choosing appropriate, useful, administratively convenient, technically adequate, and
fair assessment methods are prerequisite to good use of information to support instructional decisions.
Teachers need to be well-acquainted with the kinds of information provided by a broad range of assess-
ment alternatives and their strengths and weaknesses. In particular, they should be familiar with criteria
for evaluating and selecting assessment methods in light of instructional plans.
Teachers who meet this standard will have the conceptual and application skills that follow.
They will be able to use the concepts of assessment error and validity when developing or selecting
their approaches to classroom assessment of students. They will understand how valid assessment data
can support instructional activities such as providing appropriate feedback to students, diagnosing
group and individual learning needs, planning for individualized educational programs, motivating
students, and evaluating instructional procedures. They will understand how invalid information can
affect instructional decisions about students. They will also be able to use and evaluate assessment
494 APP EINDIXS

options available to them, considering among other things, the cultural, social, economic, and lan-
guage backgrounds of students. They will be aware that different assessment approaches can be in-
compatible with certain instructional goals and may impact quite differently on their teaching.
Teachers will know, for each assessment approach they use, its appropriateness for making
decisions about their pupils. Moreover, teachers will know where to find information about and/
or reviews of various assessment methods. Assessment options are diverse and include text- and
curriculum-embedded questions and tests, standardized criterion-referenced and norm-referenced
tests, oral questioning, spontaneous and structured performance assessments, portfolios, exhibitions,
demonstrations, rating scales, writing samples, paper-and-pencil tests, seatwork and homework, peer-
and self-assessments, student records, observations, questionnaires, interviews, projects, products,
and others’ opinions.

2. Teachers should be skilled in developing assessment methods appropriate for instructional


decisions. Although teachers often use published or other external assessment tools, the bulk of the as-
sessment information they use for decision making comes from approaches they create and implement.
Indeed, the assessment demands of the classroom go well beyond readily available instruments.
Teachers who meet this standard will have the conceptual and application skills that follow.
Teachers will be skilled in planning the collection of information that facilitates the decisions they will
make. They will know and follow appropriate principles for developing and using assessment meth-
ods in their teaching, avoiding common pitfalls in student assessment. Such techniques may include
several of the options listed at the end of the first standard. The teacher will select the techniques that
are appropriate to the intent of the teacher’s instruction.
Teachers meeting this standard will also be skilled in using student data to analyze the qual-
ity of each assessment technique they use. Because most teachers do not have access to assessment
specialists, they must be prepared to do these analyses themselves.

3. The teacher should be skilled in administering, scoring, and interpreting the results of
both externally produced and teacher-produced assessment methods. It is not enough that teach-
ers are able to select and develop good assessment methods; they must also be able to apply them
properly. Teachers should be skilled in administering, scoring, and interpreting results from diverse
assessment methods. Teachers who meet this standard will have the conceptual and application skills
that follow. They will be skilled in interpreting informal and formal teacher-produced assessment
results, including pupils’ performances in class and on homework assignments. Teachers will be able
to use guides for scoring essay questions and projects, stencils for scoring response-choice questions,
and scales for rating performance assessments. They will be able to use these in ways that produce
consistent results. Teachers will be able to administer standardized achievement tests and be able to
interpret the commonly reported scores: percentile ranks, percentile band scores, standard scores, and
grade equivalents. They will have a conceptual understanding of the summary indexes commonly
reported with assessment results: measures of central tendency, dispersion, relationships, reliability,
and errors of measurement.
Teachers will be able to apply these concepts of score and summary indices in ways that en-
hance their use of the assessments that they develop. They will be able to analyze assessment results to
identify pupils’ strengths and errors. If they get inconsistent results, they will seek other explanations
for the discrepancy or other data to attempt to resolve the uncertainty before arriving at a decision.
They will be able to use assessment methods in ways that encourage students’ educational develop-
ment and that do not inappropriately increase students’ anxiety levels.

4. Teachers should be skilled in using assessment results when making decisions about in-
dividual students, planning teaching, developing curriculum, and school improvement. Assess-
ment results are used to make educational decisions at several levels: in the classroom about students,
Standards for Teacher Competence in Educational Assessment of Students 495

in the community about a school and a school district, and in society, generally, about the purposes
and outcomes of the educational enterprise. Teachers play a vital role when participating in decision
making at each of these levels and must be able to use assessment results effectively.
Teachers who meet this standard will have the conceptual and application skills that follow.
They will be able to use accumulated assessment information to organize a sound instructional plan
for facilitating students’ educational development. When using assessment results to plan and/or eval-
uate instruction and curriculum, teachers will interpret the results correctly and avoid common misin-
terpretations, such as basing decisions on scores that lack curriculum validity. They will be informed
about the results of local, regional, state, and national assessments and about their appropriate use for
pupil, classroom, school, district, state, and national educational improvement.

5. Teachers should be skilled in developing valid pupil grading procedures that use pupil
assessments. Grading students is an important part of professional practice for teachers. Grading is de-
fined as indicating both a student’s level of performance and a teacher’s valuing of that performance. The
principles for using assessments to obtain valid grades are known and teachers should employ them.
Teachers who meet this standard will have the conceptual and application skills that follow.
They will be able to devise, implement, and explain a procedure for developing grades composed of
marks from various assignments, projects, in-class activities, quizzes, tests, and/or other assessments
that they may use. Teachers will understand and be able to articulate why the grades they assign are
rational, justified, and fair, acknowledging that such grades reflect their preferences and judgments.
Teachers will be able to recognize and to avoid faulty grading procedures such as using grades as
punishment. They will be able to evaluate and to modify their grading procedures in order to improve
the validity of the interpretations made from them about students’ attainments.

6. Teachers should be skilled in communicating assessment results to students, parents,


other lay audiences, and other educators. Teachers must routinely report assessment results to
students and to parents or guardians. In addition, they are frequently asked to report or to discuss
assessment results with other educators and with diverse lay audiences. If the results are not com-
municated effectively, they may be misused or not used. To communicate effectively with others on
matters of student assessment, teachers must be able to use assessment terminology appropriately and
must be able to articulate the meaning, limitations, and implications of assessment results. Further-
more, teachers will sometimes be in a position that will require them to defend their own assessment
procedures and their interpretations of them. At other times, teachers may need to help the public to
interpret assessment results appropriately.
Teachers who meet this standard will have the conceptual and application skills that follow.
Teachers will understand and be able to give appropriate explanations of how the interpretation of
student assessments must be moderated by the student’s socioeconomic, cultural, language, and other
background factors. Teachers will be able to explain that assessment results do not imply that such
background factors limit a student’s ultimate educational development. They will be able to com-
municate to students and to their parents or guardians how they may assess the student’s educational
progress. Teachers will understand and be able to explain the importance of taking measurement errors
into account when using assessments to make decisions about individual students. Teachers will be
able to explain the limitations of different informal and formal assessment methods. They will be able
to explain printed reports of the results of pupil assessments at the classroom, school district, state,
and national levels.

7, Teachers should be skilled in recognizing unethical, illegal, and otherwise inappropriate


assessment methods and uses of assessment information. Fairness, the rights of all concerned,
and professional ethical behavior must undergird all student assessment activities, from the initial
planning for and gathering of information to the interpretation, use, and communication of the results.
496 APPENDIX E

Teachers must be well-versed in their own ethical and legal responsibilities in assessment. In addition,
they should also attempt to have the inappropriate assessment practices of others discontinued when-
ever they are encountered. Teachers should also participate with the wider educational community in
defining the limits of appropriate professional behavior in assessment.
Teachers who meet this standard will have the conceptual and application skills that follow.
They will know those laws and case decisions that affect their classroom, school district, and state
assessment practices. Teachers will be aware that various assessment procedures can be misused or
overused resulting in harmful consequences such as embarrassing students, violating a student’s right
to confidentiality, and inappropriately using students’ standardized achievement test scores to measure
teaching effectiveness.

Source: Standards for teacher competence in educational assessment of students. Developed by the American
Federation of Teachers, National Council on Measurement in Education, and National Education Association.
This
is not copyrighted material. Reproduction and dissemination are encouraged. 1990. ‘
|e | e

< oa jae ica oy po bd =

;
ae Ga —
Som ©} S Sa
=
S N © SaY 3

p=) SS
i
———)see
Y
Z.
© co=
"S UO | Sa= Y

Z+
; LL8T@
CV8C QLLT
O87 60LT
€VLTe 9LIT
Cv9T 9PST C8
8LST
1197 pS’
680
OCVC
ISve’ ECOG
8ST 9977
9677 9077
9ECTT SPIT
LLIT 0607
611 ££0C
1907
(0) 0 0
é Zz- (panu1juoo)

(@) Z+
SOLG vCCT
0617
LSIT LS@T LSET VSVC LIST
COV:
687° 08ST
6vST 98b7"
CV9OT vElLTe
VOLT
€L9T VOLTv6LT TS8C
€T8T O67
1887 6£67
L967
167VCET 1197
| 0 0

z-

) ane,jo Z+ Z 9c" Lg 8¢° 6S" 09° 19° ct"£9 9° ¢9° 99° L9 89° 69° OF We (Ghel vl cL OL EE 8L 6L 08° 18° (6:3€8

Zz
6S8C°
LO8C° €8Le’
I@8e° SVL 699¢°
LOLE” v6Se”
CEOC™ OCS" Orre
LSSe ESbe”
Clee
60re" v9
OOee"
OfEe COLE
SCCe OSIe
Lobe OSOg
C80c° ClO¢
186c° C167
9V6C
(0) 0 0

z-

JAIN@ z+
COT LIZcSt£6cl"
6LI
IvIl 89CT°
Teer: 90rI O8rl vSSI°
errl 8c9l
16ST” LIST’
pool 9ELT
OOLT CLL 6L8T
vrs”
8081" OS6l”
Sl6l 6107
C861" 8807
vSOC
| 0 Q

[VULION z-

(e) an0 z+ Zz 87 67 0c If" ce ce vo ce oC Le 8 6€ Oa Iv (Oa tv WWcovov Lv Sv 67" 0S” IS (BS€¢ tS” cc


94)
Jopun

vary
z+
000S$°
0967 Oss’
Ot6r OrsT ICL
1OLy
108v" Ivor
Igor C8rVv
cO9r CCS" Cvyy z9Sb"
p9EV
vor Cty98cr
Lvcv solr
LOtYV’ 0607
6clv" cSOV
€lov 9C6c°
vL6c
©) 0 0

Jo z-
suonsodolg

@) zi
0000° 0Z10"
0800"
0700" 0910 6£70
6610° 6L7O 6S£0°
61£0° 8Er0"
LISO’
86£0° 8L70° LSSO° 9¢90°
96S0° FILO
¢L90° £SLO" TE80°
£6L0° 0160°
1L80° 8760"
L860° p90I"
97OI
| 0
TA z-
WIAVL
@) oma Z+ Zi 00°10° TO" €0" vO"co 90° LO’ 80° 60° Ol 1 1A) 28 vl Cl: io) [iis8I 6r 0c 74 (G6;CT vO oT 97 BG

497
498
Ta
(e) (q) (9) (9) (®) (q) (9)

w1dVL
one men

penunuo)
jo
Z+
0 Z+ 0 ze

oR
== ~ = ~ a = ~
Oe 0 O. = O.
vs” S667 SOOT eC OLEVy 0£90° COT 8987 CELLO
cs ECOL LLOV l vs C8EV 8190 (AGG IL87" 6c10°
98° 1SO¢” 6rel col vOEV 9090" VOT CL8V SC1O"
L8° 8LOe CCl 9C'T 9007 v6S0 SEC SL8r CELLO’
88° 90TE vost” LOT slvr C8S0° TGC 1887 6110"
68° teheilite LOST 8ST 6CrV ILSO° CAG v88v OT1O
06 6SIe Irs | 6o lvry 6SS0° 8CC L388 E110
16 981 vist 09'T Coty 87S0 6CTC 0687 OLLO
tO" Clon 88LT 19'T Coby LEsO OE C68" LOLO”
£6 8ETE COLT col vLyy 9TS0" LEG 9687 vOTO
v6 VOCE 9ELT €9'T v8rVy 91S0 COAG 8687 COLO
c6 680° TILT pol Cory” ¢OSO’ CE? 1067 6600°
96° CIeg C89T" COT COST c6v0" VET v06rV 9600°
Lo OvVee 099T° 99'T CISy C870" CEC 9067" 600°
86 C9Ee” Ceol” L9'T COS SLVO" SEC 6067" 1600°
66° 68Eo° IT9T 89'1 ces c9r0 LG 11l6r" 6800°
O0'T Elve L8ST° 69'T crSy CSO" 87S cl6v” L800°
10'T SEE COST I OL VSS 9br0 6e°C 9167 800°
COT 19vc 6ST A) |We vOSy 9EP0 Ove 8l6Vv C7800
€O'T C8re° CIST CLL ELSy L7VO" vc OC6V 0800"
vO'T 80S¢c° Corl” EL C8SV" 8170 GMS CCOV 800°
SOT Tese” 69rl EY 16sv 6070" ev'T CC6V ¢L00°
90°T DSSE orrl SLI 66S7 10v0" wT LC6V €L00°
JO LLSe €cyl” OL 8097" C6EO a6 6c6V 1L00°
80°T 66S¢° 1Ovl Liat 919V P8c0 9V'T cov’ 6900°
60° 1C9¢" 6Lel SL ccor’ SLEO" LVT CLOV 8900°
Ol Ev9e™ LSel 6L'T ecorv” LOCO" 8V'C veo 9900°
Ei c99¢° Ceel 08'T Ivor’ 6S¢0 6r'C 9¢6r 7900"
cIl 989¢° viel [8'T 6v9V TSc0° OSC 8c6r C900"
EE
vS00"

6100°
6S00°

cS00°

600°
8700"

vr00"

Ov00°

900°
600"
TS00°

€£00"
Ty00°
(panuyuo?)

6100°
LS00"

€C00°
cCOO"
1COO"
1c00°
0c00"
0900°

$S00°

LV00"
Sv00"

€v00"

6£00°

L700"
900°

£700"
VC00"
€c00"
8€00°
LE00°
9€00°
SE00°
veo0

CcE00"
TE00"
O€00°

800°
tvor 9r6r
8ror"
6v6r" CSO
eSoV 9S6V" 096r° £96r" 9967" OL6V"
6967" [Lov tLov 9L6V" 8Lor 6L6V"

Over’
Tv6v Spor [Sor SS6V LS6V"
6Sor 196v°
C96t" p96r
S96r" L96r"
8967" cL6V vLoVv
vLov
CLOV" LLOV 186r"
LLOV 6L6v 0867” 1867"

VST 99°C 09°C y9'C 89°C IL@ VLC 8L°C 08°C


IS'c(656eS Sst LSC89°C6S°C 19°CCOT£9 c9'T9977LOC 69°COLT CLCtL CLT9LCELC 6L'C 18°Cc8°C€8°CV3TS8°C98°CL8°C88°C68°C06°C

CTEO™
viedo v6c0" 9ScO"
OScO" 8CCO" COLO
L610 S10"
VLIO’ 9910" 8S10 OSTO"
Or evl0
610
10°
90"
vre0 6c¢0° 10¢0° L8CO"
LOEO" 18c0° 8970"
PLCO COTO" vyCcO
6€C0"
€€CcO" cCCO" cICO
LICO’ LOCO"
COCO’ 8810 6L10° OLIO" C910"
e810 9e10"

9S9r 9891"
SLO 9OLV
6697"
coor" VYLV
OSLV 19LV CLLV
OSLV t6LV cO8r" CI8v
8LLV 88LV OL8r 8esr 98

poor
IL9V" 6ILv
clLy 9CLV 8ELy
CeLy LOLG e8Ly 86LV 8087 LI8v
[C89C8V vesy Cry OS8r
pS8v L98v
LS8V" y98V
VIC

OIG

61'C
OT

CL?

IEG
81°C

Kare
661
v6'l

Lol
06'1

c6'l

col
681

col
ps'T
C81
e831

OCT
96'1

861

ele

STC
60°C

Il¢
SOC
90°C
LOT
80°C
lol

10°
COC
c0'C
vOT
881

00°C
S81
98'1
LSI

COC OCT
Oc O6LT
OLTT OcOT
8cOl e001”
£860" 6980"
€S80° 8LLO0° 6rL0"
Selo 80L0° 1890"

ILcT
Iscl TELL
IStl c601°
CLEL 9SOT
SLOT 8960"
TS60°
ve60" 1060°
8160" S880" €C80°
880° €6L0° VOLO’
8080° TCLO 690° 8990"
¢S90°
€y90"

80LE° O6Lt OC8e"


OLLE SCOe" 9907
60 C807" 9Ecr 6LeV 90EV
62) cOly” CCCV
Lviv 6ley Srer
ba
Le
6r
6CLE 698¢°
O18 68 LO6¢" Prot
888° C96E° L66e°
O86¢° Sl0r
ceOV" 6607" Tey LLIV’ LOC Isc
clit S9Cr C6CV ceeVv LStv’

ROHOGBANMNTNOTMOHOAAG
ATFANOMADDHDOAAMNYTYHN
SAN RAQRN SASHA GRR SASSI SS

499
500
TH
(e) (q) (0) (®) (4) (9) (e) (q) (2)

ATavVL
penunuo)
anea, anyeA anyeA
jo jo jo
ie FA || = | = it a = oat OSS me | x
Zi) 0 Z+ 0 z+ 0 Z+ 0 Z+ 0 Z+

Za - | = - ~ i= ee a| = a | = a | ~
ZS OR = ORES Zi Oa ee (U0 2 OM OREZ=

167 C86v" 8100° 8le £667" L000° SVE L66v" £000"


COT C867" 8100° 6le £667" © L000" ore L66v" £000°
67 £867" L100° Owe €66v" L000° LVE L66v" £000"
v67 v86r" 9100° I@€ £667" L000° Bre Lo6v €000°
S6T v86r" 9100" CCE voor" 9000" 6V'€ 8667" c000°
96°C S867 S100" eCE v66r 9000" Ose 8667" 000°
LOT S867 $100° VCE yo6r 9000° ¢ IG 8667" c000°
86°C 9867" v100" STE y66r" 9000° CGE 8667 000°
66° 9867 v100 ICE b66r" 9000" ese 8667" C000"
00°€ L86v" €100° LOE S667 $000° SE 8667" C000"
10'€ L86v" €100° 8CE S66v" $000" cc 8667 c000°
COE L86v" €100° 6C'€ S66V" $000" OSE 8667" C000"
£0°€ 8867 7100" Oe S66" $000° LSE 8667 C000"
vO'E 8867 C100" Lee S66" $000° 89° 8667" C000"
SOE 6867 T100° CEE c66r" $000° 69° 8667 C000"
90'E 6867 T100° Se 9667" 000" 09" 8667" 7000"
LOE 6867" T100° VEE 9667 O00" 19°€ 8667" C000"
80°€ 0667" 0100° SEAS 9667" 7000" COE 6667" 1000°
60°€ 0667" 0100" OEE 9667" 000" e9e 6667" 1000"
Ore 0667" 0100" Lee 9667" 000" y9'e 6667" 1000°
Tie 1667" 6000° BEE 9667" O00" Sc 6667" 1000°
cle 1667" 6000" 6e¢ L66v" €000° OO 6667" 1000°
ele 1667" 6000" Ove L66v" £000" WSIS 6667" 1000°
ve C66" 8000° Ive L66v" £000" 89'€ 6667" 1000°
Sve C66" 8000° ve L66v" £000" 69° 6667" 1000°
OTE c66r 8000° eve L66v" £000° OL'e 6667" 1000°

we
LIES
£000"
8000°
Lo6v
C66v"
APPENDIX G

Answers to Practice Problems

Chapter 2

1. Calculate the mean, variance, and standard deviation for the following score distributions.
Distribution 1 Distribution 2 Distribution 3
Mean = 7.267 Mean = 5.467 Mean = 5.20
Variance = 3.3956 Variance = 5.182 Variance = 4.427
SD = 1.8427 SDi=32.276 SD = 2.104

2. Calculate the Pearson Correlation Coefficient for the following pairs of scores.
Sample 1: r = 0.631
Sample 2: r = 0.886
Sample 3: r = 0.26

Chapter 3
1. Transform the following raw scores to the specified standard score formats. The raw score
distribution has a mean of 70 and a standard deviation of 10.
a. Raw score = 85 z-score = 1.5 T-score = 65
b. Raw score = 60 z-score = —1.0 T-score = 40
c. Raw score = 55 z-score = —1.5 T-score = 35
d. Raw score = 95 z-score = 2.5 T-score = 75
e. Raw score = 75 z-score = 0.5 T-score = 55

2. Convert the following z-scores to T-scores and CEEB scores.


a. z-score = 1.5 T-score = 65 CEEB score = 650
b. z-score = —1.5 T-score = 35 CEEB score = 350
c. z-score = 2.5 T-score = 75 CEEB score = 750
d. z-score = —2.0 T-score = 30 CEEB score = 300
e. z-score = —1.7 T-score = 33 CEEB score = 330

Chapter 4
1. Calculating KR 20:

Item 1 Item 2 Item 3 Item 4 Item5 Total Score

Student 1 0 1 1 0 1 3}
Student 2 1 1 1 1 1 >
(continued)

501
502 APPENDIX G

Item 1 Item 2 Item 3 Item 4 Item5 Total Score

Student 3 1 0 1 0 0 2
Student 4 0 0 0 1 0 1
Student 5 1 1 1 1 1 5
Student 6 I 1 0 1 0 3
P; 0.6667 0.6667 0.6667 0.6667 0.5 SD? = 2.1389
4d 0.3333 0.3333 0.3333 0.3333 0.5
P; X 4; (02222 0222) 0.2222 (02222 0.25

Dp, X q, = 0.2222 + 0.2222 + 0.2222 + 0.2222 + 0.25

p, x qi 1.1388

KR 20 = 5/4 x (2.1389 — 1.1388/2.139)


1.25 x (1.0001/2.1389)
1.25 x (0.4675)
= U8

2. Calculating Coefficient alpha:

Item 1 Item 2 Item 3 Item 4 Item5 Total Score

Student 1 4 5) 4 5 4 23
Student 2 3 3) D) 3 2 13
Student 3 7) 3 1 2: 1 9
Student 4 4 4 5 5 4 22
Student 5 2 3 2 2 3 12
Student 6 1 2 2 1 3 9
SDi2 1.2222 0.8889 1.8889 2.3333 1.6667 SD? = 32.89

)
1.2222 + 0.8889 + 1.8889 + 2.3333 + 1.6667
Coefficient alpha = 5/4 x (1
32.89
1.25 x (1 — 8/32.89)
1.25 x (1 — 0.2432)
1.25 x (0.7568)
0.946
REFERENCES

Achenbach, T. M. (1991a). Manual for the Child Behavior American Psychological Association (2008). FAQ/Finding in-
Checklists/4—18 and 1991 profile. Burlington: Univer- formation about tests. Retrieved March 17, 2008, from
sity of Vermont, Department of Psychiatry. www.apa.org/science/faq-findtests.html#printeddirec
Achenbach, T. M. (1991b). Manual for the Teacher’s Report American Psychological Association (1993, January). Call for
Form and 1991 profile. Burlington: University of Ver- book proposals for test instruments. APA Monitor, 24, 12.
mont, Department of Psychiatry. American Psychological Association, American Educational
Achenbach, T. M. (1991c). Manual for the Youth Self-Report Research Association, & National Council on Measure-
and 1991 profile. Burlington: University of Vermont, ment in Education (1974). Standards for educational
Department of Psychiatry. and psychological testing. Washington, DC: Author.
Aiken, L. R. (1982). Writing multiple-choice items to mea- American Psychological Association, American Educational
sure higher-order educational objectives. Educational & Research Association, & National Council on Measure-
Psychological Measurement, 42, 803-806. ment in Education (1985). Standards for educational
Aiken, L. R. (2000). Psychological testing and assessment. and psychological testing. Washington, DC: Author.
Boston: Allyn & Bacon. Amrein, A. L., & Berliner, D. C. (2002). High-stakes test-
Alley, G., & Foster, C. (1978). Nondiscriminatory testing of ing, uncertainty, and student learning. Education Policy
minority and exceptional children. Focus on Exceptional Analysis Archives, 10(18). Retrieved May 11, 2003,
Children, 9, 1-14. from http://epaa.asu.edu/epaa/v10n18.
American Educational Research Association (2000). AERA Anastasi, A., & Urbina, S. (1997). Psychological testing (7th
position statement concerning high-stakes testing in ed.). Upper Saddle River, NJ: Prentice Hall.
preK-12 education. Retrieved September 13, 2003, Beck, M. D. (1978). The effect of item response changes on
from www.aera.net/about/policy/stakes.htm scores on an elementary reading achievement test. Jour-
American Educational Research Association, American Psy- nal of Educational Research, 71, 153-156.
chological Association, & National Council on Measure- Bloom, B., Englehart, M., Furst, E., Hill, W., & Krathwohl, D.
ment in Education (1999). Standards for educational (1956). Taxonomy of educational objectives: The clas-
and psychological testing. Washington, DC: American sification of educational goals. Handbook I: Cognitive
Educational Research Association. domain. White Plains, NY: Longman.
American Federation of Teachers, National Council on Boser, U. (1999). Study finds mismatch between California
Measurement in Education, & National Education As- standards and assessments. Education Week, 18, 10.
sociation (1990). Standards for teacher competence in Boston, C. (2001). The debate over national testing. (Report
educational assessment of students. Washington, DC: No. EDO-TM-01-02). College Park, MD: ERIC Clear-
American Federation of Teachers. inghouse on Assessment and Evaluation. (ERIC No. ED
American Psychiatric Association (1994). The diagnostic and 458214).
statistical manual of mental disorders (4th ed.). Wash- Braden, J. P. (1997). The practical impact of intellectual assess-
ington, DC: Author ment issues. School Psychology Review, 26, 242-248.
American Psychological Association (1954). Technical rec- Brookhart, S. M. (2004). Grading. Upper Saddle River, NJ:
ommendations for psychological tests and diagnostic Pearson Merrill Prentice Hall.
techniques. Psychological Bulletin, 51(2, pt. 2). Brown, R. T., Reynolds, C. R., & Whitaker, J. S. (1999). Bias
American Psychological Association (1966). Standards for in mental testing since “Bias in Mental Testing.” School
educational and psychological tests and manuals. Wash- Psychology Quarterly, 14, 208-238.
ington, DC: Author. Camilli, G., & Shepard, L. A. (1994). Methods for identifying
American Psychological Association (2004). Testing and as- biased test items. Thousand Oaks, CA: Sage.
sessment: FAQ/Finding information about psychologi- Campell, D. T., & Fiske, D. W. (1959). Convergent and dis-
cal tests. Retrieved December 1, 2004, from www.apa criminant validation by the multitrait-multimethod ma-
.org/science/faq-findtests.html#findinfo trix. Psychological Bulletin, 56, 546-553.

503
504 REFERENCES

Cannell, J. J. (1988). Nationally normed elementary achieve- Costin, F. (1970). The optimal number of alternatives in mul-
ment testing in America’s public schools: How all 50 tiple-choice achievement tests: Some empirical evidence
states are above average. Educational Measurement: Is- for a mathematical proof. Educational & Psychological
sues and Practice, 7, 5-9. Measurement, 30, 353-358.
Cannell, J. J. (1989). The “Lake Wobegon” report: How pub- Crocker, L., & Algina, J. (1986). Introduction to classical and
lic educators cheat on standardized achievement tests. modern test theory. New York: Harcourt Brace.
Albuquerque, NM: Friends for Education. Cronbach, L. J. (1950). Further evidence on response sets and
Canter, A. S. (1997). The future of intelligence testing in the test design. Educational & Psychological Measurement,
schools. School Psychology Review, 26, 255-261. 10, 3-31.
Ceperley, P. E., & Reel, K. (1997). The impetus for the Ten- Cronbach, L. J. (1951). Coefficient alpha and the internal
nessee value-added accountability system. In J. Millman structure of tests. Psychometrika, 16, 297-334.
(Ed.), Grading teachers, grading schools, (pp. 133-136). Cronbach, L. J. (1990). Essentials of psychological testing
Thousand Oaks, CA: Corwin Press. (Sth ed.). New York: HarperCollins.
Chan, D., Schmitt, N., DeShon, R. P., Clause, C. S., & Del- Cronbach, L. J., & Furby, L. (1970). How we should mea-
bridge, K. (1997). Reaction to cognitive ability tests: sure change—Or should we? Psychological Bulletin, 52,
The relationship between race, test performance, face 281-302.
validity, and test-taking motivation. Journal of Applied Cronbach, L. J., & Gleser, G. C. (1965). Psychological tests
Psychology, 82, 300-310. and personnel decisions (2nd ed.). Champaign: Univer-
Chandler, L.A. (1990). The projective hypothesis and the sity of Illinois Press.
development of projective techniques for children. In CTB/Macmillan/McGraw-Hill. (1993). California Achieve-
C.R. Reynolds & R. Kamphaus (Eds.), Handbook of ment Tests/5. Monterey, CA: Author.
psychological and educational assessment of children: Cummins, J. (1984). Bilingual special education: Issues in as-
Personality, behavior, and context (pp. 55-69). New sessment and pedagogy. San Diego, CA: College Hill.
York: Guilford Press. Deiderich, P. B. (1973). Short-cut statistics for teacher-made
Chase, C. (1979). The impact of achievement expectations tests. Princeton, NJ: Educational Testing Service.
and handwriting quality on scoring essay tests. Journal Doherty, K. M. (2002). Education issues: Assessment. Educa-
of Educational Measurement, 16, 39-42. tion Week on the Web. Retrieved May 14, 2003, from www
Chinn, P. C. (1979). The exceptional minority child: Is- .edweek.org/context/topics/issuespage.cfm?id=41
sues and some answers. Exceptional Children, 46, Ebel, R. L. (1970). The case for true—false items. School Re-
532-536. view, 78, 373-389.
Christ, T., Burns, M., & Ysseldyke, J. (2005). Conceptual con- Ebel, R. L. (1971). How to write true—false items. Educational
fusion within response-to-intervention vernacular: Clar- & Psychological Measurement, 31, 417-426.
ifying meaningful differences. Communiqué, 33(3). Educational Testing Services (1973). Making the classroom
Cizek, G. J. (1998). Filling in the blanks: Putting standardized test: A guide for teachers. Princeton, NJ: Author.
tests to the test. Fordham Report, 2(11). Engelhart, M. D. (1965). A comparison of several item dis-
Cleary, T. A., Humphreys, L. G., Kendrick, S. A., & Wesman, crimination indices. Journal of Educational Measure-
A. (1975). Educational uses of tests with disadvantaged ment, 2, 69-76.
students. American Psychologist, 30, 15-41. Exner, J. E. (1974). The Rorschach: A comprehensive system,
Coffman, W. (1972). On the reliability of ratings of essay ex- I. New York: Wiley.
aminations. NCME Measurement in Education, 3, 1-7. Exner, J. E. (1978). The Rorschach: A comprehensive system,
Coffman, W., & Kurfman, D. (1968). A comparison of two IT, New York: Wiley.
methods of reading essay examinations. American Edu- Feifer, S. G., & Toffalo, D. A. (2007). Integrating RTI with
cational Research Journal, 5, 99-107. cognitive neuropsychology: A scientific approach to
Cohen, J. (1988). Statistical power analysis for the behavioral reading. Middleton, MD: School Neuropsych Press.
sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Feldt, L. (1997). Can validity rise when reliability declines?
Cohen, R. C., & Swerdlik, M. E. (2002). Psychological test- Applied Measurement in Education, 10, 377-387.
ing and assessment: An introduction to tests and mea- Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn
surement. New York: McGraw-Hill. (Ed.), Educational measurement (3rd ed., pp. 105-146).
Cole, N. S., & Moss, P. A. (1989). Bias in test use. In R. L. Upper Saddle River, NJ:.Merrill Prentice Hall.
Linn (Ed.), Educational measurement (3rd ed., pp. 201- Finch, A. J., & Belter, R. W. (1993). Projective techniques.
219). Upper Saddle River, NJ: Merrill Prentice Hall. In T. H. Ollendick and M. Hersen (Eds.), Handbook of
Conners, C. K. (1997). Conners’ Rating Scales—Revised. child and adolescent assessment (pp. 224-238). Boston:
North Tonawanda, NY: Multi-Health Systems. Allyn & Bacon.

REFERENCES 505

Flaugher, R. L. (1978). The many definitions of test bias. ment decisions. Recent empirical findings and future
American Psychologist, 33, 671-679. directions. School Psychology Quarterly, 12, 146-154.
Fletcher, J. M., Foorman, B. R., Boudousquie, A., Barnes, Gronlund, N. E. (1998). Assessment of student achievement
M. A., Schatschneider, C., & Francis, D. J. (2002). As- (6th ed.). Boston: Allyn & Bacon.
sessment of reading and learning disabilities: A research Gronlund, N. E. (2000). How to write and use instructional
based intervention-oriented approach. Journal of School objectives (6th ed.). Upper Saddle River, NJ: Merrill/
Psychology, 40, 27-63. Prentice Hall.
Flynn, J. R. (1998). IQ gains over time: Toward finding the Gronlund, N. E. (2003). Assessment of student achievement
causes. In U. Neisser (Ed.), The rising curve: Long-term (7th ed.). Boston: Allyn & Bacon.
gains in IQ and related measures, pp. 25-66. Washing- Gulliksen, H. (1950). Theory of mental tests. New York:
ton, DC: American Psychological Association. Wiley.
Friedenberg, L. (1 995). Psychological testing: Design, analy- Haak, R. A. (1990). Using the sentence completion to assess
sis, and use. Boston: Allyn & Bacon. emotional disturbance. In C. R. Reynolds & R. W. Kam-
Frisbie, D. A. (1992). The multiple true—false format: A status phaus (Eds.), Handbook of psychological and educa-
review. Educational Measurement: Issues and Practice, tional assessment of children: Personality, behavior, and
11, 21-26. context (pp. 147-167). New York: Guilford Press.
Fuchs, L. S. (2002). Best practices in providing accommoda- Haak, R. A. (2003). The sentence completion as a tool for
tions for assessment. In A. Thomas & J. Grimes (Eds.), to assessing emotional disturbance. In C. R. Reynolds
Best practices in school psychology (Vol. IV, pp. 899- & R. W. Kamphaus (Eds.), Handbook of psychological
909). Bethesda, MD: National Association of School and educational assessment of children: Personality, be-
Psychologists. havior, and context (2nd ed., pp. 159-181). New York:
Fuchs, D., Mock, D., Morgan, P., & Young, C. (2003). Re- Guilford Press.
sponsiveness to intervention: Definitions, evidence, Hakstian, A. (1971). The effects of study methods and test
and implications for the learning disabilities construct. performance on objective and essay examinations. Jour-
Learning Disabilities Research and Practice, 18(3), nal of Educational Research, 64, 319-324.
157-171. Hales, L., & Tokar, E. (1975). The effect of quality of pre-
Galton, F. (1884). Measurement of character. Fortnightly Re- ceding responses on the grades assigned to subsequent
view, 42, 179-185. (Reprinted in Readings in personality responses to an essay question. Journal of Educational
assessment, by L. D. Goodstein & R. I. Lanyon, Eds., Measurement, 12, 115-117.
1971, New York: Wiley). Halpern, D. F. (1997). Sex differences in intelligence: Im-
Gay, G. H. (1990). Standardized tests: Irregularities in admin- plications for education. American Psychologist, 52,
istering the test effects test results. Journal of Instruc- 1091-1102.
tional Psychology, 17, 93-103. Hammer, E. (1985). The House-Tree-Person Test. In C. New-
Ghiselli, E. E., Campbell, J. P., & Zedeck, S. (1981). Mea- mark (Ed.), Major psychological assessment instruments
surement theory for the behavioral sciences. San Fran- (pp. 135-164). Boston: Allyn & Bacon.
cisco: W. H. Freeman. Handler, L. (1985). The clinical use of the Draw-A-Person
Glass, G. V. (1978). Standards and criteria. Journal of Educa- Test (DAP). In C. Newmark (Ed.), Major psychological
tional Measurement, 15, 237-261. assessment instruments (pp. 165-216). Boston: Allyn &
Godshalk, F., Swineford, F., Coffman, W., & Educational Test- Bacon.
ing Service (1966). The measurement of writing ability. Harrow, A. J. (1972). A taxonomy of the psychomotor domain.
New York: College Entrance Examination Board. New York: David McKay.
Goodstein, L. D., & Lanyon, R. I. (1971). Readings in person- Hays, W. (1994). Statistics (Sth ed.). New York: Harcourt
ality assessment. New York: Wiley. Brace.
Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, Helms, J. E. (1992). Why is there no study of cultural equiva-
NJ: Erlbaum. lence in standardized cognitive ability testing? American
Gray, P. (1999). Psychology. New York: Worth. Psychologist, 47, 1083-1101.
Green, B. F. (1981). A primer of testing. American Psycholo- Hembree, R. (1988). Correlates, causes, effects, and treat-
gist, 36, 1001-1011. ment of test anxiety. Review of Educational Research,
Grier, J. B. (1975). The number of alternatives for optimum 58, 47-77.
test reliability. Journal of Educational Measurement, 1 js Hilliard, A. G. (1979). Standardization and cultural bias as
109-113. impediments to the scientific study and validation of
Gresham, F. M., & Witt, J. C. (1997). Utility of intelligence “intelligence.” Journal of Research and Development in
tests for treatment planning, classification, and place- Education, 12, 47-58.
506 REFERENCES

Hilliard A. G. (1989). Back to Binet: The case against the use Kamphaus, R. W., & Frick, P. J. (2002). Clinical assessment
of IQ tests in the schools. Diagnostique, 14, 125-135. of child and adolescent personality and behavior. Bos-
Hoff, D. J. (1999). N.Y.C. probe levels test-cheating charges. ton: Allyn & Bacon.
Education Week, 19, 3. Kamphaus, R. W., & Reynolds, C. R. (1998). Behavior As-
Hoff, D. J. (2003). California schools experiment with dele- sessment System for Children (BASC) ADHD Monitor.
tion of D’s. Education Week, 32, 5. Circle Pines, MN: American Guidance Service.
Hopkins, K. D. (1998). Educational and psychological measure- Kaufman, A. S. (1994). Intelligent testing with the WISC-III.
ment and evaluation (8th ed.). Boston: Allyn & Bacon. New York: Wiley.
Hughes, D., Keeling, B., & Tuck, B. (1980). The influence of Kaufman, A. S., & Lichtenberger, E. O. (1999). Essentials of
context position and scoring method on essay scoring. WAIS-III assessment. New York: Wiley.
Journal of Educational Measurement, 17, 131-135. Keith, T. Z., & Reynolds, C. R. (1990). Measurement and
Hunter, J. E., Schmidt, F. L., & Hunter, R. (1979). Differen- design issues in child assessment research. In C. R.
tial validity of employment tests by race: A comprehen- Reynolds & R.W. Kamphaus (Eds.), Handbook of
sive review and analysis. Psychological Bulletin, 86, psychological and educational assessment of children:
721-735. Intelligence and achievement (pp. 29-62). New York:
Hunter, J. E., Schmidt, F. L., & Rauschenberger, J. (1984). Guilford Press.
Methodological, statistical, and ethical issues in the Keller, B. (2001). Dozens of Michigan schools under suspi-
study of bias in mental testing. In C.R. Reynolds & cion of cheating. Education Week, 20, 18, 30.
R. T. Brown (Eds.), Perspectives on bias in mental test- Kelley, T. L. (1939). The selection of upper and lower groups
ing (pp. 41-101). New York: Plenum Press. for the validation of test items. Journal of Educational
Impara, J. C., & Plake, B. S. (1997). Standard setting: An Psychology, 30, 17-24.
alternative approach. Journal of Educational Measure- Kerlinger, F. N. (1973). Foundations of behavioral research.
ment, 34, 353-366. New York: Holt, Rinehart and Winston.
Jacob, S., & Hartshorne, T. S. (2007). Ethics and law for King, W. L., Baker, J., & Jarrow, J. E. (1995). Testing accom-
school psychologists (5th ed.). Hoboken, NJ: Wiley. modations for students with disabilities. Columbus, OH:
Jaeger, R. M. (1991). Selection of judges for standard-setting. Association on Higher Education and Disability.
Educational Measurement: Issues and Practice, 10(2), Kober, N. (2002). Teaching to the test: The good, the bad,
3-10. and who’s responsible. Test Talk for Leaders (Issue 1).
James, A. (1927). The effect of handwriting on grading. Eng- Washington, DC: Center on Education Policy. Re-
lish Journal, 16, 180-205. trieved May 13, 2003, from www.cep-dc.org/testing/
Jensen, A. R. (1976). Test bias and construct validity. Phi testtalkjune2002.htm
Delta Kappan, 58, 340-346. Kovacs, M. (1991). The Children’s Depression Inventory
Jensen, A. R. (1980). Bias in mental testing. New York: Free (CDI). North Tonawanda, NY: Multi-Health Systems.
Press. Kranzler, J. H. (1997). Educational and policy issues related
Johnson, A. P. (1951). Notes on a suggested index of item va- to the use and interpretation of intelligence tests in the
lidity: The U-L index. Journal of Educational Measure- schools. School Psychology Review, 26, 50-63.
ment, 42, 499-504. Krathwohl, D., Bloom, B., & Masia, B. (1964). Taxonomy of
Joint Committee on Standards for Educational Evaluation educational objectives: Book 2: Affective domain. White
(2003). The student evaluation standards. Arlen Gul- Plains, NY: Longman.
lickson, Chair. Thousand Oaks, CA: Corwin Press. Kubiszyn, T., & Borich, G. (2000). Educational testing and
Joint Committee on Testing Practices (1988). Code of fair measurement: Classroom application and practice (6th
testing practices in education. Washington, DC: Ameri- ed.). New York: Wiley.
can Psychological Association. Kubiszyn, T., & Borich, G. (2003). Educational testing and
Joint Committee on Testing Practices (1998). Rights and measurement: Classroom application and practice (7th
responsibilities of test takers: Guidelines and expec- ed.). New York: Wiley.
tations. Washington, DC: American Psychological Kuder, G. F., & Richardson, M. W. (1937). The theory of the
Association. estimation of reliability. Psychometrika, 2, 151-160.
Kamphaus, R. W. (1993). Clinical assessment of children’s Lawshe, C. H. (1975). A quantitative approach to content va-
intelligence: A handbook for professional practice. Bos- lidity. Personnel Psychology, 28, 563-575.
ton: Allyn & Bacon. Linn, R., & Baker, E. (1992, fall). Portfolios and accountability.
Kamphaus, R. W. (2001). Clinical assessment of child and The CRESST Line: Newsletter of the National Center for
adolescent intelligence. Boston: Allyn & Bacon. Research on Evaluation Standards and Student Testing, 1,
Kamphaus, R. W. (in press). Clinical assessment of children’s 8-10. Retrieved December 6, 2004, from www.cse.ucla
intelligence (3rd ed.). New York: Springer. .edu/products/newsletters/clfall92.pdf
REFERENCES 507

Linn, R. L., & Gronlund, N. E. (2000). Measurement and as- Neisser, U., BooDoo, G., Bouchard, T., Boykin, A., Brody, N.,
sessment in teaching (8th ed.). Upper Saddle River, NJ: Ceci, S., Halpern, D., Loehlin, J., Perloff, R., Sternberg,
Prentice Hall. R., & Urbina, S. (1996). Intelligence: Knowns and un-
Livingston, R. B., Eglsaer, R., Dickson, T., & Harvey-Liv- knowns. American Psychologist, 51, 77-101.
ingston, K. (2003). Psychological assessment practices Nitko, A. J. (2001). Educational assessment of students.
with children and adolescents. Paper presented at the Upper Saddle River, NJ: Merrill Prentice Hall.
23rd Annual National Academy of Neuropsychology Nitko, A. J., & Lane, S. (1990). Standardized multilevel sur-
Conference, Dallas, TX. vey achievement batteries. In C. R. Reynolds & R. W.
Lord, F. M. (1952). The relation of the reliability of multiple- Kamphaus (Eds.), Handbook of psychological and
choice tests to the distribution of item difficulties. Psy- educational assessment of children: Intelligence and
chometrika, 17, 181-194. achievement (pp. 405-434). New York: Guilford Press.
Lowry, R. (2003). Vassar stats: Cohen’s kappa. Retrieved Nomura, J. M., Stinnett, T., Castro, F., Atkins, M., Beason,
August 10, 2003, from http://faculty./vassar.edu/lowry/ S., Linden, S., Hogan, K., Newry, B., & Weichmann, K.
kappa.html (March, 2007). Effects of stereotype threat on cognitive
Lyman, H. B. (1998). Test scores and what they mean. Boston: performance of African Americans. Paper presented to
Allyn & Bacon. the annual meeting of the National Association of School
Manzo, K. K. (2003). Essay scoring goes digital. Education Psychologists, New York.
Week, 22, 39-40, 42. Northeast Technical Assistance Center (1999). Providing test
Mastergeorge, A. M., & Miyoshi, J. N. (1999). Accommoda- accommodations. NETAC Teacher Tipsheet. Rochester,
tions for students with disabilities: A teacher’s guide NY: Author.
(CSE Technical Report 508). Los Angeles: National Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric the-
Center for Research on Evaluation, Standards, and Stu- ory (3rd ed.). New York: McGraw-Hill.
dent Testing. Olson, L. (2003). Georgia suspends testing plans in key
McArthur, D., & Roberts, G. (1982). Roberts Apperception grades. Education Week, 22, 1, 15.
Test for Children: Manual. Los Angeles: Western Psy- Oosterhof, A. C. (1976). Similarity of various item discrimi-
chological Services. nation indices. Journal of Educational Measurement, 13,
McGregor, G., & Vogelsberg, R. (1998). Inclusive schooling 145-150.
practices: Pedagogical and research foundations. A syn- Pedulla, J., Abrams, L., Madaus, G., Russell, M., Ramos, M.,
thesis of the literature that informs best practices about & Miao, J. (2003). Perceived effects of state-mandated
inclusive schooling. Pittsburgh, PA: Allegheny Univer- testing programs on Teaching and Learning: Findings
sity of the Health Sciences. from a national survey of teachers. National Board on
Mealey, D. L., & Host, T. R. (1992). Coping with test anxiety. Educational Testing and Public Policy. Retrieved March
College Teaching, 40, 147-150. 17, 2008, from www.bc.edu/research/reports.html
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Phillips, S. E. (1993). Testing accommodations for disabled
measurement (3rd ed., pp. 13-103). Upper Saddle River, students. Education Law Reporter, 80, 9-32.
NJ: Merrill Prentice Hall. Phillips, S. E. (1994). High-stakes testing accommodations:
Messick, S. (1994). The interplay of evidence and conse- Validity versus disabled rights. Applied Measurement in
quences in the validation of performance assessments. Education, 7(2), 93-120.
Educational Researcher, 23, 13-23. Phillips, S. E. (1996). Legal defensibility of standards: Issues
Murphey, K. R., & Davidshofer, C. O. (2001). Psychologi- and policy perspectives. Educational Measurement: Is-
cal testing: Principles and applications (Sth ed.). Upper sues and Practice, 15(2), 5-19.
Saddle River, NJ: Prentice Hall. Piacentini, J. (1993). Checklists and rating scales. In T. H.
Myford, C. M., & Cline, F. (2002). Looking for patterns in Ollendick & M. Hersen (Eds.), Handbook of child and
disagreement: A Facets analysis of human raters’ and adolescent assessment (pp. 82-97). Boston: Allyn &
e-rater’s scores on essays written for the Graduate Man- Bacon.
agement Admission Test (GMAT). Paper presented at the Pike, L. W. (1979). Short-term instruction, testwiseness, and
annual meeting of the American Educational Research the Scholastic Aptitude Test: A literature review with
Association, New Orleans, LA. research recommendations. Princeton, NJ: Educational
National Council on Measurement in Education (1995). Code Testing Service.
of professional responsibilities in educational measure- Popham, W. J. (1999). Classroom assessment: What teachers
ment. Washington, DC: Author. need to know. Boston: Allyn & Bacon.
National Commission of Excellence in Education (1983). A Popham, W. J. (2000). Modern educational measurement:
nation at risk: The imperative for educational reform. Practical guidelines for educational leaders. Boston:
Washington, DC: U.S. Government Printing Office. Allyn & Bacon.
508 REFERENCES

Powers, D.E., & Kaufman, J.C. (2002). Do standardized Reynolds, C. R. (2002). Comprehensive Trail-Making Test:
multiple-choice tests penalize deep-thinking or creative Examiner’s manual. Austin, TX: PRO-ED.
students? (RR-02-15). Princeton, NJ: Educational Test- Reynolds, C.R. (2005, August). Considerations in RTI as a
ing Service method of diagnosis of learning disabilities. Paper pre-
The Psychological Assessment Resources (2003). Catalog of sented to the Annual Institute for Psychology in the
professional testing resources, 26. Lutz, FL: Author. Schools of the American Psychological Association.
The Psychological Corporation (2002). Examiner’s manual Washington, DC.
for the Wechsler Individual Achievement Test—Second Reynolds, C. R., & Kamphaus, R. (Eds.). (1990a). Hand-
Edition. San Antonio: Author. book of psychological and educational assessment of
The Psychological Corporation (2003). The catalog for psycho- children: Personality, behavior, and context. New York:
logical assessment products. San Antonio, TX: Author. Guilford Press.
Ramsay, M., Reynolds, C., & Kamphaus, R. (2002). Essen- Reynolds, C. R., & Kamphaus, R. (Eds.). (1990b). Handbook
tials of behavioral assessment. New York: Wiley. of psychological and educational assessment of children:
Reitan, R. M., & Wolfson, D. (1993). The Halstead-Reiten Intelligence and achievement. New York: Guilford Press.
Neuropsychological Test Battery: Theory and clinical Reynolds, C. R., & Kamphaus, R. W. (1992). Behavior As-
interpretation (2nd ed.). Tucson, AZ: Neuropsychology sessment System for Children: Manual. Circle Pines,
Press. MN: American Guidance Service.
Reynolds, C. R. (1980). In support of “Bias in Mental Test- Reynolds, C. R., & Kamphaus, R. W. (1998). Behavior As-
ing” and scientific inquiry. The Behavioral and Brain sessment System for Children: Manual. Circle Pines,
Sciences, 3, 352. MN: American Guidance Services.
Reynolds, C. R. (1982). The problem of bias in psychologi- Reynolds, C. R., & Kamphaus, R. W. (2003). Reynolds In-
cal assessment. In C. R. Reynolds & T. B. Gutkin (Eds.), tellectual Assessment Scales (RIAS) and the Reynolds
The handbook of school psychology (pp. 178-208). New Intellectual Screening Test (RIST) professional manual.
York: Wiley. Lutz, FL: Psychological Assessment Resources.
Reynolds, C. R. (1983). Test bias: In God we trust; all oth- Reynolds, C. R., & Kaufman, A. S. (1990). Assessment of
ers must have data. Journal of Special Education, 17, children’s intelligence with the Wechsler Intelligence
241-260. Scale for Children—Revised (WISC-R). In C.R.
Reynolds, C. R. (1985). Critical measurement issues in learning Reynolds & R. W. Kamphaus (Eds.), Handbook of psy-
disabilities. Journal of Special Education, 18, 451-476. chological and educational assessment of children: In-
Reynolds, C. R. (1990). Conceptual and technical problems telligence and achievement (pp. 127-165). New York:
in learning disability diagnosis. In C. R. Reynolds & Guilford Press.
R. W. Kamphaus (Eds.), Handbook ofpsychological and Reynolds, C. R., Lowe, P. A., & Saenz, A. (1999). The prob-
educational assessment of children: Intelligence and lem of bias in psychological assessment. In T. B. Gutkin
achievement (pp. 571-592). New York: Guilford Press. & C.R. Reynolds (Eds.), The handbook of school psy-
Reynolds, C. R. (1995). Test bias in the assessment of intel- chology (3rd ed., pp. 549-595). New York: Wiley.
ligence and personality. In D. Saklofsky & M. Zeidner Reynolds, C. R., & Ramsay, M. C. (2003). Bias in psycholog-
(Eds.), International handbook of personality and intel- ical assessment: An empirical review and recommenda-
ligence (pp. 545-576). New York: Plenum. tions. In J. R. Graham & J. A. Naglieri (Eds.), Handbook
Reynolds, C. R. (1998a). Common sense, clinicians, and ac- of psychology: Assessment psychology (pp. 67-93). New
tuarialism in the detection of malingering during head York: Wiley.
injury litigation. In C. R. Reynolds (Ed.), Detection of Reynolds, C. R., & Voress, J. (2007). Test of Memory and
malingering during head injury litigation. Critical issues Learning (TOMAL-2) (2nd ed.). Austin, TX: PRO-ED.
in neuropsychology (pp. 261-286). New York: Plenum. Reynolds, C. R., Voress, J., & Pierson, N. (2007). Develop-
Reynolds, C. R. (1998b). Fundamentals of measurement and mental Test of Auditory Perception (DTAP). Austin, TX:
assessment in psychology. In A. Bellack & M. Hersen PRO-ED.
(Eds.), Comprehensive clinical psychology (pp. 33-55). Reynolds, W. M. (1993). Self-report methodology. In T. H.
New York: Elsevier. Ollendick and M. Hersen (Eds.), Handbook of child and
Reynolds, C. R. (1999). Inferring causality from relational adolescent assessment (pp. 98-123). Boston: Allyn &
data and design: Historical and contemporary lessons Bacon.
for research and clinical practice. The Clinical Neurop- Ricker, K. L. (2004). Setting cutscores: Critical review of
sychologist, 13, 386-395. Angoff and modified-Angoff methods. Alberta Journal
Reynolds, C. R. (2000). Why is psychometric research on bias of Educational Measurement.
in mental testing so often ignored? Psychology, Public Riverside Publishing (2002). CogAT, Form 6: A short guide
Policy, and Law, 6, 144-150. for teachers. Itasca, IL: Author.
REFERENCES 509

Riverside Publishing (2003). Clinical and special needs as- Sireci, S. G. (1998). Gathering and analyzing content validity
sessment catalog. Itasca, IL: Author. data. Educational Assessment, 5, 299-321.
Roid, G. H. (2003). Stanford-Binet Intelligence Scales, Fifth Sireci, S. G., Thissen, D., & Wainer, H. (1991). On the re-
Edition. Itasca, IL: Author. liability of testlet-based tests. Journal of Educational
Rudner, L. M. (2001). Computing the expected proportions of Measurement, 28, 237-247.
misclassified examinees. Practical Assessment, Research Stainback, S., & Stainback, W. (1992). Curriculum consider-
& Evaluation, 7(14). Retrieved December 3, 2005, from ations in inclusive classrooms: Facilitating learning for
http://PAREonline.net/getvn.asp?v=7&n=14 all students. Baltimore: Brooks.
Sackette, P. R., Hardison, C. M., & Cullen, M. J. (2004). On Steele, C. M., & Aronson, J. (1995). Stereotype threat and
interpreting stereotype threat as accounting for African- the intellecutal test performance of African Americans.
American differences on cognitive tests. American Psy- Journal of Personality and Social Psychology, 69, 797-
chologist, 59(1), 7-13. 811.
Salvia, J., & Ysseldyke, J. (2007). Assessment in special and Steele, C. M., Spencer, S. J., & Aronson, J. (2002). Contend-
inclusive education. Boston: Houghton Mifflin. ing with group image: The psychology of stereotype and
Samuels, C. A. (2007, July 6). Advocates for students with dis- social identity threat. In M. Zanna (Ed.), Advances in
abilities balk at proposed NCLB change. Education Week. experimental social psychology (Vol. 23, pp. 379-440).
Retrieved July 9, 2007, www.edweek.org/ew/articles. New York: Academic Press.
Sanders, W. L., Saxton, A. M., & Horn, S. P. (1997). The Ten- Stern, R. A. & White, T. (2003). Neuropsychological Assess-
nessee value-added assessment system: A quantitative, ment Battery (NAB). Lutz, FL: Psychological Assess-
outcomes-based approach to educational assessment. ment Resources.
In J. Millman (Ed.), Grading teachers, grading schools Stiggens, R. J. (2001). Student-involved classroom assessment
(pp. 137-162). Thousand Oaks, CA: Corwin Press. (3rd ed.). Upper Saddle River, NJ: Merrill Prentice Hall.
Sandoval, J., & Mille, M. P. W. (1979, September). Accuracy Stiggins, R. J., & Conklin, N. F. (1992). In teacher’s hands:
Judgments of WISC-R item difficulty for minority groups. Investigating the practices of classroom assessment. Al\-
Paper presented at the annual meeting of the American bany, NY: State University of New York Press.
Psychological Association, New York. Stroud, K., & Reynolds, C. R. (2006). School Motivation and
Sarnacki, R. E. (1979, spring). An examination of test-wise- Learning Strategies Inventory (SMALSI). Los Angeles:
ness in the cognitive domain. Review of Educational Re- Western Psychological Services.
search, 49, 252-279. Subkoviak, M. J. (1984). Estimating the reliability of mastery—
Sattler, J. M. (1992). Assessment of children (revised and up- nonmastery classifications. In R. A. Berk (Ed.), A guide
dated 3rd ed.). San Diego, CA: Author. to criterion-referenced test construction (pp. 267-291).
Saupe, J. L. (1961). Some useful estimates of the Kuder- Baltimore: Johns Hopkins University Press.
Richardson formula number 20 reliability coefficient. Suzuki, L. A., & Valencia, R. R. (1997). Race-ethnicity and
Educational and Psychological Measurement, 2, 63-72. measured intelligence: Educational implications. Ameri-
Schoenfeld, W. N. (1974). Notes on a bit of psychological can Psychologist, 52, 1103-1114.
nonsense: “Race differences in intelligence.” Psycho- Tabachnick, B. G., & Fidel, L. S. (1996). Using multivariate
logical Record, 24, 17-32. statistics (3rd ed.). New York: HarperCollins.
Semel, E. M., Wiig, E. H., & Secord, W. A. (2004). Clinical Texas Education Agency (2003). District and campus coordi-
Evaluation of Language Fundamentals 4—Screening nator manual: Texas student assessment program.
Test (CELF-4). San Antonio, TX: Harcourt Assessment. Thurnlow, M., Hurley, C., Spicuzza, R., & El Sawaf, H.
Shavelson, R., Baxter, G., & Gao, X. (1993). Sampling vari- (1996). A review of the literature on testing accommo-
ability of performance assessments. Journal of Educa- dations for students with disabilities (Minnesota Report
tional Measurement, 30, 215-232. No. 9). Minneapolis: University of Minnesota, National
Shepard, L., Glaser, R., Linn, R., & Bohrnstedt, G. (1993). Center on Educational Outcomes. Retrieved April 19,
Setting performance standards for student achievement 2004, from http://education.umn.edu/NCEO/Online
tests. Stanford, CA: National Academy of Education. Pubs/MnReport9. html
Sheppard, E. (1929). The effect of quality of penmanship on Tippets E., & Benson, J. (1989). The effect of item arrange-
grades. Journal of Educational Research, 19, 102-105. ment on test anxiety. Applied Measurement in Educa-
Sheslow, D., & Adams, W. (2003). Wide Range Assessment tion, 2, 289-296.
of Memory and Learning 2 (WRAML-2). Wilmington, Turnbull, R., Turnbull, A., Shank, M., Smith, S., & Leal, D.
DE: Wide Range. (2002). Exceptional lives: Special education in today’s
Sidick, J. T., Barrett, G. V., & Doverspike, D. (1994). Three- schools. Upper Saddle River, NJ: Merrill Prentice Hall.
alternative multiple-choice tests: An attractive option. U.S. Department of Education (1997). Guidance on standards,
Personnel Psychology, 47, 829-835. assessments, and accountability—II. Assessments.
510 REFERENCES

Retrieved November 30, 2004, from www.ed.gov/policy/ adaptive testing. In D. Lubinski & R. Dawis (Eds.), As-
elsec/guid/standardsassessment/guidance_pg4.html sessing individual differences in human behavior: New
Viadero, D., & Drummond, S. (1998, April 22). Software said concepts, methods, and findings (pp. 49-79). Palo Alto,
to grade essays for content. Education Week. Retrieved CA: Davies-Black.
January 30, 2004, from www.edweek.org/ew/ew_print Wigdor, A. K., & Garner, W. K. (1982). Ability testing: Uses,
story.cfm?slug=32soft.h17 consequences, and controversy. Washington DC: Na-
Wallace, G., & Hammill, D. D. (2002). Comprehensive Re- tional Academy Press.
ceptive and Expressive Vocabular Test, (CREVT-2) (2nd Williams, R. L. (1970). Danger: Testing and dehumanizing
ed.). Los Angeles: Western Psychological Services. Black children. Clinical Child Psychology Newsletter,
Ward, A. W., & Murray-Ward, M. (1994). Guidelines for the 9, 5-6.
development of item banks: An NCME instructional Williams, R.L., Dotson, W., Dow, P., & Williams, W. S.
module. Educational Measurement: Issues and Prac- (1980). The war against testing: A current status report.
tice, 13, 34-39. Journal of Negro Education, 49, 263-273.
Webster, W. J., & Mendro, R. L. (1997). The Dallas value- Witt, J., Heffer, R., & Pfeiffer, J. (1990). Structured rating
added accountability system. In J. Millman (Ed.), Grad- scales: A review of self-report and informant rating
ing teachers, grading schools (pp. 81-99). Thousand processes, procedures, and issues. In C. R. Reynolds
Oaks, CA: Corwin Press. & R. W. Kamphaus (Eds.), Handbook of psychological
Wechsler, D. W. (1991). Wechsler Intelligence Scale for and educational assessment of children: Personality, be-
Children—Third Edition: Manual. San Antonio, TX: havior, and context (pp. 364-394). New York: Guilford
Psychological Corporation. Press.
Wechsler, D. W. (1997). WAIS-III administration and scoring Woodcock, R. W., McGrew, K. S., & Mather, N. (2001a).
manual. San Antonio, TX: Psychological Corporation. Woodcock-Johnson III (WJ III) Tests of Achievement.
Wechsler, D. W. (2003). Wechsler Intelligence Scale for Itasca, IL: Riverside Publishing.
Children—Fourth Edition: Technical and interpretive Woodcock, R. W., McGrew, K. S., & Mather, N. (2001b).
manual. San Antonio, TX: Psychological Corporation. Woodcock-Johnson III (WJ III) Tests of Cognitive Abili-
Weiss, D. J. (1982). Improving measurement quality and ef- ties. Itasca, IL: Riverside Publishing.
ficiency with adaptive theory. Applied Psychological Woodcock, R. W., McGrew, K.S., & Mather, N. (2001c).
Measurement, 6, 473-492. Woodcock-Johnson III (WJ III) Complete Battery. Itasca,
Weiss, D. J. (1985). Adaptive testing by computer. Journal of IL: Riverside Publishing.
Consulting and Clinical Psychology, 53, 774-789.
Weiss, D. J. (1995). Improving individual difference mea-
surement with item response theory and computerized
Seca ber eu sia tent
ee at

Academic honesty, 464 high-stakes, 301


Achievement hybrid, 313
in relation to improvement or effort, the language of, 2-9
288-289 mathematics in, 33-34
relative to ability, 289 modified, reporting results of, 415
Adaptive devices and supports, 408 performance assessments, 185, 246-268
Age equivalents, 79 actual, 247
Alternatives, 196 analogue, 247
American College Test (ACT), 7, 367 artificial, 247-248
American Federation of Teachers, 22 definition, 246-249
American Educational Research Association extended-response, 249
(AERA), 316 guidelines for developing effective,
Analysis, 174 252-266
Anghoff method, 81 in high-stakes testing, 266
Application, 174 how they differ from more traditional
Appropriate preparation practices, 318-321 assessments, 247-249
Aptitude—achievement discrepancies, 337 reliability issues, 263
Assessment(s), 2—4 restricted-response, 249
accommodations for, 396 strengths and weaknesses, 266-268
determining what to provide, 410-412 placement assessment, 20
rationale for, 403-404 preparing students for, 189-191
strategies for, 405-410 process of, 13
when not appropriate, 404—405 participants in, 13-15
administering of, 189-191, 457-460 published, 453-457
alternative, 248, 409-410 restricted-response performance
augmented, 313 assessment, 249
authentic, 26, 248 results, 461
in the schools, 373-375 confidentiality of, 461
behaviors to avoid, 466 scoring of, 460-461
complex-performance, 26 standards-based assessments, 312
development of, 181-189, 452-453 students with disabilities, 28-29, 416
diagnostic, 20 and technology in schools, 26
educational, 9 in the twenty-first century, 24-29
assumptions of, 9-13 use of alternate, 409
common applications of, 19-21 value-added, 315-318
laws related to, 15-19 weighing assessment procedures, 291
English Language Learners, 412 what teachers need to know about, 21-24
guidelines Assessment Systems Corporation, 159
for developing, 452-453 Autism, 401-402
for interpreting, using, and Automated essay scoring systems, 238
communicating, 462-463
for selecting, 453-457 Base rate, 135
for scoring, 460-461 Bayesian Essay Scoring System, 238

511
512 INDEX

Behavior Assessment System for Children Cohen’s kappa, 264


(BASC), 86, 376-381 College admission tests, 366
Monitor for ADHD, 383 Communication disorders, 400
Parent Rating Scale (PRS), 376 Comprehension, 173
Self-Report of Personality (SRP), 383-387 Co-normed assessments, 338, 352
Teacher Rating Scale (TRS), 376 Concurrent studies, 133
Behavior rating scales, 375-376 Confidence intervals, 115—117
Bell-shaped curve, 66 calculating, 115-116
Best-answer format, 196 reporting, 116
Beta weight, 136 Conners Rating Scales—Revised (CRS-R),
Bias(es) 381
content, 438 Construct, 9, 107
controversy of, 425-431 Construct-irrelevant variance, 124-125
cultural, 424, 436 Construct underrepresentation, 124
examiner and language, 433 Content
in relation to variables inappropriate, 432
external to test, 442-447 style and grammar, 224-225
internal to test, 440-442 Content-based standards, 81
mean difference definition of test, 437 Content coverage, 130
past and present concerns about, 425 Content heterogeneity, 100-101
personal, 261 Contrasted group study, 139
prediction, 442 Convergent evidence, 138
test, 424 Correct-answer format, 196
in test content, 437-440 Correction for guessing, 209
Binet, Alfred, 72, 333 Correlation
Binet-Simon Scale, 333 versus causality, 56
Bloom’s taxonomy, 173-175 item-total, 153
Bookmark procedure, 81 point-biserial, 154
Correlation coefficients, 51-56
California Achievement Test/5 (CAT/5), 108, item-total, 153-155
305 negative correlation, 52
Cattell-Horn-Carroll (CHC) theory, 352 Pearson formula for, 55
Ceiling level, 111 point-biserial, 54, 154
Checklists, 261 positive correlation, 52
Child Behavior Checklist (CBCL), 381-382 relation to prediction, 54
Children’s Apperception Test (CAT), 391 Spearman rank, 54
Chronological age, 72 types of, 54
Classic test theory, 92 Counseling decisions, 21
Classification decisions, 21 Criterion, 132
Code of Fair Testing Practices in Education selecting, 134
(JCTP, 1988), 14, 451 Criterion contamination, 134-135
Code of Professional Responsibilities in Criterion-referenced interpretation, 79-80
Educational Measurement (NCME, Cronbach’s alpha, 100-101
1995), 14, 451 CTB McGraw-Hill, 305, 310
Coefficient Cue, 201
alpha, 101, 103 Cultural loading, 436
of determination, 52 Cultural test bias hypothesis, 422
half-test, 99 Culture-free tests, 436
Cognitive Abilities Test (CogAT), 341-343 Cut score, 80 ‘
INDEX 513

D, 152-156 handwriting, grammar, and spelling, 232


Deaf-blindness, 402 order, 232, 262
Decision-theory models, 137 Egalitarian fallacy, 437
Developmental delay, 402-403 Elementary and Secondary Education Act of
Diana y. State Board of Education, 426 1965 (ESEA), 15
Differential item functioning (DIF), 438 Emotional disturbance, 400-401
Direct-question format, 196, 237 Environment x genetic interaction model, 429
Discouraging cheating, 192 Error(s), 10
Discriminant evidence, 138 administrative, 95
based on consequences of testing, 140-141 central tendency, 261
based on internal structure, 139-140 clerical, 95
based on relations to other variables, content sampling, 94
132-139 leniency, 261
based on response processes, 140 logical, 261
based on test content, 129-132 of measurement, 91—95
Distracter analysis, 157—158 score, 92
Distracters, 196 scoring, 95
Distributions, 38-42, 45 severity, 261
bimodal, 45 time sampling, 94-95
frequency, 38-41 variance, 95
graphs, 40 Evaluation, 174
grouped, 39 formal, 281
ungrouped, 38 formative, 20, 278, 281-282
negatively skewed, 41 informal, 281
normal, 42, 66-69, 115-116 portfolios, 270
positively skewed, 41 of products and processes, 267
symmetrical, 41 summative, 19, 279, 281-282
Domain(s), 171-177 Exner Comprehensive System, 391
affective, 175-176 Extended-response essays, 228
cognitive, 172-175
psychomotor, 176-177 Factor analysis, 139-140
Draw-A-Person Test (DAP), 389 comparative, 441
Fairness, 429-430
E-rater, 238 Family Educational Rights and Privacy Act
Education of All Handicapped Children Act of (FERPA), 19, 295-296
1975 (EAHCA), 16, 397 Feedback and evaluation, 278-282
Educational accountability, 16, 27 “Flags,” 415, 417
Educational objectives, 170-179 Floor level, 111
behavioral versus nonbehavioral, 177-178 “Flynn Effect,” 64—65
characteristics of, 171-172 Format, 171, 177-178
taxonomy of, 172-177 Frame of reference, 285
writing, 178-179 Free appropriate public education
Educational Testing Service (ETS), 415 (FAPE), 16
Education Week, 456
Effect(s) Galton, Sir Francis, 373
content indeterminancy, 231 Gauss, Carl Frederich, 66-67
expectancy, 232, 460 Gaussian curve, 66
fatigue, 232 Generalizability theory, 93
halo, 261 Grade book programs, 294
INDEX

Grade equivalents, 78 controversies over, 334-336


limitations of, 78-79 in the courtroom, 426
Grades, 278 Intelligent Essay Assessor, 238
basis for, 284-285 IntelliMetric, 238
benefits of, 279 Interpolation, 78
combining, 290 Interpretation, 83-85
deleting Ds, 284 criterion-referenced vs. norm-referenced,
history of, 280 83-85
informing students of, 295 Inter-rater agreement, 104, 263
letter, 282-283 Inter-rater differences, 95
limitations of, 279 Inter-rater inconsistency, 231
numerical, 283 Intra-rater inconsistency, 231
pass—fail, 283 InView, 341
verbal descriptors, 283 Iowa Tests of Basic Skills (ITBS), 306
Grading, 284-290 Iowa Tests of Educational Development
absolute, 287—290 (ITED), 306
criterion-referenced, 287—290 Item analysis, 148
norm-referenced, 285-287 of performance assessments, 163-164
and punishment, 286 practical strategies for teachers, 159-160
relative, 285—287 qualitative, 164-165
Graduate Record Examination (GRE), 6 of speed tests, 156-157
Gray Oral Reading Test—Fourth Edition using to improve classroom instruction,
(GORT-4), 326 165, 167
Guidance decisions, 21 using to improve items, 161-163
Item banks, 166
Harcourt Assessments, Inc., 306, 310 Item(s)
Hearing impairments, 401 constructed-response, 154, 183, 185-186,
Holistic standards, 81 195
Homogeneity of regression, 443 difficulty, 148-150
Homogeneous content, 216 index, 149
House-Tree-Person (H-T-P), 389 influence of distracters, 158-159
optimal level, 149
Inclusion or mainstreaming, 398 on power tests, 151
Incomplete-sentence format, 196, 237 special assessment situations, 150
Individualized educational program (IEP), 17 with selected-response items, 149
398, 459 discrimination, 150-156
committee, 398 index, 151-153
Individuals with Disabilities Education Act influence of distracters on, 158-159
(IDEA), 16, 397 on mastery tests, 155-156
categories of disabilities, 399-403 essay, 224-238
IDEA 1997, 397 automated scoring systems, 238
IDEA 2004, 16, 371, 397 guidelines for developing, 229-230
Inequitable social consequences, 433 guidelines for scoring, 233-237
Infants and Toddlers with Disabilities levels of complexity of, 226-227
Act, 397 purposes of, 224-226
Inkblot techniques, 391-392 restricted-response versus extended-
Intelligence, 333 response, 228
sex differences, 423 strengths and weaknesses of, 230-233
Intelligence quotient (IQ), 333 item-total correlation coefficients, 153
INDEX 515

mapping strategy, 81 Linear regression, 54, 135


matching, 215-219 Linear transformation, 69
guidelines for developing, 216-218
premises, 215 Mainstreaming, 17
responses, 215 Marks, 278
strengths and weaknesses of, 218-219 Marshall v. Georgia, 426-427
multiple true—false, 201 Mastery teaching, 80
multiple-choice, 196-210 Mastery testing, 112-113
and creative students, 208 Measurement, 3, 34
changing your answer, 220 Measurement error, 91—95
guidelines for developing, 197-206 sources of, 92—95
strengths and weaknesses of, 206-210 Measures of central tendency, 42-47
objective, 183 choosing between, 45—47
relevance, 130 mean, 43
response theory (IRT), 81, 438 median, 43-44
restricted-response, 221 mode, 44-45
selected-response, 183-185, 195 Measures of variability, 47-51
selecting type of, 183-186 choosing between, 49-51
short-answer, 237-242 range, 47-48
formats, 237 standard deviation, 48-49
guidelines for developing, 239-240 variance, 49
strengths and weaknesses of, 241—242 Mental age, 72
subjective, 183 Mental Measurements Yearbook (MMY),
total correlation coefficients, 153 310-311, 455
true—false, 211-215 Mental retardation, 400
guidelines for developing, 212-213 Meta-analysis, 138
multiple true—false, 211 Method variance, 139
strengths and weaknesses of, 213-215 Modifications
true—false with correction, 212 of presentation format, 405
of response format, 405-406
Joint Committee on Standards for Educational of setting, 407-408
Evaluation, 246 of timing, 406-407
Joint Committee on Testing Practices, 452, 459 Multiple disabilities, 401
Multitrait-multimethod matrix, 138-139
KeyMath—Revised/NU: Diagnostic Inventory
of Essential Mathematics—Normative Naglieri Nonverbal Ability Test—
Update (KeyMath R/NU), 327 Multilevel Form (NNAT—
Kinetic Family Drawing (KFD), 389 Multilevel Form), 332
Knowledge, 173 Nation Under Risk: The Imperative for
Krathwohl’s Taxonomy, 175-176 Educational Reform, 301
Kuder-Richardson formula, 101-102, National Assessment of Educational Progress
117-118 (NAEP), 302-303
quick way to calculate, 118 National Commission on Excellence in
Kuder-Richardson reliability, 100 Education, 301
National Council on Measurement in
Language disorders, 400 Education, 22, 451
Larry P. v Riles, 425-427 National Education Association, 22
Learning disabilities, 399-400 National testing programs, 304
Leniency error, 261 “Nation’s Report Card,’ 303
516 INDEX

Nature—nurture controversy, 437 learning-progress, 270


Negative discrimination values, 152 representative, 269
Negative suggestion effect, 215 Prediction, 54
No Child Left Behind Act, 15-17, 302 Predictive studies, 133
support and criticism, 17 Preparation practices, inappropriate, 318
Nonstandard administration flags, 415, 417 Primary Test of Cognitive Skills (PTCS), 340
Nontest behaviors, 11, 13 Principia Products, 159
Normal curve, 116 Projective drawings, 389
Normal curve equivalent (NCE), 76 Projective techniques, 388-389
Normative data, 62-63 apperception tests, 391 (see also tests)
Norms, 63-64 debate over use, 390
“Norms tables,” 64 drawings, 389
inkblot techniques, 391-392
Objectives proponents and critics, 388
affective, 175 sentence completion tests, 390
behavioral, 177 (see also tests)
cognitive, 172 Protection of Pupil Rights Act (PPRA), 19
nonbehavioral, 177 Psychometrics, 4
psychomotor, 176 Public Law 94-142 / IDEA, 371, 397
Odd-even approach, 99 Public Law 99-457, 397
Optimal item difficulty level, 149
Oral testing, 223 Qualitative approach, 164
Orthopedic impairments, 401 Qualitative descriptions, 85-86
Other Health Impaired (OHD), 401 Quantitative approach, 164
Otis-Lennon School Ability Test, 8th Edition
(OLSAT 8), 332, 341 re52
Range restriction, 111
p, 149 Rating scales, 258
PASE vy. Hannon, 425-427 behavior, 375
Parent conferences, 295 omnibus, 382
Parent Monitor Ratings, 383 single-domain, 383
Pearson, Karl, 51 syndrome-specific, 383
Percentile rank, 77 strengths and limitations, 375-376
Performance Assessments for ITBS, ITED, Ratio IQ, 72
and TAP, 309 Reading First Initiative, 16
Performance-based standards, 81 Reading Proficiency Test in English (RPTE)
Performance tasks, 252 313
Personality, 4, 6, 371-372 Reducing test anxiety, 190
assessment in schools, 373 Referenced groups, 63
Placement decisions, 21 Regression
Policy decisions, 21 equation, 136
Population parameters, 44 intercepts, 444
Portfolio(s), 185, 268-273 linear, 54
assessments, 268-273 slopes, 444
guidelines for developing, 269-271 Reliability, 4, 91
strengths and weaknesses of, 271-272 alternate form, 96, 98
best work, 269 coefficient, 95
evaluation, 270 evaluating, 107-109
growth, 270 selecting, 105-106
INDEX 517

decay, 262 interpretations, types of, 8—9


improving, 109-111 normalized standard, 73-76
internal consistency, 96, 98-101 norm-referenced, 8, 62, 182
inter-rater, 101 obtained, 92
issues in performance assessments, 263 raw, 62
methods of estimating, 95-109 scaled, 69
relationship to validity, 125-126 standard, 69
special problems in estimating, 111-112 standardized, 321
speed tests, 111 stanine, 76
split-half, 99 T-, 71
test-retest, 96-98 true, 92
Response processes, 140, 143 variance, 95
Response sets, 207, 372, 374 Wechsler IQ, 71
Response to Intervention (RTI), 339 Wechsler scaled, 76
Reynolds Intellectual Assessment Scales z-, 70
(RIAS), 332, 350 Section 504 of the Rehabilitation Act (504)
Rights and Responsibilities of Test Takers: of 1973, 17-19, 397, 403
Guidelines and Expectations (JCTP, decline in number of students, 18
1998), 15, 451, 459 Selection decisions, 20
Riverside Publishing, 306, 310 Selection ratio, 135
Roberts Apperception Test for Children Self-report measures, 383-388
(RATC), 391 Severity error, 261
Rorschach, 391 Sigma, 49
Roid, Gale, 72 Simon, Theodore, 72, 333
Rubrics, 233-236 Spearman-Brown formula, 99
analytic, 234, 257 Speech disorders, 400
holistic, 234, 257 Standard deviation of a percent, 263
scoring, 256-257 Standard error of estimate, 135
Standard error of kappa, 265
Sample statistics, 44 Standard error of measurement, 112-117
Scale, 4 evaluating the, 112-117
Scales of measurement, 34-38 Standardization, inappropriate, 433
interval scales, 36 Standardization sample, 7, 64
mathematical operations, 39 Standardized administration, 65-66
nominal scales, 35 Standardized test administration, 321
ordinal scales, 35 Standard score formula, 69-70
ratio scales, 36-38 Standards-based interpretation, 80-83
statistics, 39 Standards for Education and Psychological
Scatterplots, 52-54 Testing (AERA et al., 1999), 14, 91,
Scholastic Assessment Test (SAT), 6-7, 133, 124, 126-127, 140-141, 246, 396,
366-367, 445 412, 414-415, 424, 435-437,
Scope, 171 451-452, 463
Score(s), 7-9, 278 Standards for Teacher Competence in
College Entrance Examination Board Educational Assessment of Students, 22
(CEEB), 71 Stanford Achievement Test Series, Tenth
composite, 103, 290 Edition (Stanford 10), 306
criterion-referenced, 8, 63, 182 Stanford-Binet intelligence scales, 71, 333
cut, 9 Stanford-Binet Intelligence Scales, Fifth
error, 92 Edition (SB5), 85, 332, 349
518 INDEX

Stanford Diagnostic Mathematics Test, Fourth blueprint, 179


Edition (SDMT 4), 310 college admissions, 366—367
Stanford Diagnostic Reading Test, Fourth computer adaptive, 25—26
Edition (SDRT 4), 310 consequences of, 140-141
State Developed Alternative Assessment cultural-free, 436
(SDAA), 313, 413 developing in a statewide testing
Statewide testing programs, 310-315 environment, 182-183
Statistical significance, 52 diagnostic achievement, 310
Stems, 196 fairness, 144
Storytelling techniques, 391 group, 7-8
Student cheating, 192, 465 high-stakes, 27
Student Evaluation Standards (JCSEE, 2003), effect on teachers, 28
451 individual, 7-8, 324-327
Students with disabilities, 397-403 individually administered, 304
Synthesis, 174 intelligence, 333-353
group, 340-343
Table of specifications, 129 individual, 343-350
development of, 179-181 major types of, 340-350
implementation of, 181-189 sample computer-generated report,
Teacher cheating, 458 354-365
Teacher Monitor Ratings (TMR), 383
selecting, 350-353
Teacher Report Form (TRF), 381-382
the use of in schools, 336-338
Teaching to the test, 16-17, 27, 318 understanding the report of, 353-366
Technology and assessment, 25-26
maximum performance, 4—6
Terman, Louis, 333 objective and subjective, 6
TerraNova Comprehensive Tests of Basic nonstandardized, 7
Skills (CTBS), 305 objections to use with minority students,
TerraNova The Second Edition (CAT/6), 305
432-435
Testing Office of the American Psychological objective, 6
Association Science Doctorate,
objective personality, 6
455-456 power, 5, 151, 189
Tests of Achievement and Proficiency (TAP), preparation practices for, 318-321
306 projective, 7
Test-criterion evidence, 133 projective personality, 7
Test Reviews Online, 456 scores, 62
Test Critiques, 311, 456 criterion-referenced interpretations, 63,
Tests, 311, 455
79-83
Test(s), 3-8 norm-referenced interpretations, 63-79
achievement, 5, 300, 331 qualitative description of, 85—86
group, 304-324 security, 456
group-administered, 302 sentence completion, 390-391
selecting a battery, 327 speed, 5, 189
apperception, 391 standardized, 7, 300
aptitude, 5, 331 best practices in using in schools,
group, 340-343 318-323
individual, 343-350 using to evaluate educational quality, 317
major types of, 340-350 state-developed achievement, 310-315
selecting, 350-353 teacher-made, 129
the use of in schools, 336-338
types of, 4-8 ‘
INDEX 519

typical response, 4, 6-8, 371 differential predictive, 433


using only a portion of, 409 evidence, types of, 129-143
Testing external influences of, 125
computerized adaptive, 25 face, 131-132
high stakes, 27 generalization, 138
mastery, 9 practical strategies for improving,
oral, 223-224 143-144
upset students, 323 relationship to reliability, 125-126
Tests in Print (TIP), 311, 455 scale, 373
Tests of Achievement and Proficiency (TAP), threats to, 124-125
306 types of, 126-127
Tests of Cognitive Skills, Second Edition as a unitary concept, 127
(TCS/2), 332, 340 Verbal descriptors, 283
Test taker(s), 15 Visual impairments, 402
rights of, 15
responsibilities of, 463-465 Wechsler, David, 72
Texas Assessment of Knowledge and Skills Wechsler Adult Intelligence Scale—Third
(TAKS), 312, 413 Edition, 103, 108
Texas Essential Knowledge and Skills Wechsler Individual Achievement Test—
(TEKS), 313 Second Edition (WIAT-II), 64-65,
Texas Student Assessment Program, 413 140, 324-325
Thematic Apperception Test (TAT), 391 Wechsler Intelligence Scale for Children—
Thematic techniques, 391 Fourth Edition (WISC-IV), 116, 138,
Traumatic brain injury, 402 332, 344-349, 352, 455
True score theory, 92 Wechsler IQ, 71
True—false with correction format, 212 Wechsler scaled scores, 76
Wide Range Achievement Test 3 (WRATS3),
Underrepresentation, 124 326
Wonderlic Personnel Test, 439
Validity, 4, 124 Woodcock-Johnson III (WJ III) Test of
argument, 141-142 Cognitive Abilities, 349-350, 352
categories of evidence, 127-128 Woodcock-Johnson III (WJ III) Tests of
coefficients, 133 Achievement, 325-326
construct, 127 Woodworth, Robert, 372
content, 126-127 Woodworth Personal Data Sheet, 372
criterion-related, 127, 132
differential, 435-436 Youth Self-Report (YSR), 387-388
9 SENN Nob.
Wi, tat +4 hy 4 )

_ es Asser rievinds Loti viniosten elie betia oa


eae F925. OTA, aren beg? ©
vr
- TM Maiiois OW) 4! 41) modo eet eheeanal
ote watae et Se ee ee ee
"APA ae eis Grr ‘s 1. alee balan vittanretd :
. ci Atl
CEA RE Dih ort SOhuraps cages Ss
Dior, (odpOieplados —- Baeeciemebiaedl6
Bb atnuns bole wahntaais ALLS Jdaeresat olesatieers brea) anit
Esealsoewiconentig ARE i'0 es ili shower ie
ee GE t/ cal » Hs sadanartal)

ne bsnl a ° as ee —
&. a aoe

a
i aes ae
tkaate asp

; — a on

be . _

ae iy ie
ar
Sal 5 eee

Pieane ae.

‘al
Measurement and Assessment in Education, Second Edition, employs a
pragmatic approach to the study of educational tests and measurement
so that teachers will understand essential psychometric concepts and be
able to apply them in the classroom.
The principles that guide this text are:
: / :
— What essential knowledge
and skills do classroom teachers need to
conduct student assessments in a professional manner?
maa YAV/a¥elime (OLo tpl rectors]
cel amelamsvelUlerelulelarlmetssetscantclalan (eli MUlsr4
i
Malis coretelcmar-tacct0l ico Maire MU aye [0(<1Narelo)elcer-(oarele)(em-Tare Mm
ccrenlalcerlihy
accurate presentation of the material.
While providing a slightly more technical presentation of measurement and assessment than more
basic texts, this text is both approachable and comprehensive. The text includes an introduction to
the basic mathematics of measurement, and expands traditional coverage to include a thorough
discussion of performance and portfolio assessments, a complete presentation of assessment
accommodations for students with disabilities, and a practical discussion of professional best
practices in educational measurement.

HIGHLIGHTS OF THIS TEXT


¢ This text is very user friendly, helping students to master the more technical aspects of educational
assessment and gain a good understanding of the mathematical concepts needed to master
measurement and assessment (Chapters 2-6).
¢ Ethical principles, legal issues, and professional standards relevant to classroom assessment are
covered thoroughly so that students are prepared to conduct classroom assessments in a
professional and ethical manner (throughout the text, but specifically in Chapter 17).
e An entire chapter (Chapter 15) is devoted to the use of assessments for students with disabilities to
prepare students to assess the knowledge and skills of all students, including those with
disabilities. .
¢ Contemporary issues regarding the assessment of students are covered in detail so that students
are aware of important issues related to educational assessment.
e Numerous pedagogical devices such as exercises, cases, and end-of-chapter problems are
included throughout the text so that students can explore topics further.
¢ Audio enhanced PowerPoint™ lectures featuring Dr. Victor Willson are particularly useful for
student review and mastery of the material presented.
e A Test Bank is also available to instuctors.

rrill
is an imprint of
Cover Photograph: ©Gabe Palmer/Corbis

For related titles and support materials, visit our


online catalog at www.pearsonhighered.com

You might also like