Testing and Measurement
Testing and Measurement
of America.
05 06 07 08 09 10 9 8 7 6 5 4 3
2 1
Acquisitions Editor:Lisa Cuevas
Shaw Editorial Assistant: Karen
Wong Production Editor: Laureen
Shea
Copy Editor: Truman Sands, Print Matters, Inc.
Typesetter: C&M Digitals (P) Ltd.
Proofreader: Libby Larson
Indexer: Nara Wood
Cover Designer: Janet Foulger
Contents
Lis t of Figure s xi
Lis t of Table s xii
A Note to Students xiii
Acknowledgments xv
10. Validit yWhat You See Is Not Always What You Ge t 141
Lets Check Your Understanding 142
Our Model Answers 143
Helping You Get What You See 143
Validation Groups 144
Criteria 144
Construct Underrepresentation 145
Construct-Irrelevant Variance 145
Lets Check Your Understanding 146
Our Model Answers 146
Sources of Validity Evidence 147
Evidence Based on Test Content 147
Evidence of Criterion-Related Validity 148
Evidence of Construct Validity 150
Lets Check Your Understanding 151
Our Model Answers 152
The Marriage of Reliability and ValidityWedded
Bliss 153
Interpreting the Validity of Tests
Intended and Unintended Consequences 154
Some Final Thoughts About Validity 154
Key Terms 155
Models and Self-instructional Exercises 155
Our Model 155
Our Model Answers 157
Now Its Your Turn 158
Our Model Answers 160
Words of Encouragement 162
11. The Perils and P itfalls of Te stingBeing Ethical 163
Your Own Competence 163
Rights of Those Being Tested 164
Potential Dangers 165
Ryans Rights 166
Appendix 167
References 171
Index 173
About the Authors 183
List of Figures
B
elieve it or not, you are about to embark on a
wonderful adventure. The world of testing and
measurement offers you insight into those numbers
and scores and tests that have been part of your
life since you
were a small child. In this user-friendly guide, were
going to introduce you to the basic concepts of
measurement and testing in a way we hope will sometimes
make you smile or even laugh. We even made ourselves
chuckle as we tried to create examples that you can
relate to.
Our goal is to help you understand tests and their
scores. A test is a sample of behavior or characteristic
at a given point in time. We know your behavior has been
sampled many times in many classes and that numbers or
scores have been used to describe your knowledge,
aptitudes, behaviors, attitudes, and even your
personality characteristics. This book will help you make
sense of these numbers and scores and what is needed to
have a strong, good test.
This workbook style textbook is not the end-all in what
there is to know about measurement and testing. Instead, we
are trying to give you founda- tional information that you
and your professors can build on. If you read the material
carefully, take the time to complete the Lets
Check Your Understanding quizzes, and work through
Our Model Answers with us, you should be able to master
the content covered.
We have tried to present the material in this text in
the most user-friendly way we know. Some of our examples
may seem corny, but please indulge us. We wanted to make
your learning as enjoyable as possible. The yellow brick
road awaits you. Have courage, be smart, and open your
heart to learning. We are the good witches and well be
helping you on your journey.
xiii
Acknowledgments
F
irst, we would like to thank the multiple reviewers
who gave us invaluable suggestions that helped us
clarify our ideas and strengthen our final product.
Next, we would like to express our gratitude to
Dr. Joanna Gorin for reading many of the chapters and
providing us with insightful feedback that we used to
rethink how we were presenting some major concepts.
Her ideas and expertise in measurement made a
significant positive contribution to this final work.
Joanna, we truly appreciate the time and energy you
took to assist us in this project.
We would like to thank Lisa Cuevas Shaw for seeing and
appreciating our approach to teaching testing and
measurement concepts. She constantly supported us to be
creative and to play with ways of making this informa-
tion fun and interesting for students. Lisa, thank you
for believing in us.
Finally, we want to express our gratitude to Jason Love
for lending us his wonderful artistic talent by creating
the cartoons for this book and to Karen Wong, Laureen
Shea, and the staff at Sage for their wonderful
assistance in getting this book published. Thanks to all
of you.
Lest we forget, we want to thank the many students
who over the years appreciated and learned from our
humorous approach to teaching testing and measurement.
They reinforced our belief that this material didnt have
to be dull or boring.
SRK
MES
The contributions of the following reviewers are
xv
CHAP TER 1
What Is a Number?
Is a Rose Always a Rose?
R
emember when you first learned nursery rhymes
such as Three Blind Mice or watched Count
Dracula on Sesame Street? As a child, you probably
helped the person reading to you count each of
the
three mice or followed the Count as he numbered
everything in sight. You were learning numbers. Do you
remember holding up your little fingers when someone
asked you how old you were? At a very early age, you
were expected to begin to understand numbers and what
they meant. Numbers are a part of everyones life. We
all use them without thinking about their meaning. In
this chapter, were going to talk about types of
numbers (or scales) and how they can be used in
measurement. Measurement is a way to give meaning to
numbers.
1
2 TESTING AND MEASUREMENT
Ordinal
Inter val
Ratio
Dichotomous Responses
Strong Strongl
ly Disagre Neutr Agre y
e al e
Disagr
1 2 3 4 Agree
5
Strong Strongl
ly Disagre Neutr Agre y
e al e
Disagr
0 1 2 3 Agree
4
Strong Strongl
ly Disagre Agre y
e e
Disagr Agree
1 2 3 4
Key Terms
To help you review the information presented in this
chapter, you need to understand and be able to explain
the following concepts. If you are not sure, look back
and reread.
Test
Measurement scales
Nominal scale
Ordinal scale
Interval scale
Ratio scale
Response format
Dichotomous response format (natural or forced)
Continuous response format
Anchors
Total score
Our Model
I
n the first chapter, youve learned about some
basic measurement principles: measurement scales and
response formats. Its time to step up your learning
another level. Dont worry; its only a small step.
Besides
that, were right here to catch you and bolster your learning.
The new topic is frequencies and frequency
distributions. The chapter titleOne Potato, Two
Potato, Three Potato, Fourreflects frequency data.
The variable being examined is potatoes, and we have
four of them (the frequency count). Now, we know youre
not interested in potatoes, but what if you were
counting money? Imagine that you just discovered a
chest full of currency in your grandmothers
basement. Wouldnt you like to know how much is
there? An easy way to find out is to divide the bills
by denomination (i.e., $1, $5, $10, $20, $50, and
$100) and count how many of each you have. Surprise,
you have just done the rudiments of creating a
frequency distribution, a simple method of organizing
data.
A frequency distribution presents scores (X) and how
many times (f ) each score was obtained. Relative to
tests and measurement, frequency distributions present
scores and how many individuals received each score.
There are three types of frequency distributions
ungrouped, grouped, and cumulative.
Ung rouped Frequency Distribu tions
An ungrouped frequency distribution, most often referred
to as just plain old frequency distribution, lists every
possible score individually. The scores, which are
typically designated by a capital X, are listed in
numerical order. In a column paralleling the X scores is
a column that indicates how many people got the
corresponding X score. In this column, which is
designated
19
20 TESTING AND MEASUREMENT
Score f Score f
X X
100 1 77
99 76
98 1 75 1
97 74
96 73 1
95 1 72
94 71 1
93 2 70
92 69
91 68
90 2 67
89 1 66
88 65
87 2 64
86 1 63
85 3 62
84 61
83 1 60
82 59
81 2 58
80 2 57
79 1 56 1
78 1
N=
25
Midterm
Grades
Frequen Percent Valid Percent Cumulative
cy Percent
Valid 100 1 4.0 4.0 4.0
98 1 4.0 4.0 8.0
95 1 4.0 4.0 12.0
93 2 8.0 8.0 20.0
90 2 8.0 8.0 28.0
89 1 4.0 4.0 32.0
87 2 8.0 8.0 40.0
86 1 4.0 4.0 44.0
85 3 12.0 12.0 56.0
83 1 4.0 4.0 60.0
81 2 8.0 8.0 68.0
80 2 8.0 8.0 76.0
79 1 4.0 4.0 80.0
78 1 4.0 4.0 84.0
75 1 4.0 4.0 88.0
73 1 4.0 4.0 92.0
71 1 4.0 4.0 96.0
56 1 4.0 4.0 100.0
Class f
Interval
98100 2
9597 1
9294 2
8991 3
8688 3
8385 4
8082 4
7779 2
7476 1
7173 2
6870
6567
6264
5961
5658 1
N 25 i3
b. i
c. UL
d. LL
4. As a rule of thumb, how many class intervals are
typically created for a set of data?
Class Interval f cf
98100 2 25
9597 1 23
9294 2 22
8991 3 20
8688 3 17
8385 4 14
8082 4 10
7779 2 6
7476 1 4
7173 2 3
6870 1
6567 1
6264 1
5961 1
5658 1 1
N 25 i 3
Hopefully you can see that the cumulative frequency is
arrived at simply by adding the numbers in the
corresponding frequency column (f) to the cumulative
frequency (cf ) below. For example, if you wanted to know
the cumulative frequency of the class interval 8991,
you would add the three people in 8991 to the 17 who
scored below that interval. Your answer is 20. But what
does this 20 mean? It means that 20 people scored at or
below 91 (which is the largest number in that class
interval) on their midterm exam. The cf for the 7476
class interval is arrived at by summing the 1 person (f )
in this interval with the 3 in the cf column below this
interval. The result- ing cf is 4. Four people scored at
or below 76 (the highest number in the 7476 class
interval).
Now, you tell us, how did we arrive at a cumulative
frequency of 14 for the 8385 class interval? . . . THATS
RIGHT! First, we looked at the frequency for the class
interval 8385, and find that four people scored between
83 and
85.Then we look at the cf for the class interval just below
the 8385 class interval. The cf was 10. To determine the
cf in the 8385 class interval, we added the f for the
8385 class interval (f 4) to the cf of 10 in the class
inter- val below it. And we arrive at an answer of 14.
OK, lets do another example using the data in Table
2.4. How did we arrive at the cf of 1 for the 5961
class interval?... If you said to yourself, No one
scored in this interval, youre right! If we had placed
an f value next to this class interval, it would have
been a 0. But the accepted conven- tion is not to put 0
frequencies into the table. This leaves you with the task
of remembering that an empty f means 0 and then you have
to add this 0 to the cf below. So, mentally, we added 0 to
1 to arrive at the cf of 1 for the 5961 class interval.
As a matter of fact, we kept adding 0 to the cf of 1
until we reached the class interval of 7173 that
actually had two people repre- sented in its f column.
Key Terms
To help you review the information presented in this
chapter, you need to understand and be able to explain
the following concepts. If you are not sure, look back
and reread.
Frequency
Score
Frequency distributions
Ungrouped
Grouped
Cumulative
Class interval
Upper limit of a class interval (UL)
Lower limit of a class interval (LL)
Width
Mutually exclusive
Our Model
36 86 104 83 56 62 69 77
92 39 110 80 58 84 74 80
52 54 93 60 88 67 46 72
53 68 73 73 81 62 99 69
65 80 71 79 49 78 64 85
110 88 81 78 72 67 60 52
104 86 80 77 71 65 58 49
99 85 80 74 69 64 56 46
93 84 80 73 69 62 54 39
92 83 79 73 68 62 53 36
N = i =
110 36 1
i 5
15
Class Interval f cf
106110 1 40
101105 1 39
96100 1 38
9195 2 37
8690 2 35
8185 4 33
7680 6 29
7175 5 23
6670 4 18
6165 4 14
5660 3 10
5155 3 7
4650 2 4
4145 2
3640 2 2
N = 40 i=5
24 15 24 15 18 24 11 23
16 25 16 29 17 25 20 26
29 23 13 20 21 22 21 20
27 27 18 21 25 20 18 28
Class Interval f cf
N = i =
3. Determine the following:
a.How many freshmen perceived that they have a
poorer social sup- port system than Ryan?
30 27 25 23 21 20 18 16
29 27 24 23 21 20 18 15
29 26 24 22 21 20 18 15
28 25 24 22 21 20 17 13
27 25 24 22 20 19 16 11
2930 3 40
2728 4 37
2526 4 32
2324 6 28
2122 7 23
1920 6 16
1718 4 10
1516 4 6
1314 1 2
1112 1 1
N 40 i2
The Distribution of Te st
ScoresThe Perfect Body?
A
s we noted in Chapter 2, frequency distributions
give you a birds eye view of scores and allow you to make
simple comparisons among the people who took your test.
The problem with frequency distributions, however, is
that they dont allow you to make really precise
comparisons. To help you become slightly more
sophisticated in your ability to compare test
scores across people, were going to teach you about frequency curves.
The first step in getting more meaning out of a
frequency distribution is to graph it. This graph is
called a frequency curve. A graph is created by two
axes. Along the horizontal axis (x-axis), the score
values are presented in ascending order. Along the
vertical axis (y-axis), the frequency of people who
could have gotten each score appears. Zero is the
point where the y-axis intersects with the x-axis. For
group frequency data, the possible intervals are
listed on the x-axis. Although the midpoint score is
typically used to represent the entire interval, in
order not to confuse you, in our examples we will list
the exact scores obtained within each interval.
Frequency curves can take on an unlimited number of
shapes, but in mea- surement we most often assume that
scores are distributed in a bell-shaped normal curve
the perfect body! This body is symmetrical. When you
draw it, you can fold it in half and one side will be a
reflection of the other.
Kurtosis
The normal curve also possesses a quality called
kurtosis (no, thats not bad breath). Kurtosis is one
aspect of how scores are distributedhow flat or how
peaked. There are three forms of a normal curve
depending on the distribution, or kurtosis, of scores.
These forms are referred to as mesokurtic,
35
36 TESTING AND MEASUREMENT
Skewness
When one side of a curve is longer than the other, we
have a skewed distrib- ution (see Figure 3.2). What this
really means is that a few peoples scores
Skew
Skew
Score X f Score f
X
100 1 77
9 76
9 1 75 1
8
9 74
7
9 73 1
6
9 1 72
5
9 71 1
4
9 2 70
3
9 69
2
9 68
1
9 2 67
0
8 1 66
9
8 65
8 2 64
7
8 1 63
6
8 3 62
5
8 61
4
8 1 60
3
8 59
2
8 2 58
1
8 2 57
0
7 1 56 1
9
7 1
8
N 25
extreme scores that are much higher or lower than the
rest of the scores? The correct answer i s . . . Yes!
One person scored really low with a score of only 56.
The tail of this curve is pulled to the left and is
pointing to this low score. This means that the curve
is skewed in the negative direction.
Just by looking at this frequency distribution, you can
conclude that the students were very spread out in their
midterm grades and one person scored very differently
(much worse) than the rest of the class. You should
arrive at these same conclusions if you examine the
grouped frequency dis- tribution of the same test scores
in Table 2.3 on page 23. All of the scores but one are
somewhat evenly distributed across the intervals from
7173 to 98100. That one score in the 5658 interval
causes this platykurtic distrib- ution to be negatively
skewed.
Although eyeballing is a quick and dirty way of looking
at your scores, a more accurate way is to graph the
actual scores. As we told you at the begin- ning of this
chapter, on the horizontal axis, typically called the x-
axis, you list the possible scores in ascending order. If
you are working with grouped frequency data, you list the
intervals (see Figure 3.3). When you create a graph,
however, let the midpoint of the interval represent the
entire inter- val. On the vertical axis, typically called
the y-axis, you list the frequencies in ascending order
that represent the number of people who might have earned
a score in each interval.
Histogram
5
Frequency
4
3
2
1
0
7779
8991
9597
9294
5961
6264
6567
6870
7476
8082
5658
7173
8385
8688
98100
Key Terms
Lets see how well you understand the concepts
weve presented in this chapter. Test your
understanding by explaining the following concepts. If
you are not sure, look back and reread.
Frequency curve
Kurtosis
Mesokurtic
Leptokurtic
Platykurtic
Skewness
Positively skewed
Negatively skewed
Graphs
x-axis
y-axis
8
7
6 80
f5 74 80
4 65 69 73 80 85
3 54 60 64 69 73 79 84
2 39 49 53 58 62 68 72 78 83 88 93
1 36 46 52 56 62 67 71 77 81 86 92 99 104 110
3640 4145 4650 5155 5660 6165 6670 7175 7680 8185 8690 9195 96100 101105
106110
Our Model
Answers
1. What does the y-axis represent?
The frequency of people who could have gotten scores
within e ach of the inter vals appe ars ne xt to the
vertical axis. For our e xample, we used frequencies
ranging from 0 to 8.
2. What does the x-axis represent?
On the horizontal axis, we listed all of the possible class inter vals.
3. What is the kurtosis of this frequency curve?
Although not perfectly symmetrical, this distribution could be
considered a normal, bell-shaped, mesokurtic cur ve.
4. Is this frequency curve skewed?
It is not ske wed because no scores are e xtremely larger
or e xtremely smaller than all of the other scores.
Class f cf
Interval
2930 3 40
2728 4 37
2526 4 32
2324 6 28
2122 7 23
1920 6 16
1718 4 10
1516 4 6
1314 1 2
1112 1 1
N 40 i2
44 TESTING AND MEASUREMENT
1112 1314 1516 1718 1920 2122 2324 2526 2728 2930
Class Intervals for Scores
B
y now you might be thinking that we are pretty
superficial people. In Chapter 3, we defined the
perfect body simply by the shape of the frequency
curveits external appearance. If the curve was a
person
and we said it was beautiful because of its shape, youd
be right in calling us superficial. Since we definitely
are not superficial, we want to extol some of the internal
qualities that make the normal curve beautiful.
In measurement, internal qualities are reflected by the
data that created the curve, by the statistics resulting
from mathematically manipulating the data, and by the
inferences that we can draw from these statistics. When we
dis- cussed scores, we were discussing data. Now its time
to move on to introduce you to some simple statistics
related to the normal curve and measurement. On center
stage we have that world-famous trio, The Central
Tendencies.
47
48 TESTING AND MEASUREMENT
The Mode
The first of these, the mode, is a rather simple,
straightforward character. Of all the central tendencies,
the mode is the most instantly recognizable. When you
look at a frequency curve, the score (or scores) obtained
by the largest number of people is the mode. Lets
consider the following set of scores (data) as a way to
introduce mode:
1, 5, 9, 9, 9, 15, 20, 31, 32, 32,
32, 32
110 88 81 78 72 67 60 52
104 86 80 77 71 65 58 49
99 85 80 74 69 64 56 46
93 84 80 73 69 62 54 39
92 83 79 73 68 62 53 36
Look at the grouped frequency data for these stress scores on the
next page.
1. What class interval has the highest frequency of obtained scores?
106110 1 40
101105 1 39
96100 1 38
9195 2 37
8690 2 35
8185 4 33
7680 6 29
7175 5 23
6670 4 18
6165 4 14
5660 3 10
5155 3 7
4650 2 4
4145 2
3640 2 2
N 40 i5
The Median
Now lets meet the second member of the triothe
median. At times this character is in plain view and
at times it hides and you have to look
between numbers to find it. The median is the score or
potential score in a distribution of scores that
divides the distribution of scores exactly in half. It
is like the median on a highwayhalf of the highway
is on one side and half on the other. As a matter of
fact this is a good analogy for remembering what a
median is. The median is the exact middle point, where
50% of the scores are higher than it and 50% are
lower.
In order to find the median, first you have to order
the scores by their numerical value, either in
ascending or descending order. We suggest using
ascending, since that is how most data are presented
in measurement. To find this score, divide the total
number of scores by 2. For example, if you have 40
scores, you would divide 40 by 2 and find that 20
scores are above and 20 scores are below the median. In
this case, the median would be a theoret- ical number
between the 20th and 21st scores. In fact, it would be
the numer- ical average of the values of the 20th and
21st scores. If you have 45 scores, the median would be
the 23rd score. When 45 is divided by 2, the median
should be score 22.5 (rounded up to represent the
actual 23rd score). We know that 22 scores will be
below and 22 scores will be above the 23rd score. When
identifying the median, as well as the mode, the
actual values of the scores arent really important.
You are just trying to find the middle
score after ordering scores by their numerical value.
Lets play with a set of scores.
2, 6, 6, 7, 8, 9, 9, 9, 10
4, 6, 6, 7, 7, 9, 9, 10
The Mean
Its time to introduce the leader of the pack, the one
with the real clout when you have interval or ratio data
the mean. Unlike the mode and the median, the numerical
values of the actual scores in a data set are essential when dealing
with the mean. The mean, symbolized by M, is a mathemati-
cally determined number. It is the mathematical average
of all the scores in the data set.
Just in case youve forgotten how to calculate
an average, heres a reminder. Add up all the scores
(remember that X stands for score) and then divide
this sum by the number of scores you added (N ). For
example, for the spelling test scores of 2, 6, 6, 7, 9,
9, and 10, the sum (which is represented by the symbol )
of these is 49. When 49 is divided by 7 (the number of
scores), the result is 7. The mean score for this
spelling test is 7.
This 7 can be used to describe the average score that
this class of students obtained on their spelling test.
If we know nothing else about any student in this class,
we would guess that if he or she took this spelling test,
he or she would score a 7, the class mean. Do you see why
the mean is so important?! For those of you who like to
see formulas, this is the formula for a mean:
XX
M=
N
10, 8, 6, 0, 8, 3,
2, 2, 8, 0
7. What is the X?
The sum is 47.
Mode
Median
+ Skew
Mean
Median
Deviation Scores
Deviation scores also reflect the variation in a set of
scores. A deviation score (X M) is the difference
between any given score (X) and the group mean
(M). When you sum all of the deviation scores, the answer
should always be
0. This is symbolized by
(X M) 0
Variance
Calculating deviation scores is the first step in
calculating variance. Since the sum of deviation scores
equals 0, in order to use them to describe our curves,
we square each deviation score before summing. We do
this because it makes all of the deviation values
positive and the sum becomes a number larger than 0.
This process is symbolized by: (X M)2, which is
called the sum of squares. If the sum of squares (squared
deviation scores) is divided by N 1, the resulting
measure of variability is the variance (denoted by s 2).
Variance is an abstract construct that reflects a
global variability in a set of scores. It becomes
useful when we want to compare dispersion across
different sets of scores. The formula for variance is
2
s = X(X M )
2
N1
N 1
SD =
, X(X M )2
34.13% 34.13%
3. Variance, symbolized by ,
is
.
4. T
he square root of variance is the .
5. On a normal curve, %
of the scores falls either above or below the
mean.
Key Terms
Central tendency
Mean
Median
Mode
Dispersion
Range
Deviation scores
Variance
Standard deviation
Models and Self-instructional Exercises
Our Model
X XM ( X M )2
3
X ( X M ) ( X M )2
X X M (X M)2
3 1 1
5 1 1
5 1 1
2 2 4
6 2 4
3 1 1
X 24 ( X M ) 0 ( X M )2 12
To ge t the de viation score s, we subt racted e ach indi vidual score
from the me an. We inserted the de viation score s into the second
column. When we added these de viation score s, ( X M ), we
got 0. To comple te the table, we squared e ach de viation score and
inserted these value s into the third column. The sum of the squared
de viation score s, (X M)2, equals 12.
5. What is the variance?
To calculate the variance you will use the formula
X(X M )2
=
s
N1
X(X M )2
=
s
N1
s2 12 2.4
5
SD 24.
SD 1.55
30 27 25 23 21 20 18 16
29 27 24 23 21 20 18 15
29 26 24 22 21 20 18 15
28 25 24 22 21 20 17 13
27 25 24 22 20 19 16 11
2. What is the N ?
4. What is the mean? (To make this easier for you, the
sum of scores equals 866.)
Standardized
Score sDo You Measure Up?
S
o far, weve been working with what is called raw data.
From this raw data or raw scores, we have been able
to create frequency distributions and frequency
curves and to examine measures of central
tendency
and of dispersion. These concepts are foundational to
testing and measure- ment. Its time, however, to start
learning more about scores themselves.
Often we want to compare people whom we assess with
people who are more representative of others in general.
Raw scores wont allow us to do this. With raw scores we
can only compare the people within the same group who
took the same test. We cant compare their scores to
people who were not in the group that we tested. This is
a dilemma! So what do we do? Guess what?! You transform
your raw scores to standard scores. When we standardize
scores, we can compare scores for different groups of
people and we can compare scores on different tests.
This chapter will reveal the secrets of four different
standard scores: Percentiles, Z scores, T scores, and IQ scores.
Arent you glad? (If you arent glad now, you will be when
you
take your next exam.)
71
72 TESTING AND MEASUREMENT
case, the one for which you are trying to find the
percentile rank) divided by the number (N ) of people
in the frequency distribution and then multi- plied by
100. The formula looks like this:
cf
PR = 100
N
X f cf
10 1 10
8 1 9
7 2 8
5 1 6
2 3 5
1 2 2
cf
PR = 100
N
5
PR = 100
10
PR = .5 100
PR = 50
Interval f cf
2529 5 30
2024 5 25
1519 10 20
1014 2 10
59 5 8
04 3 3
PR
PR
PR
PR
PR
PR
PR= i
N
[8 + 2( 14 9.5 ) 100]
PR= 5
30
[8 + 2( 4.5 ) 100]
PR= 5
30
[8 + 2(.9) 100]
PR =
30
(8 + 1.8)
PR 100
30
= (9.8) 100
30
PR
=
PR = .3267 100
PR = 32.67
Z Scores
X
Z= M
SD
Standard Deviations
4 3 2 1 0 1 2 3 4
Cumulative
0.1% 2.3% 15.9% 50.0% 84.1% 97.7% 99.9%
Percentages Rounded 2% 16% 50% 84% 98%
Percentile Equivalents 95 99
1 5 10 20 30 40 50 60 70 80 90
1 2 3 4 5 6 7 8 9
Stanines
4% 7% 12% 17% 20% 17% 12% 7% 4%
Percent in stanine
Wechsler Scales
Subtests
1 4 7 10 13 16 19
Deviation IQs
55 70 85 100 115 130 145
Class M 40 Class M 62
SD 8 SD 4
(X M) (X M)
Z Z
SD SD
Z 1.00 Z 1.00
Even though Simons raw score in English is higher
than his raw score in math, you can tell that Simon did
much better on his math test than on his English test
just by eyeballing the standardized Z scores. Simon
scored at the 84.13th percentile in math compared to
his classmates. His percentile rank equals the percent
of those who scored below the mean (50%) plus the
percent of those who scored from the mean to 1Z above
the mean (PR1Z
50% 34.13% 84.13%) or 84.13th percentile. However,
he scored only at the 15.87th percentile in English
(PR1Z 50% 34.13% 57.87%). Simon needs help in
English. On this test he did better than only 15.87%
of his classmates. You suggest to his parents that they
should talk to his seventh-grade English teacher and
maybe hire a tutor for English. Being a good principal,
you also reassure Simons parents that he seems to be
doing well in math.
Lets do one more example. Remember our good friend
Ryan who had complained about being stressed and not
having any friends. He had taken both the College Stress
Scale (CSS) and the Social Support Inventory (SSI). On
the CSS, Ryan had scored 92. The mean for college
freshmen was 71.78 with a SD of 16.76. On the SSI, Ryan
had scored 15. The mean for college freshmen was 21.65
with a SD of 4.55. As Ryans counselor, you are trying to
decide whether to first address his loneliness or his
stress. To help you make this decision, you want to
compare Ryans level of stress with his level of
loneliness. To do this you have to transform his raw
scores to Z scores.
SD 16.76 SD 4.55
(X M) (X M)
Z Z
SD SD
Z 1.21 Z 1.46
One of Ryans two friends, Paula, also took the CSS and
the SSI. What are the Z scores for each of Paulas test
scores and how would you interpret them?
SD 16.76 SD 4.55
(X M) (X M)
Z Z
SD SD
Z Z
Z Z
1. What does Paulas stress Z score tell us about her level of stress?
SD 16.76 SD 4.55
(X (X
Z Z
M) M)
(75 (26 21.65)
Z Z
71.78) 4.55
Z 0.19 Z 0.96
CSS Z of 0.19
SSI Z of 0.96
(Hint: Dont forget to add the 50% who scored below the
mean.)
CSS Z of 0.19
The percentile rank for a Z score of 0.19 equals 57.53. We arrived
at this value by looking at the Z value of 0.19 in Appendix A and
found that this value represents 7.53% of the area be tween the me
an and Z score. Because the Z value is posit i ve, we have to add
the 50% that represents the scores below the me an to the 7.53%
and arrive at 57.53% of scores at or below a Z score of 0.19.
Therefore, the percentile rank is 57.53.
SSI Z of 0.96
The percentile rank for a Z score of 0.96 equals 83.15. We arrived
at this value by looking at the Z value of 0.96 in Appendix A
and found that this value represents 33.15% of the are a be tween
the me an and Z score . Because the Z value is posit i ve, we again
have to add the 50% that represents the score s below the me an to
the 33.15% and arrive at 83.15% of score s at or below a Z score of
0.96. Therefore, the percentile rank is 83.15.
Other Standard Scores
Two other types of standard scores frequently used in
measurement are T scores and IQ scores. Some of you
may also be familiar with ACT, SAT, and
GRE scores, which are also standardized scores. For
example, the GRE has a set mean of 500 and a SD of 100.
Dont groan, your GRE Verbal of 490 wasnt that bad. You
are rubbing elbows (literally and figuratively) with the
average person who took the GRE.
T Score s
T 10(Z ) 50
T 10(Z ) 50
T 10(0.19) 50
T 1.9 50
T 51.9
IQ Score s
IQ SD (Z ) M
IQ 15(Z ) 100
IQ 16(Z ) 100
Key Terms
Check your understanding of the material by explaining
the following concepts. If you arent sure, look back and
reread.
Percentile rank
Z score
T score
IQ score
Models and Self-instructional Exercises
Our Model
You may not know this, but when Gandalf (from Lord
of the Rings) was selecting which Hobbits to help him
fight the Dark Lord Sauron, he gave the Hobbits tests
to measure courage, endurance, and power-hungriness.
The Hobbits had a mean of 22 (SD 4) on the courage
scale, a mean of 53 (SD 7) on the endurance scale,
and a mean of 13 (SD 2) on the power-hungriness
scale. Mr. Frodo scored 30 on courage, 60 on
endurance, and 7 on power-hungriness. Samwise Gamgee
scored 26 on courage, 62 on endurance, and 9 on power-
hungriness. Whom should Gandalf pick?
To answer this question, we first want to calculate
the Z scores for both Mr. Frodo and Samwise on each
scale.
Courage XM XM
Z= Z=
SD SD
30 22 26 22
Z Z
= =
8 4 4 4
Z= = + 2.00 Z= = + 1.00
4 4
Endurance XM XM
Z= Z=
SD SD
60 53 62 53
Z Z
= =
7 7 9 7
Z= = + 1.00 Z= = + 1.29
7 7
Power-hungriness XM XM
Z= Z=
SD SD
7 13 9 13
Z Z
= =
52 42
Z= = 2.50 Z= = 2.00
2 2
Bill Ted
History XM XM
Z Z
SD SD
Z Z
Z Z
PR PR
Political Science XM XM
Z Z
SD SD
Z Z
Z Z
PR PR
PR = 2.27 PR = 1.07
Political Science X M X M
Z Z
SD SD
65 82 68 82
Z Z
8 8
17 14
Z 2.13 Z 1.75
8 8
PR = 1.66 PR = 4.01
T 10(Z ) 50
H
ere we go againsomething new for you to learn.
This is going to be a short and to the point chapter.
Arent you glad? We are going to intro- duce you
formally to criterion-referenced tests and norm-
referenced tests. Just to remind you, a test is a
sample of behavior or characteristic at a
given point in time.
Criterion-Referenced TestsDo
You Know as Much as You Should?
To keep this simple, first we need to define what a
criterion is. In the context of measurement, a criterion
is defined as some measurable behavior, knowledge,
attitude, or proficiency. A criterion-referenced test is a
mastery test that assesses your proficiency on a
criterion of importance. For example, perhaps your
professor at the beginning of the semester told you
that you had to learn 70% of the information related to
measure- ment in order to pass this class. You
demonstrate that youve learned this information by
averaging at least 70% across all the class tests and
other assignments. In this instance, the criterion is the
course content, and the cutoff score for passing is 70%.
A cutoff score is the lowest score you can receive and
still be in the passing range.
As we see increasing demands for accountability in
education, both students and teachers are having to
pass criterion-referenced tests on specific domains of
knowledge. For example, in many states students must
93
94 TESTING AND MEASUREMENT
1. What is a criterion?
Key Terms
Criterion-referenced tests
Criterion
Cutoff score
Norm-referenced tests
Norm groups
Norm-reference group
Fixed-reference group
Specific group norms
Models and Self-instructional Exercises
You have just taken your midterm in this class (again),
and you find out you knew 60% of the information on the
test. (Careful, youre slipping!)
W
ouldnt it be wonderful if a test score was an
absolutely perfect, accurate measure of whatever
behavior or variable is being measured? Sadly,
test scores are not all theyre cracked up
to
be. They are not perfect measures of knowledge,
behaviors, traits, or any specific characteristic. Thats
right, even the tests in this class are not perfect
measures of what you know. Basic test theory can help us
explain why test scores are fallible.
Test Theory
A basic concept in test theory is that any score obtained
by an individual on a test is made up of two components:
the true score and the error score. No persons obtained test
score (Xo) is a perfect reflection of his or her
abilities, or behaviors, or characteristics, or whatever
it is that is being measured. The basic equation that
reflects the relationship between true, error, and
obtained or observed scores is
Xo Xt Xe
101
10 TESTING AND MEASUREMENT
2
Test-Theor y Assumptions
There are three underlying assumptions in test theory.
They are relatively simple and straightforward. First, it
is assumed that true scores are stable and consistent
measures of a variable, characteristic, behavior, or
whatever youre measuring. Therefore, when you are doing
measurement, you try to control for as much error as
possible so that your observed score approaches your true
score. We want our observed score to be as close as
possible to our true score (which we will never really
know).
The second assumption is that error scores are random.
They just hap- pen. As measurement specialists, we try to
stamp out as much error as we can, but error cant be
totally controlled. Every obtained score consists of
error. Error doesnt care where it strikes. It attaches
itself to every score, just like a leech, regardless of
who is taking the test. Because error attaches itself to
the true score, it can raise or lower an obtained score.
Error scores occur purely by chance. Because of this,
there is no (zip, nada, zilch) relationship between error
scores and obtained scores (or true scores for that
matter). What this means is that a student with a high
score on a test should be just as likely to have a large
error associated with his [or her] score, either posi-
tive or negative, as the person who received the lowest
score on the test (Dick & Hagerty, 1971, p. 12).
The final assumption is that the observed score is
the sum of the true score and the error score (Xo Xt
Xe). In order to understand this assumption, we
need to understand Thing One and Thing Two
(if Dr. Seuss were writing this book) and we need an
example. This theoretical example can be found in Table
7.1. We say theoretical because we can never know what
someones error score is.
Table 7.1 True, Error, and Observed Scores and Their Sums,
Means, and Variances
X 877 0 877
M 87.7 0.0 87.7
2 24.81 6.0 30.81
4.98 2.4 5.55
2 2
To see this relationship, add the true variance, t to the error
variance e 2 2
in Table 7.1. As you can see, when the t of 24.81 is added to the e of
6.0,
this equals the
o
2 of 30.81.
Now that weve taught you this, we have to
fess up. We cant know the actual values of error
scores or true scores, so we cant calcu- late their
variances. They are hypothetical concepts. You just have
to trust us that they exist. In fact, the error scores,
because they are ran- dom, are totally independent of
(that is, not correlated with) the true scores. So now
you have some idea about the theoretical underpinnings
of testing.
Our Model
X
M
2
8.01 3.38 8.7
Our Model Answers
X 347 0 347
M 34.7 0.0 34.7
2 64.21 11.4 75.61
8.01 3.38 8.7
Now Its Your Turn
X
M
2
4.8 3.3 5.9
We have completed the table with the correct answers. How did you
do?
True Error Obtained
Perso score Score Score
n
Xt Xe Xo
1 18 2 20
2 23 5 18
3 20 1 19
4 11 3 14
5 18 2 16
6 20 1 21
7 25 4 21
8 23 5 28
9 11 3 8
10 25 4 29
X= 194 0 194
M= 19.4 0.0 19.4
2 23.44 11 34.44
4.8 3.3 5.9
Building a Strong
Te stOne the Big
Bad Wolf Cant Blow
Down
B
uilding a test is like building a houseit is
only as strong as the raw materials used to
construct it. The raw materials used to build a
test are the items. When constructing a test, your
first job is to develop or
write a pool of items. This is called item generation.
To generate items, you have to be very familiar with the
literature on the concept or focus of your test. If there
is any theoretical foundation for the concept youre
examining, you need to be familiar with this theory.
Sometimes interviews with experts on the test topic or
individuals experiencing the ramifications of the topic
being examined will yield valuable information to help in
item generation. When you generate items, you want a
large pool of items so you can select the strongest items
to build your test.
Heres an example of how one test was constructed. An
issue important for retaining minority students on
college campuses is their sense of fitting in
culturally. However, no instrument or assessment
device was found in 1995 that measured this concept.
To build a strong instrument to measure cultural
congruity, the senior author of this text and another
colleague thoroughly researched the literature, read
extensively on theories related to college student
retention, and interviewed racial or ethnic minority
under- graduates. A pool of 20 items was generated and
tested for their contribu- tion to the Cultural
Congruity Scale (CCS) (Gloria & Robinson Kurpius,
1996). The final scale consisted of only 13 items.
Seven items from the original item pool were not good
items and would have weakened the test. They were
thrown out.
111
11 TESTING AND MEASUREMENT
2
Item Discrimination
Throughout our discussion of item difficulty, we
referred to the ability of a test to discriminate
among people. A test cannot discriminate among people
unless the items themselves differentiate between those
who know the infor- mation being tested, have the
attitude being assessed, or exhibit the behavior being
measured and those who dont. Item discrimination is the
the degree to which an item differentiates correctly
among test takers in the behavior that the test is
designed to measure (Anastasi & Urbina, 1997, p. 179).
What we want is for test items to discriminate
between groups of people (often called criterion
groups). Some potential criterion groups consist of
those who succeed or fail in an academic course, in a
training program, or in a job. As a budding
measurement specialist, you need to be able to pick
tests that have strong item discrimination (items that
cannot be blown apart because theyre not wishy-washy)
or that help build a test with strong item
discrimination ability. We want you to understand item
discrimination con- ceptually, not necessarily calculate
its statistical indices. Someday you might be asked to
create a test, and we dont want you to forget the
importance of item discrimination.
When creating or building your test, regardless of what
you are trying to measure, you want each of your items to
be able to discriminate between those who are high and
low on whatever it is you are measuring. Even though an
item has an acceptable item difficulty level, if it
doesnt have the ability to discriminate, it is a weak
item. It should not be included on the test. Lose it!!!
Discard it!!! Get rid of it!!! Hide it in the
refrigerator if you have to!!! Just in case were
confusing you, maybe the following example will clear
things up. You have a good job at a large utility company
(lucky you) and one of your job responsibilities is
recommending employees for promotion to middle-level
managerial positions. In your company, a democratic
lead- ership style is highly valued and has been found to
be very effective with employees. Therefore, you always
give those individuals being considered for promotion a
leadership style instrument that assesses democratic and
autocratic leadership styles. Every item on this
instrument should be able to discriminate between those
who would be democratic and those who would
be autocratic managers.
When this leadership style instrument was originally
validated (which we will explain in Chapter 10), it
was administered to middle-level managers whose
leadership style had been rated by their supervisors
as democratic or autocratic. To determine whether an
item had strong item discrimination, managers in the
two leadership style groups needed to score
differently on the item. For example, managers who had
a democratic style would respond very positively to the
item, I involve employees in decision making. An
autocratic leader would respond negatively to this
item. Based on the different response patterns of
democratic and autocratic managers, this item has
strong item discrimination.
Ideally, the final version of the leadership style
instrument does not include any items that do not
discriminate between the two groups. Based on your
validation study, you discarded those items earlier.
If all items included on this test have good
discriminating ability, you should be able to classify
all your potential managers who respond consistently
into two groups: democratic or autocratic.
We regret that we have to point out the vagaries of
human nature. Some people will not respond consistently
no matter how good the item discrimination is for every
item on your test. Those people who answer
inconsistently, sometimes agreeing with the democratic
items and some- times agreeing with the autocratic items,
will have scores that reflect a mixed leadership style.
The test items are not to blame for this confusion.
Human inconsistency is to blame. We cant control
everything. (Perhaps, a mixed leadership style is a good
thing, too.)
Before we leave the topic of item discrimination, we
feel compelled (as King Richard might say, Nay,
obligated!) to provide you with a wee bit of information
about the index of discrimination. In our example of
democratic and autocratic leadership styles, we used what
is known as the extreme groups approach to item
discrimination. To calculate the index of discrimination
for each item for these two groups, you first calculate
the percentage of managers who scored democratic on the
item and the percentage of those who scored autocratic
on the item. For this item, the difference (D) between
these two percentages becomes its index of discrimination.
An index of discrimination can range from 1.00 to
1.00. The closer the D value comes to 0, the less
discriminating ability the item has. In fact, D 0
means no item discrimination. The closer the D value
is to 1.00, the stronger the item discrimination.
The exact formula for the index of
discrimination is
DUL
Key Terms
Item pool
Item analysis
Item difficulty
Item discrimination
Index of discrimination
p value
D value
Extreme groups
.
3. If this is a strong test, the average item
difficulty should be approxi- mately .
Reliabilit yThe
Same Ye sterday,
Today, and Tomor row
W
hen selecting tests for use either in research or
in clinical decision making, you want to make
sure that the tests you select are reliable.
Reliability can be defined as the trustworthiness
or the
accuracy of a measurement. Those of us concerned with
measurement issues also use the terms consistency and
stability when discussing reliability. Consistency is the
degree to which all parts of a test or different forms
of a test measure the same thing. Stability is the
degree to which a test measures the same thing at
different times or in different situations. A
reliability coef- ficient does not refer to the test
as a whole, but it refers to scores obtained on a
test. In measurement, we are interested in the
consistency and stability of a persons scores.
As measurement specialists, we need to ask ourselves,
Is the score just obtained by Person X (Person X seems
so impersonal; lets call her George) the same score she
would get if she took the test tomorrow, or the next day,
or the next week? We want Georges score to be a stable
measure of her performance on any given test. The
reliability coefficient is a measure of consistency. We
also ask ourselves, Is the score George received a true
indication of her knowledge, ability, behavior, and so
on? Remember obtained scores, true scores, and error
scores? The more reliable a test, the more Georges
obtained score is a reflection of her true score.
A reliability coefficient is a numerical value that can
range from 0 to 1.00. A reliability coefficient of zero
indicates the test scores are absolutely unreli- able. In
contrast, the higher the reliability coefficient, the
more reliable or accurate the test scores. We want tests
to have reliabilities above 0.70 for
121
12 TESTING AND MEASUREMENT
2
2 93 90
3 85 90
4 108 100
5 116 100
6 100 90
7 75 95
8 90 95
9 88 88
10 80 85
2 35 45
3 42 40
4 68 74
5 73 80
6 52 50
7 79 80
8 60 61
9 59 67
10 48 55
The alternate forms reliability coefficient for these
scores across this 4-month period is 0.95. If we depicted
this symbolically, it would look like this:
rForm1*Form2 .95
The measurement specialists who designed these
alternate forms of the MAT did a great job. (Lets hope
they were given a bonus for a job well done!) This 0.95
is a very strong reliability coefficient and indicates
that sources of error due to both content variability
and time were controlled to a very great extent.
Reliability Statistics
Cronbachs
Alpha N of Items
.869 5
Interrater Reliabilit y
2 X 0
3 0 0
(Continued)
Teacher Rater Rater
Responses 1 2
4 X X
5 0 0
6 X X
7 X X
8 X X
9 0 X
10 0 0
Number of agreements
Interrater
reliability = Number of possible
agreements
8
Interrater reliability
0.80
10
4. For which type do you need to administer the test only one
time?
Key Terms
Reliability
Testretest
Alternate forms
Internal consistency
Interrater
Standard error of measurement
Pearson product-moment correlations
K-R 20
Cronbachs alpha
Spearman-Brown correction
Between and .
3. Based on what you know so far about Tom and the other
applicants, would you recommend him as an employee?
Yes or no?
3. Based on what you know so far about Tom and the other
applicants, would you recommend him as an employee?
We would not recommend Tom for employment.
4. Why did you make this recommendation?
At the 95th confidence le vel, Toms highe s t potential true score is s
till barely equal to the me an score of 45 for this applicant pool.
More than likely, his true score is significantly below the group me an.
We would look for applicants whose true score would at a minimum
include the me an and score s above it at the 95% confidence inter
val.
rtt
Interpersonal Relationships Fear of Math Subscale
Subscale
1. What is the SEm for each subscale for this group of applicants?
S
uppose youve created a test that has perfect
reliability (r 1.00). Anyone who takes this test
gets the same score time after time after time.
The obtained score is their true score. There is no
error. Well, doesnt this sound wonderful!? Dont be
gullible. If you believe there is such a thing as a
perfectly reliable test, could we interest you in some
ocean-front property in the desert? Remember that what
you see is not always what you get.
Were sorry to tell you, but having a perfectly
reliable test is not enough. Indeed, a perfectly reliable
test (or even a nonreliable test) may not have any value
at all. We offer the case of Professor Notsobright to
prove our point. Professor Notsobright wants to know how
smart or intelligent everyone in his class is. He knows
that intelligence is related to the brain and decides,
therefore, that brain size must surely reflect
intelligence. Since he cant actu- ally measure brain
size, he measures the circumference of each students
head. Sure enough, he gets the same values each time he
takes out his handy-dandy tape measure and encircles each
students head. He has found a reliable measurement. What
he has NOT found is a valid measure of the construct
intelligence.
In measurement, our objective is to use tests that are
valid as well as reliable. This chapter introduces you to
the most fundamental concept in measurementvalidity.
Validity is defined as how well a test measures what it
is designed to measure. In addition, validity tells us
what can be inferred from test scores. According to the
Standards for Educational and Psychological Testing (1999),
the process of validation involves accumulating evidence
to provide a sound scientific basis for the proposed
score interpretations (p. 9). Evidence of validity is
related to the accuracy of the proposed interpretation of
test scores, not to the test itself.
141
14 TESTING AND MEASUREMENT
2
Validation Groups
Criteria
Construct Underrepresentation
.
2. A criterion is .
3. A test should be validated against one and only
one criterion. True or false?
4. The criteria used as a source of validity evidence
are external to the test itself. True or false?
5. Construct underrepresentation occurs when
.
6. A source of error in a validity coefficient that is
not related to the tests intended construct is
called
.
7. Examples of this source of error include
and
.
10. C
onstruct validity is also referred to as .
12. When
multiple theoretical constructs are being
measured by the same test, the test should be
and have
for each construct or factor
being assessed.
Key Terms
Validity
Validation group
Criterion
Construct underrepresentation
Construct-irrelevant variance
Sources of validity evidence
Content
Criterion related
Predictive
Concurrent
Convergent
Discriminant
Internal structure
Construct
Therefore, the maximum potential validit y value of the IP for bank tellers
is 0.94.
L
ets pretend that you are passing this class
with flying colorsred, white, and blue, of
courseand are thinking about actually using your
newfound knowledge of measurement and testing. This
means that
you need to pay particularly close attention to this last
chapterThe Perils and Pitfalls of Testing.
There are two large ethical domains that you ethically
need to be con- cerned about. These are your own
competence and the rights of the people you test or
assess.
163
16 TESTING AND MEASUREMENT
4
1. You must ensure that the test you have chosen has
strong psychomet- ric qualitiesreliability and
validity.
2. If you are using a norm-referenced test, you need
to make sure that the norm group is appropriate
for the person you are going to test.
3. You need to protect the confidentiality of the person
you are testing.
4. You need to protect the test itself, answer
sheets, and responses from being shared with
someone who is not competent in testing.
5. You need to make sure that you do not overinterpret
test scores.
6. You need to be careful not to use test scores in a
way that is harmful.
Potential Dangers
Three major dangers of testing have been frequently
written about in the literature. The first is
invasion of privacy. When you ask someone to
answer any test or measurement questions, you are asking
about personal thoughts, ideas, knowledge, or attitudes.
By answering the questions, they are revealing personal
information about themselves. We have an ethical
obligation to respect this information and to keep it
confidential.
The second major danger of testing is self-
incrimination. Sometimes our test or measurement
questions ask about information that most people would not
talk about. For example, if someone takes a personality
inven- tory, questions about certain behaviors are asked.
Remember our Honesty Inventory? If our job applicants
answered the items honestly, they may well have been
giving us involuntary confessions of behaviors that may
be illegal. Based on their answers, not only did they get
or not get hired, they also got a labelhonest or not
honest. Before a person ever answers test questions, they
need to know who will see their test scores and what will
happen to them as a result of their responses.
The last major danger of testing is unfair
discrimination. We want tests and all measurement
devices to be reliable and valid and to help us
discriminate among individuals, not against people. For
example, the pro- files Gandalf obtained for Mr. Frodo
and Samwise Gamgee helped him make a decision about who
would be the ring bearer. Test results can help us decide
who will be the best for certain tasks or jobs, who has
mastered specific bodies of knowledge, who needs further
assistance or help in an area, or who has a special
aptitude. What we dont want a test to do is to dis-
criminate against a person or group of people. When a
test is not culturally
16 TESTING AND MEASUREMENT
6
sensitive or when it is given in a language that someone
has not mastered, it will discriminate against that
person. Remember, it is unethical to use test results to
discriminate against a group of people. So, if your test
promotes Euro-American children and puts other children
(Native Americans, Latinos, or African Americans) into
special education classes, it is unfairly discriminating
against these children of color and is culturally biased.
America has a sad history of using tests to categorize
and separate that we are trying to overcome. Be one of
the good guys; use your knowledge of measurement to help
others, not to hurt them.
Ryans Rights
Embedded in the dangers are the rights of clients or
students or potential employees. We need to protect their
right to privacy, to know when their answers might be
self-incriminating, and to trust that test scores will
not be used to discriminate against them in any way.
Specifically, Ryan has the right to know his test scores
will not be shared with his professors, classmates, or
anyone else without his consent. He has the right to know
that some test items might ask about incriminating
behaviors. This implies that he also has the right to
refuse to answer these questions. Finally, he has the
right to know that his test scores will not be used
against him. That is, he has the right to know how the
test scores will be used and to have veto power over any
projected use.
Whenever you are involved in testing people, they have
the right to give informed consent before they are
tested. This consent should (1) tell them about the
test, what use will be made of the results, who will see
the results, and perhaps even how long the test results
will be kept; (2) be given freely or voluntarily by the
person who will be tested; and (3) be given only by a
person who is competent to understand what he or she is
consenting to. If the person being tested is a minor,
consent must be obtained from a parent or guardian. In
addition, the minor should be asked to give informed
assent; that is, they agree to be tested and know what is
being asked of them.
Often schools administer tests to every student. When
this happens, they are acting in loco parentis. If
testing is mandated by the state, the school does not
need to obtain specific consent from the parents.
However, if a school system agrees to let you test
students for your personal research, you must get
affirmative parental consent. We use the term affirmative to
indi- cate that you must have signed consent. Failure to
say no is not consent!
As you evolve in your knowledge of measurement and
testing, we just want to remind you to be ethical and to
respect the power of tests. We believe in you!! In fact, we
believe in you so much, there is no Lets Check Your
Understanding for this chapter.
Appendix
167
168
(Continu
ed)
169
170
Behavioral
sciences, 5 Bell-
shaped curve.
See Normal curve
Bimodal curve, 48, 49
Binet IV, 85
Biserial correlation,
117
California Personality
Inventory, 84 CCS (Cultural
Congruity Scale), 111 Central
tendencies:
cartoon of, 48
choice of
measurement, 57
definition of, 47
key terms, 63
mean, 5354
median, 5052
medians, modes, for
grouped frequency
data, 53
mode, 4849
cf. in grouped frequency
See distribution, 2224
Cumula mode and, 49
tive of grouped frequency
freque data, 53 Class interval
width, 22, 7374
ncy Cohens Kappa, 130
Class College Stress Scale
interv (CSS):
al: concurrent validity
for, 149 in frequency
in curve model,
cu 4243
mu in frequency distribution
la model, 2831
ti mode of scores, 4950
response formats and,
ve
78, 11
fr Z scores and, 80,
eq 8183
ue Competence, tester,
nc 163164
y Concurrent validity, 149, 150
di Confidence interval, 132133
st Confidentiality,
ri testee, 164
bu Consent, informed, 166
ti Consistency,
on 121 Construct
, underrepresentation
26 , 145
Construct validity,
27 150151 Construct-
irrelevant variance:
description of, 145
test content and,
147148 Content
stability, 127129
Content validity, 147
148
173
17 TESTING AND MEASUREMENT
4
Continuous response format: Cumulative
description of, 78 frequency
of Personal Valuing of distribution:
Education description of,
2627
test, 127128
functi
options of, 9
10 test scores on of,
and, 11 28
Convergent validity, 150 Curve. See
Correlation coefficient: Frequency
alternate form
reliability, 126 as curve
measure of Cutoff
reliability, 134 test- score:
retest reliability, definition of,
124125 validity 93
coefficient and, 142
Criterion: for criterion-
definition referenced
of, 93 tests, 94
for validity, 144145
Criterion groups, D (difference),
115 Criterion-
referenced tests: 116117
description of, 9394 Dangers, testing,
key terms, 98 165166
models/exercises for, 99 Deviation,
100 standard. See
questions/answers about, Standard
9495 deviat
Criterion-related validity, ion
148150
Cronbachs alpha, 128
Crystal, Billy, 4
CSS. See College Stress
Scale Cultural bias, 125,
165166
Cultural Congruity Scale
(CCS), 111 Cumulative
frequency (cf):
calculation of,
2627 class
interval and,
53 definition
of, 26
percentile rank
calculation, 7173
percentile rank for
grouped
data, 7374
Inde x 17
Deviation scores: description variance, 59 5
of, 5859 Distribution. See
Z score formed from, 77 Dichotomous Frequency
response format: distributions; Test
description of, 7 scores,
of Personal Valuing of Education distribution
test, 128129 of
options of, 910 test
scores and, 11 Empirical evidence,
Dick, W., 103 147 Error scores:
caution about, 110
Difference (D), 116117
fallibility of test
Discriminant validity, 150 scores, 101 key
Discrimination: terms, 106
item, 115117 models/exercises
unfair, 165166 for,
Discrimination, index of: 107110
computation of, 116117 reliability and
models/exercises for, 118120 variance, 123 test
Dispersion: theory, 101102
description of, 57 test theory,
questions/answers,
deviation scores, 5859 102103
key terms, 63 test-theory
models/exercises for, 6470 assumptions,
questions/answers about, 62 103105
test-theory assumptions,
range, 5758 questions/answers, 105
standard deviation, 5961 106
Error variance: models/exercises for, 4245
reliability and, skewness, 3741
123
test-retest reliability Z scores and normal curve, 7778 Frequency
and, 124125 Ethics, of distributions:
testing: cumulative frequency distribution, 2627
competence of tester, 163 frequency curve, 3940 grouped frequency
164 dangers, potential,
165166 distribution,
rights of testees, 164 2224
165, 166
Expert ratings,
147
Extreme groups approach,
116
f. See Frequencies
Fear of math (FOM) test,
137139, 158162
Fixed-reference group, 96
97 Flawless IQ Test, 124
125
FOM (fear of math) test,
137139, 158162
Frequencies (f ):
cumulative
frequency
distribution,
2627
description of,
19 grouped
frequency
distribution,
2224
key terms, 28
models/exercises for, 28
34
questions/answers about,
2425 test scores and,
2728 ungrouped
frequency
distributions, 1921
Frequency curve:
central tendencies, 4757
description of,
35
dispersion, 5761
key terms, 4142
kurtosis, 3537
mean, median, mode, on
skewed curves, 57, 58
key ia, A. M.,
terms, 111
Goody-Two-Shoes (G2S)
28 personality test, 144145
limitat Grade point averages (GPA),
ions 144 Graduate Record
of, 35 Examination (GRE):
construct
models/ underrepresentation, 145
exercis criterion for score
es for, validation, 144
2834 predictive validity of,
questio 148149 standard scores
ns/answ of, 84
ers Graph, 35
about, See also Frequency curve
2425
test Grouped data, percentile
scores rank
and, for, 7374
2728 Grouped frequency data,
ungroup 53 Grouped frequency
ed distributions:
frequen description of, 2224
cy function of, 28
di
of midterm scores, 40
st
ri Hagerty, N., 103
bu High-stakes testing, 94,
ti 154
on
Honesty Inventory
s,
(HI), 135137,
19
155158
Horizontal axis (x-axis),
21 35
G
e i, 22, 7374
n Identification (ID)
d number, 2 In loco
e parentis, 166
r
, Index of
discrimination:
9 computation of,
7 116117
models/exercises for,
G 118120
l
o Inference, 96
r Informed consent, 166
Intelligence, 141, 142
Intelligence quotient
(IQ) scores,
83, 8485
Internal consistency Lord of the
reliability, 127129 Rings (Tolkien),
Internal structural
validity, 150151 8687 Lower
Interpersonal relationships limit (LL), 24,
(IP) test, 7374
137139, 158162
Interrater reliability,
129130 Interval scale:
description of, 34
use of, 5, 12
Interval width, 22,
7374 Invasion of
privacy, 165
IP (interpersonal
relationships) test, 137
139, 158162
IQ (intelligence
quotient) scores, 83,
8485
Item analysis,
112 Item
difficulty:
description of, 113114
models/exercises for, 118
120 questions/answers
about,
114115
Item discrimination:
description of,
115117
models/exercises for, 118
120 questions/answers
about,
117118
Item generation, 111, 112
Item pool, 111
Kranzler, G., 47
Kuder-Richardson 20 (K-R
20), 129 Kurtosis:
description of, 3537
dispersion and, 57
platykurtic curve, 3940
questions/answers about,
37
N. See Number of
scores/people Negatively
skewed curve:
description of, 3839
frequency distribution of
midterm grades, 40
mean, median, mode
and, 57 Nominal scales:
description of, 12
use of, 45, 12
Norm groups, 9596
Normal curve:
central tendencies
and, 47 kurtosis,
3537
norm-referenced tests,
9598
standard deviation, 60
61
Z scores and, 7778
Normative sample, 9596
Norm-reference group,
9596 Norm-referenced
tests:
definition of, 95
fixed-reference group,
9697
key terms, 98
models/exercises for,
99100
norm-reference group,
9596
questions/answers
about, 98 specific
group norms, 97
Numbers: caution about, 110
key models/exercises for,
terms, 107110 reliability
and variance, 123
1213 reliability of test
meaning and, 121
of, 1 test theory equation,
measure 101102 Open-ended
ment response format, 6 Order,
scal 23
es/r Ordinal scale:
espo description of,
nse 23
form use of, 5, 12
ats
and, p (percentage), 73, 113
12
models/e 114 Parallel forms
xercises reliability, 126127
for, Parental consent, 166
1318 Pearson correlation
respo coefficient, 124,
nse 125
forma Percentage (p), 73,
ts, 113114 Percentile
610
scale rank (PR):
s of calculation of, 7173
measu comparisons with,
remen 7576 for grouped
t, data, 7374
16 for Z score, calculation
test of, 7780, 83
score
s questions/answers about,
and, 7576, 8789
11 Percentile score, 96
Perfectly Honest Scale (PHS),
Observed 155
score. See Personal Valuing of
Obtained Education test, 127
score 129
Obtained Personality inventories, 84
score: Personality Research Form
assumpti (PRF),
ons 84, 85
about, Phi coefficient, 117
103104 PHS (Perfectly Honest Scale),
155
Platykurtic principles of,
curve: 135
description
of, 37 questions/answer
dispersion s about, 122,
of, 57 123
frequency distribution of 124
midterm grades, 3940 reliability
illustration coefficient,
of, 36 121122
Point biserial correlation, score variance,
117 Portfolio assessment,
149150 123
Positively skewed curve,
3839, 57 PR. See
Percentage rank Predictive
validity, 148149
PRF (Personality Research
Form), 84, 85
The Princess Bride (movie), 4
Privacy:
invasion of,
165
right to,
166
Range, 5758
Rank, 23
Rank-ordered data, 134
Ratio,
123
Ratio
scale:
description
of, 4
requirement of absolute
zero, 5 use of, 12
Raw scores:
conversion to standard
scores, 71, 9091
transformation into Z
score, 7780
Reliability:
correlation coefficients
as measures of, 134
definition of, 121
key terms, 135
models/exercises for, 135
139
SPSS to calculate, 140 Rights, of testees, 164
standard error of measurement, 165, 166 Robinson, S. E.,
132134 14
types of reliability estimates, Robinson Kurpius, S. E.,
124132 111
validity and, 141, 153154
Reliability coefficient:
alternate forms reliability, (Sum), 53,
126127 54
description of, 121122 Salkind, N. J., 4
error variance and, 123 SAT. See Scholastic
internal consistency
Aptitude Test Scales of
reliability, 128129
measurement:
questions/answers about, 122
example of, 5
test-retest reliability, 125
Response formats: explanation of, 1
continuous responses, 78 interval, 34
description of, 67 key terms, 12
dichotomous responses, 7 models/exercises for,
example of, 910 1318
internal consistency reliability nominal, 12
and, 127129 ordinal, 23
item difficulty and, 113 key questions/answers about,
terms, 13
56
models/exercises for, 1318
questions/answers about, 10 test ratio, 4
scores and, 11 use of, 45, 12
Scholastic Aptitude Test response formats example, 78, 11
(SAT): construct Z scores, 8082
underrepresentation, 145
criterion for score Spearman rho, 134 Spearman-Brown
validation, 144 predictive correction
validity of, 148149 procedure, 129 Specific group
standard scores of, 8384 norms, 97
Scores (X):
central tendencies, 4757
cumulative frequency
distribution, 2627
dispersion, 5761
frequency distribution
for, 2728 grouped
frequency
distributions, 2224
means, standard
deviations, 6263
percentile rank
calculation, 7173
percentile rank for
grouped
data, 7374
ungrouped frequency
distributions and, 19
21
See also Standard
scores SD. See
Standard deviation
Self-
incrimination, 165
SEm (Standard error
of
measurement), 132
134 SII (Strong Interest
Inventory),
150151
Skewed curve, 57, 58
Skewed distribution, 3739,
57 Skewness:
description of/types of,
3739 frequency
distribution and, 3940
Social sciences, 4,
5 Social security
number, 2
Social Support Inventory
(SSI): calculation of
central tendency,
dispersion, 6770
frequency curve exercise,
4345 frequency
distribution exercise,
3134
Speede measurement equation,
d 132133
test, in Z score formula,
129 77
Split-half IQ score calculation and,
procedure, 8485 percentile rank
127129 for Z score
SPSS. See and, 7780
Statisti Standard error of measurement
cal (SEm), 132134
Packages Standard
for the scores: IQ
Social scores, 84
Sciences 85
(SPSS) key terms, 85
program models/exercises for, 86
90 percentile ranks,
SSI. See comparisons
Social Support with, 7677
Inventory percentile ranks for
Stability: grouped data, 7374
defin percentile ranks for Z
ition scores exercise, 83
of, percentile ranks,
121 questions/answers about,
internal 7576
consiste
ncy percentiles, 71
reliabil 73 reason for use
ity, of, 71
127129 SPSS for conversion
Standard of raw scores, 90
deviatio 91
n (SD): T scores, 84
calculat Z score, questions/answers
ion of, about, 8183
5961 Z score, transforming raw
calculat score to, 7780
ion of Z table, 168170
test Standards for
scores, Educational and
6263 Psychological Testing,
141,
in standard 147,
error of 154
State Departments of standard error
Education, 97 Statistical of
Packages for the Social measurement
Sciences (SPSS) program: , 132133
calculation of central validity,
tendency, interpretation
dispersion, 6970 of, 154
conversion of raw scores to
standard scores, 9091
for internal consistency
reliability, 128, 129
for reliability
coefficients, 140 frequency
distribution of midterm
grades, 21
Statistics, 47
Stevens, S.S., 1
Strong Interest Inventory
(SII), 150151
Sum (), 53, 54
Sum of squares, 59
Symmetrical curves, 35
37
Symmetrical distribution, 57
T scores:
description
of, 84
questions/answers about,
88, 90 use in
measurement, 83
Tails, 38
Teacher,
93, 94
Test:
criterion-referenced
tests, 9394 criterion-
referenced tests,
questions/answers, 9495
definition
of, 11
norm-and criterion-
referenced tests,
models/exercises for,
99100
norm-referenced tests,
9597 norm-referenced
tests,
questions/answers,
98 reliability
estimates, types of,
124130
reliability of, 121122
validity evidence, sources of, 147 models/exercises for,
151 107110 relationship
validity of, 141142, 144145 Test, to numbers, 11
building: reliability and
item analysis, 112 variance, 123
item difficulty, 113114 reliability estimates,
item difficulty, questions/answers types of,
about, 114115 124130
item discrimination, 115117 item reliability of, 121122
discrimination, standard error of
questions/answers about, 117118 measurement, 132133
item generation, 111 test theory, 101102
key terms, 118 test theory,
models/exercises for, 118120 questions/answers,
102103
questions/answers about, 112 test-theory assumptions,
Test content, 147148 Test 103105
scores: test-theory assumptions,
caution about, 110 questions/answers,
fallibility of, 101 105106
frequency distribution and, 19, 2728 validity,
key terms, 106 mean/standard interpretation
deviation of, 154
for, 6263 measurement scales
and, 12
validity, means for test theory equation, 101102 variance,
evidence of, 144145 item difficulty and, 114
validity evidence, True zero. See Absolute zero
sources of, 147
151
validity of, 141142
validity-reliability
relationship, 153
154
Z scores, 7780
Test scores,
distribution of:
frequency curve for,
35 key terms, 4142
kurtosis, 3537
models/exercises for,
4245
skewness, 37
41 Test theory:
assumptions, 103105
equation, 101102
questions/answers
about,
102103, 105
106
sources of error,
102 Testee:
potential test
dangers, 165166
rights of, 164
165, 166 Tester:
competence of, 163164
potential test dangers,
165166 rights of
testees and, 164165
Testing, ethics of:
competence of tester,
163164 dangers,
potential, 165166
rights of testees, 164
165, 166
Test-retest reliability,
124125 Tolkien, J. R.
R., 8687
Total test
scores, 11 True
score:
assumptions about, 103
104
models/exercises for,
107110 reliability and
variance, 123
reliability of test
and, 121
Unfair oups, 144
discrimina Validity:
tion, construct
165166 underrepresentation, 145
Ungrouped construct-irrelevant
frequency variance, 145
distributio criteria, 144145
ns, description of, 141142
19 important points about,
21 154155 interpretation of
Upper test validity, 154 key
limit terms, 155
models/exercises for,
(UL), 155162
23 questions/answers
24 about,
Urbin 142143, 146
a, reliability and, 153
154 sources of validity
S., evidence,
115 147151
sources of validity
V evidence,
a questions/answers
l about, 151153
i validation
d groups, 144
a Validity
t coefficient:
i construct-
o irrelevant
n variance and,
145
g criterion for validity
r evidence, 144145
description of, 142
reliability coefficient
and, 154 Variance:
calculation of, 59
of error score,
104105 of scores,
63
of Z score, 77
reliability and,
123 square root
of, 5960
true score
variance, item
difficulty and,
114
Vertical axis (y-axis), 35
Wechsler Intelligence IQ score from, 8485
Scale for Children, questions/answers about,
Fourth Edition 8183, 86
(WISC-IV), 85
90
Width, class interval,
raw score conversion to,
2223
9091
X. See Scores T score from, 84
X-axis (horizontal table, 168170
axis), 35 transformation of raw
score into, 7780
Y-axis (vertical axis), Z table, 8283, 168170
35
Zero, 35
Z score: See also Absolute zero
description of,
77
Abou t the Au thors
183