Basic Principles of Measurement
Basic Principles of Measurement
BASIC PRINCIPLES
OF MEASUREMENT
MEASUREMENT DEFINED
11
12 Measurement and Practice
A good measure essentially is one that is reliable and valid. Reliability refers
to the consistency of an instrument in terms of the items measuring the same
entity and the total instrument measuring the same way every time. Validity
pertains to whether the measure accurately assesses what it was designed to
assess. Unfortunately, no instrument available for clinical practice is com-
pletely reliable or valid. Lack of reliability and lack of validity are referred to
as random and systematic error, respectively.
Reliability
whether scores are stable over time; and whether different forms of the same
instrument are equal to each other. These approaches to estimating reliability
are known as internal consistency, test-retest reliability, and parallel forms
reliability.
ing, and after therapy-test-retest reliability becomes very important. How, after
all, can you tell if the apparent change in your client's problem is real change
if the instrument you use is not stable? Without some evidence of stability you
are less able to discern if the observed change was real or simply reflected error
in your instrument. Again, there are no concrete rules for determining how
strong the test-retest coefficient of stability needs to be. Correlations of .69 or
better for a one-month period between administrations is considered a "reason-
able degree of stability" (Cronbach, 1970, p. 144). For shorter intervals, like a
week or two, we suggest a stronger correlation, above .80, as an acceptable
level of stability.
you will be able to compare two instruments and, all other things being equal,
use the one with the least amount of error.
Validity
Summary
Let us now summarize this material from a practical point of view. First of
all, to monitor and evaluate practice you will want to use instruments that are
reliable and valid. You can consider the instrument to be reliable if the items
are consistent with each other and the scores are stable from one administration
to the next; when available, different forms of the same instrument need to be
highly correlated in order to consider the scores consistent.
Additionally, you will want to use measures that provide relatively valid
assessments of the problem. The simplest way to address this issue is to examine
the content of the items and make a judgment about the face validity of the
instrument. More sophisticated methods are found when the researcher corre-
lates the scores with some criterion of future status (predictive validity) or
present status (concurrent validity). At times you will find measures that are
reported to have known-groups validity, which means the scores are different
for groups known to have and known not to have the problem. Finally, you may
come across instruments that are reported to have construct validity, which
means that within the same study, the measure correlates with theoretically
relevant variables and does not correlate with nonrelevant ones.
For an instrument to be valid, it must to some extent be reliable. Conversely,
a reliable instrument may not be valid. Thus, if an instrument reports only in-
formation on validity, we can assume some degree of reliability but if it reports
only information on reliability, we must use more caution in its application.
It is important to look beyond these validity terms in assessing research or
Basic Principles of Measurement 19
deciding whether to use a scale, because many researchers use the terms
inconsistently. Moreover, we must warn you that you will probably never find
all of this information regarding the reliability and validity of a particular
instrument. Since no measure in the behavioral and social sciences is com-
pletely reliable and valid, you will simply have to settle for some error in
measurement. Consequently, you must also be judicious in how you interpret
and use scores from instruments. However, we firmly believe that a measure
without some substantiation is usually better than a measure with no substan-
tiation or no measure at all.
We are now at a point to discuss the next of our four basic sets of principles, namely
how to make meaningful the number assigned to the variable being measured.
In order for a number to be meaningful it must be interpreted by comparing
it with other numbers. This section will discuss different scores researchers may
use when describing their instruments and methods of comparing scores in
order to interpret them. If you want to understand a score, it is essential that
you be familiar with some elementary statistics related to central tendency and
variability.
In order to interpret a score it is necessary to have more than one score. A group
of scores is called a sample, which has a distribution. We can describe a sample
by its central tendency and variability.
Central tendency is commonly described by the mean, mode, and median
of all the scores. The mean is what we frequently call the average (the sum of
all scores divided by the number of scores). The mode is the most frequently
occurring score. The median is that score which is the middle value of all the
scores when arranged from lowest to highest. When the mean, mode, and
median all have the same value, the distribution is called "normal." A normal
distribution is displayed in Figure 2.1.
By itself, a measure of central tendency is not very informative. We also
need to know how scores deviate from the central tendency, which is variability.
The basic measures of variability are range, variance, and standard deviation.
The range is the difference between the lowest and highest scores. The number
tells us very little, besides the difference between the two extreme scores. The
20 Measurement and Practice
.0
E
z
-4 -3 -2 -1 Mean +1 +2 +3 +4
Test score
PANEL A Percentile
1 5 10 20 30 405060 70 80 90 95 99
PANEL B Z score
-4 -3 -2 -1 0 +1 +2 +3 +4
1 I I II I I
PANEL C Tscore --
10 20 30 40 S0 so 70 s0 90
l l I . I I I I I
PANEL D CEEB score
100 200 300 400 500 60o 700 800 900
variance, on the other hand, is a number representing the entire area of the
distribution, that is, all the scores taken together, and refers to the extent to
which scores tend to cluster around or scatter away from the mean of the entire
distribution. Variance is determined by the following formula:
2
I(X M) +n- 1
In this formulaXrepresents each score, from which the mean (M) is subtracted.
The result is then squared and added together (E); this number is then divided
by the sample size (n) minus one.
The square root of the variance is the standard deviation, which reflects the
deviation from the mean, or how far the scores on a measure deviate from the
mean. The standard deviation can also indicate what percentage of scores is
higher or lower than a particular score. With a normal distribution, half of the
area is below the mean and half is above. One standard deviation above the
mean represents approximately 34.13% of the area away from the mean. As
Figure 2.1 illustrates, an additional standard deviation incorporates another
Basic Principles of Measurement 21
Z = (X - M) + SD
where Z is the transformed score, X is the raw score, M is the sample mean, and
SD is the standard deviation. By using this formula you can transform raw
scores from different instruments to have the same range, a mean of zero and
a standard deviation of one. These transformed scores then can be used to
compare performances on different instruments. They also appear frequently
in the literature.
Raw scores can also be transformed into standardized scores. Standardized
scores are those where the mean and standard deviation are converted to some
agreed-upon convention or standard. Some of the more common standardized
scores you'll come across in the literature are T scores, CEEBs, and stanines.
A T score has a mean of 50 and a standard deviation of 10. T scores are
used with a number of standardized measures, such as the Minnesota Multi-
phasic Inventory (MMPI). This is seen in Panel C of Figure 2.1. A CEEB, which
stands for College Entrance Examination Board, has a mean of 500 and a
standard deviation of 100. The CEEB score is found in Panel D of Figure 2.1.
The stanine, which is abbreviated from standard nine, has a mean of 4.5, and a
range of one to nine. This standardized score is displayed in Panel E of Figure
2.1.
To summarize, transformed scores allow you to compare instruments that
have different ranges of scores; many authors will report data on the instruments
in the form of transformed scores.
Selfism scale (Phares and Erskine, 1984) was designed to measure narcissism
and was developed with undergraduates; these data might not be an appropriate
norm with which to compare a narcissistic client's scores. It would make sense,
though, to make norm-referenced comparisons between an adult client whose
problem is due to irrational beliefs on Shorkey's Rational Behavior Scale
(Shorkey and Whiteman, 1977), since these normative data are more represen-
tative of adult clients. Additionally, norms represent performance on an instru-
ment at a specific point in time. Old normative data, say ten years or older, may
not be relevant to clients you measure today.
To avoid these and other problems with normative comparisons, an alter-
native method of interpreting scores is to compare your client's scores with his
or her previous performance. This is the basis of the single-system evaluation
designs we discussed in Chapter 1 as an excellent way to monitor practice.
When used in single-system evaluation, self-referenced comparison means
you interpret scores by comparing performance throughout the course of
treatment. Clearly, this kind of comparison indicates whether your client's
scores reveal change over time.
There are several advantages to self-referenced comparisons. First of all,
you can usually use raw scores. Additionally, you can be more certain the
comparisons are relevant and appropriate since the scores are from your own
client and not some normative data which may or may not be relevant or
representative. Finally, self-referenced comparisons have the advantage of
being timely-or not outdated-since the assessment is occurring over the
actual course of intervention.
Utility
Utility refers to how much practical advantage you get from using an instrument
(Gottman and Leiblum, 1974). In clinical practice, an instrument which helps
you plan-or improve upon-your services, or provides accurate feedback
24 Measurement and Practice
Suitability andAcceptability
Sensitivity
Directness
Directness refers to how the score reflects the actual behavior, thoughts, or
feelings of your client. Instruments that tap an underlying disposition from
which you make inferences about your client are considered indirect. Direct
measures, then, are signs of the problem, while indirect ones are symbols of the
problem. Behavioral observations are considered relatively direct measures,
while the Rorschach Inkblot is a classic example of an indirect measure. Most
instruments, of course, are somewhere between these extremes. When deciding
which instrument to use you should try to find ones that measure the actual
problem as much as possible in terms of its manifested behavior or the client's
experience. By avoiding the most indirect measures, not only are you pre-
venting potential problems in reliability, but you can more validly ascertain the
magnitude or intensity of the client's problem.
Nonreactivity
You will also want to try to use instruments that are relatively nonreactive.
Reactivity refers to how the very act of measuring something changes it. Some
methods of measurement are very reactive, such as the self-monitoring of
cigarette smoking (e.g., Conway, 1977), while others are relatively nonreactive.
Nonreactive measures are also known as unobtrusive. Since you are interested
in how your treatment helps a client change, you certainly want an instrument
that in and of itself, by the act of measurement, does not change your client.
At first glance, reactivity may seem beneficial in your effort to change a
client's problem. After all, if measuring something can change it, why not just
give all clients instruments instead of delivering treatment? However, the
change from a reactive measure rarely produces long-lasting change, which
therapeutic interventions are designed to produce. Consequently, you should
attempt to use instruments that do not artificially affect the results, that is, try
to use relatively nonreactive instruments. If you do use instruments that could
produce reactive changes, you have to be aware of this in both your adminis-
tration of the measure-and try to minimize it (see Bloom, Fischer, and Orme,
1999, for some suggestions)-and in a more cautious interpretation of results.
Appropriateness
routine use it must require little time for your client to complete and little time
for you to score. Instruments that are lengthy or complicated to score may
provide valuable information, but cannot be used on a regular and frequent basis
throughout the course of treatment because they take up too much valuable
time.
Appropriateness is also considered in the context of the information gained
from using an instrument. In order for the information to be appropriate it must
be reliable and valid, have utility, be suitable and acceptable to your client,
sensitive to measuring real change, and measure the problem in a relatively
direct and nonreactive manner. Instruments that meet these practice principles
can provide important information which allows you to more fully understand
your client as you monitor his or her progress.