0% found this document useful (0 votes)
23 views16 pages

Basic Principles of Measurement

Uploaded by

ghada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views16 pages

Basic Principles of Measurement

Uploaded by

ghada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

2

BASIC PRINCIPLES
OF MEASUREMENT

In this chapter we will review some basic principles of


measurement. The major purpose of this review is to acquaint you with some
of the terms we use in Part II and in Volume 2 in describing the instruments
there, and to help you in selecting measures. Since the topic can be difficult,
we will try to present it in an understandable way by dividing the material into
four areas: a definition of measurement, the research principles defining a
"good" measure, the statistical principles involved in interpreting scores, and
the practice principles involved in using measures.

MEASUREMENT DEFINED

Measurement can actually be defined simply, although some writers have


developed complex, technical and intimidating definitions. Researchers in
measurement are like other scientists as they, too, create a new vocabulary for
concepts most of us already know. For example, a mathematician works with
"integers," while the rest of us use numbers.
Most simply, measurement is the systematic process of assigning a number
to "something" (Nunnally, 1978, p. 3). The "thing" is known as a variable.
The variables of concern to clinical practice tend to be a client's behavior,
thoughts, feelings, or situation (the dependent or outcome variable); treatment
goals; and theoretical concepts such as self-esteem. The number assigned
represents a quantified attribute of the variable.
This simple definition of measurement reflects the process of quantifying
the "thing." Quantification is beneficial in work with clients because it allows

11
12 Measurement and Practice

you to monitor change mathematically. The mathematical procedures of inter-


est to us are addition, subtraction, multiplication, and division. For example,
suppose you have successfully treated "an explosive personality" such that at
the end of therapy you conclude the "explosive personality is in remission." By
assigning a number to, say, the symptom of anger by using Spielberger' s (1982)
State-Trait Anger Scale, you can mathematically monitor how this symptom
has changed during treatment. It would be impossible, on the other hand, to
subtract an explosive personality in remission from an explosive personality.
In determining which math procedures to use, the first consideration is
known as the "level of measurement," of which there are four: nominal, ordinal,
interval, and ratio. These four levels differ from each other by the presence or
absence of four characteristics: exclusiveness, order, equivalency, and abso-
luteness.
The nominal level of measurement possesses only exclusiveness, which
means the number assigned to an attribute is distinct from others as it represents
one and only one attribute. With a nominal level the number is essentially the
same as a name given to the attribute. An example of a nominal variable is "sex"
with the attributes of "male" and "female." It is impossible to use mathematics
with nominal measures, just as it is with terms like "explosive personality."
The ordinal level of measurement possesses exclusiveness, but is also
ordered. To say the numbers are ordered means the numbers have a ranking.
An example of an ordinal level of measurement is the severity of a client's
problems or the ranking of a client's social functioning, as described in the
DSM IV (American Psychiatric Association, 1994). With ordinal measures you
can compare the relative positions between the numbers assigned to some
variable. For example, you can compare "poor functioning"-which is as-
signed the number 5-in relation to "superior functioning"-which is assigned
the number 1. While you can compare the relative rankings of an ordinal level
of measurement, you cannot determine how much the two attributes differ by
using the mathematic procedures of addition, subtraction, multiplication, and
division. In other words, you cannot subtract "superior" from "poor."
The use of math procedures is appropriate with measures at the interval
level. An interval measure possesses exclusiveness and is ordered, but differs
from ordinal measures in having equivalency. Equivalency means the distances
between the numbers assigned to an attribute are equal. For example, the
difference, or distance, between 2 and 4 on a seven-point scale is equal to the
difference between 5 and 7.
With interval level measures, however, we do not have an absolute zero.
This means either that complete absence of the variable never occurs or that
our tool cannot assess it. For example, self-esteem is never completely absent
as illustrated with the Index of Self-Esteem (Hudson, 1992). Consequently, we
really should not multiply or divide scores from interval level measures since
Basic Principles of Measurement 13

a score of 30 on an assertiveness questionnaire, for example, may not reflect


twice as much assertion as a score of 60. Another example of this issue is
temperature; 90 degrees is not twice as hot as 45 degrees.
The problem caused by not having an absolute zero is resolved by assuming
the measurement tool has an arbitrary zero. By assuming there is an arbitrary
score which reflects an absence of the variable, we can use all four mathematical
procedures: addition, subtraction, multiplication, and division.
The fourth level of measurement is ratio. Measures that are at a ratio level
possess exclusiveness, order, equivalency, and have an absolute zero. The only
difference between a ratio and an interval level of measurement is this char-
acteristic. With an interval level measure, we had to assume an arbitrary zero
instead of actually having and being able to measure an absolute zero. This
means there is an absolute absence of the variable which our measurement tool
can ascertain. Ratio measures are fairly rare in the behavioral and social
sciences, although numerous everyday variables can be measured on a ratio
scale, such as age, years of marriage, and so on. However, the assumption of
an arbitrary zero allows us to use all the mathematical procedures to monitor
our practice.
The benefits of measurement, namely our ability to better understand what
is happening in our practice, is a result of our using math to help us monitor
practice. The levels of measurement that result from how a number is assigned
to something determines what math procedures you will be able to use. As the
remainder of this book will show you, measures at the interval and ration levels
allow you to determine the effects of your intervention most clearly and, there-
fore, to be a more accountable professional.

RESEARCH PRINCIPLES UNDERLYING MEASUREMENT

A good measure essentially is one that is reliable and valid. Reliability refers
to the consistency of an instrument in terms of the items measuring the same
entity and the total instrument measuring the same way every time. Validity
pertains to whether the measure accurately assesses what it was designed to
assess. Unfortunately, no instrument available for clinical practice is com-
pletely reliable or valid. Lack of reliability and lack of validity are referred to
as random and systematic error, respectively.

Reliability

There are three basic approaches to determining an instrument's reliability:


whether the individual items of a measure are consistent with each other;
14 Measurement and Practice

whether scores are stable over time; and whether different forms of the same
instrument are equal to each other. These approaches to estimating reliability
are known as internal consistency, test-retest reliability, and parallel forms
reliability.

Internalconsistency. Items of an instrument that are not consistent with


one another are most likely measuring different things, and thus do not
contribute to-and may detract from-the instrument's assessment of the
particular variable in question. When using measures in practice you will want
to use those tools where the items are all tapping a similar aspect of a particular
construct domain.
The research procedure most frequently used to determine if the items are
internally consistent is Cronbach's coefficient alpha. This statistic is based on
the average correlations among the items. A correlation is a statistic reflecting
the amount of association between variables. In terms of reliability, the alpha
coefficient has a maximum value of 1.0. When an instrument has a high alpha it
means the items are tapping a similar domain, and, hence, that the instrument
is internally consistent. While there are no hard and fast rules, an alpha coeffi-
cient exceeding .80 suggests the instrument is more or less internally consistent.
In addition to Cronbach's alpha, there are other similar methods for
estimating internal consistency. The essential logic of the procedures is to
determine the correlation among the items. There are three frequently en-
countered methods. The "Kuder-Richardson formula 20" is an appropriate
method for instruments with dichotomous items, such as true-false and forced-
choice questions. "Split-half reliability" is a method that estimates the consis-
tency by correlating the first half of the items with the second half; this method
can also divide the items into two groups by randomly assigning them. A special
form of split-half reliability is known as "odd-even." With odd-even reliability
the odd items are correlated with the even items. Any method of split-half
reliability underestimates consistency because reliability is influenced by the
total number of items in an instrument. Because of this, you will often find
references to the "Spearman-Brown formula," which corrects for this under-
estimation.

Test-retest reliability. Reliability also can be assessed in terms of the


consistency of scores from different administrations of the instrument. If in
actuality the variable has not changed between the times you measure it, then
the scores should be relatively similar. This is known as test-retest reliability
and refers to the stability of an instrument. Test-retest reliability is also
estimated from a correlation. A strong correlation, say above .80, suggests that
the instrument is more or less stable over time.
When you use an instrument over a period of time-for example, before, dur-
Basic Principles of Measurement 15

ing, and after therapy-test-retest reliability becomes very important. How, after
all, can you tell if the apparent change in your client's problem is real change
if the instrument you use is not stable? Without some evidence of stability you
are less able to discern if the observed change was real or simply reflected error
in your instrument. Again, there are no concrete rules for determining how
strong the test-retest coefficient of stability needs to be. Correlations of .69 or
better for a one-month period between administrations is considered a "reason-
able degree of stability" (Cronbach, 1970, p. 144). For shorter intervals, like a
week or two, we suggest a stronger correlation, above .80, as an acceptable
level of stability.

Parallelforms. A third way to assess reliability is to determine if two


forms of the same instrument are correlated. When two forms of the same
measure exist, such as the long and short forms of the Rathus Assertiveness
Schedule (Rathus, 1973; McCormick, 1985), then the scores for each should
be highly correlated. Here, correlations of above .80 are needed to consider two
parallel forms consistent.

Error. All three approaches to reliability are designed to detect the


absence of error in the measure. Another way to look at reliability is to estimate
directly the amount of error in the instrument. This is known as the standard
error of measurement (SEM), and is basically an estimate of the standard
deviation of error. The SEM is calculated by multiplying the standard deviation
by the square root of 1 minus the reliability coefficient ( SD x '/ PI re). As an
index of error, the SEM can be used to determine what change in scores may be
due to error. For example, if the instrument's SEM is 5 and scores changed from
30 to 25 from one administration to the next, this change is likely due to error in
the measurement. Thus, only change greater than the SEM may be considered
real change.
The SEM is also an important way to consider reliability because it is less
easily influenced by differences in the samples from which reliability is
estimated. The SEM has limitations because the number reflects the scale range
of the measurement tool. You cannot directly compare the SEM from different
instruments unless they have the same range of scores. For example, Zung's
(1965) Self-Rating Depression scale has a range of 20 to 80. Hudson's (1992)
Generalized Contentment scale, however, has a range of zero to 100. The size
of the SEM of both scales is affected by their respective ranges. However, in
general, the smaller the SEM, the more reliable the instrument (the less
measurement error).
One way to solve this problem is to convert the SEM into a percentage, by
dividing the SEM by the score range and multiplying by 100. This gives you
the percentage of scores which might be due to error. By making this conversion
16 Measurement and Practice

you will be able to compare two instruments and, all other things being equal,
use the one with the least amount of error.

Validity

The validity of an instrument refers to how well it measures what it was


designed to measure. There are three general approaches to validity: content
validity, criterion validity, and construct validity. The literature, however, is
full of inconsistent-and occasionally incorrect-use of these terms.

Content validity. Content validity assesses whether the substance of the


items taps the entity you are trying to measure. More specifically, since it is not
possible to ask every question about your client's problem, content validity
indicates whether these particular scale items are a representative sample of the
content area.
There are two basic approaches to content validity, face and logical content
validity. Face validity asks if the items appear on the surface to tap the content.
Face validity is determined by examining the items and judging if they appear
to actually be measuring the content they claim to be measuring. To illustrate
this, select any instrument from Part II and look at the items. In yourjudgment
do they look like they measure the content they are supposed to? If so, you
would say the instrument has face validity.
This exercise demonstrates the major problem with face validity: it is
basically someone's subjective judgment. Logical content validity, however,
is more systematic. It refers to the procedure the instrument developer used to
evaluate the content of the items and whether they cover the entire content
domain. That is, do the items on the measure appear to be representative of all
content areas that should be included? When this information is available, it
will be presented by the researcher in a manual or in the research article on the
instrument. While you will want to use an instrument that has logical content
validity, the necessary information frequently is not available. Consequently,
you will often have to settle for your own judgment of face validity.

Criterionvalidity. This approach to validity has several different names,


and therefore generates a great deal of confusion. It is also known as empirical
validity or predictive validity, among other terms. In general, criterion validity
asks whether the measure correlates significantly with other relevant variables.
Usually, these other variables are already established as valid measures. There
are two basic types of criterion validity: predictive validity asks whether the
instrument is correlated with some event that will occur in the future; concur-
Basic Principles of Measurement 17

rent validity refers to an instrument's correlation with an event that is assessed


at the same time the measure is administered.
These approaches to validity are empirically based and are more sophisti-
cated than content validity. Quite simply, both approaches to criterion validity
are estimates of an instrument's association with some other already valid
measure where you would most likely expect to find a correlation. When such
information is available, you can be more confident that your instrument is
accurately measuring what it was designed to measure.
Another approach to criterion validity is known-groups validity. This
procedure (sometimes called discriminant validity) compares scores on the
measure for a group that is known to have the problem with a group that is known
not to have the problem. If the measure is valid, then these groups should have
significantly different scores. Different scores support the validity by sug-
gesting that the measure actually taps the presence of the variable.

Constructvalidity. The third type of validity asks whether the instrument


taps a particular theoretical construct. For example, does the Splitting Scale
(Gerson, 1984) really measure this defense mechanism? The answer can be
partially determined from criterion validity, of course, but a more convincing
procedure is construct validation. In order to consider an instrument as having
construct validity, it should be shown to have convergent validity and discrim-
inant validity, although some authors use the terms as if they were separate
types of validity.
Convergent validity asks if a construct, such as loneliness, correlates with
some theoretically relevant variable, such as the amount of time a person spends
by him or herself. Other examples could be whether the measurement of
loneliness correlates with the number of friends or feelings of alienation. In
other words, do scores on a measure converge with theoretically relevant
variables? With convergent validity you want to find statistically significant
correlations between the instrument and other measures of relevant variables.
Discriminant validity (sometimes called divergent validity), on the other hand,
refers to the way theoretically nonrelevant and dissimilar variables should not
be associated with scores on the instrument. Here you would want to find
instruments that are not significantly correlated with measures with which they
should not be correlated, in order to believe the score is not measuring
something theoretically irrelevant.
Another approach to construct validity is factorial validity, by which
researchers determine if an instrument has convergent and discriminant validity
(Sundberg, 1977, p. 45). Factorial validity can be determined with a statistical
procedure known as factor analysis designed to derive groups of variables that
measure separate aspects of the problem, which are called "factors." If the
variables were similar, they would correlate with the same factor and would
18 Measurement and Practice

suggest convergent validity. Since this statistical procedure is designed to detect


relatively uncorrelated factors, variables not associated with a particular factor
suggest discriminant validity.
This approach to factorial validity is often a statistical nightmare. First of
all, one needs a large number of subjects to use the statistic appropriately.
Moreover, the statistical procedure has numerous variations which are fre-
quently misapplied, and the specific values used for decision making-known
as eigenvalues-may not be sufficiently stringent to actually indicate the
variables form a meaningful factor. Consequently, the construct validity find-
ings can be misleading (Cattell, 1966; Comrey, 1978).
A second way to estimate factorial validity is to determine if individual
items correlate with the instrument's total score and do not correlate with
unrelated variables. This procedure, nicely demonstrated by Hudson (1982),
again tells you if the instrument converges with relevant variables and differs
from less relevant ones. Factorial validity, then, helps you decide if the
theoretical construct is indeed being measured by the instrument.

Summary

Let us now summarize this material from a practical point of view. First of
all, to monitor and evaluate practice you will want to use instruments that are
reliable and valid. You can consider the instrument to be reliable if the items
are consistent with each other and the scores are stable from one administration
to the next; when available, different forms of the same instrument need to be
highly correlated in order to consider the scores consistent.
Additionally, you will want to use measures that provide relatively valid
assessments of the problem. The simplest way to address this issue is to examine
the content of the items and make a judgment about the face validity of the
instrument. More sophisticated methods are found when the researcher corre-
lates the scores with some criterion of future status (predictive validity) or
present status (concurrent validity). At times you will find measures that are
reported to have known-groups validity, which means the scores are different
for groups known to have and known not to have the problem. Finally, you may
come across instruments that are reported to have construct validity, which
means that within the same study, the measure correlates with theoretically
relevant variables and does not correlate with nonrelevant ones.
For an instrument to be valid, it must to some extent be reliable. Conversely,
a reliable instrument may not be valid. Thus, if an instrument reports only in-
formation on validity, we can assume some degree of reliability but if it reports
only information on reliability, we must use more caution in its application.
It is important to look beyond these validity terms in assessing research or
Basic Principles of Measurement 19

deciding whether to use a scale, because many researchers use the terms
inconsistently. Moreover, we must warn you that you will probably never find
all of this information regarding the reliability and validity of a particular
instrument. Since no measure in the behavioral and social sciences is com-
pletely reliable and valid, you will simply have to settle for some error in
measurement. Consequently, you must also be judicious in how you interpret
and use scores from instruments. However, we firmly believe that a measure
without some substantiation is usually better than a measure with no substan-
tiation or no measure at all.

STATISTICAL PRINCIPLES OF INTERPRETATION

We are now at a point to discuss the next of our four basic sets of principles, namely
how to make meaningful the number assigned to the variable being measured.
In order for a number to be meaningful it must be interpreted by comparing
it with other numbers. This section will discuss different scores researchers may
use when describing their instruments and methods of comparing scores in
order to interpret them. If you want to understand a score, it is essential that
you be familiar with some elementary statistics related to central tendency and
variability.

Measures of Central Tendency and Variability

In order to interpret a score it is necessary to have more than one score. A group
of scores is called a sample, which has a distribution. We can describe a sample
by its central tendency and variability.
Central tendency is commonly described by the mean, mode, and median
of all the scores. The mean is what we frequently call the average (the sum of
all scores divided by the number of scores). The mode is the most frequently
occurring score. The median is that score which is the middle value of all the
scores when arranged from lowest to highest. When the mean, mode, and
median all have the same value, the distribution is called "normal." A normal
distribution is displayed in Figure 2.1.
By itself, a measure of central tendency is not very informative. We also
need to know how scores deviate from the central tendency, which is variability.
The basic measures of variability are range, variance, and standard deviation.
The range is the difference between the lowest and highest scores. The number
tells us very little, besides the difference between the two extreme scores. The
20 Measurement and Practice

.0
E
z
-4 -3 -2 -1 Mean +1 +2 +3 +4
Test score

PANEL A Percentile
1 5 10 20 30 405060 70 80 90 95 99

PANEL B Z score
-4 -3 -2 -1 0 +1 +2 +3 +4

1 I I II I I
PANEL C Tscore --
10 20 30 40 S0 so 70 s0 90

l l I . I I I I I
PANEL D CEEB score
100 200 300 400 500 60o 700 800 900

4% 7% 12% 17% 20% 17% 12% 7% 4%


1 2 3 4 1 e I I
PANEL E Stanine
1 2 3 4 5 6 7 8 9

Figure 2.1. Normal Distribution and Transformed Scores

variance, on the other hand, is a number representing the entire area of the
distribution, that is, all the scores taken together, and refers to the extent to
which scores tend to cluster around or scatter away from the mean of the entire
distribution. Variance is determined by the following formula:
2
I(X M) +n- 1

In this formulaXrepresents each score, from which the mean (M) is subtracted.
The result is then squared and added together (E); this number is then divided
by the sample size (n) minus one.
The square root of the variance is the standard deviation, which reflects the
deviation from the mean, or how far the scores on a measure deviate from the
mean. The standard deviation can also indicate what percentage of scores is
higher or lower than a particular score. With a normal distribution, half of the
area is below the mean and half is above. One standard deviation above the
mean represents approximately 34.13% of the area away from the mean. As
Figure 2.1 illustrates, an additional standard deviation incorporates another
Basic Principles of Measurement 21

13.6% of the distribution, while a third standard deviation includes approxi-


mately 2.1% more, and a fourth represents about .13%.
These concepts are important because you can use them with different types
of scores, as well as with some methods of comparing scores; when they are
available, we will present them for each scale in the book.

Raw scores and transformed scores. We now consider some of the


different types of scores you might obtain yourself or come across in the
literature. The basic types are raw scores and transformed scores, of which we
will consider percentile ranks, standard scores, and three different standardized
scores. Raw scores are the straightforward responses to the items on a measure.
In your use of an instrument to monitor your practice you will most likely only
need to use raw scores.
Raw scores from instruments with different possible ranges of scores,
however, cannot be compared with each other. Consider the same issue we
discussed earlier in terms of comparing different SEMs. A score of 30 on a 20
to 80 scale cannot be compared directly with a score of 35 on a zero to 100
scale. In order to compare scores from such measures you will need to trans-
form the scores so that the ranges are the same.
Probably the most widely used transformed score-although not the best
for our purposes-is a percentile rank. A percentile rank represents the propor-
tion of scores which are lower than a particular raw score. With percentile ranks
the median score is the 50th percentile, which is displayed in Panel A of Figure
2.1. Because percentile ranks concern one person's score in comparison to
others' scores in the same sample, its use in practice evaluation is infrequent.
Rather, you will be concerned with your client's score in relation to his or her
previous scores.
You could use percentile rank, however, by comparing the percentile rank
at the end of treatment with that at the beginning of treatment. To do so, you
would simply count all scores with values less than the one you are interested
in, divide by the total number of scores, and then multiply by 100. If you did
this with two instruments with different ranges, you could compare the perfor-
mances as reflected in percentile ranks.
A more useful transformed score is the standard score. Standard scores are
also known as "Z scores." Standard scores convert a raw score to a number that
reflects its distance from the mean. This transformed score derives its name
from the standard deviation as it is essentially a measure of the score in terms
of the extent of its deviation from the mean. The standard score, therefore,
usually has a range from -4 to +4. A standard score of +1 would indicate that
the score was 34. 1% above the mean. With standard scores the mean is always
zero. The standard score is displayed in Panel B of Figure 2.1.
22 Measurement and Practice

Standard scores are derived from the following formula:

Z = (X - M) + SD

where Z is the transformed score, X is the raw score, M is the sample mean, and
SD is the standard deviation. By using this formula you can transform raw
scores from different instruments to have the same range, a mean of zero and
a standard deviation of one. These transformed scores then can be used to
compare performances on different instruments. They also appear frequently
in the literature.
Raw scores can also be transformed into standardized scores. Standardized
scores are those where the mean and standard deviation are converted to some
agreed-upon convention or standard. Some of the more common standardized
scores you'll come across in the literature are T scores, CEEBs, and stanines.
A T score has a mean of 50 and a standard deviation of 10. T scores are
used with a number of standardized measures, such as the Minnesota Multi-
phasic Inventory (MMPI). This is seen in Panel C of Figure 2.1. A CEEB, which
stands for College Entrance Examination Board, has a mean of 500 and a
standard deviation of 100. The CEEB score is found in Panel D of Figure 2.1.
The stanine, which is abbreviated from standard nine, has a mean of 4.5, and a
range of one to nine. This standardized score is displayed in Panel E of Figure
2.1.
To summarize, transformed scores allow you to compare instruments that
have different ranges of scores; many authors will report data on the instruments
in the form of transformed scores.

Methods of comparingraw and transformedscores. As we stated earlier,


in order for a score to be interpreted it must be compared with other scores. We
will consider two basic procedures for comparing scores: norm-referenced
comparisons and self-referenced comparisons.
Norm-referenced comparisons allow you to interpret a score by comparing
it with an established "norm." Ideally, these normative data should be repre-
sentative of a population. When you compare your client's score with a norm,
you can interpret it in terms of how the performance relates to the sample mean
and standard deviation, in other words, how much above or below the mean
your client is in relation to the norm.
Unfortunately, we do not have many instruments for rapid assessment that
have well-established norms. Norm-referenced comparisons also have many
limitations. One limitation is that the samples used to develop instruments are
frequently not representative of a larger population. Secondly, even if the
sample is representative, it is quite possible that your client is dissimilar enough
to make the norm-referenced data noncomparable. Consider two examples. The
Basic Principles of Measurement 23

Selfism scale (Phares and Erskine, 1984) was designed to measure narcissism
and was developed with undergraduates; these data might not be an appropriate
norm with which to compare a narcissistic client's scores. It would make sense,
though, to make norm-referenced comparisons between an adult client whose
problem is due to irrational beliefs on Shorkey's Rational Behavior Scale
(Shorkey and Whiteman, 1977), since these normative data are more represen-
tative of adult clients. Additionally, norms represent performance on an instru-
ment at a specific point in time. Old normative data, say ten years or older, may
not be relevant to clients you measure today.
To avoid these and other problems with normative comparisons, an alter-
native method of interpreting scores is to compare your client's scores with his
or her previous performance. This is the basis of the single-system evaluation
designs we discussed in Chapter 1 as an excellent way to monitor practice.
When used in single-system evaluation, self-referenced comparison means
you interpret scores by comparing performance throughout the course of
treatment. Clearly, this kind of comparison indicates whether your client's
scores reveal change over time.
There are several advantages to self-referenced comparisons. First of all,
you can usually use raw scores. Additionally, you can be more certain the
comparisons are relevant and appropriate since the scores are from your own
client and not some normative data which may or may not be relevant or
representative. Finally, self-referenced comparisons have the advantage of
being timely-or not outdated-since the assessment is occurring over the
actual course of intervention.

PRACTICE PRINCIPLES INVOLVED IN MEASUREMENT

Having discussed what a "good" instrument is and how to statistically interpret


scores, we turn now to the actual use of measures to monitor your client's
progress and evaluate your effectiveness. In choosing to use a particular
instrument you need to consider several factors that determine its practical
value. These include utility, suitability and acceptability, sensitivity, directness,
nonreactivity, and appropriateness. All of these issues concern the practical
value of using instruments in your practice.

Utility

Utility refers to how much practical advantage you get from using an instrument
(Gottman and Leiblum, 1974). In clinical practice, an instrument which helps
you plan-or improve upon-your services, or provides accurate feedback
24 Measurement and Practice

regarding your effectiveness would be considered to have some utility (Nelson,


1981).
Certain features of an instrument influence its utility. Chief among these
are the measure's purpose, length, your ability to score it, and the ease of
interpreting the score. Instruments that tap a clinically relevant problem, are
short, easy to score, and easy to interpret are the most useful for practice.

Suitability andAcceptability

A second element of an instrument's practical value is how suitable its content


is to your client's intellectual ability and emotional state. Many require fairly
sophisticated vocabulary or reading levels and may not be suitable for clients
with poor literacy skills or whose first language is not English. Similarly,
instruments may require the ability to discriminate between different emotional
states, a set of skills which may not be developed in young children or severely
disturbed clients. Psychotic clients, for example, are usually unable to accu-
rately fill out most instruments.
If the scores are not accurate reflections of your client's problem, then the
use of the instrument has little practical advantage. Furthermore, in order for
an instrument to have practical value, your client will need to perceive the
content as acceptable (Haynes, 1983), and the process of measuring the problem
throughout treatment as important. If your client does not realize that measuring
his or her problem is important, then the instrument may not be given serious
attention-or even completed at all. Similarly, if your client sees the content as
offensive, which might occur with some of the sexuality instruments, the
responses may be affected. As we will discuss later, in these circumstances you
will need to familiarize your client with the value of measurement in practice,
and select an instrument with an understanding of your client's point of view.

Sensitivity

Since measurement in practice is intended to observe change over time, you


will need to use instruments that are sensitive to tapping those changes. Without
a sensitive instrument your client's progress may go undetected. You might
ask, though, "Aren't scores supposed to be stable as an indicator of reliability?"
Yes, and in fact, you will want an instrument that is both stable and sensitive.
In other words, you will want to use instruments that are stable unless actual
change has occurred. When real change does occur you will want an instrument
that is sensitive enough to reveal that change. Sensitivity usually is revealed
when the instrument is used as a measure of change in a study with actual clients.
Basic Principles of Measurement 25

Directness

Directness refers to how the score reflects the actual behavior, thoughts, or
feelings of your client. Instruments that tap an underlying disposition from
which you make inferences about your client are considered indirect. Direct
measures, then, are signs of the problem, while indirect ones are symbols of the
problem. Behavioral observations are considered relatively direct measures,
while the Rorschach Inkblot is a classic example of an indirect measure. Most
instruments, of course, are somewhere between these extremes. When deciding
which instrument to use you should try to find ones that measure the actual
problem as much as possible in terms of its manifested behavior or the client's
experience. By avoiding the most indirect measures, not only are you pre-
venting potential problems in reliability, but you can more validly ascertain the
magnitude or intensity of the client's problem.

Nonreactivity

You will also want to try to use instruments that are relatively nonreactive.
Reactivity refers to how the very act of measuring something changes it. Some
methods of measurement are very reactive, such as the self-monitoring of
cigarette smoking (e.g., Conway, 1977), while others are relatively nonreactive.
Nonreactive measures are also known as unobtrusive. Since you are interested
in how your treatment helps a client change, you certainly want an instrument
that in and of itself, by the act of measurement, does not change your client.
At first glance, reactivity may seem beneficial in your effort to change a
client's problem. After all, if measuring something can change it, why not just
give all clients instruments instead of delivering treatment? However, the
change from a reactive measure rarely produces long-lasting change, which
therapeutic interventions are designed to produce. Consequently, you should
attempt to use instruments that do not artificially affect the results, that is, try
to use relatively nonreactive instruments. If you do use instruments that could
produce reactive changes, you have to be aware of this in both your adminis-
tration of the measure-and try to minimize it (see Bloom, Fischer, and Orme,
1999, for some suggestions)-and in a more cautious interpretation of results.

Appropriateness

This final criterion of an instrument's practical value is actually a composite of


all the previous principles. Appropriateness refers to how compatible an
instrument is for single-system evaluation. In order for it to be appropriate for
26 Measurement and Practice

routine use it must require little time for your client to complete and little time
for you to score. Instruments that are lengthy or complicated to score may
provide valuable information, but cannot be used on a regular and frequent basis
throughout the course of treatment because they take up too much valuable
time.
Appropriateness is also considered in the context of the information gained
from using an instrument. In order for the information to be appropriate it must
be reliable and valid, have utility, be suitable and acceptable to your client,
sensitive to measuring real change, and measure the problem in a relatively
direct and nonreactive manner. Instruments that meet these practice principles
can provide important information which allows you to more fully understand
your client as you monitor his or her progress.

You might also like