0% found this document useful (0 votes)
10 views29 pages

Lecture 2

This document discusses the goodness of measures in terms of reliability and validity. Reliability refers to the consistency of a measure and is assessed through test-retest reliability, alternate-form reliability, and internal consistency reliability. Validity refers to what a measure actually measures and is assessed through face validity, content validity, and construct validity. Reliability and validity are important for determining the quality of a measure.

Uploaded by

addis zewd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views29 pages

Lecture 2

This document discusses the goodness of measures in terms of reliability and validity. Reliability refers to the consistency of a measure and is assessed through test-retest reliability, alternate-form reliability, and internal consistency reliability. Validity refers to what a measure actually measures and is assessed through face validity, content validity, and construct validity. Reliability and validity are important for determining the quality of a measure.

Uploaded by

addis zewd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

GOODNESS OF MEASURES:

RELIABILITY AND VALIDITY


Reliability
• Definition
• The degree of stability exhibited when a
measurement is repeated under identical
conditions.
• Lack of reliability may arise from:
 divergences between observers or
instruments of measurement or
instability of the attribute being measured.
Assessment of reliability
• Reliability is assessed in 3 forms
– Test-retest reliability
– Alternate-form reliability
– Internal consistency reliability
Test-retest reliability
• Most common form in surveys
• Measured by having the same respondents
complete a survey at two different points in time
to see how stable the responses are.
• Usually quantified with a correlation coefficient (r
value)
• In general, r values are considered good if
r  0.70
• If data are recorded by an observer, you can have
the same observer make two separate
measurements.
• The comparison between the two measurements
is intra-observer reliability.
• Be careful about test-retest with items or scales
that measure variables likely to change over a
short period of time, such as happiness, anxiety,
etc.
• If you do it, make sure that you test-retest over
very short periods of time.
• Potential problem with test-retest is the practice
effect.
– Individuals become familiar with the items and
simply answer based on their memory of the
last answer.
• What effect does this have on your reliability
estimates?
• It inflates the reliability estimate.
Alternate-form reliability
• Use differently worded forms to measure the
same attribute.
• Questions or responses are reworded or their
order is changed to produce two items that are
similar but not identical.
• Be sure that the two items address the same
aspect of behavior with the same vocabulary and
the same level of difficulty.
– Items should differ in wording only.
• It is common to simply change the order of the
response alternatives.
– This forces respondents to read the response
alternatives carefully and thus reduces practice
effect.
Example: Assessment of Depression
Circle one item
Version A:
During the past 4 weeks, I have felt downhearted:
Every day 1
Some days 2
Never 3

Version B:
During the past 4 weeks, I have felt downhearted:
Never 1
Some days 2
Every day 3
You could also change the wording of the response
alternatives without changing the meaning.

Example: Assessment of urinary function


Version A:
During the past week, how often did you usually empty
your bladder?
1 to 2 times per day
3 to 4 times per day
5 to 8 times per day
12 times per day
More than 12 times per day
Version B:
During the past week, how often did you usually empty
your bladder?
Every 12 to 24 hours
Every 6 to 8 hours
Every 3 to 5 hours
Every 2 hours
More than every 2 hours
• You could also change the actual wording of the
question.
– Be careful to make sure that the two items are
equivalent.
– Items with different degrees of difficulty do not
measure the same attribute.
– What might they measure?
• Reading comprehension or cognitive
function.
Example: Assessment of loneliness
Version A:
How often in the past month have you felt alone in the world?
Every day
Some days
Occasionally
Never
Version B:
During the past 4 weeks, how often have you felt a sense of
loneliness?
All of the time
Sometimes
From time to time
Never
• Practice effects may occur even when alternate
forms are used.
• Even though you use different questions on the
parallel form, participants may respond similarly
on the second test because they are familiar with
your question format.
Internal consistency reliability
• Applied not to one item, but to groups of items that are
thought to measure different aspects of the same concept.
• Cronbach’s coefficient alpha
– Measures internal consistency reliability among a
group of items combined to form a single scale.
– It is a reflection of how well the different items
complement each other in their measurement of
different aspects of the same variable or quality.
– Interpret like a correlation coefficient (0.70 is good)
Example: Assessment of physical function
Limited a Limited a Not
lot little limited
Vigorous activities, such as running, lifting heavy 1 2 3
objects, participating in strenuous sports

Moderate activities, such as moving a table, 1 2 3


pushing a vacuum cleaner, bowling, or playing golf

Lifting or carrying groceries 1 2 3

Climbing several flights of stairs 1 2 3

Bending, kneeling, or stooping 1 2 3

Walking more than a mile 1 2 3

Walking several blocks 1 2 3

Walking one block 1 2 3

Bathing or dressing yourself 1 2 3


Calculation of Cronbach’s coefficient alpha
Example: Assessment of emotional health
During the past month: Yes No
Have you been a very nervous person? 1 0
Have you felt downhearted and blue? 1 0
Have you felt so down in the dumps that
nothing could cheer you up? 1 0
Results
Summed
Patient Item 1 Item 2 Item 3 scale score
1 0 1 1 2

2 1 1 1 3

3 0 0 0 0

4 1 1 1 3

5 1 1 0 2

Percentage
positive 3/5=.6 4/5=.8 3/5=.6
Calculations
Mean score=2

Sample variance=


CC alpha  1 

(% pos) i (%neg ) i   k 

 Var   k  1 
 
 (.6)(.4)  (.8)(.2)  (.6)(.4)   3 
 1      0.86
 1.5  2 

Conclude that this scale has good reliability


• If internal consistency is low you can add
more items or re-examine existing items for
clarity.
Validity
• Definition
• Validity is often defined as the extent to which an
instrument measures what it purports to measure.
• It is important in determining whether the
statements in the instrument are relevant to the
study.
Assessment of validity
• Validity is measured in three forms
– Face validity
– Content validity
– Construct validity
Face validity
• Face validity is related to checking whether the
instrument looks as if it measures what it is
supposed to measure.
• To establish face validity, investigators seek
experts to review the instrument for grammar,
organization, appropriateness, and confirmation
that it appears to flow logically.
Content validity
• Subjective measure of how appropriate the items
seem to a set of reviewers who have some
knowledge of the subject matter.
– Usually consists of an organized review of the
survey’s contents to ensure that it contains
everything it should and doesn’t include
anything that it shouldn’t.
• There is no universally accepted standard indicator
of content validity.
• However, calculating the content validity index
(CVI) is one of the most popular ways to evaluate
content validity.
• The CVI is based on experts’ rating of item
relevance.
• A minimum of 3 to 10 experts is recommended to
review the content validity index (CVI) of the
instrument.
• In this process, reviewers are informed about the
objective of the study and the operational
definitions of the constructs.
• They are asked to rate each item using a 4-point
scale (1 = not relevant; 2 = somewhat relevant; 3 =
quite relevant; 4 = highly relevant).
• Responses for each item are then dichotomized:
questions that received a 1 or 2 are given a zero,
and items that scored 3 or 4 are given 1 point.
• The Item-level Content Validity Index (I-CVI) is
computed by totaling all the points for each item
and then divided by the total number of experts.
• For example, if an item was marked as quite
relevant or highly relevant by 5 of the 6 experts,
the item had a total score of 5, which was divided
by the number of experts (5/6 = .83).
• Then, the Scale-level Content Validity Index (S-
CVI) is obtained by averaging all the Item-level
Content Validity Indexes of the instruments.
• Experts proposed that an Item-level Content
Validity Index (I-CVI) of 1.00 is ideal when there
are five or fewer experts, while an I-CVI of .83 or
higher is recommended when there are more than
five experts.
• However, most of the scholars argue that an I-CVI
greater than .78 would be acceptable overall.
Construct validity
• Most valuable and most difficult measure of
validity.
• Basically, it is a measure of how meaningful the
scale or instrument is when it is in practical use.
• Convergent: whether two instruments measure the
same construct.
• Divergent: whether different instruments measure
different constructs.

You might also like