Validity and Reliability in Assessment

Validity and Reliability
in Assessment
This work is the summarizations

.Of the previous efforts done by great educators
A humble presentation by Dr Tarek Tawfik Amin
Measurement experts (and many educators) believe
that every measurement device should possess certain
qualities.
The two most common technical concepts in

measurement are reliability and validity.
Reliability Definition (Consistency)
 The degree of consistency between two measures of

the same thing. (Me hre ns and Le hman, 1 9 8 7 (
 The measure of how stable, dependable,
trustworthy, and consistent a test is in measuring the
same thing each time (Wo rthe n e t al. , 1 9 9 3)
(Validity definition (Accuracy
 Truthfulness: Does the test measure what it purports

to measure? the extent to which certain inferences
can be made from test scores or other measurement.
(Me hre ns and Le hman, 1 9 8 7 )
 The degree to which they accomplish the purpose

for which they are being used. (Wo rthe n e t al. , 1 9 9 3(
 The term “ validity” refers to the degree to which the
conclusions (interpretations) derived from the
results of any assessment are “ well-grounded or
justifiable; being at once relevant and meaningful.”
(Me ssick S. 1 9 9 5)
 Content” : related to objectives and their sampling.

 “ Construct” : referring to the theory underlying the target.
 “ Criterion” : related to concrete criteria in the real world. It can be concurrent or
predictive.
 “ Concurrent” : correlating high with another measure already validated.
 “ Predictive” : Capable of anticipating some later measure.
 “ Face” : related to the test overall appearance.
The usual concepts of validity.

Sources of validity in assessment
Old concept
Usual concepts of validity

All assessments in medical education require
evidence of validity to be interpreted meaningfully.
In contemporary usage, all validity is construct

validity, which requires multiple sources of evidence;
construct validity is the whole of validity, but has
multiple facets. (Do wning S 20 0 3)
( Construct (Concepts, ideas and notions
- Nearly all assessments in medical education, deal with constructs:

intangible collections of abstract concepts and principles which
are inferred from behavior and explained by educational or
psychological theory.
- Educational achievement is a construct, inferred from performance
on assessments; written tests over domain of knowledge, oral
examinations over specific problems or cases in medicine, or
OSCE, history-taking or communication skills.
- Educational ability or aptitude is another example of construct –
a construct that may be even more intangible and abstract than
achievement. (Do wning 20 0 3)
 Content: do instrument items completely represent the

construct?
 Response process: the relationship between the intended
construct and the thought processes of subjects or observers
 Internal structure: acceptable reliability and factor structure
 Relations to other variables: correlation with scores from
another instrument assessing the same construct
 Consequences: do scores really make a difference?
Downing 2003, Cook S 2007

Content Response process Internal structure Relationship to Consequences
other variables
- Examination - Student format • Item analysis data: • Correlation with • Impact of test
blueprint familiarity other relevant scores/results
1. Item
- Representativeness - Quality control of difficulty/discriminati variables (exams) on students/society
of test blueprint to electronic on • Convergent • Consequences on
achievement correlations -
scanning/scoring 2. Item/test learners/future
domain characteristic curves internal/external:
- Key validation of learning
- Test specification preliminary scores 3. Inter-item - Similar tests • Reasonableness of
- Match of item - Accuracy in correlations • Divergent method of establishing
content to test combining different 4. Item-total correlations: pass-fail (cut) score
specifications formats scores correlations (PBS) internal/external • Pass-fail
- Representativeness - Quality • Score scale - Dissimilar measures consequences:
of items to domain control/accuracy of reliability • Test-criterion 1. P/F Decision
final
- Logical/empirical • Standard errors of correlations reliability-accuracy
scores/marks/grades
relationship of content measurement (SEM) • Generalizability of 2. Conditional
- Subscore/subscale
tested domain evidence standard error of
analyses: • Generalizability
measurement
- Quality of test • Item factor analysis
questions 1-Accuracy of • False +ve/-ve
applying pass-fail • Differential Item
- Item writer decision rules to
qualifications scores Functioning (DIF)
- Sensitivity review 2-Quality control of
score reporting
Sources of validity
Internal Structure-1
Statistical e vide nce o f the hypo the size d re latio nship

:be twe e n te st ite m sco re s and the co nstruct
(:Reliability (internal consistency - 1
􀂄Test scale reliability
􀂄Rater reliability
􀂄Generalizability
Item analysis data- 2
􀂄Item difficulty and discrimination
􀂄MCQ option function analysis
􀂄Inter-item correlations
Scale factor structure -3
Dimensionality studies- 4
Differential item functioning (DIF) studies- 5
Sources of validity
Relationship to other variables-2
Statistical e vide nce o f the hypo the size d re latio nship be twe e n
te st sco re s and the co nstruct
�Criterion-related validity studies
�Correlations between test scores/subscores
and other measures
�Convergent-Divergent studies
Keys of reliability assessment
 “ Stability” : related to time consistency.

 “ Internal” : related to the instruments.
 “ Inter-rater” : related to the examiners’ criterion.
 “ Intra-rater” : related to the examiner’ s criterion.
Validity and reliability are closely related.

A test cannot be considered valid unless the measurements resulting from it are
reliable. Likewise, results from a test can be reliable and not necessarily valid.
Validity and reliability are closely related.

A test cannot be considered valid unless the measurements resulting from it are
reliable. Likewise, results from a test can be reliable and not necessarily valid.
Sources of reliability in assessment
Source of Description Measures Definitions Comments

reliability
Internal - Do all the items on an Split-half reliability - Correlation between - Rarely used because
consistency instrument measure the same scores on the first and the “effective”
construct? (If an instrument second halves of a instrument is only half
measures more than one given instrument as long as the actual
construct, a single score will instrument; Spearman-
not measure either construct Brown† formula can
very well. adjust
- We would expect high Kuder-Richardson 20 - Similar concept to -Assumes all items are
correlation between item split-half, but accounts equivalent, measure a

scores measuring a single for all items single construct, and
construct. have dichotomous
responses
- Internal consistency is
probably the most commonly
reported reliability statistic, in - Assumes all items
part because it can be - A generalized form are equivalent and
calculated after a single Cronbach’ s alpha of the measure a single
administration of a single construct; can be used
instrument. Kuder-Richardson
formulas with dichotomous or
- Because instrument halves continuous data
can be considered “alternate
forms,” internal consistency
can be viewed as an estimate
of parallel forms reliability.
reliability
Temporal Does the instrument produce Test-retest reliability Administer the instrument Usually quantified
stability similar results when to the same person at using correlation
administered a second different times (eg, Pearson’ s r)
time?
Do different versions of the Administer different Usually quantified

Parallel forms Alternate forms
“same” instrument produce versions of the instrument using correlation
similar results? reliability to the same individual at (eg, Pearson’ s r)
the same or
different times
Agreement Percent agreement %identical responses

When using raters, does it Does not account
(inter-rater matter who does the rating? Simple correlation for agreement that
Phi
reliability) Is one rater’ s score similar to Agreement corrected for would occur by
Kappa chance
another’ s? chance
Kendall’s tau Does not account
Agreement on ranked data
Intraclass correlation for chance
coefficient ANOVA to estimate how
well ratings from different
raters coincide

reliability
Generalizability How much of the error in Generalizability Complex model that As the name implies,
theory measurement is the result coefficient allows estimation of this elegant method is
of each factor (eg, item, multiple sources of “generalizable” to
item grouping, subject, error virtually any setting in
rater, day of which reliability is
administration) involved in assessed;
the measurement process? For example, it can
determine the relative
contribution of
internal consistency
and
inter-rater reliability to
the overall reliability
of a given instrument
. Ite ms” are the individual q ue stio ns o n the instrume nt*“

.The “ co nstruct” is what is be ing me asure d, such as kno wle dg e , attitude , skill, o r sympto m in a spe cific are a
The Spe arman Bro wn “ pro phe cy” fo rmula allo ws o ne to calculate the re liability o f an instrume nt’ s sco re s
(.whe n the numbe r o f ite ms is incre ase d (o r de cre ase d
(Co o k and Be ckman Validity and Re liability o f Psycho me tric Instrume nts (20 0 7
Different types of assessments require different kinds

of reliability
Written MCQs
􀂄Scale reliability Oral Exams
􀂄Internal consistency 􀂄Rater reliability
Written—Essay 􀂄Generalizability Theory
􀂄Inter-rater agreement Observational Assessments
􀂄Generalizability Theory 􀂄Inter-rater agreement
􀂄Generalizability Theory
Performance Exams (OSCEs)
􀂄Generalizability Theory
?Reliability – How high
􀂆Very high-stakes: > 0.90 + (Licensure

(tests
(􀂆Moderate stakes: at least ~0.75 (OSCE
(􀂆Low stakes: >0.60 (Quiz
?How to increase reliability
For Written tests
􀂄Use objectively scored formats
􀂄At least 35-40 MCQs
􀂄MCQs that differentiate high-low students
For performance exams
􀂄At least 7-12 cases
􀂄Well trained SPs
􀂄Monitoring, QC
Observational Exams
(􀂄Lots of independent raters (7-11
􀂄Standard checklists/rating scales
􀂄Timely ratings
Conclusion
Validity = Meaning
􀂄Evidence to aid interpretation of assessment data
􀂄Higher the test stakes, more evidence needed
􀂄Multiple sources or methods
􀂄Ongoing research studies
Reliability
􀂄Consistency of the measurement
􀂄One aspect of validity evidence
􀂄Higher reliability always better than lower
References
 National Board of Medical Examiners. United States Medical Licensing
 Exam Bulletin. Produced by Federation of State Medical Boards of
 the United States and the National Board of Medical Examiners.
 Available at: http://www.usmle.org/bulletin/2005/testing.htm.
 Norcini JJ, Blank LL, Duffy FD, Fortna GS. The mini-CEX: a method
for assessing clinical skills. Ann Intern Med. 2003;138:476-481.
 Litzelman DK, Stratos GA, Marriott DJ, Skeff KM. Factorial validation
of a widely disseminated educational framework for evaluating
clinical teachers. Acad Med. 1998;73:688-695.
 Merriam-Webster Online. Available at: http://www.m-w.com/.
 Sackett DL, Richardson WS, Rosenberg W, Haynes RB. Evidence-
Based Medicine: How to Practice and Teach EBM. Edinburgh: Churchill Livingstone; 1998.
 Wallach J. Interpretation of Diagnostic Tests. 7th ed. Philadelphia:
Lippincott Williams & Wilkins; 2000.
 Beckman TJ, Ghosh AK, Cook DA, Erwin PJ, Mandrekar JN. How reliable are assessments of clinical
teaching? A review of the published instruments. J Gen Intern Med. 2004;19:971-977.
 Shanafelt TD, Bradley KA, Wipf JE, Back AL. Burnout and selfreported
patient care in an internal medicine residency program. Ann Intern Med. 2002;136:358-367.
 Alexander GC, Casalino LP, Meltzer DO. Patient-physician communication about out-of-pocket costs.
JAMA. 2003;290:953-958.
Reference
s - Pittet D, Simon A, Hugonnet S, Pessoa-Silva CL, Sauvan V, Perneger TV. Hand hygiene among
physicians: performance, beliefs, and perceptions. Ann Intern Med. 2004;141:1-8.
- Messick S. Validity. In: Linn RL, editor. Educational Measurement, 3rd Ed. New York: American
Council on Education and Macmillan; 1989.
- Foster SL, Cone JD. Validity issues in clinical assessment. Psychol Assess. 1995;7:248-260.
American Educational Research Association, American Psychological Association, National Council
on Measurement in Education. Standards for Educational and Psychological Testing. Washington,
DC:
American Educational Research Association; 1999.
- Bland JM, Altman DG. Statistics notes: validating scales and indexes. BMJ. 2002;324:606-607.
- Downing SM. Validity: on the meaningful interpretation of assessment
data. Med Educ. 2003;37:830-837. 2005 Certification Examination in Internal Medicine Information
Booklet. Produced by American Board of Internal Medicine. Available
at: http://www.abim.org/resources/publications/IMRegistrationBook. pdf.
- Kane MT. An argument-based approach to validity. Psychol Bull. 1992;112:527-535.
- Messick S. Validation of inferences from persons’ responses and performances as scientific
inquiry into score meaning. Am Psychol. 1995;50:741-749.
- Kane MT. Current concerns in validity theory. J Educ Meas. 2001; 38:319-342. American
Psychological Association. Standards for Educational and Psychological Tests and Manuals.
Washington, DC: American Psychological Association; 1966.
- Downing SM, Haladyna TM. Validity threats: overcoming interference in the proposed
interpretations of assessment data. Med Educ. 2004;38:327-333.
- Haynes SN, Richard DC, Kubany ES. Content validity in psychological assessment: a functional
approach to concepts and methods. Psychol Assess. 1995;7:238-247.
- Feldt LS, Brennan RL. Reliability. In: Linn RL, editor. Educational Measurement, 3rd Ed. New
York: American Council on Education and Macmillan; 1989.
- Downing SM. Reliability: on the reproducibility of assessment data. Med Educ. 2004;38:1006-
1012.
Clark LA, Watson D. Constructing validity: basic issues in objective scale development. Psychol
Resources
 For an excellent resource on item analysis:
 http://www.utexas.edu/academic/ctl/assessment/iar/students/
report/itemanalysis.php
 For a more extensive list of item-writing tips:
 http://testing.byu.edu/info/handbooks/Multiple-Choice
%20Item%20Writing%20Guidelines%20-%20Haladyna%20and
%20Downing.pdf
 http://homes.chass.utoronto.ca/~murdockj/teaching/MCQ_basi
c_tips.pdf
 For a discussion about writing higher-level multiple choice items:
 http://www.ascilite.org.au/conferences/perth04/procs/pdf/wo
odford.pdf

Validity and Reliability in Assessment

Uploaded by

Copyright:

Available Formats

Validity and Reliability in Assessment

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Validity and Reliability in Assessment

Uploaded by

Copyright:

Available Formats

Validity and Reliability

This work is the summarizations

The two most common technical concepts in

 The degree of consistency between two measures of

 Truthfulness: Does the test measure what it purports

 The degree to which they accomplish the purpose

 Content” : related to objectives and their sampling.

The usual concepts of validity.

Usual concepts of validity

In contemporary usage, all validity is construct

- Nearly all assessments in medical education, deal with constructs:

 Content: do instrument items completely represent the

Downing 2003, Cook S 2007

Statistical e vide nce o f the hypo the size d re latio nship

Relationship to other variables-2

 “ Stability” : related to time consistency.

Validity and reliability are closely related.

Validity and reliability are closely related.

Source of Description Measures Definitions Comments

correlation between item split-half, but accounts equivalent, measure a

Do different versions of the Administer different Usually quantified

Agreement Percent agreement %identical responses

Source of Description Measures Definitions Comments

. Ite ms” are the individual q ue stio ns o n the instrume nt*“

Different types of assessments require different kinds

?Reliability – How high

􀂆Very high-stakes: > 0.90 + (Licensure

You might also like