PsychAssess (Prelim)

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 23

LESSON 1 Testing

Introduction to Psychological Testing and Assessment Skill of Evaluator

- Testing typically requires technician-like skills in


terms of administering and scoring a test as well
Testing and Assessment
as in interpreting a test result.
 1905 – Alfred Binet and a colleague published a
Outcome
test designed to help place Paris schoolchildren
in appropriate classes. - Typically, testing yields a test score or series of
 WWI - in 1917 the US military needed a way to test scores.
screen large numbers of recruits quickly for
Assessment
intellectual and emotional problems.
 WWII – US depended even more on Skill of Evaluator
psychological tests to screen recruits for service
- Assessment typically requires an educated
Why distinguish between testing and assessment? selection of tools of evaluation, skill in
evaluation, and thoughtful organization and
Psychological assessment
integration of data.
- The gathering and integration of psychology-
Outcome
related data for the purpose of making a
psychological evaluation that is accomplished - Typically, assessment entails a logical problem-
through the use of tools such as tests, solving approach that brings to bear many
interviews, case studies, behavioral sources of data designed to shed light on a
observation, and specially designed apparatuses referral question
and measurement procedures.
Types of Assessment
Psychological Testing
Collaborative Psychological Assessment
- The process of measuring psychology-related
- The assessor and assessee may work as
variables by means of devices or procedures
“partners” from initial contact through final
designed to obtain a sample of behavior.
feedback
Testing
Therapeutic Psychological Assessment
Objective
- Therapeutic self-discovery and new
- Typically, to obtain some gauge, usually understandings are encouraged throughout the
numerical in nature, with regard to an ability or assessment process.
attribute.
Dynamic Assessment
Process
- An interactive approach to psychological
- Testing may be individual or group in nature. assessment that usually follows a model of (1)
After test administration, the tester will evaluation, (2) intervention of some sort, and
typically add up “the number of correct answers (3) evaluation.
or the number of certain types of responses . . . - Typically employed in educational settings, as
with little if any regard for the how or well as in correctional, corporate,
mechanics of such content” neuropsychological, clinical settings.

Assessment The Tools of Psychological Assessment

Objective Psychological test

- Typically, to answer a referral question, solve a - A device or procedure designed to measure


problem, or arrive at a decision through the use variables related to psychology (for example,
of tools of evaluation. intelligence, personality, aptitude, interests,
attitudes, and values) (Cohen, 2005).
Process
- a systematic procedure for obtaining samples of
- Assessment is typically individualized. In behavior, relevant to cognitive or affective
contrast to testing, assessment more typically functioning, and for scoring and evaluating
focuses on how an individual processes rather those samples according to standards (Urbina,
than simply the results of that processing. 2004)
- provide a statement, usually of the “self-report”
variety, and require the subject to choose
Test format
between two or more alternative responses
- The form, plan, structure, arrangement, and such as “True” or “False”
layout of test items as well as to related
considerations such as time limits.
- Also used to refer to the form in which a test is
administered: computerized, pencil-and-paper,
or some other form.

Psychological Testing: Types of Tests


Projective personality test
Psychological testing
- In which either the stimulus (test materials) or
- refers to all the possible uses, applications, and
the required response— or both—are
underlying concepts of psychological and
ambiguous.
educational tests.
- For example, in the highly controversial
- The main use of these tests, though, is to
Rorschach test, the stimulus is an inkblot.
evaluate individual differences or variations
among individuals. Such tests measure The Tools of Psychological Assessment
individual differences in ability and personality
 Scoring is the process of assigning such
and assume that the differences shown on the
evaluative codes or statements to performance
test reflect actual differences among
on tests, tasks, interviews, or other behavior
individuals.
samples.
Type of Tests:  A cut score (or simply a cutoff) is a reference
point, usually numerical, derived by judgment
Individual tests
and used to divide a set of data into two or
- Tests that can be given to only one person at a more classifications.
time.  Psychometrics may be defined as the science of
psychological measurement.
Group test
 Psychometric soundness of a test refers to how
- can be administered to more than one person consistently and how accurately a psychological
at a time by a single examiner test measures what it purports to measure.
 Psychometric utility refers to the usefulness or
Types of Tests: Based on type of behavior measured practical value that a test or assessment
I. Ability tests: Measure skills in terms of speed, technique has for a particular purpose.
accuracy, or both.  The Interview
A. Achievement: Measures previous learning. - A method of gathering information through
B. Aptitude: Measures potential for acquiring a direct communication involving reciprocal
specific skill. exchange.
C. Intelligence: Measures potential to solve - May differ in purpose, length, nature
problems, adapt to changing circumstances, - May be used by psychologists in various
and profit from experience. specialty areas to help make diagnostic,
II. Personality tests: Measure typical behavior— treatment, selection, or other decisions.
traits, temperaments, and dispositions.  Panel Interview (board interview)
A. Structured (objective): Provides a self-report - A presumed advantage of this approach is that
statement to which the person responds “True” any idiosyncratic biases of a lone interviewer
or “False,”“Yes” or “No.” will be minimized by the use of two or more
B. Projective: Provides an ambiguous test interviewers (Dipboye, 1992). A disadvantage of
stimulus; response requirements are unclear. the panel interview relates to its utility; the cost
of using multiple interviewers
Types of Tests  The Portfolio
Personality tests - constitutes work products—whether retained
on paper, canvas, film, video, audio, or some
- are related to the overt and covert dispositions other medium.
of the individual. - As samples of one’s ability and accomplishment,
Structured personality tests a portfolio may be used as a tool of evaluation,
e.g., writing, painting.
 Case History Data
- records, transcripts, and other accounts in  Integrative report
written, pictorial, or other form that preserve - Designed to integrate data from sources other
archival information, official and informal than the test itself (medication records or
accounts, and other data and items relevant to behavioral observation data) into the
an assessee. interpretive report.
- may include files or excerpts from files  CAPA (Computer Assisted Psychological
maintained at institutions and agencies such as Assessment
schools, hospitals, employers, religious - assisted” refers to the assistance computers
institutions, and criminal justice agencies provide to the test user, not the testtaker.
- Includes are letters and written  CAT (Computer Adaptive Testing)
correspondence, photos and family albums, etc. - The adaptive in this term is a reference to the
 Case History or Case Study computer’s ability to tailor the test to the
- concerns the assembly of case history data into testtaker’s ability or testtaking pattern.
an illustrative account.  Other Tools
- For example, a case study of a successful world - specially created videos are widely used in
leader, an individual who assassinated a high- training and evaluation contexts.
ranking political figure, and on groupthink - Examples
 Behavioral Observation - video-presented incidents of sexual harassment
- Monitoring the actions of others or oneself by in the workplace.
visual or electronic means while recording - Police personnel may be asked about how they
quantitative and/or qualitative information would respond to various types of emergencies
regarding the actions. - Psychotherapists may be asked to respond with
- Often used as a diagnostic aid in various settings a diagnosis and a treatment plan clients
such as inpatient facilities, behavioral research presented to them on videotape.
laboratories, and classrooms, as well as for
selection purposes in corporate settings.
 Naturalistic Observation
- Observing behavior in a natural setting—that is,
the setting in which the behavior would
typically be expected to occur.
 Role play
- Acting an improvised or partially improvised
part in a simulated situation.
 Role-play Test
- A tool of assessment wherein assessees are
directed to act as if they were in a particular
Users of tests
situation. Assessees may then be evaluated
with regard to their expressed thoughts,  Test developer
behaviors, abilities, and other variables. - APA has estimated that more than 20,000 new
- useful in evaluating various skills psychological tests are developed each year.
- Clinicians may attempt to obtain a baseline  Test taker
measure of abuse, cravings, or coping skills - even a deceased individual can be considered
through role play an assessee
 Computers as Tools  Psychological Autopsy - a reconstruction of a
- more obvious role as a tool of assessment is deceased individual’s psychological profile on
their role in test administration, scoring, and the basis of archival records, artifacts, and
interpretation interviews previously conducted with the
 Extended Scoring Report deceased assessee or with people who knew
- includes statistical analyses of the testtaker’s him or her.
performance. - Should observers be parties to the assessment
 Interpretive report process?
- distinguished by its inclusion of numerical or
Applications of Assessment
narrative interpretive statements in the report.
 Consultative Report Educational settings
- This type of report, usually written in language
appropriate for communication between - School ability tests
assessment professionals, may provide expert - Achievement tests
opinion concerning analysis of the data. - Diagnostic tests
- Informal evaluation

Clinical settings

- To determine maladjustment
- Effectiveness of psychotherapy
- Learning difficulties
- Expert witnessing
- Forensic settings (prisons)

Counseling settings

- improvement of the assessee in terms of


adjustment, productivity, or some related
variable.
- Measures of social and academic skills and
measures of personality, interest, attitudes, and
values

Geriatric settings

- In the US, more than 12 million adults are


currently in the age range of 75 to 84.
- At issue in many such assessments is the extent
to which assessees are enjoying as good a
quality of life as possible

Business and military settings

- Recruitment
- Promotion
- Transfer
- Job satisfaction
- Performance
- Product design
- Marketing

Governmental and organizational credentialing

- Governmental licensing, certification, or general


credentialing of professionals

Other settings

- Program evaluation
- Health Psychology

Assessment of people with disabilities

Accommodation

- The adaptation of a test, procedure, or


situation, or the substitution of one test for
another, to make the assessment more suitable
for an assessee with exceptional needs.

Alternate assessment

- an evaluative or diagnostic procedure or


process that varies from the usual, customary,
or standardized way a measurement is derived
either by virtue of some special accommodation
made to the assessee or by means of
alternative methods designed to measure the
same variable(s).
their deviation from an average (Statistical
regression).
LESSON 2
- credited with devising or contributing to the
Historical, Cultural, and Legal/Ethical Considerations development of many contemporary tools of
psychological assessment including
questionnaires, rating scales, and self-report
Historical Perspective inventories.
- pioneered the use of a statistical concept
China central to psychological experimentation and
- State-sponsored multistage examinations were testing: the coefficient of correlation.
referred to as imperial examinations - Anthropometric Laboratory (1884) measured on
- Tests and testing programs first came into being in variables such as height (standing), height
China as early as 2200 BCE. (sitting), arm span, weight, breathing, etc.
- Testing was instituted as a means of selecting who, - urged educational institutions to keep
of many applicants, would obtain government jobs. anthropometric records on their students
- The content of the examination changed over time Wilhelm Max Wundt
and with the cultural expectations of the day.
- music, archery, horsemanship, writing, and - Assessment was also an important activity at
arithmetic the first experimental psychology laboratory,
- agriculture, geography, revenue, civil law, and founded at the University of Leipzig in Germany
military strategy` by Wundt.
- Reports by British missionaries and diplomats - tried to formulate a general description of
encouraged the English East India Company in 1832 human abilities with respect to variables such as
to copy the Chinese system as a method of selecting reaction time, perception, and attention span.
employees for overseas duty. - Other students of Wundt at Leipzig included
- British government adopted a similar system of Charles Spearman
testing for its civil service in 1855.
- After the British endorsement of a civil service - originated the concept of test reliability and the
testing system, the French and German mathematical framework for factor analysis.
governments followed suit.
Victor Henri
- 5In 1883, the U.S. government established the
American Civil Service Commission. - collaborate with Alfred Binet since 1895 on
papers suggesting how mental tests could be
Greco-Roman Tradition
used to measure higher mental processes
- Attempts to categorize people in terms of
Emil Kraepelin
personality types.
- Such categorizations typically included - was an early experimenter with the word
reference to an overabundance or deficiency in association technique as a formal test
some bodily fluid (such as blood or phlegm) as a
factor believed to influence personality.  1905 - Alfred Binet and Theodore Simon
published a 30-item “measuring scale of
Middle Ages
intelligence” designed to help identify mentally
- The question of critical importance was “Who is retarded Paris schoolchildren. The Binet test
in league with the Devil?” and various would go through many revisions and
measurement procedures were devised to translations—and, in the process, launch both
address this question. the intelligence testing movement and the
clinical testing movement.
Renaissance
 1939 - David Wechsler introduced a test
- measurement in the modern sense began to designed to measure adult intelligence, the
emerge. Wechsler-Bellevue Intelligence Scale, later
- By the 18th century, Christian von Wolff had renamed the Wechsler Adult Intelligence Scale
anticipated psychology as a science and (WAIS).
psychological measurement as a specialty  For Wechsler, intelligence was “the aggregate
within that science. or global capacity of the individual to act
purposefully, to think rationally, and to deal
Francis Galton effectively with his environment” (1939, p. 3).
- Galton (1869) aspired to classify people
according to their natural gifts and to ascertain
Group intelligence tests - For example, a child may present as
noncommunicative and having only minimal
- came into being in the United States in
language skills when verbally examined.
response to the military’s need for an efficient
method of screening the intellectual ability of Standards of evaluation
World War I and II recruits.
- Judgments related to certain psychological
Personality tests traits can be culturally relative.
- In practice, we must raise questions about the
- During WWI, Robert S. Woodworth was
applicability of assessment-related findings to
assigned the task of developing a measure of
specific individuals.
adjustment and emotional stability that could
- Ex: “How appropriate are the norms or other
be administered quickly and efficiently to
standards that will be used to make this
groups of recruits.
evaluation?
Woodworth Psychoneurotic Inventory
Tests and Group Membership
- This instrument was the first widely used self-
- “What happens when groups systematically
report test of personality. (Post WWI)
differ in terms of scores on a particular test?”
- “What student is best qualified to be admitted
to this school?”
Culture and Assessment - “Which job candidate should get the job?”
- Culture may be defined as the socially transmitted - The mere suggestion that differences in
behavior patterns, beliefs, and products of work of psychological variables exist arouses skepticism
a particular population, community, or group of if not charges of discrimination, or bias.
people (Cohen, 1994). - This is especially true when the observed group
- Early 1900’s - U.S. Public Health Service used IQ differences are deemed responsible for blocking
tests to measure the intelligence of people seeking one or another group from employment or
to immigrate to the United States. educational opportunities.
- Henry H. Goddard raised questions about how - If systematic differences related to group
meaningful such tests are when used with people membership were found to exist on job ability
from various cultural and language backgrounds. test scores, then what, if anything, should be
- Efforts to address culture in testing: done?
- Culture-specific tests - tests designed for use with 2 contrasting views:
people from one culture but not from another.
- WAIS – data from black population was deleted - Test were designed to measure job ability, and
from the manual it does what it was designed to do. In support of
- WISC (1949, revised 1974) contained no minority this view is evidence suggesting that group
children in its development. differences in scores on professionally
- 1949 WISC: “If your mother sends you to the store developed tests do reflect differences in real-
for a loaf of bread and there is none, what do you world performance.
do? - Efforts should be made to “level the playing
field” between groups of people.
- Affirmative action refers to voluntary and
Some Issues Regarding Culture and Assessment mandatory efforts undertaken by federal, state,
and local governments, private employers, and
Verbal communication schools to combat discrimination and to
- When the language in which the assessment is promote equal opportunity in education and
conducted is not the assesse employment for all (American Psychological
- e’s primary language, he or she may not fully Association, 1996).
comprehend the instructions or the test items. - Tests and other tools of assessment are
- The contents of tests from a particular culture portrayed as instruments that can have a
are typically laden with items and material. momentous and immediate impact on one’s
life.
Nonverbal communication and behavior
Legal and Ethical Considerations
- Facial expressions, finger and hand signs, and
shifts in one’s position in space may all convey  Laws are rules that individuals must obey for
messages. E.g., eye contact the good of the society as a whole—or rules
- Culture exerts effects over many aspects of thought to be for the good of society as a
nonverbal behavior. whole.
 Whereas a body of laws is a body of rules, a  Computer-assisted psychological assessment
body of ethics is a body of principles of right, (CAPA)
proper, or good conduct.
 To the extent that a code of professional ethics
is recognized and accepted by members of a Major issues with regard to CAPA:
profession, it defines the standard of care
expected of members of that profession.  Access to test administration, scoring, and
 US Legislations: interpretation software.
 Minimum competency testing programs  Despite restrictions software may still be
(1970’s) - formal testing programs designed to copied.
be used in decisions regarding various aspects  Comparability of pencil-and-paper and
of students’ education. The data from such computerized versions of tests.
programs was used in decision making about  In many instances, the comparability of the
grade promotions, awarding of diplomas, and traditional and the computerized forms of the
identification of areas for remedial instruction. test has not been researched or has only
 Truth-in-testing legislation was also passed at insufficiently been researched.
the state level beginning in the 1980s. The  The value of computerized test interpretations.
primary objective of these laws was to provide  Many tests available for computerized
test takers with a means of learning the criteria administration also come with computerized
by which they are being judged. scoring and interpretation procedures. Such
interpretations in many cases are questionable.
The Concerns of the Profession  Unprofessional, unregulated “psychological
testing” online.
 Test-user qualifications
 A growing number of Internet sites purport to
 Three levels of tests in terms of the degree of
provide, usually for a fee, online psychological
required knowledge of testing and psychology
tests. Yet the vast majority of the tests offered
(APA):
would not meet a psychologist’s standards.
 Level A: Tests or aids that can adequately be
administered, scored, and interpreted with the The Rights of Testtakers
aid of the manual and a general orientation to
the kind of institution or organization in which The right of informed consent
one is working (for instance, achievement or  The disclosure of the information needed for
proficiency tests). consent must, of course, be in language the
 Level B: Tests or aids that require some testtaker can understand.
technical knowledge of test construction and  Example for a 2-3 year old: “I’m going to ask
use and of supporting psychological and you to try to do some things so that I can see
educational fields such as statistics, individual what you know how to do and what things you
differences, psychology of adjustment, could use some more help with”.
personnel psychology, and guidance (e.g.,  If a testtaker is incapable of providing an
aptitude tests and adjustment inventories informed consent to testing, such consent may
applicable to normal populations). be obtained from a parent or a legal
 Level C: Tests and aids that require substantial representative.
understanding of testing and supporting  Consent must be in written rather than oral
psychological fields together with supervised form. The written form should specify
experience in the use of these devices (for (1) the general purpose of the testing
instance, projective tests, individual mental (2) the specific reason it is being undertaken in
tests) the present case
Testing people with disabilities (3) the general type of instruments to be
administered.
 Difficulties:  One gray area with respect to the testtaker’s
 transforming the test into a form that can be right of fully informed consent before testing
taken by the testtaker involves research and experimental situations
 transforming the responses of the testtaker so wherein the examiner’s complete disclosure of
that they are scorable all facts pertinent to the testing (including the
 meaningfully interpreting the test data experimenter’s hypothesis and so forth) might
irrevocably contaminate the test data.

The right to be informed of test findings


Computerized test administration, scoring, and
interpretation
The right to privacy and confidentiality

 Privacy right “recognizes the freedom of the


individual to pick and choose for himself the
time, circumstances, and particularly the extent
to which he wishes to share or withhold from
others his attitudes, beliefs, behavior, and
opinions” (Shah, 1969).
 Privileged Information - information that is
protected by law from disclosure in a legal
proceeding. State statutes have extended the
concept of privileged information to parties
who communicate with each other in the
context of certain relationships, including the
lawyer-client relationship, the doctor-patient
relationship, the priest penitent relationship,
and the husband-wife relationship.
 Confidentiality may be distinguished from
privilege in that, whereas “confidentiality
concerns matters of communication outside the
courtroom, privilege protects clients from
disclosure in judicial proceedings” (Jagim et al.,
1978)

The right to the least stigmatizing label

LESSON 3
Statistics Review

Levels of Measurement

Frequency Polygon

- is expressed by a continuous line connecting the


points where test scores or class intervals (as
indicated on the X -axis) meet frequencies (as
indicated on the Y -axis)

Describing Data: Frequency Distributions

Frequency Distribution

- A list of raw scores in which all scores are listed


alongside the number of times each score
occurred.
- scores can also be illustrated graphically
Measures of Central Tendency
Grouped Frequency Distribution
- A measure of central tendency is a statistic that
- test-score intervals, also called class intervals, indicates the average or midmost score
replace the actual test scores between the extreme scores in a distribution.
Describing Data: Graphs Mean
Graph - is equal to the sum of the observations (or test
scores in this case) divided by the number of
- a diagram or chart composed of lines, points,
observations.
bars, or other symbols that describe and
illustrate data. Median
Histogram - the middle score in a distribution, is another
commonly used measure of central tendency.
- a graph with vertical lines drawn at the true
We determine the median of a distribution of
limits of each test score (or class interval),
scores by ordering the scores in a list by
forming a series of contiguous rectangles.
magnitude, in either ascending or descending
- test scores along the X –axis; frequency of
order.
occurrence along the vertical or Y-axis
Mode

- the most frequently occurring score in a


distribution of scores is the mode

Measures of Variability

Variability

- is an indication of how scores in a distribution


are scattered or dispersed.
- Statistics that describe the amount of variation
Bar Graph in a distribution are referred to as measures of
variability.
- numbers indicative of frequency also appear on
the Y-axis, and reference to some categorization Measures of variability include
(e.g., yes/no/maybe, male/female) appears on
- range
the X- axis. Here the rectangular bars typically
- interquartile range
are not contiguous.
- semi-interquartile range
- average deviation
- standard deviation, and the variance - May indicate that the test was too difficult.
- More items that were easier would have been
Range of a distribution
desirable in order to better discriminate at the
- is equal to the difference between the highest and lower end of the distribution of test scores.
the lowest scores. when relatively few of the scores fall at the high
end of the distribution.
Interquartile Range
Negative skew
- is a measure of variability equal to the difference
between Q3 and Q1 . Like the median, it is an - when relatively few of the scores fall at the low
ordinal statistic. end of the distribution.
- May indicate that the test was too easy.
Semi-interquartile Range - More items of a higher level of difficulty would
- is equal to the interquartile range divided by 2. make it possible to better discriminate between
- Knowledge of the relative distances of Q1 and scores at the upper end of the distribution.
Q3 from Q2 (the median) provides the seasoned Kurtosis
test interpreter with immediate information as
to the shape of the distribution of scores. - refer to the steepness of a distribution in its
- In a perfectly symmetrical distribution, Q1 and center.
Q3 will be exactly the same distance from the - May be platykurtic (relatively flat), leptokurtic
median. (relatively peaked), or somewhere in the middle
mesokurtic.

The Normal Curve

 Average Deviation (AD) - Normal curve is a bell-shaped, smooth,


mathematically defined curve that is highest at
its center.
- From the center it tapers on both sides
 Standard Deviation – is equal to the square approaching the X -axis asymptotically (meaning
root of the average squared deviations about that it approaches, but never touches, the axis).
the mean. - In theory, the distribution of the normal curve
 It is equal to the square root of the variance. ranges from negative infinity to positive infinity.
The curve is perfectly symmetrical, with no
skewness.
- Because it is symmetrical, the mean, the
 Variance is equal to the arithmetic mean of the
median, and the mode all have the same exact
squares of the differences between the scores
value.
in a distribution and their mean. The formula
- Why is the normal curve important in
used to calculate the variance (s2) using
understanding the characteristics of
deviation scores is
psychological tests?

The Area under the Normal Curve

- 50% of the scores occur above the mean and


Skewness 50% of the scores occur below the mean.
- Approximately 34% of all scores occur between
- Distributions can be characterized by their the mean and 1 standard deviation above the
skewness, or the nature and extent to which mean.
symmetry is absent. - Approximately 34% of all scores occur between
- Positive skew - when relatively few of the scores the mean and 1 standard deviation below the
fall at the high end of the distribution. mean.
- Approximately 68% of all scores occur between - The 5th stanine indicates performance in the
the mean and ±1 standard deviation. average range, from 1/4 standard deviation
- Approximately 95% of all scores occur between below the mean to 1/4 standard deviation
the mean and ±2 standard deviations. above the mean, and captures the middle 20%
- A normal curve has two tails. The area on the of the scores in a normal distribution.
normal curve between ± 2 and ± 3 standard - The 4th and 6th stanines are also 1/2 standard
deviations above or below the mean is referred deviation wide and capture the 17% of cases
to as a tail. below and above (respectively) the 5th stanine.

A scores

- Another type of standard score is employed on


Standard Scores tests such as the Scholastic Aptitude Test (SAT)
z Scores and the Graduate Record Examination (GRE).
Raw scores on those tests are converted to
- result from the conversion of a raw score into a standard scores such that the resulting
number indicating how many standard distribution has a mean of 500 and a standard
deviation units the raw score is below or above deviation of 100. If the letter A is used to
the mean of the distribution. represent a standard score from a college or
- In essence, a z score is equal to the difference graduate school admissions test whose
between a particular raw score and the mean distribution has a mean of 500 and a standard
divided by the standard deviation. deviation of 100, then the following is true:

T Scores Deviation IQ

- a scale with a mean set at 50 and a standard - For most IQ tests, the distribution of raw scores
deviation set at 10. is converted to IQ scores, whose distribution
- Devised by William McCall (1939) and named a typically has a mean set at 100 and a standard
T score in honor of his professor E. L. Thorndike deviation set at 15.
- composed of a scale that ranges from 5 - The typical mean and standard deviation for IQ
standard deviations below the mean to 5 tests results in approximately 95% of deviation
standard deviations above the mean. Thus, for IQs ranging from 70 to 130, which is 2 standard
example, a raw score that fell exactly at 5 deviations below and above the mean.
standard deviations below the mean would be - Standard scores converted from raw scores may
equal to a T score of 0, a raw score that fell at involve either linear or nonlinear
the mean would be equal to a T of 50, and a transformations.
raw score 5 standard deviations above the - A standard score obtained by a linear
mean would be equal to a T of 100. transformation is one that retains a direct
- One advantage in using T scores is that none of numerical relationship to the original raw
the scores is negative. score. The magnitude of differences between
such standard scores exactly parallels the
Stanine (standard nine) differences between corresponding raw scores.
- Sometimes scores may undergo more than one
- A standard score with a mean of 5 and a
transformation.
standard deviation of approximately 2.
- For example, the creators of the SAT did a
- Stanines are different from other standard
second linear transformation on their data to
scores in that they take on whole values from 1
convert z scores into a new scale that has a
to 9, which represent a range of performance
mean of 500 and a standard deviation of 100.
that is half of a standard deviation in width.
- A nonlinear transformation may be required - The scale along the bottom of the distribution is
when the data under consideration are not in z-score units, that is, μ = 0, σ = 1
normally distributed yet comparisons with - It has a bell-shaped distribution that is
normal distributions need to be made. symmetrical (skewness, s3 = 0.0) and
- As the result of a nonlinear transformation, the mesokurtic (kurtosis, s4 = 3.0).
original distribution is said to have been - The proportion of the curve that is under the
normalized different parts of the distribution have
corresponding values such that
Standard Score Equivalents
- If we convert a raw score into a z-score, we can
determine its percentile rank.
- We can use this to determine proportions of the
distribution that is between specific parts.

Areas within the Normal Distribution Curve

- Proportions of the area between μ and


- 1.0 SD is approximately .68 or 68%
- 2.0 SD is approximately .95 or 95%
- 3.0 SD is approximately .997 or 99.7%

Normalized standard scores

- Many test developers hope that the test they


are working on will yield a normal distribution
of scores. Yet even after very large samples
have been tested with the instrument under
development, skewed distributions result. What
should be done? Comparison of scores
- One alternative available to the test developer - The value of the normal distribution lies in the
is to normalize the distribution. fact that many real-world distributions including
- Conceptually, normalizing a distribution values of a variable as well as values of sample
involves “stretching” the skewed curve into the statistics approach the form of the theoretical
shape of a normal curve and creating a distribution.
corresponding scale of standard scores, a scale - One of the most common uses of the z-score is
that is technically referred to as a normalized in comparing 2 different measures that have
standard score scale. different means and standard deviations.
- Normalization of a skewed distribution of scores
may also be desirable for purposes of
comparability.
- One of the primary advantages of a standard
score on one test is that it can readily be
compared with a standard score on another
test. However, such comparisons are
appropriate only when the distributions from
which they derived are the same.

Case 1: Area between 0 and any z-score

- The area reading for the z-value in the z-score


table is the required probability

Converting z-scores to Percentile, T-scores Case 1: Area between 0 and any z-score

1. Given the μ = 100 and σ = 15 on a standard IQ


test.
Characteristics of the Normal Distribution Curve
- Frank obtained a score of 125 on the IQ test. 4. What is the percentage of area between a score
a. What percentage of cases fall between his of 123 and a score of 135?
score and the mean?
b. What is his percentile rank in the general
population?

Solution:

Case 4: Area between 2 z-values on opposite sides of


Case 2: Area in any tail the mean
- Find the area given for the z-value in the z-score - Find the areas given for both z values in the z-
table. Subtract the area found from .5000 to score table. Add the two areas obtained from
find the probability the table to get the probability.
- Or, see proportion in the tail of distribution (C).

Case 2: Area in any tail (lower tail)


Case 4: Area between 2 z-values on opposite sides of
2. Corrine scores 93 on the IQ test. What is her the mean
percentile rank?
5. What percentage of cases fall between a score
of 120 and a score of 88?

Case 2: Area in any tail (upper tail)

3. Tilda scores 108 on the test. What percent of


test takers obtained a score above hers.

Case 3: Area between two z-scores on the same side of


the mean

- Find the areas given for both z values in the z-


score table. Subtract the smaller area from the
larger area to get the probability.

Case 3: Area between two z-scores on the same side of


LESSON 4
the mean
Test Reliability - forgetting, failing to notice abusive behavior,
and misunderstanding instructions regarding
reporting.
What is Reliability? - underreporting or overreporting

 Reliability refers to consistency in measurement. Reliability Estimates


 Reliability coefficient is an index of reliability, a
 Three approaches to the estimation of reliability:
proportion that indicates the ratio between the true
1. Test-retest
score variance on a test and the total variance.
2. Alternate or parallel forms
 Concept of Reliability
3. Internal or inter-item consistency
- a score on a test is presumed to reflect not only
 The method or methods employed will depend on a
the test-taker’s true score on the ability being
number of factors, such as the purpose of obtaining
measured but also error.
a measure of reliability

 Variance from true differences is true variance, and Test-Retest Reliability


variance from irrelevant, random sources is error
 an estimate of reliability obtained by correlating
variance.
pairs of scores from the same people on two
different administrations of the same test.
 A systematic error source does not change the
 appropriate when evaluating the reliability of a test
variability of the distribution or affect reliability.
that purports to measure something that is
 A systematic source of error would not affect score
relatively stable over time, such as a personality
consistency.
trait.
Sources of Variance  The passage of time can be a source of error
variance. The longer the time that passes, the
greater the likelihood that the reliability coefficient
will be lower.
 When the interval between testing is greater than
six months, the estimate of test-retest reliability is
often referred to as the coefficient of stability.
 may be most appropriate in gauging the reliability
of tests that employ outcome measures such as
reaction time or perceptual judgments (brightness,
Sources of Error Variance loudness)

 Test construction Parallel-Forms and Alternate-Forms Reliability


- Item sampling or content sampling refer to
 The degree of the relationship between various
variation among items within a test as well as to
forms of a test can be evaluated by means of an
variation among items between tests.
alternate-forms or parallel-forms coefficient of
- The extent to which a testtaker’s score is
reliability, which is often termed the coefficient of
affected by the content sampled
equivalence.
 Test administration
 Parallel forms
- Test environment: the room temperature, the
- Exists when, for each form of the test, the
level of lighting, and the amount of ventilation
means and the variances of observed test
and noise
scores are equal.
- Testtaker variables: emotional problems,
 Alternate forms
physical discomfort, lack of sleep, and the
- are simply different versions of a test that have
effects of drugs or medication
been constructed so as to be parallel.
- Examiner-related variables: examiner’s physical
- Although they do not meet the requirements
appearance and demeanor— even the presence
for the legitimate designation “parallel,”
or absence of an examiner
alternate forms of a test are typically designed
 Test scoring and interpretation
to be equivalent with respect to variables such
- Scorers and scoring systems are potential
as content and level of difficulty.
sources of error variance
• Similarity with Test-retest method
1. Two test administrations with the same group
are required
2. Test scores may be affected by factors such as
 Other sources of error motivation, fatigue, or intervening events such
as practice, learning, or therapy.
• Advantage - rSB is equal to the reliability adjusted by the
- It minimizes the effect of memory for the Spearman-Brown formula
content of a previously administered form of - rxy is equal to the Pearson r in the original-
the test. length test
- Certain traits are presumed to be relatively - n is equal to the number of items in the revised
stable in people over time, and we would version divided by the number of items in the
expect tests measuring those traits to reflect original version.
that stability. Ex: intelligence tests  By determining the reliability of one half of a test, a
• Disadvantage test developer can use the Spearman-Brown
- Developing alternate forms of tests can be formula to estimate the reliability of a whole test.
time-consuming and expensive.  Because a whole test is two times longer than half a
- Error due to item sampling – selection of items test, n becomes 2 in the Spearman-Brown formula
for inclusion in the test. for the adjustment of split-half reliability.
 The symbol rhh stands for the Pearson r of scores in
Internal Consistency
the two half tests
 Split-Half Reliability
- obtained by correlating two pairs of scores
obtained from equivalent halves of a single test
administered once.  Spearman-Brown formula
- a useful measure of reliability when it is - may be used to estimate the effect of the
impractical or undesirable to assess reliability shortening on the test’s reliability
with two tests or to administer a test twice - also used to determine the number of items
Three steps: needed to attain a desired level of reliability.
- Odd-Even Reliability Coefficients before and
1. Divide the test into equivalent halves. after the Spearman-Brown Adjustment
2. Calculate a Pearson r between scores on the
two halves of the test.
3. Adjust the half-test reliability using the
Spearman-Brown formula
- Simply dividing the test in the middle is not
recommended because it’s likely this procedure
would spuriously raise or lower the reliability • Increasing Test Reliability
coefficient. - In adding items to increase test reliability to a
 Ways to split a test: desired level, the rule is that the new items
- Randomly assign items to one or the other half must be equivalent in content and difficulty so
of the test. that the longer test still measures what the
- Assign odd-numbered items to one half of the original test measured.
test and even-numbered items to the other - If the reliability of the original test is relatively
half. Referred to as odd-even reliability. low, then it may be impractical to increase the
- Divide the test by content so that each half number of items to reach an acceptable level of
contains items equivalent with respect to reliability.
content and difficulty. - Another alternative is to abandon the relatively
 Spearman-Brown formula unreliable instrument and locate—or develop—
- allows a test developer or user to estimate a suitable alternative.
internal consistency reliability from a - The reliability of the instrument could also be
correlation of two halves of a test. raised by creating new items, clarifying the
- Because the reliability of a test is affected by test’s instructions, or simplifying the scoring
its length, a formula is necessary for estimating rules.
the reliability of a test that has been shortened Other Methods of Estimating Internal Consistency
or
- lengthened. Inter-item consistency
- General Spearman-Brown (rSB) formula is - refers to the degree of correlation among all
the items on a scale.
- calculated from a single administration of a
single form of a test.
Where: - useful in assessing the homogeneity of the test.
- Tests are said to be homogeneous if they
contain items that measure a single trait.
Homogeneity KR-20 Formula:

- as an adjective used to describe test items, it is


the degree to which a test measures a single
factor.
- In other words, homogeneity is the extent to Where:
which items in a scale are unifactorial.
- rKR20 stands for the Kuder-Richardson formula 20
Heterogeneity reliability coefficient
- in contrast to test homogeneity, describes the - k is the number of test items
degree to which a test measures different - σ 2 is the variance of total test scores
factors. - p is the proportion of testtakers who pass the
 The more homogeneous a test is, the more inter- item
item consistency it can be expected to have. - q is the proportion of people who fail the item
- Because a homogeneous test samples a - ∑pq is the sum of the pq products over all items
relatively narrow content area, it is to be  The one variant of the KR-20 formula that has
expected to contain more inter-item received the most acceptance and is in widest use
consistency than a heterogeneous test. today is a statistic called coefficient alpha.
 Test homogeneity is desirable because it allows
relatively straightforward test-score interpretation.
- Testtakers with the same score on a
homogeneous test probably have similar
abilities in the area tested.
- Testtakers with the same score on a more
heterogeneous test may have quite different
abilities

The Kuder-Richardson formulas

- KR-20 is the statistic of choice for determining


the inter-item consistency of dichotomous
items, primarily those items that can be scored
right or wrong (such as multiple-choice items).
- Kuder-Richardson formula 20, or KR-20, so
named because it was the twentieth formula
developed in a series.
- Where test items are highly homogeneous, KR-
20 and split-half reliability estimates will be
similar.
- If test items are more heterogeneous, KR-20
will yield lower reliability estimates than the
split-half method.
 An approximation of KR-20 can be obtained by the Measures of Inter-Scorer Reliability
use of the twenty-first formula in the series
- the degree of agreement or consistency
developed by Kuder and Richardson
between two or more scorers (or judges or
 KR-21 formula may be used if there is reason to
raters) with regard to a particular measure.
assume that all the test items have approximately
- If the reliability coefficient is high, the
the same degree of difficulty.
prospective test user knows that test scores can
 Let’s add that this assumption is seldom justified.
be derived in a systematic, consistent way by
Formula KR-21 has become outdated in an era of
various scorers with sufficient training.
calculators and computers. Way back when, KR-21
was sometimes used to estimate KR-20 only Coefficient of Inter-scorer Reliability
because it required many fewer calculations.
- A coefficient of correlation that determines the
Numerous modifications of Kuder-Richardson
degree of consistency among scorers in the
formulas have been proposed through the years.
scoring of a test.
 The one variant of the KR-20 formula that has
received the most acceptance and is in widest use Using and Interpreting a Coefficient of Reliability
today is a statistic called coefficient alpha.
 You may even hear it referred to as coefficient α - - “How high should the coefficient of reliability
20. This expression incorporates both the Greek be?”
letter alpha (α) and the number 20, the latter a - If a test score carries with it life-or-death
reference to KR-20. implications, then we need to hold that test to
some high standards.
Coefficient Alpha or Cronbach Alpha - If a test score is routinely used in combination
with many other test scores and typically
- appropriate for use on tests containing non-
accounts for only a small part of the decision
dichotomous items
process, that test will not be held to the highest
- The formula for coefficient alpha is
standards of reliability.
- As a rule of thumb, it may parallels many
grading systems:
- .90s rates a grade of A (with a value of .95
Where: higher for the most important types of
decisions)
- rα is coefficient alpha - .80s rates a B (with below .85 being a clear B)
- k is the number of items - .65 - .70s rates a weak and unacceptable
- σi 2 is the variance of one item
- ∑ is the sum of variances of each item The Purpose of the Reliability Coefficient
- σ 2 is the variance of the total test scores - If a specific test of employee performance is
Coefficient alpha designed for use at various times over the
course of the employment period, it would be
- is the preferred statistic for obtaining an reasonable to expect the test to demonstrate
estimate of internal consistency reliability. reliability across time. It would thus be
- widely used because it requires only one desirable to have an estimate of the
administration of the test. instrument’s test-retest reliability. For a test
- Typically ranges in value from 0 to 1 designed for a single administration only, an
- answers about how similar sets of data are estimate of internal consistency would be the
- Myth about alpha is that “bigger is always reliability measure of choice. If the purpose of
better.” determining reliability is to break down the
- a value >.90 may indicate redundancy in the error variance into its parts, as shown in Figure
items 5–1, then a number of reliability coefficients
- All indexes of reliability, coefficient alpha would have to be calculated. Note that the
among them, provide an index that is a various reliability coefficients do not all reflect
characteristic of a particular group of test the same sources of error variance. Thus, an
scores, not of the test itself individual reliability coefficient may provide an
- Measures of reliability are estimates, and index of error from test construction, test
estimates are subject to error. administration, or test scoring and
- The precise amount of error inherent in a interpretation. A coefficient of inter-rater
reliability estimate will vary with the sample of reliability, for example, provides information
testtakers from which the data were drawn. about error as a result of test scoring.
Specifically, it can be used to answer questions function of situational and cognitive
about how consistently two scorers score the experiences.
same test items. - Example: anxiety (dynamic) manifested by a
stockbroker throughout a business day vs. his
intelligence (static)

Restriction or Inflation of Range

- If the variance of either variable in a


correlational analysis is restricted by the
sampling procedure used, then the resulting
correlation coefficient tends to be lower.
- If the variance of either variable in a
Sources of Variance in a Hypothetical Test correlational analysis is inflated by the sampling
procedure, then the resulting correlation
- In this hypothetical situation
coefficient tends to be higher.
- 5% of the variance has not been identified by
- Also of critical importance is whether the range
the test user (could be accounted for by
of variances employed is appropriate to the
transient error - attributable to variations in the
objective of the correlational analysis.
testtaker’s feelings, moods, or mental state
over time. Two Scatterplots Illustrating Unrestricted and
- may also be due to other factors that are yet to Restricted Ranges
be identified.

Speed tests versus power tests

Power test
The Nature of the Test
- When a time limit is long enough to allow
Considerations:
testtakers to attempt all items, and if some
- test items are homogeneous or heterogeneous items are so difficult that no testtaker is able to
in nature obtain a perfect score.
- characteristic, ability, or trait being measured is
Speed test
presumed to be dynamic or static
- range of test scores is or is not restricted - generally contains items of uniform level of
- test is a speed or a power test difficulty (typically uniformly low) so that, when
- the test is or is not criterion-referenced. given generous time limits, all testtakers should
- Some tests present special problems regarding be able to complete all the test items correctly.
the measurement of their reliability. - the time limit on a speed test is established so
that few if any of the testtakers will be able to
Homogeneity versus heterogeneity of test items
complete the entire test.
- Tests designed to measure one factor (ability or - Score differences on a speed test are therefore
trait) are expected to be homogeneous in items based on performance speed
resulting in a high degree of internal - A reliability estimate of a speed test should be
consistency. based on performance from two independent
- By contrast, in a heterogeneous test, an testing periods using one of the following: (1)
estimate of internal consistency might be low test-retest reliability, (2) alternate-forms
relative to a more appropriate estimate of test- reliability, or (3) split-half reliability from two
retest reliability. separately timed half tests
 To understand why the KR-20 or split-half reliability
Dynamic versus static characteristics
coefficient will be spuriously high, consider the
- A dynamic characteristic is a trait, state, or following example.
ability presumed to be everchanging as a
 When a group of testtakers completes a speed test,  In criterion-referenced testing, and particularly in
almost all the items completed will be correct. If mastery testing, how different the scores are from
reliability is examined using an odd-even split, and if one another is seldom a focus of interest.
the testtakers completed the items in order, then  The critical issue for the user of a mastery test is
testtakers will get close to the same number of odd whether or not a certain criterion score has been
as even items correct. A testtaker completing 82 achieved.
items can be expected to get approximately 41 odd  As individual differences (and the variability)
and 41 even items correct. A testtaker completing decrease, a traditional measure of reliability would
61 items may get 31 odd and 30 even items correct. also decrease
When the numbers of odd and even items c orrect  The person will ordinarily have a different universe
are correlated across a group of testtakers, the score for each universe. Mary’s universe score
correlation will be close to 1.00. Yet this impressive covering tests on May 5 will not agree perfectly
correlation coeffi cient actually tells us nothing with her universe score for the whole month of
about response consistency. Under the same May. . . . Some testers call the average over a large
scenario, a Kuder-Richardson reliability coeffi cient number of comparable observations a “true score”;
would yield a similar coeffi cient that would also be, e.g., “Mary’s true typing rate on 3-minute tests.”
well, equally useless. Recall that KR-20 reliability is Instead, we speak of a “universe score” to
based on the proportion of testtakers correct (p) emphasize that what score is desired depends on
and the proportion of testtakers incorrect (q) on the universe being considered. For any measure
each item. In the case of a speed test, it is there are many “true scores,” each corresponding
conceivable that p would equal 1.0 and q would to a different universe.
equal 0 for many of the items. Toward the end of  When we use a single observation as if it
the test—when many items would not even be represented the universe, we are generalizing. We
attempted because of the time limit— p might generalize over scorers, over selections typed,
equal 0 and q might equal 1.0. For many if not a perhaps over days. If the observed scores from a
majority of the items, then, the product pq would procedure agree closely with the universe score, we
equal or approximate 0. When 0 is substituted in can say that the observation is “accurate,” or
the KR-20 formula for ∑ pq, the reliability coefficient “reliable,” or “generalizable.” And since the
is 1.0 (a meaningless coefficient in this instance). observations then also agree with each other, we
say that they are “consistent” and “have little error
Criterion-referenced tests
variance.” To have so many terms is confusing, but
- designed to provide an indication of where a not seriously so. The term most often used in the
testtaker stands with respect to some variable literature is “reliability.” The author prefers
or criterion, such as an educational or a “generalizability” because that term immediately
vocational objective. implies “generalization to what?” . . . There is a
- Unlike norm-referenced tests, criterion- different degree of generalizability for each
referenced tests tend to contain material that universe. The older methods of analysis do not
has been mastered in hierarchical fashion. separate the sources of variation. They deal with a
- Traditional techniques of estimating reliability single source of variance, or leave two or more
employ measures that take into account scores sources entangled. (Cronbach, 1970, pp. 153–154)
on the entire test.
Alternatives to the True Score Model
- Such traditional procedures of estimating
reliability are usually not appropriate for use Generalizability theory
with criterion-referenced tests.
- The 1950s saw the development of an
alternative theoretical model, one originally
referred to as domain sampling theory and
 To understand why, recall that reliability is defined
better known today in one of its many modified
as the proportion of total variance (σ 2) attributable
forms as generalizability theory
to true variance (σtr 2). Total variance in a test score
- In domain sampling theory, a test’s reliability is
distribution equals the sum of the true variance plus
conceived of as an objective measure of how
the error variance (σe 2):
precisely the test score assesses the domain
from which the test draws a sample (Thorndike,
1985). A domain of behavior, or the universe of
 A measure of reliability, therefore, depends on the items that could conceivably measure that
variability of the test scores: how different the behavior, can be thought of as a hypothetical
scores are from one another. construct: one that shares certain
characteristics with (and is measured by) the  “Difficulty” in this sense refers to the attribute of
sample of items that make up the test. not being easily accomplished, solved, or
- In theory, the items in the domain are thought comprehended.
to have the same means and variances of those - In a mathematics test, for example, a test item
in the test that samples from the domain. Of tapping basic addition ability will have a lower
the three types of estimates of reliability, difficulty level than a test item tapping basic
measures of internal consistency are perhaps algebra skills. The characteristic of difficulty as
the most compatible with domain sampling applied to a test item may also refer to physical
theory. difficulty—that is, how hard or easy it is for a
person to engage in a particular activity.

Generalizability theory

- Developed by Lee J. Cronbach (1970) and his


colleagues (Cronbach et al., 1972), Discrimination
generalizability theory is based on the idea that
- signifies the degree to which an item
a person’s test scores vary from testing to
differentiates among people with higher or
testing because of variables in the testing
lower levels of the trait, ability, or whatever it is
situation.
that is being measured.
- Instead of conceiving of all variability in a
- Consider two more ADLQ items: item 4, My
person’s scores as error, Cronbach encouraged
mood is generally good; and item 5, I am able to
test developers and researchers to describe the
walk one block on flat ground.
details of the particular test situation or
- Which of these two items do you think would
universe leading to a specific test score. This
be more discriminating in terms of the
universe is described in terms of its facets,
respondent’s physical abilities?
which include things like the number of items in
- A number of different IRT models exist to
the test, the amount of training the test scorers
handle data resulting from the administration of
have had, and the purpose of the test
tests with various characteristics and in various
administration.
formats. For example, there are IRT models
- According to generalizability theory, given the
designed to handle data resulting from the
exact same conditions of all the facets in the
administration of tests with:
universe, the exact same test score should be
 Dichotomous test items - test items or questions
obtained. This test score is the universe score,
that can be answered with only one of two
and it is, as Cronbach noted, analogous to a true
alternative responses, such as true–false, yes–no, or
score in the true score model.
correct–incorrect questions
Item Response Theory (Lord, 1980)  Polytomous test items - test items or questions
with three or more alternative responses, where
- • Item response theory procedures models the
only one is scored correct or scored as being
probability that a person with X amount of a
consistent with a targeted trait or other construct
particular personality trait will exhibit Y amount
of that trait on a personality test designed to Important Differences Between Latent-Trait Models
measure it. and Classical “True Score” Test Theory.
- Called Latent-Trait Theory because the
psychological or educational construct being - In classical TST theory, no assumptions are
measured is so often physically unobservable made about the frequency distribution of test
(stated another way, is latent) and because the scores. By contrast, such assumptions are
construct being measured may be a trait (or an inherent in latent-trait models.
ability). - Some IRT models have very specific and
stringent assumptions about the underlying
Item Response Theory (IRT) distribution. In one group of IRT models
developed by Rasch, each item on the test is
- IRT refers to a family of theory and methods
assumed to have an equivalent relationship
with many other names used to distinguish
with the construct being measured by the test.
specific approaches.
- The psychometric advantages of item response
- Examples of two characteristics of items within
theory have made this model appealing,
an IRT framework are the difficulty level of an
especially to commercial and academic test
item and the item’s level of discrimination;
developers and to large-scale test publishers. It
items may be viewed as varying in terms of
is a model that in recent years has found
these, as well as other, characteristics.
increasing application in standardized tests,
professional licensing examinations, and - Because the standard error of measurement
questionnaires used in behavioral and social functions like a standard deviation in this
sciences. context, we can use it to predict what would
happen if an individual took additional
Reliability and Individual Scores
equivalent tests:
The Standard Error of Measurement - approximately 68% (actually, 68.26%) of the
scores would be expected to occur within
- The standard error of measurement, often ±1σmeas of the true score;
abbreviated as SEM or SEM , provides a measure - approximately 95% (actually, 95.44%) of the
of the precision of an observed test score. scores would be expected to occur within
- It provides an estimate of the amount of error ±2σmeas of the true score;
inherent in an observed score or measurement. - approximately 99% (actually, 99.74%) of the
- In general, the relationship between the SEM scores would be expected to occur within
and the reliability of a test is inverse; the higher ±3σmeas of the true score.
the reliability of a test (or individual subtest - The best estimate available of the individual’s
within a test), the lower the SEM. true score on the test is the test score already
- The usefulness of the reliability coefficient does obtained.
not end with test construction and selection. - Thus, if a student achieved a score of 50 on one
- By employing the reliability coefficient in the spelling test and if the test had a standard error
formula for the standard error of measurement, of measurement of 4, then—using 50 as the
the test user now has another descriptive point estimate—we can be:
statistic relevant to test interpretation, this one - 68% (actually, 68.26%) confident that the true
useful in estimating the precision of a particular score falls within 50 ±1σmeas (or between 46
test score. and 54, including 46 and 54);
- 95% (actually, 95.44%) confident that the true
score falls within 50 ±2σmeas (or between 42
and 58, including 42 and 58);
- 99% (actually, 99.74%) confident that the true
score falls within 50 ±3σmeas (or between 38
- To be hired at a company TRW as a word and 62, including 38 and 62).
processor, a candidate must be able to word-  The standard error of measurement, like the
process accurately at the rate of 50 words per reliability coefficient, is one way of expressing test
minute. The personnel office administers a total reliability.
of seven brief word-processing tests to Mary  If the standard deviation of a test is held constant,
over the course of seven business days. In then the smaller the σ meas, the more reliable the test
words per minute, Mary’s scores on each of the will be; as rxx increases, the σmeas decreases. For
seven tests are as follows: 52 55 39 56 35 50 54 example, when a reliability coefficient equals .64
- “Which is her ‘true’ score?” and σ equals 15, the standard error of
- The standard error of measurement is the tool measurement equals 9:
used to estimate or infer the extent to which an
observed score deviates from a true score.  With a reliability coefficient equal to .96 and σ still
Standard Error of Measurement equal to 15, the standard error of measurement
decreases to 3:
- The standard deviation of a theoretically
normal distribution of test scores obtained by
one person on equivalent tests.
- Also known as the standard error of a score and • In practice, the standard error of measurement is
denoted by the symbol σmeas, the standard error most frequently used in the interpretation of
of measurement is an index of the extent to individual test scores.
which one individual’s scores vary over tests • If the cut off score for mental retardation is 70, how
presumed to be parallel. do scores that are close to the cutoff value of 70
should be treated?
Assumption: • How high above 70 must a score be for us to
conclude confidently that the individual is unlikely
- If the individual were to take a large number of
to be retarded?
equivalent tests, scores on those tests would
• The standard error of measurement provides such
tend to be normally distributed, with the
an estimate.
individual’s true score as the mean.
• Further, the standard error of measurement is 2. How did this individual’s performance on test 1
useful in establishing a confidence interval: a range compare with someone else’s performance on
or band of test scores that is likely to contain the test 1?
true score. 3. How did this individual’s performance on test 1
• Consider an application of a confidence interval compare with someone else’s performance on
with one hypothetical measure of adult intelligence. test 2?
• Suppose a 22-year-old testtaker obtained a FSIQ of • As you might have expected, when comparing
75. The test user can be 95% confident that this scores achieved on the different tests, it is essential
testtaker’s true FSIQ falls in the range of 70 to 80. that the scores be converted to the same scale.
This is so because the 95% confidence interval is set • The formula for the standard error of the difference
by taking the observed score of 75, plus or minus between two scores is
1.96, multiplied by the standard error of
measurement. In the test manual we find that the
standard error of measurement of the FSIQ for a 22-
Where:
year-old testtaker is 2.37. With this information in
hand, the 95% confidence interval is calculated as - σdiff is the standard error of the difference
follows: between two scores
- σ 2 meas1 is the squared standard error of
• The calculated interval of 4.645 is rounded to the measurement for test 1
nearest whole number, 5. We can therefore be 95% - σ 2 meas12 is the squared standard error of
confident that this testtaker’s true FSIQ on this measurement for test 2
particular test of intelligence lies somewhere in the  If we substitute reliability coefficients for the
range of the observed score of 75 plus or minus 5, standard errors of measurement of the separate
or somewhere in the range of 70 to 80. scores, the formula becomes

The Standard Error of the Difference between Two


Scores
Where:
 Error related to any of the number of possible
variables operative in a testing situation can - r1 is the reliability coefficient of test 1
contribute to a change in a score achieved on the - r2 is the reliability coefficient of test 2
same test, or a parallel test, from one - σ is the standard deviation
administration of the test to the next.  Note that both tests would have the same standard
 The amount of error in a specific test score is deviation because they must be on the same scale
embodied in the standard error of measurement. (or be converted to the same scale) before a
 True differences in the characteristic being comparison can be made.
measured can also affect test scores.  The standard error of the difference between two
scores will be larger than the standard error of
measurement for either score alone because the
former is affected by measurement error in both
scores.
 The value obtained by calculating the standard
error of the difference is used in much the same
• In the field of psychology, if the probability is more way as the standard error of the mean. If we wish
than 5% that the difference occurred by chance, to be 95% confident that the two scores are
then, for all intents and purposes, it is presumed different, we would want them to be separated by
that there was no difference. A more rigorous 2 standard errors of the difference. A separation of
standard is the 1% standard. Applying the 1% only 1 standard error of the difference would give
standard, no statistically significant difference us 68% confidence that the two true scores are
would be deemed to exist unless the observed different.
difference could have occurred by chance alone  As an illustration of the use of the standard error of
less than one time in a hundred. The standard error the difference between two scores, consider the
of the difference between two scores can be the situation of a corporate personnel manager who is
appropriate statistical tool to address three types of seeking a highly responsible person for the position
questions: of vice president of safety. The personnel officer in
1. How did this individual’s performance on test 1 this hypothetical situation decides to use a new
compare with his or her performance on test 2? published test we will call the Safety-Mindedness
Test (SMT) to screen applicants for the position.
 After placing an ad in the employment section of
the local newspaper, the personnel officer tests 100
applicants for the position using the SMT. The
personnel officer narrows the search for the vice
president to the two highest scorers on the SMT:
Moe, who scored 125, and Larry, who scored 134.
Assuming the measured reliability of this test to
be .92 and its standard deviation to be 14, should
the personnel officer conclude that Larry performed
significantly better than Moe? To answer this
question, first calculate the standard error of the
difference:

 Note that in this application of the formula, the two


test reliability coefficients are the same because the
two scores being compared are derived from the
same test. What does this standard error of the
difference mean? For any standard error of the
difference, we can be:
- 68% confident that two scores differing by 1σdiff
represent true score differences
- 95% confident that two scores differing by 2σdiff
represent true score differences
- 99.7% confident that two scores differing by
3σdiff represent true score differences
 Applying this information to the standard error of
the difference just computed for the SMT, we see
that the personnel officer can be:
- 68% confident that two scores differing by 5.6
represent true score differences
- 95% confident that two scores differing by 11.2
represent true score differences
- 99.7% confident that two scores differing by
16.8 represent true score differences
 The difference between Larry’s and Moe’s scores is
only 9 points, not a large enough difference for the
personnel officer to conclude with 95% confidence
that the two individuals actually have true scores
that differ on this test. Stated another way: If Larry
and Moe were to take a parallel form of the SMT,
then the personnel officer could not be 95%
confident that, at the next testing, Larry would
again outperform Moe. The personnel officer in this
example would have to resort to other means to
decide whether Moe, Larry, or someone else would
be the best candidate for the position.

You might also like