PsychAssess (Prelim)
PsychAssess (Prelim)
PsychAssess (Prelim)
Clinical settings
- To determine maladjustment
- Effectiveness of psychotherapy
- Learning difficulties
- Expert witnessing
- Forensic settings (prisons)
Counseling settings
Geriatric settings
- Recruitment
- Promotion
- Transfer
- Job satisfaction
- Performance
- Product design
- Marketing
Other settings
- Program evaluation
- Health Psychology
Accommodation
Alternate assessment
LESSON 3
Statistics Review
Levels of Measurement
Frequency Polygon
Frequency Distribution
Measures of Variability
Variability
A scores
T Scores Deviation IQ
- a scale with a mean set at 50 and a standard - For most IQ tests, the distribution of raw scores
deviation set at 10. is converted to IQ scores, whose distribution
- Devised by William McCall (1939) and named a typically has a mean set at 100 and a standard
T score in honor of his professor E. L. Thorndike deviation set at 15.
- composed of a scale that ranges from 5 - The typical mean and standard deviation for IQ
standard deviations below the mean to 5 tests results in approximately 95% of deviation
standard deviations above the mean. Thus, for IQs ranging from 70 to 130, which is 2 standard
example, a raw score that fell exactly at 5 deviations below and above the mean.
standard deviations below the mean would be - Standard scores converted from raw scores may
equal to a T score of 0, a raw score that fell at involve either linear or nonlinear
the mean would be equal to a T of 50, and a transformations.
raw score 5 standard deviations above the - A standard score obtained by a linear
mean would be equal to a T of 100. transformation is one that retains a direct
- One advantage in using T scores is that none of numerical relationship to the original raw
the scores is negative. score. The magnitude of differences between
such standard scores exactly parallels the
Stanine (standard nine) differences between corresponding raw scores.
- Sometimes scores may undergo more than one
- A standard score with a mean of 5 and a
transformation.
standard deviation of approximately 2.
- For example, the creators of the SAT did a
- Stanines are different from other standard
second linear transformation on their data to
scores in that they take on whole values from 1
convert z scores into a new scale that has a
to 9, which represent a range of performance
mean of 500 and a standard deviation of 100.
that is half of a standard deviation in width.
- A nonlinear transformation may be required - The scale along the bottom of the distribution is
when the data under consideration are not in z-score units, that is, μ = 0, σ = 1
normally distributed yet comparisons with - It has a bell-shaped distribution that is
normal distributions need to be made. symmetrical (skewness, s3 = 0.0) and
- As the result of a nonlinear transformation, the mesokurtic (kurtosis, s4 = 3.0).
original distribution is said to have been - The proportion of the curve that is under the
normalized different parts of the distribution have
corresponding values such that
Standard Score Equivalents
- If we convert a raw score into a z-score, we can
determine its percentile rank.
- We can use this to determine proportions of the
distribution that is between specific parts.
Converting z-scores to Percentile, T-scores Case 1: Area between 0 and any z-score
Solution:
Power test
The Nature of the Test
- When a time limit is long enough to allow
Considerations:
testtakers to attempt all items, and if some
- test items are homogeneous or heterogeneous items are so difficult that no testtaker is able to
in nature obtain a perfect score.
- characteristic, ability, or trait being measured is
Speed test
presumed to be dynamic or static
- range of test scores is or is not restricted - generally contains items of uniform level of
- test is a speed or a power test difficulty (typically uniformly low) so that, when
- the test is or is not criterion-referenced. given generous time limits, all testtakers should
- Some tests present special problems regarding be able to complete all the test items correctly.
the measurement of their reliability. - the time limit on a speed test is established so
that few if any of the testtakers will be able to
Homogeneity versus heterogeneity of test items
complete the entire test.
- Tests designed to measure one factor (ability or - Score differences on a speed test are therefore
trait) are expected to be homogeneous in items based on performance speed
resulting in a high degree of internal - A reliability estimate of a speed test should be
consistency. based on performance from two independent
- By contrast, in a heterogeneous test, an testing periods using one of the following: (1)
estimate of internal consistency might be low test-retest reliability, (2) alternate-forms
relative to a more appropriate estimate of test- reliability, or (3) split-half reliability from two
retest reliability. separately timed half tests
To understand why the KR-20 or split-half reliability
Dynamic versus static characteristics
coefficient will be spuriously high, consider the
- A dynamic characteristic is a trait, state, or following example.
ability presumed to be everchanging as a
When a group of testtakers completes a speed test, In criterion-referenced testing, and particularly in
almost all the items completed will be correct. If mastery testing, how different the scores are from
reliability is examined using an odd-even split, and if one another is seldom a focus of interest.
the testtakers completed the items in order, then The critical issue for the user of a mastery test is
testtakers will get close to the same number of odd whether or not a certain criterion score has been
as even items correct. A testtaker completing 82 achieved.
items can be expected to get approximately 41 odd As individual differences (and the variability)
and 41 even items correct. A testtaker completing decrease, a traditional measure of reliability would
61 items may get 31 odd and 30 even items correct. also decrease
When the numbers of odd and even items c orrect The person will ordinarily have a different universe
are correlated across a group of testtakers, the score for each universe. Mary’s universe score
correlation will be close to 1.00. Yet this impressive covering tests on May 5 will not agree perfectly
correlation coeffi cient actually tells us nothing with her universe score for the whole month of
about response consistency. Under the same May. . . . Some testers call the average over a large
scenario, a Kuder-Richardson reliability coeffi cient number of comparable observations a “true score”;
would yield a similar coeffi cient that would also be, e.g., “Mary’s true typing rate on 3-minute tests.”
well, equally useless. Recall that KR-20 reliability is Instead, we speak of a “universe score” to
based on the proportion of testtakers correct (p) emphasize that what score is desired depends on
and the proportion of testtakers incorrect (q) on the universe being considered. For any measure
each item. In the case of a speed test, it is there are many “true scores,” each corresponding
conceivable that p would equal 1.0 and q would to a different universe.
equal 0 for many of the items. Toward the end of When we use a single observation as if it
the test—when many items would not even be represented the universe, we are generalizing. We
attempted because of the time limit— p might generalize over scorers, over selections typed,
equal 0 and q might equal 1.0. For many if not a perhaps over days. If the observed scores from a
majority of the items, then, the product pq would procedure agree closely with the universe score, we
equal or approximate 0. When 0 is substituted in can say that the observation is “accurate,” or
the KR-20 formula for ∑ pq, the reliability coefficient “reliable,” or “generalizable.” And since the
is 1.0 (a meaningless coefficient in this instance). observations then also agree with each other, we
say that they are “consistent” and “have little error
Criterion-referenced tests
variance.” To have so many terms is confusing, but
- designed to provide an indication of where a not seriously so. The term most often used in the
testtaker stands with respect to some variable literature is “reliability.” The author prefers
or criterion, such as an educational or a “generalizability” because that term immediately
vocational objective. implies “generalization to what?” . . . There is a
- Unlike norm-referenced tests, criterion- different degree of generalizability for each
referenced tests tend to contain material that universe. The older methods of analysis do not
has been mastered in hierarchical fashion. separate the sources of variation. They deal with a
- Traditional techniques of estimating reliability single source of variance, or leave two or more
employ measures that take into account scores sources entangled. (Cronbach, 1970, pp. 153–154)
on the entire test.
Alternatives to the True Score Model
- Such traditional procedures of estimating
reliability are usually not appropriate for use Generalizability theory
with criterion-referenced tests.
- The 1950s saw the development of an
alternative theoretical model, one originally
referred to as domain sampling theory and
To understand why, recall that reliability is defined
better known today in one of its many modified
as the proportion of total variance (σ 2) attributable
forms as generalizability theory
to true variance (σtr 2). Total variance in a test score
- In domain sampling theory, a test’s reliability is
distribution equals the sum of the true variance plus
conceived of as an objective measure of how
the error variance (σe 2):
precisely the test score assesses the domain
from which the test draws a sample (Thorndike,
1985). A domain of behavior, or the universe of
A measure of reliability, therefore, depends on the items that could conceivably measure that
variability of the test scores: how different the behavior, can be thought of as a hypothetical
scores are from one another. construct: one that shares certain
characteristics with (and is measured by) the “Difficulty” in this sense refers to the attribute of
sample of items that make up the test. not being easily accomplished, solved, or
- In theory, the items in the domain are thought comprehended.
to have the same means and variances of those - In a mathematics test, for example, a test item
in the test that samples from the domain. Of tapping basic addition ability will have a lower
the three types of estimates of reliability, difficulty level than a test item tapping basic
measures of internal consistency are perhaps algebra skills. The characteristic of difficulty as
the most compatible with domain sampling applied to a test item may also refer to physical
theory. difficulty—that is, how hard or easy it is for a
person to engage in a particular activity.
Generalizability theory