Psychometrics - For Colleagues
Psychometrics - For Colleagues
1 – Presentation Reliability
Table of contents:
Some error is involved in any measurement, whether it is the measurement of temperature, blood
pressure, or intelligence. One definition of the true score is that it represents the average score that
would be obtained over repeated testing
-Measurement error causes obtained scores to vary over the test. The standard deviation of obtained
scores over these generally hypothetical tests for a given individual defines the "standard error of
measurement"
-In particular, measurement error associated with extreme true scores is estimated to be
(1) Smaller than the error associated with true scores closer to the mean
(2) Skewed (positively for low scores and negatively for high scores)
-In contrast to definitions of reliability based upon the internal consistency or covariances among
components of a linear combination, "reliability" can also mean temporal stability which basically
concerns the correlation between scores over repeated testing.
Domain sampling - particularly useful model of a process that gives rise to true scores
-Tests are constructed by selecting a specified number of measures at random from a homogeneous,
infinitely large pool. Under these conditions, the correlation of any given test score with the average of
an test scores (the reliability index) can be shown to equal the square root of the correlation of any
given test score with another given test score (the reliability coefficient).
-In turn, the reliability coefficient can be shown to estimate the ratio of variance in true scores to the
variance in observed scores.
- It assumes that two or more tests produce equal true scores but generate independent random
measurement error.
-Although it defines rather than estimates the reliability coefficient, its major predictions are the same.
The role of factorial complexity in measures of reliability is considered; a key point is that a test may
measure more than one thing (factor) yet be highly reliable.
-Measurement error can be thought of as measure deviating from a true value and such measurement
error can be mixture of systematic and random processes. When it is systematic, it can affect all
observations equally and be constant error or affect certain types of observations differently than others
and be a bias.
Example 1: A miscalibrated thermometer that always reads three degrees too high illustrates a constant
error in the physical sciences. If the thermometer was sensitive to some irrelevant attribute such as the
color or the density of what was being measured, the error would be a bias
Example 2: Random error would be introduces if the person reading the thermometer were to
transpose digits from time to time when reading observations
Constant errors
There are obvious biases and random errors in the behavioral sciences, although the situation may be
less obvious with constant errors. If clinician A were to judge the intelligence of each of a series of
individuals five points higher than clinician B, they would be calibrated differently, but either or both
could have a constant error since the true IQ is unknown.
Likewise, unsystematic differences in ratings on repeated testing illustrate one form of random error
when it can be assumed the person rated did not change.
Even if the concept of constant error was meaningful in the behavioral sciences, it affects all
observations equally and therefore does not influence group comparisons, and so it no need to be
considered further. Indeed it has no effect by definition, unless a scale has a meaningful zero for
example if it is a radio or absolute scale since it affects only the location of the scale mean.
However, a clinician, a rater, or an evaluative process may be sensitive to irrelevant attributes like race,
gender etc, and thereby be a biased!
Random errors
What is more important and meaningful is the presence or random errors. They are important because
they limit the degree of lawfulness in nature by complicating relationships. They might for example
make the curve appear jagged and therefore more complex rather than smoother and simpler.
2. Luck in guessing
3. State of alertness
4. Clerical errors
-Random measurement errors are never completely eliminated, but one should seek to minimize them
as much as possible
-One definition of reliability is freedom from a random error – how repeatable observations are
Measurement reliability
In other words, measurements must be stable whenever essentially the same results should be obtained
since the measurement in reliable to the extent that it leads to the same or similar results, regardless of
the opportunities for variations to occur. Reliable measures allow one to generalize from one particular
use of the method to a wide variety of related circumstances.
-But be aware that high reliability does not mean high validity!
Example 3: One could measure intelligence by having individuals throw stones as far as possible.
Distances obtained by individuals on one occasion will correlate highly with distances obtained on
another occasion. Being repeatable, the measures are highly reliable; but stone tossing is obviously not
a valid measure of intelligence.
Results will correlate with the measures of strength and not intelligence.
-Measurement error places limits on the validity of an instrument, but event its complete absence does
not guarantee validity, thus, reliability is necessary but not sufficient to validity.
Estimates of reliability
Some measures are so readily obtainable that it is possible to retest the subject many times.
Example 4: Practice a sample of subjects at reaction time responses until their performance is stable to
help satisfy the above assumption. Then run them for at least 100 trials to produce a domain of
responses. Their reaction time on one trial is a one-item test.
Correlate their results on one arbitrarily chosen trial (e.g., the tenth) with:
(1) Their results on another arbitrarily chosen trial (e.g., the fifteenth) and
-These two correlations reflect the reliability of individual differences in a one-item test of reaction time
and should closely approximate a square root relationship.
In psychometrics, reliability refers to the consistency and stability of a measurement instrument or test
over repeated administrations. It is a crucial aspect of any psychological assessment, as it indicates the
extent to which the instrument produces consistent and dependable results. One of the factors that
contribute to the reliability of a test is the number of items it contains.
The number of items in a psychological test plays a significant role in determining its reliability. A test
with too few items may not adequately capture the construct it intends to measure, leading to
unreliable results. On the other hand, a test with too many items may become cumbersome and time-
consuming, potentially leading to participant fatigue and decreased motivation, affecting the overall
quality of responses.
To understand the relationship between reliability and the number of items, it's essential to consider the
concept of internal consistency. Internal consistency is a measure of how well the items within a test
correlate with each other. One common method for assessing internal consistency is Cronbach's alpha
coefficient. As the number of items increases, it can impact the internal consistency of the test.
In general, increasing the number of items in a test can enhance its internal consistency up to a certain
point. This is because a larger number of items provides a more comprehensive sampling of the
construct being measured, increasing the likelihood of capturing its various facets. However, after a
certain threshold, adding more items may result in diminishing returns, and the additional items might
not contribute substantially to the overall reliability.
Researchers and test developers often conduct reliability analyses to determine the optimal number of
items required to achieve a balance between precision and practicality. They may use statistical
techniques such as item-total correlations, factor analysis, and reliability coefficients to assess the
internal consistency of the test and make informed decisions about whether to add or remove items.
Several common statistics are used to estimate the reliability of a set of items in psychometrics. These
statistics help assess the consistency and stability of measurements, providing valuable information
about the quality of a psychological test. Here are some of the common reliability statistics:
Cronbach's alpha is a widely used measure of internal consistency. It assesses how well
items in a test are correlated with each other, providing an overall estimate of reliability.
Values closer to 1 indicate higher internal consistency.
Developed by Lee Cronbach in 1951, it has become one of the most widely used
measures of internal consistency.
Where: k is the number of items in the test, sigma-i is variance of item i, and sigma-X is
total score variance.
Interpretation:
1. Range of Values:
Practical Applications:
1. Educational Testing:
Used to assess the reliability of educational tests, ensuring that the test
consistently measures a student's knowledge or ability.
2. Psychological Research:
3. Personnel Selection:
Alternative terminology. Cronbach's alpha, when computed for binary (e.g., true/false)
items, is identical to the so-called Kuder-Richardson-20 formula of reliability for sum
scales. In either case, because the reliability is actually estimated from the consistency
of all items in the sum scales, the reliability coefficient computed in this manner is also
referred to as the internal-consistency reliability.
This just means that you can use Cronbach’s equation on polytomous data such as Likert rating
scales. In the case of dichotomous data such as multiple choice items, Cronbach’s alpha and KR-
20 are the exact same.
As a formula designed for dichotomous items, KR-20 is not suitable for tests with items that
have more than two response categories.
3. Split-Half Reliability:
This method involves dividing a test into two halves and comparing the scores on each half. The
correlation between the scores on the two halves is then calculated. The Spearman-Brown
formula is often applied to correct for the artificially low reliability that can result from splitting
the test.
r is the observed correlation coefficient obtained from the split-half reliability analysis.
Example:
Suppose the observed correlation (r) from a split-half analysis is 0.70. Applying the Spearman-
Brown prophecy formula:
2 X 0.70
radjusted=
1+0.70
1.40
radjusted=
1.70
radjusted=0.824
In this example, the adjusted correlation coefficient (radjusted) suggests an estimate of the reliability of
the full-length test.
Other statistical techniques involve: Test-Retest Reliability, Intraclass Correlation Coefficient (ICC),
Coefficient Omega (Ω), Guttman Split-Half Coefficient, Average Inter-Item Correlation, Coefficient of
Stability and Factor Analysis
When evaluating the reliability of a set of items, researchers often consider a combination of these
statistics to gain a comprehensive understanding of the measurement instrument's quality. Each statistic
has its strengths and limitations, and the choice of which to use depends on the nature of the test and
the goals of the assessment.
Attentuation
Attenuation in the context of reliability and the number of items refers to the potential
underestimation of the true reliability of a measurement instrument due to measurement error
associated with a limited number of items. Reliability refers to the consistency and stability of
measurements, and it is typically assessed using reliability coefficients such as Cronbach's alpha. The
number of items in a scale or test can influence the observed reliability estimate, and attenuation is
a concern when the number of items is insufficient to accurately reflect the underlying reliability of
the construct.
Key Concepts:
Attenuation in Reliability:
Attenuation in reliability occurs when the observed reliability coefficient is lower than the true reliability
due to measurement error. This is particularly relevant when the number of items in the scale is limited.
A small number of items may not adequately cover the full range of the construct being
measured, leading to an incomplete representation and potential underestimation of reliability.
With fewer items, the reliability estimate becomes more sensitive to the variability of individual
items. If one or a few items have low variability or are not well-aligned with the construct, it can
disproportionately impact the observed reliability.
Impact of Random Measurement Error:
Random measurement error has a greater impact on reliability estimates when the number of
items is limited. Inconsistencies or fluctuations in responses to a small number of items can
contribute to attenuation.
Whenever possible, include a sufficient number of items in the scale to improve reliability. More
items provide a more robust and stable estimate of the underlying construct.
Item Quality:
Ensure that the selected items are of high quality, tapping into different facets of the construct.
Well-designed items contribute more to reliability than a larger number of poorly constructed or
redundant items.
Factor Analysis:
Conduct factor analysis to explore the underlying factor structure of the scale. It helps ensure
that the items are measuring the intended construct and identifies potential sources of
attenuation.
Replication Studies:
Replicate studies with different samples to assess the generalizability of reliability estimates.
Consistent reliability across diverse samples enhances confidence in the stability of the measure.