Module 4
Module 4
Psychometric properties
History and theory of reliability
Classical test theory posits that an observed score (X) equals the true score (T)
plus the error score (E).
The theory assumes that each person has a true score, which would be obtained if
there were no measurement errors.
In practice, measuring instruments are imperfect, and errors can be introduced by
individual or situational factors unrelated to the attribute being measured.
The difference between the observed score and the true score is the error of
measurement (E), represented by the equation:
X −T =E
The classical theory of reliability assumes that measurement errors are random,
not systematic, similar to a rubber yardstick that stretches and contracts at
random.
Measurement errors are caused by varied and complex factors, acting like random
variables; they are equally likely to be positive or negative and are uncorrelated
with true scores or other test errors.
Key assumptions of classical theory:
1. Mean error of measurement = 0
2. True scores and errors are uncorrelated = 0
3. Errors on different measures are uncorrelated = 0
Classical reliability theory posits that the variance of obtained scores is the sum of the
variance of true scores and the variance of error scores:
2 2 2
O X =OT +O E
If error scores contribute significantly to variability, test scores will be inconsistent
(low reliability). If errors have little effect, the test scores will be consistent,
reflecting true scores more accurately.
The reliability coefficient measures the relative impact of true scores and errors on
obtained test scores.
Generalizability theory:
Generalizability theory (Cronbach et al., 1972) offers an alternative approach to
studying the consistency of test scores, addressing the limitations of classical
reliability theory.
A key weakness of classical reliability theory is its assumption that measurement
errors are random, which Cronbach argued is inadequate as error factors vary
across different estimation methods.
Generalizability theory recognizes both random and systematic sources of
inconsistency that contribute to error scores, unlike classical theory, which only
considers random error.
Generalizability theory sees the classical theory as a special case, emphasizing
that errors are not always random but can be influenced by systematic factors.
The central question in this theory is about the conditions under which test scores
can be generalized, asking "What conditions allow generalization?" and
"Under what conditions might results differ?"
It shifts focus from simply asking if a test is reliable to understanding when and
where the test is reliable, recognizing that a test might be reliable in some
contexts but unreliable in others.
The theory underscores the importance of context in interpreting test reliability,
asserting that systematic differences in when, where, or how a test is taken affect
the generalizability and meaning of test scores.
Meaning of Reliability
Definition:
Reliability refers to the precision and accuracy of a measurement or score.
It indicates the consistency of scores or measurements, reproducible over time
or across equivalent test items.
Consistency Types:
1. Temporal Stability: Consistency of scores when a test is repeated on the same
individuals over time.
2. Internal Consistency: Consistency of scores between two equivalent sets of
items within a single test administration.
Reliable Tests produces similar results for examinees across different occasions
or test forms.
Reliable Tests high scorers on one set of items should also score high on
equivalent items, and the same applies to low scorers.
Reliability is not inherent to the test itself but depends on how it performs when
administered to examinees.
Statistical Measures:
1. Coefficient of Stability: Correlation between scores from testing and retesting to
measure temporal stability.
2. Alpha Coefficient: Correlation between scores of two equivalent item sets in a
single administration to measure internal consistency.
Reliability is described as the self-correlation of the test, derived from correlations
between repeated measures or equivalent sets of items.
The meaning of reliability can be further clarified by noting the following important
points:
1. Reliability as a Property of Test Scores:
o Reliability pertains to the test scores or results, not the assessment
instrument itself.
o An instrument can have different reliabilities depending on the context or
the group being assessed.
o It is more accurate to refer to the reliability of scores rather than the
test's reliability.
2. Relationship Between Reliability and Validity:
o Reliability is necessary but not sufficient for validity.
o Inconsistent results (low reliability) make validity impossible.
o High reliability does not guarantee validity if the test measures the wrong
construct or is misused.
3. Types of Consistency in Reliability:
o Reliability relates to specific types of consistency, such as:
Over time periods (temporal stability).
Across raters (inter-rater reliability).
Among samples of tasks (internal consistency).
o Scores may be consistent in one aspect but not in others.
Types of Errors:
Random Errors: Fluctuate between positive and negative, canceling out over time
(mean = 0).
Systematic Errors: Constantly inflate or depress scores, resulting in non-zero
mean errors.
Variance is defined as standard deviation squared or SD2. In terms of an equation, the
situation may be written as:
2 2 2
σ T =σ ∞ +σ e
Where
σ 2T = total variance of the test score;
σ 2∞= variance of true score;
σ 2e= variance of error score.
Test-Retest Reliability
A single test is administered twice to the same sample with a reasonable time
gap.
The correlation between the two sets of scores provides the reliability coefficient
(temporal stability coefficient).
Purpose: Measures the consistency of scores over time, indicating whether
examinees retain their relative positions across administrations.
Appropriate Time Gap: A fortnight (approximately two weeks) is considered an
optimal time gap to balance carryover effects (if too short) and maturation effects
(if too long).
Advantages:
Effective for estimating the reliability of:
1. Speed tests.
2. Power tests.
3. Heterogeneous tests.
Disadvantages and Sources of Error:
Time Sampling Error: Changes in scores due to factors occurring over time.
Examinee-related Issues: Variations in health, emotional state, motivation, and
mental readiness.
Examiner-related Issues: Differences in physical and mental states during test
administrations.
Environmental Changes: External factors altering testing conditions.
Maturational Effects: Particularly significant for young children over longer
intervals, causing fluctuations in scores.
Carryover Effects:
Familiarity with test items may:
1. Aid in the second administration through skill improvement or memory.
2. Inflate the reliability coefficient, contributing to true variance.
Limitations:
Time-consuming process.
Not suitable for tests assessing dynamic or rapidly changing characteristics.
Despite its limitations, test-retest reliability is an effective method for estimating
reliability for specific test types and provides valuable insights into temporal stability.
Split-Half Method:
Common method to estimate internal consistency reliability.
The test is divided into two equal or nearly equal parts, often using the odd-even
method:
o Odd-numbered items (e.g., 1, 3, 5) form one half.
o Even-numbered items (e.g., 2, 4, 6) form the other half.
Scores from each half are correlated using Product Moment (PM) correlation to
determine half-test reliability.
Spearman-Brown Formula:
Used to estimate the reliability of the entire test from the half-test reliability
2r1 1
2 11 2× Reliability of the half test
r tt = =
1+r 1 1 1+ Reliability of the half test
2 11
where
rtt reliability coefficient;
σ 2d=¿ variance of the difference between two half scores for each examinee;
σ 2t =¿variance of the total score.
Total score for an examinee is the sum of his scores on the two halves of the test.
σ d=
1
N √
N ∑ d −( ∑ d )
2 2
Where
d represents the difference between the scores for the two halves for each
examinee,
N is the number of examinees.
Advantages:
Provides a direct estimate of error variance using differences between two
halves of the test.
Convenient when the test can be split evenly.
Rulon Formula: Focuses on the variance of differences between half-test scores.
Flanagan Formula
Purpose: Estimates reliability by analyzing the variance within each half of the
test, avoiding the use of differences between scores.
Formula:
( )
2 2
σ1+ σ2
r tt =2 1−
σ 2t
where
rtt = reliability coefficient
σ 21=¿ variance of scores of the first half;
σ 22=¿ variance of scores of the second half;
σ 2t =¿variance of the total score.
Advantages:
1. Does not rely on the difference of scores between two halves, making it easier to
compute for some datasets.
2. Provides similar reliability estimates as the Rulon formula.
Flanagan Formula: Focuses on the variances of the scores for each half
independently
Comparison with the Spearman-Brown Formula
Both the Rulon formula and the Flanagan formula have a distinct advantage
over the Spearman-Brown formula:
o They directly estimate the reliability of the whole test, without requiring
computation of the half-test reliability coefficient.
These methods also apply to alternate forms of the test.
Both methods yield similar reliability coefficients and are useful alternatives to the
Spearman-Brown formula, particularly when analyzing the internal consistency of a
test.
Kuder-Richardson Formulas
The Kuder-Richardson Formulas (K-R20 and K-R21) were developed to address issues
with the split-half reliability method, providing alternative ways to estimate the
internal consistency of a test. These formulas, devised by Kuder and Richardson
(1937), are widely used for tests where items are scored dichotomously (e.g., right or
wrong, scored as 1 or 0).
The Kuder-Richardson formulas (K-R20 and K-R21) have specific requirements that
must be met to ensure accurate estimation of internal consistency reliability for a test.
These are:
( )( σ t −∑ pq
)
2
n
KR 20=
n−1 σ 2t
where
KR20 = reliability coefficient by K-R 20;
n = number of items in the test;
σ 2t of variance of scores on the test;
p = proportion of correct answer to each item;
q = proportion of incorrect answer to each item; hence it is equal to 1-p.
Requirements:
Items must measure the same trait or construct (unifactorial test).
Items must be scored dichotomously (e.g., correct/incorrect).
Strengths:
Suitable for analyzing tests with varying item difficulties.
Provides a highly accurate reliability coefficient when item-level difficulty data is
available.
Limitations:
Requires an item analysis worksheet, which involves considerable effort to
compute the ∑pq term for each item.
Kuder-Richardson Formula 21 (K-R21)
Purpose: Simplifies the K-R20 formula by assuming all items have equal difficulty
levels.
Formula:
( )( )
2
n σ t −npq
KR 21=
n−1 σt
2
Where:
npq: Simplified term assuming all items have the same difficulty index.
Requirements:
Does not require an item analysis worksheet, making it computationally easier.
Suitable when items are of equal difficulty levels or when such an assumption
is reasonable.
Limitations:
Produces less accurate results if item difficulty levels vary significantly
Limitations of K-R Formulas
1. Heterogeneous Tests: The formulas underestimate reliability if the test is not
unifactorial (measuring different constructs).
2. Item Difficulty Variability: K-R21 especially underestimates reliability when
items have unequal difficulty levels.
3. Speed Tests: Not appropriate for speed tests or tests that are similar to speed
tests.
Relation to Split-Half Reliability
The K-R formulas are mathematically equivalent to the mean of all split-half
reliability coefficients obtained using the Rulon formula.
This equivalence holds true when split-half coefficients are calculated using
Rulon and not the Spearman-Brown formula.
Coefficient Alpha (Cronbach's Alpha)
While K-R formulas are designed for dichotomous data, Cronbach's Alpha
generalizes this approach to non-dichotomous (continuous) data. Cronbach's
Alpha is considered an advanced and versatile method, capable of handling a wider
range of data types.
The K-R20 formula is preferred for more precise reliability estimates, provided
detailed item-level data is available. The K-R21 formula offers a simpler but less
accurate alternative when item difficulties are assumed to be uniform. Both
approaches emphasize the importance of test homogeneity and are instrumental in
evaluating the internal consistency of assessments.
( )
r tt =a=
n−1 σ 2t
Where:
n = Number of items in the test
2
σ t = Variance of the total test score
2
σ t = Variance of the individual item scores for each item in the test
∑ ( σ 2i ) = Sum of the variances of the individual item scores across all items
Internal Consistency: Coefficient alpha estimates the internal consistency of the
test. This tells us how well the items in the test are related to one another and
measure the same underlying construct. A higher alpha value indicates that the test
items are more consistent with each other.
Range of Alpha:
The coefficient alpha ranges from 0 (no internal consistency) to 1 (perfect internal
consistency). An alpha closer to 1 suggests that the items in the test are highly
consistent, while an alpha closer to 0 indicates that the items may not be
measuring the same thing.
A negative alpha suggests that the items are negatively correlated with one
another, implying that an inappropriate reliability model or test structure is being
used.
Use in Multi-Scored Items: While the Kuder-Richardson formulas (K-R20 and K-
R21) are suitable for tests with binary (0 or 1) scoring, Cronbach's alpha is
appropriate for tests with items scored on a scale (e.g., Likert scales) where the
response options are not binary.
Alternate-Forms Reliability
Also known as parallel-forms reliability, equivalent-forms reliability, and
comparable-forms reliability.
Requires two comparable forms of the same test, administered to the same sample
either immediately or after a time interval (typically two weeks).
Types of Alternate-Forms Reliability:
1. Immediate: Calculated when the two forms are administered on the same day.
2. Delayed: Calculated when the two forms are administered after a time gap (e.g.,
two weeks).
Error Variance:
Immediate Reliability: Error variance comes from content sampling.
Delayed Reliability: Error variance includes time sampling, content sampling, and
content heterogeneity.
Reliability Measure:
The reliability is determined by the Pearson r correlation between scores from the
two forms.
This correlation is known as the coefficient of equivalence.
Factors Affecting Reliability:
Short time intervals: May lead to practice or memory effects, increasing true
variance and raising reliability.
Long time intervals: May resemble test-retest reliability, introducing demerits
like memory effects or recall bias.
Content differences: Significant changes between the forms reduce reliability by
increasing error variance.
Challenges in Test Equivalence:
The difficulty in ensuring that the two forms are truly equivalent, as unequal
means, variances, or correlations could distort the reliability coefficient.
Gulliksen’s Definition (1950): Parallel tests must have equal means, variances,
and inter-item correlations.
Criteria for Parallel Tests (Freeman, 1962):
Same number of items, similar content, difficulty range, and item homogeneity.
Distribution of difficulty levels and means/standard deviations should be similar.
Uniform administration and scoring.
Practical Challenges:
Meeting all criteria for equivalence is difficult, requiring considerable effort in test
design (e.g., writing items in different languages for two forms).
Alternate-forms reliability is especially useful in speed tests but can be applied to
power tests as well.
Scorer Reliability
Scorer reliability is essential for tests that involve subjective scoring (e.g., creativity
and projective personality tests).
It is determined by comparing scores from multiple examiners and calculating the
correlation coefficient between them.
Source of Error: The main source of error in scorer reliability is the differences
between examiners (inter-scorer differences).
Relative Reliability Methods:
Test-retest, internal consistency, and parallel-forms reliability use correlation
coefficients to measure reliability, known as relative reliability.
Analysis of Variance (ANOVA) is also applied as a measure of relative reliability.
Hoyt's Assumptions for ANOVA:
Total score on a test can be divided into four components: common components,
item-related components, examinee-related components, and error components.
Error variance is assumed to be equal and normally distributed for each item.
Error components are independent across items.
Formula
Vr
r tt =1−
Ve
where
rtt = reliability coefficient;
Vr = error variance;
Ve = variance among examinees.
Relation to K-R 20: Hoyt's formula yields the same reliability coefficient as the K-R
20 formula, and has similar limitations (e.g., should not be used for tests where speed
is crucial).
Application of ANOVA: ANOVA can also be applied to alternate forms of tests and
retests to estimate reliability.
Intrinsic Factors
1. Length of the Test:
Longer Tests: Tend to yield higher reliability coefficients as they provide a broader
sampling of the content being measured.
Spearman-Brown Formula:
o Used to estimate reliability for lengthened tests.
o Assumes added items have similar properties (difficulty, variance, inter-item
correlations) and do not influence examinee responses.
Example: Doubling test length increases true variance fourfold but error variance
only doubles, improving overall reliability.
2. Range of Total Scores:
Wide Range: Higher variability among scores increases reliability.
Narrow Range: Lower variability decreases reliability.
Statistically, a higher standard deviation of total scores indicates higher reliability.
3. Homogeneity of Items:
Homogeneous Items: High inter-item correlations and measurement of the same
trait improve reliability.
Heterogeneous Items: Low inter-item correlations and measurement of diverse
traits reduce reliability.
4. Difficulty Value of Items:
Items with moderate difficulty levels (e.g., difficulty index around 0.5) yield higher
reliability.
Extremely easy or difficult items often fail to differentiate between examinees,
lowering reliability.
5. Discrimination Value:
High Discrimination: Items that distinguish well between superior and inferior
performers contribute to higher reliability.
Low Discrimination: Poorly discriminating items weaken item-total correlations
and reduce reliability.
6. Scorer Reliability:
Refers to consistency among scorers in evaluating responses.
Discrepancies between scorers lower the reliability of the test.
UNIT - 4 VALIDITY
Validity refers to the degree to which a test measures what it claims to measure.
It is determined by correlation with independent, external criteria, not the self-
correlation of the test.
Validity is about the accuracy, appropriateness, and usefulness of inferences made
from test scores.
Prominent definitions emphasize agreement between test scores and the traits or
abilities they aim to measure.
Generalizability: Validity ensures that conclusions from a test can be generalized to
a broader population.
Validity Coefficient: This is the correlation between a test and ideal independent
criteria measuring the same trait.
Modern Perspective:
Validity focuses on evidence supporting inferences from test scores, not the test
itself.
Evidence can be content-related, criterion-related, or construct-related.
Properties of Validity:
Relative: A test is valid only for a specific purpose or context.
Dynamic: Validity evolves over time and requires periodic reassessment.
Degree-Based: Validity is not absolute; it varies in degree.
Unitary Concept: Viewed as a single construct based on multiple types of
evidence, rather than separate types of validity.
Evaluative Judgement: Involves evaluating the appropriateness and
consequences of test interpretations and uses.
Aspects of validity
Purpose of Validity: A test's validity is tied to the purpose it serves, leading to
different aspects of validity based on the specific purpose of the test.
Three Main Testing Purposes:
1. Content Representation: Measures current performance in specific content
areas (e.g., an English spelling test evaluating students' spelling abilities).
2. Functional Relationship: Predicts future outcomes or determines present
standing on a related variable (e.g., mechanical aptitude predicting job
performance).
3. Hypothetical Traits: Assesses abstract, non-observable traits like extroversion,
intelligence, or neuroticism through test performance.
Three Types of Validity:
1. Content Validity: Ensures the test represents the content it aims to measure.
2. Criterion-Related Validity: Establishes a functional relationship between the test
and an external variable (present or future).
3. Construct Validity: Measures the degree to which the test assesses a
hypothetical trait or quality.
Evolution of Validity Understanding:
Earlier distinctions among validity types (e.g., predictive, criterion-related) have
been critiqued for creating unnecessary divisions.
Modern perspectives emphasize evidence-based validity rather than distinct
subcategories.
Overlap among validity categories suggests they are not entirely separate but part
of a unified concept.
Current Perspective:
The 1999 Standards for Educational and Psychological Tests recommend
focusing on various evidence categories for validity rather than treating validity
types as distinct.
Validity is now understood as a comprehensive evaluation of test evidence rather
than discrete classifications.
Criterion-related Validity
Criterion-related validity evaluates a test's effectiveness by correlating test scores
with an external, independent measure (criterion) of the same variable the test
aims to assess.
The criterion serves as the "true" or reliable external measure, and validity is
estimated as the correlation coefficient between test scores and criterion scores.
Subtypes of Criterion-Related Validity:
Predictive Validity: Assesses how well test scores can predict future
performance or outcomes related to the criterion.
Concurrent Validity: Examines the correlation between test scores and the
criterion when both are measured simultaneously.
Widely used in assessing the effectiveness of tests for predicting future performance
or confirming current abilities.
Predictive Validity
Predictive validity measures how well a test predicts future outcomes. Test scores
are obtained first, followed by a time gap (months or years), after which criterion
scores are measured and correlated with the test scores.
The correlation between test scores and future criterion scores is used as the
predictive validity coefficient (often calculated using Pearson’s product-
moment correlation).
Predictive validity is essential for tests forecasting future academic achievement,
vocational success, or long-term outcomes like therapy effectiveness.
Example:
Academic Example: Administering an intelligence test at postgraduate admission
and later correlating the test scores with students' grades after two years. A high
correlation indicates predictive validity.
Workplace Example: Using a mechanical aptitude test to predict job
performance, with future work performance (e.g., units produced) being the
criterion. A high correlation confirms predictive validity.
Concurrent Validity
Concurrent validity measures how well a test correlates with a criterion that is
available at the same time. Unlike predictive validity, there is no time gap between
when the test and criterion scores are obtained.
Example: A new intelligence test may be correlated with scores from an already
established intelligence test to determine its validity. Similarly, an intelligence test
might be validated against students' current academic marks.
Methods to Determine Concurrent Validity:
Relationship method: Administer the test and criterion to the same group, then
correlate the scores to calculate validity.
Discrimination method: Determines if the test can differentiate between
individuals with or without a certain characteristic (e.g., using a mental adjustment
inventory to distinguish between institutionalized and non-institutionalized
individuals).
Comparison to Predictive Validity: Concurrent validity is typically higher than
predictive validity. Over time, the correlation between test scores and future
outcomes tends to weaken, leading to lower predictive validity.
Challenges: The selection of an appropriate criterion is crucial. An inadequate or
inappropriate criterion can lower the validity coefficient. For example, factors like
motivation, emotional adjustment, or health can influence academic performance or
job productivity, distorting the correlation between the test and the criterion.
Correction for Attenuation: If the reliability of the criterion is low, it is
recommended to correct the validity coefficient for attenuation (to account for
measurement errors), as the correlation may be weaker than the true relationship
between the test and the criterion.
Correction for Attenuation
Correction for attenuation is used when both the test and the criterion are
unreliable (due to measurement errors or faulty construction), which leads to a
lower correlation coefficient between the test and the criterion. The purpose of the
correction is to adjust the validity coefficient to reflect the true relationship
between the test and criterion, accounting for measurement errors.
Types of Correction:
o Full Correction: Corrects for attenuation in both the test and the criterion.
o One-way Correction: Corrects for attenuation in only the criterion,
assuming the test is more reliable than the criterion. Here, the validity
coefficient is increased by correcting for the criterion's attenuation.
Interpretation:
o The full correction formula adjusts both the test and criterion to eliminate
measurement errors in both, which increases the validity coefficient.
o The one-way correction is more commonly applied, as the criterion is often
less reliable than the test. This adjustment increases the validity coefficient
as it accounts for measurement errors in the criterion.
Upper Limit: The correction formula sets an upper limit for the correlation
between the test and criterion. If both the test and criterion have perfect reliability
(1.0), the correction for attenuation will not alter the validity coefficient, as no
errors remain to be corrected.
Correlation Methods
Validity of a Test: Defined by its correlation with an independent external criterion.
Correlation Methods: Different methods to calculate the correlation coefficient
include:
Pearson r: Most commonly used.
Biserial r and Point Biserial r: Used when one variable is divided into categories
(e.g., Pass/Fail) and the other is continuous.
Tetrachoric r and Phi coefficient: Used when both variables are divided into two
categories.
Multiple Correlation: Used when more than two measures are involved,
represented by R, which shows the relationship between one measure and the
composite of multiple measures.
Expectancy Tables
Expectancy tables show the relationship between test scores and criterion
measures, typically expressed as probabilities or percentages.
Structure:
o Test scores are plotted against criterion scores, forming a bivariate
distribution.
o Correlation coefficients (validity coefficients) are computed from this
distribution.
Purpose: Used to estimate the predictive efficiency of a test by determining the
likelihood of achieving specific performance outcomes based on test scores.
Application Examples:
o Probability of a student with an IQ of 100 securing the first rank.
o Probability of a clerical test score correlating to an "excellent" performance
rating.
o Probability of a medical entrance test score predicting success after course
completion.
Flexibility: Expectancy tables vary depending on the specific probabilities being
determined but always predict performance based on known scores.
Cut-off Score
A score that separates potentially superior examinees from potentially inferior
ones.
Purpose: To maximize the selection of superior examinees and eliminate inferior
ones.
Demerit: Screening using a cut-off score may not be perfect due to:
o Imperfect validity of the test.
o The inability of a single test to fully reveal an examinee's ability or potential.
o This can result in some inferior examinees being selected and some superior
ones being rejected.
Miscellaneous Techniques
Additional Techniques for Validity: Some less commonly used methods exist for
validating tests.
Contrasted Groups: Comparing test scores of distinct groups (e.g., normals vs.
neurotics, institutionalized mental defectives vs. normal schoolchildren) to
establish validity.
Age Group Comparison: Comparing success percentages of students across
different age groups as a method of validation.