0% found this document useful (0 votes)
4 views23 pages

Module 4

The document discusses psychometric properties, focusing on reliability in measurement, including classical test theory, domain sampling model, and generalizability theory. It defines reliability, its types, and methods for estimating reliability coefficients, such as test-retest and internal consistency reliability. Additionally, it highlights the importance of context and various sources of error that can impact the reliability of test scores.

Uploaded by

Taniya T Thomas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views23 pages

Module 4

The document discusses psychometric properties, focusing on reliability in measurement, including classical test theory, domain sampling model, and generalizability theory. It defines reliability, its types, and methods for estimating reliability coefficients, such as test-retest and internal consistency reliability. Additionally, it highlights the importance of context and various sources of error that can impact the reliability of test scores.

Uploaded by

Taniya T Thomas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Module - 4

Psychometric properties
History and theory of reliability
 Classical test theory posits that an observed score (X) equals the true score (T)
plus the error score (E).
 The theory assumes that each person has a true score, which would be obtained if
there were no measurement errors.
 In practice, measuring instruments are imperfect, and errors can be introduced by
individual or situational factors unrelated to the attribute being measured.
 The difference between the observed score and the true score is the error of
measurement (E), represented by the equation:
X −T =E
 The classical theory of reliability assumes that measurement errors are random,
not systematic, similar to a rubber yardstick that stretches and contracts at
random.
 Measurement errors are caused by varied and complex factors, acting like random
variables; they are equally likely to be positive or negative and are uncorrelated
with true scores or other test errors.
 Key assumptions of classical theory:
1. Mean error of measurement = 0
2. True scores and errors are uncorrelated = 0
3. Errors on different measures are uncorrelated = 0
Classical reliability theory posits that the variance of obtained scores is the sum of the
variance of true scores and the variance of error scores:
2 2 2
O X =OT +O E
 If error scores contribute significantly to variability, test scores will be inconsistent
(low reliability). If errors have little effect, the test scores will be consistent,
reflecting true scores more accurately.
 The reliability coefficient measures the relative impact of true scores and errors on
obtained test scores.

Domain sampling model:


 The domain sampling model addresses the issue of using a limited number of
items to represent a larger, complex ability or construct.
For example, to evaluate spelling ability, a researcher would ideally test every word in
a dictionary, but instead, a sample of words is used.
 The true score would be the percentage of words correctly spelled if all words
were tested, but researchers aim to estimate how much error is introduced by
using a shorter test.
 The domain sampling model views reliability as the ratio of the variance of
observed scores on a shorter test to the variance of the true score over the long
run.
 Errors arise from using a sample of items rather than the entire domain, and
increasing the number of items in the sample increases reliability.
 Reliability is estimated through the correlation of the observed test score with the
true score, although finding the true score is usually not possible.
 It is assumed that when items are randomly selected from a domain, each test
provides an unbiased estimate of the true score.

Generalizability theory:
 Generalizability theory (Cronbach et al., 1972) offers an alternative approach to
studying the consistency of test scores, addressing the limitations of classical
reliability theory.
 A key weakness of classical reliability theory is its assumption that measurement
errors are random, which Cronbach argued is inadequate as error factors vary
across different estimation methods.
 Generalizability theory recognizes both random and systematic sources of
inconsistency that contribute to error scores, unlike classical theory, which only
considers random error.
 Generalizability theory sees the classical theory as a special case, emphasizing
that errors are not always random but can be influenced by systematic factors.
 The central question in this theory is about the conditions under which test scores
can be generalized, asking "What conditions allow generalization?" and
"Under what conditions might results differ?"
 It shifts focus from simply asking if a test is reliable to understanding when and
where the test is reliable, recognizing that a test might be reliable in some
contexts but unreliable in others.
 The theory underscores the importance of context in interpreting test reliability,
asserting that systematic differences in when, where, or how a test is taken affect
the generalizability and meaning of test scores.

Meaning of Reliability
Definition:
 Reliability refers to the precision and accuracy of a measurement or score.
 It indicates the consistency of scores or measurements, reproducible over time
or across equivalent test items.
Consistency Types:
1. Temporal Stability: Consistency of scores when a test is repeated on the same
individuals over time.
2. Internal Consistency: Consistency of scores between two equivalent sets of
items within a single test administration.
 Reliable Tests produces similar results for examinees across different occasions
or test forms.
 Reliable Tests high scorers on one set of items should also score high on
equivalent items, and the same applies to low scorers.
 Reliability is not inherent to the test itself but depends on how it performs when
administered to examinees.
Statistical Measures:
1. Coefficient of Stability: Correlation between scores from testing and retesting to
measure temporal stability.
2. Alpha Coefficient: Correlation between scores of two equivalent item sets in a
single administration to measure internal consistency.
Reliability is described as the self-correlation of the test, derived from correlations
between repeated measures or equivalent sets of items.
The meaning of reliability can be further clarified by noting the following important
points:
1. Reliability as a Property of Test Scores:
o Reliability pertains to the test scores or results, not the assessment
instrument itself.
o An instrument can have different reliabilities depending on the context or
the group being assessed.
o It is more accurate to refer to the reliability of scores rather than the
test's reliability.
2. Relationship Between Reliability and Validity:
o Reliability is necessary but not sufficient for validity.
o Inconsistent results (low reliability) make validity impossible.
o High reliability does not guarantee validity if the test measures the wrong
construct or is misused.
3. Types of Consistency in Reliability:
o Reliability relates to specific types of consistency, such as:
 Over time periods (temporal stability).
 Across raters (inter-rater reliability).
 Among samples of tasks (internal consistency).
o Scores may be consistent in one aspect but not in others.

Logical Meaning or Technical Meaning of Reliability


 Reliability measures the accuracy of test scores by accounting for errors in
measurement.
 A perfect reliability coefficient is +1.00, indicating no error, but this is an ideal
rarely achieved in practice.
 Components of an Obtained Score: Any obtained score consists of:
1. True score: Free from all errors.
2. Error score: Includes measurement errors.
Formula:
X T =X ∞ + X e
where,
 X T =¿ the actual obtained score;
 X ∞=¿true score; (∞ is the sign of infinity which represents the true score.)
 X ∞=¿ error score.

Types of Errors:
 Random Errors: Fluctuate between positive and negative, canceling out over time
(mean = 0).
 Systematic Errors: Constantly inflate or depress scores, resulting in non-zero
mean errors.
Variance is defined as standard deviation squared or SD2. In terms of an equation, the
situation may be written as:
2 2 2
σ T =σ ∞ +σ e
Where
 σ 2T = total variance of the test score;
 σ 2∞= variance of true score;
 σ 2e= variance of error score.

Reliability Coefficient - Reliability measures the proportion of true variance to the


total variance.
Formula:
2
σ∞ True variance
r tt = 2
=
σ e
Obtained variance
2
σe Error variance
r tt =1− 2
=1−
σ T
Obtained variance

Impact of Error on Reliability:


 Smaller error scores result in higher reliability.
 Errors arise from factors like scoring issues, administration errors, examinee’s
motivation, guessing, and misunderstanding instructions.
Reliability is defined as the proportion of true variance within the total variance of
test scores.

Methods (or Types) of Reliability


Four Common Methods of Estimating Reliability Coefficients:
1. Test-Retest Reliability: Measures the consistency of scores over time.
2. Internal Consistency Reliability: Assesses the consistency of results within the
test itself.
3. Parallel-Forms Reliability (Alternate/Equivalent/Comparable-Forms
Reliability): Evaluates the consistency of scores between two equivalent forms of
a test.
4. Scorer Reliability: Examines the consistency of scores assigned by different
raters or scorers.
Classification of Methods:
External Consistency Procedures:
 Includes Test-Retest Reliability and Parallel-Forms Reliability.
 These methods compare results from two independent data collection
processes to verify reliability.
Purpose: Each method helps in estimating how reliable a test score is by analyzing
consistency across various dimensions (time, items, test forms, or raters).

Test-Retest Reliability
 A single test is administered twice to the same sample with a reasonable time
gap.
 The correlation between the two sets of scores provides the reliability coefficient
(temporal stability coefficient).
Purpose: Measures the consistency of scores over time, indicating whether
examinees retain their relative positions across administrations.
Appropriate Time Gap: A fortnight (approximately two weeks) is considered an
optimal time gap to balance carryover effects (if too short) and maturation effects
(if too long).
Advantages:
Effective for estimating the reliability of:
1. Speed tests.
2. Power tests.
3. Heterogeneous tests.
Disadvantages and Sources of Error:
 Time Sampling Error: Changes in scores due to factors occurring over time.
 Examinee-related Issues: Variations in health, emotional state, motivation, and
mental readiness.
 Examiner-related Issues: Differences in physical and mental states during test
administrations.
 Environmental Changes: External factors altering testing conditions.
 Maturational Effects: Particularly significant for young children over longer
intervals, causing fluctuations in scores.
Carryover Effects:
Familiarity with test items may:
1. Aid in the second administration through skill improvement or memory.
2. Inflate the reliability coefficient, contributing to true variance.
Limitations:
 Time-consuming process.
 Not suitable for tests assessing dynamic or rapidly changing characteristics.
Despite its limitations, test-retest reliability is an effective method for estimating
reliability for specific test types and provides valuable insights into temporal stability.

Internal Consistency Reliability


 Measures the homogeneity of a test, ensuring that all items assess the same trait
or function.
 A test is considered homogeneous if its internal consistency reliability is high.

Split-Half Method:
 Common method to estimate internal consistency reliability.
 The test is divided into two equal or nearly equal parts, often using the odd-even
method:
o Odd-numbered items (e.g., 1, 3, 5) form one half.
o Even-numbered items (e.g., 2, 4, 6) form the other half.
 Scores from each half are correlated using Product Moment (PM) correlation to
determine half-test reliability.
Spearman-Brown Formula:
Used to estimate the reliability of the entire test from the half-test reliability
2r1 1
2 11 2× Reliability of the half test
r tt = =
1+r 1 1 1+ Reliability of the half test
2 11

Sources of Error Variance:


Content Sampling/Item Sampling: Differences in item selection or item
characteristics can affect scores.
Advantages:
 Requires only a single administration of the test.
 Eliminates variability caused by differences in multiple administrations,
providing a quick and efficient estimate of reliability (termed "on-the-spot
reliability" by Guilford and Fruchter, 1973).
Disadvantages:
 Temporary Conditions: Fluctuations due to temporary internal (examinee-
related) or external (environmental) factors may skew results, inflating or
depressing the reliability coefficient.
 Not Suitable for Speed Tests: Overestimates reliability when applied to speed
tests.
 Variation in Split Methods: Different methods of dividing the test may yield
different reliability coefficients, affecting consistency.
Internal consistency reliability is useful for evaluating test homogeneity, but its
effectiveness depends on the test type and the method of item division.

Rulon and Flanagan Formulas


These formulas provide estimates of the reliability of the whole test by calculating
the proportion of error variance in relation to the total variance of the test scores.
Both approaches emphasize that a lower error variance corresponds to higher
true variance and, consequently, higher reliability.
Rulon Formula
Purpose: Estimates reliability by analyzing the difference between two half-test
scores (e.g., odd-numbered and even-numbered items).
Formula:
2
σd
r tt =1− 2
σt

where
 rtt reliability coefficient;
 σ 2d=¿ variance of the difference between two half scores for each examinee;
 σ 2t =¿variance of the total score.

Total score for an examinee is the sum of his scores on the two halves of the test.

σ d=
1
N √
N ∑ d −( ∑ d )
2 2

Where
 d represents the difference between the scores for the two halves for each
examinee,
 N is the number of examinees.

Advantages:
 Provides a direct estimate of error variance using differences between two
halves of the test.
 Convenient when the test can be split evenly.
Rulon Formula: Focuses on the variance of differences between half-test scores.
Flanagan Formula
Purpose: Estimates reliability by analyzing the variance within each half of the
test, avoiding the use of differences between scores.
Formula:

( )
2 2
σ1+ σ2
r tt =2 1−
σ 2t

where
 rtt = reliability coefficient
 σ 21=¿ variance of scores of the first half;
 σ 22=¿ variance of scores of the second half;
 σ 2t =¿variance of the total score.
Advantages:
1. Does not rely on the difference of scores between two halves, making it easier to
compute for some datasets.
2. Provides similar reliability estimates as the Rulon formula.
Flanagan Formula: Focuses on the variances of the scores for each half
independently
Comparison with the Spearman-Brown Formula
 Both the Rulon formula and the Flanagan formula have a distinct advantage
over the Spearman-Brown formula:
o They directly estimate the reliability of the whole test, without requiring
computation of the half-test reliability coefficient.
 These methods also apply to alternate forms of the test.
Both methods yield similar reliability coefficients and are useful alternatives to the
Spearman-Brown formula, particularly when analyzing the internal consistency of a
test.

Kuder-Richardson Formulas
The Kuder-Richardson Formulas (K-R20 and K-R21) were developed to address issues
with the split-half reliability method, providing alternative ways to estimate the
internal consistency of a test. These formulas, devised by Kuder and Richardson
(1937), are widely used for tests where items are scored dichotomously (e.g., right or
wrong, scored as 1 or 0).
The Kuder-Richardson formulas (K-R20 and K-R21) have specific requirements that
must be met to ensure accurate estimation of internal consistency reliability for a test.
These are:

Main Requirements for Using the K-R Formulas:


1. Homogeneity of Items:
o All items on the test should measure the same factor or construct. This
means the test should be unidimensional (homogeneous), where every
item assesses the same underlying trait.
o The items should exhibit high inter-item correlation to indicate that
they measure the same thing, ensuring internal consistency.
2. Dichotomous Scoring:
o Items should be scored in a dichotomous manner: either correct or
incorrect. This is usually represented by scoring correct answers as +1
and incorrect answers as 0.
3. Item Difficulty Consistency:
o For K-R20, the items should not vary much in their difficulty levels. If
item difficulties differ significantly, the reliability estimate may be
inaccurate.
o For K-R21, all items should have the same difficulty value. If this
condition is not met, the reliability estimate from K-R21 could be
substantially lower than that obtained from K-R20.
Kuder-Richardson Formula 20 (K-R20)
Purpose: Provides a precise estimate of test reliability based on the variance of
total test scores and the difficulty level of individual items.
Formula:

( )( σ t −∑ pq
)
2
n
KR 20=
n−1 σ 2t
where
 KR20 = reliability coefficient by K-R 20;
 n = number of items in the test;
 σ 2t of variance of scores on the test;
 p = proportion of correct answer to each item;
 q = proportion of incorrect answer to each item; hence it is equal to 1-p.

Requirements:
 Items must measure the same trait or construct (unifactorial test).
 Items must be scored dichotomously (e.g., correct/incorrect).

Strengths:
 Suitable for analyzing tests with varying item difficulties.
 Provides a highly accurate reliability coefficient when item-level difficulty data is
available.

Limitations:
Requires an item analysis worksheet, which involves considerable effort to
compute the ∑pq term for each item.
Kuder-Richardson Formula 21 (K-R21)
Purpose: Simplifies the K-R20 formula by assuming all items have equal difficulty
levels.
Formula:

( )( )
2
n σ t −npq
KR 21=
n−1 σt
2

Where:
 npq: Simplified term assuming all items have the same difficulty index.

Requirements:
 Does not require an item analysis worksheet, making it computationally easier.
 Suitable when items are of equal difficulty levels or when such an assumption
is reasonable.

Limitations:
 Produces less accurate results if item difficulty levels vary significantly
Limitations of K-R Formulas
1. Heterogeneous Tests: The formulas underestimate reliability if the test is not
unifactorial (measuring different constructs).
2. Item Difficulty Variability: K-R21 especially underestimates reliability when
items have unequal difficulty levels.
3. Speed Tests: Not appropriate for speed tests or tests that are similar to speed
tests.
Relation to Split-Half Reliability
 The K-R formulas are mathematically equivalent to the mean of all split-half
reliability coefficients obtained using the Rulon formula.
 This equivalence holds true when split-half coefficients are calculated using
Rulon and not the Spearman-Brown formula.
Coefficient Alpha (Cronbach's Alpha)
While K-R formulas are designed for dichotomous data, Cronbach's Alpha
generalizes this approach to non-dichotomous (continuous) data. Cronbach's
Alpha is considered an advanced and versatile method, capable of handling a wider
range of data types.
The K-R20 formula is preferred for more precise reliability estimates, provided
detailed item-level data is available. The K-R21 formula offers a simpler but less
accurate alternative when item difficulties are assumed to be uniform. Both
approaches emphasize the importance of test homogeneity and are instrumental in
evaluating the internal consistency of assessments.

Coefficient Alpha (Cronbach's Alpha)


Coefficient Alpha (also known as Cronbach's Alpha) is a generalized method for
estimating the internal consistency reliability of a test, especially for tests with
items scored on a scale rather than a binary right/wrong basis. This is commonly used
in tests such as personality inventories or surveys where items may have multiple
response options (e.g., "Never", "Rarely", "Sometimes", "Usually").
Formula
n σ t −∑ ( σ i )
2 2

( )
r tt =a=
n−1 σ 2t
Where:
 n = Number of items in the test
2
 σ t = Variance of the total test score
2
 σ t = Variance of the individual item scores for each item in the test

 ∑ ( σ 2i ) = Sum of the variances of the individual item scores across all items
Internal Consistency: Coefficient alpha estimates the internal consistency of the
test. This tells us how well the items in the test are related to one another and
measure the same underlying construct. A higher alpha value indicates that the test
items are more consistent with each other.

Range of Alpha:
 The coefficient alpha ranges from 0 (no internal consistency) to 1 (perfect internal
consistency). An alpha closer to 1 suggests that the items in the test are highly
consistent, while an alpha closer to 0 indicates that the items may not be
measuring the same thing.
 A negative alpha suggests that the items are negatively correlated with one
another, implying that an inappropriate reliability model or test structure is being
used.
Use in Multi-Scored Items: While the Kuder-Richardson formulas (K-R20 and K-
R21) are suitable for tests with binary (0 or 1) scoring, Cronbach's alpha is
appropriate for tests with items scored on a scale (e.g., Likert scales) where the
response options are not binary.

Interpretation: In psychological and educational research, a good Cronbach's


alpha is typically above 0.60, with values closer to 0.90 being preferred for high-
stakes or precise measurements.

Average of Split-Half Reliability: Mathematically, Cronbach's alpha can be


understood as the average of all possible split-half reliability coefficients for
the test. This makes it a more general method than the split-half method itself.

Sources of Error Variance:


 Content Sampling: Variability in the items selected for the test.
 Content Heterogeneity: If the items in the test measure different constructs, this
may lower the alpha coefficient.

Practical Application: Acceptable Range: For a good internal consistency in


psychological or educational tests, Cronbach's alpha should be at least 0.60, with
values above 0.70 or 0.80 being preferable.
In summary, Cronbach's alpha is an essential measure in psychometrics, providing a
robust estimate of internal consistency, especially for tests with multi-option scoring.

Alternate-Forms Reliability
 Also known as parallel-forms reliability, equivalent-forms reliability, and
comparable-forms reliability.
 Requires two comparable forms of the same test, administered to the same sample
either immediately or after a time interval (typically two weeks).
Types of Alternate-Forms Reliability:
1. Immediate: Calculated when the two forms are administered on the same day.
2. Delayed: Calculated when the two forms are administered after a time gap (e.g.,
two weeks).
Error Variance:
 Immediate Reliability: Error variance comes from content sampling.
 Delayed Reliability: Error variance includes time sampling, content sampling, and
content heterogeneity.
Reliability Measure:
 The reliability is determined by the Pearson r correlation between scores from the
two forms.
 This correlation is known as the coefficient of equivalence.
Factors Affecting Reliability:
 Short time intervals: May lead to practice or memory effects, increasing true
variance and raising reliability.
 Long time intervals: May resemble test-retest reliability, introducing demerits
like memory effects or recall bias.
 Content differences: Significant changes between the forms reduce reliability by
increasing error variance.
Challenges in Test Equivalence:
 The difficulty in ensuring that the two forms are truly equivalent, as unequal
means, variances, or correlations could distort the reliability coefficient.
 Gulliksen’s Definition (1950): Parallel tests must have equal means, variances,
and inter-item correlations.
Criteria for Parallel Tests (Freeman, 1962):
 Same number of items, similar content, difficulty range, and item homogeneity.
 Distribution of difficulty levels and means/standard deviations should be similar.
 Uniform administration and scoring.
Practical Challenges:
 Meeting all criteria for equivalence is difficult, requiring considerable effort in test
design (e.g., writing items in different languages for two forms).
 Alternate-forms reliability is especially useful in speed tests but can be applied to
power tests as well.

Scorer Reliability
 Scorer reliability is essential for tests that involve subjective scoring (e.g., creativity
and projective personality tests).
 It is determined by comparing scores from multiple examiners and calculating the
correlation coefficient between them.
Source of Error: The main source of error in scorer reliability is the differences
between examiners (inter-scorer differences).
Relative Reliability Methods:
 Test-retest, internal consistency, and parallel-forms reliability use correlation
coefficients to measure reliability, known as relative reliability.
 Analysis of Variance (ANOVA) is also applied as a measure of relative reliability.
Hoyt's Assumptions for ANOVA:
 Total score on a test can be divided into four components: common components,
item-related components, examinee-related components, and error components.
 Error variance is assumed to be equal and normally distributed for each item.
 Error components are independent across items.
Formula
Vr
r tt =1−
Ve
where
 rtt = reliability coefficient;
 Vr = error variance;
 Ve = variance among examinees.
Relation to K-R 20: Hoyt's formula yields the same reliability coefficient as the K-R
20 formula, and has similar limitations (e.g., should not be used for tests where speed
is crucial).
Application of ANOVA: ANOVA can also be applied to alternate forms of tests and
retests to estimate reliability.

Standard Error of Measurement


 The Standard Error of Measurement (SEM) expresses the reliability of test
scores in absolute terms, unlike the relative reliability coefficient (correlation).
 SEM is preferred by many researchers as it remains unaffected by variability in the
range of scores, unlike the reliability coefficient.
Statistical Basis:
 SEM is defined as the standard deviation of the error component in test scores.
 Alternatively, SEM can be understood as the standard deviation of scores around
the examinee's true score.
Calculation Challenges:
 SEM cannot be calculated directly because true scores and error scores cannot
be measured directly.
 Hypothetically, repeating the same test on the same group infinitely would yield
true scores, and SEM could be computed as the standard deviation of those
scores.
Indirect Calculation:
 SEM is calculated indirectly using the formula

SEmeans =σ √ 1−r 2tt


Where:
o SE = Standard Error of Measurement
o σ = Standard deviation of test scores
o rₜₜ = Reliability coefficient of the test
 If the reliability coefficient (rₜₜ) is 1.00 (perfect reliability), SEM is zero, indicating
no error in test scores (rare in practice).
 Lower SEM indicates higher reliability and more consistent test scores.
 SEM provides a direct measure of the accuracy of test scores and helps in
understanding the precision of the results obtained.

Factors influencing reliability of test scores


Extrinsic Factors
1. Group Variability:
 Homogeneous Groups: Lower reliability due to limited variability in scores.
 Heterogeneous Groups: Higher reliability as wider ability ranges enhance score
variability.
 Extreme cases with zero variability result in a reliability coefficient of zero.
2. Guessing by Examinees:
 Impact on Scores:
o Raises total scores artificially, inflating the reliability coefficient.
o Contributes to measurement errors, as guessing outcomes vary among
examinees.
 Example: Two individuals guessing on a test may score differently due to luck,
increasing error scores and reducing reliability.
3. Environmental Conditions:
 Uniform Testing Environment: Essential to maintain consistent reliability.
 Disruptions: Variations in light, noise, or comfort can reduce the reliability of test
scores.
4. Momentary Fluctuations in Examinees:
 Temporary factors like distractions, anxiety, or physical discomfort can influence
scores by increasing error variance.
 Examples include broken tools, sudden noises, or inability to correct mistakes.

Intrinsic Factors
1. Length of the Test:
 Longer Tests: Tend to yield higher reliability coefficients as they provide a broader
sampling of the content being measured.
 Spearman-Brown Formula:
o Used to estimate reliability for lengthened tests.
o Assumes added items have similar properties (difficulty, variance, inter-item
correlations) and do not influence examinee responses.
 Example: Doubling test length increases true variance fourfold but error variance
only doubles, improving overall reliability.
2. Range of Total Scores:
 Wide Range: Higher variability among scores increases reliability.
 Narrow Range: Lower variability decreases reliability.
 Statistically, a higher standard deviation of total scores indicates higher reliability.
3. Homogeneity of Items:
 Homogeneous Items: High inter-item correlations and measurement of the same
trait improve reliability.
 Heterogeneous Items: Low inter-item correlations and measurement of diverse
traits reduce reliability.
4. Difficulty Value of Items:
 Items with moderate difficulty levels (e.g., difficulty index around 0.5) yield higher
reliability.
 Extremely easy or difficult items often fail to differentiate between examinees,
lowering reliability.
5. Discrimination Value:
 High Discrimination: Items that distinguish well between superior and inferior
performers contribute to higher reliability.
 Low Discrimination: Poorly discriminating items weaken item-total correlations
and reduce reliability.
6. Scorer Reliability:
 Refers to consistency among scorers in evaluating responses.
 Discrepancies between scorers lower the reliability of the test.

UNIT - 4 VALIDITY
 Validity refers to the degree to which a test measures what it claims to measure.
 It is determined by correlation with independent, external criteria, not the self-
correlation of the test.
 Validity is about the accuracy, appropriateness, and usefulness of inferences made
from test scores.
 Prominent definitions emphasize agreement between test scores and the traits or
abilities they aim to measure.
Generalizability: Validity ensures that conclusions from a test can be generalized to
a broader population.
Validity Coefficient: This is the correlation between a test and ideal independent
criteria measuring the same trait.

Modern Perspective:
 Validity focuses on evidence supporting inferences from test scores, not the test
itself.
 Evidence can be content-related, criterion-related, or construct-related.

Properties of Validity:
 Relative: A test is valid only for a specific purpose or context.
 Dynamic: Validity evolves over time and requires periodic reassessment.
 Degree-Based: Validity is not absolute; it varies in degree.
 Unitary Concept: Viewed as a single construct based on multiple types of
evidence, rather than separate types of validity.
 Evaluative Judgement: Involves evaluating the appropriateness and
consequences of test interpretations and uses.

Sources of Evidence for Validity:


o Test content.
o Response processes.
o Internal structure.
o Relationships with other variables.
o Consequences of testing.
These points summarize the concept, definitions, properties, and evolving
understanding of validity in the context of testing and measurement.

Aspects of validity
Purpose of Validity: A test's validity is tied to the purpose it serves, leading to
different aspects of validity based on the specific purpose of the test.
Three Main Testing Purposes:
1. Content Representation: Measures current performance in specific content
areas (e.g., an English spelling test evaluating students' spelling abilities).
2. Functional Relationship: Predicts future outcomes or determines present
standing on a related variable (e.g., mechanical aptitude predicting job
performance).
3. Hypothetical Traits: Assesses abstract, non-observable traits like extroversion,
intelligence, or neuroticism through test performance.
Three Types of Validity:
1. Content Validity: Ensures the test represents the content it aims to measure.
2. Criterion-Related Validity: Establishes a functional relationship between the test
and an external variable (present or future).
3. Construct Validity: Measures the degree to which the test assesses a
hypothetical trait or quality.
Evolution of Validity Understanding:
 Earlier distinctions among validity types (e.g., predictive, criterion-related) have
been critiqued for creating unnecessary divisions.
 Modern perspectives emphasize evidence-based validity rather than distinct
subcategories.
 Overlap among validity categories suggests they are not entirely separate but part
of a unified concept.
Current Perspective:
 The 1999 Standards for Educational and Psychological Tests recommend
focusing on various evidence categories for validity rather than treating validity
types as distinct.
 Validity is now understood as a comprehensive evaluation of test evidence rather
than discrete classifications.

Content or Curricular Validity


 Content validity (also called intrinsic validity, curricular validity, etc.) ensures a test
measures the specific content or domain it claims to represent.
 Focuses on the relevance and representativeness of individual items and the test
as a whole.
 Key for tests assessing mastery of specific skills or knowledge (e.g., achievement
tests).
Components of Content Validity:
 Item Validity: Ensures individual test items measure the intended content.
 Sampling Validity: Ensures the test covers the entire content area
comprehensively.
Threats to Content Validity:
 Construct Under-Representation: Failure to include key components of the
domain (e.g., testing only ancient history in a history test).
 Construct-Irrelevant Variance: Inclusion of unrelated factors influencing scores
(e.g., anxiety affecting intelligence test results).
Methods to Assess Content Validity:
 Expert Judgement: Experts evaluate if test items represent the domain
comprehensively and suggest necessary additions or adjustments.
 Statistical Methods:
o Internal consistency tests to confirm items measure the same construct.
o Correlation of scores from similar tests to provide evidence of validity.
 Item-Discriminating Power: Evaluates items' ability to distinguish between high
and low performers.
Steps for Ensuring Content Validity:
 Explicitly define the content area, covering all major sections proportionally.
 Define the content area with clear objectives, factual knowledge, and applications
before item writing.
 Validate content relevance based on examinees’ responses rather than the
apparent relevance of items.
Applications:
 Most applicable to achievement or proficiency tests.
 Less relevant and potentially misleading for aptitude, intelligence, and personality
tests.
Distinction from Face Validity:
 Content Validity: Objective and technical, focusing on how well a test measures
the intended domain.
 Face Validity: Superficial appearance of validity, important for ensuring
examinees' cooperation and motivation.

Role of Face Validity:


 Enhances social acceptability and encourages examinees' participation.
 Improves wording and structure, indirectly supporting overall test validity.

Criterion-related Validity
 Criterion-related validity evaluates a test's effectiveness by correlating test scores
with an external, independent measure (criterion) of the same variable the test
aims to assess.
 The criterion serves as the "true" or reliable external measure, and validity is
estimated as the correlation coefficient between test scores and criterion scores.
Subtypes of Criterion-Related Validity:
 Predictive Validity: Assesses how well test scores can predict future
performance or outcomes related to the criterion.
 Concurrent Validity: Examines the correlation between test scores and the
criterion when both are measured simultaneously.
Widely used in assessing the effectiveness of tests for predicting future performance
or confirming current abilities.

Predictive Validity
 Predictive validity measures how well a test predicts future outcomes. Test scores
are obtained first, followed by a time gap (months or years), after which criterion
scores are measured and correlated with the test scores.
 The correlation between test scores and future criterion scores is used as the
predictive validity coefficient (often calculated using Pearson’s product-
moment correlation).
 Predictive validity is essential for tests forecasting future academic achievement,
vocational success, or long-term outcomes like therapy effectiveness.
Example:
 Academic Example: Administering an intelligence test at postgraduate admission
and later correlating the test scores with students' grades after two years. A high
correlation indicates predictive validity.
 Workplace Example: Using a mechanical aptitude test to predict job
performance, with future work performance (e.g., units produced) being the
criterion. A high correlation confirms predictive validity.

Concurrent Validity
 Concurrent validity measures how well a test correlates with a criterion that is
available at the same time. Unlike predictive validity, there is no time gap between
when the test and criterion scores are obtained.
Example: A new intelligence test may be correlated with scores from an already
established intelligence test to determine its validity. Similarly, an intelligence test
might be validated against students' current academic marks.
Methods to Determine Concurrent Validity:
 Relationship method: Administer the test and criterion to the same group, then
correlate the scores to calculate validity.
 Discrimination method: Determines if the test can differentiate between
individuals with or without a certain characteristic (e.g., using a mental adjustment
inventory to distinguish between institutionalized and non-institutionalized
individuals).
Comparison to Predictive Validity: Concurrent validity is typically higher than
predictive validity. Over time, the correlation between test scores and future
outcomes tends to weaken, leading to lower predictive validity.
Challenges: The selection of an appropriate criterion is crucial. An inadequate or
inappropriate criterion can lower the validity coefficient. For example, factors like
motivation, emotional adjustment, or health can influence academic performance or
job productivity, distorting the correlation between the test and the criterion.
Correction for Attenuation: If the reliability of the criterion is low, it is
recommended to correct the validity coefficient for attenuation (to account for
measurement errors), as the correlation may be weaker than the true relationship
between the test and the criterion.
Correction for Attenuation
 Correction for attenuation is used when both the test and the criterion are
unreliable (due to measurement errors or faulty construction), which leads to a
lower correlation coefficient between the test and the criterion. The purpose of the
correction is to adjust the validity coefficient to reflect the true relationship
between the test and criterion, accounting for measurement errors.
 Types of Correction:
o Full Correction: Corrects for attenuation in both the test and the criterion.
o One-way Correction: Corrects for attenuation in only the criterion,
assuming the test is more reliable than the criterion. Here, the validity
coefficient is increased by correcting for the criterion's attenuation.
 Interpretation:
o The full correction formula adjusts both the test and criterion to eliminate
measurement errors in both, which increases the validity coefficient.
o The one-way correction is more commonly applied, as the criterion is often
less reliable than the test. This adjustment increases the validity coefficient
as it accounts for measurement errors in the criterion.
 Upper Limit: The correction formula sets an upper limit for the correlation
between the test and criterion. If both the test and criterion have perfect reliability
(1.0), the correction for attenuation will not alter the validity coefficient, as no
errors remain to be corrected.

Major Qualities Desired in a Criterion Measure


1. Relevance: A criterion must closely align with the behaviors of interest and
match the trait being predicted, ensuring content validity. For example, a
criterion for abstract reasoning should focus specifically on assessing abstract
reasoning.
2. Freedom from Bias: The criterion measure should offer equal opportunities for
scoring, regardless of personal characteristics or group affiliation, to avoid bias.
A biased criterion distorts the correlation between test scores and trait
measures.
3. Reliability: Criterion scores must be stable and reproducible over time or
situations. Unpredictable fluctuations reduce the ability to predict scores and
lower the validity of the test.
4. Availability: The criterion measure should be practical and easily accessible to
researchers, considering factors like time and cost to ensure it is not a
hindrance in the research process.
Construct Validity
 Construct validity refers to how well a test measures a theoretical construct or trait
(e.g., intelligence, anxiety). It is more complex and used when content or criterion-
related validity cannot be applied effectively.
Steps in Construct Validation:
 Specifying Possible Measures: The first step is to define the construct clearly
and identify various possible measures of that construct.
 Determining Correlations: Investigate the extent to which the specified
measures correlate with each other. High correlations suggest they measure the
same construct.
 Behavior of Measures: Examine if the measures behave in expected ways with
regard to other relevant variables. For example, intelligence measures should
correlate with academic performance.
Challenges in Construct Validation:
 Measures might form clusters of high correlations or fail to correlate strongly,
making it difficult to draw conclusions.
 The process requires ongoing research and empirical evidence, with results often
circumstantial rather than definitive.
Indicators of Construct Validity:
o Homogeneity of the test (measuring a single construct).
o Strong correlation with related measures.
o Developmental or age-related changes that align with the construct theory.
o Group differences consistent with theory.
o Intervention effects consistent with theory.
o Factor analysis results aligning with theoretical expectations.
 Overlap with Other Validity Types: Construct validity often incorporates aspects
of content and criterion validity. Many psychologists believe all forms of validity are
ultimately subsumed by construct validity.
Construct validity is the most comprehensive form of validity and should be the
primary focus in validation studies. Attempts to divide validity into categories like
content, criterion, and construct validity can be confusing, as they overlap
significantly.

Convergent validation and discriminant validation


 Convergent Validity: A test demonstrates convergent validity when it correlates
highly with other tests measuring the same construct. This indicates that the test is
measuring what it is intended to measure.
 Discriminant Validity: A test shows discriminant validity when it correlates poorly
with tests measuring different, unrelated constructs. This confirms that the test is
not measuring something it shouldn't be measuring.
 Campbell & Fiske's Multitrait-Multimethod Matrix (MTMM): This design
helps assess both convergent and discriminant validity by measuring multiple traits
with multiple methods. It compares correlations between different traits measured
by the same method (monomethod) and the same trait measured by different
methods (heteromethod).

Key Requirements for Construct Validity:


 Convergent Validity: Validity diagonals (correlations of the same trait measured
by different methods) should be large and statistically significant.
 Discriminant Validity: Validity diagonals should be higher than correlations
between different traits measured by the same method (monomethod) or different
methods (heteromethod).
 Pattern of Correlations: There should be a clear pattern in the correlations of
different traits measured by different methods, providing evidence for discriminant
validity.
If the test meets these requirements, it provides strong evidence for both convergent
and discriminant validity, ensuring the establishment of construct validity.

Statistical methods for calculating validity


There are several statistical methods that are employed in computing the validity
coefficient of a test.

Correlation Methods
Validity of a Test: Defined by its correlation with an independent external criterion.
Correlation Methods: Different methods to calculate the correlation coefficient
include:
 Pearson r: Most commonly used.
 Biserial r and Point Biserial r: Used when one variable is divided into categories
(e.g., Pass/Fail) and the other is continuous.
 Tetrachoric r and Phi coefficient: Used when both variables are divided into two
categories.
 Multiple Correlation: Used when more than two measures are involved,
represented by R, which shows the relationship between one measure and the
composite of multiple measures.

Expectancy Tables
 Expectancy tables show the relationship between test scores and criterion
measures, typically expressed as probabilities or percentages.
 Structure:
o Test scores are plotted against criterion scores, forming a bivariate
distribution.
o Correlation coefficients (validity coefficients) are computed from this
distribution.
 Purpose: Used to estimate the predictive efficiency of a test by determining the
likelihood of achieving specific performance outcomes based on test scores.
 Application Examples:
o Probability of a student with an IQ of 100 securing the first rank.
o Probability of a clerical test score correlating to an "excellent" performance
rating.
o Probability of a medical entrance test score predicting success after course
completion.
 Flexibility: Expectancy tables vary depending on the specific probabilities being
determined but always predict performance based on known scores.

Cut-off Score
 A score that separates potentially superior examinees from potentially inferior
ones.
 Purpose: To maximize the selection of superior examinees and eliminate inferior
ones.
 Demerit: Screening using a cut-off score may not be perfect due to:
o Imperfect validity of the test.
o The inability of a single test to fully reveal an examinee's ability or potential.
o This can result in some inferior examinees being selected and some superior
ones being rejected.

Miscellaneous Techniques
 Additional Techniques for Validity: Some less commonly used methods exist for
validating tests.
 Contrasted Groups: Comparing test scores of distinct groups (e.g., normals vs.
neurotics, institutionalized mental defectives vs. normal schoolchildren) to
establish validity.
 Age Group Comparison: Comparing success percentages of students across
different age groups as a method of validation.

Factors influencing validity


Validity of a test is influenced by several factors. Some of the important factors are
enumerated below.
1. Length of the Test: Increasing the length of the test can improve both its
reliability and validity. A longer test tends to yield more stable results, improving
the correlation between the test and the criterion. However, the improvement in
validity is slower compared to reliability as the test length increases.
2. Range of Ability (Sample Heterogeneity): The validity of a test is influenced
by the range of abilities within the sample. A narrow ability range results in
lower validity, while a broader range of abilities increases validity as it captures
more variability in the test scores.
3. Ambiguous Directions: If the test instructions are unclear or ambiguous,
different test-takers may interpret the instructions differently, which can reduce
the accuracy and validity of the test. It can also lead to guessing, lowering the
test's validity.
4. Socio-Cultural Differences: Tests developed in one cultural context may not
be valid in another due to differences in socio-economic status, cultural norms,
gender roles, etc. Cross-cultural testing can help ensure validity across different
cultural groups, though the test must be adapted to reflect cultural differences.
5. Addition of Inappropriate Items: The inclusion of inappropriate or poorly
constructed items (e.g., vague questions or items that differ greatly in difficulty
from the original set) can lower both the reliability and validity of the test. Items
should be consistent in difficulty and relevance to the test’s purpose to maintain
validity.

You might also like