Module 2 - Measurement Fundamentals
Module 2 - Measurement Fundamentals
2
Psychological Assessment
3
Tools of Psychological Assessment
1. Psychological test
2. Interview
3. Portfolio
4. Case History Data
5. Behavioral Observation
6. Role-Play Tests
7. Computers as Tools
8. Other Tools
4
A Taxonomy of Psychological Assessment
Psychological Test
According to Cronbach (1960) – “A psychological test is a systematic
procedure for comparing the behavior of two or more people”.
According to Anstey (1966) – “Psychological tests can be defined as devices
and techniques for the quantitative assessment of psychological attribute of an
individual”.
According to Anastasi (1969) – “A psychological test is an objective and
standardized measure of a sample of behavior”.
CRTs may be standardized or teacher-made and enable a different kind of comparison. Compared to
NRTs, CRTs are typically shorter in length and narrower in focus. Instead of comparing current student
performance to other students, CRTs enable comparisons to an absolute standard or criterion. CRTs help us
to determine what a student can or cannot do. Rather than stating that “Marie is above average,” CRTs
enable us to make judgments about a student’s (or group of students’) level of proficiency or mastery over a
skill or set of skills (e.g., “Marie is able to spell the words in the third grade spelling list with greater than
80% accuracy,” For this reason, and because they are shorter than NRTs, scores from a CRT are more likely
to be useful for instructional decision making than scores from a NRT would be.
Definition of Measurement
• Measurement is the assignment of numerals to objects or events according to rules
(Stevens, 1946).
• Measurement is a process that involves three components – an object of measurement, a
set of numbers, and a system of rules – that serve to assign numbers to attributes or
magnitudes of the variable being measured.
• The rules are the specific procedures used to transform qualities of attributes into numbers
(Camilli, Cizek, & Lugg, 2001; Nunnaly & Bernstein, 1994; Yanai, 2003).
• An educational or psychological test is a measuring device, and as such it involves rules
e.g., specific items, administration, and scoring instructions for assigning numbers to an
9
individual’s performance that are interpreted as reflecting characteristics of the individual.
Definition of Measurement…
10
Properties of Numbers
• Property of Identity
• Property of Order
• Property of Quantity
• The Number 0
11
Four Levels of Measurement
1. Nominal Scale: a scale in which the numbers or letters assigned to an object serve only
as labels for identification or classification, e.g. Gender (Male = 1, Female = 2)
2. Ordinal Scale: a scale that arranges objects or alternatives according to their magnitude
in an ordered relationship, e.g. Academic status (Freshman = 1, Sophomore = 2, Junior
= 3, etc.
3. Interval Scale: a scale that both arranges objects according to their magnitude,
distinguishes this ordered arrangement in units of equal intervals, but does not have a
natural zero representing absence of the given attribute, e.g. the temperature scale (40oC
is not twice as hot as 20oC)
4. Ratio Scale: a scale that has absolute rather than relative quantities and an absolute
12
(natural) zero where there is an absence of a given attribute, e.g. income, age.
Association Between Property of
Number and Level of Measurement
Level of Measurement
Property of Number Nominal Ordinal Interval Ratio
Identity
Order
Quantity
Absolute Zero
Example Sex Class Rank Temperature Distance
13
Why Level of Measurement Matters?
18
Types of Correlations
No Correlation Level of Measurement
1. Phi, contingency Both variables nominal
2. Spearman rank order, Kendall’s Both variables ordinal
tau
3. Pearson product moment Both variables interval
4. Pearson Point biserial One variable interval, one variable (naturally)
dichotomous/binary
5. Pearson biserial One variable interval, one variable artificially
dichotomous/binary
6. Polychoric Both variables ordinal with underlying continuities
7. Tetrachoric Both variables dichotomous artificially 19
Types of Graphs
Type Level of Measurement
Histogram/Frequency Polygram Ordinal, interval, or ratio level data. Most often used with ratio
or interval level data
20
Types of Graphs
ABA Design
70
60
50
Low Self-esteem
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Baseline Intervention Baseline
Definition of Variable
22
Types of Variables Based on How They are Measured
Exercise
1. The number of participants in the current batch.
2. A person’s weight is measured on which kind of scale?
3. Salaries of college professors.
4. What level of measurement is of an IQ score?
5. Variables gender or ethnicity are measured on which scale?
6. Where do you live?
7. What is your political preference?
8. What is your hair color?
9. What is your social class?
26
9. Number of students present in the class
10. Time it takes to get to school
11. Number of red marbles in a jar
12. Distance traveled between classes
13. Students’ grade level
14. Height of students in class
15. weight of students in class
16. number of heads when flipping three coins
17. temperature measured in Kelvin scale
27
Exercise
18. How satisfied are you with the training program?
Very Unsatisfied – 1
Unsatisfied – 2
Neutral – 3
Satisfied – 4
Very Satisfied – 5
19. How would you rate our app?
Excellent
Very Good
Good
Bad
Poor
20. In medical practice, burns are commonly described as
First-Degree
Second-Degree
Third-Degree 28
Main Differences Between Discrete Versus Continuous Variables
Basis of Comparison Discrete Variable Continuous Variable
30
Assignment
2. Explain all kinds of variables based on the role they play in a study and
present them pictorially.
31
Errors in Measurement
1. Random Error
2. Systematic Error
What Is Random Error?
Any factors that randomly affect measurement of the variable
across the sample. For instance, each person’s mood can
inflate or deflate performance on any occasion. Random error
adds variability to the data but does not affect average
performance for the group.
33
Random Error
Frequency
34
Random Error
The distribution of X with
random error
Frequency
35
Random Error
The distribution of X with
random error
Frequency
38
Systematic Error
Frequency
39
Systematic Error
The distribution of X with
systematic error
Frequency
40
Systematic Error
The distribution of X with
systematic error
Frequency
The distribution of X with no
systematic error
Variance
2
X X 2
X
( N 1)
Variance
(X X )( X X )
N 1
Covariance XY (X X )(Y Y )
N 1
Covariance
Correlatio n (ρ XY )
σ X σY
43
Computing Variance
Students X X X X X 2
1 4 -2.6 6.76
2 5 -1.6 2.56
3 3 -3.6 12.96
4 8 1.4 1.96
5 8 1.4 1.96
6 6 -0.6 0.36
7 6 -0.6 0.36
8 7 0.4 0.16
9 9 2.4 5.76
10 10 3.4 11.56
Sum 66.00 0.00 44.40
Mean 6.60 0.00 N = 10
= (Sum of Squared Deviations) /(N-1)
Variance
= 44.40/9 = 4.93
Computing Covariance
Student X X X Y Y Y X X Y Y
1 4 -2.6 20 2.6 -6.76
2 5 -1.6 19 1.6 -2.56
3 3 -3.6 21 3.6 -12.96
4 8 1.4 16 -1.4 -1.96
5 8 1.4 16 -1.4 -1.96
6 6 -0.6 18 0.6 -0.36
7 6 -0.6 18 0.6 -0.36
8 7 0.4 17 -0.4 -0.16
9 9 2.4 15 -2.4 -5.76
10 10 3.4 14 -3.4 -11.56
Sum 66.00 0.00 174.00 0.00 -44.40
1 7 6 2.76 0.60
2 5 5 1.97 0.50
3 8 7 3.15 0.70
4 8 7 3.15 0.70
5 7 8 2.76 0.80
6 6 5 2.36 0.50
7 5 6 1.97 0.60
8 3 5 1.18 0.50
9 5 6 1.97 0.60
10 6 5 2.36 0.50
Mean 6 6 2.36 0.60
Covariance 0.91 0.04
46
Correlation 0.67 0.67
ΣXΣY
COV XY
Σ X X Y Y
ΣXY
N
N 1 N 1
ΣXΣX
COV XX
Σ X X X X
ΣXX
N
N 1 N 1
ΣX
2
2
2 ΣX
Σ X X N
Var X
N 1 N 1
ΣY
2
2
2 ΣY
Σ Y Y N
Var Y
N 1 N 1
Often multiple items are combined in order to create a composite score
The variance of the composite is a combination of the variances and
covariances of the items creating it
General Variance Sum Law states that if X and Y are random variables:
2
X Y 2
X 2 XY
2
Y
49
Variance Sum Law
Σ[(X Y) ( X Y)] 2
σ X2 Y
N 1
Σ[(X X) (Y Y)] 2
σ 2X Y
N 1
Σ(X X) 2 Σ(Y Y) 2 2Σ(X X)(Y Y)
σ 2X Y
N 1 N 1 N 1
2 2 2
σ X Y σ X σ Y 2 XY
2 2 2
σ X Y σ X σ Y 2 XY X Y
Covariance (σ XY )
Correlation (ρ XY )
σXσY
2 2 2
σ X Y σ X σ Y 2 XY
Individual X (Sleep) Y (Awaken) (X+Y)
1 4 20 24
2 5 19 24
3 3 21 24
4 8 16 24
5 8 16 24
6 6 18 24
7 6 18 24
8 7 17 24
9 9 15 24
10 10 14 24
Covariance -4.93
Variance 4.93 4.93 0
51
Variance (X+Y) 0
2 2 2
σ XY σ X σ Y 2 XY
Individual X (Sleep) Y (Awaken) (X-Y)
1 4 20 -16
2 5 19 -14
3 3 21 -18
4 8 16 -8
5 8 16 -8
6 6 18 -12
7 6 18 -12
8 7 17 -10
9 9 15 -6
10 10 14 -4
Covariance -4.93 -4.93
Variance 4.93 4.93 4.93
52
Variance (X-Y) 19.73
2 2 2
σ X Y σ X σ Y 2 XY
Individual X (Sleep) Y (Study) (X+Y)
1 5 4 9
2 6 5 11
3 7 4 11
4 8 4 12
5 6 5 11
6 7 5 12
7 6 5 11
8 7 4 11
9 8 5 13
10 6 4 10
Covariance 0.00
Variance 0.93 0.28 1.21
53
Variance (X+Y) 1.21
2 2 2
σ XY σ X σ Y 2 XY
Individual X (Sleep) Y (Study) (X-Y)
1 5 4 1
2 6 5 1
3 7 4 3
4 8 4 4
5 6 5 1
6 7 5 2
7 6 5 1
8 7 4 3
9 8 5 3
10 6 4 2
Covariance 0.00
Variance 0.93 0.28 1.21
Variance (X-Y) 1.21 54
Theories of Measurement
Lord and Novick (1968) say “The classical test theory model is
based on a particular, mathematically convenient and conceptually
useful, definition of true score and on certain basic assumptions
concerning the relationships among true and error scores”.
Classical Test Theory
X T E
Observed True Random
= +
Score Score Error
57
Classical test theory also assumes that (a) the distribution of observed scores
that a person may have under repeated independent testing is normal and (b)
the standard deviation of the normal distribution, referred to as standard error
of measurement (SEM), is the same for all persons taking the test.
from the summation of a true score, T, plus error, E – starts with common assumptions
about items and their relationships to the latent variable and sources of error:
The observed score (X) of a person is the sum of the true part (T) and the error part (E).
X T E
61
Assumptions
1. The amount of error associated with individual items varies randomly.
The error associated with individual items has a mean zero when
aggregated across a large number of people. Thus, items’ means tend to
be unaffected by error when a large number of respondents complete the
items.
E 0...(1)
62
Assumptions…
2. Measurement errors between two items of the same scale are
uncorrelated. That is one item’s error term is not correlated with
another item’s error term (i.e. assumption of local independence); the
only routes linking items always pass through the latent variable,
never through any error term.
E E 0...(2)
1 2
63
Assumptions…
3. Error terms are not correlated with the true score of the latent variable. For example, with a high value of T, the E is
not systematically lower or higher. So, the assumption is:
TE 0...(3)
64
Important Deduction from the CTT
1. The observed variance of scores in a sample or population equals the true
score variance of a sample/population plus the error variance. The total
observed variance in a test/questionnaire consists of the sum of the true
variances and the error variances. Because the correlation between the T’s
and E’s is zero, no correlation between T and E has to be added.
2
X 2
T ...( 4 )
2
E
Cov XE Var E E E
XE E ...(7)
X E X E X E X 66
Theoretical Definition of Reliability
1. Reliability can be defined as the ratio of the true score to the observed
score.
T
R ...(1)
X
2. Reliability is generally defined as the ratio of the true score variance to the
observed score variance.
T2
Re liability 2 ...( 2)
X
3. Reliability is the squared correlation between true score and observed
score.
Re liability XT
2
...(3)
σ T2 2
Reliabilit y(ρ XX ) 2 ρ XT ...(4)
σX 67
Theoretical Definition of Reliability…
You may be wondering how we can compute a reliability coefficient if we don’t know the
true scores of all the test takers. Fortunately, the answer is simple. There is another
definition of reliability/precision that is mathematically equivalent to the formula that uses
true score variance and observed score variance to calculate reliability. That definition is as
follows: Reliability/precision is equal to the correlation between the observed scores on two
parallel tests (Crooker & Algina, 1986)
Re liability XX '
68
69
Theoretical Definition of Reliability…
The theoretical reliability coefficient is not practical; we do not know each
person’s true score. So, we cannot compute reliability. If we can’t compute
reliability, perhaps the best we can do is to estimate it.
Test-Retest Reliability
Alternate/Equivalent/Parallel Forms Reliability
Internal Consistency Reliability
Inter-Rater /Inter-Scorer /Scorer Reliability
Each traditional estimation method – test-retest, parallel (equivalent) forms,
and internal consistency – defines reliability somewhat differently; none is
isomorphic with the theoretical definition.
71
72
Test-Retest Reliability
The test-retest method for estimating reliability involves
administering the same test to the same group of individuals on
two different occasions and then correlating the two sets of
scores. When using this method, the reliability coefficient
indicates the degree of stability (consistency) of examinees'
scores over time and is also known as the coefficient of stability.
73
Now, to see how repeatable or consistent an observation is,
we can measure it twice.
If we can't compute reliability, perhaps the best we can do is to
estimate it. Maybe we can get an estimate of the variability of the
true scores. How do we do that? Remember our two observations,
X1 and X2? We assume that these two observations would be related
to each other to the degree that they share true scores. So, let’s
calculate the correlation between X 1 and X2. Here’s a simple
formula for the correlation:
Covariance (X1 , X 2 )
Correlatio n
SD X1 SD X 2
Covariance (X1 , X 2 )
Correlatio n
SD X1 SD X 2
Variance (T )
Correlatio n
Variance (X )
Correlatio n Re liability
Variance (T )
Variance (X) Re liability
Alternate/Equivalent/Parallel Forms Reliability
The theoretical reliability coefficient is not practical; we do not know each person’s
true score. Nevertheless, we can estimate the theoretical coefficient with the sample
correlation between scores on two parallel tests. Assume that X and X′ are two
strictly parallel tests (for simplicity) – that is, tests with equal means, variances,
covariances with each other, and equal covariances with any outside measure. The
Pearson product-moment correlation between parallel tests produces an estimate of
the theoretical reliability coefficient:
77
Kuder-Richardson 20 (KR 20):
a special case of alpha
applies only to dichotomous items
k sTotal pq
2
2
k 1 sTotal
Where, k is the number of items and pq is the variance for each dichotomous item
The proportion of individuals who pass (p) multiplied by the proportion who
pq fail (q) each item
Calculate Kuder-Richardson 20 reliability coefficients on the following scores on an achievement test, where 1
indicates a right answer and 0 a wrong answer.
Examinee A B C D E p q pq
Item
1
1 1 0 1 1 0.8 0.2 0.16
2
1 0 0 0 0 0.2 0.8 0.16
3
1 1 1 1 1 1 0 0.00
4
1 1 1 0 0 0.6 0.4 0.24
5
1 0 1 1 0 0.6 0.4 0.24
S 2
Total 1.2 pq 0.8
2
X -X
2
N σ 2 pq
2 2 p Item Facility
X - 2 X X X
2
N q Item Difficulty
2
X 2 2 X X X
2
N N N
2
X 0 or 1
X ΣX X
2 2 X X X 2
N N N
2
NX
σ 2 p 2p 2
N
σ 2 p 2p 2 p 2
σ 2 p(1 p) σ 2 pq
80
k sij
2
k 1 sTotal
s 2 is the composite variance (if items were summed)
Total
sij is covariance between the ith and jth items where i is not equal to j
k is the number of items
81
Respondents
Items Ria Zia Pia Variance
1 6 5 4 1.00
2 6 4 5 1.00
3 5 3 3 1.33
4 4 4 4 .00
5 4 5 4 .34
Total 25 21 20
Variance of Total Scores = 7.0
Total of item variances = 3.67
82
A 3-item (X, Y, Z) psychological test is administered to 60 participants in Basic Psychometrics
and the following variance-covariance matrix is obtained:
X Y Z
X 55.83 29.52 30.33
Y 29.52 17.49 16.15
Z 30.33 16.15 29.06
The sum of all the item variance is 102.38
The sum of all the item covariance is 152.03
The variance of the Total Test Scores is 254.41
2
S Total 55.83 17.49 29.06 2( 29.52 30.33 16.15)
83
k sTotal si
2 2
3 254.41 102.38
2 .8964
k 1 sTotal 3 1 254.41
k sij 3 152.03
2 .8964
k 1 sTotal 3 1 254.41
84
Models of Measurement
Just as routine tests to check for any violations of normality should be carried out, so equally should tests for
assumptions in reliability estimation be applied. Despite the seemingly obscure labels given to the models, all
are connected by four underlying and easily-described properties of a scale (e.g., see Graham, 2006). These
properties are as follows:
i) the extent to which each item measures the same underlying trait (unidimensionality);
ii) whether the true scores for different items have the same mean (sensitivity);
iii) whether the true scores for different items have the same variance; and
iv) whether the error variance is the same for each item.
Thus the degree to which one assumes either constancy or variability of properties ii) to iv) is what distinguishes
the essentially tau-equivalent from parallel or congeneric models.
Models of Measurement…
In CTT, measures of the same thing (e.g., items, subtests, or tests) can be classified by their levels of similarity.
In this section, I define four levels of similarity: parallel, tau-equivalent, essentially tau-equivalent, and
congeneric. Note that these levels are hierarchical in the sense that the highest level (parallel) requires the most
similarity, whereas levels lower in the hierarchy allow for less similarity in test properties. For example, parallel
measures must have equal true score variances, whereas congeneric measures do not require this. One useful
way of thinking about these levels is in terms of the relationships between the true scores of pairs of measures
(Komaroff, 1997). In CTT, the basic relationship between the true scores on two measures (ti and tj) is
t i a ij bij t j
Models of Measurement…
89
Congeneric Model
X k λ k η a k E k ...( A )
X is an observed score linearly related to a single latent trait η ,
λ is the slope representing units /scales of measurement,
a is the intercept representing origin of scale, E is the residual (error term), and subscript k is the item
in question.
The least restrictive measurement model referred to as congeneric model assumes
that a group of observed items
1.measure the same latent trait
2.can measure the latent trait on different scales/units (the slopes λk can be
different)
3.can measure the latent trait with different degrees of precision (dissimilar scale
origins—the intercepts ak can be different)
4.can measure the latent trait with different amounts of error (Var(Ek) can be
different).
Factor Analysis (typically) uses the Congeneric Measurement Model (Raykov, 1997a).
91
Essentially Tau-equivalent Model
X k a k E k ...(B)
92
Essentially Tau-Equivalent Model
A more restricted case of congeneric measures, referred to as essentially tau-
equivalent measures, occurs when only the first condition is in place-
1.λ1= λ2 = λ3 (all slopes are equal).
As variables that differ by a constant have equal variances, one can say that
essentially tau-equivalent measures have equal true score variances but unequal
error variances.
93
Tau-equivalent Model
X k E k ...(C)
Tau-Equivalent Model
95
Parallel Model
X k E...( D)
Parallel Model
The most restricted case of congeneric measures, referred to as parallel measures, occurs
when the following three conditions are in place-
1.λ1= λ2 = λ3 (all slopes are equal);
Thus, parallel measures have the same units of measurement, scale origins, and error
variances.
97
98
99
Setting a Confidence Interval
Var X VarT Var E
Var E Var X VarT
Var E Var X VarT
VarT
SDE Var X 1
Var X
SDE SD X 1 R
Setting a Confidence Interval…
FIVE Steps:
1. Estimate True Score (T ) = (R)(X)
Where, R = Reliability, X = Observed score
2. Calculate Standard Error of Measurement (SE)
SE SD X 1 R
102
Setting a Confidence Interval…
• Then, if I want to be 95% confident that the score will fall
within a certain range, the z value associated with the 95%
confidence interval (1.96) will be multiplied with the value of
the SE (1.96 × 2.68 = 5.25). Then this value is added to and
subtracted from the estimated true score.
• This leads to the following two equations:
• 114 + 5.25 = 119.25 and
• 114 – 5.25 = 108.75
103
Setting a Confidence Interval …
104
THANK YOU
105