View PDF
View PDF
View PDF
DR RAJIV SAKSENA
Analyst Cum programmer
Department of Statistics
University of Lucknow, Lucknow
Disclaimer: The e-content is exclusively meant for academic purposes and for enhancing teaching and
learning. Any other use for economic/commercial purpose is strictly prohibited. The users of the content
shall not distribute, disseminate or share it with anyone else and its use is restricted to advancement of
individual knowledge. The information provided in this e-content is authentic and best as per my
knowledge.
Statistical Methods For Psychology and Education
BY
DR RAJIV SAKSENA
DEPARTMENT OF STATISTICS
UNIVERSITY OF LUCKNOW
INTRODUCTION
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
2
distributed. The zero-point and the units of the scale are chosen arbitrarily,
but the scale unit should be equal and remain stable throughout the scale. We
shall discuss in this section some of the common scaling procedures used in
psychology and education.
Probability
Density
ABILITY
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
3
Example: Suppose there are four item A,B,C and D, passed, respectively, by
90%, 80%, 70% and 60% of the individuals. Compare the difference in
difficulty between A and B with the difference in difficulty between C and D.
To find the difficulty value 𝑑𝐴 of the item A we find the point, on the
normal distribution with mean O, and s.d. σ, the area to the right of which is
0.90. From the table of the area under the normal probability curve (Table I,
Appendix B), we have
𝑑𝐴 = −1.28σ
Similarly 𝑑𝐵 = −0.84𝜎
𝑑𝐶 = −0.52𝜎
And 𝑑𝐷 = −0.25𝜎
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
4
Percentile scaling
Z-Scaling or 𝝈 scaling
Here we assume that whatever differences that may exist in the forms of
raw score distributions may be attributed to chance or to the limitations of the
test. In fact, the distributions of the traits under consideration are assumed to
differ only in mean and s.d. Hence the score on different tests should be
expressed in items of the score in a hypothetical distribution of the same form
as the trait–distribution with some arbitrarily chosen mean and s.d. The
transformed scores are called linear derived scores. In particular, if the mean is
arbitrarily taken to be zero and the s.d to be unity, the scores are called
standard scores or 𝜎-scores or z–score. To avoid negative standard scores, in
linear derived scores the mean is generally taken to be 50 and the s.d to be 10.
If a particular test has raw score mean & s.d equal to μ & σ, respectively, then
the linear derived score corresponding to a score 𝑥 on that test is given by
𝑥 − 𝜇 𝑤 − 50
=
𝜎 10
(𝑥−𝜇)
or 𝑤 = 50 + 10𝑋 = 50 + 10𝑧, …(1)
𝜎
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
5
where 𝑤 is the linear derived score with mean 50 and s.d. 10 and 𝑧 is the
standard score. This linear transformation changes only the mean and the s.d.,
while retaining the form of the original distribution.
T-Scaling
Where 𝛷(г) is the area under the curve of the normal deviate from -∞ to г.
TABLE 5.1
STANINE DISTRIBUTION
Stanine score 1 2 3 4 5 6 7 8 9
Percentage on 4 7 12 17 20 17 12 7 4
each
score (rounded)
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
6
A transformation is nonlinear if it changes the form of the distribution.
Normalized score and percentile score are merely special case of nonlinear
transformation of the raw score. For nonlinear transformation any form of
distribution may be chosen.
Here we do not make any assumption about the distribution of the trait
under consideration. The appropriate trait distribution is obtained by
graduating the raw score distribution by an appropriate Pearsonian curve.
Equivalent score can also be obtained from the score distribution for 𝑥
and 𝑦 without going into the process of graduation .First two ogives are drawn
on the some graph paper. Two scores 𝑥𝑖 and 𝑦𝑖 with the same relative
cumulative frequency are then regarded are equivalent.
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
7
Example: The raw score distribution for Vernacular and English for a group
of 500 students are given below. One of two students got 80 in Vernacular and
40 in English, while the other got 60 in both. Compare their performances by
(i) percentile scaling, (ii) linear derived scores, (iii) T-scaling and (iv)
equivalent score (ogive method).
436+7.2
And 𝑃60.5(𝑉𝑒𝑟𝑛) = X 100 = 88.64
500
270+15.6
Similarly, for English 𝑃40.5(𝐸𝑛𝑔. ) = X 100 = 57.12
500
476+3.6
and 𝑃60.5(𝐸𝑛𝑔. ) = X 100 = 95.92.
500
Score Frequency
Vernacular English
0-4 3
5-9 6
10-14 12
15-19 6 23
20-24 7 35
25-29 18 45
30-34 34 74
35-39 56 72
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
8
40-44 84 78
45-49 74 53
50-54 104 46
55-59 53 29
60-64 36 18
65-69 16 5
70-74 9 1
75-79 0
80-84 3
Vernacular English
0-4 3
5-9 9
10-14 21
15-19 6 44
20-24 13 79
25-29 31 124
30-34 65 198
35-39 121 270
40-44 205 348
45-49 279 401
50-54 383 447
55-59 436 476
60-64 472 494
65-69 488 499
70-74 497 500
75-79 497
80-84 500
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
9
Hence the total scaled score for student 1, getting 80 in Vernacular and
40 in English, is by percentile scaling,
99.52+57.12= 156.64
88.64 +95.92=156.64
Thus we see that the relative performances of the two students are
quite different although their total raw scores are equal.
For linear derived scores with mean 50 and s.d. 10, we require the
means and s. d.s of scores in the two subjects. Denoting by 𝑥 the score in
Vernacular and by 𝑦 the score in English, we have.
𝑥̅ =47.07 𝑠𝑥 = 11.32
79.07+51.63=130.70,
61.40+66.89 =130.70,
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
10
Linear derived scores however show that student I is slightly superior
to student 2.
75.90+51.79=127.69
62.08+67.41=129.49
440
280
200
120
40
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
11
Fig. Determination of equivalent scores in English
and Vernacular from the ogives
80+49.8=129.8
60+66.9=126.9
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
12
𝑥2 1
∫ 𝑥 = exp [−𝑥 2/2]𝑑𝑥
𝑥1 √2𝜋
1 𝑥 2 𝑥2
[− exp [− ] 𝑥
√2𝜋 2 1 𝛷 (𝑥1 ) − 𝛷 (𝑥2 )
= =
𝛷(𝑥2 ) − 𝛷 (𝑥1 ) 𝛷 (𝑥2 ) − 𝛷(𝑥1 )
…(4)
1 𝑥
where 𝜙(𝑥 ) = exp [−𝑥 2 /2] and 𝛷(𝑥 ) = ∫−∞ 𝑒𝑥𝑝 [−𝜇 2/2] 𝑑𝜇.
√2𝜋
From the observed distribution of the ratings, it is easy to find 𝛷(𝑥1 )
and 𝛷(𝑥2 ), and hence 𝜙(𝑥1) and 𝜙(𝑥2 ),
The method is due to Likert and the scale is known as Likert’s scale. This
is also called the category–scale method. If on the other hand the 𝑛 individuals
in the group are ranked by different judges, the scale values corresponding to
the ranks can be obtained under the sane assumptions as before, 𝑖. 𝑒 under the
assumption of normality of the trait concerned.
Suppose there is no tie. Then percentile rank (PR) of an individual with
rank R, 𝑖. 𝑒 percentage of individuals who are ranked below him, is given by
1
100(𝑅−2 )
𝑃𝑅 = 100 − = 𝑃, 𝑠𝑎𝑦, …(5)
𝑛
1
since the rank R of the individual really represents the interval from 𝑅 − to
2
1
𝑅 + . The scale value corresponding to this PR can now be obtained as the
2
value of a normal deviate below which the area is P/100. In the case of tied
ranks, the PR values can be obtained from the frequency distribution of ranks.
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
13
Under the usual assumption of normality for the trait under
consideration, we obtain, for the rating the scale values as follows:
Raking A B C D E
Scale value
𝜙(𝑥1 )−𝜙(𝑥2 ) 2.062 0.997 -0.040 -1.115 -2.267
𝜙(𝑥2 )−𝜙(𝑥1 )
It often happens that the ability or the trait in which we are interested
cannot be expressed as a test score. This necessitates the construction of
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
14
product scales. In such scales Excellence of performance is determined by
comparing an individual’s product with various standard products, the values
of which are already determined by a number of competent and expert judges,
hand-writings, compositions, etc., are well-known examples.
Here Þij is the proportion of judges preferring the 𝑖𝑡ℎ product to the 𝑗𝑡ℎ
one and Þji=1-Þij. By convention Þij=1/2.
1 ∞
{𝑇 − (𝑆𝑖 − 𝑆𝑗 )}2
Þ𝑖𝑗 = ∫ 𝑒𝑥𝑝 [− 2 ] 𝑑𝑇
𝜎𝑖−𝑗 √2𝜋 𝑜 2𝜎𝑖−𝑗
∞
1
= ∫ exp[−𝜏 2 /2]𝑑𝜏,
√2𝜋
−(𝑆𝑖−𝑆𝑗)/𝜎𝑖−𝑗
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
15
where 𝑥𝑖𝑗 is the value of the normal deviate the area to the right of which is
Þ𝑖𝑗 . Equation (5.6) is known as Thurnstone’s law of comparative judgment.
Assuming that the distribution of judgments for each product has the same
s.d. σ and that judgments for any two products are uncorrelated, 𝜎𝑖−𝑗 = 𝜎√2,
a constant.
𝑆𝑖 − 𝑆𝑗 = −𝑥𝑖𝑗 …(6a)
Probability
Density
Xij
pij
DIFFERENCE IN JUDGEMENTS
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
16
values for the 𝑘 products. Alternatively, we could take the origin at the
minimum scale value and adjust the scale values accordingly.
Example. 200 individuals were asked about their preferences for 4 different
types of music. The proportion matrix is given below. Find the scale values.
Music type
1 2 3 4 4
Musictype
1 2 3 4
1 0 .739 1.165 1.237
Music type 2 -.739 0 .653 1.015
3 -1.17 -.653 0 .831
4 -1.24 -1.015 -.831 0
Column mean -785 -.232 .247 .771
With the origin at 𝑆 the mean scale values, the column means give us the
corresponding scale values for the four music types. With origin at 𝑆1 , on the
other hand, we get the following scale values:
Music type 1 2 3 4
Scale value 0 .553 1.032 1.556
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
17
NORMS AND REFERENCE GROUPS
Many tests are used for several purposes and for several groups of
individuals. If the result of a test is to be used for comparison with several
groups, it is necessary to have norms for each of the groups separately, unless
they are known to be the same. To calculate the norms for several groups, the
test has to be administered to a random representative sample from the
population of the reference group. The size of the sample should not be too
small so as to obtain stable norms .Norm data are however not necessary in
practical situations where we want to select a number of individuals out of all
applicants on the basis of test score, because the top individuals are to be
selected, no what the norms are
TEST THEORY
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
18
assumed that the measuring instrument used would give us a stable and
consistent measure of the trait if we remeasured the trait under identical
conditions .Technically, this aspect of the accuracy is known as the reliability
of the measuring instrument. The second requirement is that the measuring
instrument measures the trait which it is intended to measure. And
technically this is known as the validity of the measuring instrument.
The raw score (𝑥) dose not equal the unknown true score (𝑡). The
difference (𝑥 − 𝑡) which may be due to various factors is the error score (𝑒).
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
19
In test theory we always consider only random errors(𝑒). Constant or
systematic errors are assumed to be absent in test theory. Since we consider
only random errors, it is reasonable to make the following assumption for 𝑒′𝑠:
𝜇𝑒 = 0
𝜌𝑖𝑒 = 0 } …(8)
𝜌𝑒𝑔 𝑒ℎ = 0
In words, the mean of error score is zero, the correlation between true
score and error score is zero, and the correlation between error score from
different testing occasions (or for two parallel tests, g and h, to be defined
shortly) is zero. We note that under this model the estimates of 𝜇𝑒 , 𝜌𝑖𝑒 and
𝜌𝑒𝑔 𝑒ℎ will approach zero if the number of individuals (𝑛) approaches infinity.
In practice, however, the estimates are assumed to satisfy these relations for
the given sample.
Two tests are said to be parallel when it makes no difference which one
is used. If 𝑔 and ℎ are two tests and if for the 𝑖𝑡ℎ individual 𝑡𝑖𝑔 ≠ 𝑡𝑖ℎ , then we
cannot say that it makes no difference whether we use test g or h. So, in order
that g and h may be parallel test, it is reasonable the assume that
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
20
𝑖. 𝑒, the true score of any individual should be the same on the two tests.
Next, consistent with the definition of error score (8), we assume about
the error scores on two parallel tests that
𝑖. 𝑒, the standard deviations of errors on the two tests should be the same.
Thus (9) and (10) defined parallel tests in terms of unknown quantities. These
can be expressed in terms of the distributions of the raw score, using the
relations (7), (8) and (9) as follows:
Also, from (7) and (8), we have 𝜎𝑥2 = 𝜎𝑡2 + 𝜎𝑒2 for any test.
Then we have
Thus the means of raw scores on two parallel tests are equal; and so are
the standard deviations.
If we have more than two parallel tests (at least three-say g, h and k) .we
have another condition to check; besides (11), before we can conclude that
the tests g, h and k are parallel .And this condition is
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
21
𝑐𝑜𝑣(𝑥𝑔 ,𝑥ℎ )
𝜌𝑥𝑔 𝑥ℎ =
𝜎𝑥𝑔 𝑋 𝜎𝑥ℎ
𝑐𝑜𝑣(𝑡𝑔 ,𝑡ℎ )
= (Since g, h are parallel tests, the remaining covariance
𝜎𝑥2𝑔
terms are all zero and𝜎𝑥𝑔 = 𝜎𝑥ℎ ).
𝜌𝑡𝑔𝑡ℎ 𝜎𝑡𝑔 𝜎𝑡ℎ
=
𝜎𝑥2𝑔
= 𝜎𝑡2𝑔 /𝜎𝑥2𝑔 (since 𝜌𝑡𝑔 𝑡ℎ = 1 and 𝜎𝑡𝑔 = 𝜎𝑡ℎ , g and h being parallel).
Equ. (13) easily establishes equ. (12) for a number of parallel tests.
Thus for three or more parallel tests the means of raw scores are equal;
so are the variances and the inter correlations. In addition to satisfying these
criteria, parallel tests should also be similar with respect to the content and
nature of items, etc., which may be verified by expert judgment only.
Equations (8) define error score. Then the true score (𝑡) can be
regarded as the difference (𝑥 − 𝑒) between the raw score and the error score.
Thus, 𝑡𝑖 = 𝑥𝑖 − 𝑒𝑖 .
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
22
𝑘
Or 𝜎𝑒 = 𝜎𝑥 √1 − 𝜌𝑥𝑔 𝑥ℎ …(15)
Equation (5.15) gives the standard deviation of the error scores, which
is technically known as the standard error of measurement.
DEFINITION OF RELIABILITY
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
23
if g be the given test and h any other test parallel to g, then the reliability of g
is measured by 𝜌𝑥𝑔 𝑥ℎ and will be denoted as 𝜌𝑔𝑔 .
Reliability can thus be defined as the ratio of the true score variance to
the raw score variance or as the proportion of the raw score variance that is
the true score variance. Reliability ranges from zero to one. 𝜌𝑔𝑔 = 1 when
𝜎𝑒 = 0. But 𝜎𝑒 = 0 if and only if all 𝑒𝑖 = 0, since 𝜇𝑒 = 0. Thus, the test is
perfectly reliable (𝜌𝑔𝑔 = 1) if 𝑥𝑖 = 𝑡𝑖 for each 𝑖, and then the raw scores are
the true scores. 𝜌𝑔𝑔 = 0 if 𝜎𝑖 = 0 (or , equivalently, if 𝜎𝑒 = 𝜎𝑥 ), 𝑖. 𝑒., when
𝑥𝑖 = 𝑡 + 𝑒𝑖 for each 𝑖, and then the test is unreliable (here 𝑡 denotes true
score for all 𝑖).
0 ≤ 𝜌𝑔𝑔 ≤ 1.
By the length of a test we mean the number of items in the test. Let us
augment the length of the test by adding to (𝑘 − 1) parallel tests of the same
length. So the composite test is now made of 𝑘 parallel test of the same length
and the length of the composite test is 𝑘 times the length of the original test.
The effects of this increase in length on the true score variance and raw score
variances are the following:
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
24
Denoting the 𝑘 parallel tests by 𝑔1 , 𝑔2 ………. 𝑔𝑘 and the composite test
by 𝐺, we have
(Summation over 𝑖, 𝑗 = 1, 2, … … . . , 𝑘)
and, 𝜎𝑥2𝐺 = 𝜎 2 (𝑥𝑔1 + 𝑥𝑔2 + ⋯ + 𝑥𝑔𝑘 ) = ∑ 𝜎𝑥2𝑔 + ∑ ∑ 𝜌𝑥𝑔 𝑥𝑔𝑖 𝜎𝑥𝑔 𝜎𝑥𝑔
𝑖 𝑖 𝑖 𝑗
𝑖=1 𝑖 ≠𝑗
Since 𝜌𝑥𝑔 𝑥𝑔 = 𝜌𝑔𝑔 (𝑖. 𝑒 𝑟𝑒𝑙𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦) and 𝜎𝑥𝑔 = 𝜎𝑥𝑔 for parallel tests 𝑔𝑖 , 𝑔𝑗
𝑖 𝑗 𝑖 𝑗
Using equation (16), we may write down the reliability of a test whose
length is increased 𝑘 times (by adding 𝑘 − 1 parallel tests) as
which can be expressed in terms of 𝜌𝑔𝑔 , by using equation (15) and (18) as,
𝑘 2𝜎𝑖2𝑔
1
𝜌𝐺𝐺 =
𝑘𝜎𝑥2𝑔1 [1 + (𝑘 − 1)𝜌𝑔𝑔 ]
𝑘𝜌𝑔𝑔
= …(19)
1+(𝑘−1)𝜌𝑔𝑔
where 𝜌𝑔𝑔 is the reliability of the original is test and 𝜌𝐺𝐺 is the reliability
of the lengthened test G, whose length is equal to k times the length of g1.
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
25
Formula (19) is known as the general Spearman brown formula. In the
usual case where k = 2, Spearman–Brown formula or doubled test length is
2𝜌𝑔𝑔
𝜌𝐺𝐺 = …(20)
1+𝜌𝑔𝑔
The derivation of formula (19) and (20) involves the assumption that
the additional test parts used in lengthening the original test are parallel to
those in the original test
Where 𝜌the reliability of the original test and 𝜌𝐺𝐺 is is the desired reliability of
the lengthened test after the original test is lengthened 𝑘
Here 𝜌𝑔𝑔 = 0.67 and 𝜌𝐺𝐺 = 0.95. Then by equation (21), we have
.95(1 − .67) .95 X .33 .3135
𝑘= = = = 9(Approximately).
.67(1 − .95) .67 X .05 .0335
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
26
Reliability, as defined above and denoted by 𝜌𝑔𝑔 , is based on population
data (an infinite number of individuals being tested). In practice, we have only
a sample of finite size 𝑛 and the corresponding sample correlation estimated
the reliability. There are available mainly four methods for estimated test
reliability. These are:
(a) The parallel–test method, (b) the test-retest method, (c) the split–
half method and (d) the Kuder-Richardson method.
For many situations, this is the best method of estimating test reliability.
However, the ability measured should not change in the time interval between
the administrations of the test. For many scholastic achievement and mental
ability tests, this condition is fulfilled. But there are case where the ability
tested will change, 𝑒. 𝑔., in performance tests like type-writing tests, athletic
skills tests etc., if the individuals continue practicing during the interval
between the two administrations.
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
27
Test-retest method
Split-half method
Here one test is applied once and then the score is divided into two
equivalent halves, and the correlation between the score on the half-tests
estimates the reliability of each half-tests. Then by Spearman- Brown formula
(5.20), we may estimate the reliability of the original (full) test.
The test may be split into two parts in a number of ways. The
commonest way is to split the test on the basis of odd-numbered and even-
numbered items.
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
28
that there is no unique way of splitting the test and unique split-half
correlation. In most Power test (where one dose not emphasize the speed or
quickness with which the work can be performed), the items are arranged in
order of difficulty, and the odd-even split provides a unique estimate of
reliability
where 𝑠𝑥2 is the variance of raw scores and 𝑠𝑑2 is the variance of the difference
of raw scores on the two halves of the test.
where 𝑠12 and 𝑠22 are the variances of raw scores on the two halves.
Equations (20), (22) and (23) will give the same reliability coefficient
when 𝑠12 = 𝑠22 , 𝑖. 𝑒., when the two halves have equal raw scores variances. If
𝑠12 ≠ 𝑠22, then the split-half reliability given by equ.(20) will be the highest.
Kuder-Richardson method
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
29
Since the items are all parallel 𝜌𝑥𝑔 𝑥ℎ will be equal to 𝜌𝑔𝑔 (reliability of item 𝑔)
for all 𝑔 and ℎ and 𝜎𝑥𝑔 will be the same for all 𝑔. Thus,
Next, to obtain the reliability of the test of 𝑘 parallel items from 𝜌𝑔𝑔 , we
apply the general Spearman–Brown formula (19):
𝑘𝜌𝑔𝑔
𝜌𝐺𝐺 =
1 + (𝑘 − 1)𝜌𝑔𝑔
𝑘 𝜎𝑥2 −∑𝑘
𝑥=1 𝜎𝑥𝑔
=[ ]x[ ]. …(24)
𝑘−1 𝜎𝑥2
𝑘 𝑠𝑥2 − ∑𝑘 2
𝑔=1 𝑠𝑥𝑔
𝑟𝐺𝐺 = [ ][ ] …(24a)
𝑘−1 𝑠𝑥2
where 𝑠𝑥2 is the sample variance of raw total scores and 𝑠𝑥2𝑔 is the same for 𝑔.
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
30
𝑘 𝑠𝑥2 − ∑𝑘
𝑔=1 𝑝𝑔 (1−𝑝𝑔 )
𝑟𝐺𝐺 = [ ][ ] …(25)
𝑘−1 𝑠𝑥2
𝜎𝑥2𝑔 = 𝜋(1 − 𝜋) = 𝜋 − 𝜋 2
𝜇𝑥 = 𝑘𝜋
Thus,
𝜇𝑥 𝜇𝑥2
𝜎𝑥2𝑔 = − ,
𝑘 𝑘2
Then from formula (24), we have
2
𝑘 𝑘𝜎𝑥𝑔
𝜌𝐺𝐺 =[ ] [1 − 2 ]
𝑘−1 𝜎𝑥
𝑘 𝜇𝑥 −𝜇𝑥2 /𝑘
=[ ] [1 − ] …(26)
𝑘−1 𝜎𝑥2
where 𝑥̅ and 𝑠𝑥2 are the sample mean and variance of raw total scores.
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
31
VALIDITY
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
32
recruiting persons in a vocation), etc. We discuss below the different concepts
of validity:
Predictive validity
This type of validity arises when we use a test for trait for selecting
applicant for a particular course or job and the criterion variable is the degree
of success at a later period, 𝑖. 𝑒., after the recruits have completed the course
or have been on the job for a sufficient period. The criterion variable is the
performance at that later period– grades or ratings on completion of the
course or after a certain period of employment. A test has a high predictive
validity if it can forecast efficiently later performance on a particular
measurable aspect of life. And this is of importance in the selection or
recruitment of individuals for different courses of study or training
programmes or jobs.
Concurrent validity
Concurrent validity is obtained for tests for which the criterion variable
is also available at the same times as the test results and we are not to wait as
in the case of predictive validity. Tests are constructed for measuring a
variable for which the result also may be obtained without waiting, because it
is easier and sometimes saves time and expenditure, while giving the same
results as the criterion variable. Concurrent validity is used for diagnostic test
(𝑒. 𝑔. in clinical diagnosis). Both types of validity (predictive and concurrent)
are obtained by computing the correlation between the test scores and
criterion scores and the validity is the correlation coefficient.
Content validity
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
33
drawing ability, etc. There are large numbers of items which measure these
areas and, in a test, we have only a sample of these items. In content validity of
a test, we try to ascertain how far the test covers the field of study under
investigation or in other words, how good the items of the test are as a sample
from the totality of all items for that test. It is, however, not possible to
express content validity as a validity coefficient, as is possible with the
previous two validities.
Construct validity
Let 𝑇𝑖 and 𝐶𝑖 be the observed test score and criterion score for the 𝑖𝑡ℎ
individual, 𝑡𝑖 and 𝑐𝑖 the corresponding true scores, and 𝑒𝑖 and 𝑒𝑖′ the errors.
Thus,
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
34
𝑇𝑖 = 𝑡𝑖 + 𝑒𝑖 and 𝑐𝑖 = 𝑒𝑖 + 𝑒𝑖′
𝑟𝑇𝑇 and 𝑟𝐶𝐶 being estimates of reliability of test score and criterion scores.
Thus,
𝑟𝑇𝐶
𝑟𝑡𝑐 = …(27)
√𝑟𝑇𝑇 𝑟𝐶𝐶
But this coefficient is of little practical value, since a pair of perfectly reliable
test and criterion is rarely realized. Very often we shall be using test scores
which are contaminated with errors for the purpose of prediction. There, it
may be of interest to know what would be the validity coefficient had a
perfectly reliable criterion been available. In the same way, we can find the
correlation between true criterion score and observed test score, as
𝑟𝑇𝐶
𝑟𝑡𝑐 = . …(28)
√𝑟𝐶𝐶
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
35
EFFECT OF TEST LENGTH ON TEST PARAMETERS
We have seen in previous section, the effect of test length on the true
score variance (equ.17), on the observed score variance (equ.18) and on the
reliability of a test (equ.19). Using notations already introduced, it is easy to
see the effect of test length on true mean and observed score mean:
To find the effect of test length on the validity of a test, we first consider the
case where the original test is lengthened by adding to it (𝑘 − 1) parallel test
of the same length and the original criterion variable is lengthened by adding
to it (𝑙 − 1) parallel criterion variables of the same length, such that each pair
of component test and criterion variable gives the same validity coefficient.
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
36
∑𝑘𝑔=1 ∑𝑙ℎ=1 𝜌𝑥𝑔 𝑦ℎ 𝜎𝑥𝑔 𝜎𝑦ℎ
= 1/2
{𝑘𝜎𝑥2𝑔 + 𝑘(𝑘 − 1)𝜌𝑔𝑔 𝜎𝑥2𝑔 } {𝑙𝜎𝑦2ℎ + 𝑙(𝑙 − 1)𝜌ℎℎ 𝜎𝑦2ℎ }1/2
𝑘𝑙𝜌𝑥𝑔 𝑦ℎ
= 1/2
{𝑘 + 𝑘 (𝑘 − 1)𝜌𝑔𝑔 } {𝑙 + 𝑙(𝑙 − 1)𝜌ℎℎ }1/2
… (31)
where 𝜌𝑥𝑔 𝑦ℎ is the validity of the original test with the original criterion
variable.
If the criterion variable is not lengthened, then the effect on the validity
of increasing only the test length is obtained from (5.31) by putting𝑙 = 1:
𝑘𝜌𝑥 𝑦
𝑔 ℎ
𝜌𝑥𝐺 𝑦𝐻 = {𝑘+𝑘(𝑘−1)𝜌 …(32)
𝑔𝑔 }1/2
ITEM ANALYSIS
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
37
The typical item analysis is carried out from two kinds of information –
an index of item difficulty and an index of item validity, which means how well
the item discriminates in agreement with the rest of the items of the test or
how well it predicts some external criterion. The most common index of item
difficulty is p𝑖 , the proportion of subjects who pass the item. The commonly
used index of item validity is r𝑖𝑒 , the correlation of the item score with some
external criterion 𝑐 or, more often r𝑖𝑐 , the correlation of the item score with
the total score. The most common use of item analysis data is the selection of
the best items to compose the final test. It also enables the item-writer to
modify the items in the required directions. The important features of the test,
viz. mean, variance, reliability and validity, can be controlled by selecting
items of the right type of difficulty, the right spread of difficulty, the right
degree of item inter correlations and item validities.
The difficulty index 𝑝𝑖 for the 𝑖𝑡ℎ item is the proportion of subjects
answering the item correctly. In a multiple–choice item with 𝑘 alternatives,
Guilford has proposed a correction for guessing on the assumption that a
subject either knows the answer correctly or guesses at random. If 𝑅𝑖 is the
number of persons answering the item correctly & 𝑊𝑖 the number answering
wrongly, the number of lucky guesses, 𝑖. 𝑒 of those who guess correctly, is
𝑊𝑖
estimated as so that the item difficulty corrected for guessing is
𝑘−1
𝑊𝑖
𝑅𝑖 −
𝑘−1
𝑅𝑖 + 𝑊𝑖
…(33)
There are alternative formulae for correction for guessing too, based on other
assumptions. In some methods of item analysis, the correction 𝑟𝑖𝑡 is estimated
from those making extreme scores, generally the upper & lower 27% of total
group. The estimation is, however, based on symmetry of the item score &
total score distributions & linearity of regression of item score on total score.
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
38
Four coefficients of correlation are commonly used to indicate the
correlation of an item with a criterion (𝑟𝑖𝑐 ) or, more generally, of an item with
the total(𝑟𝑖𝑡 ). They are biserial(𝒓𝒃𝒊 ), point biserial (𝒓𝒑𝒃 ), tetrachoric (𝒓𝒕 )
and the Φ coefficient. If the ability measured by the item is normally
distributed and the criterion score is continuous, then 𝑟𝑏𝑖 can be used. If the
item score is limited to 0 and 1, 𝑟𝑝𝑏 should be used. If the criterion variable
and the ability measured by the item are both normally distributed, 𝑟𝑡 is called
for. If the criterion is not a continuous variable, but a natural division into two
groups, one can use the Φ coefficient. Another index, known as the index of
discrimination between High and Low groups, is often used for item selection.
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
39
more or less general agreement as to the procedure of measuring intelligence.
In an intelligence test, the following types of problem find a place:
(ii) Classification
A set of word is given. All but one word are in some respect the same.
The subject is to find out the odd word.
(iv)Mixed sentences
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
40
(v)Coding
Example: Code the following message by first reversing each word and then
(vi)Number series
A series of numbers is given and the subject is to supply the next or the
next two.
(vii) Analogies
Three words, of which the first two are related in some way, are given.
The subject is to find or select the fourth word which is related to the third as
the second is to the first.
(viii) Inferences
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
41
Intelligence tests may be designed for application to individuals or for
application to groups of individuals. One of the well-known individual tests is
Binet’s test. The revised version of this test is now being widely used for
measuring Intelligence of young children and for detecting mental deficiency.
Group tests were first widely used by the U.S Army authorities for
recruitment, placement or promotion of personnel. The Alpha test was meant
for the majority and the Beta test for illiterates or non-English-Speaking
person.
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
42
regarded as retarded if his MR is less than I, and he is of average intelligence if
his MR equals I.
Intelligence tests have found many uses. They are used for vocational
guidance and selection, in the grading of pupils and in diagnosing mental
deficiency. Thus an intelligence test, properly constructed and standardized, is
of immense use for various purposes.
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
43
ELEMENTS OF FACTOR ANALYSIS
F1, F2, …………., Fm being the common factors 𝑆𝑗 the specific factor and 𝐸𝑗, the
error or unreliability.
ℎ𝑗2 = ∑𝑚 2
𝑘=1 𝑎𝑗 𝑘 is called the communality of the variable 𝑋𝑗, which is
the part of the total variance attributable to common factors, whereas 𝑏𝑗2 and
𝑐𝑗2 are called the specificity and unreliability of the variable, 𝑏𝑗2 + 𝑐𝑗2 being
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
44
called its uniqueness. ℎ𝑗2 + 𝑏𝑗2 may be termed as the reliability of the
variable, and 𝑎𝑗1, 𝑎𝑗2, ………,𝑎𝑗𝑚, are the factor loadings of the 𝑚 common
factors for the variable 𝑋𝑗 . The basic problem of factor analysis is to determine
the factor loadings. When the factor loadings are determined one can evaluate
the factors in terms of the variables.
Let us designate
𝑟𝑥𝑗 𝑈𝑗 = 𝑎𝑗 ...(38)
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
45
𝑎11 𝑎12 … … 𝑎1𝑚 𝑎1 𝑂 𝑂……𝑂
𝑎 𝑎22 … … . 𝑎2𝑚 𝑂 𝑎2 𝑂……𝑂
And 𝑀 = ( 21 )
… … …… … … … … …… …
𝑎𝑛1 𝑎𝑛2 … … . 𝑎𝑛𝑚 𝑂 𝑂 𝑂 … … 𝑎𝑛
Then 𝑋 = 𝑀𝐹.
1 𝑟12 … … 𝑟1𝑛
𝑋𝑋 = ( 𝑟21 1 … … . 𝑟2𝑛 ) = 𝑅,
1
Now
𝑁 … ……… …
𝑟𝑛1 𝑟𝑛2 … … 𝑟𝑛𝑛
1
= (𝑀𝐹 )(𝐹′𝑀′)
𝑁
1
= 𝑀 ( 𝐹𝐹′) 𝑀′
𝑁
𝑅 = 𝑀𝑀′ .
Thus, if we regard the correlation matrix R as the available data and the
factor pattern matrix M as the desired objective in a factor analysis, we have
𝑛(𝑛−1)
experimentally given coefficients which must exceed the number of
2
linearly independents coefficients in M. It will be seen that by limiting
ourselves to common factors, the factor problem becomes determinate even
though we admit the existence of unique factors.
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
46
and compare them with the observed correlation coefficients to see how far
the assumed factor pattern explains the observed correlation coefficients.
When the factor loadings are determined, estimation of any common factor 𝐹𝑠
(or an unique factor 𝑈𝑠 ) involves the determination of the regression function
The solution is
1
𝛽̂𝑠𝑗 = [𝑡 𝑅 + 𝑡2𝑠 𝑅2𝑗 + ⋯ 𝑡𝑛𝑠 𝑅𝑛𝑗 ],
𝑅 1𝑠 1𝑗
where 𝑅𝑖𝑗 is the cofactor of 𝑟𝑖𝑗 in the determinant
R=[R].
Thus 𝛽̂𝑠 = 𝑡𝑠′ 𝑅 −1
′
𝐹 = 𝑆 ′ 𝑅 −1𝑋, …(39)
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
47
𝑟𝑋𝑗 𝐹𝑘 = 𝑡𝑗𝑘 = 𝑎𝑗𝑘
References
DR RAJIV SAKSENA
Department of Statistics
University of Lucknow, Lucknow
48