View PDF

STATISTICAL METHODS FOR
PSYCHOLOGY AND EDUCATION
DR RAJIV SAKSENA
Analyst Cum programmer
Department of Statistics
University of Lucknow, Lucknow
Disclaimer: The e-content is exclusively meant for academic purposes and for enhancing teaching and
learning. Any other use for economic/commercial purpose is strictly prohibited. The users of the content
shall not distribute, disseminate or share it with anyone else and its use is restricted to advancement of
individual knowledge. The information provided in this e-content is authentic and best as per my
knowledge.
Statistical Methods For Psychology and Education
BY
DR RAJIV SAKSENA
DEPARTMENT OF STATISTICS
UNIVERSITY OF LUCKNOW
INTRODUCTION
Psychometry is the branch of psychology which deals with the

measurement of psychological traits or mental abilities like intelligence,
aptitude, interest, opinion, attitude or, simply scholastic achievement.
Education statistics may be considered to be a part of psychiatry where our
main purpose is to rank a group of individuals according to their scholastic
achievement. Although this task of ranking does not seem to present
immediate problems, a close examination will reveal a number of pitfalls and
weaknesses of the prevalent system.
Unlike physical or biological characteristics, psychological

characteristics are rather abstract and hence can be measured only with some
degree of unreliability. For the purpose of measurement, one has to develop a
certain scale, which bears a strong analogy with a foot-rule used for
measuring or comparing lengths. As on a foot–rule equal distances on a
psychological scale stand for empirically equal differences n the psychological
trait being measured. But the zero–point of the psychological scale, unlike that
of the foot-rule, is arbitrary. However, distances from the arbitrary zero are
additive. In other words a psychological scale is an interval scale and not a
ratio scale, since there is no absolute zero-point on it.
SOME SCALING PROCEDURES

Most of the scaling procedures used for psychological or educational data are
based on the assumption that the trait under consideration is normally
DR RAJIV SAKSENA
2
distributed. The zero-point and the units of the scale are chosen arbitrarily,
but the scale unit should be equal and remain stable throughout the scale. We
shall discuss in this section some of the common scaling procedures used in
psychology and education.
SCALING INDIVIDUAL TEST-ITEMS OF DIFFICULTY
Here we have a number of items in a test administered to a large group

of individuals. The proportion of individuals successful in each item is known.
We assume in the construction of the difficulty scale that the ability (x) which
the group of items is measuring is normally distributed with some mean μ and
some s.d σ. We can arbitrarily take the origin at μ write μ=0.
Let pi be the proportion of individuals passing the 𝑖𝑡ℎ item. We

determine the point x-axis for which the area to the right of the ordinate is pi .
Let the point be 𝑘𝑖 σ . Thus 𝑘𝑖 σ is the amount of ability required for passing
the item and hence may be taken as a measure of difficulty (𝑑𝑖 ) for the 𝑖𝑡ℎ
item. Thus an equal difference in 𝑑 will mean an equal difference in ability
required for passing the items.
Probability
Density
ABILITY
Fig. Determining the difficulty-value of an item from

the proportion of individuals passing the item
P1
DR RAJIV SAKSENA
3
Example: Suppose there are four item A,B,C and D, passed, respectively, by
90%, 80%, 70% and 60% of the individuals. Compare the difference in
difficulty between A and B with the difference in difficulty between C and D.
To find the difficulty value 𝑑𝐴 of the item A we find the point, on the
normal distribution with mean O, and s.d. σ, the area to the right of which is
0.90. From the table of the area under the normal probability curve (Table I,
Appendix B), we have
𝑑𝐴 = −1.28σ
Similarly 𝑑𝐵 = −0.84𝜎
𝑑𝐶 = −0.52𝜎
And 𝑑𝐷 = −0.25𝜎
Hence 𝑑𝐵 − 𝑑𝐴 = 0.44𝜎, whereas 𝑑𝐷 − 𝑑𝐶 = 0.27𝜎

𝑑𝐵 −𝑑𝐴 0.44𝜎
Thus = = 1.63.
𝑑𝐷 −𝑑𝐶 0.27𝜎
The difficulty of B relative to A is 1.63 times greater than the difficulty of

D relative to C.
SCALING OF TEST-SCORES IN SEVERAL TESTS
The main defect of the prevalent system of ranking in scholastic test

consists in the adding of the raw scores of an individual on several tests to get
his composite or total scores and ranking all individuals on the basis of the
total scores. This is not a valid procedure since the same raw score 𝑥 on
different tests may involve different degrees of ability and hence may not be
equivalent in different tests. Hence the raw scores have to be scaled under
some assumption regarding the distribution of the trait which the test is
measuring.
DR RAJIV SAKSENA
4
Percentile scaling
Here we assume that the distribution of the trait under consideration is

rectangular, under which we shall have percentile differences equal
throughout the scale. To determine the scale value corresponding to a score 𝑥
we have to find the percentile position of an individual with score 𝑥, 𝑖. 𝑒 the
percentage of individuals in the group having a score equal to or less than 𝑥,
which can be easily obtained from the score-distribution assuming that ‘score’
is a continuous variable .Regardless of the form of the original raw scores
distribution, the distribution of percentile score will be rectangular. However,
the distribution of raw scores is rarely rectangular, so that the basic
assumption underlying the percentile scaling may not always be realistic.
Thus while using this scaling method one should beware of its limitations.
Z-Scaling or 𝝈 scaling
Here we assume that whatever differences that may exist in the forms of
raw score distributions may be attributed to chance or to the limitations of the
test. In fact, the distributions of the traits under consideration are assumed to
differ only in mean and s.d. Hence the score on different tests should be
expressed in items of the score in a hypothetical distribution of the same form
as the trait–distribution with some arbitrarily chosen mean and s.d. The
transformed scores are called linear derived scores. In particular, if the mean is
arbitrarily taken to be zero and the s.d to be unity, the scores are called
standard scores or 𝜎-scores or z–score. To avoid negative standard scores, in
linear derived scores the mean is generally taken to be 50 and the s.d to be 10.
If a particular test has raw score mean & s.d equal to μ & σ, respectively, then
the linear derived score corresponding to a score 𝑥 on that test is given by
𝑥 − 𝜇 𝑤 − 50
=
𝜎 10
(𝑥−𝜇)
or 𝑤 = 50 + 10𝑋 = 50 + 10𝑧, …(1)
𝜎
DR RAJIV SAKSENA
5
where 𝑤 is the linear derived score with mean 50 and s.d. 10 and 𝑧 is the
standard score. This linear transformation changes only the mean and the s.d.,
while retaining the form of the original distribution.
T-Scaling
In this case we assume that the trait-distribution is normal. The raw

score distribution may deviate from normality, but the deviations from
normality are attributed to chance or to limitations of the tests. The mean and
s.d. of the normal distribution of the trait may be arbitrarily taken to be 50
and 10, respectively. To get the scaled score corresponding to a raw score 𝑥,
first we find as in percentile scaling, the percentile position (𝑃) of an
individual with score 𝑥 and then find the point (𝑇) on a normal distribution
with mean 50 and s.d. 10 below which the area is P/100. This is given by.
𝑇−50 𝑃
𝛷( )= …(2)
10 100
Where 𝛷(г) is the area under the curve of the normal deviate from -∞ to г.
The scaled score obtained by the process is called T–score in memory of

the psychologists Terman and Thorndyke. The scale is due to McCall.
Normalized scores are also expressed as stanine (standard nine) score.

The stanine scale takes nine values from 1 to 9, with mean 5 and s.d 2. When a
distribution is transformed to a stanine scale, the frequencies are distributed
as follows:
TABLE 5.1
STANINE DISTRIBUTION
Stanine score 1 2 3 4 5 6 7 8 9
Percentage on 4 7 12 17 20 17 12 7 4
each
score (rounded)
DR RAJIV SAKSENA
6
A transformation is nonlinear if it changes the form of the distribution.
Normalized score and percentile score are merely special case of nonlinear
transformation of the raw score. For nonlinear transformation any form of
distribution may be chosen.
Method of equivalent scores
Here we do not make any assumption about the distribution of the trait
under consideration. The appropriate trait distribution is obtained by
graduating the raw score distribution by an appropriate Pearsonian curve.
Let 𝑥 and 𝑦 be the scores on two tests, having probability–density

function 𝑓(𝑥 ) and ℎ(𝑦), respectively, obtained by some process of graduation.
Now, two score on the two tests 𝑥𝑖 and 𝑦𝑖 are to be considered equivalent, in
the score that they bring into play equal amounts of the trait, if and only if
𝑥 𝑦
𝑖
∫−∞ 𝑓 (𝑥 )𝑑𝑥 = ∫−∞𝑖 ℎ(𝑦)𝑑𝑦. …(3)
For Practical convenience an equivalence curve may be obtained by

computing a number of pairs of equivalent score, (𝑥𝑖 𝑦𝑖 ) and fitting to the
corresponding set of points an appropriate curve, say 𝑦 = 𝑔(𝑥 ).
Equivalent score can also be obtained from the score distribution for 𝑥
and 𝑦 without going into the process of graduation .First two ogives are drawn
on the some graph paper. Two scores 𝑥𝑖 and 𝑦𝑖 with the same relative
cumulative frequency are then regarded are equivalent.
For the purpose of comparison or combination, the raw score on

different tests may be converted into equivalent scores on a standard test. In
This method the form of the distribution of equivalent (transformed) scores is
the same as that of the standard test. If however, the standard test score has a
normal distribution, the method reduces to normalized scaling.
DR RAJIV SAKSENA
7
Example: The raw score distribution for Vernacular and English for a group
of 500 students are given below. One of two students got 80 in Vernacular and
40 in English, while the other got 60 in both. Compare their performances by
(i) percentile scaling, (ii) linear derived scores, (iii) T-scaling and (iv)
equivalent score (ogive method).
First we have to remember that a score of 80 is to be considered as an

interval from 79.5 to 80.5, and similarly for the other scores. To obtain the
percentile positions, we obtain the cumulative frequencies (less-then type) for
both Vernacular and English. They are shown in Table. Hence the percentile
positions, corresponding to 80.5 and 60.5 in vernacular are given by
497+0.6
𝑃80.5(𝑉𝑒𝑟𝑛) = X 100 = 99.52
500
436+7.2
And 𝑃60.5(𝑉𝑒𝑟𝑛) = X 100 = 88.64
500
270+15.6
Similarly, for English 𝑃40.5(𝐸𝑛𝑔. ) = X 100 = 57.12
500
476+3.6
and 𝑃60.5(𝐸𝑛𝑔. ) = X 100 = 95.92.
500
TABLE: DISTRIBUTIONS OF SCOBES IN VERNACULAR AND ENGLISH OF A GROUP OF 500 STUDENTS
Score Frequency
Vernacular English
0-4 3
5-9 6
10-14 12
15-19 6 23
20-24 7 35
25-29 18 45
30-34 34 74
35-39 56 72
DR RAJIV SAKSENA
8
40-44 84 78
45-49 74 53
50-54 104 46
55-59 53 29
60-64 36 18
65-69 16 5
70-74 9 1
75-79 0
80-84 3
TABLE: CUMULATIVE FREQUENCY DISTBIBUTIONS OF SCORES IN VERNACULAR AND ENGLISH
Score Cumulative Frequency
Vernacular English
0-4 3
5-9 9
10-14 21
15-19 6 44
20-24 13 79
25-29 31 124
30-34 65 198
35-39 121 270
40-44 205 348
45-49 279 401
50-54 383 447
55-59 436 476
60-64 472 494
65-69 488 499
70-74 497 500
75-79 497
80-84 500
DR RAJIV SAKSENA
9
Hence the total scaled score for student 1, getting 80 in Vernacular and
40 in English, is by percentile scaling,
99.52+57.12= 156.64
And that of student 2, getting 60 in both Vernacular and English, is
88.64 +95.92=156.64
Thus we see that the relative performances of the two students are
quite different although their total raw scores are equal.
For linear derived scores with mean 50 and s.d. 10, we require the
means and s. d.s of scores in the two subjects. Denoting by 𝑥 the score in
Vernacular and by 𝑦 the score in English, we have.
𝑥̅ =47.07 𝑠𝑥 = 11.32
𝑦̅ =37.87 And 𝑠𝑦 = 13.10
Hence the 𝑤 scores are given by

80 − 47.09
𝑤80 (𝑉𝑒𝑟𝑛) = 50 + 𝑋10 = 79.07,
11.32
60 − 47.09
𝑤80 (𝑉𝑒𝑟𝑛) = 50 𝑋10 = 61.40,
11.32
40 − 7.87
𝑤80(𝐸𝑛𝑔) = 50 + 𝑋10 = 51.63
13.10
60−37.87
and 𝑤80 (𝐸𝑛𝑔) = 50 + 𝑋10 = 66.89.
13.10
As such, the total 𝑤 −score of student 1 is
79.07+51.63=130.70,
and the of student 2 is
61.40+66.89 =130.70,
DR RAJIV SAKSENA
10
Linear derived scores however show that student I is slightly superior
to student 2.
Now, for T-scaling percentile positions have to be converted into T-

score. We have
𝑇80 (𝑉𝑒𝑟𝑛) = 50 + 𝑟.9952𝑋10 = 75.90,
𝑇60(𝑉𝑒𝑟𝑛) = 50 + 𝑟.8864𝑋10 = 62.08
𝑇40 (𝑉𝑒𝑟𝑛) = 50 + 𝑟.5712𝑋10 = 51.79
and 𝑇60(𝑉𝑒𝑟𝑛) = 50 + 𝑟.9592𝑋10 = 67.41
Hence the total T-score of student 1 is
75.90+51.79=127.69
and the total T-score of student 2 is
62.08+67.41=129.49
Thus T-scaling shown that student 2 is slightly superior to student 1

520
440
OGIVE FOR MARKS IN ENGLISH(Y)

OGIVE FOR MARKS 360
IN VERNACULAR(X)
280
200
120
40
4.5 14.5 24.5 34.5 44.5 54.5 64.5 74.5 84.5

MARKS
DR RAJIV SAKSENA
11
Fig. Determination of equivalent scores in English
and Vernacular from the ogives
In the equivalent score method, let us take Vernacular as the standard.

From figure we find that a score of a 40 in English is equivalent to a score of
49.8 in Vernacular and a score of 60 in English is equivalent to a score of 66.9
in Vernacular.
Hence the total score of student 1 in terms of Vernacular score is
80+49.8=129.8
And that student 2 is
60+66.9=126.9
This method again shows that student 1 is slightly superior to student 2.
SCALING OF RATING OR RANKING TO TERM OF NORMAL CURVE

In many psychological problems, individuals are rated or ranked by
judges for their possession of some characteristics not readily measurable in
terms of performance. Honestly responsibility tactfulness etc, are examples of
such traits. Suppose that there are two judges rating a group of individuals
and that the frequency distributions of ratings for the two judges are known.
The problem is to assign ‘weights’ or numerical scores to the ratings, so that
the ratings of the judges may be compared or combined.
Let us assume that the distribution of the trait (𝑠𝑎𝑦 𝑥) is normal with
mean 0 and s.d. 1. Now suppose that the individuals with trait values from 𝑥1
to 𝑥2 are given a particular rating. The scale value for the rating is taken to be
the mean trait value of all these individuals and so is given by the formula:
𝑥2
1
∫ 𝑥 = exp [−𝑥 2/2]𝑑𝑥
𝑥1 √2𝜋
Scale value =
DR RAJIV SAKSENA
12
𝑥2 1
∫ 𝑥 = exp [−𝑥 2/2]𝑑𝑥
𝑥1 √2𝜋
1 𝑥 2 𝑥2
[− exp [− ] 𝑥
√2𝜋 2 1 𝛷 (𝑥1 ) − 𝛷 (𝑥2 )
= =
𝛷(𝑥2 ) − 𝛷 (𝑥1 ) 𝛷 (𝑥2 ) − 𝛷(𝑥1 )
…(4)
1 𝑥
where 𝜙(𝑥 ) = exp [−𝑥 2 /2] and 𝛷(𝑥 ) = ∫−∞ 𝑒𝑥𝑝 [−𝜇 2/2] 𝑑𝜇.
√2𝜋
From the observed distribution of the ratings, it is easy to find 𝛷(𝑥1 )
and 𝛷(𝑥2 ), and hence 𝜙(𝑥1) and 𝜙(𝑥2 ),
The method is due to Likert and the scale is known as Likert’s scale. This
is also called the category–scale method. If on the other hand the 𝑛 individuals
in the group are ranked by different judges, the scale values corresponding to
the ranks can be obtained under the sane assumptions as before, 𝑖. 𝑒 under the
assumption of normality of the trait concerned.
Suppose there is no tie. Then percentile rank (PR) of an individual with
rank R, 𝑖. 𝑒 percentage of individuals who are ranked below him, is given by
1
100(𝑅−2 )
𝑃𝑅 = 100 − = 𝑃, 𝑠𝑎𝑦, …(5)
𝑛
1
since the rank R of the individual really represents the interval from 𝑅 − to
2
1
𝑅 + . The scale value corresponding to this PR can now be obtained as the
2
value of a normal deviate below which the area is P/100. In the case of tied
ranks, the PR values can be obtained from the frequency distribution of ranks.
Example A group of 100 workers was rated by a supervisor on a five-point

scale –A, B,C,D and E–with respect to efficiency, A being the highest rating and
E the lowest. Obtain the scale value for each rating from the following
frequency distribution of the rating: Obtain the Scale value for each rating
from the following frequency distribution of the rating:
Rating A B C D E
Frequency 5 24 45 23 3
DR RAJIV SAKSENA
13
Under the usual assumption of normality for the trait under
consideration, we obtain, for the rating the scale values as follows:
Raking A B C D E
Area covered by the rating 0.05 0.24 0.45 0.23 0.03

𝜙(𝑥2 )- 𝜙(𝑥1 )
Area covered by the rating 0.95 0.71 0.26 0.03 0
𝜙(𝑥1 )
Lower limit of the trait 1.645 0.553 -0.643 -1.881 −∞
𝑥1
Upper limit of the trait ∞ 1.645 0.553 -0.643 -1.881
𝑥2
Ordinate at the upper limit 0.1031 0.3424 0.3244 0.0680 0
𝜙(𝑥1 )
Ordinate at the upper limit 0 0.1031 0.3424 0.3244 0.068
𝜙(𝑥2 )
Scale value
𝜙(𝑥1 )−𝜙(𝑥2 ) 2.062 0.997 -0.040 -1.115 -2.267
𝜙(𝑥2 )−𝜙(𝑥1 )
SCALING OF QUALITATIVE ANSWERS TO A QUESTIONNAIRE
The answers to the items in an attitude or personality test or a test of a

similar type will be qualitative 𝑒. 𝑔 , ‘Yes ‘ and ‘No’, or ‘Strongly approve’,
‘Approve’, ‘Undecided’, ‘Disapprove’ and ‘Strongly disapprove’. It is necessary
to allot numerical score to the answers so as to obtain the total score of an
individual measuring his attitude or personality. The method of scaling is
exactly similar to Likert’s rating scale. The questionnaire is first administered
to a group of individuals and the frequency distribution of the answers is
obtained. From the observed distribution, Likert’s scale values are then
obtained for different answers to the questionnaire.
SCALING OF JUDGMENTS OF A NUMBER OF PRODUCTS: PRODUCT SCALE
It often happens that the ability or the trait in which we are interested
cannot be expressed as a test score. This necessitates the construction of
DR RAJIV SAKSENA
14
product scales. In such scales Excellence of performance is determined by
comparing an individual’s product with various standard products, the values
of which are already determined by a number of competent and expert judges,
hand-writings, compositions, etc., are well-known examples.
We shall discuss the method of paired comparisons due to Thurston,

suppose there are 𝑘 standard products judges, by a group of 𝑁 judges. All
possible pairs of products 𝑘(𝑘 − 1)/2 in all are presented to a judge and he is
to select one member of each pair in preference to the other. The can be
presented in the form of a proportion matrix.
Product
1 2 … …. k
1 p11 p21 …. …. Pk1

Product 2 p12 p22 …. …. Pk2
: : : : : :
k p1k p2k …. …. Pkk
Here Þij is the proportion of judges preferring the 𝑖𝑡ℎ product to the 𝑗𝑡ℎ
one and Þji=1-Þij. By convention Þij=1/2.
Now, suppose that the distribution of difference in judgments (𝑇) of the

𝑖𝑡ℎ and 𝑗𝑡ℎ products is normal with mean Si-Sj (the difference of their scale
values) and s.d. 𝜎𝑖−𝑗 . Thus
1 ∞
{𝑇 − (𝑆𝑖 − 𝑆𝑗 )}2
Þ𝑖𝑗 = ∫ 𝑒𝑥𝑝 [− 2 ] 𝑑𝑇
𝜎𝑖−𝑗 √2𝜋 𝑜 2𝜎𝑖−𝑗
∞
1
= ∫ exp[−𝜏 2 /2]𝑑𝜏,
√2𝜋
−(𝑆𝑖−𝑆𝑗)/𝜎𝑖−𝑗
so that 𝑆𝑖 − 𝑆𝑗 = −𝑥𝑖𝑗 𝜎𝑖−𝑗 …(6)
DR RAJIV SAKSENA
15
where 𝑥𝑖𝑗 is the value of the normal deviate the area to the right of which is
Þ𝑖𝑗 . Equation (5.6) is known as Thurnstone’s law of comparative judgment.
Assuming that the distribution of judgments for each product has the same
s.d. σ and that judgments for any two products are uncorrelated, 𝜎𝑖−𝑗 = 𝜎√2,
a constant.
Taking 𝜎𝑖−𝑗 = 𝜎√2 as the unit of the scale, we have
𝑆𝑖 − 𝑆𝑗 = −𝑥𝑖𝑗 …(6a)
Probability
Density
Xij
pij
DIFFERENCE IN JUDGEMENTS
Fig. Determination the difference of scale-value of

judgements (Si – Sj) from the proportion pij
Thus we get the ( 𝑆𝑖 − 𝑆𝑗 ) matrix:
Product
1 2 … …. k
1 s1- s1 s2-s1 …. .... sk-s1
Product 2 s1-s2 s2-s2 …. …. sk-s2

: : : : : :
k s1-sk s2-s2 … …. sk- sk
1
The column means give 𝑆1 , 𝑆2 , … … 𝑆𝑘 , as deviations from 𝑆̅ = ∑𝑘𝑖=1 𝑆𝑖 .
𝑘
̅
If we take the origin at 𝑆, then the column means provide us with the scale –
DR RAJIV SAKSENA
16
values for the 𝑘 products. Alternatively, we could take the origin at the
minimum scale value and adjust the scale values accordingly.
Example. 200 individuals were asked about their preferences for 4 different
types of music. The proportion matrix is given below. Find the scale values.
Music type
1 2 3 4 4
1 .500 .770 .878 .892

Music type 2 .230 .500 .743 .845
3 .122 .257 .500 .797
4 .108 .155 .209 .500
Under the usual assumption of normality of the distribution of difference
in judgments with means 𝑆𝑖 − 𝑆𝑗 and s.d, 𝜎𝑖−𝑗,and with the constant 𝜎𝑖−𝑗, taken
as the scale we get the matrix of scale separations 𝑆𝑖 − 𝑆𝑗 as follows:
Musictype
1 2 3 4
1 0 .739 1.165 1.237
Music type 2 -.739 0 .653 1.015
3 -1.17 -.653 0 .831
4 -1.24 -1.015 -.831 0
Column mean -785 -.232 .247 .771
With the origin at 𝑆 the mean scale values, the column means give us the
corresponding scale values for the four music types. With origin at 𝑆1 , on the
other hand, we get the following scale values:
Music type 1 2 3 4
Scale value 0 .553 1.032 1.556
DR RAJIV SAKSENA
17
NORMS AND REFERENCE GROUPS
By liner transformation or normalization of test scores, we get the scale

values with which we can combine the performances of an individual in
different test or can make comparison between individuals. But in many
situations, it is not sufficient to have the scale value, but we have to know on
the basis of which group of individuals the scaling was done. We have to know
the age, sex, education and occupation and other characteristics of the
reference group. A scale values with reference to a certain group may not be
so good, but it may be very good for another reference group .Thus, when we
want to judge the performance of an individual by his test score, we must
know what to compare it with,𝑖. 𝑒 .the norm we want to use. We must know
the mean, standard deviation and percentile values for the group with which
we compare an individual score. Thus a score may be good when compared to
one norm (for a certain reference group).
Many tests are used for several purposes and for several groups of
individuals. If the result of a test is to be used for comparison with several
groups, it is necessary to have norms for each of the groups separately, unless
they are known to be the same. To calculate the norms for several groups, the
test has to be administered to a random representative sample from the
population of the reference group. The size of the sample should not be too
small so as to obtain stable norms .Norm data are however not necessary in
practical situations where we want to select a number of individuals out of all
applicants on the basis of test score, because the top individuals are to be
selected, no what the norms are
TEST THEORY
The measurements on the psychological characteristics considered in

previous sections were collected by various types of methods such as tests,
questionnaires or ratings. Whatever may be the method of obtaining
measurements, we made the assumption, though not explicitly, that the
measurements were meaningful and reproducible. To be more exact we
DR RAJIV SAKSENA
18
assumed that the measuring instrument used would give us a stable and
consistent measure of the trait if we remeasured the trait under identical
conditions .Technically, this aspect of the accuracy is known as the reliability
of the measuring instrument. The second requirement is that the measuring
instrument measures the trait which it is intended to measure. And
technically this is known as the validity of the measuring instrument.
With physical measurements these present no problems at all. For we

know that if we use a non-flexible accurate measuring tape in the correct way,
we shall get the exact length of an object, and this can be reproduced if
remeasured under similar conditions. So physical measurements are, usually,
always reliable and valid. But we are not so sure about psychological
measurements. We have to verify in each case that we are getting reliable and
valid measurements and then only can we use them with confidence.
Before we actually discuss reliability and validity, we shall consider

some simple results in test theory under a very simple model.
LINEAR MODEL OF TEST THEORY
We are interested in getting the true measure of an individual’s

performance on at test. By applying a measuring instrument what we get is
the individual’s raw score (obtained score) on the test. We can consider
various types of relationship between the true score of the 𝑖𝑡ℎ individual(𝑡𝑖 )
and his raw score (𝑥𝑖 ). But the relationship that is usually adopted is the
simplest one-a linear relationship. We assume that
𝑥𝑖 = 𝑡𝑖 + 𝑒𝑖 , 𝑓𝑜𝑟 𝑖 = 1,2 … … … … , 𝑛, …(7)
where 𝑒𝑖 = 𝑥𝑖 − 𝑡𝑖 is the error of measurement for the 𝑖𝑡ℎ individual?
The raw score (𝑥) dose not equal the unknown true score (𝑡). The
difference (𝑥 − 𝑡) which may be due to various factors is the error score (𝑒).
DR RAJIV SAKSENA
19
In test theory we always consider only random errors(𝑒). Constant or
systematic errors are assumed to be absent in test theory. Since we consider
only random errors, it is reasonable to make the following assumption for 𝑒′𝑠:
𝜇𝑒 = 0
𝜌𝑖𝑒 = 0 } …(8)
𝜌𝑒𝑔 𝑒ℎ = 0
In words, the mean of error score is zero, the correlation between true
score and error score is zero, and the correlation between error score from
different testing occasions (or for two parallel tests, g and h, to be defined
shortly) is zero. We note that under this model the estimates of 𝜇𝑒 , 𝜌𝑖𝑒 and
𝜌𝑒𝑔 𝑒ℎ will approach zero if the number of individuals (𝑛) approaches infinity.
In practice, however, the estimates are assumed to satisfy these relations for
the given sample.
Since only random errors are considered, for a large number of

cases (𝑛 𝑙𝑎𝑟𝑔𝑒), the positive and negative errors of all magnitudes (small and
large) will cancel each other with the result that the mean will be zero.
Similarly, since only random errors are considered, there is no reason to
expect any correlation between true scores and error scores for a large
number of individuals. Large or small true scores will be expected to occur
equally often with large or small error score. This is reasonable for both
positive and negative scores. Thus we assume 𝜌𝑖𝑒 = 0 A similar argument will
show that 𝜌𝑒𝑔 𝑒ℎ = 0 is also a reasonable assumption.
DEFINITION OF PARALLEL TEST
Two tests are said to be parallel when it makes no difference which one
is used. If 𝑔 and ℎ are two tests and if for the 𝑖𝑡ℎ individual 𝑡𝑖𝑔 ≠ 𝑡𝑖ℎ , then we
cannot say that it makes no difference whether we use test g or h. So, in order
that g and h may be parallel test, it is reasonable the assume that
𝑡𝑖𝑔 = 𝑡𝑖ℎ , 𝑓𝑜𝑟 𝑖 = 1,2, … … , 𝑛; …(9)
DR RAJIV SAKSENA
20
𝑖. 𝑒, the true score of any individual should be the same on the two tests.
Next, consistent with the definition of error score (8), we assume about
the error scores on two parallel tests that
𝜎𝑒𝑔 = 𝜎𝑒ℎ ; …(10)
𝑖. 𝑒, the standard deviations of errors on the two tests should be the same.
Thus (9) and (10) defined parallel tests in terms of unknown quantities. These
can be expressed in terms of the distributions of the raw score, using the
relations (7), (8) and (9) as follows:
From (7), since 𝜇𝑒 = 0, we have 𝜇𝑡 = 𝜇𝑥 for any test.
From (9), we have 𝜇𝑡𝑔 = 𝜇𝑡ℎ , 𝜎𝑡𝑔 = 𝜎𝑡ℎ and 𝜌𝑡𝑔 𝑡ℎ = 1
Also, from (7) and (8), we have 𝜎𝑥2 = 𝜎𝑡2 + 𝜎𝑒2 for any test.
Then we have
𝜇𝑥𝑔 = 𝜇𝑥ℎ 𝑎𝑛𝑑 𝜎𝑥𝑔 𝜎𝑥ℎ , …(11)
For two parallel tests g and h
Thus the means of raw scores on two parallel tests are equal; and so are
the standard deviations.
If we have more than two parallel tests (at least three-say g, h and k) .we
have another condition to check; besides (11), before we can conclude that
the tests g, h and k are parallel .And this condition is
𝜌𝑥𝑔 𝑥ℎ = 𝜌𝑥𝑔 𝑥𝑘 = 𝜌𝑥ℎ 𝑥𝑘 , …(12)
The condition of equality of all inter-correlations between raw scores of the

parallel tests
Now we establish (12) by first obtaining an expression for 𝜌𝑥𝑔 𝑥ℎ in

terms of 𝜎𝑖2 𝑎𝑛𝑑 𝜎𝑥2 .
DR RAJIV SAKSENA
21
𝑐𝑜𝑣(𝑥𝑔 ,𝑥ℎ )
𝜌𝑥𝑔 𝑥ℎ =
𝜎𝑥𝑔 𝑋 𝜎𝑥ℎ
𝑐𝑜𝑣(𝑡𝑔 , 𝑡ℎ ) + 𝑐𝑜𝑣(𝑡𝑔 , 𝑒ℎ ) + 𝑐𝑜𝑣(𝑡ℎ , 𝑒𝑔 ) + 𝑐𝑜𝑣(𝑒𝑔 , 𝑒ℎ )

=
𝜎𝑥𝑔 𝑋 𝜎𝑥ℎ
𝑐𝑜𝑣(𝑡𝑔 ,𝑡ℎ )
= (Since g, h are parallel tests, the remaining covariance
𝜎𝑥2𝑔
terms are all zero and𝜎𝑥𝑔 = 𝜎𝑥ℎ ).
𝜌𝑡𝑔𝑡ℎ 𝜎𝑡𝑔 𝜎𝑡ℎ
=
𝜎𝑥2𝑔
= 𝜎𝑡2𝑔 /𝜎𝑥2𝑔 (since 𝜌𝑡𝑔 𝑡ℎ = 1 and 𝜎𝑡𝑔 = 𝜎𝑡ℎ , g and h being parallel).
Thus for two parallel tests g and h,
𝜌𝑥𝑔𝑥ℎ = 𝜎𝑡2𝑔 /𝜎𝑥2𝑔
= 𝜎𝑡2ℎ /𝜎𝑥2ℎ (𝑠𝑖𝑛𝑐𝑒 𝜎𝑡𝑔 = 𝜎𝑡ℎ , 𝜎𝑥𝑔 = 𝜎𝑥ℎ ) …(13)
Equ. (13) easily establishes equ. (12) for a number of parallel tests.
Thus for three or more parallel tests the means of raw scores are equal;
so are the variances and the inter correlations. In addition to satisfying these
criteria, parallel tests should also be similar with respect to the content and
nature of items, etc., which may be verified by expert judgment only.
DEFINITION OF TRUE SCORE
Equations (8) define error score. Then the true score (𝑡) can be
regarded as the difference (𝑥 − 𝑒) between the raw score and the error score.
Thus, 𝑡𝑖 = 𝑥𝑖 − 𝑒𝑖 .
Alternatively, we may define the true score of an individual as the limit

of the average of the raw score of the individual on a number of parallel tests
𝑘 approaches infinity, 𝑖. 𝑒 .
DR RAJIV SAKSENA
22
𝑘
𝑡𝑖 = lim [∑ 𝑥𝑖𝑔 /𝑘 ] … (14)

𝑘⟶∞
𝑔=1
With this definition of 𝑡, the error score is defined as the difference 𝑥 −

𝑡; 𝑖. 𝑒. , 𝑒 = 𝑥 − 𝑡.
ERROR VARIANCE (STANDARD ERROR OF MEASUREMENT)
From equations (7) and, we have.
𝜎𝑥2 = 𝜎𝑖2 + 𝜎𝑒2 ,
And from equation (13), we have, if g and h are parallel tests,
𝜎𝑡2 = 𝜌𝑥𝑔 𝑥ℎ 𝜎𝑥2 ,
Thus combining the above two relations we get
𝜎𝑥2 = 𝜎𝑥2 𝜌𝑥𝑔 𝑥ℎ + 𝜎𝑒2
Or 𝜎𝑒2 = 𝜎𝑥2 (1 − 𝜌𝑥𝑔 𝑥ℎ )
Or 𝜎𝑒 = 𝜎𝑥 √1 − 𝜌𝑥𝑔 𝑥ℎ …(15)
Equation (5.15) gives the standard deviation of the error scores, which
is technically known as the standard error of measurement.
DEFINITION OF RELIABILITY
We define reliability as the reproducibility of the measurements when

remeasured under identical conditions. Spearman first introduced the term
‘reliability’. The reliability of a test (a measuring instrument) is given by the
correlation between the raw scores of the given test and a parallel test. Thus,
DR RAJIV SAKSENA
23
if g be the given test and h any other test parallel to g, then the reliability of g
is measured by 𝜌𝑥𝑔 𝑥ℎ and will be denoted as 𝜌𝑔𝑔 .
From equation (5.13), we know that
𝜌𝑔𝑔 = 𝜎𝑡2𝑔 /𝜎𝑥2𝑔
= 1 − 𝜎𝑒2𝑔 /𝜎𝑥2𝑔 …(16)
by virtue of the relation 𝜎𝑡2 = 𝜎𝑥2 − 𝜎𝑒2 .
Reliability can thus be defined as the ratio of the true score variance to
the raw score variance or as the proportion of the raw score variance that is
the true score variance. Reliability ranges from zero to one. 𝜌𝑔𝑔 = 1 when
𝜎𝑒 = 0. But 𝜎𝑒 = 0 if and only if all 𝑒𝑖 = 0, since 𝜇𝑒 = 0. Thus, the test is
perfectly reliable (𝜌𝑔𝑔 = 1) if 𝑥𝑖 = 𝑡𝑖 for each 𝑖, and then the raw scores are
the true scores. 𝜌𝑔𝑔 = 0 if 𝜎𝑖 = 0 (or , equivalently, if 𝜎𝑒 = 𝜎𝑥 ), 𝑖. 𝑒., when
𝑥𝑖 = 𝑡 + 𝑒𝑖 for each 𝑖, and then the test is unreliable (here 𝑡 denotes true
score for all 𝑖).
For any test g, therefore,
0 ≤ 𝜌𝑔𝑔 ≤ 1.
It may be noted , however , that when the reliability is measured from

a sample of individuals , one obtain a negative coefficient.
EFFECT OF TEST LENGTH ON THE RELIABILITY OF A TEST
By the length of a test we mean the number of items in the test. Let us
augment the length of the test by adding to (𝑘 − 1) parallel tests of the same
length. So the composite test is now made of 𝑘 parallel test of the same length
and the length of the composite test is 𝑘 times the length of the original test.
The effects of this increase in length on the true score variance and raw score
variances are the following:
DR RAJIV SAKSENA
24
Denoting the 𝑘 parallel tests by 𝑔1 , 𝑔2 ………. 𝑔𝑘 and the composite test
by 𝐺, we have
𝜎𝑖2𝐺 = 𝜎 2 (𝑡𝑔1 + 𝑡𝑔2 + ⋯ + 𝑡𝑔𝑘 ) = ∑ ∑ 𝜌𝑡𝑔𝑖 𝑡𝑔𝑗 𝜎𝑡𝑔 𝜎𝑡𝑔

𝑖 𝑗
𝑖 𝑗
(Summation over 𝑖, 𝑗 = 1, 2, … … . . , 𝑘)
= 𝑘 2 𝜎𝑡2𝑔1 (since the component tests are parallel,
𝜌𝑡𝑔 𝑡𝑔 = 1 and 𝜎𝑡𝑔 = 𝜎𝑡𝑔 (for all 𝑖, 𝑗) …(17)

𝑖 𝑗 𝑖 𝑗
and, 𝜎𝑥2𝐺 = 𝜎 2 (𝑥𝑔1 + 𝑥𝑔2 + ⋯ + 𝑥𝑔𝑘 ) = ∑ 𝜎𝑥2𝑔 + ∑ ∑ 𝜌𝑥𝑔 𝑥𝑔𝑖 𝜎𝑥𝑔 𝜎𝑥𝑔
𝑖 𝑖 𝑖 𝑗
𝑖=1 𝑖 ≠𝑗
= 𝑘𝜎𝑥2𝑔1 + 𝑘(𝑘 − 1)𝜌𝑔𝑔 𝜎𝑥2𝑔1 …(18)
Since 𝜌𝑥𝑔 𝑥𝑔 = 𝜌𝑔𝑔 (𝑖. 𝑒 𝑟𝑒𝑙𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦) and 𝜎𝑥𝑔 = 𝜎𝑥𝑔 for parallel tests 𝑔𝑖 , 𝑔𝑗
𝑖 𝑗 𝑖 𝑗
Using equation (16), we may write down the reliability of a test whose
length is increased 𝑘 times (by adding 𝑘 − 1 parallel tests) as
𝜌𝐺𝐺 = 𝜎𝑖2𝐺 /𝜎𝑥2𝐺,
which can be expressed in terms of 𝜌𝑔𝑔 , by using equation (15) and (18) as,
𝑘 2𝜎𝑖2𝑔
1
𝜌𝐺𝐺 =
𝑘𝜎𝑥2𝑔1 [1 + (𝑘 − 1)𝜌𝑔𝑔 ]
𝑘𝜌𝑔𝑔
= …(19)
1+(𝑘−1)𝜌𝑔𝑔
where 𝜌𝑔𝑔 is the reliability of the original is test and 𝜌𝐺𝐺 is the reliability
of the lengthened test G, whose length is equal to k times the length of g1.
DR RAJIV SAKSENA
25
Formula (19) is known as the general Spearman brown formula. In the
usual case where k = 2, Spearman–Brown formula or doubled test length is
2𝜌𝑔𝑔
𝜌𝐺𝐺 = …(20)
1+𝜌𝑔𝑔
The derivation of formula (19) and (20) involves the assumption that
the additional test parts used in lengthening the original test are parallel to
those in the original test
The formula for determining k is obtained by solving equation (19) for k:

𝜌𝐺𝐺 (1−𝜌𝑔𝑔 )
𝑘= , …(21)
𝜌𝑔𝑔 (1−𝜌𝐺𝐺 )
Where 𝜌the reliability of the original test and 𝜌𝐺𝐺 is is the desired reliability of
the lengthened test after the original test is lengthened 𝑘
Example. What would be the reliability coefficient when the original

test of reliability 0.50 would be doubled in length?
We have in this case 𝜌𝑔𝑔 = 0.50 and 𝑘 = 2. then by equation (20) we

get, as the reliability of the lengthened test
2 X .50
𝜌𝐺𝐺 = = 0.67.
1 + .50
Example. By what amount should the length of a test of reliability 0.66
be increased so as to get a reliability of 0.95 for the lengthened test?
Here 𝜌𝑔𝑔 = 0.67 and 𝜌𝐺𝐺 = 0.95. Then by equation (21), we have
.95(1 − .67) .95 X .33 .3135
𝑘= = = = 9(Approximately).
.67(1 − .95) .67 X .05 .0335
PRACTICAL METHOD OF ESTIMATING TEST RELIABILITY
DR RAJIV SAKSENA
26
Reliability, as defined above and denoted by 𝜌𝑔𝑔 , is based on population
data (an infinite number of individuals being tested). In practice, we have only
a sample of finite size 𝑛 and the corresponding sample correlation estimated
the reliability. There are available mainly four methods for estimated test
reliability. These are:
(a) The parallel–test method, (b) the test-retest method, (c) the split–
half method and (d) the Kuder-Richardson method.
Parallel –test method
Reliability was defined as the correlation between raw scores on two

parallel tests. In this method, two tests are constructed satisfying as far as
possible the conditions for parallelism. Then the two tests are administered to
the same group with a suitable time lag and the reliability (𝜌𝑔𝑔 ) is estimated
by the correlation(𝑟𝑔𝑔 ) between the raw scores of the parallel test obtained
from the sample.
For many situations, this is the best method of estimating test reliability.
However, the ability measured should not change in the time interval between
the administrations of the test. For many scholastic achievement and mental
ability tests, this condition is fulfilled. But there are case where the ability
tested will change, 𝑒. 𝑔., in performance tests like type-writing tests, athletic
skills tests etc., if the individuals continue practicing during the interval
between the two administrations.
The parallel- test reliability may also be obtained by administering

both the tests at the same session , In this case also the scores on the second
test may be influenced either by familiarity with the material in the first test
or by fatigue.
Generally speaking, parallel –test reliability will give a satisfactory

result .But the difficulty is to construct two parallel test. So when only one test
is available, we are to use one of the other methods.
DR RAJIV SAKSENA
27
Test-retest method
This method consists in administering the same test twice after a

suitable time interval to eliminate familiarity with the material, test fatigue,
etc., and then finding the correlation between the test scores and retest
scores. If, however, the individuals duplicate their first performance, then the
reliability will be over-estimated by this method.
If the test is repeated immediately, the memory effect, practice and

confidence will increase the scores on retesting. If sufficient time elapses
before the second administration, then these effects will be absent and the
test-retest correlation will give an estimate of the stability of the test scores.
As in the parallel-test method, here also, the experimenter will have to

adjust the time interval and control the activity of the individuals within the
time interval so as to minimize the effects due to memory, fatigue practice etc.
The difficulty with both these methods is that sometimes it is difficult to

get the individuals again after an interval of time. In such a case, we cannot
apply either the same test twice or two parallel–tests. For such case, we have
the following methods.
Split-half method
Here one test is applied once and then the score is divided into two
equivalent halves, and the correlation between the score on the half-tests
estimates the reliability of each half-tests. Then by Spearman- Brown formula
(5.20), we may estimate the reliability of the original (full) test.
The test may be split into two parts in a number of ways. The
commonest way is to split the test on the basis of odd-numbered and even-
numbered items.
In many performance tests or personality tests, it is difficult to construct

parallel test or to retest with the same test. So the split-half method is
regarded as the best method in such cases. The objection that is often raised is
DR RAJIV SAKSENA
28
that there is no unique way of splitting the test and unique split-half
correlation. In most Power test (where one dose not emphasize the speed or
quickness with which the work can be performed), the items are arranged in
order of difficulty, and the odd-even split provides a unique estimate of
reliability
Rulon presented the following formula for estimating reliability from

two subtest scores (of the same test):
𝑠𝑑2
𝑟𝑔𝑔 = 1 − …(22)
𝑠𝑥2
where 𝑠𝑥2 is the variance of raw scores and 𝑠𝑑2 is the variance of the difference
of raw scores on the two halves of the test.
Similar results may be obtained by using the formula due to Guttman,

which is similar to apply:
𝑠12 +𝑠22
𝑟𝑔𝑔 = 2 [1 − ], …(23)
𝑠𝑥2
where 𝑠12 and 𝑠22 are the variances of raw scores on the two halves.
Equations (20), (22) and (23) will give the same reliability coefficient
when 𝑠12 = 𝑠22 , 𝑖. 𝑒., when the two halves have equal raw scores variances. If
𝑠12 ≠ 𝑠22, then the split-half reliability given by equ.(20) will be the highest.
Kuder-Richardson method
We shall obtain the Kuder-Richardson formulae for estimating test

reliability by making the same assumptions as were made originally by Kuder
and Richardson. Let us consider a test of length 𝑘 which is made up of 𝑘
parallel items .Then the raw score variance is given by
𝑘
2
𝜎𝑥2 = 𝜎(𝑥 1 +𝑥2 +⋯+𝑥𝑘 )
= ∑ 𝜎𝑥2𝑔 + ∑ ∑ 𝜌𝑥𝑔 𝑥ℎ 𝜎𝑥𝑔 𝜎𝑥ℎ .
𝑔=1 𝑔 ≠ ℎ
DR RAJIV SAKSENA
29
Since the items are all parallel 𝜌𝑥𝑔 𝑥ℎ will be equal to 𝜌𝑔𝑔 (reliability of item 𝑔)
for all 𝑔 and ℎ and 𝜎𝑥𝑔 will be the same for all 𝑔. Thus,
𝜎𝑥2 = 𝑘𝜎𝑥2 + 𝑘 (𝑘 − 1) 𝜌𝑔𝑔 𝜎𝑥2𝑔 ,
so that the item reliability ( 𝜌𝑔𝑔 ) can be expressed as follows:

𝑔
𝜎𝑥2 − ∑𝑘𝑔=1 𝜎𝑥2𝑔
𝜌𝑔𝑔 = , 𝑠𝑖𝑛𝑐𝑒 ∑ 𝜎𝑥2𝑔 = 𝑘𝜎𝑥2𝑔
(𝑘 − 1) ∑𝑘𝑔=1 𝜎𝑥2𝑔
𝑔=1
Next, to obtain the reliability of the test of 𝑘 parallel items from 𝜌𝑔𝑔 , we
apply the general Spearman–Brown formula (19):
𝑘𝜌𝑔𝑔
𝜌𝐺𝐺 =
1 + (𝑘 − 1)𝜌𝑔𝑔
𝜎𝑥2 − ∑𝑘𝑥=1 𝜎𝑥2𝑔 1

=𝑘 x
(𝑘 − 1) ∑𝑘𝑥=1 𝜎𝑥2𝑔 1 + (𝑘 − 1)[(𝜎𝑥2 − ∑𝑘𝑥=1 𝜎𝑥2𝑔 )/(𝑘 − 1) ∑𝑘𝑥=1 𝜎𝑥2𝑔 ]
𝑘 𝜎𝑥2 −∑𝑘
𝑥=1 𝜎𝑥𝑔
=[ ]x[ ]. …(24)
𝑘−1 𝜎𝑥2
This is the Kuder-Richardson “formula 20” for obtaining the reliability

of a test of 𝑘 parallel items in terms of 𝑘, 𝜎𝑥2 and 𝜎𝑥2𝑔 . In practice, this is
estimated by
𝑘 𝑠𝑥2 − ∑𝑘 2
𝑔=1 𝑠𝑥𝑔
𝑟𝐺𝐺 = [ ][ ] …(24a)
𝑘−1 𝑠𝑥2
where 𝑠𝑥2 is the sample variance of raw total scores and 𝑠𝑥2𝑔 is the same for 𝑔.
If the scoring of items be 1 for a correct response and 0 for wrong

response, then 𝑠𝑥2𝑔 = 𝑝𝑔 (1 − 𝑝𝑔 ), where 𝑝𝑔 is the sample proportion of correct
response for item 𝑔. Then formula (24a) simplifies to
DR RAJIV SAKSENA
30
𝑘 𝑠𝑥2 − ∑𝑘
𝑔=1 𝑝𝑔 (1−𝑝𝑔 )
𝑟𝐺𝐺 = [ ][ ] …(25)
𝑘−1 𝑠𝑥2
If in formula (24) we assume that the 𝑘 parallel items are of equal

difficulty, the scorning being 1 for a correct and 0 for a wrong response, with
𝜋 as the common difficulty value for all items, then
𝜎𝑥2𝑔 = 𝜋(1 − 𝜋) = 𝜋 − 𝜋 2
Now, the mean of obtained scores on the test is
𝜇𝑥 = 𝑘𝜋
Thus,
𝜇𝑥 𝜇𝑥2
𝜎𝑥2𝑔 = − ,
𝑘 𝑘2
Then from formula (24), we have
2
𝑘 𝑘𝜎𝑥𝑔
𝜌𝐺𝐺 =[ ] [1 − 2 ]
𝑘−1 𝜎𝑥
𝑘 𝜇𝑥 −𝜇𝑥2 /𝑘
=[ ] [1 − ] …(26)
𝑘−1 𝜎𝑥2
This is the Kuder-Richardson “formula 21” for obtaining the reliability

of a test of 𝑘 parallel items of equal difficulty in terms of 𝑘, 𝜎𝑥2 and 𝜇𝑥 . In
practice this is estimated by
𝑘 𝑥̅ −𝑥̅ 2 /𝑘
𝑟𝐺𝐺 = [ ] [1 − ] …(26a)
𝑘−1 𝑠𝑥2
where 𝑥̅ and 𝑠𝑥2 are the sample mean and variance of raw total scores.
We have divided the Kuder-Richardson formula under original

assumptions. However it is also possible to derive them under less restrictive
conditions.
DR RAJIV SAKSENA
31
VALIDITY
In the previous section, we considered one essential property of a

measuring instrument – the reliability. Now we shall consider the second
essential property– the validity. A psychological test (a measuring instrument)
should not only be reliable, but it should also be valid. By this we mean that
the test should measure what it is supposed or intended to measure. If we
want to measure a trait A for a group of individuals with the test, we must be
sure, before we can use the test confidently for that purpose, that it actually
measures the trait A and also measures it reliably. The term ‘validity’ is a
relative term-a test is valid for a particular trait for a particular group or for a
particular situation. We may use the same test for measuring different traits
and then we must obtain the validity separately for each case.
As with the reliability of physical measurements, in the case of the

validity of such measurements also, we face no great problem. But the
situation is different with psychological measurements.
To estimate the validity of a test we must know which particular trait

we want to measure. We make use of some known measure of the trait called
the criterion variable. The validity of the test is then estimated by computing
a coefficient (the coefficient of validity) which determines relationship
between the scores obtained on the test and the values of the criterion
variable and getting measures on this variable which are to be compared with
the scores on the test. Often it is difficult to get reliable measures on the true
criterion. What we get are only approximate measures on the criterion
variable. Depending upon the situation, the criterion scores may be of any of
the following kinds: ratings by judges (experts who know the group) on the
trait measured scores on another valid test of the (we may validate a newly
constructed test for trait A by selecting as the criterion variable the score on a
well–established test for trait A), measure of later success (for a test for
DR RAJIV SAKSENA
32
recruiting persons in a vocation), etc. We discuss below the different concepts
of validity:
Predictive validity
This type of validity arises when we use a test for trait for selecting
applicant for a particular course or job and the criterion variable is the degree
of success at a later period, 𝑖. 𝑒., after the recruits have completed the course
or have been on the job for a sufficient period. The criterion variable is the
performance at that later period– grades or ratings on completion of the
course or after a certain period of employment. A test has a high predictive
validity if it can forecast efficiently later performance on a particular
measurable aspect of life. And this is of importance in the selection or
recruitment of individuals for different courses of study or training
programmes or jobs.
Concurrent validity
Concurrent validity is obtained for tests for which the criterion variable
is also available at the same times as the test results and we are not to wait as
in the case of predictive validity. Tests are constructed for measuring a
variable for which the result also may be obtained without waiting, because it
is easier and sometimes saves time and expenditure, while giving the same
results as the criterion variable. Concurrent validity is used for diagnostic test
(𝑒. 𝑔. in clinical diagnosis). Both types of validity (predictive and concurrent)
are obtained by computing the correlation between the test scores and
criterion scores and the validity is the correlation coefficient.
Content validity
Sometimes tests are constructed to study the knowledge of the

individuals on certain specific areas of study, say verbal ability, geometrical
DR RAJIV SAKSENA
33
drawing ability, etc. There are large numbers of items which measure these
areas and, in a test, we have only a sample of these items. In content validity of
a test, we try to ascertain how far the test covers the field of study under
investigation or in other words, how good the items of the test are as a sample
from the totality of all items for that test. It is, however, not possible to
express content validity as a validity coefficient, as is possible with the
previous two validities.
Construct validity
This is comparatively a new concept in validity theory. This concept is

found useful when either there is no external criterion variable or it is difficult
to obtain measurements on the criterion variables. This validity cannot be
expressed in a single measure as the correlation between test scores and
criterion scores. Validity in this case is demonstrated by showing that the
predictions expected on the basis of theory may be confirmed by the test.
Some of the common ways of establishing construct validity are the following:
(1) Correlating different items or parts of the test. There correlations should
be high if the test is measuring a unitary variable.
(2) Correlating different tests which measure the same variable.
CORRECTIONS FOR ATTENUATION:
A validity coefficient expresses the extent of agreement of the test score

with a measurement of the criterion variable. Both these measurements are,
however, liable to errors, which are due to unreliability of the measuring
instruments. It is possible to develop a correlation for these errors, known as
the correction for attenuation.
The corrected value of the validity coefficient will estimate the

relationship of the test score and the criterion score, had both the
measurements been completely reliable.
Let 𝑇𝑖 and 𝐶𝑖 be the observed test score and criterion score for the 𝑖𝑡ℎ
individual, 𝑡𝑖 and 𝑐𝑖 the corresponding true scores, and 𝑒𝑖 and 𝑒𝑖′ the errors.
Thus,
DR RAJIV SAKSENA
34
𝑇𝑖 = 𝑡𝑖 + 𝑒𝑖 and 𝑐𝑖 = 𝑒𝑖 + 𝑒𝑖′
all expressed as deviations from means.
Thus 𝑟𝑖𝑐 the true validity coefficient, is

∑(𝑇𝑖 −𝑒𝑖 )(𝐶𝑖 −𝑒𝑖′ )
𝑟𝑡𝑐 = (N being the total number of individuals),
𝑁𝑠𝑡 𝑠𝑐
so that
∑ 𝑇𝑖 𝐶𝑖 ∑ 𝑇𝑖 𝑒𝑖′ ∑ 𝐶𝑖 𝑒𝑖 ∑ 𝑒𝑖 𝑒𝑖′
𝑟𝑡𝑐 𝑠𝑡 𝑠𝑐 = − − +
𝑁 𝑁 𝑁 𝑁
Assuming independence of true scores and error scores and of error

scores themselves,
𝑟𝑇𝐶 𝑠𝑇 𝑠𝐶
𝑟𝑡𝑐 =
𝑠𝑡 𝑠𝑐
From (5.16), we know

𝑠𝑡2 𝑆𝐶2
𝑟𝑇𝑇 = and 𝑟𝐶𝐶 =
𝑠𝑇2 𝑆𝐶2
𝑟𝑇𝑇 and 𝑟𝐶𝐶 being estimates of reliability of test score and criterion scores.
Thus,
𝑟𝑇𝐶
𝑟𝑡𝑐 = …(27)
√𝑟𝑇𝑇 𝑟𝐶𝐶
But this coefficient is of little practical value, since a pair of perfectly reliable
test and criterion is rarely realized. Very often we shall be using test scores
which are contaminated with errors for the purpose of prediction. There, it
may be of interest to know what would be the validity coefficient had a
perfectly reliable criterion been available. In the same way, we can find the
correlation between true criterion score and observed test score, as
𝑟𝑇𝐶
𝑟𝑡𝑐 = . …(28)
√𝑟𝐶𝐶
DR RAJIV SAKSENA
35
EFFECT OF TEST LENGTH ON TEST PARAMETERS
We have seen in previous section, the effect of test length on the true
score variance (equ.17), on the observed score variance (equ.18) and on the
reliability of a test (equ.19). Using notations already introduced, it is easy to
see the effect of test length on true mean and observed score mean:
𝜇𝑡𝐺 = 𝑘𝜇𝑡𝑔 …(29)
and 𝜇𝑥𝐺 = 𝑘𝜇𝑥𝑔 …(30)
To find the effect of test length on the validity of a test, we first consider the
case where the original test is lengthened by adding to it (𝑘 − 1) parallel test
of the same length and the original criterion variable is lengthened by adding
to it (𝑙 − 1) parallel criterion variables of the same length, such that each pair
of component test and criterion variable gives the same validity coefficient.
Let us denote the total test score by𝑥𝐺 :
𝑥𝐺 = 𝑥𝑔1 + 𝑥𝑔2 + ⋯ … … + 𝑥𝑔𝑘
and the total criterion score by 𝑦𝐻 :
𝑦𝐻 = 𝑦ℎ1 + 𝑦ℎ2 + ⋯ … . . +𝑦ℎ𝑖 .
Now we obtain the correlation coefficient of augmented test scores with

the augmented criterion variable scores:
𝑐𝑜𝑣(𝑥𝐺 , 𝑦𝐻 )
𝜌𝑥𝐺 𝑦𝐻 =
𝜎𝑥𝐺 X 𝜎𝑦𝐻
𝑐𝑜𝑣(𝑥𝑔1 + 𝑥𝑔2 + ⋯ + 𝑥𝑔𝑘 , 𝑦ℎ1 + 𝑦ℎ2 + ⋯ + 𝑦ℎ𝑙 )
=
√𝑣𝑎𝑟(𝑥𝑔1 + 𝑥𝑔2 + ⋯ + 𝑥𝑔𝑘 ) X 𝑣𝑎𝑟(𝑦ℎ1 + 𝑦ℎ2 + ⋯ + 𝑦ℎ𝑙 )
DR RAJIV SAKSENA
36
∑𝑘𝑔=1 ∑𝑙ℎ=1 𝜌𝑥𝑔 𝑦ℎ 𝜎𝑥𝑔 𝜎𝑦ℎ
= 1/2
{𝑘𝜎𝑥2𝑔 + 𝑘(𝑘 − 1)𝜌𝑔𝑔 𝜎𝑥2𝑔 } {𝑙𝜎𝑦2ℎ + 𝑙(𝑙 − 1)𝜌ℎℎ 𝜎𝑦2ℎ }1/2
𝑘𝑙𝜌𝑥𝑔 𝑦ℎ 𝜎𝑥𝑔 𝜎𝑦ℎ

= 1/2
{𝑘 + 𝑘 (𝑘 − 1)𝜌𝑔𝑔 } {𝑙 + 𝑙 (𝑙 − 1)𝜌𝑔𝑔 }1/2𝜎𝑥𝑔 𝜎𝑦ℎ
𝑘𝑙𝜌𝑥𝑔 𝑦ℎ
= 1/2
{𝑘 + 𝑘 (𝑘 − 1)𝜌𝑔𝑔 } {𝑙 + 𝑙(𝑙 − 1)𝜌ℎℎ }1/2
… (31)
where 𝜌𝑥𝑔 𝑦ℎ is the validity of the original test with the original criterion
variable.
𝜌𝑥𝐺 𝑦𝐻 is the validity of the lengthened test(lengthened 𝑘 times),
with the lengthened criterion variable (lengthened 𝑙 times),
𝜌𝑔𝑔 is the reliability of the original test and
𝜌ℎℎ is the reliability of the original criterion variable.
If the criterion variable is not lengthened, then the effect on the validity
of increasing only the test length is obtained from (5.31) by putting𝑙 = 1:
𝑘𝜌𝑥 𝑦
𝑔 ℎ
𝜌𝑥𝐺 𝑦𝐻 = {𝑘+𝑘(𝑘−1)𝜌 …(32)
𝑔𝑔 }1/2
ITEM ANALYSIS
We have already seen that in constructing a test will be determined by

its reliability and validity. Now, in developing a test a large number of items
supposed to measure the ability under consideration are tried over a large
group of subjects. The question that naturally arises is : how well can the item
be selected so that the required reliability and validity of the test can be
achieved? This calls for item analysis
DR RAJIV SAKSENA
37
The typical item analysis is carried out from two kinds of information –
an index of item difficulty and an index of item validity, which means how well
the item discriminates in agreement with the rest of the items of the test or
how well it predicts some external criterion. The most common index of item
difficulty is p𝑖 , the proportion of subjects who pass the item. The commonly
used index of item validity is r𝑖𝑒 , the correlation of the item score with some
external criterion 𝑐 or, more often r𝑖𝑐 , the correlation of the item score with
the total score. The most common use of item analysis data is the selection of
the best items to compose the final test. It also enables the item-writer to
modify the items in the required directions. The important features of the test,
viz. mean, variance, reliability and validity, can be controlled by selecting
items of the right type of difficulty, the right spread of difficulty, the right
degree of item inter correlations and item validities.
The difficulty index 𝑝𝑖 for the 𝑖𝑡ℎ item is the proportion of subjects
answering the item correctly. In a multiple–choice item with 𝑘 alternatives,
Guilford has proposed a correction for guessing on the assumption that a
subject either knows the answer correctly or guesses at random. If 𝑅𝑖 is the
number of persons answering the item correctly & 𝑊𝑖 the number answering
wrongly, the number of lucky guesses, 𝑖. 𝑒 of those who guess correctly, is
𝑊𝑖
estimated as so that the item difficulty corrected for guessing is
𝑘−1
𝑊𝑖
𝑅𝑖 −
𝑘−1
𝑅𝑖 + 𝑊𝑖
…(33)
There are alternative formulae for correction for guessing too, based on other
assumptions. In some methods of item analysis, the correction 𝑟𝑖𝑡 is estimated
from those making extreme scores, generally the upper & lower 27% of total
group. The estimation is, however, based on symmetry of the item score &
total score distributions & linearity of regression of item score on total score.
DR RAJIV SAKSENA
38
Four coefficients of correlation are commonly used to indicate the
correlation of an item with a criterion (𝑟𝑖𝑐 ) or, more generally, of an item with
the total(𝑟𝑖𝑡 ). They are biserial(𝒓𝒃𝒊 ), point biserial (𝒓𝒑𝒃 ), tetrachoric (𝒓𝒕 )
and the Φ coefficient. If the ability measured by the item is normally
distributed and the criterion score is continuous, then 𝑟𝑏𝑖 can be used. If the
item score is limited to 0 and 1, 𝑟𝑝𝑏 should be used. If the criterion variable
and the ability measured by the item are both normally distributed, 𝑟𝑡 is called
for. If the criterion is not a continuous variable, but a natural division into two
groups, one can use the Φ coefficient. Another index, known as the index of
discrimination between High and Low groups, is often used for item selection.
INTELLIGENCE TESTS AND IQ
Interest in the nature and measurement of intelligence is gradually

increasing. Tests of intelligence and other mental qualities are being used in
different spheres of life. By intelligence is meant the capacity for relational
and constructive thinking for the attainment of some goal. In the discussion of
intelligence, Spearman’s two–factor theory holds an important place.
According to this theory, there is a common element, a general factor, in all
our cognitive abilities- abilities that are concerned with the intellectual
aspects of mind. Spearman named this as the g-factor and this g-factor can be
identified with intelligence. Besides the g-factor, which is present in all
abilities, there is according to Spearman a specific factor for each ability.
Spearman’s theory was not, however, universally accepted. Thomson
proposed a group- factor theory. According to Thomson, there are group
factors, each of which is present in a number of different abilities. Thus, while
they are more restricted than Spearman’s g-factor, they are less restricted
than his specific factors. Some of the group factors are the following (i) verbal
ability; (ii) numerical ability; (iii) musical ability; (iv) mechanical ability;
All attempts to describe intelligence by a recourse to physiology have

failed. Though differences of opinion exist on nature of intelligence, there is
DR RAJIV SAKSENA
39
more or less general agreement as to the procedure of measuring intelligence.
In an intelligence test, the following types of problem find a place:
(i) Synonyms and antonyms
One word is given, and the subject is required to select or to supply a

second word which has the same or the opposite meaning.
Example: (i) Superior is the opposite of …………. .
(ii) Cruel is the same as (rough, unkind, persecutor, inhuman).
(ii) Classification
A set of word is given. All but one word are in some respect the same.
The subject is to find out the odd word.
Example: (i) Shoot, stab, murder, write.
(ii) Rice, flour, bread, flower.
(iii) Sentence completions
An incomplete sentence is given. The subject is to complete it.
Example: (i) Man is superior to other animals because………,
(ii) A journey to moon can be made by……..,
(iv)Mixed sentences
A set of words is given. The subject is to rearrange them into a sentence

and say whether it is true or false.
Example: (i) Sword pen is then mightier. (True, False)
(ii) Is America a socialist country.(true, false)
DR RAJIV SAKSENA
40
(v)Coding
A sentence is given. The subject is to rewrite it on the basis of a given

code.
Example: Code the following message by first reversing each word and then
substituting each letter by the next- “Send reinforcements at once”.
(vi)Number series
A series of numbers is given and the subject is to supply the next or the
next two.
Example: (i) Supply the next two terms-.
(a) 1, 3, 7, 13, ….,…..,… .
(b) 81, 27,9, 3….,….,… .
(vii) Analogies
Three words, of which the first two are related in some way, are given.
The subject is to find or select the fourth word which is related to the third as
the second is to the first.
Example: Black is to white as intelligent is to ………. .
Man is to woman as god is to ………. .
(viii) Inferences
A problem demanding reasoning is given, and the subject is to select or

supply the solution.
Example: All men are mortal.
Some men are kind.
All mortals are kind. (True or false)
DR RAJIV SAKSENA
41
Intelligence tests may be designed for application to individuals or for
application to groups of individuals. One of the well-known individual tests is
Binet’s test. The revised version of this test is now being widely used for
measuring Intelligence of young children and for detecting mental deficiency.
Group tests were first widely used by the U.S Army authorities for
recruitment, placement or promotion of personnel. The Alpha test was meant
for the majority and the Beta test for illiterates or non-English-Speaking
person.
Intelligence tests, like other tests, may again be verbal or non-verbal,

The former demand the intelligent manipulation of ideas expressed in words
while the letter call for the intelligent manipulation of objects.
After constructing an Intelligence test, we must check its reliability and

validity by one of the methods discussed previously. When we are satisfied
that the Intelligence test is reliable and valid, we must compute some
standard or norm which will aid us in assessing any given individual’s score.
We may compute either the mean and standard deviation or the percentile
norms, standard scores or T –sores for this purpose. It was in this connection
that Binet introduced the concept of mental age. An individual’s mental age
(MA) is the age at which an average person can pass the tests that the
individual passes. A numbers of intelligent tests so constructed are to be
applied to large numbers of children of different ages. Then one has to find at
what age last birth day each test is passed by 50% of the children of that age.
Thus for each age a number of intelligence tests, say 5, are fixed. If a subject
can answer correctly all the tests for age 9, 80% of age 10, 40% of age 11 and
20% of age 12, his mental age would be 9+.80+40+20=10.40. Later, mental
ratio (MR) was defined as
𝑚𝑒𝑛𝑡𝑎𝑙 𝑎𝑔𝑒
𝑚𝑒𝑡𝑎𝑙 𝑟𝑎𝑡𝑖𝑜 = …(34)
𝑐ℎ𝑟𝑜𝑛𝑜𝑙𝑜𝑔𝑖𝑐𝑎𝑙 𝑎𝑔𝑒
Thus, if a boy of 10 years possesses an MA of 10.40 years, then his MR is

1.04. He is thus an advanced child, his MR being more than 1. A child will be
DR RAJIV SAKSENA
42
regarded as retarded if his MR is less than I, and he is of average intelligence if
his MR equals I.
The intelligence quotient, or IQ,has now replaced the MR. IQ is defined as

𝑀𝐴
𝐼𝑄 = 100 X
𝐶𝐴
= 100 X 𝑀𝑅 …(35)
We now make some observations concerning the interpretation of IQ in

its classical from. The IQ will be 100(lower than100/greater than 100) for all
children who have the same (a lower/a higher) level of intellectual
development as (than) the average child of the same age. It is necessary that
the standard deviations of the IQ distribution of all age groups be
approximately the same for the same IQ to have the same relative position on
the distribution for different ages. This is essential for a proper interpretation
of an individual IQ. But as this is not fulfilled in many cases, the present trend
in standard tests is that the test is standardized and normalized into a set of
normalized scores (called IQ –equivalents) for each age with mean 100 and
standard deviation 15. Thus it is immaterial whether we use a T -scale or an
IQ–equivalent scale for the norm.
The use of intelligence tests has shown that intelligence may be

supposed to be normally distributed and that it depends on heredity. It has
also been found that intelligence grows with age, which continues up to age 16
or 17, and then it remains steady. There is no evidence that intelligence and
sex are related. It has also been found that different occupations require
intelligence to varying degrees.
Intelligence tests have found many uses. They are used for vocational
guidance and selection, in the grading of pupils and in diagnosing mental
deficiency. Thus an intelligence test, properly constructed and standardized, is
of immense use for various purposes.
DR RAJIV SAKSENA
43
ELEMENTS OF FACTOR ANALYSIS
Factor analysis is that branch of statistical methods which is concerned

with the resolution of a set of variables X1, X2,………..X𝑛 in terms of a smaller
number of factors F1, F2…………. Fm, where 𝑚 < 𝑛 so that the purpose in views
is not vitiated. The resolution is effected by the analysis of inter-correlations
of the variables. The satisfactory solution is to use factors which convey all the
important and essential information of the original set of variables and the
emphasis is on economy of description. Factor analysis has its principal
application in psychological measurements, where the variables X 1, X2,.……..X𝑛
are the test scores on 𝑛 score of a battery and F1, F2,………….,Fm are 𝑚 mental
abilities measured by the tests.
The simplest mathematical expression for describing a set of variables

in terms of several others is a linear one. In factor analysis also, a linear form
is taken to represent a variable X𝑗 in terms of a number of underlying factors
which are taken in the standardized form (𝑖. 𝑒., with zero means and unit
𝑠. 𝑑. ′′′′′′′′′′′′′′𝑠). Several types of factors are employed. Common factors are
those which occur in more than one variable. Common factors are of two
types - (1) general factor, which is common to all the variables and (2) group
factors, which are present in several, but not in all, variables. A factor which
appears in the description of a single variable is called unique. Unique factors
are of two types – (1) specific factors, having a simple interpretation and
liable to be identified, and (2) unreliable or error factors, which are unreliable
and not identifiable. Thus we have
𝑋𝑗 = 𝑎𝑗1 𝐹1 + 𝑎𝑗2𝐹2 + ⋯ + 𝑎𝑗𝑚 𝐹𝑚 + 𝑏𝑗 𝑆𝑗 + 𝑐𝑗 𝐸𝐽, 𝑗 = 1,2 … . 𝑛, …(36)
F1, F2, …………., Fm being the common factors 𝑆𝑗 the specific factor and 𝐸𝑗, the
error or unreliability.
ℎ𝑗2 = ∑𝑚 2
𝑘=1 𝑎𝑗 𝑘 is called the communality of the variable 𝑋𝑗, which is
the part of the total variance attributable to common factors, whereas 𝑏𝑗2 and
𝑐𝑗2 are called the specificity and unreliability of the variable, 𝑏𝑗2 + 𝑐𝑗2 being
DR RAJIV SAKSENA
44
called its uniqueness. ℎ𝑗2 + 𝑏𝑗2 may be termed as the reliability of the
variable, and 𝑎𝑗1, 𝑎𝑗2, ………,𝑎𝑗𝑚, are the factor loadings of the 𝑚 common
factors for the variable 𝑋𝑗 . The basic problem of factor analysis is to determine
the factor loadings. When the factor loadings are determined one can evaluate
the factors in terms of the variables.
Let us designate
𝑋𝑗 = 𝑎𝑖1 𝐹1 + 𝑎𝑗2 𝐹2 + ⋯ + 𝑎𝑗𝑚 𝐹𝑚 + 𝑎𝑗 𝑈𝑗 , 𝑗 = 1,2 … . . 𝑛, …(37)
𝑈𝑗 Being the uniqueness, as the factor pattern and
𝑟𝑋𝑗 𝐹𝑘 = 𝑎𝑗1𝑟𝐹𝑘 𝐹1 + 𝑎𝑗2𝑟𝐹𝑘 𝐹2 + ⋯ + 𝑎𝑗𝑘 + ⋯ + 𝑎𝑗𝑚 𝑟𝐹𝑘 𝐹𝑚
𝑟𝑥𝑗 𝑈𝑗 = 𝑎𝑗 ...(38)
as the factor structure
If we have 𝑁 individuals for whom the values of the variable 𝑋𝑗 are

known, say 𝑋𝑗1 , 𝑋𝑗2 ,……., 𝑋𝑗𝑁 ,let
𝑋11 𝑋12 … … 𝑋1𝑁

𝑋 𝑋22 … … 𝑋2𝑁
𝑋 = ( 21 )
… … …… …
𝑋𝑛1 𝑋𝑛2 … … 𝑋𝑛𝑁
𝐹11 𝐹12 … … 𝐹1𝑁
𝐹21 𝐹22 … … 𝐹2𝑁
… … …… …
𝐹𝑚1 𝐹𝑚2 … … 𝐹𝑚𝑁
𝐹=
𝑈11 𝑈12 … … 𝑈1𝑁
… … …… …
𝑈21 𝑈22 … … 𝑈2𝑁
( 𝑈𝑛1 𝑈𝑛2 … … 𝑈𝑛𝑁 )
DR RAJIV SAKSENA
45
𝑎11 𝑎12 … … 𝑎1𝑚 𝑎1 𝑂 𝑂……𝑂
𝑎 𝑎22 … … . 𝑎2𝑚 𝑂 𝑎2 𝑂……𝑂
And 𝑀 = ( 21 )
… … …… … … … … …… …
𝑎𝑛1 𝑎𝑛2 … … . 𝑎𝑛𝑚 𝑂 𝑂 𝑂 … … 𝑎𝑛
Then 𝑋 = 𝑀𝐹.
1 𝑟12 … … 𝑟1𝑛
𝑋𝑋 = ( 𝑟21 1 … … . 𝑟2𝑛 ) = 𝑅,
1
Now
𝑁 … ……… …
𝑟𝑛1 𝑟𝑛2 … … 𝑟𝑛𝑛
the correlation matrix

1
Thus 𝑅= 𝑋𝑋 ′
𝑁
1
= (𝑀𝐹 )(𝐹′𝑀′)
𝑁
1
= 𝑀 ( 𝐹𝐹′) 𝑀′
𝑁
But if the factors are all orthogonal,
𝑅 = 𝑀𝑀′ .
Thus, if we regard the correlation matrix R as the available data and the
factor pattern matrix M as the desired objective in a factor analysis, we have
𝑛(𝑛−1)
experimentally given coefficients which must exceed the number of
2
linearly independents coefficients in M. It will be seen that by limiting
ourselves to common factors, the factor problem becomes determinate even
though we admit the existence of unique factors.
Now with the assumption of a particular factor pattern and the

assumption of orthogonality of factors, we can calculate the coefficients
𝑚
𝑟̂𝑗𝑘 = ∑ 𝑎𝑗𝑖 𝑎𝑘𝑖

𝑖=1
DR RAJIV SAKSENA
46
and compare them with the observed correlation coefficients to see how far
the assumed factor pattern explains the observed correlation coefficients.
When the factor loadings are determined, estimation of any common factor 𝐹𝑠
(or an unique factor 𝑈𝑠 ) involves the determination of the regression function
𝐹̂𝑠 = 𝛽𝑠1 𝑋1 + 𝛽𝑠2 𝑋2 + ⋯ + 𝛽𝑠𝑛 𝑋𝑛 .
The normal equations will be
𝛽𝑠1 + 𝑟12𝛽𝑠2 + ⋯ + 𝑟1𝑛 𝛽𝑠𝑛 = 𝑡1𝑠,
𝑟21𝛽𝑠1 + 𝛽𝑠2 + ⋯ 𝑟2𝑛 𝛽𝑠𝑛 = 𝑡2𝑠
𝑟𝑛1𝛽𝑠1 + 𝑟𝑛1𝛽𝑠2 + ⋯ 𝛽𝑠𝑛 = 𝑡𝑛𝑠 ,
where 𝑡𝑗𝑠 = 𝑟𝑋𝑗 𝐹𝑠
The solution is
1
𝛽̂𝑠𝑗 = [𝑡 𝑅 + 𝑡2𝑠 𝑅2𝑗 + ⋯ 𝑡𝑛𝑠 𝑅𝑛𝑗 ],
𝑅 1𝑠 1𝑗
where 𝑅𝑖𝑗 is the cofactor of 𝑟𝑖𝑗 in the determinant
R=[R].
Thus 𝛽̂𝑠 = 𝑡𝑠′ 𝑅 −1
′
so that 𝐹̂𝑠 = 𝑡𝑠′ 𝑅 −1(𝑋1 , 𝑋2 , … … . . , 𝑋𝑛 ) ’
Combining for all factors, common and unique, we have
𝐹 = 𝑆 ′ 𝑅 −1𝑋, …(39)
𝑡11 𝑡12 𝑡1𝑚 𝑎1 𝑂 … … . 𝑂

𝑡 𝑡 𝑡2𝑚 𝑂 𝑎2 … … 𝑂
where 𝑆 = ( 21 22 )
… … … … … … … …
𝑡𝑛1 𝑡𝑛2 𝑡𝑛𝑚 𝑂 𝑂 … … 𝑎𝑛
In case the factors are orthogonal,
DR RAJIV SAKSENA
47
𝑟𝑋𝑗 𝐹𝑘 = 𝑡𝑗𝑘 = 𝑎𝑗𝑘
and the factor structure coincides with the loading matrix M.
𝑎11 𝑎12 𝑎1𝑚 𝑎1 𝑂 … … . 𝑂

𝑎 𝑎22 𝑎2𝑚 𝑂 𝑎2 … … 𝑂
Where 𝑀 = ( 21 )
… … … … … …… …
𝑎𝑛1 𝑎𝑛2 𝑎𝑛𝑚 𝑂 𝑂 … … 𝑎𝑛
We have 𝐹̂ = 𝑀′ 𝑅 −1𝑋 …(40)
In actual applications, the orthogonal factors are estimated conveniently by

the method of pivotal condensation.
References
1. Goon, Gupta and Dasgupta: Fundamentals of Statistics, Vol. II.
DR RAJIV SAKSENA
48

View PDF

Uploaded by

Copyright:

Available Formats

View PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

View PDF

Uploaded by

Copyright:

Available Formats

STATISTICAL METHODS FOR

PSYCHOLOGY AND EDUCATION

Psychometry is the branch of psychology which deals with the

Unlike physical or biological characteristics, psychological

SOME SCALING PROCEDURES

SCALING INDIVIDUAL TEST-ITEMS OF DIFFICULTY

Here we have a number of items in a test administered to a large group

Let pi be the proportion of individuals passing the 𝑖𝑡ℎ item. We

Fig. Determining the difficulty-value of an item from

Hence 𝑑𝐵 − 𝑑𝐴 = 0.44𝜎, whereas 𝑑𝐷 − 𝑑𝐶 = 0.27𝜎

The difficulty of B relative to A is 1.63 times greater than the difficulty of

SCALING OF TEST-SCORES IN SEVERAL TESTS

The main defect of the prevalent system of ranking in scholastic test

Here we assume that the distribution of the trait under consideration is

In this case we assume that the trait-distribution is normal. The raw

The scaled score obtained by the process is called T–score in memory of

Normalized scores are also expressed as stanine (standard nine) score.

Method of equivalent scores

Let 𝑥 and 𝑦 be the scores on two tests, having probability–density

For Practical convenience an equivalence curve may be obtained by

For the purpose of comparison or combination, the raw score on

First we have to remember that a score of 80 is to be considered as an

TABLE: DISTRIBUTIONS OF SCOBES IN VERNACULAR AND ENGLISH OF A GROUP OF 500 STUDENTS

TABLE: CUMULATIVE FREQUENCY DISTBIBUTIONS OF SCORES IN VERNACULAR AND ENGLISH

Score Cumulative Frequency

And that of student 2, getting 60 in both Vernacular and English, is

𝑦̅ =37.87 And 𝑠𝑦 = 13.10

Hence the 𝑤 scores are given by

As such, the total 𝑤 −score of student 1 is

and the of student 2 is

Now, for T-scaling percentile positions have to be converted into T-

𝑇80 (𝑉𝑒𝑟𝑛) = 50 + 𝑟.9952𝑋10 = 75.90,

𝑇60(𝑉𝑒𝑟𝑛) = 50 + 𝑟.8864𝑋10 = 62.08

𝑇40 (𝑉𝑒𝑟𝑛) = 50 + 𝑟.5712𝑋10 = 51.79

and 𝑇60(𝑉𝑒𝑟𝑛) = 50 + 𝑟.9592𝑋10 = 67.41

Hence the total T-score of student 1 is

and the total T-score of student 2 is

Thus T-scaling shown that student 2 is slightly superior to student 1

OGIVE FOR MARKS IN ENGLISH(Y)

4.5 14.5 24.5 34.5 44.5 54.5 64.5 74.5 84.5

In the equivalent score method, let us take Vernacular as the standard.

Hence the total score of student 1 in terms of Vernacular score is

And that student 2 is

This method again shows that student 1 is slightly superior to student 2.

SCALING OF RATING OR RANKING TO TERM OF NORMAL CURVE

Example A group of 100 workers was rated by a supervisor on a five-point

Area covered by the rating 0.05 0.24 0.45 0.23 0.03

SCALING OF QUALITATIVE ANSWERS TO A QUESTIONNAIRE

The answers to the items in an attitude or personality test or a test of a

SCALING OF JUDGMENTS OF A NUMBER OF PRODUCTS: PRODUCT SCALE

We shall discuss the method of paired comparisons due to Thurston,

1 p11 p21 …. …. Pk1

Now, suppose that the distribution of difference in judgments (𝑇) of the

so that 𝑆𝑖 − 𝑆𝑗 = −𝑥𝑖𝑗 𝜎𝑖−𝑗 …(6)

Taking 𝜎𝑖−𝑗 = 𝜎√2 as the unit of the scale, we have

Fig. Determination the difference of scale-value of

1 s1- s1 s2-s1 …. .... sk-s1

Product 2 s1-s2 s2-s2 …. …. sk-s2