ch7 - CORELATION
ch7 - CORELATION
s s
...(1)
or
r
X X Y Y
X X Y Y
=
- -
- -
( ) ( )
( ) ( )
2 2
...(2)
or
r
XY
X Y
N
X
X
N
Y
Y
N
=
-
- -
( )( )
( ) ( )
2
2
2
2
...(3)
or
r
N XY X Y
N X X N Y Y
=
( )( )
( ) ( )
2 2 2 2 ...(4)
Properties of Correlation Coefficient
Let us now discuss the properties of
the correlation coefficient
r has no unit. It is a pure number.
It means units of measurement are
not part of r. r between height in
feet and weight in kilograms, for
instance, is 0.7.
A negative value of r indicates an
inverse relation. A change in one
variable is associated with change
in the other variable in the
opposite direction. When price of
a commodity rises, its demand
falls. When the rate of interest
rises the demand for funds also
falls. It is because now funds have
become costlier.
If r is positive the two variables
move in the same direction. When
the price of coffee, a substitute of
tea, rises the demand for tea also
rises. Improvement in irrigation
facilities is associated with higher
yield. When temperature rises the
sale of ice-creams becomes brisk.
CORRELATION 97
If r = 0 the two variables are
uncorrelated. There is no linear
relation between them. However
other types of relation may be
there.
If r = 1 or r = 1 the correlation is
perfect. The relation between them
is exact.
A high value of r indicates strong
linear relationship. Its value is
said to be high when it is close to
+1 or 1.
A low value of r indicates a weak
linear relation. Its value is said to
be low when it is close to zero.
The value of the correlation
coefficient lies between minus one
and plus one, 1
1. If, in
any exercise, the value of r is
outside this range it indicates error
in calculation.
The value of r is unaffected by the
change of origin and change of
scale. Given two variables X and Y
let us define two new variables.
U =
X A
B
; V =
Y C
D
where A and C are assumed means of
X and Y respectively. B and D are
common factors. Then
r
xy
= r
uv
This
.
property is used to calculate
correlation coefficient in a highly
simplified manner, as in the step
deviation method.
As you have read in chapter 1, the
statistical methods are no substitute
for common sense. Here, is another
example, which highlights the need for
understanding the data properly
before correlation is calculated. An
epidemic spreads in some villages and
the government sends a team of
doctors to the affected villages. The
correlation between the number of
deaths and the number of doctors sent
to the villages is found to be positive.
Normally the health care facilities
provided by the doctors are expected
to reduce the number of deaths
showing a negative correlation. This
happened due to other reasons. The
data relate to a specific time period.
Many of the reported deaths could be
terminal cases where the doctors
could do little. Moreover, the benefit
of the presence of doctors becomes
visible after some time. It is also
possible that the reported deaths are
not due to the epidemic. A tsunami
suddenly hits the state and death toll
rises.
Let us illustrate the calculation of
r by examining the relationship
between years of schooling of the
farmer and the annual yield per acre.
Example 1
No. of years Annual yield per
of schooling acre in 000 (Rs)
of farmers
0 4
2 4
4 6
6 10
8 10
10 8
12 7
Formula 1 needs the value of
xy
x y
, , s s
98 STATISTICS FOR ECONOMICS
From Table 7.1 we get,
xy
X X
N
x
=
=
-
=
42
112
7
2
,
( )
, s
s
y
Y Y
N
=
-
=
( )
2
38
7
Substituting these values in
formula (1)
r = =
42
7
112
7
38
7
0 644 .
The same value can be obtained
from formula (2) also.
r
X X Y Y
X X Y Y
=
- -
- -
( )( )
( ) ( )
2 2
...(2)
r = =
42
112 38
0 644 .
Thus years of education of the
farmers and annual yield per acre are
positively correlated. The value of r is
also large. It implies that more the
number of years farmers invest in
education, higher will be the yield per
acre. It underlines the importance of
farmers education.
To use formula (3)
r
XY
X Y
N
X
X
N
Y
Y
N
=
-
- -
( )( )
( ) ( )
2
2
2
2
...(3)
the value of the following expressions
have to be calculated i.e.
XY X Y , , .
2 2
Now apply formula (3) to get the
value of r.
Let us know the interpretation of
different values of r. The correlation
coefficient between marks secured in
English and Statistics is, say, 0.1. It
means that though the marks secured
in the two subjects are positively
correlated, the strength of the
relationship is weak. Students with high
marks in English may be getting
relatively low marks in statistics. Had
the value of r been, say, 0.9, students
with high marks in English will
invariably get high marks in Statistics.
TABLE 7.1
Calculation of r between years of schooling of farmers and annual yield
Years of (X X ) (X X )
2
Annual yield (Y Y ) (Y Y )
2
(X
X
)(Y
Y
)
Education per acre in 000 Rs
(X) (Y)
0 6 36 4 3 9 18
2 4 16 4 3 9 12
4 2 4 6 1 1 2
6 0 0 10 3 9 0
8 2 4 10 3 9 6
10 4 16 8 1 1 4
12 6 36 7 0 0 0
X=42 (X X )
2
=112 Y=49 (Y
Y
)
2
=38 (X
X
)(Y Y )=42
CORRELATION 99
An example of negative correlation
is the relation between arrival of
vegetables in the local mandi and price
of vegetables. If r is 0.9, vegetable
supply in the local mandi will be
accompanied by lower price of
vegetables. Had it been 0.1 large
vegetable supply will be accompanied
by lower price, not as low as the price,
when r is 0.9. The extent of price fall
depends on the absolute value of r.
Had it been zero there would have
been no fall in price, even after large
supplies in the market. This is also a
possibility if the increase in supply is
taken care of by a good transport
network transferring it to other
markets.
Activity
Look at the following table.
Calculate r between annual
growth of national income at
current price and the Gross
Domestic Saving as percentage
of GDP.
Step deviation method to calculate
correlation coefficient.
When the values of the variables
are large, the burden of calculation
can be considerably reduced by using
a property of r. It is that r is
independent of change in origin and
scale. It is also known as step
deviation method. It involves the
transformation of the variables X and
Y as follows:
TABLE 7.2
Year Annual growth Gross Domestic
of National Saving as
Income percentage of GDP
199293 14 24
199394 17 23
199495 18 26
199596 17 27
199697 16 25
199798 12 25
199899 16 23
199900 11 25
200001 8 24
200102 10 23
Source: Economic Survey, (200405) Pg. 8,9
a property of r. It is that r is
independent of change in origin and
scale. It is also known as step
deviation method. It involves the
transformation of the variables X and
Y as follows:
U
X A
h
V
Y B
k
= = ;
where A and B are assumed means, h
and k are common factors.
Then r
UV
= r
XY
This can be illustrated with the
exercise of analysing the correlation
between price index and money
supply.
Example 2
Price 120 150 190 220 230
index (X)
Money 1800 2000 2500 2700 3000
supply
in Rs crores (Y)
The simplification, using step
deviation method is illustrated below.
Let A = 100; h = 10; B = 1700 and
k = 100
100 STATISTICS FOR ECONOMICS
The table of transformed variables
is as follows:
Calculation of r between price
index and money supply using step
deviation method
TABLE 7.3
U V
X -
100
10
Y -
1700
100
U
2
V
2
UV
2 1 4 1 2
5 3 25 9 15
9 8 81 64 72
12 10 144 100 120
13 13 169 169 169
U V U
V UV
= 41; = 35; = 423;
= 343; = 378
2
2
Substituting these values in formula
(3)
r
UV
U V
N
U
U
N
V
V
N
=
-
- -
( )( )
( ) ( )
2
2
2
2
(3)
=
-
- -
378
41 35
5
423
41
5
343
35
5
2 2
( ) ( )
= 0.98
This strong positive correlation
between price index and money
supply is an important premise of
monetary policy. When the money
supply grows the price index also
rises.
Activity
Take some examples of Indias
population and national income.
Calculate the correlation
between them using step
deviation method and see the
simplification.
Spearmans rank correlation
Spearmans rank correlation was
developed by the British psychologist
C.E. Spearman. It is used when the
variables cannot be measured
meaningfully as in the case of price,
income, weight etc. Ranking may be
more meaningful when the
measurements of the variables are
suspect. Consider the situation where
we are required to calculate the
correlation between height and weight
of students in a remote village. Neither
measuring rods nor weighing scales
are available. The students can be
easily ranked in terms of height and
weight without using measuring rods
and weighing scales.
There are also situations when you
are required to quantify qualities such
as fairness, honesty etc. Ranking may
be a better alternative to quantifica-
tion of qualities. Moreover, sometimes
the correlation coefficient between two
variables with extreme values may be
quite different from the coefficient
without the extreme values. Under
these circumstances rank correlation
provides a better alternative to simple
correlation.
Rank correlation coefficient and
simple correlation coefficient have the
same interpretation. Its formula has
CORRELATION 101
been derived from simple correlation
coefficient where individual values
have been replaced by ranks. These
ranks are used for the calculation of
correlation. This coefficient provides
a measure of linear association
between ranks assigned to these
units, not their values. It is the
Product Moment Correlation between
the ranks. Its formula is
r
D
n n
k
= 1
6
3
2
...(4)
where n is the number of observations
and D the deviation of ranks assigned
to a variable from those assigned to
the other variable. When the ranks are
repeated the formula is
r
k
= 1
6
12 12
1
2
3
1
1
3
2
2
2
D
m m m m
n n
+
-
+
-
+
-
( ) ( )
...
( )
where m
1
, m
2
, ..., are the number of
repetitions of ranks and
m m
3
1
1
12
...,
their corresponding correction
factors. This correction is needed for
every repeated value of both variables.
If three values are repeated, there will
be a correction for each value. Every
time m
1
indicates the number of times
a value is repeated.
All the properties of the simple
correlation coefficient are applicable
here. Like the Pearsonian Coefficient
of correlation it lies between 1 and
1. However, generally it is not as
accurate as the ordinary method. This
is due the fact that all the information
concerning the data is not utilised.
The first differences of the values of
the items in the series, arranged in
order of magnitude, are almost never
constant. Usually the data cluster
around the central values with smaller
differences in the middle of the array.
If the first differences were constant
then r and r
k
would give identical
results. The first difference is the
difference of consecutive values.
Rank correlation is preferred to
Pearsonian coefficient when extreme
values are present. In general
r
k
is less than or equal to r.
The calculation of rank correlation
will be illustrated under three
situations.
1. The ranks are given.
2. The ranks are not given. They have
to be worked out from the data.
3. Ranks are repeated.
Case 1: When the ranks are given
Example 3
Five persons are assessed by three
judges in a beauty contest. We have
to find out which pair of judges has
the nearest approach to common
perception of beauty.
Competitors
Judge 1 2 3 4 5
A 1 2 3 4 5
B 2 4 1 5 3
C 1 3 5 2 4
There are 3 pairs of judges
necessitating calculation of rank
correlation thrice. Formula (4) will be
used
102 STATISTICS FOR ECONOMICS
r
D
n n
s
= -
-
1
6
2
3
...(4)
The rank correlation between A
and B is calculated as follows:
A B D D
2
1 2 1 1
2 4 2 4
3 1 2 4
4 5 1 1
5 3 2 4
Total 14
Substituting these values in
formula (4)
r
D
n n
s
= -
-
1
6
2
3
...(4)
= -
-
= - = - = 1
6 14
5 5
1
84
120
1 0 7 0 3
3
. .
The rank correlation between A
and C is calculated as follows:
A C D D
2
1 1 0 0
2 3 1 1
3 5 2 4
4 2 2 4
5 4 1 1
Total 10
Substituting these values in
formula (4) the rank correlation is 0.5.
Similarly, the rank correlation
between the rankings of judges B and
C is 0.9. Thus, the perceptions of
judges A and C are the closest. Judges
B and C have very different tastes.
Case 2: When the ranks are not given
Example 4
We are given the percentage of marks,
secured by 5 students in Economics
and Statistics. Then the ranking has
to be worked out and the rank
correlation is to be calculated.
Student Marks in Marks in
Statistics Economics
(X) (Y)
A 85 60
B 60 48
C 55 49
D 65 50
E 75 55
Student Ranking in Ranking in
Statistics Economics
(R
x
) (R
Y
)
A 1 1
B 4 5
C 5 4
D 3 3
E 2 2
Once the ranking is complete
formula (4) is used to calculate rank
correlation.
Case 3: When the ranks are repeated
Example 5
The values of X and Y are given as
X 25 45 35 40 15 19 35 42
Y 55 60 30 35 40 42 36 48
In order to work out the rank
correlation, the ranks of the values
are worked out. Common ranks are
given to the repeated items. The
CORRELATION 103
common rank is the mean of the ranks
which those items would have
assumed if they were slightly different
from each other. The next item will be
assigned the rank next to the rank
already assumed. The formula of
Spear man s rank correlation
coef ficient when the ranks are
repeated is as follows
r
D
m m m m
n n
s
= -
+
-
+
-
+
-
1
6
12 12
1
2
3
1
1
3
2
2
2
( ) ( )
...
( )
where m
1
, m
2
, ..., are the number
of repetitions of ranks and
m m
3
1
1
12
-
..., their corresponding
correction factors.
X has the value 35 both at the
4th and 5th rank. Hence both are
given the average rank i.e.,
4 5
2
4 5
+
= th .
th rank
X Y Rank of Rank of Deviation in D
2
Ranking
XR' YR'' D=R'R''
25 55 6 2 4 16
45 80 1 1 0 0
35 30 4.5 8 3.5 12.25
40 35 3 7 4 16
15 40 8 5 3 9
19 42 7 4 3 9
35 36 4.5 6 1.5 2.25
42 48 2 3 1 1
Total
D = 65 5 .
The necessary correction thus is
m m
3 3
12
2 2
12
1
2
-
=
-
=
Using this equation
r
D
m m
n n
s
= -
+
-
-
1
6
12
2
3
3
( )
...(5)
Substituting the values of these
expressions
r
s
= -
+
-
= -
= - =
1
6 65 5 0 5
8 8
1
396
504
1 0 786 0 214
3
( . . )
. .
Thus there is positive rank correlation
between X and Y. Both X and Y move
in the same direction. However, the
relationship cannot be described as
strong.
Activity
Collect data on marks scored by
10 of your classmates in class
IX and X examinations. Calculate
the rank correlation coefficient
between them. If your data do not
have any repetition, repeat the
exercise by taking a data set
having repeated ranks. What are
the circumstances in which rank
correlation coef ficient is
preferred to simple correlation
coefficient? If data are precisely
measured will you still prefer
rank correlation coefficient to
simple correlation? When can
you be indifferent to the choice?
Discuss in class.
4. CONCLUSION
We have discussed some techniques
for studying the relationship between
104 STATISTICS FOR ECONOMICS
EXERCISES
1. The unit of correlation coefficient between height in feet and weight in
kgs is
(i) kg/feet
(ii) percentage
(iii) non-existent
2. The range of simple correlation coefficient is
(i) 0 to infinity
(ii) minus one to plus one
(iii) minus infinity to infinity
3. If r
xy
is positive the relation between X and Y is of the type
(i) When Y increases X increases
(ii) When Y decreases X increases
(iii) When Y increases X does not change
two variables, particularly the linear
relationship. The scatter diagram gives
a visual presentation of the
relationship and is not confined to
linear relations. Measures of
correlation such as Karl Pearsons
coefficient of correlation and
Spearmans rank correlation are
strictly the measures of linear
Recap
Correlation analysis studies the relation between two variables.
Scatter diagrams give a visual presentation of the nature of
relationship between two variables.
Karl Pearsons coefficient of correlation r measures numerically only
linear relationship between two variables. r lies between 1 and 1.
When the variables cannot be measured precisely Spearmans rank
correlation can be used to measure the linear relationship
numerically.
Repeated ranks need correction factors.
Correlation does not mean causation. It only means
covariation.
relationship. When the variables
cannot be measured precisely, rank
correlation can meaningfully be used.
These measures however do not imply
causation. The knowledge of
correlation gives us an idea of the
direction and intensity of change in a
variable when the correlated variable
changes.
CORRELATION 105
4. If r
xy
= 0 the variable X and Y are
(i) linearly related
(ii) not linearly related
(iii) independent
5. Of the following three measures which can measure any type of relationship
(i) Karl Pearsons coefficient of correlation
(ii) Spearmans rank correlation
(iii) Scatter diagram
6. If precisely measured data are available the simple correlation coefficient
is
(i) more accurate than rank correlation coefficient
(ii) less accurate than rank correlation coefficient
(iii) as accurate as the rank correlation coefficient
7. Why is r preferred to covariance as a measure of association?
8. Can r lie outside the 1 and 1 range depending on the type of data?
9. Does correlation imply causation?
10. When is rank correlation more precise than simple correlation coefficient?
11. Does zero correlation mean independence?
12. Can simple correlation coefficient measure any type of relationship?
13. Collect the price of five vegetables from your local market every day for a
week. Calculate their correlation coefficients. Interpret the result.
14. Measure the height of your classmates. Ask them the height of their
benchmate. Calculate the correlation coefficient of these two variables.
Interpret the result.
15. List some variables where accurate measurement is difficult.
16. Interpret the values of r as 1, 1 and 0.
17. Why does rank correlation coefficient differ from Pearsonian correlation
coefficient?
18. Calculate the correlation coefficient between the heights of fathers in inches
(X) and their sons (Y)
X 65 66 57 67 68 69 70 72
Y 67 56 65 68 72 72 69 71
(Ans. r = 0.603)
19. Calculate the correlation coefficient between X and Y and comment on
their relationship:
X 3 2 1 1 2 3
Y 9 4 1 1 4 9
(Ans. r = 0)
106 STATISTICS FOR ECONOMICS
Activity
Use all the formulae discussed here to calculate r between
Indias national income and export taking at least ten
observations.
20. Calculate the correlation coefficient between X and Y and comment on
their relationship
X 1 3 4 5 7 8
Y 2 6 8 10 14 16
(Ans. r = 1)