Correlation Analysis
Correlation Analysis
________________________________________________________________________________
Introduction
The distribution in which we consider two variables simultaneously for each item of the series is known
as bivariate distribution.
Correlation analysis involves various methods and techniques used for studying or measuring the
strength of the relationship between two or more related variables.
Correlation is described or classified in several different ways. Three of the most important are:
(i) Positive and negative;
(ii) Simple, partial and multiple; and
(iii) Linear and non-linear.
(i) Positive and negative Correlation. Whether correlation is positive or negative would depend upon
the direction of change of the variable. If both the variables are varying in the same direction, i.e.,
if one variable is increasing the other on an average is also increasing or, if one variable is
decreasing the other on an average is also decreasing, correlation is said to be positive. If, on the
other hand, the variables are varying in opposite directions, i.e., as one variable is increasing the
other is decreasing or vice versa, correlation is said to be negative.
(ii) Simple, Partial and Multiple Correlation. The distinction between simple, partial and multiple
correlation is based upon the number of variables studied. When only two variables are studied it
is a problem of simple correlation. In multiple correlation three or more variables are studied
simultaneously. In partial correlation we recognize more than two variables. But consider only two
variables to be influencing each other, the effect of other influencing variable being kept constant.
(iii) Linear and non-linear correlation. The distinction between linear and non-linear correlation is
based upon the constancy of the ratio of change between the variables. If the amount of change in
one variable tends to bear a constant ratio to the amount of change in the other variable, then the
correlation is said to be linear.
1
For example, observe the following two variables:
Advertisement expenditure(Tk. lakhs): 25 35 45 55 65
Sales(Tk. Crores): 120 140 160 180 200
It is clear that the ratio of change between the two variables is the same. If such variables are
plotted on a graph paper, all the plotted points would fall on a straight line. However, such a
situation is rare in practice.
Correlation would be called non-linear or curvilinear if the amount of change in one variable does not
bear a constant ratio to the amount of change in the other variable. For example, if double the amount of
rainfall, the production of rice or wheat, etc., would not necessarily be doubled. It may pointed out that
in most practical cases we find a non-linear relationship between the variables.
The following are the important methods of ascertaining whether two variables are correlated or not:
The correlation coefficient is a measure of the strength of linear relationship between two related
quantitative variables. In other words, a measure of the degree of linear relationship is called the
coefficient of correlation.
If X and Y are two variables then, correlation coefficient XY between the variables is defined as:
2
We use the Greek letter (rho) to represent the population correlation coefficient and r to represent
the sample correlation coefficient. If the two variables under study are x and y, the following
formula suggested by Karl Pearson can be used for measuring the degree of relationship
n xy x y
r , where n is the number of data pairs.
n x 2 x 2 n y 2 y 2
OR
xy n
x y
r
x y 2 y
x2
2 2
n
n
Despite these values if r = + 0.70 it means that relationship between x and y is positive implies that if
x increases y will also increase. If -0.65, it means that relationship between x and y is negative
implies that if x increases y will decrease.
Example: From a random sample of 10 businessmen, the amount of investment in business (in
thousand taka) and profit (in thousand taka) for them are given below.
Calculate the coefficient of correlation between investment and profit and interpret the result.
Solution:
Suppose investment is denoted by x and profit is denoted by y. The correlation coefficient between x
and y can be computed by
xy n
x y
r
x y
2 2
x
2
y
2
n n
3
Table: Computation of correlation efficient.
x y x2 y2 xy
20 10 400 100 200
30 5 900 25 150
70 20 4900 400 1400
100 35 10000 1225 3500
60 36 3600 1296 2160
75 15 5625 225 1125
30 20 900 400 600
105 35 11025 1225 3675
92 30 8464 900 2760
15 5 225 25 75
597 211 46039 5821 15645
xy n
x y
Now r 0.80
x y
2 2
x
2
y
2
n n
Comment: The correlation coefficient ( r) 0.80 , indicating high degree positive correlation
between the variables investment and profit implies that if investment increases profit will also
increases.
For Practice:
Example: The following data relate to the prices and supplies of a commodity during a period of
eight years:
Price (Tk/kg): 10 12 18 16 15 19 18 17
Supply (100kg): 30 35 45 44 42 48 47 46
A scatter diagram is a graphical device used to analyze the relationship between two variables. The
scatter diagram is constructed by plotting the pairs of observation on two variables with one variable
along X-axis and the other along Y-axis. It is the best method to identify the nature of relationship –
positive, negative, linear or non-linear.
The scatter diagrams below show how different patterns of data produce different degrees of
correlation.
4
Perfect positive: If r = + 1, there is a perfect positive linear relationship between x and y variables;
all data points fall exactly on a straight line. The slope of the line is positive.
r = +1
Perfect negative: If r = -1, there is a perfect negative linear relationship between x and y variables;
all points lie on the line. The slope of the line is negative.
r=-1
Strong Positive correlation: A set of data pairs (x, y) for which as x increases, y tends to increase.
The closer r is to +1, say + 0.88, the stronger the positive association between the two variables.
r = + 0.88
5
Strong Negative correlation: A set of data pairs (x, y) for which as x increases, y tends to decrease.
The closer r is to -1, say -0.94, the stronger the negative association between the two variables.
r = - 0.94
Weak Positive Correlation: The value of Y increases slightly as the value of X . The lower r is to
+1, say + 0.23, the weak the positive association between the two variables.
r = + 0.23
Weak Negative Correlation: The value of Y decreases slightly as the value of X increases. The
lower r is to -1, say -0. 22, the weak the negative association between the two variables.
r = - 0.22
6
No correlation: A set of data pairs (x, y) for which there is no clear pattern between x and y . There
is no linear relationship among the points of the scatter diagram.
If the data points fall in a random pattern, then the correlation is equal to zero.
Price of Milk and Price of Pens have zero correlation with each other. If Price of Pens
increase nothing can be inferred about the price of Milk and vice versa because their prices
are just not dependent on each other.
r = 0.0005
In addition:
Suppose that the prices of coffee and of computers are observed and found to have a
correlation of +.0005. This means that there is no correlation, or relationship, between the
two variables.
A zero correlation exists when there is no relationship between two variables. For example
there is no relationship between the amount of tea drunk and level of intelligence.
r = + 0.06
7
No correlation:
If r = 0, either the variables are independent or the relationship between the variables is not
linear.
If two variables are independent, then r = 0.
If r = 0, it does not mean that the variables are not related. This simply says that the relationship
is not linear. For example, if Y is related to X by Y X 2 , the value of r will be zero although
the relationship between X and Y is perfect and it is quadratic.
r=0
....................................................................................
8
Properties 1:
In other words,
2
X i X Yi Y
0
sx s y
X Y Y
2 X i X Yi Y 0
2 2
i X i
2 2
sx sy sx s y
Or
2
nsx 2 ns y 2nrxy sx s y
2
2
0
sx sy sx s y
Properties 2:
9
Assumptions for testing the significance of the linear correlation coefficient
Assumptions for Testing the Significance of the Linear Correlation Coefficient
1. The data are quantitative and are obtained from a simple random sample.
2. The scatter plot shows that the data are approximately linearly related.
3. There are no outliers in the data.
4. The variables x and y must come from normally distributed populations
Test of Significance:
Hypothesis:
H 0 : 0 (This null hypothesis means that there is no correlation between x and y variables in the
population)
H1 : 0 (This alternative hypothesis means that there is a significant correlation between the
variables in the population)
r n2
The test statistic is t which is distributed as Student’s t with n 2 df .
1 r2
Conclusion: If computed value of t is greater than the tabulated t with same df and at 5% (or 1%)
level of significance then H0 will be rejected at 5% (or 1%) level of significance otherwise, H0 may be
accepted.
…………………….
When the null hypothesis is rejected at a specific level, it means that there is a significant
difference between the value of r and 0. When the null hypothesis is not rejected, it means
that the value of r is not significantly different from 0 (zero) and is probably due to chance.
………………
10
Spearman’s Rank Correlation Coefficient
Sometimes we come across statistical series in which the variables under consideration are not
capable of quantitative measurement but can be arranged in serial order. This happens we are
dealing with qualitative characteristics (attributes) such as honesty, beauty, judgement, TV
programmes, leadership ability, color, taste, merit, intelligence, efficiency etc., which cannot be
measured quantitatively but can be arranged serially. In such situations Karl Pearson’s coefficient of
correlation cannot be used as such Charles Edward Spearman, a British psychologist, developed a
formula in 1904 which consists in obtaining the correlation coefficient between the ranks of n
individuals in the two attributes under study. Also, it is appropriate when one or both variables are
ordinal or skewed.
n
6 d i
2
.................. (1)
rs 1 i 1
n(n 1)
2
Where
di difference between the corresponding ranks of x and y .
and n is the total number of pairs of observation of x and y .
r = +1 : Means that the rankings have perfect positive association. Their rankings are
s
exactly alike.
r = 0: Means that the rankings have no correlation or association.
s
r = -1 : Means that the rankings have perfect negative association. They have exact reverse
s
ranking to each other.
We shall discuss below the method of computing the Spearman’s rank correlation coefficient rs
under the following situations:
11
Case 1: When actual ranks are given
When we give the actual data and not the ranks, it will be necessary to assign the ranks. Ranks can
be assigned by taking either the highest value as 1 or the lowest value as 1. It is immaterial in which
way (descending or ascending) the ranks are assigned. However, the same approach should be
followed for all the variables under consideration.
In some cases it may be found necessary to assign equal rank to two or more individuals or entries.
In such cases, it is customary to give each individual or entry an average rank. Thus if two
56
individuals are ranked equal at fifth place, they are each given the rank 5.5 , while if three
2
567
are ranked equal at fifth place, they are given the rank 6 . In other words, where two or
3
more individuals are to be ranked equal, the rank assigned for purposes of calculating coefficient of
correlation is the average of the ranks which these individuals would have got had they differed
slightly from each other.
Where equal ranks are assigned to some entries, an adjustment in the formula (1) for calculating
the rank coefficient of correlation is made.
1
The adjustment consists of adding
12
(m 3 m) to the value of d i
i
2
, where m stands for the
number of items whose ranks are common. If there is more than one such group of items with
common rank, this value is added as many times as the number of such groups. The formula can
thus be written as:
n 2 1
6 d i
m1 m1
3 1
m2 m2 ........
3
rs 1 i 1
12 12
n(n 1)
2
12
Example1: Ten competitors in a beauty contest are ranked by three judges in the
following order:
Judge 1 : 1 6 5 10 3 2 4 9 7 8
Judge 2 : 3 5 8 4 7 10 2 1 6 9
Judge 3 : 6 4 9 8 1 2 3 10 5 7
Use the rank correlation coefficient to determine which pair of judges has the nearest
approach to common tastes in beauty.
Solution:
The pair of judges who have the nearest approach to common choice in beauty can be
obtained in
3C2 = 3 ways as follows:
(i) Judge 1 and judge 2
(ii) Judge 2 and judge 3
(iii) Judge 3 and judge 1
Calculations for comparing their ranking are shown below:
Judge 1 Judge 2 Judge 3
d 2 R1 R2 d 2 R2 R3 d 2 R3 R1
2 2 2
( R1 ) ( R2 ) ( R3 )
1 3 6 4 9 25
6 5 4 1 1 4
5 8 9 9 1 16
10 4 8 36 16 4
3 7 1 16 36 4
2 10 2 64 64 0
4 2 3 4 1 1
9 1 10 64 81 1
7 6 5 1 1 4
8 9 7 1 4 1
Comment: Since the correlation coefficient is largest, the judges 1 and 3 have nearest
approach to common choice for beauty.
13
Example2. Calculate Spearman’s rank correlation coefficient between advertisement cost and
sales from the following data:
Advertisement cost (‘000Tk.): 39 65 62 90 82 75 25 98 36 78
Sales (lakhs Tk.): 47 53 58 86 62 68 60 91 51 84
Solution:
Let X i denote the advertisement cost (‘000 Tk.) and Yi denote the sales (lakhs Tk.).
Xi Yi Rank of Rank of d i xi yi di
2
X i ( x i ) Yi ( y i y)
39 47 8 10 -2 4
65 53 6 8 -2 4
62 58 7 7 0 0
90 86 2 2 0 0
82 62 3 5 -2 4
75 68 5 4 1 1
25 60 10 6 4 16
98 91 1 1 0 0
36 51 9 9 0 0
78 84 4 3 1 1
10 10 - -
d 0
i
i d 30
2
i
i
Here n =10
We have from (1)
n
6 d i
2
6 30
rs 1 i 1
= 1 0.82 .
n(n 1)
2
10 99
14
Example 3: Calculate Spearman’s rank Correlation for the following data
X 80 73 80 36 54 93 65 36 58 80
Y 15 83 15 64 32 16 67 64 85 64
Solution:
Prepare the following table and calculate di’s and di2’s as given below:
Xi Yi Rank of Rank of d i xi y i di
2
X i ( xi ) Yi ( y i )
80 15 8 1.5 6.5 42.25
73 83 6 9 -3 9
80 15 8 1.5 6.5 42.25
36 64 1.5 6 -4.5 20.25
54 32 3 4 -1 1
93 16 10 3 7 49
65 67 5 8 -3 9
36 64 1.5 6 -4.5 20.25
58 85 4 10 -6 36
80 64 8 6 2 4
∑di2=233
Xi Yi Rank of Rank of d i xi y i di
2
X i ( xi ) Yi ( y i )
80 15 3 9.5 -6.5 42.25
73 83 5 2 3 9
80 15 3 9.5 -6.5 42.25
36 64 9.5 5 4.5 20.25
54 32 8 7 1 1
93 16 1 8 -7 49
65 67 6 3 3 9
36 64 9.5 5 4.5 20.25
58 85 7 1 6 36
80 64 3 5 -2 4
∑di =233
2
15
Calculate rank correlation coefficient as follows:
6 233
23 2 33 3 23 2 33 3
12 12 12 12
rs 1
10 102 1
The significance of rank correlation coefficient is tested by t-test, as it is done in case of Karl
Pearson’s correlation coefficient.
Hypothesis:
rs n 2
t tn 2
1 rs 2
i.e., t follows student’s t distribution with n 2 df , where n is the number of paired observations
and rs is the rank correlation.
Comment: If |t| > tcrit , then we reject the null hypothesis of no correlation between the two sets of
ranking in favour of the alternative. We conclude that there is statistical evidence to suggest that
there is some correlation between the rankings.
16