Lecture Sheet H
Lecture Sheet H
Correlation Analysis
1. Correlation:
If two (or more) quantities vary in such a way that movements in one are
accompanied by movements in the other (or others), these quantities are said to be
correlated.
For example:
(i) there exists some relationship between family income and expenditure on
luxury items;
(ii) price of a commodity and amount demanded;
(iii) increase in rainfall up to a point and production of rice, etc.
The statistical tool with the help of which these relationships between two (or
more than two) variables is studied is called correlation.
A. M. Tuttle defines correlation as: “An analysis of the covariation of two or more
variables is usually called correlation.”
The measure of correlation is called the correlation coefficient or coefficient of
correlation.
2. Types of Correlation:
Correlation is described or classified in several different ways. Three of the most
important are:
(i) Positive and negative;
(ii) Simple, partial and multiple; and
(iii) Linear and non-linear.
If both the variables (e.g., x and y ) are varying in the same direction, i.e., if one
variable is increasing the other on an average is also increasing or, if one variable
is decreasing the other on an average is also decreasing, correlation is said to be
positive. The following examples would illustrate positive correlation:
Positive Positive
correlatin correlatin
x y x y
10 15 80 50
12 20 70 45
11 22 60 30
18 25 40 20
20 37 30 10
1
If, on the other hand, the variables are varying in opposite directions, i.e., as one
variable is increasing the other is decreasing or vise versa, correlation is said to be
negative. The following examples would illustrate negative correlation:
Negative Negative
correlation correlation
x y x y
20 40 100 10
30 30 90 20
40 22 60 30
60 15 40 40
80 16 30 50
In partial correlation we recognize more than two variables. But consider only two
variables to be influencing each other, the effect of other influencing variable
being kept constant.
If the amount of change in one variable tends to bear a constant ratio to the amount
of change in the other variable, then the correlation is said to be linear. For
example, observe the following two variables x and y :
x: 10 20 30 40 50
y: 70 140 210 280 350
2
It is clear that the ratio of change between the two variables is the same. If such
variables are plotted on a graph paper, all the plotted points would fall on a straight
line.
Correlation would be called non-linear or curvilinear if the amount of change in
one variable does not bear a constant ratio to the amount of change in the other
variable.
It may be pointed out that in most practical cases we find a non-linear relationship
between the variables. However, since techniques of analysis for measuring non-
linear correlation are far more complicated than those for linear correlation, we
generally make an assumption that the relationship between the variables is of the
linear type.
Of these, the first one is based on the knowledge of graphs whereas the others are
the mathematical methods. Each of these methods shall be discussed in detail in
the following pages.
3
Suppose x and y are two variables, related to each other and let ( x1 , y1 ),
( x2 , y2 ), .......,( x n , y n ) are n pairs of values recorded from n sample points on ( x , y ),
then Karl Pearson’s Correlation Coefficient between the variables x and y , usually
denoted by rxy , is given by
Cov ( x, y )
rxy
Var ( x).Var ( y )
1
where, Cov ( x, y )
n
( xi x )( y i y )
Var ( x) s x
(x x)i
2
Var ( y ) s y
(y i y)2
n
rxy Correlation Coefficient between x and y .
Note: We have
1
n
( xi x )( yi y )
(i) rxy
1 1
n
( xi x ) 2 . ( y y ) 2
n
(ii) rxy = ( x x )( y y )
i i
(x x ) ( y y)
i
2
i
2
1
n
( xi x )( yi y )
(iii) rxy
1 1
n
( xi x ) 2 . ( y y ) 2
n
xi yi in i
x y
……..(*)
xi y
2 2
xi yi
2 2 i
n n
(*) is usually considered as the working formula for calculating the correlation co-
efficient between x and y . rxy is sometimes called the product moment correlation
co-efficient or total correlation co-efficient or co-efficient of correlation.
By symmetry it can be easily shown that rxy ryx , rxy is denoted sometimes simply
by r .
4
4. Properties of the Coefficient of Correlation:
The following are important properties of the coefficient of correlation, r :
1. The coefficient of correlation lies between -1 and +1. Symbolically, r
or | r | .
2. The coefficient of correlation is independent of change of origin and scale.
3. The coefficient of correlation is the geometric mean of the two regression
coefficients. Symbolically, r 2( yx ) 2( xy )
4. If x and y are independent variables then coefficient of correlation is zero.
However, the converse is not true.
5
Problem/Assignment:
Problem 1 (a): The following sample data relate to advertising expenditure (in
lakh Tk.) and their corresponding sales (in crore Tk.)
Advertising Sales
Expenditure
10 14
12 17
15 23
23 25
20 21
Solution:
Calculation of Correlation Coefficient
Adv. Exp. Sales x2 y2 xy
x y
10 14
12 17
15 23
23 25
20 21
5 5 5 5 5
x x x y x y
2 2
i i i i i i
i 1 i 1 i 1 i 1 i 1
x y n
x y i i
i i
r = 0.865
xi y
2 2
xi yi
2 2 i
n n
Or,
There exists a high degree of positive correlation between sales and advertising
expenditure.
6
Regression Analysis
1. Regression analysis:
A. The regression analysis is a technique of studying the dependence of one
variable (called dependent variable), on one or more variables (called independent
or explanatory variables), with a view to estimating or predicting the average value
of the dependent variable in terms of the known or fixed values of the independent
variables.
B. The statistical tool with the help of which we are in a position to estimate (or
predict) the unknown values of one variable from known values of another
variable is called regression. With the help of regression analysis, we are in a
position to find out the average probable change in one variable given a certain
amount of change in another.
Regression analysis is generally classified into two kinds: simple and multiple.
Simple regression involves only two variables, one of which is dependent variable
and the other is explanatory variable (independent variable). The associated model
in the case of simple regression will be a simple regression model.
A regression analysis may involve a linear model or a nonlinear model. The term
linear can be interpreted in two different ways: (i) linear in variable and (ii) linear
in parameter. Of these, linearity in the parameters is relevant in the regression
analysis.
7
4. Classical linear regression model/ Population regression model:
Yi 1 2 X 2i 3 X 3i ... x X xi ui i 1 , 2 , 3 , …, n (*)
where,
Y the dependent variable,
X 2 to X k the explanatory variables or regressors,
1 to k parameters; 1 the intercept, 2 to k the partial regression
coefficients,
u stochastic disturbance term,
i ith observation,
n size of the population.
8
6. Interpretation of the model:
Yi 1 2 X 2i 3 X 3i ... k X ki ui i 1 , 2 , 3 , …, n
E (Y )i 1 2 X 2i 3 X 3i ... k X ki
It gives the average or mean value of Y for the fixed (in repeated sampling) values
of the X variables.
y 1 2 x u
where,
y = the dependent variable,
x the independent or explanatory variable,
1 and 2 = parameters,
u stochastic disturbance term.
9
3. The random terms of different observations ( ui , u j ) are independent, i.e., ui and
u j are independent for all i j . Symbolically,
cov( ui , u j ) 0 for all i j .
4. The disturbance term is not correlated with the explanatory variable, i.e., ui
and xi are independent. Symbolically,
cov( ui , xi ) 0 .
5. ui are normally distributed for all i . In conjunction with assumption 1, 2, and 3
this implies that ui are independently and normally distributed with mean zero and
a common variance 2 . We write this as ui IN ( 0 , 2 ) .
Important Note:
(1)
(i)
y 1 2 x u
This is the population regression model. Eq. (i) is also called the stochastic
population regression function (PRF). We can express the SRF in its stochastic
form as follows:
yi ˆ 1 ˆ 2 xi uˆi ………….(ii)
̂ 1 estimator of 1
̂ 2 estimator of 2
uˆ i the estimator of ui .
uˆ i denotes the (sample) residual term or simply the residual. Eq. (ii) is called the
sample regression model.
We have
(iii)
E( y x ) 1 2 x
which is called the population regression equation. Eq. (iii) is also called the
deterministic, or nonstochastic, population regression function (PRF). The sample
regression function (SRF) in nonstochastic form i.e., the sample counterpart of
Eq.(iii) may be written as
yˆ i ˆ 1 ˆ 2 xi . ………(iv)
where
yˆ i = estimator of E ( y / x )
(2) Our primary objective in regression analysis is to estimate the (stochastic) PRF
yi 1 2 xi ui
on the basis of the SRF
yi ˆ 1 ˆ 2 xi uˆi
10
9. Methods for estimating the regression parameters 1 and 2 :
1. The method of moments.
2. The method of least squares
3. The method of maximum likelihood
is minimized.
This minimization can easily be achieved by using differential calculus. We
require that
( uˆ i2 )
0.
ˆ 1
( uˆ i2 )
0
ˆ 2
Differentiating (i) with respect to ̂ 1 and ̂ 2 , and setting differentials equal to zero,
we have
n
2 ( yi ˆ 1 ˆ 2 xi ) 0
i1
11
and
n
2 ( yi ˆ 1 ˆ 2 xi ) xi 0
i1
That is,
y i n ˆ 1 ˆ 2 xi ………….(ii)
( xi )( yi )
x y
i i
n
ˆ 2 ………(iv)
( xi )2
x i
2
n
and
ˆ 1 y ˆ 2 x ……………(v)
where x and y
xi yi
n n
Equations (v) and (iv) define the least squares estimates of 0 and 1 , or
(geometrically) the line of best fit. The method described is called ordinary least
squares (OLS).
Note: We have until now considered the regression of y on x . This is called the
direct regression. Sometimes one has to consider the regression of x on y as well.
This is called the reverse regression. In such case we have
E ( x y ) 1 2 y
xˆ ˆ ˆ y
i 1 2 i
xi 1 2 yi ui
x ˆ ˆ y uˆ
i 1 2 i i
12
( xi )( yi )
x y
i i
n
ˆ 2
( yi )2
y i
n
2
ˆ ˆ
1 x 2 y ……………(v)
where y and x
yi xi
n n
13
C.
Correlation Regression
1. Correlation literally means the 1. Regression means stepping back or
relationship between two or more returning to the average value and is a
variables which vary in sympathy so mathematical measure expressing the
that the movements in one tend to be average relationship between the two
accompanied by the corresponding variables.
movements in the other (s).
2. Correlation need not imply cause 2. However, regression analysis clearly
and effect relationship between the indicates the cause and effect
variables under study. relationship between the variables. The
variable corresponding to cause is taken
as independent variable and the variable
corresponding to effect is taken as
dependent variable.
3. Correlation analysis is confined Regression analysis has much wider
only to the study of linear relationship applications as it studies linear as well as
between the variables and, therefore, non-linear relationship between the
has limited applications. variables.
4. There may be non-sense correlation There is no such thing like non-sense
between two variables which is due to regression.
pure chance and has no practical
relevance, e .g ., the correlation
between rainfall and the intelligence
of a group of individuals.
5. Correlation Coefficient Regression analysis aims at establishing
r( x , y ) between two variables the functional relationship between the
x and y is a measure of the direction two variables under study and then using
and degree of the linear relationship this relationship to predict or estimate
between two variables which is the value of the dependent variable for
mutual. It is symmetric, i .e ., any given value of the independent
r ( x , y ) r ( y , x ) and it is immaterial variable. It also reflects upon the nature
which of x and y is dependent of the variable, i .e ., which is dependent
variable and which is independent
variable and which is independent
variable. Regression coefficients are not
variable.
symmetric in x and y , i .e ., 2( yx ) 2( xy ) .
6. Correlation coefficient r ( x , y ) is a 6. The regression coefficient 2( yx )
relative measure of the linear ( 2( xy ) ) are absolute measures
relationship between x and y and is representing the change in the value of
independent of the units of the variable y( x ), for a unit change in
measurement. It is a pure number the value of the variable x( y ) .
lying between 1 .
14
Problem/Assignment:
Problem 1(b):
The following sample data relate to advertising expenditure (in lakh Tk.) and their
corresponding sales (in crore Tk.)
Advertising Sales
Expenditure
10 14
12 17
15 23
23 25
20 21
Solution:
The population regression model of y on x be
y 1 2 x u
where
yˆ i = estimator of E ( y / x )
n xi yi
x y i i i 1
n
i 1
ˆ2 i 1
2
n
n
xi
xi i 1
2
i 1 n
and
15
n n
xi y i
ˆ1 y ˆ2 x , where x i 1
, y i 1
n n
xi 80 xi 10 xi yi x y
2 2
i i
i 1 i 1 i 1 i 1 i 1
Here n 5
Now, we have
n n
x i y i
x i 1
= y i 1
n n
n n
n xi yi
x y i i i 1 i 1
n
ˆ2 i 1
2
= 0.712
n
n
xi
xi i 1
2
i 1 n
16
Problem/Assignment:
Problem 2:
Consider the following hypothetical data on weekly family consumption
expenditure y and family income x :
y ($) x ($)
70 80
65 100
90 120
95 140
110 160
115 180
120 200
140 220
155 240
150 260
17