0% found this document useful (0 votes)
4 views17 pages

Lecture Sheet H

The document covers correlation and regression analysis, explaining the concepts of correlation, types of correlation, methods of studying correlation, and regression analysis. It details the correlation coefficient, its properties, and the interpretation of its values, as well as the objectives and assumptions of regression analysis. Additionally, it provides examples and calculations related to advertising expenditure and sales to illustrate correlation and regression concepts.

Uploaded by

mehedihasanio388
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views17 pages

Lecture Sheet H

The document covers correlation and regression analysis, explaining the concepts of correlation, types of correlation, methods of studying correlation, and regression analysis. It details the correlation coefficient, its properties, and the interpretation of its values, as well as the objectives and assumptions of regression analysis. Additionally, it provides examples and calculations related to advertising expenditure and sales to illustrate correlation and regression concepts.

Uploaded by

mehedihasanio388
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Lecture Sheet H

Chapter 8: Correlation Analysis and Regression Analysis

Correlation Analysis
1. Correlation:
If two (or more) quantities vary in such a way that movements in one are
accompanied by movements in the other (or others), these quantities are said to be
correlated.
For example:
(i) there exists some relationship between family income and expenditure on
luxury items;
(ii) price of a commodity and amount demanded;
(iii) increase in rainfall up to a point and production of rice, etc.
The statistical tool with the help of which these relationships between two (or
more than two) variables is studied is called correlation.
A. M. Tuttle defines correlation as: “An analysis of the covariation of two or more
variables is usually called correlation.”
The measure of correlation is called the correlation coefficient or coefficient of
correlation.

2. Types of Correlation:
Correlation is described or classified in several different ways. Three of the most
important are:
(i) Positive and negative;
(ii) Simple, partial and multiple; and
(iii) Linear and non-linear.

(i) Positive and Negative Correlation:


Whether correlation is positive (direct) or negative (inverse) would depend upon
the direction of change of the variable.

If both the variables (e.g., x and y ) are varying in the same direction, i.e., if one
variable is increasing the other on an average is also increasing or, if one variable
is decreasing the other on an average is also decreasing, correlation is said to be
positive. The following examples would illustrate positive correlation:

Positive Positive
correlatin correlatin
x y x y
10 15 80 50
12 20 70 45
11 22 60 30
18 25 40 20
20 37 30 10

1
If, on the other hand, the variables are varying in opposite directions, i.e., as one
variable is increasing the other is decreasing or vise versa, correlation is said to be
negative. The following examples would illustrate negative correlation:

Negative Negative
correlation correlation
x y x y
20 40 100 10
30 30 90 20
40 22 60 30
60 15 40 40
80 16 30 50

(ii) Simple, Partial Multiple Correlation:


The distinction between simple, partial and multiple correlation is based upon the
number of variables studied.

When only two variables are studied it is a problem of simple correlation.

When three or more variables are studied it is a problem of either multiple or


partial correlation.

In multiple correlation three or more variables are studied simultaneously. For


example, when we study the relationship between the yield of rice per acre and
both the amount of rainfall and the amount of fertilizers used, it is a problem of
multiple correlation.

In partial correlation we recognize more than two variables. But consider only two
variables to be influencing each other, the effect of other influencing variable
being kept constant.

(iii) Linear and Non-linear (Curvilinear) Correlation:


The distinction between linear and non-linear correlation is based upon the
constancy of the ratio of change between the variables.

If the amount of change in one variable tends to bear a constant ratio to the amount
of change in the other variable, then the correlation is said to be linear. For
example, observe the following two variables x and y :

x: 10 20 30 40 50
y: 70 140 210 280 350

2
It is clear that the ratio of change between the two variables is the same. If such
variables are plotted on a graph paper, all the plotted points would fall on a straight
line.
Correlation would be called non-linear or curvilinear if the amount of change in
one variable does not bear a constant ratio to the amount of change in the other
variable.

It may be pointed out that in most practical cases we find a non-linear relationship
between the variables. However, since techniques of analysis for measuring non-
linear correlation are far more complicated than those for linear correlation, we
generally make an assumption that the relationship between the variables is of the
linear type.

3. Methods of studying correlation:


The following are the important methods of ascertaining whether two variables are
correlated or not:

(i) Scatter diagram method;


(ii) Karl Pearson’s coefficient of correlation;
(iii) Spearman’s rank correlation coefficient; and
(iv) Method of least squares.

Of these, the first one is based on the knowledge of graphs whereas the others are
the mathematical methods. Each of these methods shall be discussed in detail in
the following pages.

(i) Scatter diagram method:


The diagrammatic way of representing bivariate data is called scatter diagram.
Suppose x and y are two variables, related to each other and let ( x1 , y1 ),
( x2 , y2 ), .......,( x n , y n ) are n pairs of values of ( x , y ) . If the values are plotted along
the x -axis and y -axis respectively in the xy plane, the diagram of dots so obtained
is known as scatter diagram. From the scatter diagram we can form a fairly good,
though vague, idea whether the variables are correlated or not, e.g., if the points
are very dense, i.e. very close to each other, we should expect a fairly good
amount of correlation between the variables and if the points are widely scattered,
a poor correlation is expected. This method, however, is not suitable if the number
of observations is fairly large.

(ii) Karl Pearson’s Coefficient of Correlation:


As a measures of intensity or degree of linear relationship between two variables,
Karl Pearson (1867-1936), a British Biometrician, developed a formula called
Correlation Coefficient.

3
Suppose x and y are two variables, related to each other and let ( x1 , y1 ),
( x2 , y2 ), .......,( x n , y n ) are n pairs of values recorded from n sample points on ( x , y ),
then Karl Pearson’s Correlation Coefficient between the variables x and y , usually
denoted by rxy , is given by

Cov ( x, y )
rxy 
Var ( x).Var ( y )
1
where, Cov ( x, y ) 
n
 ( xi  x )( y i  y )
Var ( x)  s x 
 (x  x)i
2

Var ( y )  s y 
(y i  y)2
n
rxy  Correlation Coefficient between x and y .

Note: We have

1
n
 ( xi  x )( yi  y )
(i) rxy 
1 1
n
 ( xi  x ) 2 .  ( y  y ) 2
n

(ii) rxy =  ( x  x )( y  y )
i i

 (x  x )  ( y  y)
i
2
i
2

1
n
 ( xi  x )( yi  y )
(iii) rxy 
1 1
n
 ( xi  x ) 2 .  ( y  y ) 2
n

 xi yi   in i
x y
 ……..(*)
  xi     y  
2 2
  
  xi     yi 
2 2 i

 n   n 

(*) is usually considered as the working formula for calculating the correlation co-
efficient between x and y . rxy is sometimes called the product moment correlation
co-efficient or total correlation co-efficient or co-efficient of correlation.

By symmetry it can be easily shown that rxy  ryx , rxy is denoted sometimes simply
by r .

4
4. Properties of the Coefficient of Correlation:
The following are important properties of the coefficient of correlation, r :
1. The coefficient of correlation lies between -1 and +1. Symbolically,   r  
or | r |  .
2. The coefficient of correlation is independent of change of origin and scale.
3. The coefficient of correlation is the geometric mean of the two regression
coefficients. Symbolically, r   2( yx )   2( xy )
4. If x and y are independent variables then coefficient of correlation is zero.
However, the converse is not true.

5. Interpreting the Coefficient of Correlation:


The following general guidelines are given which would help in interpreting the
value of r .
1. When r   , it means there is perfect positive correlation between the variables.
2. When r   , it means there is perfect negative correlation between the variables.
3. When r   , it means there is no correlation between the variables, i.e., the
variables are uncorrelated.
4. The closer r is to +1 or -1, the closer the relationship between the variables and
the closer r is to 0, the less closer the relationship.

5
Problem/Assignment:
Problem 1 (a): The following sample data relate to advertising expenditure (in
lakh Tk.) and their corresponding sales (in crore Tk.)

Advertising Sales
Expenditure
10 14
12 17
15 23
23 25
20 21

Let advertising expenditure be denoted by x and sales by y.


Find the correlation coefficient between x and y. And comment.

Solution:
Calculation of Correlation Coefficient
Adv. Exp. Sales x2 y2 xy
x y
10 14
12 17
15 23
23 25
20 21
5 5 5 5 5

x x x y x y
2 2
i  i  i  i  i i 
i 1 i 1 i 1 i 1 i 1

Correlation Coefficient between the variables x and y , denoted by r , is given by

 x y   n
x y i i
i i
r = 0.865
  xi     y  
2 2
  
  xi     yi 
2 2 i

 n  n 


Comment: There is a high degree of positive correlation between sales and


advertising expenditure.

Or,

There exists a high degree of positive correlation between sales and advertising
expenditure.

6
Regression Analysis
1. Regression analysis:
A. The regression analysis is a technique of studying the dependence of one
variable (called dependent variable), on one or more variables (called independent
or explanatory variables), with a view to estimating or predicting the average value
of the dependent variable in terms of the known or fixed values of the independent
variables.

B. The statistical tool with the help of which we are in a position to estimate (or
predict) the unknown values of one variable from known values of another
variable is called regression. With the help of regression analysis, we are in a
position to find out the average probable change in one variable given a certain
amount of change in another.

2. Objectives of regression analysis:


(i). To estimate the mean, or average, value of the dependent variable, given the
values of the independent variables.
(ii). Determine the effect of each of the explanatory variables on the dependent
variable, controlling the effects of all other explanatory variables.
(iii). To predict, or forecast, the mean value of the dependent variable, given the
values of the independent variable(s).

3. Types of regression analysis:

Regression analysis is generally classified into two kinds: simple and multiple.

Simple regression involves only two variables, one of which is dependent variable
and the other is explanatory variable (independent variable). The associated model
in the case of simple regression will be a simple regression model.

Multiple regression involves three or more variables, one of which is dependent


variable that is to be estimated on the basis of the values of all other explanatory
variables. The model used for multiple regression is a multiple regression model.

A regression analysis may involve a linear model or a nonlinear model. The term
linear can be interpreted in two different ways: (i) linear in variable and (ii) linear
in parameter. Of these, linearity in the parameters is relevant in the regression
analysis.

7
4. Classical linear regression model/ Population regression model:

The classical linear regression model involving k variables ( Y and X 2 , X 3 , …, X k )


is given by

Yi  1   2 X 2i   3 X 3i  ...   x X xi  ui i  1 , 2 , 3 , …, n (*)

where,
Y  the dependent variable,
X 2 to X k  the explanatory variables or regressors,
1 to  k  parameters; 1  the intercept,  2 to  k  the partial regression
coefficients,
u  stochastic disturbance term,
i  ith observation,
n  size of the population.

5. Assumptions of the classical linear regression model:


The classical linear regression model is based on several assumptions, which are
as follows:

1. The regression model is linear in the parameters, as shown in (*)


2. X 2 , X 3 , …, X k are nonstochastic or fixed in repeated sampling.
3. The disturbance ui is a random real variable.
4. The mean value of ui is zero, i.e. , E (ui )  0, for all i .
5. The variance of ui is constant or homoscedastic, i.e. ,
V (ui )   2 , for all i .
 E (ui 2 )   2
6. There is no autocorrelation in the disturbances, i.e. ,
Cov (ui , u j )  0 , for all i  j .
 E (ui , u j )  0
7. ui is independent of the explanatory variables, i.e. ,
Cov (ui , X 2i )  Cov (ui , X 3i  ...  Cov (ui , X ki )  0
8. The explanatory variables are measured without error.
9. The number of observations must be greater than the number of regressors.
10. There must be sufficient variability in the values taken by regressors.
11. The regression model is correctly specified
12. There is no exact linear relationship among the X variables.
13. The random variable ui is normally distributed. ui  N (0,  2 ) , ui  NID (0,  2 ) .

8
6. Interpretation of the model:
Yi  1   2 X 2i   3 X 3i  ...   k X ki  ui i  1 , 2 , 3 , …, n
 E (Y )i    1   2 X 2i   3 X 3i  ...   k X ki
It gives the average or mean value of Y for the fixed (in repeated sampling) values
of the X variables.

7. The meaning of partial regression coefficients/parameters:


1 : It is the intercept term. It gives the mean or average effect on Y of all the
variables excluded from the model, although its mechanical interpretation is the
average value of Y when X 2 , X 3 , …, and X k are set equal to zero.

 2 : It gives the slope of E (Y ) with respect to X 2 , holding X 3 , X 4 , … constant.  2


measures the change in the mean value of Y , E (Y ) , per unit change in X 2 , holding
X 3 , X 4 , … constant.

 3 : It gives the slope of E (Y ) with respect to X 3 , holding X 2 , X 4 , … constant.  3


measures the change in the mean value of Y , E (Y ) , per unit change in X 3 , holding
X 2 , X 4 , … constant.

The meaning of all other parameters can be explained similarly.

8. Simple linear regression model:


Simple regression involves only two variables: one is dependent variable and the
other is independent or explanatory variable. The model is given by

y  1  2 x  u
where,
y = the dependent variable,
x  the independent or explanatory variable,
 1 and  2 = parameters,
u  stochastic disturbance term.

This model is called the population regression model. The important


assumptions underlying the model are:

1. The mean, or expected, value of the random disturbance term ui is zero.


Symbolically,
E ( ui )  0 for all i .
2. The variance of ui is the same for all observations. Symbolically,
var( ui )   2 for all i .

9
3. The random terms of different observations ( ui , u j ) are independent, i.e., ui and
u j are independent for all i  j . Symbolically,
cov( ui , u j )  0 for all i  j .
4. The disturbance term is not correlated with the explanatory variable, i.e., ui
and xi are independent. Symbolically,
cov( ui , xi )  0 .
5. ui are normally distributed for all i . In conjunction with assumption 1, 2, and 3
this implies that ui are independently and normally distributed with mean zero and
a common variance  2 . We write this as ui  IN ( 0 , 2 ) .

Important Note:
(1)
(i)
y  1  2 x  u
This is the population regression model. Eq. (i) is also called the stochastic
population regression function (PRF). We can express the SRF in its stochastic
form as follows:

yi  ˆ 1  ˆ 2 xi  uˆi ………….(ii)
̂ 1  estimator of  1
̂ 2  estimator of  2
uˆ i  the estimator of ui .
uˆ i denotes the (sample) residual term or simply the residual. Eq. (ii) is called the
sample regression model.

We have
(iii)
E( y x )   1   2 x
which is called the population regression equation. Eq. (iii) is also called the
deterministic, or nonstochastic, population regression function (PRF). The sample
regression function (SRF) in nonstochastic form i.e., the sample counterpart of
Eq.(iii) may be written as

yˆ i  ˆ 1  ˆ 2 xi . ………(iv)
where
yˆ i = estimator of E ( y / x )

(2) Our primary objective in regression analysis is to estimate the (stochastic) PRF

yi   1   2 xi  ui
on the basis of the SRF

yi  ˆ 1  ˆ 2 xi  uˆi

10
9. Methods for estimating the regression parameters  1 and  2 :
1. The method of moments.
2. The method of least squares
3. The method of maximum likelihood

10. Least squares estimators


Consider the two-variable PRF:
y   1   2 xi  u
Since PRF is not directly observable, we estimate it from the SRF
yi  ˆ 1  ˆ 2 xi  uˆi
 yˆ i  uˆ i
 uˆ i  yi  yˆ i
 yi  ˆ 1  ˆ 2 xi
which shows that the uˆ i (the residuals) are simply the difference between the actual
and estimated y values.

Let ( x1 , y1 ), ( x2 , y2 ), .......,( x n , y n ) be n pairs of values observed from a random


sample and we want to use this information to obtain good estimates ̂ 1 and ̂ 2 of
the unknown parameters  1 and  2 . Viewed graphically, we have a scatter
diagram with n points and we want to obtain a line which in some sense fits the
data best.

A natural approach to the estimation problem is to choose ̂ 1 and ̂ 2 in such a way


that the residuals are as small as possible. In fact, we will choose ̂ 1 and ̂ 2 so that
the sum of the squared residuals is minimized. This criterion of best fit is known as
the principle of least squares. That is, we will select ̂ 1 and ̂ 2 so that
n n 2
 uˆ i2    yi  ( ˆ 1  ˆ 2 xi ) ……………(i)
i1 i1

is minimized.
This minimization can easily be achieved by using differential calculus. We
require that

 (  uˆ i2 )
0.
ˆ 1

 (  uˆ i2 )
0
ˆ 2

Differentiating (i) with respect to ̂ 1 and ̂ 2 , and setting differentials equal to zero,
we have
n
2  ( yi  ˆ 1  ˆ 2 xi )  0
i1

11
and
n
2  ( yi  ˆ 1  ˆ 2 xi ) xi   0
i1

That is,

y i n ˆ 1  ˆ 2  xi ………….(ii)

x y i i ˆ 1  xi  ˆ 2  xi2 ……(iii)

Equations (ii) and (iii) are known as normal equations.

Solving the normal equations simultaneously, we obtain

(  xi )(  yi )
x y
i i 
n
ˆ 2  ………(iv)
(  xi )2
x i
2

n

and

ˆ 1  y  ˆ 2 x ……………(v)
where x   and y  
xi yi
n n

Equations (v) and (iv) define the least squares estimates of  0 and  1 , or
(geometrically) the line of best fit. The method described is called ordinary least
squares (OLS).

Note: We have until now considered the regression of y on x . This is called the
direct regression. Sometimes one has to consider the regression of x on y as well.
This is called the reverse regression. In such case we have
E ( x y )   1   2 y
xˆ  ˆ   ˆ  y
i 1 2 i

xi   1   2 yi  ui
x  ˆ   ˆ  y  uˆ 
i 1 2 i i

12
(  xi )(  yi )
x y
i i 
n
ˆ 2 
(  yi )2
y i
n
2

ˆ ˆ
 1  x   2 y ……………(v)

where y   and x  
yi xi
n n

11. Properties of regression coefficients:


(i) The correlation coefficient is the geometric mean of the two regression
coefficients. Symbolically,
r   2( yx )   2( xy )
(ii) If one of the regression coefficients is greater than unity, the other must be less
than unity, since the value of the coefficient of correlation cannot exceed unity.
(iii) Both the regression coefficients will have the same sign, i.e., they will be
either positive or negative.
(iv) The coefficient of correlation will have the same sign as that of regression
coefficients. For example, if  2( yx )  0.8 and  2( xy )  0.2 , r   0.2  0.8  0.4 .
(v) Arithmetic mean of the two regression coefficients is greater than the
correlation coefficient. Symbolically,
 2( yx )   2( xy )
r
2
(vi) Regression coefficients are independent on change of origin but not of scale.

8. Correlation Analysis Vs. Regression Analysis:


A. There are two important points of difference between correlation and regression
analysis:
1. Whereas correlation coefficient is a measure of degree of relationship
between x and y , the objective of regression analysis is to study the ‘nature of
relationship’ between the variables.
2. The cause and effect relation is clearly indicated through regression analysis
than by correlation. Correlation is merely a tool of ascertaining the degree of
relationship between two variables and, therefore, we cannot say that one variable
is the cause and other the effect.
B.
1. Correlation measures the degree of relationship between two variables. But
regression does not do so. It simply explains the nature of relationship, i .e ., the
average probable change in one variable given a certain amount of change in the
other.
2. Correlation does not help us in ascertaining whether one variable is the cause
and the other the effect. But regression helps us in studying the cause and effect
relationship between the two variables. In fact, in regression analysis, one variable
is taken as dependent one and the other the independent.

13
C.
Correlation Regression
1. Correlation literally means the 1. Regression means stepping back or
relationship between two or more returning to the average value and is a
variables which vary in sympathy so mathematical measure expressing the
that the movements in one tend to be average relationship between the two
accompanied by the corresponding variables.
movements in the other (s).
2. Correlation need not imply cause 2. However, regression analysis clearly
and effect relationship between the indicates the cause and effect
variables under study. relationship between the variables. The
variable corresponding to cause is taken
as independent variable and the variable
corresponding to effect is taken as
dependent variable.
3. Correlation analysis is confined Regression analysis has much wider
only to the study of linear relationship applications as it studies linear as well as
between the variables and, therefore, non-linear relationship between the
has limited applications. variables.
4. There may be non-sense correlation There is no such thing like non-sense
between two variables which is due to regression.
pure chance and has no practical
relevance, e .g ., the correlation
between rainfall and the intelligence
of a group of individuals.
5. Correlation Coefficient Regression analysis aims at establishing
r( x , y ) between two variables the functional relationship between the
x and y is a measure of the direction two variables under study and then using
and degree of the linear relationship this relationship to predict or estimate
between two variables which is the value of the dependent variable for
mutual. It is symmetric, i .e ., any given value of the independent
r ( x , y )  r ( y , x ) and it is immaterial variable. It also reflects upon the nature
which of x and y is dependent of the variable, i .e ., which is dependent
variable and which is independent
variable and which is independent
variable. Regression coefficients are not
variable.
symmetric in x and y , i .e .,  2( yx )   2( xy ) .
6. Correlation coefficient r ( x , y ) is a 6. The regression coefficient  2( yx )
relative measure of the linear (  2( xy ) ) are absolute measures
relationship between x and y and is representing the change in the value of
independent of the units of the variable y( x ), for a unit change in
measurement. It is a pure number the value of the variable x( y ) .
lying between 1 .

14
Problem/Assignment:
Problem 1(b):
The following sample data relate to advertising expenditure (in lakh Tk.) and their
corresponding sales (in crore Tk.)

Advertising Sales
Expenditure
10 14
12 17
15 23
23 25
20 21

Let advertising expenditure be denoted by x and sales by y.


Fit the regression equation/line of y on x . And predict y if x  30 .

Solution:
The population regression model of y on x be
y  1   2 x  u

and the population regression equation of y on x be


E( y x )   1   2 x

The sample regression model of y on x be


y  ˆ1  ˆ2 x  uˆ

and the sample regression equation or fitted regression line of y on x be


yˆ  ˆ1  ˆ2 x

where
yˆ i = estimator of E ( y / x )

Using the least squares metnod, we have


n n

n  xi  yi
x y i i  i 1
n
i 1

ˆ2  i 1
2
  n

n
  xi 
 xi   i 1 
2

i 1 n

and

15
n n

 xi y i
ˆ1  y  ˆ2 x , where x  i 1
, y i 1

n n

Calculation of Regression Equation


Adv. Exp. Sales x2 y2 xy
x y
10 14
12 17
15 23
23 25
20 21
5 5 5 5 5

 xi  80  xi  10  xi   yi  x y
2 2
i i 
i 1 i 1 i 1 i 1 i 1

Here n  5

Now, we have
n n

x i y i
x i 1
= y i 1

n n

n n

n  xi  yi
x y i i  i 1 i 1
n
ˆ2  i 1
2
= 0.712
 n 
n
  xi 
 xi   i 1 
2

i 1 n

ˆ1  y  ˆ2 x = 8.608

Thus the fitted regression line of y on x be


yˆ  ˆ1  ˆ2 x
 yˆ  ˆ1  ˆ2 x
 yˆ  8.608  0.712 x

If x  30 , then yˆ  8.608  0.712 x = 8.608 + 0.712  30 = 29.968

16
Problem/Assignment:
Problem 2:
Consider the following hypothetical data on weekly family consumption
expenditure y and family income x :

y ($) x ($)
70 80
65 100
90 120
95 140
110 160
115 180
120 200
140 220
155 240
150 260

(a) Find the correlation coefficient between x and y. And comment.


(b) Estimate (fit) the regression equation of y on x . Interpret the estimated
parameters. Predict y if x  280 $.

(Prof. Dr. Rahmat Ali)

17

You might also like