Unit II Notes Correlation and Regression
Unit II Notes Correlation and Regression
JSPM’s
RAJARSHI SHAHU COLLEGE OF ENGINEERING, PUNE
(Autonomous Institute Affiliated to Savitribai Phule Pune University, Pune)
FIRST YEAR
BATCHLOR OF TECHNOLOGY
(A.Y. 2021 -2022, SEMESTER - II)
COMPUTER/IT /CSBS ENGINEERING
SUBJECT: STATISTICAL METHODS
2.1 CORRELATION
We have already discussed distributions involving one variable or univariate distributions. In many
problems, of practical in nature, we are required to deal with two or more variables. Distributions
using two variables are called bivariate distribution. In such distributions, we are often interested in
knowing whether there exists some kind (or degree) of relationship between the two variables. In
statistics, this means whether there is correlation or co-variance between the two variables. If the
change in one variable affects the change in the other variable, the variables are said to be correlated
and the relationship is called correlation. Correlation determines the intensity or degree of
relationship between two different variables.
For example, change in rainfall will affect the crop output and thus the variables 'Rainfall recorded'
and 'Crop output' are correlated. If the increase (or decrease) in one variable causes corresponding
increase (or decrease) in the other, the correlation is said to be positive or direct. On the other hand, if
increase in the value of one variable shows a corresponding decrease in the value of the other or vice
versa, the correlation is called negative or inverse. For example, if the income of worker increases, as
a natural course his expenditure also increases, hence the correlation between income and expenditure
is positive or direct. If we consider the price and demand of a certain commodity then our experience
tells us that as the price of a commodity rises, its demand falls and thus the correlation between these
variables is negative or inverse. Correlation can also be classified as linear and non-linear. When the
amount of change in one variable is not in a constant ratio to the amount of change in other variable,
the correlation is called non-linear.
There are different methods to determine correlation between two variables. Graphical method called
'Scatter diagram' gives rough idea about the correlation without giving any specific numerical value
and 'Karl Pearson's Coefficient of Correlation' gives numerical measures for intensity (or degree) of
linear relationship between two variables and is widely used.
Remark: Besides 'Scatter diagram' and 'Karl Pearson's Coefficient of Correlation', there is other
measure called Spearman’s coefficient of correlation, which measures the linear association between
1
Statistical Methods (Comp/IT)
ranks assigned to individual items according to their attributes. But these are not as much of
consequence.
We shall now discuss 'Karl Pearson's Coefficient of Correlation' which is widely used in practice.
2.2 KARL PEARSON'S COEFFICIENT OF CORRELATION:
To measure the intensity or degree of linear relationship between two variables, Karl Pearson
developed a formula called correlation coefficient.
Correlation coefficient between two variables x and y denoted by r ( x, y) is defined as
cov x, y
r x, y (1)
xy
In Bivariate distribution if xi , yi take the values x1 , y1 , x2 , y 2 , x3 , y3 xn , y n we define
covx, y
xi x yi y
1
(2)
n
where x, y are arithmetic means for x and y series respectively.
Also, the standard deviations for x and y series are:
x
1
xi x 2 and y 1
y i y 2 (3)
n n
cov x, y
Correlation coefficient r x, y can be calculated using following simplifications
xy
covx, y xi x yi y
1
n
On simplifying we get
cov x, y
1
n
xi yi x y (4)
2
1 1
x2 x
2
x (5)
n n
1
Similarly, y2 y i2 y 2 (6)
n
For simplification of calculation, we put
xi A y B
ui xi A and vi yi B or ui and vi i , then
h k
covu, v
1
n
u i vi u v , (7)
1 1
u2
n
u i2 u 2 and v2 vi2 v 2
n
(8)
covu, v
and r u , v is given by, r u, v . It can be established that r x, y r u , v .
u v
We note here that calculation of r u , v is simpler as compared to r x, y .
Using results (4), (5) and (6) in (1) we can write the formula as
2
Statistical Methods (Comp/IT)
1
covx, y xi y i x y xy n x y
r x, y n
xy 1 2 2 1
xi x n y i y2 2 x n x y
2 2 2
n y2
n
Also, using results (7) and (8) in (1), we can write the formula as
1
covu, v u i vi u v uv nu v
r u, v n
u v 1 2 2 1
u i u n vi v 2 2 u nu v
2 2 2
nv 2
n
Property: It can be shown that the coefficient of correlation must satisfy the condition 1 r 1 .
Ex. 1: Following are the values of import of raw material and export of finished product
Export 10 11 14 14 20 22 16 12 15 13
Import 12 14 15 16 21 26 21 15 16 14
Calculate the coefficient of correlation between the import value and export values.
Solution: For n 10 the data is tabulated as:
x y x2 y2 xy
10 12 100 144 120
11 14 121 196 154
14 15 196 225 210
14 16 196 256 224
20 21 400 441 420
22 26 484 676 572
16 21 256 441 336
12 15 144 225 180
15 16 225 256 240
13 14 169 196 182
x 147 y 170 x 2291 y 3056 xy 2638
2 2
r x, y
covx, y
xy n x y
xy x n x y
2 2 2
n y2
2638 10 14.7 17 139
0.9458
2291 1014.7 3056 1017
2 2
130 .1 166
3
Statistical Methods (Comp/IT)
Sol.: Here, x
x
40
4, x 2 16 and y
40 y
4, y 2 16
n 10 n 10
Ex. 3: Given: r 0.9, XY 70, y 3.5, X 2 100 . Find the number of items, if X and Y are
deviation from arithmetic mean.
Sol.: x2
1
x x 2 1 X 2 100
n n n
x x y y
1
r x, y
covx, y
n
XY
xy xy n x y
Squaring we get, r 2 XY
2
0.92
70 2 0.81
4900
n
2 2 2
100 1225 n
n2 3.5
x y 2
n
0.81 1225 n 4900992 .25 \ n=5
2.3 t- TEST FOR A CORRELATION COEFFICIENT:
The most frequently used test to examine whether the two variables X and Y are correlated is
the t est. To apply this test, we first set up the two hypotheses as follows:
4
Statistical Methods (Comp/IT)
Soln.:
Here ,
In order to examine whether indices whether the relationship between is
statistically significant, we apply the test.
5
Statistical Methods (Comp/IT)
The two hypotheses are:
The critical value of t with degrees of freedom and at level of significance is . As the
calculated value of is less than the critical value of , the null hypothesis is accepted. The
conclusion is that the relationship between is not statistically significant.
2.4 ASSUMPTIONS OF THE KARL PEARSONIAN CORRELATION:
1. The two variables x and y are linearly related. This implies that when the individual pairs are plotted
on a graph resulting in a scatter diagram. If the points are joined together, a straight line will be
formed.
2. The two variables are affected by several causes, which are independent, so as to form a normal
distribution. For example, relationships between price and demand, price and supply, advertising
expenditure and sales, length of experience and earning and so on are affected by several factors such
that the series result into a normal distribution.
EXERCISE: 2.1
Ex.1 From a group of 10 students, marks obtained by each in two papers x and y are given below
x 23 28 42 17 26 35 29 37 16 46
y 25 22 38 21 27 39 24 32 18 44
6
Statistical Methods (Comp/IT)
Calculate coefficient of correlation between x and y .
Ex.2 Obtain the correlation coefficient between population density (per square mile) and death rate
(per thousand persons) from the following data:
Population density 200 500 400 700 300
Death rate 12 18 16 21 10
2.5 REGRESSION
After having established that the two variables are correlated, we are generally interested in
estimating the value of one variable for a given value of the other variable. For example, if we know
that rainfall affects the crop output then it is possible to predict the crop output at the end of a rainy
season. If the variables in a bivariate distribution are related, the points in scatter diagram cluster
round some curve called the curve of regression or the regression curve. If the curve is a straight line,
it is called the line of regression and in such case the regression between two variables is linear. The
line of regression gives best estimate for the value of one variable for some specified value of the
other variable.
7
Statistical Methods (Comp/IT)
If point (xi, yi) is assumed to lie on (1) then y co-ordinate of the point can be calculated as,
y' i = axi + b
If point actually lies on (1) then, yi = y'i
Otherwise yi – y'i will represent the deviation of observed value yi from the calculated value of y'i
using the formula (1).
In method of least squares we take the sum of the squares of these deviations and minimize this sum
using the principle of maxima or minima. Values of a, b in (1) are calculated using this criterion. This
is called least square criteria. Curve (1) can be of any degree using least square criteria we can find its
8
Statistical Methods (Comp/IT)
equation. We shall now discuss fitting of straight line and second degree parabola to a given set of
points.
Consider the set of values of xi , yi , i 1, 2, 3, n . Let the line of regression of y on x be
y mx c
Using the method of principle of least squares, the normal equations for estimating m and c the
regression line of y on x is
r y
y y x x
x
y y b y x x x (9)
r y
where by x is called regression coefficient of y on x .
x
Similarly, the regression line of x on y is
x
xx r y y
y
x x bxy y y (10)
sx
where bx y = r is called regression coefficient of x on y .
sy
For obtaining regression lines (9) and (10), we have to calculate r x, y , x and y .
( )
where s x = s u , s y = s v , cov u,v =
1
n
å
1 1
u v - u v , u2 u i2 u 2 , v2 vi2 v 2 and
n n
x a u , y b v
xa y b
In particular, if u ,v then
h k
cov x, y covu , v
r r x, y
xy u v
where x h u , y k v and x a h u , y b k v .
9
Statistical Methods (Comp/IT)
y x
Property 1: Since b y x bx y r r therefore b y x bx y r 2 and r b y x bx y .
x y
Property 2: If is the acute angle between the two regression lines in the case of two variables
1 r2 x y
x and y , then tan .
r x2 y2
x
xi 30 6 and y yi 40 8
n 5 n 5
xi x 2 220 62 8
2
x2
n 5
yi y 2 340 82 4
2
y2
n 5
covx, y i i x y
x y
6 * 8 5.2
214
n 5
covx, y 5.2 covx, y 5.2
by x 0.65 bx y 1.3
x 2
8 y2 6
Regression line y on x is y y by x x x
y - 8 - 0.65 x 6
y -0.65 x 11.9
Regression line x on y is x x by x y y
x 6 -1.3 y 8
x -1.3 y 16.4
10
Statistical Methods (Comp/IT)
and estimate y for x 14.5 and x for y 29.5
Sol: We prepare the table
x y u x 26 v y 26 u2 v2 uv
10 12 -16 -14 256 196 224
14 16 -12 -10 144 100 120
19 18 -7 -8 49 64 56
26 26 0 0 0 0 0
30 29 4 3 16 9 12
34 35 8 9 64 81 72
39 38 13 12 169 144 156
Total -10 -8 698 594 640
-10 8
Here n 7, u= = -1.429 , v 1.143
7 7
n 7
n 7
cov (u,v )
r = r ( x, y ) = r (u,v ) =
89.795
= = 0.9941
susv 9.883´ 9.14
sy s 9.14
r´ = r ´ v = 0.9941´ = 0.9194
sx su 9.883
sx s 9.883
r´ = r ´ u = 0.9941´ = 1.0749
sy sv 9.14
x a u 26 1.429 24.571
y b v 26 1.143 24.857
Regression line y on x is y y b y x x x
11
Statistical Methods (Comp/IT)
Ex. 3 The regression equations are 8x 10y 66 0 and 40x 18y 214. The value of variance of x
is 9. Find A) The mean value of x and y B) The correlation coefficient C) The standard deviation of
y.
Sol.: A) Given regression equations are 8x 10 y 66 0 and 40x -18y = 214. Since both the
regression lines passing through the point x, y , Solving above equation we get,
x 13 and y 17
B) Let 8x 10 y 66 0 be the line of regression y on x and 40 x 18 y 214 be the line of regression x
on y . Therefore, the equation can be written as
8 66 18 214
y x and x y
10 10 40 40
b y x - regression coefficient of y on x =0.8, and bx y - regression coefficient of x on y =0.45
Coefficient of correlation between x and y is given by r bx y b y x 0.45 * 0.8 = 0.6
But since both regression coefficients are positive, we take r 0.6
C) Variance of x = 9 i.e. x 9 . Therefore x 3
2
y y
We have by x r 0 .8 0 .6 * y 4
x 3
EXERCISE 2.2
1. Obtain lines of regression for the following data.
x 40 44 28 30 44 38 31
y 32 39 26 30 38 34 28
2. Determine the equations of regression lines for the following data:
x 1 2 3 4 5 6 7 8 9
y 9 8 10 12 11 13 14 16 15
and obtain an estimate of y for x 4.5
3. Obtain lines of regression for the following data.
x 6 2 10 4 8
y 9 11 5 8 7
and find x for y 9.3
4. If the two lines of regression are 9 x y 0 and 4 x y and the means of x and y are 2
and -3 resp., find the value of , and the coefficient of correlation between x and y .
5. The regression equations are: 3x 2 y 26 0 and 6 x y 31 0 then find:
(i) The mean values of x and y (ii) The correlation co-efficient between x and y .
6. Fit a straight line of the form to the following data by, using least square method
x 0 2 4 6 8 12 20
y 10 12 18 22 20 30 30
12
Statistical Methods (Comp/IT)
2.10 MULTIPLE CORRELATION
Introduction: Multiple correlation is based on three or more variables without excluding the effect of
anyone. It is denoted by R as against r, which is used to denote simple bivariate correlation
coefficient.
In case of three variables , the multiple correlation coefficients are given as:
Multiple correlation coefficient with as a dependent variable with as
independent variables and is defined as,
As is the case with simple bivariate correlation, the coefficient of multiple correlation lies between
0 and 1. As R becomes closer to 0, it shows that the relationship is becoming more and more
negligible. In contrast, as it moves closer to 1, it shows that the relationship is becoming more and
more strong. If R is 1, then the correlation is called perfect. It may be added that when R is 0 showing
the absence of a linear relationship, it is just possible that there may be a non-linear relationship
among the variables. Another point to note is that multiple coefficient of correlation is always
positive. This is in contrast to simple bivariate coefficient of correlation, which may vary from -1 to
+1.
We can obtain the coefficient of multiple determination by squaring the multiple coefficient of
correlation.
Example 1: Given the following zero-order coefficients of correlation. Calculate multiple coefficient
of correlations,
Solution:
13
Statistical Methods (Comp/IT)
Example 2: Given the following zero-order coefficients of correlation. Calculate multiple coefficient
of correlations,
Solution:
Example 1: Given the following zero-order coefficients of correlation. Calculate multiple coefficient
of correlations,
Solution:
14
Statistical Methods (Comp/IT)
X 3 4 5 6 7 8 9
Y 2 5 6 4 3 2 4
Z 5 6 4 5 6 5 8
Solution:
X Y Z X2 Y2 Z2 XY XZ YZ
3 2 5 9 4 25 6 15 10
4 5 6 16 25 36 20 24 30
5 6 4 25 36 16 30 20 24
6 4 5 36 16 25 24 30 20
7 3 6 49 9 36 21 42 18
8 2 5 64 25 25 16 40 10
9 4 8 81 16 64 36 72 34
15
Statistical Methods (Comp/IT)
And
16
Statistical Methods (Comp/IT)
negative correlation. When the points are extremely scattered on a graph, then it becomes evident that
there is almost no relationship between the two variables. However, when it comes to other values of
, we have to be careful in it’s interpretation, suppose we get a correlation of we may say
that is ‘twice as good’ or ‘twice as strong’ as a correlation of . It may be noted that
this comparison is wrong. The strength of is judge by coefficient of determination,
We multiply it by 100, thus getting 81 percent. This suggest that when
is then we can say that 81 percent of the total variation in the series can be attributed to the
relationship with . When and in percentage terms it is 20.25.
Where y is the dependent variable which is to be predicted; are the k known variables
on which the predictions are to be based and are parameters, the value of which are
to be determined by the method of least square.
Example 1) The following data relate to radio advertising expenditures, newspaper advertising
expenditures and sales. Fit a regression
Solution: It may be noted here that as there are three variables, viz. Y, , there will be three
normal equations as below:
4 1 7 16 4 1 28 7
17
Statistical Methods (Comp/IT)
7 2 12 49 14 4 84 24
9 5 17 81 45 25 153 85
Example 2) The following data relate to sales and advertising. Fit a regression
1 100 40 10
2 80 30 10
3 60 20 7
4 120 50 15
5 150 60 20
6 90 40 12
7 70 20 8
8 130 60 14
Solution: It may be noted here that as there are three variables, viz. Y, , there will be three
normal equations as below:
18
Statistical Methods (Comp/IT)
19