0% found this document useful (0 votes)
6 views

Unit II Notes Correlation and Regression

The document discusses statistical methods with a focus on correlation, particularly Karl Pearson's Coefficient of Correlation, which measures the linear relationship between two variables. It explains the concepts of correlation, including positive and negative correlations, and provides formulas for calculating the correlation coefficient along with examples. Additionally, it introduces the t-test for assessing the significance of correlation and outlines assumptions related to Pearsonian correlation.

Uploaded by

Rohit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Unit II Notes Correlation and Regression

The document discusses statistical methods with a focus on correlation, particularly Karl Pearson's Coefficient of Correlation, which measures the linear relationship between two variables. It explains the concepts of correlation, including positive and negative correlations, and provides formulas for calculating the correlation coefficient along with examples. Additionally, it introduces the t-test for assessing the significance of correlation and outlines assumptions related to Pearsonian correlation.

Uploaded by

Rohit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Statistical Methods (Comp/IT)

JSPM’s
RAJARSHI SHAHU COLLEGE OF ENGINEERING, PUNE
(Autonomous Institute Affiliated to Savitribai Phule Pune University, Pune)

FIRST YEAR
BATCHLOR OF TECHNOLOGY
(A.Y. 2021 -2022, SEMESTER - II)
COMPUTER/IT /CSBS ENGINEERING
SUBJECT: STATISTICAL METHODS

UNIT – II: LINEAR STATISTCIAL MODELS

2.1 CORRELATION

We have already discussed distributions involving one variable or univariate distributions. In many
problems, of practical in nature, we are required to deal with two or more variables. Distributions
using two variables are called bivariate distribution. In such distributions, we are often interested in
knowing whether there exists some kind (or degree) of relationship between the two variables. In
statistics, this means whether there is correlation or co-variance between the two variables. If the
change in one variable affects the change in the other variable, the variables are said to be correlated
and the relationship is called correlation. Correlation determines the intensity or degree of
relationship between two different variables.
For example, change in rainfall will affect the crop output and thus the variables 'Rainfall recorded'
and 'Crop output' are correlated. If the increase (or decrease) in one variable causes corresponding
increase (or decrease) in the other, the correlation is said to be positive or direct. On the other hand, if
increase in the value of one variable shows a corresponding decrease in the value of the other or vice
versa, the correlation is called negative or inverse. For example, if the income of worker increases, as
a natural course his expenditure also increases, hence the correlation between income and expenditure
is positive or direct. If we consider the price and demand of a certain commodity then our experience
tells us that as the price of a commodity rises, its demand falls and thus the correlation between these
variables is negative or inverse. Correlation can also be classified as linear and non-linear. When the
amount of change in one variable is not in a constant ratio to the amount of change in other variable,
the correlation is called non-linear.
There are different methods to determine correlation between two variables. Graphical method called
'Scatter diagram' gives rough idea about the correlation without giving any specific numerical value
and 'Karl Pearson's Coefficient of Correlation' gives numerical measures for intensity (or degree) of
linear relationship between two variables and is widely used.
Remark: Besides 'Scatter diagram' and 'Karl Pearson's Coefficient of Correlation', there is other
measure called Spearman’s coefficient of correlation, which measures the linear association between

1
Statistical Methods (Comp/IT)
ranks assigned to individual items according to their attributes. But these are not as much of
consequence.
We shall now discuss 'Karl Pearson's Coefficient of Correlation' which is widely used in practice.
2.2 KARL PEARSON'S COEFFICIENT OF CORRELATION:
To measure the intensity or degree of linear relationship between two variables, Karl Pearson
developed a formula called correlation coefficient.
Correlation coefficient between two variables x and y denoted by r ( x, y) is defined as
cov x, y 
r  x, y   (1)
xy
In Bivariate distribution if xi , yi  take the values x1 , y1 , x2 , y 2 , x3 , y3 xn , y n  we define

covx, y  
 xi  x  yi  y 
1
(2)
n
where x, y are arithmetic means for x and y series respectively.
Also, the standard deviations for x and y series are:

x 
1
 xi  x 2 and  y  1
  y i  y 2 (3)
n n
cov x, y 
Correlation coefficient r  x, y   can be calculated using following simplifications
xy

covx, y    xi  x  yi  y 
1
n
On simplifying we get
cov x, y  
1
n
 xi yi  x y (4)
2
1 1 
   x2    x
2
x (5)
n n 
1
Similarly,  y2   y i2  y 2 (6)
n
For simplification of calculation, we put
xi  A y B
ui  xi  A and vi  yi  B or ui  and vi  i , then
h k
covu, v  
1
n
 u i vi  u v , (7)

1 1
 u2 
n
 u i2  u 2 and  v2   vi2  v 2
n
(8)

covu, v 
and r u , v  is given by, r u, v   . It can be established that r  x, y   r u , v  .
u v
We note here that calculation of r u , v  is simpler as compared to r  x, y  .
Using results (4), (5) and (6) in (1) we can write the formula as
2
Statistical Methods (Comp/IT)
1
covx, y   xi y i  x y  xy  n x y
r  x, y    n 
xy 1 2 2 1
 xi  x n  y i  y2 2  x  n x  y
2 2 2
 n y2 
n
Also, using results (7) and (8) in (1), we can write the formula as
1
covu, v   u i vi  u v  uv  nu v
r u, v    n 
u v 1 2 2 1
 u i  u n  vi  v 2 2  u  nu  v
2 2 2
 nv 2 
n
Property: It can be shown that the coefficient of correlation must satisfy the condition  1  r  1 .
Ex. 1: Following are the values of import of raw material and export of finished product
Export 10 11 14 14 20 22 16 12 15 13
Import 12 14 15 16 21 26 21 15 16 14
Calculate the coefficient of correlation between the import value and export values.
Solution: For n  10 the data is tabulated as:

x y x2 y2 xy
10 12 100 144 120
11 14 121 196 154
14 15 196 225 210
14 16 196 256 224
20 21 400 441 420
22 26 484 676 572
16 21 256 441 336
12 15 144 225 180
15 16 225 256 240
13 14 169 196 182
 x  147  y  170 x  2291 y  3056  xy  2638
2 2

x  x  147  14.7, y  y  170  17


n 10 n 10

r  x, y  
covx, y 

 xy  n x y
xy  x  n x  y
2 2 2
 n y2 
2638  10  14.7  17 139
   0.9458
2291  1014.7 3056  1017  
2 2
130 .1  166

Ex. 2: Calculate the coefficient of correlation from the following information.


n  10,  x  40,  x 2  190,  y  40,  y 2  200,  xy  150

3
Statistical Methods (Comp/IT)

Sol.: Here, x 
x 
40
 4, x 2  16 and y  
40 y
 4, y 2  16
n 10 n 10

Correlation coefficient between x and y is r  x, y  


covx, y 

 xy  n x y
xy  x  n x  y
2 2 2
 n y2 
150  10  4  4  10
   0.2886
190  104 200  104 
2 2
30  40

Ex. 3: Given: r  0.9,  XY  70,  y  3.5,  X 2  100 . Find the number of items, if X and Y are
deviation from arithmetic mean.

Sol.:  x2 
1
 x  x 2  1  X 2  100
n n n

 x  x  y  y 
1
r  x, y  
covx, y 
 n 
 XY
xy xy n x  y

Squaring we get, r 2  XY 
 
2

0.92 
70 2  0.81 
4900
n  
2 2 2
 100  1225  n
n2     3.5
x y 2

 n 
0.81  1225 n  4900992 .25 \ n=5
2.3 t- TEST FOR A CORRELATION COEFFICIENT:
The most frequently used test to examine whether the two variables X and Y are correlated is
the t est. To apply this test, we first set up the two hypotheses as follows:

Where, is the population correlation coefficient.


The formula used for the t test is as follows:

Where, is the sample correlation coefficient.


The test statistic follows a distribution with degrees of freedom. Let us take an example.
Ex. 1: If Determine whether there is significant association between advertising
expenditure and sales revenue.
Soln.: We apply the t-test and use the formula

4
Statistical Methods (Comp/IT)

The critical value of t for df at level of significance is As the calculated value of


is more than the critical value, the null hypothesis is rejected. This means that there is statistically
significant correlation between the two variables.
Ex 2: calculate the correlation coefficient between the two series, given below. Examine whether
there is a significant relationship between the two variables.
X 10 20 30 40 50
y 3 2 1 5 4

Soln.:

Here ,
In order to examine whether indices whether the relationship between is
statistically significant, we apply the test.
5
Statistical Methods (Comp/IT)
The two hypotheses are:

Where, indicates correlation coefficient of the population.


The formula for the t test is,

The critical value of t with degrees of freedom and at level of significance is . As the
calculated value of is less than the critical value of , the null hypothesis is accepted. The
conclusion is that the relationship between is not statistically significant.
2.4 ASSUMPTIONS OF THE KARL PEARSONIAN CORRELATION:
1. The two variables x and y are linearly related. This implies that when the individual pairs are plotted
on a graph resulting in a scatter diagram. If the points are joined together, a straight line will be
formed.
2. The two variables are affected by several causes, which are independent, so as to form a normal
distribution. For example, relationships between price and demand, price and supply, advertising
expenditure and sales, length of experience and earning and so on are affected by several factors such
that the series result into a normal distribution.

EXERCISE: 2.1
Ex.1 From a group of 10 students, marks obtained by each in two papers x and y are given below

x 23 28 42 17 26 35 29 37 16 46
y 25 22 38 21 27 39 24 32 18 44
6
Statistical Methods (Comp/IT)
Calculate coefficient of correlation between x and y .

Ex.2 Obtain the correlation coefficient between population density (per square mile) and death rate
(per thousand persons) from the following data:
Population density 200 500 400 700 300
Death rate 12 18 16 21 10

Ex.3 Calculate the coefficient of correlation for the following distribution


x 35 34 40 43 56 20 38
y 32 30 31 32 53 20 33
Ex.4 Find coefficient of correlation for the data:
x 10 14 18 22 26 30
y 18 12 24 06 30 36

2.5 REGRESSION
After having established that the two variables are correlated, we are generally interested in
estimating the value of one variable for a given value of the other variable. For example, if we know
that rainfall affects the crop output then it is possible to predict the crop output at the end of a rainy
season. If the variables in a bivariate distribution are related, the points in scatter diagram cluster
round some curve called the curve of regression or the regression curve. If the curve is a straight line,
it is called the line of regression and in such case the regression between two variables is linear. The
line of regression gives best estimate for the value of one variable for some specified value of the
other variable.

2.6 CURVE FITTING


In experimental work, we often encounter the problem of fitting a curve to data obtain from
observations connecting two variables. Data fitting or representing relationship by means of
polynomials has been considered. There are different methods of fitting curve. Scatter diagram is
graphical method and the least square approximation is the most commonly applied techniques for
best fit.

2.7 SCATTER DIAGRAM


To find a relationship between the set of paired observation, we plot their correspond values on the
graph taking one of the variables along x-axis and other along y-axis. The resulting diagram showing
a collection of dots is called scatter diagram. A smooth curve that approximate the above set of
points is known as the approximating curve.

7
Statistical Methods (Comp/IT)

2.8 LEAST SQUARE APPROXIMATION


As a result of certain experiment suppose the values of the variables (xi, yi) are recorded for i = 1, 2,
3 … n. If these points are plotted, usually it is observed that a smooth curve passes through most of
these points, while some of the points are slightly away from this curve. The curve passing through
these points may be a first-degree curve i.e. a straight line say y = ax + b or a second degree parabola
such as
y = ax2 + bx + c
or in general an nth degree curve.
y = a0xn + a1xn–1 + a2xn–2 +…+ an .
To determine the equation of the curve which very nearly passes through the set of points, we assume
some form of relation between x and y may be a straight line or a parabola of second degree, third
degree and so on, which we expect to be the best fit.
In Fig. below, we observe that a straight line very nearly passes through the set of points. We may
assume the equation of the straight line as y = ax + b (1)

If point (xi, yi) is assumed to lie on (1) then y co-ordinate of the point can be calculated as,
y' i = axi + b
If point actually lies on (1) then, yi = y'i
Otherwise yi – y'i will represent the deviation of observed value yi from the calculated value of y'i
using the formula (1).
In method of least squares we take the sum of the squares of these deviations and minimize this sum
using the principle of maxima or minima. Values of a, b in (1) are calculated using this criterion. This
is called least square criteria. Curve (1) can be of any degree using least square criteria we can find its
8
Statistical Methods (Comp/IT)
equation. We shall now discuss fitting of straight line and second degree parabola to a given set of
points.

2.9 REGRESSION LINES:

Consider the set of values of xi , yi , i  1, 2, 3, n . Let the line of regression of y on x be
y  mx  c
Using the method of principle of least squares, the normal equations for estimating m and c the
regression line of y on x is
r y
y y  x  x 
x
y  y  b y x x  x  (9)

r y
where by x  is called regression coefficient of y on x .
x
Similarly, the regression line of x on y is
x
xx r y  y
y
x  x  bxy  y  y  (10)

sx
where bx y = r is called regression coefficient of x on y .
sy
For obtaining regression lines (9) and (10), we have to calculate r  x, y  ,  x and  y .

For simplification of calculation, we can use change of scale property. If u  x  a, v  y  b then


cov x, y  covu , v 
r  x, y   r u , v   
xy u v

( )
where s x = s u , s y = s v , cov u,v =
1
n
å
1 1
u v - u v ,  u2   u i2  u 2 ,  v2   vi2  v 2 and
n n
x  a u , y  b v
xa y b
In particular, if u  ,v then
h k
cov x, y  covu , v 
r  r  x, y   
xy u v

where  x  h  u ,  y  k  v and x  a  h u , y  b  k v .

9
Statistical Methods (Comp/IT)
 y   x 
Property 1: Since b y x  bx y   r     r   therefore b y x  bx y  r 2 and r  b y x  bx y .
 
 x   y 
Property 2: If  is the acute angle between the two regression lines in the case of two variables
1 r2  x  y
x and y , then tan   .
r  x2   y2

Ex.1: Obtain lines of regression for the following data.


x 6 2 10 4 8
y 9 11 5 8 7
Ans:- We prepare the table:
x y x2 y2 xy
6 9 36 81 54
2 11 4 121 22
10 5 100 25 50
4 8 16 64 32
8 7 64 49 56
 x  30  x  40 x 2
 220 y 2
 3056  xy  2638
No. of observation = n = 5

x
 xi  30  6 and y   yi  40  8
n 5 n 5
 xi  x 2  220  62  8
2

 x2 
n 5
 yi   y 2  340  82  4
2

 y2 
n 5

covx, y    i i  x y 
x y
 6 * 8  5.2
214
n 5
covx, y   5.2 covx, y   5.2
by x    0.65 bx y    1.3
x 2
8  y2 6
Regression line y on x is y  y  by x x  x 
y - 8  - 0.65 x  6 
y  -0.65 x  11.9
Regression line x on y is x  x  by x  y  y 
x  6  -1.3 y  8
x  -1.3 y  16.4

Ex.2: Compute the regression lines for the following data:


x 10 14 19 26 30 34 39
y 12 16 18 26 29 35 38

10
Statistical Methods (Comp/IT)
and estimate y for x  14.5 and x for y  29.5
Sol: We prepare the table

x y u  x  26 v  y  26 u2 v2 uv
10 12 -16 -14 256 196 224
14 16 -12 -10 144 100 120
19 18 -7 -8 49 64 56
26 26 0 0 0 0 0
30 29 4 3 16 9 12
34 35 8 9 64 81 72
39 38 13 12 169 144 156
Total -10 -8 698 594 640
-10 8
Here n  7, u= = -1.429 , v  1.143
7 7

covu, v   uv  u v  640   1.429 1.143   89.795


1 1
n
 7

u i  u 2  698   2.042   97.672


1 1
u2    u  9.883
2

n 7

vi  v 2  594   1.306   83.551


1 1
 v2    v  9.14
2

n 7
cov (u,v )
r = r ( x, y ) = r (u,v ) =
89.795
= = 0.9941
susv 9.883´ 9.14
sy s 9.14
r´ = r ´ v = 0.9941´ = 0.9194
sx su 9.883
sx s 9.883
r´ = r ´ u = 0.9941´ = 1.0749
sy sv 9.14
x  a  u  26  1.429  24.571
y  b  v  26  1.143  24.857
Regression line y on x is y  y  b y x x  x 

y - 24.857  0.9194  x  24.571 (1)


Regression line x on y is x  x  b y x y - y

x - 24.571  -1.0749  y  24.857  (2)


To estimate y for x  14.5 , put the value of x in (1) we get y  15.5977
To estimate x for y  29.5 , put the value of y in (2) we get x  29.56176

11
Statistical Methods (Comp/IT)
Ex. 3 The regression equations are 8x 10y  66  0 and 40x 18y  214. The value of variance of x
is 9. Find A) The mean value of x and y B) The correlation coefficient C) The standard deviation of
y.
Sol.: A) Given regression equations are 8x 10 y  66  0 and 40x -18y = 214. Since both the
regression lines passing through the point  x, y  , Solving above equation we get,
x  13 and y  17
B) Let 8x 10 y  66  0 be the line of regression y on x and 40 x 18 y  214 be the line of regression x
on y . Therefore, the equation can be written as
8 66 18 214
y  x and x y
10 10 40 40
b y x - regression coefficient of y on x =0.8, and bx y - regression coefficient of x on y =0.45
Coefficient of correlation between x and y is given by r  bx y b y x  0.45 * 0.8 =  0.6
But since both regression coefficients are positive, we take r  0.6
C) Variance of x = 9 i.e.  x  9 . Therefore  x  3
2

y y
We have by x  r  0 .8  0 .6 * y  4
x 3
EXERCISE 2.2
1. Obtain lines of regression for the following data.
x 40 44 28 30 44 38 31
y 32 39 26 30 38 34 28
2. Determine the equations of regression lines for the following data:
x 1 2 3 4 5 6 7 8 9
y 9 8 10 12 11 13 14 16 15
and obtain an estimate of y for x  4.5
3. Obtain lines of regression for the following data.
x 6 2 10 4 8
y 9 11 5 8 7
and find x for y  9.3
4. If the two lines of regression are 9 x  y    0 and 4 x  y   and the means of x and y are 2
and -3 resp., find the value of  ,  and the coefficient of correlation between x and y .
5. The regression equations are: 3x  2 y  26  0 and 6 x  y  31  0 then find:
(i) The mean values of x and y (ii) The correlation co-efficient between x and y .

6. Fit a straight line of the form to the following data by, using least square method

x 0 2 4 6 8 12 20
y 10 12 18 22 20 30 30

12
Statistical Methods (Comp/IT)
2.10 MULTIPLE CORRELATION

Introduction: Multiple correlation is based on three or more variables without excluding the effect of
anyone. It is denoted by R as against r, which is used to denote simple bivariate correlation
coefficient.
In case of three variables , the multiple correlation coefficients are given as:
Multiple correlation coefficient with as a dependent variable with as
independent variables and is defined as,

Multiple correlation coefficient with as a dependent variable with as


independent variables and is defined as,

Multiple correlation coefficient with as a dependent variable with as


independent variables and is defined as,

As is the case with simple bivariate correlation, the coefficient of multiple correlation lies between
0 and 1. As R becomes closer to 0, it shows that the relationship is becoming more and more
negligible. In contrast, as it moves closer to 1, it shows that the relationship is becoming more and
more strong. If R is 1, then the correlation is called perfect. It may be added that when R is 0 showing
the absence of a linear relationship, it is just possible that there may be a non-linear relationship
among the variables. Another point to note is that multiple coefficient of correlation is always
positive. This is in contrast to simple bivariate coefficient of correlation, which may vary from -1 to
+1.
We can obtain the coefficient of multiple determination by squaring the multiple coefficient of
correlation.
Example 1: Given the following zero-order coefficients of correlation. Calculate multiple coefficient
of correlations,

Solution:

13
Statistical Methods (Comp/IT)

Example 2: Given the following zero-order coefficients of correlation. Calculate multiple coefficient
of correlations,

Solution:

Example 1: Given the following zero-order coefficients of correlation. Calculate multiple coefficient
of correlations,

Solution:

14
Statistical Methods (Comp/IT)

Example 3: Calculate from the following data:

X 3 4 5 6 7 8 9

Y 2 5 6 4 3 2 4

Z 5 6 4 5 6 5 8

Solution:
X Y Z X2 Y2 Z2 XY XZ YZ

3 2 5 9 4 25 6 15 10

4 5 6 16 25 36 20 24 30

5 6 4 25 36 16 30 20 24

6 4 5 36 16 25 24 30 20

7 3 6 49 9 36 21 42 18

8 2 5 64 25 25 16 40 10

9 4 8 81 16 64 36 72 34

15
Statistical Methods (Comp/IT)

Then zero-order correlation coefficients are given by,

And

2.11 COEFFICIENT OF DETERMINATION:


When, the interpretation of does not pose any problem. When,
, all the points lie on a straight line in a graph showing a perfect positive or a perfect

16
Statistical Methods (Comp/IT)
negative correlation. When the points are extremely scattered on a graph, then it becomes evident that
there is almost no relationship between the two variables. However, when it comes to other values of
, we have to be careful in it’s interpretation, suppose we get a correlation of we may say
that is ‘twice as good’ or ‘twice as strong’ as a correlation of . It may be noted that
this comparison is wrong. The strength of is judge by coefficient of determination,
We multiply it by 100, thus getting 81 percent. This suggest that when
is then we can say that 81 percent of the total variation in the series can be attributed to the
relationship with . When and in percentage terms it is 20.25.

2.12 MULTIPLE REGRESSION:


The multiple linear regression takes the following form:

Where y is the dependent variable which is to be predicted; are the k known variables
on which the predictions are to be based and are parameters, the value of which are
to be determined by the method of least square.

Example 1) The following data relate to radio advertising expenditures, newspaper advertising
expenditures and sales. Fit a regression

Radio advertising expenditures 4 7 9 12


(‘000Rs)

newspaper advertising expenditures 1 2 5 8

Sales (Rs Lakh) (Y) 7 12 17 20

Solution: It may be noted here that as there are three variables, viz. Y, , there will be three
normal equations as below:

4 1 7 16 4 1 28 7

17
Statistical Methods (Comp/IT)

7 2 12 49 14 4 84 24

9 5 17 81 45 25 153 85

12 8 20 144 96 64 240 160

Applying the above values in the normal equations


….. (1)
…. (2)
….. (3)
After Solving above equations, we get

The regression equation is

Example 2) The following data relate to sales and advertising. Fit a regression

Sales Territory Sales (Lakh Rs) (Y) Advertising Number of selling


(‘000Rs) Agents

1 100 40 10

2 80 30 10

3 60 20 7

4 120 50 15

5 150 60 20

6 90 40 12

7 70 20 8

8 130 60 14

Solution: It may be noted here that as there are three variables, viz. Y, , there will be three
normal equations as below:

18
Statistical Methods (Comp/IT)

40 10 100 1600 400 100 4000 1000

30 10 80 900 300 100 2400 800

20 7 60 400 140 49 1200 420

50 15 120 2500 750 225 6000 1800

60 20 150 3600 1200 400 9000 3000

40 12 90 1600 480 144 3600 1080

20 8 70 400 160 64 1400 560

60 14 130 3600 840 196 8400 1820

Applying the above values in the normal equations


….. (1)
…. (2)
….. (3)
After Solving above equations, we get

The regression equation is

19

You might also like