0% found this document useful (0 votes)
30 views104 pages

PS - Module 3 - ViRa

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views104 pages

PS - Module 3 - ViRa

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

Probability and Statistics

Module 3 - Correlation and Regression

Vignesh Ravi

Division of Mathematics, School of Advanced Sciences,

Vellore Institute of Technology, Chennai.

Contact: [email protected]

1 / 104
Outline
Correlation
Types of correlation
Coefficient of correlation
Problems
Rank Correlation Coefficient
Regression
Problems
Partial Correlation
Multi-linear Regression
Problems for Practice

2 / 104
Correlation

Correlation refers to the relationship of two or more variables. We


know that there exists relationship between the heights of a father
and a son, wage and price index. The study of relation is called
“Correlation”.
Definition
Correlation is a statistical analysis which measures and analyses
the degree or extent to which two variables fluctuate with reference
to each other. The correlation expresses the relationship or
independence of two set of variables upon each other. One variable
may be called the subject (independent) and the other relative
(dependent).

3 / 104
Types of correlation

Correlation is classified into many types.


▶ Positive and negative
▶ Simple and multiple
▶ Partial and total
▶ Linear and non-linear

4 / 104
Types of correlation

Positive correlation
If two variables tend to move together in the same direction that is
an increase in the value of one variable is accompained by an
increase in the value of the other variable or a decrease in the
value of one variable is accompained by an decrease in the value of
the other variable then the correlation is called “positive or direct
correlation”.

Example
height and weight, rainfall and yield of crops, price and supply

5 / 104
Types of correlation

Negative correlation
If two variables, tend to move together in the same directions that
is an increase or decrease in the values of one variable is
accompained by a decrease or increase in the value of the other
variable, then the correlation is called “negative or inverse
correlation”.

Example

6 / 104
Types of correlation

Simple correlation
About the study of only two variables, the relationship is described
as simple correlation.

Example
Quantity of money and price level, demand and price.

7 / 104
Types of correlation

Multiple correlation
About the study of more than two variables simultaneously, the
relationship is described as multiple correlation.

Example
The relationship of price, demand and supply of a commodity.

8 / 104
Types of correlation

Partial correlation
The study of two variables excluding some other variables is called
“partial correlation”.

Example
Price and demand, eliminating the supply side.

Note
In total correlation, all the facts are taken into account.

9 / 104
Types of correlation

Linear correlation
If the ratio of change between two variables is uniform, then there
will be linear correlation between them.

The ratio of change between the variables is the same. If we plot


these on the graph, we get a straight line.

10 / 104
Types of correlation

Non linear correlation


If a curvilinear or non-linear correlation, the amount of change in
one variable does not bear a constant ratio of the amount of
change in the other variables. The graph of non-linear or
curvilinear relationship will be a curve.

11 / 104
Coefficient of correlation

Karl Pearson’s Coefficient of correlation


A mathematical method for measuring the magnitude of linear
relationship between two variables. It is denoted by r .

covariance of x and y
r=
σx σy

Direct method

PP P
xi yi − xi
n y
r=q P q P i P
n xi2 − ( xi )2 n yi2 − ( yi )2
P

12 / 104
Coefficient of correlation

Actual mean method

P
XY
r = pP pP
X 2 Y2

where X = xi − x̄, Y = yi − ȳ , x̄ and ȳ are the means of x and y .

Assumed mean method

Step deviation method

13 / 104
Properties of correlation coefficient

▶ The maximum value of rank correlation coefficient is 1. i.e.,


correlation coefficient r lies between -1 and 1 symbolically,
|r | ≤ 1 or −1 ≤ r ≤ 1.
▶ The coefficient of correlation is independent of the change of
origin and scale of measurements.
▶ If X , Y are random variables and a, b, c, d are any numbers
such that a ̸= 0 and c ̸= 0, then
ac
r (aX + b, cY + d) = r (X , Y )
|ac|
▶ Two independent variables are uncorrelated. i.e., X and Y are
independent variables then r (X , Y ) = 0.

14 / 104
Problem 1

Calculate coefficient of correlation from the following data:

Solution:

15 / 104
Problem 1 Contd.

16 / 104
Problem 2

Find if there is any significant correlation between the heights and


weights given below:

Solution:

17 / 104
Problem 2 Contd.

18 / 104
Problem 3

Find if there is any significant correlation between the heights and


weights given below:

19 / 104
Problem 3 Contd.

Solution:

20 / 104
Problem 3 Contd.

21 / 104
Problem 4

Psychological tests of intelligence and of engineering ability were


applied to 10 students. Here is a record of ungrouped data
showing Intelligence ration(I.R) and Engineering ratio (E.R).
Calculate the correlation coefficient.

22 / 104
Problem 4 Contd.

23 / 104
Problem 4 Contd.

24 / 104
Problem 5

With the following data in 6 cities, calculate the coefficient of


correlation by Pearson’s method between the density of population
and the death rate.

25 / 104
Problem 5 Contd.

26 / 104
Problem 6

Calculate Karl Pearson’s correlation coefficient for the following


paired data:

27 / 104
Problem 6 Contd.

28 / 104
Problem 6 Contd.

29 / 104
Problem 7

Find a suitable coefficient of correlation for the following data:

30 / 104
Problem 7 Contd.

31 / 104
Problem 7 Contd.

32 / 104
Problem 8

The following table gives the distribution of the total population


and those who are totally and partially blind among them. Find
out if there is any relation between age and blindness.

33 / 104
Problem 8 Contd.

34 / 104
Problem 8 Contd.

35 / 104
Problem 8 Contd.

36 / 104
Rank Correlation Coefficient

Spearman’s Rank correlation

6 D2
P
ρ=1−
N(N 2 − 1)

where ρ is the Rank coefficient of correlation, D 2 is the sum of the


squares of the differences of two ranks and N is the number of
paired observations.

37 / 104
Rank Correlation Coefficient

Properties of Rank correlation coefficient


▶ The value of ρ lies between -1 and 1.
▶ If ρ = 1, there is complete agreement in the order of the ranks
and the direction of the rank is same.
▶ If ρ = −1, there is complete disagreement in the order of the
ranks and they are in opposite direction.

38 / 104
Equal and repeated ranks

39 / 104
Problem 1

Following are the ranks obtained by 10 students in two subjects,


Statistics and Mathematics. To what extent the knowledge of the
students in two subjects is related?

40 / 104
Problem 1 Contd.

41 / 104
Problem 2

A random sample of 5 college students is selected and their grades


in Mathematics and Statistics are found to be

42 / 104
Problem 2 Contd.

43 / 104
Problem 3

44 / 104
Problem 3 Contd.

45 / 104
Problem 4

From the following data, calculate the rank correlation coefficient


after making adjustment for tied ranks.

46 / 104
Problem 4 Contd.

47 / 104
Problem 4 Contd.

48 / 104
Problem 5

Obtain the rank correlation coefficient for the following data:

49 / 104
Problem 5 Contd.

50 / 104
Problem 5 Contd.

51 / 104
Problem 6

A sample of 12 fathers and their elder sons gave the following data
about their elder sons. Calculate the coefficient of rank correlation.

52 / 104
Problem 6 Contd.

53 / 104
Problem 6 Contd.

54 / 104
Regression
Definition
In regression, we can estimate the value of one variable with the
value of the other variable which is known. The statistical method
which helps us to estimate the unknown value of one variable from
the known value of the related variable is called regression.

Line of regression
The line described in the average relationship between two
variables is known as line of regression.

Example
▶ Used to estimate the relation between two economic variables
like Income and Expenditure. Also in prediction analysis.
▶ It is useful in statistical estimation of demand curves, supply
cµrves, production function, cost function and consumption
function, etc.
55 / 104
Regression equation

Regression equation of Y on X

Y = a + bX

Normal equations are:


X X
Y = Na + b X
X X X
XY = a X +b X2

a, b are constants, a is the Y − intercept and b is called the slope


or regression coefficient of Y on X .

56 / 104
Regression equation

Regression equation of X on Y

X = a + bY

Normal equations are:


X X
X = Na + b Y
X X X
XY = a Y +b Y2

a, b are constants, a is the X − intercept and b is called the slope


or regression coefficient of X on Y .

57 / 104
Deviations taken from Arithmetic mean of X on Y

Regression equation of Y on X

σy
Y − Ȳ = r (X − X̄ )
σx
The regression coefficient of Y on X is
P
σy xy
byx = r = P 2
σx y

where x = X − X̄ and y = Y − Ȳ .

58 / 104
Deviations taken from Arithmetic mean of X on Y

Regression equation of X on Y

σx
X − X̄ = r (Y − Ȳ )
σy

The regression coefficient of X on Y is


P
σx xy
bxy = r = P 2
σy x

where x = X − X̄ and y = Y − Ȳ .

59 / 104
Problem 1

Determine the equation of a straight line which best fits the data.

60 / 104
Problem 1 Contd.

61 / 104
Problem 2

A panel of two judges P and Q graded seven dramatic


performances by independently awarding marks as follows:

62 / 104
Problem 2 Contd.

63 / 104
Problem 2 Contd.

64 / 104
Problem 3
From the following data, calculate

65 / 104
Problem 4

Given the following data, calculate the expected value of Y when


X = 12.

66 / 104
Problem 4 Contd.

67 / 104
Problem 5

Determine the equation of a straight line which best fits the data:

68 / 104
Problem 5 Contd.

69 / 104
Problem 5 Contd.

70 / 104
Problem 5 Contd.

71 / 104
Problem 5 Contd.

72 / 104
Problem 6

Calculate the regression equations of Y on X from the data given


below, taking deviations from actual means of X and Y .

73 / 104
Problem 6 Contd.

74 / 104
Partial Correlation

▶ Simple correlation is a measure of the relationship between a


dependent variable and another independent variable.
▶ For example, if the performance of a sales person depends
only on the training that he has received, then the relationship
between the training and the sales performance is measured
by the simple correlation coefficient r .
▶ A dependent variable may depend on several variables. For
example, the yarn produced in a factory may depend on the
efficiency of the machine, the quality of cotton, the efficiency
of workers, etc.

75 / 104
Partial Correlation

▶ The technique of partial correlation proves useful when one


has to develop a model with 3 to 5 variables.
▶ Suppose Y is a dependent variable, depending on n other
variables X1 , X2 , . . . , Xn . Partial correlation is a measure of
the relationship between and any one of the variables
X1 , X2 , . . . , Xn , as if the other variables have been eliminated
from the situation.
▶ Let r12.3 denote the correlation of X1 and X2 by eliminating
the effect of X3 .
▶ Let r12 be the simple correlation coefficient between X1 and
X2 .
▶ Let r13 be the simple correlation coefficient between X1 and
X3 .

76 / 104
Formula

77 / 104
Problem 1

78 / 104
Problem 1 Contd.

79 / 104
Problem 1 Contd.

80 / 104
Multiple Correlation

▶ When the value of a variable is influenced by another variable,


the relationship between them is a simple correlation. In a real
life situation, a variable may be influenced by many other
variables.
▶ For example, the sales achieved for a product may depend on
the income of the consumers, the price, the quality of the
product, sales promotion techniques, the channels of
distribution, etc.
▶ In this case, we have to consider the joint influence of several
independent variables on the dependent variable.
▶ The multiple correlation coefficients are denoted by the
letter R. The dependent variable is denoted by X1 . The
independent variables are denoted by X2 , X3 , X4 , . . ..

81 / 104
Multiple Correlation

▶ R1.23 denotes the multiple correlation of the dependent


variable X1 with two independent variables X2 and X3 . It is a
measure of the relationship that X1 has with X2 and X3 .
▶ R2.13 is the multiple correlation of the dependent variable X2
with two independent variables X1 and X3 .
▶ R3.12 is the multiple correlation of the dependent variable X3
with two independent variables X1 and X2 .

82 / 104
Formula

s
2 + r 2 − 2r r r
r12 13 12 13 23
R1.23 = 2
1 − r23
s
2 + r 2 − 2r r r
r12 23 12 13 23
R2.13 = 2
1 − r13
s
2 + r 2 − 2r r r
r13 23 12 13 23
R3.12 = 2
1 − r12

83 / 104
Problem 1

84 / 104
Problem 1 Contd.

85 / 104
Problem 1 Contd.

86 / 104
Multi-linear Regression

In linear regression, there is only one independent and dependent


variable involved. But, in the case of multiple regression, there will
be a set of independent variables that helps us to explain better or
predict the dependent variable Y.

The multiple regression equation is given by

Y = a + b1 X1 + b2 X2 + b3 X3 + . . . + bk Xk
where X1 , X2 , . . . , Xk are the k independent variables, Y is the

dependent variable, a is the intercept and b is the slope.

87 / 104
Steps to follow

1. Calculate X12 , X22 , X1 Y , X2 Y and X1 X2 .

2. Calculate Regression Sums

3. Calculate b0 , b1 , and b2 .

4. Place b0 , b1 , and b2 in the estimated linear regression


equation.

88 / 104
Problem 1

89 / 104
Problem 1 Contd.

90 / 104
Problem 1 Contd.

91 / 104
Problem 1 Contd.

92 / 104
Problem 1 Contd.

93 / 104
Problem 1 Contd.

94 / 104
Problem 1 Contd.

95 / 104
Problem 2

Find the regression line Y on X1 and X2 .

Y X1 X2
4 15 30
6 12 24
7 8 20
9 6 14
13 4 10
15 3 4

Ans: Y = 16.48 + 0.39X1 − 0.62X2

96 / 104
Practice Problems I
1. Find the correlation co-efficient for the following data.
Sales 15 18 25 27 30 35
Advertising Expense 50 65 82 95 110 120
Ans: r = 0.99
2. Find the correlation co-efficient for the following data.
x 65 66 67 67 68 69 70 72
y 67 68 65 68 72 72 69 71
Ans: r = 0.6030
3. A computer while calculating rxy from 25 pairs of
observations; obtained thePfollowing constants
P 2 n = 25,
2 = 650,
P P
P x = 125, x y = 100, y = 460 and
xy = 508. A recheck showed that two pairs of values (6,
14), (8, 6) were wrong while the correct values were (8, 12),
(6, 8). Obtain the correct value of correlation co-efficient.
Ans: r = 0.6670
97 / 104
Practice Problems II
4. The marks secured by the recruits in the selection test X and
in the proficiency test Y are given below. Calculate the rank
correlation co-efficient.
S. No. 1 2 3 4 5 6 7 8 9
X 10 15 12 17 13 16 24 14 22
Y 30 42 45 46 33 34 40 35 39
Ans: ρ = 0.4
5. Calculate rank correlation coefficient from the following data:

Ans: 0.96
6. Calculate rank correlation coefficient from the following data:

Expenditure on Ads. 10 15 14 25 14 14 20 22
Profit 6 25 12 18 25 40 10 7
98 / 104
Practice Problems III
Ans: -0.024
7. Calculate the rank co-efficient of correlation for the following
data:
X 68 64 75 50 64 80 75 40 55 64
Y 62 58 68 45 81 60 68 48 50 70
Ans: ρ = 0.5450
8. The ranking of 10 students in two subjects, maths and
physics, are as follows:
Maths 3 5 8 4 7 10 2 1 6 9
Physics 6 4 9 8 1 2 3 10 5 7
Ans: r = −0.2970
9. Find the correlation co-efficient and the equation of the
regression lines for the following data
X 1 2 3 4 5
Y 2 5 3 8 7
99 / 104
Practice Problems IV
Ans: r = 0.8062, y = 1.3x + 1.1, x = 0.5y + 0.5.
10. Marks obtained by ten students in Mathematics X and
Statistics Y are given below. Find the two regression lines.
Also, find y when x = 55.
X 60 34 40 50 45 40 22 43 42 64
Y 75 32 33 40 45 33 12 30 34 51
Ans: Y = 1.1865X − 13.7060, X = 0.6414Y + 19.3061;
Y = 51.55 when X = 55.
11. In a correlation analysis the equations of the two regression
lines are 3x + 12y = 19 and 3y + 9x = 46. Find
−1
i.) Correlation Co-efficient. Ans: r = √
2 3
ii.) Mean values of X and Y. Ans: x̄ = 5, ȳ = 13 .
12. For the following data, find the most likely price at Madras
corresponding to the price 70 at Bombay and that Bombay
corresponding to the price 68 at Madras. S. D. of the
difference between the prices at Madras and Bombay is 3.1.
100 / 104
Practice Problems V
Madras Bombay
Average Price 65 67
S. D. of Price 0.5 3.5
Ans: For the price 68 at Madras, the most likely price at
Bombay is 84.43.
Ans: For the price 70 at Bombay, the most likely price at
Madras is 65.36.
13. If r12 = 0.75, r13 = 0.80, r23 = 0.70. Find the partial
correlation r13.2 . Ans: r13.2 = 0.5823
14. Given that r12 = 0.7, r32 = 0.85, r31 = 0.75. Determine R2.13 .
Ans: R2.13 = 0.8552
15. Determine the all multiple correlation co-efficient for the
following data:
X1 Number of Students 35 45 60 64
X2 Marks Obtained 60 72 68 80
X3 Number of Activity 4 3 7 5
101 / 104
Practice Problems VI
Ans: r12 = 0.77435, r13 = 0.66799, r23 = 0.04688;R1.23 =
0.9997, R2.13 = 0.9995, R3.12 = 0.9994.
16. Determine the multiple correlation co-efficient R1.23 for the
following data:
X1 2 5 7 11
X2 3 6 10 12
X3 1 3 6 10
Ans: r12 = 0.9692, r13 = 0.9922, r23 = 0.9713;
R1.23 = 0.9937.
17. Given that r12 = 0.6, r32 = 0.45, r31 = 0.5. Determine all the
partial correlation co-efficient. Ans:
r12.3 = 0.48, r13.2 = 0.32, r23.1 = 0.22.
18. Determine all the partial correlation co-efficient from the
following data:

102 / 104
Practice Problems VII
X1 20 15 25 26 28 40 38
X2 12 13 16 15 23 15 28
X3 13 15 12 16 14 18 14
Ans: r12 = 0.59, r13 = 0.59, r23 = −0.18, r12.3 =
0.88, r13.2 = 0.88, r23.1 = −0.81.
19. Determine the R1.23 , R2.13 and R3.12 for the data given in 18.
20. Predict the value of Y for subject 6 from the given dataset
that contains values for X1 , X2 , and Y by using a Multiple
Regression Model.
Subject Y X1 X2
S1 −3.7 3 8
S2 3.5 4 5
S3 2.5 5 7
S4 11.5 6 3
S5 5.7 2 1
S6 ? 3 2
103 / 104
Thank You

104 / 104

You might also like