Correlation and Regression
Correlation and Regression
ℎ𝑘 𝑢𝑖 − 𝑢 ∙ 𝑣𝑖 − 𝑣
= = 𝑟𝑢𝑣
ℎ. 𝑘 𝑢𝑖 − 𝑢 2 ∙ 𝑣𝑖 − 𝑣 2
∴ 𝑟𝑥𝑦 = 𝑟𝑢𝑣
Thus, 𝑟 is independent of change of origin (𝑎, 𝑏) and change of Scale ℎ, 𝑘.
Remark: the above theorem can also be stated as: “If 𝑥 = 𝑎𝑢 + 𝑏, 𝑦 = 𝑐𝑣 + 𝑑, where
𝑎, 𝑏, 𝑐, 𝑑 are constants then 𝑟𝑥𝑦 = 𝑟𝑢𝑣 ”
2. If 𝒙 and 𝒚 are two correlated variables with the same standard deviation and
heaving correlation coefficient r, show that correlation coefficient between 𝒙
and 𝒙 + 𝒚 is 𝟏𝒔 + 𝒓 𝟐.
𝑐𝑜𝑣 𝑥, 𝑥 + 𝑦
𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧: By definition 𝑟 𝑥, 𝑥 + 𝑦 = ⋯⋯⋯⋯⋯⋯ 1
𝜎𝑥 𝜎𝑥+𝑦
1
Now, 𝑐𝑜𝑣 𝑥, 𝑥 + 𝑦 = 𝑥𝑖 − 𝑥 𝑥𝑖 + 𝑦𝑖 − 𝑥 − 𝑦
𝑛
1
= 𝑥𝑖 − 𝑥 𝑥𝑖 − 𝑥 + 𝑦𝑖 − 𝑦
𝑛
1 1
= 𝑥𝑖 − 𝑥 2 + 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦
𝑛 𝑛
2
= 𝜎 𝑥 + 𝑐𝑜𝑣 (𝑥, 𝑦)
𝑐𝑜𝑣 𝑥, 𝑥
But 𝑟 = ∴ 𝑐𝑜𝑣 𝑥𝑦
𝜎𝑥 𝜎𝑦
∴ 𝑐𝑜𝑣 𝑥, 𝑥 + 𝑦 = 𝜎𝑥 2 + 𝑟𝜎𝑥𝜎𝑦 = 𝜎 2
= 𝜎 2 1 + 𝑟 ∴ 𝜎𝑥 = 𝜎𝑦 = 𝜎 say by data … … … … (2)
Now to find 𝜎𝑥+𝑦 ; Let 𝑧 = 𝑥 + 𝑦
∴𝐸 𝑧 =𝐸 𝑥 +𝐸 𝑦
2 | APPLIED MATHEMATICS - IV
2 2
𝜎𝑧 2 = 𝐸 𝑧 − 𝐸(𝑧) = 𝑥 + 𝑦 − 𝐸 𝑥 + 𝐸(𝑦)
2
= 𝑥 − 𝐸(𝑥) + 𝑦 𝑦 − 𝐸(𝑦)
= 𝐸 𝑥 − 𝐸(𝑥) 2 𝐸 𝑦 − 𝐸(𝑦) 2
+ 2𝐸 𝑥 − 𝐸 𝑥 𝑦 − 𝐸(𝑦)
2 2
∴ 𝜎𝑥+𝑦 2 = 𝜎𝑥 + 𝜎𝑦 + 2𝑐𝑜𝑣(𝑥, 𝑦)
𝑐𝑜𝑣 𝑥, 𝑦
But 𝑟 =
𝜎𝑥 𝜎𝑦
∴ 𝑐𝑜𝑣 𝑥, 𝑦 = 𝑟𝜎𝑥 𝜎𝑦
∴ 𝜎𝑥 𝑦 2 = 𝜎𝑥 2 + 𝜎𝑦 2 + 2𝑟 𝜎𝑥 𝜎𝑦
𝜎𝑥 𝑦 2 = 𝜎𝑥 2 + 𝜎𝑦 2 + 2𝑟 𝜎𝑥 𝜎𝑦
= 2𝜎 2 + 2𝑟 𝜎 2 = 2𝜎 2 1 + 𝑟 ∴ 𝜎𝑥 = 𝜎𝑦 = 𝜎 … … … (3)
Putting the values from (2), (3) in (1)
𝜎 2 (1 + 𝑟) 𝜎 2 (1 + 𝑟) 1+𝑟
𝑟 𝑥, 𝑥 + 𝑦 = = =
𝜎 2𝜎 2 (1 + 𝑟) 𝜎 2(1 + 𝑟) 2
4. Two random variables x and y are jointly normally distributed and u and v are
defined by 𝑼 = 𝒙 𝐜𝐨𝐬 𝒂 + 𝒚 𝐬𝐢𝐧 𝒂, 𝑽 = 𝒚 𝐜𝐨𝐬 𝒂 − 𝒙 𝐬𝐢𝐧 𝒂, 𝑽 = 𝒚 𝐜𝐨𝐬 𝒂 − 𝒙 𝐬𝐢𝐧 𝒂.
2𝑟𝜎𝑥 𝜎𝑦
Solution: Show that U and V will be uncorrelated if tan 2𝑎 =
𝜎 𝑥 2 −𝜎 𝑦 2
𝑥 cos 𝑎 𝑦 − 𝐸 𝑦 − sin 𝑎 𝑥 − 𝐸 𝑥
𝑿: 𝟐𝟑 𝟐𝟕 𝟐𝟖 𝟐𝟗 𝟑𝟎 𝟑𝟏 𝟑𝟑 𝟑𝟓 𝟑𝟔 𝟑𝟗
𝒀: 𝟏𝟖 𝟐𝟐 𝟐𝟑 𝟐𝟒 𝟐𝟓 𝟐𝟔 𝟐𝟖 𝟐𝟗 𝟑𝟎 𝟑𝟐
Solution:
𝑥 𝑦 𝑥2 𝑦2 𝑥𝑦
23 18 529 324 414
4 | APPLIED MATHEMATICS - IV
81710 − 79927
𝑟=
98750 − 96721 67630 − 66049
1783
𝑟= = 0.9954
1791.301
6. Calculate the correlation coefficient from the following data:
𝑋 𝑌 𝑥 − 300 𝑦 − 50 𝑢2 𝑣2 𝑢𝑣
𝑢= 𝑣=
100 10
100 30 −2 −2 4 4 4
200 40 −1 −1 1 1 1
300 50 0 0 0 0 0
400 60 1 1 1 1
500 70 2 2 4 4 4
Total 0 0 10 10 10
CORRELATION AND REGRESSION | 5
𝑛=5
𝑛Σuv − ΣuΣv
𝑟𝑥,𝑦 = 𝑟𝑢 ,𝑣 =
𝑛Σu2 − Σu 2 𝑛Σv 2 − Σv 2
5 10 − (0)(0)
= =1
5 10 − 10 2 5 10 − 0 2
X 18 20 34 52 12
y 39 23 35 18 46
Solution: Let R be the spearman’s rank correlation coefficient of 𝑥 and 𝑦.
We have,
6Σ𝑑12
𝑅 =1−
𝑁3 − 𝑁
Where, N = Numbers of observations, Here N=5
Consider the following table
X Y 𝑅𝑥 𝑅𝑦 𝑑1 = 𝑅𝑥 − 𝑅𝑦 𝑑12
18 39 4 2 2 4
20 23 3 4 -1 1
34 35 2 3 -1 1
52 18 1 5 -4 16
15 46 5 1 4 16
38
We have,
6 38 6 38 228
𝑅 =1− 3
=1− =1−
5 −5 125 − 5 120
𝑹 = −𝟎. 𝟗
8. Calculate 𝑹 and 𝒓 from the following data.
𝑿 ∶ 𝟏𝟐 𝟏𝟕 𝟐𝟐 𝟐𝟕 𝟑𝟐
𝒀 ∶𝟏𝟏𝟑 𝟏𝟏𝟗 𝟏𝟏𝟕𝟏𝟏𝟓 𝟏𝟐𝟏
Interpret your result.
Solution: The values of 𝑅 and 𝑟 come out to be equal.
6 | APPLIED MATHEMATICS - IV
REGRESSION
1. Definition: A method of estimating the value of one variable when that of other is
known and when that of other is known and the variables are correlated.
2. Types of line of Regression : There are of two type as follow
(a) Line of Regression of 𝑦 on 𝑥, which is given as,
𝑦 = 𝑎 + 𝑏𝑥
(b) Line of Regression of 𝑥 on 𝑦, which is given as
𝑦 = 𝑎 + 𝑏𝑦
3. Method of obtaining the Regression :
(a) Method of scotter diagram: In this method, we plot a graph in which, one
variable which is plotted on X –axis and other variable is plotted on Y – axis
and If they are correlated perfectly
Example: Given the following pair of variables x and y
𝑋: 1 2 3 4
𝑌: 2 2 3 4
Plot the point on graph and draw line of Regression
(b) Method of least square: This method is classified in two type As given follows :
(i) By using summation: Consider we want to derive the regression of y on X
then it is given as.
𝑦 = 𝑎 + 𝑏𝑥
Now we have two following summation equation as follow,
𝑦 = 𝑎𝑛 + 𝑏 𝑥
𝑥𝑦 = 𝑎 𝑦+𝑏 x2
By using the given data we find the Reburied term and sub, by that we get the
constant 𝑎 and 𝑏. Similarly, for regression of 𝑥 on 𝑦 is given as
𝑥 = 𝑎 + 𝑏𝑦
CORRELATION AND REGRESSION | 7
𝑥 = 𝑎𝑛 + 𝑏 𝑦
𝑥𝑦 = 𝑎 𝑦+𝑏 𝑦2
6. Choice of Regression Line: For estimating the value of one variable when other is
known, we have to select the line of regression suppose we have two line of
Regression as follows.
𝑦 = 𝑎 + 𝑏𝑥, and 𝑥 = 𝑎 + 𝑏𝑦
And we want to estimate y for x = 20
The following procedure to be followed.
Step (i) Find the sloppy of 1 equation and denote it as byx and similarly for 2nd
equation denote as bxy
Step (ii) Find the r values
(a) If r value Line between 0 and 1; Both pair of line of Regression are original and
to estimate are original and estimate y we should uses equation 1
(b) If r value doesn’t Line between o and 1; Both pair of line is not original and to
estimation y, we have to reverse the equation.
SOLVED PROBLEM
1. The following data gives the marks in two short examination obtained by 5
student in mathematics
Marks in the 1st exam (x): 6 2 10 4 8
Marks in the 2nd exam(y): 9 11 5 8 7
Determine the regression of line of y on x.
Solution: The regression of line of 𝑦 on 𝑥 is as follows
𝑦 = 𝑎 + 𝑏𝑥 … … … (1)
As we have summation equation as
𝑦 = 𝑎𝑛 + 𝑏 𝑥 … … … (2)
𝑥𝑦 = 𝑎 𝑥+𝑏 𝑥 2 … … … (3)
Sr no X Y 𝑋2 Xy
1 6 9 36 54
2 2 11 4 22
3 10 5 100 50
4 4 8 16 32
5 8 7 64 56
𝑋𝑦 = 1022250
4. State true or false with justification. If the two line of regression are 𝒙 + 𝟑𝒚 − 𝟓 =
𝟎 and 𝟒𝒙 + 𝟑𝒚 − 𝟖 = 𝟎 then the correlation coefficient is +𝟎. 𝟓.
Solution: 𝑥 + 3𝑦 = 0
∴ 3𝑦 = 5 − 𝑥
1 5
∴𝑦=− 𝑥+
3 3
And 4𝑥 + 3𝑦 − 8 = 0
∴ 3𝑦 = −4𝑥 + 8
4 8
∴𝑦=− +
3𝑥 3
1 4
𝐿𝑒𝑡 𝑏1 = − 𝑎𝑛𝑑 𝑏2 = −
3 3
1
Since, 𝑏1 < 𝑏2 , 𝑏𝑦𝑥 = 𝑏1 = −
3
1 3
𝑏𝑥𝑦 = =−
𝑏2 4
Hence Equation (1) is regression equation of Y on X and Equation (2) is regression
equation of X on Y.
−1 −3
∴ 𝑟 = ± 𝑏𝑦𝑥 𝑏𝑥𝑦 =± ×
3 4
1 1
= ± =±
4 2
= ±0.5
5. Obtain the equation of the line regression of cost on age from the following table
giving the age of the car of certain make and the annual maintenance cost. Also
find maintenance cost if age of the car is 9 years.
𝑋 𝑌 𝑢 =𝑋−5 𝑉 =𝑦−8 𝑢2 𝑢𝑣
2 5 −3 −3 9 9
4 7 −1 −1 1 1
6 8.5 1 0.5 1 0.5
8 11 3 3 9 9
𝛴 0 −0.5 20 19.5
CORRELATION AND REGRESSION | 11
Σu 0
𝑥 = 𝑎 + 𝑐𝑢 = 𝑎𝑐 =5+1× =4
𝑛 4
Σv −0.5
𝑦 = 𝑏 + 𝑐𝑣 = 𝑏 + 𝑐 =8+1× = 7.875
𝑛 4
𝑛Σuv − ΣuΣv 4 19.5 − 0 (−0.5)
𝑏𝑦𝑥 = 𝑏𝑣𝑢 = = = 0.975
𝑛Σu2 − Σu 2 4 20 − 0 2
∴ Regression Equation Y on X is
𝑦 − 𝑦 = 𝑏𝑦𝑥 (𝑥 − 𝑥 )
∴ 𝑦 − 7.875 = 0.975(𝑥 − 5)
𝑦 = 0.975𝑥 + 3
When x is 9 years
∴ y = 0.975 9 + 3 = 11.775
∴ Maintenance cost for a car 9 years old is 11.775 × 1000 = 11775units.
6. It is given that the means of x and y are 5 and 10. If the line of regression of y on x
is parallel to the line 𝟐𝟎𝒚 = 𝟗𝒙 + 𝟒𝟎, estimate the value of y for 𝒙 = 𝟑𝟎.
Solution: Given means of x and y are 5 and 10.
∴ 𝑥 = 5; 𝑦 = 10
Given line is 20𝑦 = 9𝑥 + 40
9 40
∴𝑦= 𝑥+
20 20
9
Slope of the above line 𝑚1 =
20
7. The regression lines of a sample are 𝒙 + 𝟔𝒚 = 𝟔 and 𝟑𝒙 + 𝟐𝒚 = 𝟏𝟎. Find (i) mean
of 𝒙 and 𝒚 and (ii) coefficient of correlation between 𝒙and 𝒚.
Solution: (i) Give regression lines are
𝑥 + 6𝑦 − 6 = 0 ; 3𝑥 + 2𝑦 − 10 = 0
They pass through the point 𝑥 , 𝑦 we get
𝑥+6𝑦 =6 … … … … (1)
3𝑥 + 2 𝑦 = 106 … … … … (2)
3 × 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 1 − 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛(2)
3𝑥 + 18 𝑦 = 18
3𝑥 + 2𝑦 = 10
− − −
16 𝑦 = 8
1
𝑦 = 8 − 16 =
2
1
Put 𝑦 = in equation (1)
2
1
𝑥+6× =6
2
𝑥+3=6
𝑥=3
1
Means of 𝑥 and 𝑦 are 𝑥 = 3 and 𝑦 =
2
−3
𝑟= 𝑏𝑥𝑦 𝑏𝑦𝑥 = −6 × = 9=3
2
CORRELATION AND REGRESSION | 13
PROBLEMS
1. You are given the following data:
X Y
Arithmetic mean 36 85
Standard deviation 11 08
X: 1 2 3 4 5 6 7 8 9
Y: 2 6 7 8 10 11 11 10 9
11. The following marks have been obtained by a class of students in Stats (out of 100):
Paper I 45 55 56 58 60 65 68 70 75 80 85
Paper II 56 50 48 60 62 64 65 70 74 82 90
x 1 2 3 4 5 6
y 49 54 60 73 80 86
13. The following data gave the growth of employment in lakhs in organized sector in
India between 1988 and 1995 :
Find the correlation coefficient between the employment in public sectors and
private sectors and give your comments.
14. Two Judges in a beauty contest gave the following marks out of 50 to 9 contestants:
Judge A 20 25 22 27 23 26 34 24 32
Judge B 30 42 45 46 33 34 40 35 39
Do the two Judges appear to agree in their standards? When will agreement
complete?
15. Show that the second degree curve fitting the following data is given by 𝑣 = 3 +
0.85𝑢 − 0.27𝑢2 where 𝑢 = 𝑥 − 5, 𝑣 = 𝑦 − 7. Also find 𝑦 𝑤ℎ𝑒𝑛 𝑥 = 10.
x 1 2 3 4 5 6 7 8 9
y 2 6 7 8 10 11 11 10 9
16. From the following data find the equation of line of regression of y on x and
estimate the most probable value of y when x = 80.
x 89 86 74 65 64 64 66 67 72 79
y 92 91 84 75 73 72 71 75 78 84
17. While calculating correlation coefficient between x & y following constants are
obtained. 𝑁 = 25, 𝑥 = 125 𝑦 = 100, 𝑥 2 = 650, 𝑦 2 = 460, 𝑥𝑦 = 508.
It was later discovered that it had recorded two pairs x = 6, y = 14 and x =8 , y =6
while the correct values were x = 8, y = 12 and x = 6 , y = 8. Calculate correct
correlation coefficient.
CORRELATION AND REGRESSION | 15
18. The following table shows the height of a sample of 12 fathers and their sons. Find
ranks correlation coefficients.
x 65 63 67 64 68 62 70 66 68 67 69 71
y 68 66 68 65 69 66 68 65 71 67 68 80
19. The following marks have been obtained by a class of students in stats (out of 100):
Paper I 45 55 56 58 60 65 68 70 75 80 85
Paper II 56 50 48 60 62 64 65 70 74 82 90
Compute the coefficient of correlation for the above data. Find also the equations of
lines of regression.
20. If 𝑥 = 𝑎𝑢 + 𝑏, 𝑦 = 𝑐𝑢 + 𝑑 𝑎, 𝑏, 𝑐, 𝑑 are constants then prove 𝑟𝑥𝑦 = 𝑟𝑢𝑣 . Where
𝑟𝑥𝑦 coefficient of correlation between x and y.
21. Fit a second degree curve for the following data:
x 1 2 3 4 5
y 1250 1400 1650 1950 2300
24. Fit a second degree parabola to following data and estimate the value of y for x = 6
X: 1 2 3 4 5
Y: 25 28 33 39 46
25. The following data gives the age of or of certain make and the annual maintenance
cost , obtain the equation of line of regression of cost on age
Age (year): 2 4 6 8
Maintenance cost: 10 20 25 30
26. Find the equation of line of regression foe the following data
X: 5 6 7 8 9 10 11
Y: 11 14 14 15 12 17 16
Also find r.
16 | APPLIED MATHEMATICS - IV
27. Find the two equation of line of regression and estimate the value of y for x = 7 form
the following data
X: 0 1 2 3 4 5 6
Y: 5 9 8 10 11 9 11
28. Given the following result of weight and height of 10000 student
𝑥 = 150𝑦 = 68 inch𝜎𝑥 = 𝜎𝑥𝜎𝑦 = 𝜎𝑦 r = 0.6
29. The following result regarding 100 college student are given as.
30. The two line of Regression are 6y = 5 x + 90 and 15x =8y +130 estimate y for x =
60
31. The regression Line sample are x + 6y = 6 and 3x + 2y = 10 find y when x =12
32. Find the angle between the line of regression using the following data
N=10 𝑥 = 270 𝑦 = 630 ,𝜎𝑦 = 5, 𝑟𝑥𝑦 = 0.
33. If 𝜎𝑥 = 𝜎𝑦 = 𝜎 and the angle between the equation of regression line is ta𝑛−1 .Find
the coefficient of Regression
34. The equation of two line of regression are x = 19.13 -0.874 and y = 11.64 – 0.5x
(a) Find mean of 𝑥 and 𝑦
(b) The coefficient of correlation between 𝑥 and 𝑦
35. Calculate the rank of correlation coefficient from the following data.
Rank in English ∶ 1, 3, 7, 5, 4, 6, 2, 10, 9, 8
Rank in Statistics: 3, 1, 4, 5, 6, 9, 7, 8, 10, 2
1 1
6 𝑑2 + 𝑚13 − 𝑚1 + 𝑚23 − 𝑚2 + ⋯
Ans: 𝑅 = 1 − 12 12
𝑛3 − 𝑛
36. Obtain the rank correlation coefficient from the following data.
𝑋 ∶ 10, 12, 18, 18, 15, 40.
𝑌 ∶ 12, 18, 25, 25, 50, 25.
Ans: 1 − 0.4571 = 0.5429
37. (a) Let 𝑟𝑥𝑦 = 0.4, 𝑐𝑜𝑣 𝑥, 𝑦 = 1.6, 𝜎𝑦 2 = 25. Find r.
(b) If 𝑅𝑥𝑦 = 0.143 and the sum of the squares of the differences between the ranks
in 48 find R.
Ans: 𝑁 = 7, Other roots of N are imaginary.