Unit 1 Correlation, Regression and Curve Fitting 2024-25
Unit 1 Correlation, Regression and Curve Fitting 2024-25
Unit 1 Correlation, Regression and Curve Fitting 2024-25
CORRELATION
Correlation is the relationship that exists between two or more variables. Two variables are
said to be correlated if a change in one variable affects a change in the other variable.
EXAMPLE:
1. Relationship between heights and weights.
2. Relationship between price and demand of commodity.
3. Relationship between rainfall and yield of crops.
Types Of Correlations
1. Positive and Negative correlations.
2. Simple and Multiple correlations.
3. Partial and Total correlations.
4. Linear and Non-linear correlations.
2. Multiple Correlation: -
When more than two variables are studied, the relationship is described as multiple
correlation, e.g., relationship of price, demand, and supply of a commodity.
2. Nonlinear Correlation:
If the ratio of change between two variables is not constant, the correlation is said to
nonlinear. The graph of a nonlinear or curvilinear relationship will be a curve.
X 15 22 25 30 35 40
Y 4 5 8 9 10 12
Method of correlation:
There are two different methods.
1. Graphic methods.
2. Mathematical methods.
Graphic methods:
1. Scatter diagram.
2. Simple graph.
Mathematical methods:
1. Karl Pearson’s coefficient of correlation.
2. Spearman’s rank coefficient of correlation.
Scatter diagram:
This is a very simple method studying the relationship between two variables. In this method
one variable is taken on X-axis and the other variable is taken on Y-axis and for each pair
of values, points are plotted as follows:
Example 1:
After standard deduction from total income, 20% income tax is imposed on the
remaining income. The information regarding the taxable income and the tax to be paid
is given below for five persons.
Person 1 2 3 4 5
Taxable Income (thousand ₹ ) 𝒙 50 30 80 20 100
Income Tax (thousand ₹ ) 𝒚 10 6 16 4 20
Draw a scatter diagram from this information and discuss about the correlation.
Solution:
The following scatter diagram is obtained by plotting the points corresponding to the
ordered pairs (50,10), (30,6), (80,16), (20,4) and (100,20) of 𝑥 and 𝑦.
We can see that all the points lie on the same line in the scatter diagram. We can
also see that as the values of variable 𝑋 change, the values of variable 𝑌 also change in
the same direction with a constant proportion. Hence, we can see that there is a perfect
positive correlation between two variables 𝑋 and 𝑌.
Example: 2
To know the relation between monthly expenditure and monthly savings for middle class
families, the information regarding expenditure and savings for 5 families is given
below. (The monthly income of each family is ₹ 20,000)
Draw a scatter diagram indicating the relation between monthly expenditure and
monthly savings from this information and discuss about their correlation.
Solution:
The following scatter diagram is obtained by plotting the points of ordered pairs (15,5),
(18,2), (8,12), (10,10), (12,8) of 𝑋 and 𝑌 on the graph paper.
We can see that all the points lie on the same line in the scatter diagram. We can also see
that as the values of variable 𝑋 change, the values of variable 𝑌 also change in the
opposite direction with a constant proportion. Hence, we can see that there is a perfect
negative correlation between 𝑋 and 𝑌.
Exercise
1. A ball pen making company wants to know the relation between the price (in ₹ ) and
supply (in thousand units) of its most selling Gel Pen. The following information is
collected for it: Draw a scatter diagram and interpret it.
Price (in ₹ ) 14 16 12 11 15 13 17
Monthly Supply 32 50 20 12 45 30 53
1
𝑐𝑜𝑣(𝑋, 𝑌) = ∑(𝑥 − 𝑥 )(𝑦 − 𝑦)
𝑛
∑(𝑥 − 𝑥 ) ∑(𝑦 − 𝑦)
𝜎 = , 𝜎 =
𝑛 𝑛
∑(𝑥 − 𝑥 )(𝑦 − 𝑦)
𝑟=
∑ (𝑥 − 𝑥 ) ∑(𝑦 − 𝑦)
Solution:
n=6
𝑥 𝑦 𝑥 𝑦 𝑥𝑦
2 18 4 324 36
4 12 16 144 48
5 10 25 100 50
6 8 36 64 48
8 7 64 49 56
11 5 121 25 55
𝛴𝑥 = 36 𝛴𝑦 = 60 𝛴𝑥 = 266 𝛴𝑦 =706 𝛴𝑥𝑦=293
𝛴𝑥𝛴𝑦 (36)(60)
∑ 𝑥𝑦 − 293 −
𝑟= 𝑛 = 6 = −𝟎. 𝟗𝟐𝟎𝟑
(𝛴𝑥) (𝛴𝑦) (36) (60)
𝛴𝑥 − 𝛴𝑦 − 266 − 706 −
𝑛 𝑛 6 6
Example : 2
Calculate the correlation coefficient between the following data:
𝑥 5 9 13 17 21
𝑦 12 20 25 33 35
(Summer 2023)
Solution:
𝑛 = 5
𝛴𝑥 65 𝛴𝑦 125
𝑥̅ = = = 13, 𝑦= = = 25
𝑛 5 𝑛 5
𝑥 𝑦 (𝑥 − 𝑥 ) (𝑦 − 𝑦) (𝑥 − 𝑥 ) (𝑦 − 𝑦) (𝑥 − 𝑥 )(𝑦 − 𝑦)
5 12 –8 –13 64 169 104
9 20 –4 –5 16 25 20
13 25 0 0 0 0 0
17 33 4 8 16 64 32
21 35 8 10 64 100 80
∑𝑥 =65 ∑𝑦 =125 ∑(𝑥 − 𝑥 ) ∑(𝑦 − 𝑦) ∑(𝑥 − 𝑥 ) ∑(𝑦 − 𝑦) ∑(𝑥 − 𝑥 )(𝑦 − 𝑦)
=0 =0 = 160 = 358 = 236
Demand in quintals 65 66 67 67 68 69 70 72
Price in rupees per kg 67 68 65 68 72 72 69 71
Solution
Let the demand in quintal be denoted by x and the price in rupees per kg be denoted by y.
𝑛 = 8
𝛴𝑥 544
𝑥̅ = = = 68
𝑛 8
𝛴𝑦 552
𝑦= = = 69
𝑛 8
𝑥 𝑦 (𝑥 − 𝑥 ) (𝑦 − 𝑦) (𝑥 − 𝑥 ) (𝑦 − 𝑦) (𝑥 − 𝑥)(𝑦 − 𝑦)
65 67 –3 –2 9 4 6
66 68 –2 –1 4 1 2
67 65 –1 –4 1 16 4
67 68 –1 –1 1 1 1
68 72 0 3 0 9 0
69 72 1 3 1 9 3
70 69 2 0 4 0 0
72 71 4 2 16 4 8
65 67 –3 –2 9 4 6
∑𝑥 ∑𝑦 ∑(𝑥 − 𝑥 ) ∑(𝑦 − 𝑦) ∑(𝑥 − 𝑥) ∑(𝑦 − 𝑦) ∑(𝑥 − 𝑥 )(𝑦 − 𝑦)
= 544 = 552 =0 =0 = 36 = 44 = 24
∑(𝑥 − 𝑥)(𝑦 − 𝑦) 24
𝑟= = = 𝟎. 𝟔𝟎𝟑
∑(𝑥 − 𝑥) ∑ (𝑦 − 𝑦 ) √36√44
Example : 4
Given n = 10, 𝜎 = 5.4, 𝜎 = 6.2, and sum of the product of deviations from the mean of 𝑥 and
𝑦 is 66. Find the correlation coefficient.
Solution
n = 10, 𝜎 = 5.4, 𝜎 = 6.2
∑(𝑥 − 𝑥)(𝑦 − 𝑦) = 66
1 66
𝑐𝑜𝑣(𝑋, 𝑌) = ∑(𝑥 − 𝑥 )(𝑦 − 𝑦) = = 6.6
𝑛 10
cov(𝑋, 𝑌) 6.6
𝑟 = = = 𝟎. 𝟏𝟗𝟕
𝜎 𝜎 5.4 × 6.2
Exercise:
1. Find the Pearson’s Correlation Coefficient of the following data:
𝑥 100 101 102 102 100 99 97 98 96 95
𝑦 98 99 99 97 95 92 95 94 90 91
2. Calculate Karl Pearson’s coefficient of correlation for the data given below:
𝑥 10 14 18 22 26 30 10
𝑦 18 12 24 6 30 36 18
𝑥 17 19 21 26 20
𝑦 23 27 25 26 27
Rank the data: For each variable, rank the data from lowest to highest, assigning a rank to
each value. If there are ties, assign each tied value the average of the ranks it would have
received if there were no tie.
Calculate the differences between ranks: For each pair of data points, find the difference
between their ranks.
Spearman’s Rank correlation coefficient:
𝟔∑𝒅𝟐
Calculated by following formula: 𝒓 = 𝟏 −
𝒏 𝒏𝟐 𝟏
In ∑𝑑 , ( )
is added where 𝑚 is the number of times an item is repeated.
The value of 𝝆 lies between -1 and 1. A positive value indicates a positive monotonic
relationship, while a negative value indicates a negative monotonic relationship. A value of
0 indicates no monotonic relationship.
Example: 1
Two judges have given ranks to 10 students for their honesty. Find the rank correlation
coefficient of the following data:
1ST Judge 3 5 8 4 7 10 2 1 6 9
2nd Judge 6 4 9 8 1 2 3 10 5 7
Solution
Rank given Rank given Difference in 𝑑
by 1st judge by 2nd judge ranks d
3 6 −3 9
5 4 1 1
8 9 −1 1
4 8 −4 16
7 1 6 36
10 2 8 64
2 3 −1 1
1 10 −9 81
6 5 1 1
9 7 2 4
∑𝑑 =214
6∑𝑑 6(90)
𝑟 = 1− = 1− = 𝟎. 𝟒𝟓𝟓
𝑛(𝑛 − 1) 10(100 − 1)
Example: 3
Find the Coefficient of rank correlation of the following data: (Summer 2022-23)
𝑥 35 40 42 43 40 53 54 49 41 55
Solution
𝑛 = 10
𝑥 𝑦 Rank in (𝑥) Rank in (y) 𝑑 =𝑥−𝑦 𝑑
35 102 10 1 9 81
43 98 5 4 1 1
49 92 4 9 -5 25
∑𝑑 = 200.50
𝑚 𝑚 𝑚 𝑚
6 ∑𝑑 + (𝑚 − 1) + (𝑚 − 1) + (𝑚 − 1) + (𝑚 − 1)
𝑟 =1− 12 12 12 12
𝑛(𝑛 − 1)
6{200.50 + 0.5 + 0.5 + 0.5 + 0.5 }
=1−
990
= −𝟎. 𝟐𝟐𝟕
Exercise:
1.Compute Spearman’s rank correlation coefficient from the following data:
𝑥 18 20 34 52 12
𝑦 39 23 35 52 12
2.Obtain the rank correlation coefficient from the following data.
𝑥 10 12 18 18 15 40
𝑦 12 18 25 25 50 25
(Summer 2023-24)
REGRESSION:
By studying the correlation, we can know the existence, degree and direction of relationship
between two variables but we cannot answer the question of the type if there is a certain
amount of change in one variable, what will be the corresponding change in the other
variable. The above type of question can be answered if we can establish a quantitative
relationship between two related variables. The statistical tool by which it is possible to
predict or estimate the unknown values of one variable from known values of another
variable is called regression. A line of regression is a straight line.
LINES OF REGRESSION
If the variables, which are highly correlated, are plotted on a graph then the points lie in a
narrow strip. If all the points in the scatter diagram cluster around a straight line, the line is
called the line of regression. The line of regression is the line of best fit and is obtained by
the principle of least squares.
Line of Regression of 𝒚 on 𝒙:
It is the line which gives the best estimate for the values of 𝑦 for any given values of 𝑥.
The regression equation of 𝑦 on 𝑥 is given by
𝝈𝒚
(𝒚 − 𝒚 ) = 𝒓 (𝒙 − 𝒙 )
𝝈𝒙
It is also written as
𝑦 = 𝑎 + 𝑏𝑥
Line of regression of 𝒙 on 𝒚:
It is the line which gives the best estimate for the values of x for any given values of y. The
regression equation for x on y is given by
𝝈𝒙
(𝒙 − 𝒙 ) = 𝒓 (𝒚 − 𝒚 )
𝝈𝒚
It is also written as
𝑥 = 𝑎 + 𝑏𝑦
where 𝑥̅ and 𝑦 are means of 𝑥 series and 𝑦 series respectively, 𝜎 and 𝜎 are standard
deviations of 𝑥 series and 𝑦 series respectively, 𝑟 is the correlation coefficient between
𝑥 and 𝑦.
REGRESSION COEFFICIENTS
The slope 𝑏 of the line of regression of 𝑦 on 𝑥 is also called the coefficient of regression of
𝑦 on 𝑥. It represents the increment in the value of 𝑦 corresponding to a unit change in the
value of 𝑥.
𝑏 = Regression coefficient of 𝑦 on 𝑥 = 𝑟
Similarly, the slope 𝑏 of the line of regression of 𝑥 on 𝑦 is called the coefficient of regression
of 𝑥 on 𝑦. It represents the increment in the value of 𝑥 corresponding to a unit change in the
value of 𝑦.
𝑏 = Regression coefficient of 𝑥 on 𝑦 = 𝑟
and
𝜎
𝑏 =𝑟
𝜎
∑(𝑥 − 𝑥)(𝑦 − 𝑦)
=
∑(𝑦 − 𝑦)
(b) 𝑏 = 𝑟
∑𝑥∑𝑦
∑ 𝑥𝑦 −
= 𝑛
(∑ 𝑥)
∑𝑥 −
𝑛
and
𝜎
𝑏 =𝑟
𝜎
∑𝑥∑𝑦
∑ 𝑥𝑦 −
= 𝑛
(∑ 𝑦)
∑𝑦 −
𝑛
= 0.1
∑𝑥∑𝑦
∑ 𝑥𝑦 −
𝑏 = 𝑛
(∑ 𝑦)
∑𝑦 −
𝑛
(15000)(6800)
1022250 − 100
=
(6800)
463025 − 100
= 3.6
𝑟= 𝑏 𝑏 = (0.1)(3.6) = 0.6
∑𝑥 15000
𝑥 = = = 150
𝑛 100
∑𝑦 6800
𝑦 = = = 68
𝑛 100
The equation of the line of regression of 𝑦 on 𝑥 is;
(𝑦 − 𝑦 ) = 𝑏 (𝑥 − 𝑥 )
(𝑦 − 68) = 0.1(𝑥 − 150)
𝒚 = 𝟎. 𝟏𝒙 + 𝟓𝟑
The equation of the line of regression of 𝑥 on 𝑦 is;
(𝑥 − 𝑥 ) = 𝑏 (𝑦 − 𝑦 )
𝑥 − 150 = 3.6(𝑦 − 68)
𝒙 = 𝟑. 𝟔𝒚 − 𝟗𝟒. 𝟖
Example: 2
Find the regression coefficients 𝑏𝑦𝑥 and 𝑏𝑥𝑦 and hence, find the correlationcoefficient
between x and y for the following data:
𝑥 4 2 3 4 2
𝑦 2 3 2 4 4
Solution
𝑛=5
𝑥 𝑦 𝑥 𝑦 𝑥𝑦
4 2 16 4 8
2 3 4 9 6
3 2 9 4 6
4 4 16 16 16
2 4 4 16 8
∑𝑥 = 15 ∑𝑦 = 15 ∑𝑥 = 49 ∑𝑦 = 49 ∑𝑥𝑦 = 44
∑𝑥 ∑𝑦 (15)(15)
∑ 𝑥𝑦 − 44 −
𝑏 = 𝑛 = 5 = −0.25
(∑ 𝑥) (15)
∑𝑥 − 49 −
𝑛 5
∑𝑥∑𝑦 (15)(15)
∑ 𝑥𝑦 − 44 −
𝑏 = 𝑛 = 5 = −0.25
(∑ 𝑦) (15)
∑𝑦 − 49 −
𝑛 5
𝑟= 𝑏 𝑏 = (−0.25)(−0.25) = 𝟎. 𝟐𝟓
Exercise
1.Find the regression coefficient of y on x for the following data:
𝒙 1 2 3 4 5
𝒚 160 180 140 180 200
2. Find the equation of regression lines from the following data and also estimate 𝑦
for 𝑥 = 1 and 𝑥 for 𝑦 = 4.
𝒙 3 2 -1 6 4 -2 5 7
𝒚 5 13 12 -1 2 20 0 -3
3.Find the equation of regression lines and the correlation coefficient from the following data:
𝒙 28 41 40 38 35 33 46 32 36 33
𝒚 30 34 31 34 30 26 28 31 26 31
4.The following information is obtained for two variables x and y. Find regression equation of 𝑦
on 𝑥. n=10;∑𝑥 = 130; ∑𝑥 = 2288; ∑𝑥𝑦 = 3467.
CURVE FITTING
Curve fitting is the process of finding the ‘best-fit’ curve for a given set of data. It is the
representation of the relationship between two variables by means of an algebraic equation.
On the basis of this mathematical equation, predictions can be made in many statistical problems.
Suppose a set of 𝑛 points of values (𝑥 , 𝑦 ), (𝑥 , 𝑦 ), … , (𝑥 , 𝑦 ) of the two variables 𝑥 and 𝑦 are
given. These values are plotted on a rectangular coordinate system, i.e., the 𝑥𝑦-plane. The
resulting set of points is known as a scatter diagram (Fig. 5.1). The scatter diagram exhibits the
trend and it is possible to visualize a smooth curve approximating the data. Such a curve is known
as an approximating curve.
METHODS:
1.Linear Regression: One of the simplest forms of curve fitting, linear regression assumes a
linear relationship between the independent and dependent variables. The goal is to find the best-
fitting line (or hyperplane in higher dimensions) that minimizes the sum of squared differences
between observed and predicted values.(𝑌 = 𝐴𝑋 + 𝐵 𝑜𝑟 𝑋 = 𝐴𝑌 + 𝐵).
f ( x) ax b
How can we pick the coefficients that best fits
the line to the data?
First question: What makes a particular
straight line a ‘good’ fit?
Why does the blue line appear to us to fit the
trend better?
• Consider the distance between the data and
points on the line
• Add up the length of all the red and blue
verticle lines
• This is an expression of the ‘error’ between
data and fitted line
• The one line that provides a minimum error
is then the ‘best’ straight line
Quantifying errors in a curve fit
(1) positive or negative error have the same
value (data point is above or below the line)
(2) Weight greater errors more heavily
we can do both of these things by squaring
the distance denote data values as (x, y)
======>>
denote points on the fitted line as (x, f(x))
sum the error at the four data points
n
err d i y1 f x1 y 2 f x 2 ........ y n f x n
2 2 2 2
i 1
n
y i axi b
2
i 1
i 1 i 1 i 1 i 1 i 1 i 1
n n n n n
xi y i a x i b xi y i a xi n b
2
i 1 i 1 i 1 and i 1 i 1
𝑦 =𝑎 𝑥 + 𝑛𝑏 (1)
𝑥𝑦 =𝑎 𝑥 +𝑏 𝑥 (2)
𝑥 𝑦 𝑥 𝑥𝑦
1 2.4 1 2.4
2 3 4 6
3 3.6 9 10.8
4 4 16 16
6 5 36 30
8 6 64 48
Example: 2 Fit a straight line to the following data. Also, estimate the value of y at 𝑥 = 2.5.
𝑥 0 1 2 3 4
(Winter 2022-23)
Example: 3 Fit a straight line using least square method.
𝑥 0 0.5 1 1.5 2 2.5
𝑦 0 1.5 3 4.5 6 7.5
(Winter 2023-24)
Example: 4 Fit a straight line to the following data and hence find 𝑦 when 𝑥 = 70
𝑥 71 68 73 69 67 65 66 67
𝑦 69 72 70 70 68 67 68 64
(Summer 2023-24)
Polynomial Regression: We started the linear curve fit by choosing a generic form of the
straight line 𝑓(𝑥) = 𝑎𝑥 + 𝑏
This is just one kind of function. There are an infinite number of generic forms we could
choose from for almost any shape we want. Let’s start with a simple extension to the linear
regression concept recall the examples of sampled data.
i 1
y1 a bx1 cx1 y a bx
2 2
2 2 cx2
2
2
........ y n a bx n cx n
2
y a bx cx
n
2 2
i i i
i 1
err n
b
2 xi y i a bxi cxi 0
2
i 1
err n
b
2 xi y i a bxi cxi 0
2 2
i 1
y i a n b xi c x i
i 1 i 1 i 1
2
n n n n
xi y i a xi b xi c xi
2 3
i 1 i 1 i 1 i 1
n n n n
xi y i a xi b xi c x i
2 2 3 4
i 1 i 1 i 1 i 1
Example: 1
Fit a least squares quadratic curve to the following data:
𝑥 1 2 3 4
𝑦 1.7 1.8 2.3 3.2
Estimate 𝑦(2.4).
Solution:
Let the equation of the least squares quadratic curve (parabola) be 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 .
The normal equations are
𝑦 = 𝑛𝑎 + 𝑏 𝑥+𝑐 𝑥 (1)
𝑥𝑦 = 𝑎 𝑥+𝑏 𝑥 +𝑐 𝑥 (2)
𝑥 𝑦=𝑎 𝑥 +𝑏 𝑥 +𝑐 𝑥 (3)
Here, 𝑛 = 4
𝑥 𝑦 𝑥 𝑥 𝑥 𝑥𝑦 𝑥 𝑦
1 1.7 1 1 1 1.7 1.7
2 1.8 4 8 16 3.6 7.2
3 2.3 9 27 81 6.9 20.7
4 3.2 16 64 256 12.8 51.2
∑ 𝑥 = 10 ∑𝑦 =9 ∑ 𝑥 = 30 ∑ 𝑥 = 100 ∑ 𝑥 = ∑ 𝑥𝑦 = 25 ∑𝑥 𝑦 =
354 80.8
Example: 2
Fit a second-degree polynomial using least square method to the following data:
𝑥 0 1 2 3 4
𝑦 1 1.8 1.3 2.5 6.3
Example: 3
Fit a second order polynomial 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 to following data, using least square method.
(Summer 2022-23)
𝑥 0 5 10 15 20
𝑦 7 11 16 20 26
Yi a X i n b
i 1 i 1 (1)
n n n
X Y a X i b X i
2
i i
i 1 i 1 i 1 (2)
After getting values of a and b , A antilog a, C antilog b .
Solution:
X I = xi yi Yi ln y i Xi
2
X i Yi
1 10 2.302585 1 2.302585
5 15 2.70805 25 13.54025
7 12 2.484906 49 17.39435
9 15 2.70805 81 24.37245
12 21 3.044522 144 36.53427
5 5 5 5
X Y X X Y
2
i i i i I
i 1 i 1 i 1 i 1
=34 =13.24811 =300 =94.1439
13.24811 34 A 5B
94.1439 300 A 34B
A=2.00479, B=2.248664
a=antilog2.00479=7.424536, b=antilog (2.248664) =9.475068
(2) y bx
a
Y i nB A X i
i 1 i 1 (1)
n n n
X Y B X i A X i
2
i i
i 1 i 1 i 1 (2)
It is known that v and t are connected by the relation v bt , find the best possible values
a
of a and b.
V t Y=logv X=logt X2 XY
350 61 2.544068 1.78533 3.18740262 4.542001
400 26 2.60206 1.414973 2.002149575 3.681846
500 7 2.69897 0.845098 0.714190697 2.280894
600 2.6 2.778151 0.414973 0.17220288 1.152859
4 4 4 2 4 3
Yi
i 1
Xi
i 1
X i X i
i 1 i 1
=10.62325 =4.460375 =6.075945772 =11.6576
Substitute in given equation,
n n
Yi nB A X i
i 1 i 1 (1)
n n n
X Y B X i A X i
2
i i
i 1 i 1 i 1 (2)
10.62325 4 B 4.460375A
11.6575 4.460375B 6.075945772A
On solving these equations B=2.845 A=a= - 0.17.
b anti log(2.845) 699.842
3)The following values of T and l follow the law T= aln. Test if this is so and find the best
values of a and n.
T 1.0 1.5 2.0 2.5
L 25 56.2 100 1.56