Unit 1 Correlation, Regression and Curve Fitting 2024-25

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

PARUL UNIVERSITY

FACULTY OF ENGINEERING AND TECHNOLOGY


DEPARTMENT OF APPLIED SCIENCE AND
HUMANITIES
4th SEMESTER B. TECH PROGRAMME
PROBABILITY, STATISTICS AND NUMERICAL
METHODS (303191251)
ACADEMIC YEAR 2024-2025
UNIT: 1 CORRELATION, REGRESSION AND CURVE
FITTING

CORRELATION
Correlation is the relationship that exists between two or more variables. Two variables are
said to be correlated if a change in one variable affects a change in the other variable.
EXAMPLE:
1. Relationship between heights and weights.
2. Relationship between price and demand of commodity.
3. Relationship between rainfall and yield of crops.
Types Of Correlations
1. Positive and Negative correlations.
2. Simple and Multiple correlations.
3. Partial and Total correlations.
4. Linear and Non-linear correlations.

POSITIVE AND NEGATIVE CORRELATIONS


POSITIVE CORRELATIONS (Same direction)
If both the variables vary in the same direction, the correlation is said to be positive.
In other words, if the value of one variable increases, the value of the other variable also
increases. Same decreases.
Height (cm) 120 130 135 140 145
Weight(kg) 50 55 60 65 70

NEGATIVE CORRELATIONS (Opposite direction)


If both the variables vary in the opposite direction, the correlation is said to be
negative. In other words, if the value of one variable increases, the value of other variable
decreases.
Height (cm) 120 130 135 140 145
Weight(kg) 70 65 60 55 50
SIMPLE AND MULTIPLE CORRELATIONS
1. Simple Correlation: -
When only two variables are studied, the relationship is described as simple correlation, e.g.,
the quantity of money and price level, demand and price, etc.

2. Multiple Correlation: -
When more than two variables are studied, the relationship is described as multiple
correlation, e.g., relationship of price, demand, and supply of a commodity.

PARTIAL AND TOTAL CORRELATIONS


1. Partial Correlation
When more than two variables are studied excluding some other variables, the relationship
is termed as partial correlation.
2. Total Correlation
When more than two variables are studied without excluding any variables, the relationship
is termed total correlation.

Linear and Nonlinear Correlations


1 . Linear Correlation:
If the ratio of change between two variables is constant, the correlation is said to be linear.
If such variables are plotted on a graph paper, a straight line is obtained.
X 5 10 15 20 25 30
Y 2 4 6 8 10 12

2. Nonlinear Correlation:
If the ratio of change between two variables is not constant, the correlation is said to
nonlinear. The graph of a nonlinear or curvilinear relationship will be a curve.
X 15 22 25 30 35 40
Y 4 5 8 9 10 12

Method of correlation:
There are two different methods.
1. Graphic methods.
2. Mathematical methods.

Graphic methods:
1. Scatter diagram.
2. Simple graph.
Mathematical methods:
1. Karl Pearson’s coefficient of correlation.
2. Spearman’s rank coefficient of correlation.

Scatter diagram:
This is a very simple method studying the relationship between two variables. In this method
one variable is taken on X-axis and the other variable is taken on Y-axis and for each pair
of values, points are plotted as follows:

Example 1:
After standard deduction from total income, 20% income tax is imposed on the
remaining income. The information regarding the taxable income and the tax to be paid
is given below for five persons.

Person 1 2 3 4 5
Taxable Income (thousand ₹ ) 𝒙 50 30 80 20 100
Income Tax (thousand ₹ ) 𝒚 10 6 16 4 20

Draw a scatter diagram from this information and discuss about the correlation.
Solution:
The following scatter diagram is obtained by plotting the points corresponding to the
ordered pairs (50,10), (30,6), (80,16), (20,4) and (100,20) of 𝑥 and 𝑦.
We can see that all the points lie on the same line in the scatter diagram. We can
also see that as the values of variable 𝑋 change, the values of variable 𝑌 also change in
the same direction with a constant proportion. Hence, we can see that there is a perfect
positive correlation between two variables 𝑋 and 𝑌.

Example: 2
To know the relation between monthly expenditure and monthly savings for middle class
families, the information regarding expenditure and savings for 5 families is given
below. (The monthly income of each family is ₹ 20,000)

Monthly Expenditure (thousand ₹) 𝒙 15 18 8 10 12


Monthly Savings (thousand ₹) 𝒚 5 2 12 10 8

Draw a scatter diagram indicating the relation between monthly expenditure and
monthly savings from this information and discuss about their correlation.
Solution:
The following scatter diagram is obtained by plotting the points of ordered pairs (15,5),
(18,2), (8,12), (10,10), (12,8) of 𝑋 and 𝑌 on the graph paper.
We can see that all the points lie on the same line in the scatter diagram. We can also see
that as the values of variable 𝑋 change, the values of variable 𝑌 also change in the
opposite direction with a constant proportion. Hence, we can see that there is a perfect
negative correlation between 𝑋 and 𝑌.

Exercise
1. A ball pen making company wants to know the relation between the price (in ₹ ) and
supply (in thousand units) of its most selling Gel Pen. The following information is
collected for it: Draw a scatter diagram and interpret it.
Price (in ₹ ) 14 16 12 11 15 13 17
Monthly Supply 32 50 20 12 45 30 53

2. The following information is collected to study the relationship between the


minimum day temperature and sale of woollen cloths during a particular day of
winter for six different cities.
Minimum day temperature
12 20 8 5 15 24
(Celsius)
Sale of woollen cloths
35 10 45 70 20 8
(thousand units)
Draw a scatter diagram from this information and interpret it.
Karl Pearson’s Coefficient of Correlation
The coefficient of correlation is the measure of correlation between two random variables
X and Y, and is denoted by r.
cov(𝑋, 𝑌)
𝑟 =
𝜎 𝜎
where cov(𝑋, 𝑌) is covariance of variables 𝑋 and 𝑌,
𝜎 is the standard deviation of variable 𝑋,
and 𝜎 is the standard deviation of variable 𝑌.
This expression is known as Karl Pearson’s coefficient of correlation or Karl Pearson’s
product-moment coefficient of correlation.

1
𝑐𝑜𝑣(𝑋, 𝑌) = ∑(𝑥 − 𝑥 )(𝑦 − 𝑦)
𝑛

∑(𝑥 − 𝑥 ) ∑(𝑦 − 𝑦)
𝜎 = , 𝜎 =
𝑛 𝑛

∑(𝑥 − 𝑥 )(𝑦 − 𝑦)
𝑟=
∑ (𝑥 − 𝑥 ) ∑(𝑦 − 𝑦)

The above expression can be further modified.


𝛴𝑥𝛴𝑦
∑ 𝑥𝑦 −
𝑟= 𝑛
(𝛴𝑥) (𝛴𝑦)
𝛴𝑥 − 𝛴𝑦 −
𝑛 𝑛

Properties of Coefficient of Correlation


1. The coefficient of correlation lies between -1 and 1, i.e., −1 ≤ 𝑟 ≤ 1.
2. Correlation coefficient is independent of change of origin and change of scale.
3. Two independent variables are uncorrelated.
Example : 1
Calculate the correlation coefficient between 𝑥 and 𝑦 using the following data:
𝑥 2 4 5 6 8 11
𝑦 18 12 10 8 7 5

Solution:
n=6
𝑥 𝑦 𝑥 𝑦 𝑥𝑦
2 18 4 324 36
4 12 16 144 48
5 10 25 100 50
6 8 36 64 48
8 7 64 49 56
11 5 121 25 55
𝛴𝑥 = 36 𝛴𝑦 = 60 𝛴𝑥 = 266 𝛴𝑦 =706 𝛴𝑥𝑦=293

𝛴𝑥𝛴𝑦 (36)(60)
∑ 𝑥𝑦 − 293 −
𝑟= 𝑛 = 6 = −𝟎. 𝟗𝟐𝟎𝟑
(𝛴𝑥) (𝛴𝑦) (36) (60)
𝛴𝑥 − 𝛴𝑦 − 266 − 706 −
𝑛 𝑛 6 6

Example : 2
Calculate the correlation coefficient between the following data:
𝑥 5 9 13 17 21
𝑦 12 20 25 33 35
(Summer 2023)
Solution:
𝑛 = 5
𝛴𝑥 65 𝛴𝑦 125
𝑥̅ = = = 13, 𝑦= = = 25
𝑛 5 𝑛 5
𝑥 𝑦 (𝑥 − 𝑥 ) (𝑦 − 𝑦) (𝑥 − 𝑥 ) (𝑦 − 𝑦) (𝑥 − 𝑥 )(𝑦 − 𝑦)
5 12 –8 –13 64 169 104
9 20 –4 –5 16 25 20
13 25 0 0 0 0 0
17 33 4 8 16 64 32
21 35 8 10 64 100 80
∑𝑥 =65 ∑𝑦 =125 ∑(𝑥 − 𝑥 ) ∑(𝑦 − 𝑦) ∑(𝑥 − 𝑥 ) ∑(𝑦 − 𝑦) ∑(𝑥 − 𝑥 )(𝑦 − 𝑦)
=0 =0 = 160 = 358 = 236

∑(𝑥 − 𝑥 )(𝑦 − 𝑦) 236


𝑟= = = 𝟎. 𝟗𝟖𝟔
∑ (𝑥 − 𝑥 ) ∑(𝑦 − 𝑦) √160√358
Example : 3
Calculate the correlation coefficient between for the following values of demand and the
corresponding price of a commodity:

Demand in quintals 65 66 67 67 68 69 70 72
Price in rupees per kg 67 68 65 68 72 72 69 71

Solution
Let the demand in quintal be denoted by x and the price in rupees per kg be denoted by y.
𝑛 = 8
𝛴𝑥 544
𝑥̅ = = = 68
𝑛 8
𝛴𝑦 552
𝑦= = = 69
𝑛 8
𝑥 𝑦 (𝑥 − 𝑥 ) (𝑦 − 𝑦) (𝑥 − 𝑥 ) (𝑦 − 𝑦) (𝑥 − 𝑥)(𝑦 − 𝑦)
65 67 –3 –2 9 4 6
66 68 –2 –1 4 1 2
67 65 –1 –4 1 16 4
67 68 –1 –1 1 1 1
68 72 0 3 0 9 0
69 72 1 3 1 9 3
70 69 2 0 4 0 0
72 71 4 2 16 4 8
65 67 –3 –2 9 4 6
∑𝑥 ∑𝑦 ∑(𝑥 − 𝑥 ) ∑(𝑦 − 𝑦) ∑(𝑥 − 𝑥) ∑(𝑦 − 𝑦) ∑(𝑥 − 𝑥 )(𝑦 − 𝑦)
= 544 = 552 =0 =0 = 36 = 44 = 24

∑(𝑥 − 𝑥)(𝑦 − 𝑦) 24
𝑟= = = 𝟎. 𝟔𝟎𝟑
∑(𝑥 − 𝑥) ∑ (𝑦 − 𝑦 ) √36√44

Example : 4
Given n = 10, 𝜎 = 5.4, 𝜎 = 6.2, and sum of the product of deviations from the mean of 𝑥 and
𝑦 is 66. Find the correlation coefficient.

Solution
n = 10, 𝜎 = 5.4, 𝜎 = 6.2

∑(𝑥 − 𝑥)(𝑦 − 𝑦) = 66
1 66
𝑐𝑜𝑣(𝑋, 𝑌) = ∑(𝑥 − 𝑥 )(𝑦 − 𝑦) = = 6.6
𝑛 10
cov(𝑋, 𝑌) 6.6
𝑟 = = = 𝟎. 𝟏𝟗𝟕
𝜎 𝜎 5.4 × 6.2

Exercise:
1. Find the Pearson’s Correlation Coefficient of the following data:
𝑥 100 101 102 102 100 99 97 98 96 95
𝑦 98 99 99 97 95 92 95 94 90 91

2. Calculate Karl Pearson’s coefficient of correlation for the data given below:
𝑥 10 14 18 22 26 30 10
𝑦 18 12 24 6 30 36 18

3. Find the Pearson’s Correlation Coefficient of the following data:


𝑥 9 8 7 6 5 4 3 2 1
𝑦 15 16 14 1 11 12 10 8 9
(Winter 2022-23)
4. Find Coefficient of Correlation between the following data:
𝑥 1 2 3 4 5 6 7 8 9
𝑦 9 8 10 12 11 13 14 16 15
(Winter 2023-24)
5. Calculate Karl Pearson’s coefficient of correlation for the data given below:

𝑥 17 19 21 26 20
𝑦 23 27 25 26 27

6. Given n = 10, 𝜎 = 10.8, 𝜎 = 12.4, and sum of the product of deviations


from the mean of 𝑥 and 𝑦 is 132. Find the correlation coefficient.

Spearman’s Rank correlation coefficient:


Spearman's rank correlation coefficient, often denoted by the symbol 𝝆 (rho), is a non-
parametric measure of statistical dependence between two variables.

Here’s a brief explanation of how Spearman’s rank correlation coefficient is calculated.

Rank the data: For each variable, rank the data from lowest to highest, assigning a rank to
each value. If there are ties, assign each tied value the average of the ranks it would have
received if there were no tie.
Calculate the differences between ranks: For each pair of data points, find the difference
between their ranks.
Spearman’s Rank correlation coefficient:
𝟔∑𝒅𝟐
Calculated by following formula: 𝒓 = 𝟏 −
𝒏 𝒏𝟐 𝟏

Where n = number of pair


In case finding out rank correlation coefficient when the observations are paired, the above
formula can be written as:
𝑚 𝑚
6 ∑𝑑 + (𝑚 − 1) + (𝑚 − 1) + … … … … … …
𝑟 =1− 12 12
𝑛(𝑛 − 1)

In ∑𝑑 , ( )
is added where 𝑚 is the number of times an item is repeated.

The value of 𝝆 lies between -1 and 1. A positive value indicates a positive monotonic
relationship, while a negative value indicates a negative monotonic relationship. A value of
0 indicates no monotonic relationship.

Example: 1
Two judges have given ranks to 10 students for their honesty. Find the rank correlation
coefficient of the following data:
1ST Judge 3 5 8 4 7 10 2 1 6 9
2nd Judge 6 4 9 8 1 2 3 10 5 7
Solution
Rank given Rank given Difference in 𝑑
by 1st judge by 2nd judge ranks d

3 6 −3 9
5 4 1 1
8 9 −1 1
4 8 −4 16
7 1 6 36
10 2 8 64
2 3 −1 1
1 10 −9 81
6 5 1 1
9 7 2 4
∑𝑑 =214

6∑𝑑 6(214) 1284


𝑟 =1− =1− =1− = 1 − 1.30 = −𝟎. 𝟑
𝑛(𝑛 − 1) 10(100 − 1) 990
Example: 2
Ten students got the following percentage of marks in mathematics and physics.
(x)maths 8 36 98 25 75 82 92 62 65 35
(y)physics 84 51 91 60 68 62 86 58 35 49

Find the rank correlation coefficient.


Solution
𝑛 = 10
𝑥 𝑦 Rank in Rank in 𝑑 =𝑥−𝑦 𝑑
maths (𝑥) physics (𝑦)
8 84 10 3 7 49
36 51 7 8 –1 1
98 91 1 1 0 0
25 60 9 6 3 9
75 68 4 4 0 0
82 62 3 5 –2 4
92 86 2 2 0 0
62 58 6 7 –1 1
65 35 5 10 –5 25
35 49 8 9 –1 1
∑𝑑 = 0 ∑𝑑 = 90

6∑𝑑 6(90)
𝑟 = 1− = 1− = 𝟎. 𝟒𝟓𝟓
𝑛(𝑛 − 1) 10(100 − 1)

Example: 3
Find the Coefficient of rank correlation of the following data: (Summer 2022-23)

𝑥 35 40 42 43 40 53 54 49 41 55

𝑦 102 101 97 98 38 101 97 92 95 95

Solution
𝑛 = 10
𝑥 𝑦 Rank in (𝑥) Rank in (y) 𝑑 =𝑥−𝑦 𝑑
35 102 10 1 9 81

40 101 8.5 2.5 6 36

42 97 6 5.5 0.5 0.25

43 98 5 4 1 1

40 38 8.5 10 -1.5 2.25

53 101 3 2.5 0.5 0.25

54 97 2 5.5 -3.5 12.25

49 92 4 9 -5 25

41 95 7 7.5 -0.5 0.25

55 95 1 7.5 -6.5 42.25

∑𝑑 = 200.50

𝑚 𝑚 𝑚 𝑚
6 ∑𝑑 + (𝑚 − 1) + (𝑚 − 1) + (𝑚 − 1) + (𝑚 − 1)
𝑟 =1− 12 12 12 12
𝑛(𝑛 − 1)
6{200.50 + 0.5 + 0.5 + 0.5 + 0.5 }
=1−
990
= −𝟎. 𝟐𝟐𝟕

Exercise:
1.Compute Spearman’s rank correlation coefficient from the following data:
𝑥 18 20 34 52 12
𝑦 39 23 35 52 12
2.Obtain the rank correlation coefficient from the following data.
𝑥 10 12 18 18 15 40
𝑦 12 18 25 25 50 25

(Summer 2023-24)
REGRESSION:
By studying the correlation, we can know the existence, degree and direction of relationship
between two variables but we cannot answer the question of the type if there is a certain
amount of change in one variable, what will be the corresponding change in the other
variable. The above type of question can be answered if we can establish a quantitative
relationship between two related variables. The statistical tool by which it is possible to
predict or estimate the unknown values of one variable from known values of another
variable is called regression. A line of regression is a straight line.

LINES OF REGRESSION
If the variables, which are highly correlated, are plotted on a graph then the points lie in a
narrow strip. If all the points in the scatter diagram cluster around a straight line, the line is
called the line of regression. The line of regression is the line of best fit and is obtained by
the principle of least squares.
Line of Regression of 𝒚 on 𝒙:
It is the line which gives the best estimate for the values of 𝑦 for any given values of 𝑥.
The regression equation of 𝑦 on 𝑥 is given by
𝝈𝒚
(𝒚 − 𝒚 ) = 𝒓 (𝒙 − 𝒙 )
𝝈𝒙

It is also written as
𝑦 = 𝑎 + 𝑏𝑥
Line of regression of 𝒙 on 𝒚:
It is the line which gives the best estimate for the values of x for any given values of y. The
regression equation for x on y is given by
𝝈𝒙
(𝒙 − 𝒙 ) = 𝒓 (𝒚 − 𝒚 )
𝝈𝒚

It is also written as
𝑥 = 𝑎 + 𝑏𝑦
where 𝑥̅ and 𝑦 are means of 𝑥 series and 𝑦 series respectively, 𝜎 and 𝜎 are standard
deviations of 𝑥 series and 𝑦 series respectively, 𝑟 is the correlation coefficient between
𝑥 and 𝑦.

REGRESSION COEFFICIENTS
The slope 𝑏 of the line of regression of 𝑦 on 𝑥 is also called the coefficient of regression of
𝑦 on 𝑥. It represents the increment in the value of 𝑦 corresponding to a unit change in the
value of 𝑥.

𝑏 = Regression coefficient of 𝑦 on 𝑥 = 𝑟
Similarly, the slope 𝑏 of the line of regression of 𝑥 on 𝑦 is called the coefficient of regression
of 𝑥 on 𝑦. It represents the increment in the value of 𝑥 corresponding to a unit change in the
value of 𝑦.

𝑏 = Regression coefficient of 𝑥 on 𝑦 = 𝑟

Expressions for Regression Coefficients:


(a) 𝑏 = 𝑟
∑(𝑥 − 𝑥 )(𝑦 − 𝑦)
=
∑(𝑥 − 𝑥)

and
𝜎
𝑏 =𝑟
𝜎
∑(𝑥 − 𝑥)(𝑦 − 𝑦)
=
∑(𝑦 − 𝑦)

(b) 𝑏 = 𝑟
∑𝑥∑𝑦
∑ 𝑥𝑦 −
= 𝑛
(∑ 𝑥)
∑𝑥 −
𝑛
and
𝜎
𝑏 =𝑟
𝜎
∑𝑥∑𝑦
∑ 𝑥𝑦 −
= 𝑛
(∑ 𝑦)
∑𝑦 −
𝑛

Properties of Regression Coefficient:


(1) The coefficient of correlation is the geometric mean of the coefficients of regression,
i.e., 𝑟 = 𝑏 𝑏 .
(2) If one of the regression coefficients is greater than one, the other must be less than one.
(3) The arithmetic mean of regression coefficients is greater than or equal to the coefficient
of correlation.
(4) Regression Coefficients are independent of the change of origin but not of scale.
Example: 1
The following data regarding the heights (y) and weights (x) of 100 college students are
given:
𝛴𝑥 = 15000, 𝛴𝑥 = 2272500, 𝛴𝑥𝑦 = 1022250
𝛴𝑦 = 6800, 𝛴𝑦 = 463025.
Find the coefficient of correlation between height and weight and also the equation of
regression of height and weight.
Solution:
𝑛 = 100
∑𝑥 ∑𝑦
∑ 𝑥𝑦 −
𝑏 = 𝑛
(∑ 𝑥)
∑𝑥 −
𝑛
(15000)(6800)
1022250 − 100
=
(15000)
2272500 − 100

= 0.1

∑𝑥∑𝑦
∑ 𝑥𝑦 −
𝑏 = 𝑛
(∑ 𝑦)
∑𝑦 −
𝑛
(15000)(6800)
1022250 − 100
=
(6800)
463025 − 100

= 3.6

𝑟= 𝑏 𝑏 = (0.1)(3.6) = 0.6

∑𝑥 15000
𝑥 = = = 150
𝑛 100
∑𝑦 6800
𝑦 = = = 68
𝑛 100
The equation of the line of regression of 𝑦 on 𝑥 is;
(𝑦 − 𝑦 ) = 𝑏 (𝑥 − 𝑥 )
(𝑦 − 68) = 0.1(𝑥 − 150)
𝒚 = 𝟎. 𝟏𝒙 + 𝟓𝟑
The equation of the line of regression of 𝑥 on 𝑦 is;
(𝑥 − 𝑥 ) = 𝑏 (𝑦 − 𝑦 )
𝑥 − 150 = 3.6(𝑦 − 68)
𝒙 = 𝟑. 𝟔𝒚 − 𝟗𝟒. 𝟖
Example: 2
Find the regression coefficients 𝑏𝑦𝑥 and 𝑏𝑥𝑦 and hence, find the correlationcoefficient
between x and y for the following data:
𝑥 4 2 3 4 2
𝑦 2 3 2 4 4

Solution
𝑛=5
𝑥 𝑦 𝑥 𝑦 𝑥𝑦
4 2 16 4 8
2 3 4 9 6
3 2 9 4 6
4 4 16 16 16
2 4 4 16 8
∑𝑥 = 15 ∑𝑦 = 15 ∑𝑥 = 49 ∑𝑦 = 49 ∑𝑥𝑦 = 44

∑𝑥 ∑𝑦 (15)(15)
∑ 𝑥𝑦 − 44 −
𝑏 = 𝑛 = 5 = −0.25
(∑ 𝑥) (15)
∑𝑥 − 49 −
𝑛 5
∑𝑥∑𝑦 (15)(15)
∑ 𝑥𝑦 − 44 −
𝑏 = 𝑛 = 5 = −0.25
(∑ 𝑦) (15)
∑𝑦 − 49 −
𝑛 5

𝑟= 𝑏 𝑏 = (−0.25)(−0.25) = 𝟎. 𝟐𝟓

Exercise
1.Find the regression coefficient of y on x for the following data:
𝒙 1 2 3 4 5
𝒚 160 180 140 180 200

2. Find the equation of regression lines from the following data and also estimate 𝑦
for 𝑥 = 1 and 𝑥 for 𝑦 = 4.

𝒙 3 2 -1 6 4 -2 5 7

𝒚 5 13 12 -1 2 20 0 -3
3.Find the equation of regression lines and the correlation coefficient from the following data:
𝒙 28 41 40 38 35 33 46 32 36 33
𝒚 30 34 31 34 30 26 28 31 26 31

4.The following information is obtained for two variables x and y. Find regression equation of 𝑦
on 𝑥. n=10;∑𝑥 = 130; ∑𝑥 = 2288; ∑𝑥𝑦 = 3467.

CURVE FITTING
Curve fitting is the process of finding the ‘best-fit’ curve for a given set of data. It is the
representation of the relationship between two variables by means of an algebraic equation.
On the basis of this mathematical equation, predictions can be made in many statistical problems.
Suppose a set of 𝑛 points of values (𝑥 , 𝑦 ), (𝑥 , 𝑦 ), … , (𝑥 , 𝑦 ) of the two variables 𝑥 and 𝑦 are
given. These values are plotted on a rectangular coordinate system, i.e., the 𝑥𝑦-plane. The
resulting set of points is known as a scatter diagram (Fig. 5.1). The scatter diagram exhibits the
trend and it is possible to visualize a smooth curve approximating the data. Such a curve is known
as an approximating curve.

METHODS:
1.Linear Regression: One of the simplest forms of curve fitting, linear regression assumes a
linear relationship between the independent and dependent variables. The goal is to find the best-
fitting line (or hyperplane in higher dimensions) that minimizes the sum of squared differences
between observed and predicted values.(𝑌 = 𝐴𝑋 + 𝐵 𝑜𝑟 𝑋 = 𝐴𝑌 + 𝐵).

2.Polynomial Regression: Polynomial regression extends linear regression by allowing the


model to include higher-degree polynomials. This flexibility enables a better fit for nonlinear
relationships in the data.(𝑌 = 𝐴𝑋 + 𝐵𝑋 + 𝐶)𝑜𝑟( 𝑋 = 𝐴𝑌 + 𝐵𝑌 + 𝐶).
3.Exponential and Logarithmic: Exponential and logarithmic curve fitting is suitable for
datasets exhibiting exponential growth or decay. These models are often used in fields like
biology, physics, and finance.(𝑌 = 𝑒 )
Linear Regression:
Given the general form of a straight line

f ( x)  ax  b
How can we pick the coefficients that best fits
the line to the data?
First question: What makes a particular
straight line a ‘good’ fit?
Why does the blue line appear to us to fit the
trend better?
• Consider the distance between the data and
points on the line
• Add up the length of all the red and blue
verticle lines
• This is an expression of the ‘error’ between
data and fitted line
• The one line that provides a minimum error
is then the ‘best’ straight line
Quantifying errors in a curve fit
(1) positive or negative error have the same
value (data point is above or below the line)
(2) Weight greater errors more heavily
we can do both of these things by squaring
the distance denote data values as (x, y)
======>>
denote points on the fitted line as (x, f(x))
sum the error at the four data points

n
err   d i   y1  f x1    y 2  f x 2   ........ y n  f x n 
2 2 2 2

i 1

  y1  ax1  b    y 2  ax 2  b   ........   y n  ax n  b 


2 2 2

n
   y i  axi  b 
2

i 1

Error is minimum if first ordered partial derivatives=0


 err  n  err  n
   2 xi  y i  axi  b   0    2 y i  ax i  b   0
a i 1 b i 1
n n n n n n
  xi y i  a  xi  b xi  0   y i  a  xi  b  1  0
2

i 1 i 1 i 1 i 1 i 1 i 1
n n n n n
  xi y i  a  x i  b  xi   y i  a xi  n b
2

i 1 i 1 i 1 and i 1 i 1

Solve the equations

𝑦 =𝑎 𝑥 + 𝑛𝑏 (1)

𝑥𝑦 =𝑎 𝑥 +𝑏 𝑥 (2)

Example: 1 Fit a straight line to the following data:


𝑥 1 2 3 4 6 8
𝑦 2.4 3 3.6 4 5 6
Solution
Let the straight line to be fitted to the data be
𝑦 = 𝑎 + 𝑏𝑥
∑𝑦 = 𝑛𝑎 + 𝑏∑𝑥 (1)
∑𝑥𝑦 = 𝑎∑𝑥 + 𝑏∑𝑥 (2)
𝑛=6

𝑥 𝑦 𝑥 𝑥𝑦
1 2.4 1 2.4
2 3 4 6

3 3.6 9 10.8

4 4 16 16

6 5 36 30

8 6 64 48

∑𝑥 = 24 ∑𝑦 = 24 ∑𝑥 = 130 ∑𝑥𝑦 = 113.2


Substituting these values inn Eqs (1) and (2)
24 = 6𝑎 + 24𝑏 (3)
113.2 = 24𝑎 + 130𝑏 (4)
Solving Eqs (3) and (4), we get
𝑎 = 1.9764
𝑏 = 0.5059
Hence, the required equation of straight line is 𝒚 = 𝟏. 𝟗𝟕𝟔𝟒 + 𝟎. 𝟓𝟎𝟓𝟗𝒙

Example: 2 Fit a straight line to the following data. Also, estimate the value of y at 𝑥 = 2.5.
𝑥 0 1 2 3 4

𝑦 1 1.8 3.3 4.5 6.3

(Winter 2022-23)
Example: 3 Fit a straight line using least square method.
𝑥 0 0.5 1 1.5 2 2.5
𝑦 0 1.5 3 4.5 6 7.5
(Winter 2023-24)

Example: 4 Fit a straight line to the following data and hence find 𝑦 when 𝑥 = 70
𝑥 71 68 73 69 67 65 66 67
𝑦 69 72 70 70 68 67 68 64
(Summer 2023-24)
Polynomial Regression: We started the linear curve fit by choosing a generic form of the
straight line 𝑓(𝑥) = 𝑎𝑥 + 𝑏
This is just one kind of function. There are an infinite number of generic forms we could
choose from for almost any shape we want. Let’s start with a simple extension to the linear
regression concept recall the examples of sampled data.

Error - Least squares approach


n
err   d i   y1  f  x1    y 2  f  x 2   ........ y n  f  x n 
2 2 2 2

i 1

 
 y1  a  bx1  cx1  y  a  bx
2 2
2 2   cx2
2
 2
 ........   y n  a  bx n  cx n 
2

  y  a  bx  cx 
n
2 2
i i i
i 1

To minimize the error, derivatives with respect to 𝑎, 𝑏 𝑎𝑛𝑑 𝑐 equal to 0.


 err  n
a
 
   2 y i  a  bxi  cxi  0
2

i 1

 err  n
b
 
   2 xi y i  a  bxi  cxi  0
2

i 1

 err  n
b
   2 xi y i  a  bxi  cxi  0
2 2
  
i 1

Simplify these equations, we get


n n n

 y i  a n  b  xi  c  x i
i 1 i 1 i 1
2

n n n n

 xi y i  a  xi  b xi  c  xi
2 3

i 1 i 1 i 1 i 1
n n n n

 xi y i  a  xi  b  xi  c  x i
2 2 3 4

i 1 i 1 i 1 i 1
Example: 1
Fit a least squares quadratic curve to the following data:
𝑥 1 2 3 4
𝑦 1.7 1.8 2.3 3.2
Estimate 𝑦(2.4).
Solution:
Let the equation of the least squares quadratic curve (parabola) be 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 .
The normal equations are

𝑦 = 𝑛𝑎 + 𝑏 𝑥+𝑐 𝑥 (1)

𝑥𝑦 = 𝑎 𝑥+𝑏 𝑥 +𝑐 𝑥 (2)

𝑥 𝑦=𝑎 𝑥 +𝑏 𝑥 +𝑐 𝑥 (3)

Here, 𝑛 = 4
𝑥 𝑦 𝑥 𝑥 𝑥 𝑥𝑦 𝑥 𝑦
1 1.7 1 1 1 1.7 1.7
2 1.8 4 8 16 3.6 7.2
3 2.3 9 27 81 6.9 20.7
4 3.2 16 64 256 12.8 51.2
∑ 𝑥 = 10 ∑𝑦 =9 ∑ 𝑥 = 30 ∑ 𝑥 = 100 ∑ 𝑥 = ∑ 𝑥𝑦 = 25 ∑𝑥 𝑦 =
354 80.8

Substitute these values in equations (1), (2) and (3),


9 = 4𝑎 + 10𝑏 + 30𝑐
25 = 10𝑎 + 30𝑏 + 100𝑐
80.8 = 30𝑎 + 100𝑏 + 354𝑐

Solving the above equations, we get


𝑎 = 2, 𝑏 = −0.5, 𝑐 = 0.2
Hence, the required equation of quadratic curve is
𝑦 = 2 − 0.5𝑥 + 0.2𝑥
𝑦(2.4) = 2 − (0.5)(2.4) + (0.2)(2.4) = 1.952

Example: 2
Fit a second-degree polynomial using least square method to the following data:
𝑥 0 1 2 3 4
𝑦 1 1.8 1.3 2.5 6.3
Example: 3
Fit a second order polynomial 𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 to following data, using least square method.
(Summer 2022-23)
𝑥 0 5 10 15 20
𝑦 7 11 16 20 26

Curve fitting - Other nonlinear fits (exponential)


Q: Will a polynomial of any order necessarily fit any set of data?
A: Nope, lots of phenomena don’t follow a polynomial form. They may be, for example,
exponential
(1) General exponential equation f ( x)  C e
Ax

Now, take log on both side, we get


ln y  ln C  Ax
Y  b  aX ; where Y  ln y, X  x, ln C  b and a  ln A
Which is equation of line, the original data in xy- plane mapped into XY-plane. This is
called linearization. The data x, y  transformed as  x, ln y  .

To find the value of a and b we will use the equations


n n

 Yi  a X i  n b
i 1 i 1 (1)
n n n

X Y  a  X i  b X i
2
i i
i 1 i 1 i 1 (2)
After getting values of a and b , A  antilog a, C  antilog b .

Example: An experiment gave the following values:


X 1 5 7 9 12
Y 10 15 12 15 21

Fit an exponential curve y  Ce


Ax

Solution:
X I = xi yi Yi  ln y i Xi
2
X i Yi
1 10 2.302585 1 2.302585
5 15 2.70805 25 13.54025
7 12 2.484906 49 17.39435
9 15 2.70805 81 24.37245
12 21 3.044522 144 36.53427
5 5 5 5

X Y X X Y
2
i i i i I
i 1 i 1 i 1 i 1
=34 =13.24811 =300 =94.1439
13.24811  34 A  5B
94.1439  300 A  34B

A=2.00479, B=2.248664
a=antilog2.00479=7.424536, b=antilog (2.248664) =9.475068

Hence, best fit curve is y  9.475068 e


2.248664 x

(2) y  bx
a

Taking log10 on both the side

log10 y  log10 b  a log10 x


Y  B  AX ; where Y  log10 y, X  log10 x and a  A, B  log10 b
n n

Y i  nB  A X i
i 1 i 1 (1)
n n n

X Y  B  X i  A X i
2
i i
i 1 i 1 i 1 (2)

Example: An experiment gave the following values:


v (ft/min) 350 400 500 600
t (min) 61 26 7 2.6

It is known that v and t are connected by the relation v  bt , find the best possible values
a

of a and b.
V t Y=logv X=logt X2 XY
350 61 2.544068 1.78533 3.18740262 4.542001
400 26 2.60206 1.414973 2.002149575 3.681846
500 7 2.69897 0.845098 0.714190697 2.280894
600 2.6 2.778151 0.414973 0.17220288 1.152859
4 4 4 2 4 3

 Yi
i 1
 Xi
i 1
X i X i
i 1 i 1
=10.62325 =4.460375 =6.075945772 =11.6576
Substitute in given equation,
n n

 Yi  nB  A X i
i 1 i 1 (1)
n n n

X Y  B  X i  A X i
2
i i
i 1 i 1 i 1 (2)
10.62325  4 B  4.460375A
11.6575  4.460375B  6.075945772A
On solving these equations B=2.845 A=a= - 0.17.
b  anti log(2.845)  699.842

3)The following values of T and l follow the law T= aln. Test if this is so and find the best
values of a and n.
T 1.0 1.5 2.0 2.5
L 25 56.2 100 1.56

You might also like