Correlation, Regression & Curve Fitting
Correlation, Regression & Curve Fitting
Bivariate Distribution:
Distributions involving two variables. For example, if we measure the heights and weights of a
certain group of persons, we shall get what is known as Bivariate Distribution - one variable
relating to height and the other relating to weight.
Correlation:
In a bivariate distribution if a change in one variable affects a change in the other variable, the
variables are said to be correlated. If the two variables deviate in the same direction, i.e., if the
increase (or decrease) in one result in a corresponding increase (or decrease) in the other,
correlation is said to be direct or positive. But if they constantly deviate in the opposite
directions, i.e., if increase (or decrease) in one result in corresponding decrease (or increase) in
the other, correlation is said to be diverse or negative.
For example, the correlation between (i) the heights and weights of a group of persons, (ii) the
income and expenditure is positive and the correlation between (i) price and demand of a
commodity, (ii) the volume and pressure of a perfect gas, is negative.
Scatter Diagram:
For a bivariate distribution (𝑥 , 𝑦 ), 𝑖 = 1,2, … , 𝑛 if the values of the variables 𝑋 and 𝑌 be plotted
along the x-axis and y-axis respectively in the 𝑥𝑦 plane, the diagram of dots so obtained is
known as scatter diagram.
If the points are very dense, i.e. very close to each other, we should expect a fairly good amount
of correlation between the variables and if the points are widely scattered, a poor correlation is
expected. This method however is not suitable if the number of observations is fairly large.
Note:
Correlation coefficient is independent of change of origin and scale:
If 𝑈 = and 𝑉 = so that 𝑋 = ℎ𝑈 + 𝑎 and 𝑌 = 𝑘𝑉 + 𝑏 then 𝑟(𝑋, 𝑌) = 𝑟(𝑈, 𝑉)
1
∑ (𝑥 −𝑥) 𝑦𝑖 −𝑦 ∑𝑖 𝑈𝑉
𝑛 𝑖 𝑖
In particular, if 𝑈 = 𝑋 − 𝑥̅ and 𝑉 = 𝑌 − 𝑦, 𝑟(𝑋, 𝑌) =
1 1 2
=
∑𝑖(𝑥𝑖 −𝑥)2 ∑𝑖 𝑦𝑖 −𝑦 2 2
𝑛
∑𝑖 𝑈 ∑𝑖 𝑉
Case a) Assuming that no two individuals are bracketed equal in either classification, each of the
variables 𝑋 and 𝑌 takes the values 1,2, … , 𝑛.
Then 𝑥̅ = 𝑦 = (1 + 2 + 3 + ⋯ + 𝑛) = .
𝜎 = ∑ 𝑥 − (𝑥̅ ) = . Similarly, 𝜎 = =𝜎
Let 𝑑 = 𝑥 − 𝑦 then the rank correlation coefficient between A and B, 𝝆 is given by:
(6 ∑ 𝑑 )
𝜌 =1−
𝑛(𝑛 − 1)
which is the Spearman's formula for the rank correlation coefficient.
Note: ∑ 𝑑 = ∑ (𝑥 − 𝑦 ) = 𝑛(𝑥̅ − 𝑦) = 0 (∵ 𝑥̅ = 𝑦) {Use for verification}
Case b) Tied Ranks:
If some of the individuals receive the same rank in a ranking or merit, they are said to be tied.
Let us suppose that 𝑚 of the individuals, say, (𝑘 + 1) , (𝑘 + 2) , . . . . , (𝑘 + 𝑚) are tied.
Then each of these 𝑚 individuals is assigned a common rank, which is the arithmetic mean of
the ranks 𝑘 + 1, 𝑘 + 2, 𝑘 + 3, … , 𝑘 + 𝑚.
Suppose that there are 𝑠 such sets of 𝑚 ranks to be tied in the X-series, define:
1
𝑇 = 𝑚 (𝑚 − 1)
12
Similarly suppose that there are 𝑡 such sets of 𝑚 ranks to be tied with respect to the other series
𝑌, define:
1
𝑇 = 𝑚 (𝑚 − 1)
12
The rank correlation coefficient between A and B, 𝝆 is given by:
6(∑ 𝑑 + 𝑇 + 𝑇 )
𝜌=1−
𝑛(𝑛 − 1)
Note: Limits for Rank Correlation Coefficient are given by −1 < 𝜌 < 1.
REGRESSION
The term “regression” literally means “stepping back towards the average”.
Definition:
Regression analysis is a mathematical measure of the average relationship between two or more
variables in terms of the original units of the data.
In regression analysis there are two types of variables. The variable whose value is influenced or
is to be predicted is called dependent variable and the variable which influences the values or is
used for prediction is called independent variable. In regression analysis independent variable is
also known as regressor or predictor or explanatory variable while the dependent variable is
also known as regressed or explained variable.
Lines of Regression.
If the variables in a bivariate distribution are related, we find that the points in the scatter
diagram will cluster round some curve called the "curve of regression". If the curve is a straight
line, it is called the line of regression and there is said to be linear regression between the
variables otherwise regression is said to be curvilinear.
The line of regression is the line which gives the best estimate to the value of one variable for
any specific value of the other variable. Let us suppose that in the bivariate distribution
(𝑥 , 𝑦 ), 𝑖 = 1,2, … , 𝑛; 𝑌 is dependent variable and 𝑋 is independent variable and it is required to
find the line of regression 𝑌 = 𝑎 + 𝑏𝑋 where 𝑏 represents the slope of the line.
The line of regression of 𝑌 on 𝑋 passes through the point (𝑥̅ , 𝑦), has slope 𝑏 = 𝑟 and is given
by: 𝑌 − 𝑦 = 𝑟 (𝑋 − 𝑥̅ )
Regression Coefficients:
The slope of the line of regression of 𝑌 on 𝑋 is also called the coefficient of regression of 𝑌 on
𝑋. It represents the increment in the value of dependent variable 𝑌 corresponding to a unit
change in the value of independent variable 𝑋.
𝑏 = 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑌 𝑜𝑛 𝑋 = 𝑟
𝐶𝑜𝑣(𝑋, 𝑌) 𝐶𝑜𝑣(𝑋, 𝑌)
𝑏 = 𝑆𝑖𝑛𝑐𝑒 𝑟 =
𝜎 𝜎 𝜎
Similarly, the coefficient of regression of 𝑋 on 𝑌 indicates the change in the value of variable 𝑋
corresponding to a unit change in the value of variable 𝑌 and is given by
𝑏 = 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑋 𝑜𝑛 𝑌 = 𝑟
𝐶𝑜𝑣(𝑋, 𝑌) 𝐶𝑜𝑣(𝑋, 𝑌)
𝑏 = 𝑆𝑖𝑛𝑐𝑒 𝑟 =
𝜎 𝜎 𝜎
Note:
a. Correlation coefficient is the geometric mean between the regression coefficients.
𝑏 ×𝑏 =𝑟 ×𝑟 = 𝑟 or 𝑟 = ± 𝑏 ×𝑏
b. If one of the regression coefficients is greater than unity. the other must be less than unity.
c. Regression coefficients are independent of the change of origin but not of scale.
If 𝑈 = and 𝑉 = so that 𝑋 = ℎ𝑈 + 𝑎 and 𝑌 = 𝑘𝑉 + 𝑏 then 𝑏 = 𝑏 and
𝑏 = 𝑏
𝑌 = 𝑛𝑎 + 𝑏 𝑋 (1)
𝑋𝑌 = 𝑎 𝑋+𝑏 𝑋 (2)
Note: The calculations get simplified when the central value of 𝑋 is 0. It is therefore advisable to
make the central value 0, if it is not so. i.e., if the central value of 𝑋 is 𝐴,
i. Set 𝑈 = 𝑋 − 𝐴 (Change of origin)
ii. Fit the curve 𝑌 = a + 𝑏𝑈
iii. Resubstitute the value of U to obtain the line 𝑌 = 𝑎 + 𝑏𝑋.
Change of scale simplifies the calculations even further!
𝑋 = 𝑛𝑎 + 𝑏 𝑌 (1)
𝑋𝑌 = 𝑎 𝑌+𝑏 𝑌 (2)
𝑌 = 𝑛𝑎 + 𝑏1 𝑋 + 𝑏2 𝑋 (1)
𝑋𝑌 = 𝑎 𝑋+𝑏 𝑋 + 𝑏2 𝑋 (2)
𝑋 𝑌=𝑎 𝑋 +𝑏 𝑋 + 𝑏2 𝑋 (3)
𝑌 = 𝑛𝐴 + 𝑏 𝑋 (1)
𝑋𝑌 = 𝐴 𝑋+𝑏 𝑋 (2)
Solve (1) and (2) as simultaneous equations for 𝐴 and 𝑏 where 𝐴 = 𝑙𝑛𝑎 (Thus, 𝑎 = 𝑒 )
Substitute these values of 𝑎 and 𝑏 in 𝑦 = 𝑎𝑥 which is the required curve of best fit.
2. Exponential Curve: y = a𝑒
Taking natural logarithm on both sides we get:
𝑙𝑛𝑦 = 𝑙𝑛𝑎 + 𝑏𝑥 which has the linear form:
𝑌 = 𝐴 + 𝑏𝑥 where 𝑌 = 𝑙𝑛𝑦 and 𝐴 = 𝑙𝑛𝑎. Use the linear form to fit the curve as above.