0% found this document useful (0 votes)
18 views6 pages

Correlation, Regression & Curve Fitting

The document discusses correlation and regression analysis, explaining concepts such as bivariate distribution, correlation coefficients, and the methods for calculating and interpreting these relationships. It covers the Pearson coefficient of correlation, rank correlation coefficient, and the fitting of regression lines, including the necessary equations and conditions for analysis. Additionally, it highlights the importance of understanding the relationship between dependent and independent variables in statistical modeling.

Uploaded by

simrandnagraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views6 pages

Correlation, Regression & Curve Fitting

The document discusses correlation and regression analysis, explaining concepts such as bivariate distribution, correlation coefficients, and the methods for calculating and interpreting these relationships. It covers the Pearson coefficient of correlation, rank correlation coefficient, and the fitting of regression lines, including the necessary equations and conditions for analysis. Additionally, it highlights the importance of understanding the relationship between dependent and independent variables in statistical modeling.

Uploaded by

simrandnagraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

CORRELATION

Bivariate Distribution:
Distributions involving two variables. For example, if we measure the heights and weights of a
certain group of persons, we shall get what is known as Bivariate Distribution - one variable
relating to height and the other relating to weight.

Correlation:
In a bivariate distribution if a change in one variable affects a change in the other variable, the
variables are said to be correlated. If the two variables deviate in the same direction, i.e., if the
increase (or decrease) in one result in a corresponding increase (or decrease) in the other,
correlation is said to be direct or positive. But if they constantly deviate in the opposite
directions, i.e., if increase (or decrease) in one result in corresponding decrease (or increase) in
the other, correlation is said to be diverse or negative.

For example, the correlation between (i) the heights and weights of a group of persons, (ii) the
income and expenditure is positive and the correlation between (i) price and demand of a
commodity, (ii) the volume and pressure of a perfect gas, is negative.

Correlation is said to be perfect if the deviation in one variable is followed by a corresponding


and proportional deviation in the other.

Scatter Diagram:
For a bivariate distribution (𝑥 , 𝑦 ), 𝑖 = 1,2, … , 𝑛 if the values of the variables 𝑋 and 𝑌 be plotted
along the x-axis and y-axis respectively in the 𝑥𝑦 plane, the diagram of dots so obtained is
known as scatter diagram.

If the points are very dense, i.e. very close to each other, we should expect a fairly good amount
of correlation between the variables and if the points are widely scattered, a poor correlation is
expected. This method however is not suitable if the number of observations is fairly large.

Karl Pearson Coefficient of Correlation (Product-Moment Correlation Coefficient):


Correlation coefficient between two random variables X and Y, usually denoted by 𝑟(𝑋, 𝑌) or
simply 𝑟 is a numerical measure of linear relationship between them and is defined as
𝐶𝑜𝑣(𝑋, 𝑌)
𝑟(𝑥, 𝑦) =
𝜎 𝜎
For a bivariate distribution (𝑥 , 𝑦 ), 𝑖 = 1,2, … , 𝑛
1 1
𝐶𝑜𝑣(𝑋, 𝑌) = ∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦) = 𝜇 𝑜𝑟 𝐶𝑜𝑣(𝑋, 𝑌) = ∑𝑥 𝑦 − (𝑥̅ 𝑦)
𝑛 𝑛
𝜎 = ∑𝑥 − (𝑥̅ ) and 𝜎 = ∑𝑦 − (𝑦)
1 1
∑ (𝑥 −𝑥) 𝑦𝑖 −𝑦 ∑ 𝑥𝑦 −(𝑥𝑦)
𝑛 𝑖 𝑖 𝑛 𝑖 𝑖 𝑖
Thus 𝑟(𝑋, 𝑌) = 2
𝑜𝑟 𝑟(𝑋, 𝑌) =
1 1 1 1
∑𝑖(𝑥𝑖 −𝑥)2 ∑𝑖 𝑦𝑖 −𝑦 ∑ 𝑥2 −(𝑥)2 ∑ 𝑦2 −(𝑦)2
𝑛 𝑛 𝑖 𝑖 𝑛 𝑖 𝑖

Limits for Karl Pearson Coefficient of Correlation:


Correlation Coefficient cannot exceed unity numerically. It always lies between −1 and + 1
{Use for verification}. If 𝑟 = + 1, the correlation is perfect and positive and if 𝑟 = −1,
correlation is perfect and negative.

Note:
Correlation coefficient is independent of change of origin and scale:
If 𝑈 = and 𝑉 = so that 𝑋 = ℎ𝑈 + 𝑎 and 𝑌 = 𝑘𝑉 + 𝑏 then 𝑟(𝑋, 𝑌) = 𝑟(𝑈, 𝑉)
1
∑ (𝑥 −𝑥) 𝑦𝑖 −𝑦 ∑𝑖 𝑈𝑉
𝑛 𝑖 𝑖
In particular, if 𝑈 = 𝑋 − 𝑥̅ and 𝑉 = 𝑌 − 𝑦, 𝑟(𝑋, 𝑌) =
1 1 2
=
∑𝑖(𝑥𝑖 −𝑥)2 ∑𝑖 𝑦𝑖 −𝑦 2 2
𝑛
∑𝑖 𝑈 ∑𝑖 𝑉

RANK CORRELATION COEFFICIENT


A group of 𝑛 individuals is arranged in order of merit or proficiency in possession of two
characteristics 𝐴 and 𝐵. Let (𝑥 , 𝑦 ), 𝑖 = 1,2, … , 𝑛 be the ranks of the 𝑖 individual in two
characteristics A and B respectively. Pearsonian coefficient of correlation between the ranks 𝑥 ′𝑠
and 𝑦 ′𝑠 is called the rank correlation coefficient between A and B for that group of individuals.

Case a) Assuming that no two individuals are bracketed equal in either classification, each of the
variables 𝑋 and 𝑌 takes the values 1,2, … , 𝑛.
Then 𝑥̅ = 𝑦 = (1 + 2 + 3 + ⋯ + 𝑛) = .

𝜎 = ∑ 𝑥 − (𝑥̅ ) = . Similarly, 𝜎 = =𝜎

Let 𝑑 = 𝑥 − 𝑦 then the rank correlation coefficient between A and B, 𝝆 is given by:
(6 ∑ 𝑑 )
𝜌 =1−
𝑛(𝑛 − 1)
which is the Spearman's formula for the rank correlation coefficient.
Note: ∑ 𝑑 = ∑ (𝑥 − 𝑦 ) = 𝑛(𝑥̅ − 𝑦) = 0 (∵ 𝑥̅ = 𝑦) {Use for verification}
Case b) Tied Ranks:
If some of the individuals receive the same rank in a ranking or merit, they are said to be tied.
Let us suppose that 𝑚 of the individuals, say, (𝑘 + 1) , (𝑘 + 2) , . . . . , (𝑘 + 𝑚) are tied.
Then each of these 𝑚 individuals is assigned a common rank, which is the arithmetic mean of
the ranks 𝑘 + 1, 𝑘 + 2, 𝑘 + 3, … , 𝑘 + 𝑚.
Suppose that there are 𝑠 such sets of 𝑚 ranks to be tied in the X-series, define:
1
𝑇 = 𝑚 (𝑚 − 1)
12
Similarly suppose that there are 𝑡 such sets of 𝑚 ranks to be tied with respect to the other series
𝑌, define:
1
𝑇 = 𝑚 (𝑚 − 1)
12
The rank correlation coefficient between A and B, 𝝆 is given by:
6(∑ 𝑑 + 𝑇 + 𝑇 )
𝜌=1−
𝑛(𝑛 − 1)

Note: Limits for Rank Correlation Coefficient are given by −1 < 𝜌 < 1.

REGRESSION
The term “regression” literally means “stepping back towards the average”.

Definition:
Regression analysis is a mathematical measure of the average relationship between two or more
variables in terms of the original units of the data.

In regression analysis there are two types of variables. The variable whose value is influenced or
is to be predicted is called dependent variable and the variable which influences the values or is
used for prediction is called independent variable. In regression analysis independent variable is
also known as regressor or predictor or explanatory variable while the dependent variable is
also known as regressed or explained variable.

Lines of Regression.
If the variables in a bivariate distribution are related, we find that the points in the scatter
diagram will cluster round some curve called the "curve of regression". If the curve is a straight
line, it is called the line of regression and there is said to be linear regression between the
variables otherwise regression is said to be curvilinear.
The line of regression is the line which gives the best estimate to the value of one variable for
any specific value of the other variable. Let us suppose that in the bivariate distribution
(𝑥 , 𝑦 ), 𝑖 = 1,2, … , 𝑛; 𝑌 is dependent variable and 𝑋 is independent variable and it is required to
find the line of regression 𝑌 = 𝑎 + 𝑏𝑋 where 𝑏 represents the slope of the line.

The line of regression of 𝑌 on 𝑋 passes through the point (𝑥̅ , 𝑦), has slope 𝑏 = 𝑟 and is given

by: 𝑌 − 𝑦 = 𝑟 (𝑋 − 𝑥̅ )

Similarly, the equation of the line of regression of 𝑋 on 𝑌 (𝑋 = 𝑎 + 𝑏𝑌) is given by:


𝜎
𝑋 − 𝑥̅ = 𝑟 (𝑌 − 𝑦)
𝜎
Note:
The values (𝑥̅ , 𝑦) can be obtained as the point of intersection of the two regression lines.

Regression Coefficients:
The slope of the line of regression of 𝑌 on 𝑋 is also called the coefficient of regression of 𝑌 on
𝑋. It represents the increment in the value of dependent variable 𝑌 corresponding to a unit
change in the value of independent variable 𝑋.
𝑏 = 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑌 𝑜𝑛 𝑋 = 𝑟

𝐶𝑜𝑣(𝑋, 𝑌) 𝐶𝑜𝑣(𝑋, 𝑌)
𝑏 = 𝑆𝑖𝑛𝑐𝑒 𝑟 =
𝜎 𝜎 𝜎

Similarly, the coefficient of regression of 𝑋 on 𝑌 indicates the change in the value of variable 𝑋
corresponding to a unit change in the value of variable 𝑌 and is given by
𝑏 = 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑋 𝑜𝑛 𝑌 = 𝑟

𝐶𝑜𝑣(𝑋, 𝑌) 𝐶𝑜𝑣(𝑋, 𝑌)
𝑏 = 𝑆𝑖𝑛𝑐𝑒 𝑟 =
𝜎 𝜎 𝜎

Note:
a. Correlation coefficient is the geometric mean between the regression coefficients.
𝑏 ×𝑏 =𝑟 ×𝑟 = 𝑟 or 𝑟 = ± 𝑏 ×𝑏

b. If one of the regression coefficients is greater than unity. the other must be less than unity.
c. Regression coefficients are independent of the change of origin but not of scale.
If 𝑈 = and 𝑉 = so that 𝑋 = ℎ𝑈 + 𝑎 and 𝑌 = 𝑘𝑉 + 𝑏 then 𝑏 = 𝑏 and

𝑏 = 𝑏

FITTING OF A STRAIGHT LINE


Suppose it is required to fit a regression line 𝑌 = 𝑎 + 𝑏𝑋 to a given set of observation
(𝑥 , 𝑦 ), 𝑖 = 1,2, … , 𝑛. The normal equations are given by:

𝑌 = 𝑛𝑎 + 𝑏 𝑋 (1)

𝑋𝑌 = 𝑎 𝑋+𝑏 𝑋 (2)

Solve (1) and (2) as simultaneous equations for 𝑎 and 𝑏.


Substitute the values of 𝑎 and 𝑏 in 𝑌 = 𝑎 + 𝑏𝑋 which is the required line of best fit.

Note: The calculations get simplified when the central value of 𝑋 is 0. It is therefore advisable to
make the central value 0, if it is not so. i.e., if the central value of 𝑋 is 𝐴,
i. Set 𝑈 = 𝑋 − 𝐴 (Change of origin)
ii. Fit the curve 𝑌 = a + 𝑏𝑈
iii. Resubstitute the value of U to obtain the line 𝑌 = 𝑎 + 𝑏𝑋.
Change of scale simplifies the calculations even further!

To fit the regression line 𝑋 = 𝑎 + 𝑏𝑌 to a given set of observation (𝑥 , 𝑦 ), 𝑖 = 1,2, … , 𝑛. The


normal equations are given by:

𝑋 = 𝑛𝑎 + 𝑏 𝑌 (1)

𝑋𝑌 = 𝑎 𝑌+𝑏 𝑌 (2)

Solve (1) and (2) as simultaneous equations for 𝑎 and 𝑏.


Substitute the values of 𝑎 and 𝑏 in 𝑋 = 𝑎 + 𝑏𝑌 which is the required line of best fit.
FITTING OF A SECOND-DEGREE PARABOLA
Suppose it is required to fit a parabola 𝑌 = 𝑎 + 𝑏 𝑋 + 𝑏 𝑋 to a given set of observation
(𝑥 , 𝑦 ), 𝑖 = 1,2, … , 𝑛. The normal equations are given by:

𝑌 = 𝑛𝑎 + 𝑏1 𝑋 + 𝑏2 𝑋 (1)

𝑋𝑌 = 𝑎 𝑋+𝑏 𝑋 + 𝑏2 𝑋 (2)

𝑋 𝑌=𝑎 𝑋 +𝑏 𝑋 + 𝑏2 𝑋 (3)

Solve (1), (2) and (3) as simultaneous equations for 𝑎, 𝑏 and 𝑏 .


Substitute the values of 𝑎, 𝑏 𝑎𝑛𝑑 𝑏 in 𝑌 = 𝑎 + 𝑏 𝑋 + 𝑏 𝑋 which is the required curve of best
fit.

Note: Change of origin and scale simplifies calculations.


Substitute 𝑈 = , fit the curve 𝑌 = 𝑎 + 𝑏 𝑈 + 𝑏 𝑈 and then resubstitute the value of 𝑈 to
get the required parabola 𝑌 = 𝑎 + 𝑏 𝑋 + 𝑏 𝑋 .

FITTING OF CURVES OF OTHER TYPES


1. Power Curve: 𝑦 = 𝑎𝑥
Taking natural logarithm on both sides we get:
𝑙𝑛𝑦 = 𝑙𝑛𝑎 + 𝑏𝑙𝑛𝑥 which has the linear form:
𝑌 = 𝐴 + 𝑏𝑋 where 𝑌 = 𝑙𝑛𝑦, 𝐴 = 𝑙𝑛𝑎, and 𝑋 = 𝑙𝑛𝑥

Normal equations are given by:

𝑌 = 𝑛𝐴 + 𝑏 𝑋 (1)

𝑋𝑌 = 𝐴 𝑋+𝑏 𝑋 (2)

Solve (1) and (2) as simultaneous equations for 𝐴 and 𝑏 where 𝐴 = 𝑙𝑛𝑎 (Thus, 𝑎 = 𝑒 )
Substitute these values of 𝑎 and 𝑏 in 𝑦 = 𝑎𝑥 which is the required curve of best fit.

2. Exponential Curve: y = a𝑒
Taking natural logarithm on both sides we get:
𝑙𝑛𝑦 = 𝑙𝑛𝑎 + 𝑏𝑥 which has the linear form:
𝑌 = 𝐴 + 𝑏𝑥 where 𝑌 = 𝑙𝑛𝑦 and 𝐴 = 𝑙𝑛𝑎. Use the linear form to fit the curve as above.

You might also like