Correlation and Regression
Correlation and Regression
AND
REGRESSION
CORRELATION:
The relationship between two or more than two
variables is known as correlation. For example; the
relationship between cost and price, demand and
supply, distance and velocity, production of crops and
fertility of soil, amount of rain fall, relative humidity etc.
are some examples of correlation. There are three types
of correlation
(i) Simple correlation,
(ii) Partial correlation and
(iii) Multiple correlation.
SIMPLE CORRELATION:
The relationship between two variables is called simple
correlation or linear correlation.
The numerical measurement of strength of relationship
or degree of relationship between two variables is
known as simple correlation coefficient. If x and y are
two variables then simple correlation coefficient
between them is denoted by 𝒓𝒙𝒚 . The variables x and y
are interchangeable so if one is considered as
dependent variable then another will be independent
variable. The simple correlation coefficient is given by
following formula;
𝒔𝒙𝒚
1. 𝒓𝒙𝒚 = where,
√𝒔𝒙𝒙 √𝒔𝒚𝒚
∑𝒙∑𝒚
𝒔𝒙𝒚 =∑(𝒙 − 𝒙 ̅)=∑ 𝒙𝒚 -
̅)(𝒚 − 𝒚
𝒏
𝟐
𝟐 𝟐 (∑ 𝒙)
𝒔𝒙𝒙 =∑(𝒙 − 𝒙
̅) =∑ 𝒙 - and
𝒏
𝟐 𝟐 (∑ 𝒚)𝟐
𝒔𝒚𝒚 =∑(𝒚 − 𝒚
̅) =∑ 𝒚 -
𝒏
2.
∑(𝒙−𝒙 ̅̅̅
̅)(𝒚−𝒚)
𝒓𝒙𝒚 =
̅)𝟐 √∑(𝒚−𝒚
√∑(𝒙−𝒙 ̅)𝟐
3.
𝒏 ∑ 𝒙𝒚−∑ 𝒙 ∑ 𝒚
𝒓𝒙𝒚 =
√𝒏 ∑ 𝒙𝟐 −(∑ 𝒙)𝟐 √𝒏 ∑ 𝒚𝟐 −(∑ 𝒚)𝟐
4.
𝒏 ∑ 𝒖𝒗−𝒖 ∑ 𝒗 𝒙−𝒂 𝒚−𝒂
𝒓𝒙𝒚 = where u= and v=
√𝒏 ∑ 𝒖𝟐 −(∑ 𝒖)𝟐 √𝒏 ∑ 𝒗𝟐 −(∑ 𝒗) 𝟐 𝒉 𝒉
Theorem:
Prove that simple correlation coefficient is always lies
between -1 and +1 i.e. -1≤ 𝒓𝒙𝒚 ≤1.
Proof:
̅)𝟐
∑(𝒙−𝒙
We know 𝒔𝟐𝒙 = ̅)𝟐 =(n-1)𝒔𝟐𝒙
∴ ∑(𝒙 − 𝒙
𝒏−𝟏
̅)𝟐
∑(𝒚−𝒚
𝒔𝟐𝒚 = ̅)𝟐 =(n-1)𝒔𝟐𝒚
∴ ∑(𝒚 − 𝒚
𝒏−𝟏
𝟏 ∑(𝒙−𝒙 ̅̅̅
̅)(𝒚−𝒚)
𝒓𝒙𝒚 = ̅)(𝒚 − ̅̅̅
∴ ∑(𝒙 − 𝒙 𝒚)=(n-1) 𝒓𝒙𝒚 𝒔𝒙 𝒔𝒚
𝒏−𝟏 𝒔𝒙 𝒔 𝒚
6.
Simple correlation coefficient is geometric mean of
two regression coefficient. 𝒊. 𝒆 𝒓𝒙𝒚 =√𝒃𝒙𝒚 𝒃𝒚𝒙
INTERPRETING THE LINEAR CORRELATION COEFFICIENT
The value of 𝒓𝒙𝒚 must always fall between -1 and +1
inclusive. If r is close to zero, we conclude that there is
no significance linear correlation between x and y but if
it is close to -1 or +1 we conclude that there is a
significance linear correlation between x and y.
COEFFICIENT OF LINEAR DETERMINATION:
If 𝒓𝒙𝒚 be the linear correlation between two variables x
and y then 𝒓𝟐𝒙𝒚 is the coefficient determination. It is
used to interpret the value of coefficient of linear
correlation which gives how far the changes in one
variable is explained by the other variable. For example;
If 𝒓𝒙𝒚 =0.6 then 𝒓𝟐𝒙𝒚 =0.36=36% means 36% in changes in
one variable is explained by another variable.
PARTIAL CORRELATION
The relationship between three or more than three
variables in which one is dependent, one is independent
and rest of independent variables are kept constant is
known as partial correlation.
The numerical measurement of strength of relationship
between a dependent variable and an independent
variable by keeping rest of the independent variables
constant is known as partial correlation coefficient. For
example; the relationship between quantity of
production of crops and fertility of soil by keeping
amount of rain fall, quality of seeds etc. constant is an
example of partial correlation.
If x₁ is a dependent variable, x₂ and x₃ are independent
variables then partial correlation coefficient between x₁
and x₂ by keeping x₃ constant is denoted by r₁₂.₃ and
given by formula as ;
𝒓₁₂−𝒓₁₃𝒓₂₃
r₁₂.₃ = ,
√𝟏−𝒓𝟐𝟏𝟑 √𝟏−𝒓𝟐𝟐𝟑
Similarly,
𝒓₁₃−𝒓₁₂𝒓₃₂
r₁₃.₂ = and
√𝟏−𝒓𝟐𝟏𝟐 √𝟏−𝒓𝟐𝟑𝟐
𝒓₂₃−𝒓₂₁𝒓₃₁
r₂₃.₁ = .
√𝟏−𝒓𝟐𝟐𝟏 √𝟏−𝒓𝟐𝟑𝟏
MULTIPLE CORRELATION
The relationship between a dependent variable and two
or more than two independent variables in which the
effect of all independent variables are kept together is
known as multiple correlation.
The numerical measurement of strength of relationship
between a dependent variable and two or more than
two independent variables in which the effect of all
independent variables are kept together is known as
multiple correlation coefficient. The multiple
correlation coefficient between a dependent variable x₁
and independent variables x₂ and x₃ is denoted by R₁.₂₃
And given by following formula as;
𝒓𝟏𝟐 𝟐 +𝒓𝟏𝟑 𝟐 −𝟐𝒓𝟏𝟐 𝒓𝟏𝟑 𝒓𝟐𝟑
R₁.₂₃ =√
𝟏−𝒓𝟐𝟐𝟑
Similarly,
𝒓𝟐𝟑 𝟐 +𝒓𝟐𝟏 𝟐 −𝟐𝒓𝟐𝟑 𝒓𝟐𝟏 𝒓𝟑𝟏
R₂.₃₁ =√
𝟏−𝒓𝟐𝟏𝟑
SIMPLE REGRESSION:
What is Simple Linear Regression?
Simple linear regression is a statistical method that allows us to summarize and study relationships
between two continuous (quantitative) variables:
One variable, denoted x, is regarded as the predictor, explanatory, or independent variable.
The other variable, denoted y, is regarded as the response, outcome, or dependent variable.
Because the other terms are used less frequently today, we'll use the "predictor" and "response"
terms to refer to the variables encountered in this course. The other terms are mentioned only to
make you aware of them should you encounter them. Simple linear regression gets its adjective
"simple," because it concerns the study of only one predictor variable. In contrast, multiple linear
regression, which we study later in this course, gets its adjective "multiple," because it concerns the
study of two or more predictor variables.
METHOD OF LEAST SQUARE:
Regression Coefficient
Definition: The Regression Coefficient is the constant ‘b’ in the regression equation that tells
about the change in the value of dependent variable corresponding to the unit change in the
independent variable.
If there are two regression equations, then there will be two regression coefficients:
When the deviations are obtained from the assumed mean, the
In case, the deviations are taken from the actual means; the following formula is used:
The b can be calculated by using the following formula when the deviations are taken from
yx
1. The correlation coefficient is the geometric mean of two regression coefficients. Symbolically, it can be
expressed as: r=√𝑏𝑥𝑦 𝑏𝑦𝑥
2. The value of the coefficient of correlation cannot exceed unity i.e. 1. Therefore, if one of the regression
coefficients is greater than unity, the other must be less than unity.
3. The sign of both the regression coefficients will be same, i.e. they will be either positive or negative.
Thus, it is not possible that one regression coefficient is negative while the other is positive.
4. The coefficient of correlation will have the same sign as that of the regression coefficients, such as if
the regression coefficients have a positive sign, then “r” will be positive and vice-versa.
5. The average value of the two regression coefficients will be greater than the value of the correlation.
Thus, all these properties should be kept in mind while solving for the regression coefficient
MULTIPLE REGRESSION PLANE:
If x, y and z are three variables then regression line of y
on x and z is given by y=a+bx+cz……………..(i)
The normals of equation (i) are
∑ 𝒚 = 𝒏𝒂 + 𝒃 ∑ 𝒙 + 𝒄 ∑ 𝒛…………..(ii)
∑ 𝒙𝒚 = 𝒂 ∑ 𝒙 + 𝒃 ∑ 𝒙𝟐 + 𝒄 ∑ 𝒙𝒛…………..(iii)
∑ 𝒚𝒛 = 𝒂 ∑ 𝒛 + 𝒃 ∑ 𝒙𝒛 + 𝒄 ∑ 𝒛𝟐 …………..(iv)
Solving equations (ii) and (iii) then we will get values of
a, b and c. now putting these values in equation (i) to
obtain regression line of y on x and z.
INFERENCE CONCERNING LEAST SQUARE METHOD:
The regression equation y=a+bx is obtained on the basis
of sample data. We are often interested in
corresponding equation y= 𝜶+𝜷x from the population
from which the samples are drawn. The following is the
test concerning normal population.
A TEST OF HYPOTHESIS CONCERNING THE SLOPE
PARAMETER 𝜷=b.
To test the hypothesis that the regression coefficient 𝜷
is equal to some specific value b, we use the test
𝒃−𝜷
statistic t=
𝒔𝒆
√𝒔𝒙𝒙 with n-2 degree of freedom.
Similarly, the test statistics inference about 𝜶=a.
𝒂−𝜶 𝒏𝒔𝒙𝒙
t= √𝒔 ̅𝟐
with n-2 degree of freedom.
𝒔𝒆 𝒙𝒙 +𝒏𝒙
(𝒔𝒙𝒚 )𝟐
𝒔𝒚𝒚 −
Where 𝒔𝒆 =√
𝒔𝒙𝒙
𝒏−𝟐
Now;
𝟐 (∑ 𝒙)𝟐 (𝟐𝟎𝟎𝟎)𝟐
𝒔𝒙𝒙 =∑ 𝒙 - = 532 - =132,000
𝒏 𝟏𝟎
𝟐 (∑ 𝒚)𝟐 (𝟖.𝟑𝟓)𝟐
𝒔𝒚𝒚 =∑ 𝒚 - = 9.1097 - =2.13745
𝒏 𝟏𝟎
∑𝒙∑𝒚 (𝟐𝟎𝟎𝟎)(𝟖.𝟑𝟓)
𝒔𝒙𝒚 =∑ 𝒙𝒚- = 2175.40 - =505.40
𝒏 𝟏𝟎
∑𝒙
̅=
𝒙 =200
𝒏
∑𝒚
̅=
𝒚 = 0.835
𝒏
𝒔𝒙𝒚 𝟓𝟎𝟓.𝟒𝟎
𝒃 = 𝒃𝒚𝒙 = = =0.00383
𝒔𝒙𝒙 𝟏𝟑𝟐𝟎𝟎
̅ -b𝒙
𝒂=𝒚 ̅ =0.835-(0.00383)200=0.069
(𝒔𝒙𝒚 )𝟐 (𝟓𝟎𝟓.𝟒𝟎)𝟐
𝒔𝒚𝒚 − 𝟐.𝟏𝟑𝟕𝟒𝟓 −
𝒔𝒆 =√ √
𝒔𝒙𝒙 𝟏𝟑𝟐𝟎𝟎𝟎
= 𝒔𝒆 = =0.0253
𝒏−𝟐 𝟏𝟎−𝟐
(1-𝜶)100%=95% ∴ 𝜶=0.05
(i)
The equation of the straight line that best fit
the given data in the sense of least square is
y= a+bx=0.069+0.00383x
∴y=0.069+0.00383x
When x=190 cm/sec then
y=0.069+(0.00383)190=0.80𝒎𝒎𝟐 /𝒔𝒆𝒄
(ii) 95% confidence interval for slope 𝜷 and
intercept 𝜶.
For intercept:
𝟏 ̅𝟐
𝒙
C.I.= a±𝒕 𝜶
,𝒏−𝟐 x 𝒔𝒆 √ +
𝟐 𝒏 𝒔𝒙𝒙
𝟏 (𝟐𝟎𝟎)𝟐
=0.069 ±(𝟐. 𝟑𝟎𝟔)(𝟎. 𝟏𝟓𝟗)x √ +
𝟏𝟎 𝟏𝟑𝟐𝟎𝟎
= 0.069± 0.233
=(-0.069, 0.302)
For slope 𝜷:
𝟏
C.I.=𝒃±𝒕𝜶,𝒏−𝟐 x 𝒔𝒆 √
𝟐 𝒔𝒙𝒙
=(…. , ……)
(iii) A test of hypothesis concerning slope 𝜷=0.
STEP I:
H0: 𝜷=0
H1: 𝜷 ≠0
STEP II:
𝜶=5%= 0.05
STEP III:
ttab= 𝒕𝜶,𝒏−𝟐 = t0.025, 8=2.306
𝟐
STEP IV:
Test statistic under null hypothesis H0: 𝜷=0
𝒃−𝜷
tcal=
𝒔𝒆
√𝒔
𝒙𝒙
𝟎.𝟎𝟎𝟑𝟖𝟑−𝟎
=
𝟎.𝟏𝟓𝟗
√𝟏𝟑𝟐𝟎𝟎𝟎
=8.75
STEP V:(Decision)
∴tcal>ttab so null hypothesis is rejected and
alternative hypothesis is accepted..
STEP VI:(conclusion)
From above procedure we conclude that the
slope 𝜷 ≠0 .
(v) Test hypothesis concerning intercept 𝜶 :
(do your self)
2. Ten still wires of diameter 0.5 mm and length 2.5 m
were extended in a laboratory by applying vertical
forces of varying magnitudes. Results are as follows:
Forces in kg 15 19 25 35 42 48 53 56 62 65
Increase in 1.7 2.1 2.5 3.4 3.9 4.9 5.4 5.7 6.6 7.2
length(mm)
(a) Estimate the parameter of a simple line
regression model with forces as explanatory
variable.
(b) Find 95% confidence limit for the slope of the
line.
2. Find the equation of the regression line of y on x, if
the observations (xi , yi) are the following:
(1,4),(2,8),(3,2),(4,12),(5,10),(6,14),(7,16),(8,6),(9,18)
3. The following table shows the weight z to the
nearest pound, height x to the nearest inch, and
age y to the nearest year, of 12 boys:
Weight(z) 64 71 53 67 55 58 77 57 56 51 76 68
Height(x) 57 59 49 62 51 50 55 48 52 42 61 57
Age (y) 8 10 6 11 8 7 10 9 10 6 12 9
x y z 𝒙𝟐 𝒚𝟐 𝒛𝟐 xy yz zx
64 8 57
71 10 59
53 6 49
67 11 62
55 8 51
58 7 50
77 10 55
57 9 48
56 10 52
51 6 42
78 12 61
68 9 57
∑ 𝒙= ∑ 𝒚= ∑ 𝒛= ∑ 𝒙𝟐 = ∑ 𝒚𝟐 = ∑ 𝒛𝟐 = ∑ 𝒙 𝒚= ∑ 𝒚𝒛= ∑ 𝒛𝒙=
643 106 753 34843 976 48 5779 6796 40830
R= 3.48-0.002V+0.0029V2
Comparing with R=A+BV+C𝑽𝟐 then we get;
A=3.48, B=-0.002 and C=0.0029
Scatter Plots
A Scatter (XY) Plot has points that show the relationship between two
sets of data.
In this example, each dot shows one person's weight versus their height.
(The data is plotted on the graph as "Cartesian (x,y) Coordinates")
Example:
The local ice cream shop keeps track of how much ice cream they sell versus
the noon temperature on that day. Here are their figures for the last 12 days:
14.2° $215
16.4° $325
11.9° $185
15.2° $332
18.5° $406
22.1° $522
19.4° $412
25.1° $614
23.4° $544
18.1° $421
22.6° $445
17.2° $408
Try to have the line as close as possible to all points, and as many points
above the line as below.
But for better accuracy we can calculate the line using Least Squares
Regression and the Least Squares Calculator.
As well as using a graph (like above) we can create a formula to help us.
Example: Straight Line Equation
We can estimate a straight line equation from two points from the graph above
Let's estimate two points on the line near actual values: (12°,
$180) and (25°, $610)
Now put the slope and the point (12°, $180) into the "point-slope" formula:
y − y1 = m(x − x1)
y = 33x − 216
INTERPOLATING
The values are close to what we got on the graph. But that doesn't mean they
are more (or less) accurate. They are all just estimates.
Don't use extrapolation too far! What sales would you expect at 0° ?