Lecture_8 Regression and Correlation - Copy
Lecture_8 Regression and Correlation - Copy
1 Regression
Introduction
Regression Line
Least square Estimation
2 Correlation
Correlation Coefficient
Introduction
Y = β0 + β1 x1 + β2 x2 + . . . , βr xr (1)
would hold.
Y = β0 + β1 x1 + β2 x2 + . . . , βr xr + e (2)
E [Y |x] = β0 + β1 x1 + β2 x2 + . . . , βr xr (3)
It can be expressed as
Y = β0 + β1 x + e (4)
where x is the value of the independent variable, also called the input
level, Y is the response, and e represents the random error, i.e. a random
variable having mean 0.
Example
Consider the following 10 data pairs (xi , yi ), i = 1, . . . , 10, relating y , the
percent yield of a laboratory experiment, to x, the temperature at which
the experiment was run.
i 1 2 3 4 5 6 7 8 9 10
xi 100 110 120 130 140 150 160 170 180 190
yi 45 52 54 63 62 68 75 76 92 88
The individual random error terms ei have a mean of zero and variance σ 2 ,
i.e ei ∼ N(0, σ 2 ).
That is, we minimize the sum of the squared vertical distances of each
point to the fitted line.
This vertical distance of a point from the fitted line is called a residual.
we take the derivative of g with respect to β̂0 and β̂1 and set equal to
zero, and solve it.
Dr.Emmanuel(Udsm) IS141: Probability and Statistics Regression and Correlation 12 / 34
Regression Least square Estimation
n
∂g X
= −2 yi − (β̂0 + β̂1 xi ) = 0 and
∂ β̂0 i=1
n
∂g X
= −2 yi − (β̂0 + β̂1 xi ) xi = 0
∂ β̂1 i=1
It follows that,
n
P n
P
n
P n
P xi yi
i=1 i=1
(xi − x̄)(yi − ȳ ) xi yi − n
i=1 i=1 Sxy
β̂1 = n = n = and
Sxx
(xi − x̄)2 (xi − x̄)2
P P
i=1 i=1
Example
In this table y is the purity of oxygen produced in a chemical distillation
process, and x is the percentage of hydrocarbons that are present in the
main condenser of the distillation unit. Fit a simple linear regression model
to the oxygen purity data given below:
Figure: Scatter plot of oxygen purity y versus hydrocarbon level x and regression
model y = 74.283 + 14.947x
Estimation σ 2
Another unknown parameter in the regression model is σ 2 (the variance of
the error term ε).
Substituting ŷ = β̂0 + β̂1 x into the equation for SSE , simplifying we get
n
X
SSE = yi2 − nȳ 2 − β̂1 Sxy
i=1
= SST − β̂1 Sxy ,
E (SSE ) = (n − 2)σ 2 .
For the above example, the estimate of σ 2 for the oxygen purity data is
n
yi2 − nȳ 2 − β̂1 Sxy
P
SSE i=1
σ̂ 2 = =
n−2 n−2
170, 044.5321 − 20(92.1605)2 − (14.94748)(10.17744)
=
20 − 2
= 1.18
Total variation is made up of two parts: Sum of square for errors and sum
of square of regression. i.e.
Coefficient of Determination, R 2
The coefficient of determination is the portion of the total variation in the
dependent variable that is explained by variation in the independent
variable.
Correlation
Correlation Coefficient
Features of ρ and r
Unit free
Range between −1 and 1
The closer to −1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear relationship
The closer to 0, the weaker the linear relationship.
Example
Using the following table, calculate the correlation coefficient r .
n
P n
P n
P
n yixi yi − xi
i=1 i=1 i=1
r = v"
u
n
n 2 # " n n 2 #
u P
xi2 − yi2 −
t n P P P
xi n yi
i=1 i=1 i=1 i=1
8(3142) − (73)(321)
=p
[8(713) − (73)2 ][8(14111) − (321)2 ]
= 0.886