Linear Regresssion
Linear Regresssion
1
600
y = 3.5702x + 62.366
R2 = 0.8215
500
400
300
200
100
0
0 20 40 60 80 100 120 140
2
Simple Linear Regression Model
Basic model:
Yi = 0 + 1Xi + i
where
Yi is the response variable (or dependent variable).
Xi is the predictor variable (or independent or explanatory variable).
i is a random error term with E[i]=0 and Var[i]=2.
i and j are uncorrelated for ij.
0 and 1 are parameters (intercept and slope).
3
Simple Linear Regression Model
Fixed versus random X
Some results for regression analysis assume that X is fixed (controlled).
This requirement can often be relaxed.
4
Simple Linear Regression Model (with fixed X)
Regression function:
E( Yi | x i ) = E(0 + 1x i + i ) = 0 + 1x i + E(i )
= 0 + 1x i
Variance of Yi :
Correlation of Yi’s:
Corr( Yi , Yj ) = 0, for i j
5
Simple Linear Regression Model (with fixed X)
Y=0+1X
Y
i
(Xi,Yi)
6
Estimation: Least Squares (LS) method
Principle: Minimize the sum of squared error of regression model !
n n
Q(0 ,1 ) = = ( Yi − 0 − 1Xi )2
2
i
i=1 i=1
Derivation:
Q n n n
= − 2( Yi − 0 − 1Xi ) = 0
0 i=1
Y
i =1
i = nˆ 0 + ˆ 1 X i
i =1
Q n n n n
= − 2 Xi ( Yi − 0 − 1Xi ) = 0
1 i=1
i i 0 i 1 i
X Y
i =1
= ˆ
X + ˆ
X 2
i =1 i =1
Normal equations
ˆ 1 =
( X − X)( Y − Y )
i i
( X − X)i
2
ˆ 0 = Y − ˆ 1X
7
Example: Toluca Company
Matlab code:
>> Ex = mean(X);
>> Ey = mean(Y);
>> b1 = sum( (X-Ex).*(Y-Ey) ) / sum( (X-Ex).^2 )
b1 =
3.5702
>> b0 = Ey - b1*Ex
b0 =
62.3659
>> plot(X,Y,'o'), hold on
>> plot( [20,120], [b0+b1*20, b0+b1*120])
600
500
400
300
200
100
20 40 60 80 100 120 8
Point estimation of mean response
For a given X, the mean response is estimated as
Ŷ = ˆ 0 + ˆ 1X
9
Properties of the LS regression function
n
1. The sum of residuals is zero: i = 0
i=1
10
Estimation of the error variance 2
The error variance is estimated as:
1 n
S =
2
i i
n − 2 i=1
( Y −Ŷ ) 2
11
Test of the Simple Linear Regression model
or equivalently
12
^
Sampling distribution of 1
1 is estimated as
ˆ 1 =
( X − X)(Y − Y ) = k Y
i i
where k i =
( Xi − X )
( X − X)
i
2 i i
i
( X − X ) 2
13
^
Sampling distribution of 1
ˆ 1 − 1
~ N(0,1) − distribution
Var (ˆ 1 )
n − 2 i=1 ( Xi − X )2
ˆ 1
ˆ 1 − 1
~ t(n − 2) − distribution
Sˆ 1
14
t-test of the hypothesis 1=0
Null hypothesis: 1=0
Alternative: 10
ˆ 1
t1− 2
Sˆ 1
/2 /2
t1-/2
15
Example: Toluca Company
Matlab code:
>> b0 = 62.3659;
>> b1 =3.5702;
>> n = 25;
>> e = Y - (b0+b1*X);
>> Se = sqrt( sum(e.^2/(n-2) ));
>> Sb1 = Se / sqrt(sum((X-mean(X)).^2));
>> p=2*tcdf(-abs(b1/Sb1),n-2))
p =
4.4488e-010
16
^
Sampling distribution of 0
̂ 0 is normal distributed with
E(ˆ 0 ) = 0 (unbiased)
1 X 2
ˆ
Var ( 0 ) = +
2
2
n i
( X − X )
Therefore
ˆ 0 − 0
~ t(n − 2) − distribution
S ˆ 0
1 X2
where S 2
=S +
2
2
i
ˆ 0
n ( X − X )
17
t-test of the hypothesis 0=0
Null hypothesis: 0=0 (regression line goes through the origin)
Alternative: 00
ˆ 0
t 1− 2
Sˆ 0
/2 /2
t1-/2
18
Example: Toluca Company
Matlab code
>> b0 = 62.3659;
>> b1 =3.5702;
>> n = 25;
>> e = Y - (b0+b1*X);
>> Se = sqrt( sum(e.^2/(n-2) ));
>> Sb0 = sqrt( Se^2*(1/n + ...
mean(X)^2/(sum((X-mean(X)).^2)) ) );
>> p=2*tcdf(-abs(b0/Sb0),n-2))
p =
0.0267
19
Analysis of Variance (ANOVA) of regression model
Y=0+1X
Y
Yi − Ŷi
Yi − Y
Ŷi − Y
Y
20
Variance breakdown
Yi − Y = ( Yi − Ŷi ) + ( Ŷi − Y )
i
( Y − Y ) 2
= i i i
( Y − Ŷ ) 2
+ ( Ŷ − Y ) 2
where
– Total sum of squares: SST = ( Yi − Y )2
21
Breakdown of degrees of freedom
n
Total sum of squares: SST = ( Yi − Y )2 df = n-1
i =1
There are n different Y’s, but only (n-1) degrees of freedom since the
mean value is estimated.
n
Error sum of squares: SSE = ( Yi − Ŷi )2 df = n-2
i =1
n
Regression sum of squares: SSR = ( Ŷi − Y )2 df = 1
i =1
Ŷi has 2 degrees of freedom, but one is lost to estimation of the mean
value
22
ANOVA Table
i =1
n
Error ( Y − Ŷ )
i =1
i i
2
n-2
1 n
n − 2 i=1
( Yi − Ŷi ) 2
n
1 n
Total (Y − Y)
i =1
i
2
n-1
n − 1 i=1
( Yi − Y ) 2
23
F-distribution
F(1,18)-distribution
4
3.5
2.5
1.5
0.5
0
0 0.5 1 1.5 2 2.5 3
24
F-test of 1 = 0
The null hypothesis is rejected for large F:
SSR / 1
F1− (1,n − 2) reject H0 F-test is one-sided !
SSE /(n − 2)
F1− (1, n − 2)
Matlab code:
>> SSE = sum( (Y-(b0+b1*X)).^2 );
>> SSR = sum( (b0+b1*X-mean(Y)).^2 );
>> p=1-fcdf(SSR/(SSE/23),1,23)
p =
4.4489e-010
25
Coefficient of determination (R2)
The coefficient of determination is defined as:
SSE SSR
R2 = 1− =
SST SST
R2 is the fraction of the variance of Y explained by the regression model.
R2 = 1 R2 = 0
Ŷ = Y
Ŷ = ˆ 0 + ˆ 1X
R = corr( X, Y )
26
Matrix approach to linear regression
The regression model can be formulated in terms of matrices:
Y = Xβ + ε
Y1 1 X1 1
Y 1 X
where Y = 2 X= 2
β = 0 ε = 2
1
Yn 1 X n n
and
2 0 0
0 2 0
E( Y ) = Xβ E(ε ) = 0 and Cov(ε ) =
2
0 0
27
Matrix approach to linear regression
The normal equations were derived earlier:
n n
nˆ 0 + ˆ 1 X i = Yi
i =1 i =1
n n n
ˆ 0 X i + ˆ 1 X = X i Yi
2
i
i =1 i =1 i =1
or
βˆ = ( X' X ) −1 X' Y
28
Example: Toluca Company
Matlab code:
>> X = [ones(25,1) X];
>> b = inv(X'*X)*X'*Y
b =
62.3659
3.5702
>> b = regress(Y,X)
b =
62.3659
3.5702
29