3 SimpleLinearRegression
3 SimpleLinearRegression
Suppose there are n data points {yi, xi},i = 1, 2, …,n, which are i-th
realizations of the random variables Y and X respectively. The goal
is to find the equation of the straight line
y a bx
which would provide a "best" fit for the data points. In the above
model, the intercept a and the slope b are unknown constants and
is a random error component. The errors are assumed to have
mean zero and unknown variance 2 . Additionally, we usually
assume that the errors are uncorrelated. That is, the value of one
error does not depend on the value of any other error. So, we
assume that
E y x a b x
Var y x Var a b x 2 .
Page 1 of 30
Thus, the mean of y is a linear function of x although the variance
of y does not depend on the value of x. Furthermore, because the
errors are uncorrelated, the responses are also uncorrelated.
Q n
(2) xi yi aˆ bx
bˆ i 1 i
ˆ 0
.
On simplification, they give us
n n
yi an
i 1
ˆ bˆ. x i
i 1
These equations are
known as least
n n n square normal
xi yi aˆ. xi bˆ. xi2
i 1 i 1 i 1
equations.
Page 2 of 30
From first normal equation, we have
∑𝑛𝑖=1 𝑦𝑖 − 𝑏̂ ∑𝑛𝑖=1 𝑥𝑖
𝑎̂ = = 𝑦̅ − 𝑏̂𝑥̅
𝑛
Now, putting the expression for 𝑎̂ in the second normal equation and
simplifying, we have
𝑛 𝑛 𝑛
∑𝑛 𝑛
𝑖=1 𝑥𝑖 ∑𝑖=1 𝑦𝑖
∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑆𝑥𝑦
𝑛
⇒ 𝑏̂ = 2 =
(∑𝑛
𝑖=1 𝑥𝑖 ) 𝑆𝑥𝑥
∑𝑛𝑖=1 𝑥𝑖2 −
𝑛
n
n x i
H aˆ , bˆ 4 n
i 1
n
xi xi2
i 1 i 1
xi x 0 .
2
n
and above is clearly positive definite, since i 1
Page 3 of 30
Example 1
Hours Test
x2 xy y2
Studied, x Score, y
4 31 16 124 961
9 58 81 522 3364
10 65 100 650 4225
14 73 196 1022 5329
4 37 16 148 1369
7 44 49 308 1936
12 60 144 720 3600
22 91 484 2002 8281
1 21 1 21 441
17 84 289 1428 7056
100 564 1376 6945 36562 TOTALS
Page 4 of 30
PROPERTIES OF REGRESSION COEFFICIENTS
Linearity
S x x
bˆ xy
i1 wi yi , where wi i .
Sxx Sxx i1 S xx
Unbiasedness
Let us now investigate the bias and variance properties of these
estimates.
n n
bˆ wi yi wi a bxi i a wi b wi xi wii
i 1 i 1
Now, clearly,
w 0 i and w x 1
i i
Therefore,
𝐸(𝑏̂) = 𝐸 (𝑏 + ∑ 𝑤𝑖 𝜀𝑖 )
= 𝐸 (𝑏) + ∑ 𝑤𝑖 𝐸 (𝜀𝑖 ) = 𝑏, ∵ 𝐸 (𝜀𝑖 ) = 0
Page 5 of 30
Similarly,
ˆ 1 a bx bx
aˆ y bx
n
i i
ˆ
ˆ
= a bx bx
=a bˆ b x
Since, E bˆ b and E 0, we get
E aˆ a. Thus, â is also
an unbiased estimate of the true intercept a .
Page 6 of 30
2
2
Var aˆ aˆ E aˆ
E E bˆ b x
2
= 2
x E bˆ b E
2 2 xE bˆ b
= x2Var bˆ 2 ˆ
E 2 xE b b
Now, E 2 and
2
n
ˆ
E b b E wi i 1 i
n
= E 1 wi i2 cross product terms
n
= 1 wi E i2 1 2 wi 0
n n
x2 1 2
2 2 x
Therefore, Var aˆ 2 .
Sxx n n Sxx
To show that OLS estimates are best (i.e. has least variation),
we will show that if there exist another linear unbiased
Page 7 of 30
estimator other than 𝑏̂, then its variation must be greater than
that of 𝑏̂.
bˆ* ki yi wi ci (a bxi i )
a wi a ci b wi xi b ci xi wi ci i
a ci b b ci xi wi ci i
bˆ* b wi ci i
Now,
Var bˆ* Var b wi ci i wi ci Var i
2
2 wi ci
2
Var bˆ 2 ci2
Var bˆ
Page 8 of 30
Above establishes that, for the family of linear and unbiased
estimators b̂* , each of the alternative estimators has variance that
is greater than or equal to that of the least squares estimator b̂ . The
only time that Var bˆ* Var bˆ is when all the ci = 0, in which case
bˆ* bˆ . Thus, there is no other linear and unbiased estimator of b
that is better than b̂ . Hence the OLS estimate b̂ is BLUE.
ESTIMATION OF
2
i i
i yi aˆ bx
2
ˆ
SSE = i
= yi y b xi x
2
ˆ
i
2
= ∑𝑖[(𝑦𝑖 − 𝑦̅) − 𝑏̂(𝑥𝑖 − 𝑥̅ )]
= 𝑆𝑦𝑦 + 𝑏̂ 2 𝑆𝑥𝑥 − 2𝑏̂𝑆𝑥𝑦
𝑆𝑥𝑦
= 𝑆𝑦𝑦 + 𝑏̂𝑆𝑥𝑦 − 2𝑏̂𝑆𝑥𝑦 [∵ 𝑏̂ = ]
𝑆𝑥𝑥
= 𝑆𝑦𝑦 − 𝑏̂𝑆𝑥𝑦
SS E
MS E gives an unbiased estimate of .
2
The quantity
n2
Page 9 of 30
In simple linear regression the estimated standard error of the slope
ˆ ˆ 2
is se(b) and the estimated standard error of the intercept is
S xx
1 x2
se aˆ ˆ
2
, where ˆ = MSE.
2
n S xx
bˆ b0 bˆ b0
t0
ˆ S xx
2
MS E S xx ,
that has a t distribution with n-2 degrees freedom under the null
hypothesis. Thus we would reject null hypothesis if
t 0 t / 2,n2 .
H 0 : a a0
H1 : a a0
Page 10 of 30
we would use the statistic
aˆ a 0
t0
1 x
2
MS E
n S
xx
H0 : b 0
H1 : b 0
Example 2
Page 11 of 30
Mean square error is given by
Page 12 of 30
ANOVA APPROACH FOR TESTING SIGNIFICANCE OF
REGRESSION
yi y yˆi y yi yˆi
2 2 2
i i i
S yy SS R SS E ,
where 𝑆𝑦𝑦 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2 is the total corrected sum of squares of
y. Now, we have already noted that
ˆ , or equivalently, S bS
SSE S yy bS ˆ SS ,
xy yy xy E
The total SS has n-1 degrees of freedom; SSR and SSE have 1 and
n-2 degrees of freedom, respectively.
Page 13 of 30
and we would reject H 0 , if f 0 f ,1,n 2 . The test procedure is
usually represented in an ANOVA table, as given below.
TOTAL n-1 S yy
Note: ̂ 2 MS E .
bˆ 2 .S xx bˆ.S xy SS R MS R
T
0
2
= F0.
MS E MS E MS E MS E
Page 14 of 30
CONFIDENCE INTERVALS
ˆ 2 ˆ 2
bˆ t /2,n2 b bˆ t /2,n2 .
S xx S xx
1 x2 2 1 x2
aˆ t /2,n2 ˆ
2
a aˆ t /2,n2 ˆ .
n S xx n S xx
Page 15 of 30
Eˆ y x0 ˆ y|x0 aˆ bx
ˆ
0
1 x x 2
V ( ˆ y| x )
2 0
n S
0
xx
ˆ y| x E
0
y x 0
1 x x
2
ˆ n
2
0
S xx
has a t-distribution with n-2 degrees of freedom. This leads to the
following confidence interval definition.
1 x x
2
ˆ y| x t / 2 , n 2 ˆ n
2
0
S xx
0
Page 16 of 30
1 x x
2
E y x ˆ
0 t / 2 , n 2 ˆ
2
n
0
y| x 0
S xx
n S
0
xx
Page 17 of 30
Again, the actual predicted value will vary about the mean value
with variance 2 . So clearly, variance of the predicted response ŷ0
at x x0 will be given by
x x 2
var yˆ 0 1 0
1
2
n S xx
y0 yˆ 0
1 x0 x 2
ˆ 1
2
n S xx
1 x x 2
yˆ 0 t /2,n2 ˆ 2 1 0
n S xx
1 x0 x 2
y0 yˆ 0 t /2,n2 ˆ 1
2
n S xx
The prediction interval is of minimum width at x0 x and widens
as x0 x increases.
Page 18 of 30
1. Errors are
a) normally distributed,
b) distributed with mean ‘0’ and constant variance 2 , and
c) uncorrelated.
The residuals, unlike the errors, do not all have the same variance:
the variance depends on how farther the corresponding x-value is
from the average x-value. The fact that the variances of the
residuals differ, even though the variances of the true errors are all
equal to each other, it does not make sense to compare residuals at
different data points without some sort of standardization.
Standardized Residuals
Page 19 of 30
One may also standardize the residuals by computing
ei ei
di , i = 1, 2, , n.
ˆ 2 MSE
Homoscedastic Heteroscedastic
It is frequently helpful to plot the residuals (1) against the ŷi and
(2) against the xi. If the plot is evenly and randomly distributed
Page 20 of 30
around the zero-residual-line, we will assume that there no
abnormal pattern in the residuals. If the plot is funnel-shaped
around the zero-residual-line, the variance of the observations is
not remaining same over magnitude yi or xi .
Page 21 of 30
Not Independent
Independent
resi resid
dua uals
ls X
X
resi
dual
s
X
The quantity
SS R SS
R2 1 E
S yy S yy
is called the coefficient of determination, and is often used to judge
the adequacy of the regression model. It should be noted that R2
represents amount of variability in the data explained or accounted
for by the regression model and since 0 SS R S yy , 0 R 1.
2
Page 22 of 30
Lack of Fit Test
Here we will test for the goodness of fit of the regression model.
Specifically we wish to test
The TSS for pure error would be obtained by summing over those
levels of x’s that contain repeat observations.
𝑛𝑖
Page 23 of 30
The degrees of freedom associated with the pure error SS is
m
(n
i 1
i 1) n m . The lack of fit SS is simply SSLOF = SSE - SSPE with
SS LOF (m 2) MS LOF
F0
SS PE (n m) MS PE
Page 24 of 30
Example 2
x Y
1.0 2.3, 1.8
2.0 2.8
3.3 1.8, 3.7
4.0 2.6, 2.6, 2.2
5.0 2.0
5.6 3.5, 2.8, 2.1
6.0 3.4, 3.2
6.5 3.4
6.9 5.0
ni
Degrees
yiu yi
2
Level
i 1
of
of x
freedom
1.0 0.1250 1
3.3 1.8050 1
4.0 0.0166 2
5.6 0.9800 2
6.0 0.0200 1
Totals 3.0366 7
Page 25 of 30
So, lack of fit SS is SSLOF = SSE – SSPE = 7.3372 – 3.0366 = 4.3006.
Source DF SS MS F0 P-value
Regression 1 3.4930 3.4930 6.66 0.0218
Error 14 7.3372 0.5241
(Lack of Fit) 7 4.3006 0.6144 1.42 0.3276
(Pure Error) 7 3.0366 0.4338
Total 15 10.8302
CORRELATION
Page 26 of 30
cov( x, y )
Correlation coefficient in such cases is defined as
x y .
The estimate of is the simple correlation coefficient and can be
given by
y (x x )
i i
S xy
r i 1
n n
2 S xx .S yy
i
2
x x . yi y
i 1 i 1
S xy2 S xy S xy ˆ .S xy SS R
r
2
. R2 .
S xx .S yy S xx S yy S yy S yy
Page 27 of 30
Transformed Linear
Non-linear form Remark
form
Y aebx ln Y ln a bx ln ln should be
NID(0, 2)
Y a b 1 x Y a b z Using z = 1/x
Y 1 ln Y * a bx
exp a bx Using Y* = 1/Y
Example 1
Obs # 1 2 3 4 5 6 7 8 9
HC
0.99 1.02 1.15 1.29 1.46 1.36 0.87 1.23 1.55
Level %
Purity
90.01 89.05 91.43 93.74 96.73 94.45 87.59 91.77 99.42
%
Obs # 10 11 12 13 14 15 16 17 18
HC
1.40 1.19 1.15 0.98 1.01 1.11 1.20 1.26 1.32
Level %
Purity
93.65 93.54 92.52 90.56 89.54 89.85 90.39 93.25 93.41
%
Page 28 of 30
a) Calculate the least square estimates of slope and intercept.
b) What % of total variability in Purity% is accounted for by the
model?
c) Test the significance of the model thus obtained using
ANOVA.
d) Obtain 95% confidence interval on i) slope and ii) intercept.
e) Construct a 95% confidence interval of mean purity level at
HC level of 1.01.
f) Construct a 95% prediction interval at HC level % of 1.00.
Soln.
c)
ANOVA table
Source of
DF SS MS f0 Remark
Variation
Regression 1 152.085 152.085 128.559 Significant
Error 18 21.292 1.183
Total 19 173.377
Page 29 of 30
d) 95% confidence interval on
MS E MS E
i) slope: bˆ t0.025,18 b bˆ t0.025,18
S xx S xx
Exercise 1
Page 30 of 30