Lecture Notes
Lecture Notes
Yi = β0 + β1 Xi + i
where
I Yi value of the response variable in the ith trial
Subject i: 1 2 3
Age Xi 20 55 30
Number of attempts Yi 5 12 10
Notation: n = 3, the observations for the first subject were
(X1 , Y1 ) = (20, 5), and similarly for the other subjects.
Method of least -Squares
I To find ”good” estimators of the regression parameters
β0 and β1 , we employ the method of least squares.
I For the observations (Xi , Yi ) for each case, the method of
least squares considers the deviation of Yi from its
expected value:Yi − (β0 + β1 X )
I The method of least squares requires that we consider the
sum of the n squared deviations. This criterion is denoted
by
X n
Q= (Yi − β0 − β1 X )2
i=1
I According to the method of least squares, the estimators
of β0 and β1 are those values b0 and b1 respectively, that
minimize the criterion Q for the given sample
observations (X1 , Y1 ), (X2 , Y − 2), ..., (Xn , Yn ).
Example: Least Squares Criterion Q
1. Y = 9.0 + 0X hence
Q = (5 − 9)2 + (12 − 9)2 + (10 − 9)2 = 26
2. Y = 2.81 + 0.177X hence
Q = (5 − 6.35)2 + (12 − 12.55)2 + (10 − 8.12)2 = 26
I Thus, a better fit of the regression line to the data
corresponds to a smaller sum Q.
I The objective of the method of least squares is to find
estimates b0 and b1 for β0 and β1 , respectively, for which
Q is a minimum.
Least Squares Estimators
P
(Xi − X̄ )(Yi − Ȳ ) 70690
b1 = P = = 3.57
(Xi − X̄ )2 19800
b0 = Ȳ − b1 X̄ = 312.28 − 3.57(70.0) = 62.37
Example: Fitted Regression
E [Y ] = β0 + β1 X
Ŷ = b0 + b1 X
P
I The sum of the residuals is zero: ei = 0
The sum of the squared residuals, ei2 , is a minimum
P
I
requirement for LSE
I The sum of the observed
P values
P Y; equals the sum of the
fitted values Ŷ : Yi = Ŷi
I The sum of the weighted residuals is zero when the
residual in the ith trial is weighted by the level of the
predictor variable in the ith
P trial as P
well as by the
response for the ith trial: Xi ei = Yi ei = 0
I The regression line always goes through the point (X̄ , Ȳ ).
Point Estimator of σ 2: Single population
(Yi − Ȳ )2
P
2
s =
n−1
which is an unbiased estimator of the variance σ 2 of an
infinite population.
I The sample variance is often called a mean square,
because a sum of squares has been divided by the
appropriate number of degrees of freedom.
Point Estimator of σ 2:Regression Model
(Yi − Ŷi )2
P P 2
2 SSE ei
s = MSE = = =
n−2 n−2 n−2
I MSE is an unbiased estimator of σ 2 for the regression
model
Lecture 3
Pn 2 n
∂ i=1 ei
X
= −2 xi (yi − b1 − b2 xi ) = 0 (2)
∂b2 i=1
Xn n
X n
X
( xi )b1 + ( xi2 )b2 = xi yi
i=1 i=1 i=1
in matrix notation
P P
Pn P x2i b1
= P yi
xi xi b2 xi yi
Example
Table: Example
Case X Y XY X 2
1 0 2.1
2 1 7.7
3 2 13.6
4 3 27.2
5 4 40.9
6 5 61.1
P P P
n =
P 2 6, x i = 15, y i = 152.6, xi yi = 585.6 and
xi = 55
6 15 b1 152.6
=
15 55 b2 585.6
6b1 + 15b2 = 152.6
15b1 + 55b2 = 585.6
solving the two
5(6b1 + 15b2 = 152.6)
2(15b1 + 55b2 = 585.6)
we get b2 = 11.66 and b1 = −3.72
Introduction
The densities for all three sample observations for the two
cases of µ, are as follows:
Method of Maximum likelihood
I The method of maximum likelihood uses the product of
the densities (i.e., here, the product of the three heights)
as the measure of consistency of the parameter value with
the sample data.
I The product is called the likelihood value of the
parameter value µ, and is denoted by L(µ),
I If the value of µ, is consistent with the sample data, the
densities will be relatively large and so will be the product
( L(µ), the likelihood value).
I If the value of µ, is not consistent with the data, the
densities will be small and the product L(µ) will be small.
Maximum likelihood Estimate
I There are two methods of finding maximum likelihood
estimates: by a systematic numerical search and by use of
an analytical solution.
I For some problems, analytical solutions for the maximum
likelihood estimators are available.
I For others, a computerized numerical search must be
conducted.
I The product of the densities viewed as a function of the
unknown parameters is called the likelihood function. For
our example, where σ = 10, the likelihood function is:
" #3 " 2 # " 2 #
1 1 250 − µ 1 265 − µ
p exp − exp −
(2π)102 2 10 2 10
" 2 #
1 259 − µ
∗ exp −
2 10
(3)
Regression model
I The density of an observation Yi for the normal error
regression model Yi = β0 + β1 Xi + i where i are
independent N(0, σ 2 ) is as follows, utilizing the fact that
E [Yi ] = β0 + β1 Xi and σ 2 [Yi ] = σ 2
" 2 #
1 1 Yi − β0 + β1 Xi
fi = p exp −
(2π)σ 2 2 σ
I The likelihood function for n observations Y1 Y2 , ..., Yn is
the product of the individual densities.
I Variance σ 2 of the error terms usually unknown, thus
likelihood is a function of three parameters, β0 , β1 , and σ:
n
Y 1 1 2
L(β0 , β1 , σ) = exp − 2 (Yi − β0 + β1 Xi )
2 1/2 2σ
i=1 (2πσ )
" n
#
1 1 X 2
= exp − 2 (Yi − β0 + β1 Xi )
(2πσ 2 )n/2 2σ i=1
MLE
n
X
Xi (Yi − β̂0 − β̂1 Xi ) = 0
i=1
Pn
i=1 (Yi− β̂0 − β̂1 Xi )2
= σ̂ 2
n
These are similar to the least squares normal equations and
biased estimator for σ 2
Properties of MLE
The maximum likelihood estimators β̂0 and β̂1 , are the same
as the least squares estimators b0 and b1 they have the
properties of all least squares estimators:
I They are unbiased.
I They have minimum variance among all unbiased linear
estimators.
In addition, the maximum likelihood estimators b0 and b1 for
the normal error regression model have other desirable
properties:
I They are consistent
I They are sufficient
I They are minimum variance unbiased; that is, they have
minimum variance in the class of all unbiased estimators
(linear or otherwise).
Thus, for the normal error model, the estimators b0 and b1 ,
have many desirable properties.
Exercise
The data below, involving 10 shipments, were collected on the
number of times the carton was transferred from one aircraft
to another over the shipment route (X) and the number of
ampules found to be broken upon arrival (Y). Assume that
simple linear regression model is appropriate.
Inference in Regression
where Ki = P Xi −X̄ 2
(Xi −X̄ )
Notes
X X Xi − X̄ 1 X
Ki = P = P (Xi − X̄ ) = 0
(Xi − X̄ )2 (Xi − X̄ )2
X X Xi (Xi − X̄ ) 1 X X
2
Xi Ki =
P = P Xi − X̄ Xi
(Xi − X̄ )2 (Xi − X̄ )2
1 X
=P 2
Xi2 − nX̄ 2 = 1
(Xi − X̄ )
2 2X
X X Xi − X̄ 1
Ki2 = P 2
= P 2
(Xi − X̄ )2
(Xi − X̄ ) (Xi − X̄ )
1
=P
(Xi − X̄ )2
Sampling distribution of β1
and
hX i X
σ 2 [b1 ] = σ 2 Ki Yi = Ki2 σ 2 [Yi ]
2
X σ2
=σ Ki2 =P
(Xi − X̄ )2
Estimated Variance
b1 − β1
∼ t(n − 2)
s[b1 ]
1 − α/2 confidence interval for β1 is
H0 : β1 = 0
Ha : β1 6= 0
The analyst wishes to control the risk of a Type I error at
α = .05.
Test statistic
I An explicit test of the alternatives is based on the test
statistic:
b1
t∗
s[b1 ]
I The decision rule with this test statistic for controlling
the level of significance at α is:
b0 = Ȳ − b1 X̄
Toluca Example
70.002
2 1
s [b0 ] = 2384 + = 685.34
25 19800
√
s[b0 ] = 685.34 = 26.18
Confidence Interval for β0
The 1 − α/2 percent confidence interval is:
17.5 ≤ β1 ≤ 107.2
Prediction
p
s[Yh ] = (s 2 [Yh ]) = 9.918
The 1 − α/2 confidence limits for E [Yh ] are:
277.4, 311.4
We conclude with confidence coefficient .90 that the mean
number of work hours required when lots of 65 units are
produced is somewhere between 277.4 and 311.4 hours. We
see that our estimate of the mean number of work hours is
moderately precise.
Exercise I
277.4, 311.4
Source SS df MS E {MS }
of variation
Regression SSR =
Pn
i =1 (Ŷi − Ȳ )2 1 MSR = SSR
1 σ 2 + β12
Pn
i =1 (Xi − X̄ )2
Error SSE =
Pn
i =1 (Yi − Ŷi )2 n-2 MSE = SSE
n−2 σ2
Total SSTO =
Pn
i =1 (Yi − Ȳ ) 2
n-1
F Test of β1 = 0 versus β1 6= 0
I The analysis of variance approach provides us with a
battery of highly useful tests for regression models (and
other linear statistical models).
I Simple linear regression case considered here, the analysis
of variance provides us with a test for:
H0 : β1 = 0
Ha : β1 6= 0
I Test statistic: F ∗ It compares MSR and MSE in the
following fashion
MSR
F∗ =
MSE
I Sampling distribution
Decision Rule
I If F ∗ ≤ F (1 − α; 1, n − 2) , conclude H0
I If F ∗ > F (1 − α; 1, n − 2), conclude Ha
where F (1 − α; 1, n − 2) is the (1 − α) 100 percentile of the
appropriate F distribution
Example: Toluca Company
P P
P x y
xy − n
r=qP
( y )2 P 2 y )2
P P
(
( x2 − n
)( y − n
)
cov (x, y )
=
(sd of x) ∗ (sd of y )
Example
A sample of 6 children was selected, data about their age in
years and weight in kilograms was recorded as shown in the
following table . It is required to find the correlation between
age and weight.
P P
P X Y 41∗66
XY − 461 −
b1 = P Pn 2 = 6
= 0.923
( X) (41)2
X2 − n
291 − 6
Using correlation and SD
y = α + β1 x1 + β2 x12 + E
n
X n
X n
X
2 2
(yi − ȳ ) = yi − ȳ )
(b + (yi − ybi )2
|i=1 {z } |i=1 {z } |i=1 {z }
Total sum of squares Regression sum of squares Residual sum of squares
------------------------------------------------------------------------------
headcirc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
gestage | .7800532 .0630744 12.367 0.000 .6548841 .9052223
_cons | 3.914264 1.829147 2.140 0.035 .2843818 7.544146
------------------------------------------------------------------------------
. lm(headcirc~birthwt)
------------------------------------------------------------------------------
headcirc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
birthwt | .0074918 .0005699 13.146 0.000 .0063609 .0086228
_cons | 18.21758 .6446606 28.259 0.000 16.93827 19.49689
------------------------------------------------------------------------------
------------------------------------------------------------------------------
headcirc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
gestage | .4487328 .067246 6.673 0.000 .3152682 .5821975
birthwt | .0047123 .0006312 7.466 0.000 .0034596 .005965
_cons | 8.308015 1.578943 5.262 0.000 5.174251 11.44178
------------------------------------------------------------------------------
Results
1. The overall F statistic of the regression is associated with a p
value p < 0.0001. This means that there is an significant
linear association overall, between gestational age and birth
weight (the independent variables) and head circumference
(the dependent variable). Note that the F test does not
differentiate about individual variable contributions or
significance.
2. The least-squares equation that describes head circumference
(Y ) in terms of gestational age (X1 ) and birth weight (X2 ) is
as follows:
Y = 8.3030 + 0.4487X1 + 0.0047X2
This means that all else being equal, each week increase in
gestational age results in 0.45 centimeters increase in head
circumference, while (again all else being equal) each increase
of one gram in weight is associated with 0.005 inches increase
in head circumference (or one kilogram - 1,000 grams - in
Results (continued)
Yi = β0 + β1 Xi + i
Y1 = β0 + β1 X1 + 1
Y2 = β0 + β1 X2 + 2
.
.
Yn = β0 + β1 Xn + n
Matrix
Y1 1 X1 1
Y2 1 X2 2
β = β0 = .
Y = . X = .
.
.
β1
.
Yn 1 Xn n
Thus the linear regression equation in matrix is
Y = Xβ +
and
E [Y ] = X β
Normal equation
Recall the normal equations
X X
nb0 + b1 Xi = Yi
X X X
b0 Xi + b1 Xi2 = Xi Yi
In matrix notation
X T Xb = X T Y
where b is the vector
of the least squares regression
coefficients b =
b0
Thus
b 1
P P
Pn P x2i b1
= P yi
xi xi b2 xi yi
Estimated regression coefficients
b = XTX X Y
−1 T
1 80
1 30
XTX = 1 1 ... 1 25 1750
. =
80 30 ...70
.
1750 142300
1 70
Cont: Example
399
121
XTY = 1 1 ...1 7807
. =
80 30 ...70
. 617180
323
(X T X )−1 =
0.287475 −0.003535
−0.003535 0.00005051
b b
= 0 =
b1
(X T X )−1 (X T Y ) =
0.287475 −0.003535 1 1 ...1 62.37
=
−0.003535 0.00005051 80 30 ...70 3.5702
Fitted values
Ŷ = Xb
Thus
Ŷ = X (X T X )−1(X T Y )