0% found this document useful (0 votes)
11 views89 pages

Stats 100 C

Statistic

Uploaded by

dilendrapanta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views89 pages

Stats 100 C

Statistic

Uploaded by

dilendrapanta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

Stats 100C – Linear Models

University of California, Los Angeles

Duc Vu
Fall 2021

This is stats 100C – Linear Models taught by Professor Christou. There is not an official
textbook used for the course. Instead, handouts and reference materials are distributed and
can be accessed through the class website. You can find other math/stats lecture notes through
my personal blog. Let me know through my email if you notice something mathematically
wrong/concerning. Thank you!

Contents

1 Lec 1: Sep 27, 2021 4


1.1 Simple Linear Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Prediction Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Lec 2: Sep 29, 2021 6


2.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Lec 3: Oct 1, 2021 10


3.1 Gauss-Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Estimation of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Distribution Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Lec 4: Oct 4, 2021 13


4.1 Centered Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Distribution Theory Using the Centered Model . . . . . . . . . . . . . . . . . . . . . 13

5 Lec 5: Oct 6, 2021 16


5.1 Distribution Theory Using Non-Centered Model . . . . . . . . . . . . . . . . . . . . . 16
5.2 A Note on Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.3 Coefficient of Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6 Lec 6: Oct 8, 2021 19


6.1 Variance & Covariance Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.3 Prediction Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

7 Lec 7: Oct 11, 2021 22


7.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

8 Lec 8: Oct 13, 2021 25


8.1 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8.2 Power Analysis in Simple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Duc Vu (Fall 2021) Contents

9 Lec 9: Oct 15, 2021 28


9.1 Extra Sum of Squares Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
9.2 Power Analysis Using Non-Central F Distribution . . . . . . . . . . . . . . . . . . . 30

10 Lec 10: Oct 18, 2021 31


10.1 Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

11 Lec 11: Oct 20, 2021 34


11.1 Multiple Regression (Cont’d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

12 Lec 12: Oct 22, 2021 37


12.1 Gauss-Markov Theorem in Multiple Regression . . . . . . . . . . . . . . . . . . . . . 37
12.2 Gauss-Markov Theorem For a Linear Combination . . . . . . . . . . . . . . . . . . . 38
12.3 Review of Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . 38

13 Lec 13: Oct 25, 2021 40


13.1 Theorems in Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . 40

14 Lec 14: Oct 27, 2021 44


14.1 Mean and Variance in Multivariate Normal Distribution . . . . . . . . . . . . . . . . 44
14.2 Independent Vectors in Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . 45
14.3 Partial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

15 Lec 15: Oct 29, 2021 47


15.1 Partial Regression (Cont’d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

16 Lec 16: Nov 1, 2021 50


16.1 Partial Regression (Cont’d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
16.2 Partial Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

17 Lec 17: Nov 3, 2021 53


17.1 Constrained Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

18 Lec 18: Nov 5, 2021 56


18.1 Quadratic Forms of Normally Distributed Random Variables . . . . . . . . . . . . . 56

19 Lec 19: Nov 8, 2021 60


19.1 Quadratic Forms and Their Distribution – Overview . . . . . . . . . . . . . . . . . . 60
19.2 Another Proof of Quadratic Forms and Their Distribution . . . . . . . . . . . . . . . 61
19.3 Efficiency of Least Squares Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 62

20 Lec 20: Nov 10, 2021 63


20.1 Information Matrix and Efficient Estimator . . . . . . . . . . . . . . . . . . . . . . . 63
20.2 Centered Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

21 Lec 21: Nov 12, 2021 66


21.1 Confidence Intervals in Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . 66
21.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

22 Lec 22: Nov 15, 2021 69


22.1 F Test for the General Linear Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . 69
22.2 F Statistics and t statistics in Multiple Regression . . . . . . . . . . . . . . . . . . . 69
22.3 Power Analysis in Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 70
22.4 F Statistics Using the Extra Sum of Squares . . . . . . . . . . . . . . . . . . . . . . . 70

2
Duc Vu (Fall 2021) Contents

23 Lec 23: Nov 17, 2021 72


23.1 Testing the Overall Significance of the Model . . . . . . . . . . . . . . . . . . . . . . 72
23.2 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
23.3 Multi-Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

24 Lec 24: Nov 19, 2021 75


24.1 Centered and Scaled Model in Matrix/Vector Form . . . . . . . . . . . . . . . . . . . 75

25 Lec 25: Nov 22, 2021 78


25.1 Multi-Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
25.2 Generalized Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

26 Lec 26: Nov 24, 2021 81


26.1 Generalized Least Squares (Cont’d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
26.2 Comparing Regression Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

27 Lec 27: Nov 29, 2021 83


27.1 Comparing Regression Equations (Cont’d) . . . . . . . . . . . . . . . . . . . . . . . . 83
27.2 Deleting a Single Point in Multiple Regression . . . . . . . . . . . . . . . . . . . . . . 84

28 Lec 28: Dec 1, 2021 85


28.1 Deleting a Single Point in Multiple Regression (Cont’d) . . . . . . . . . . . . . . . . 85
28.2 Influential Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

29 Lec 29: Dec 3, 2021 87


29.1 Influential Analysis (Cont’d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
29.2 Externally Studentized Residual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
29.3 A Note on Valuable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3
Duc Vu (Fall 2021) 1 Lec 1: Sep 27, 2021

§1 Lec 1: Sep 27, 2021

§1.1 Simple Linear Regression Models


Consider
Yi = µ + εi
i.i.d i.i.d
with εi ∼ N (0, σ); specifically, Y1 , . . . , Yn ∼ N (µ, σ). We want to estimate µ and σ 2 using least
squares or method of maximum likelihood (MML).
Method of Least Squares (OLS – Ordinary Least Squares):
n
X
min Q = (Yi − µ)2
i=1
∂Q X
= −2 (Yi − µ) = 0
∂µ
X
Yi − nµ̂ = 0
=⇒ µ̂ = Y

Method of Maximum Likelihood (MML):

1 1 2
f (yi ) =√ e− 2σ2 (yi −µ)
σ 2π
− 1 1 2
= 2πσ 2 2 e− 2σ2 (yi −µ)
− n 1
P
(yi −µ)2
L = f (y1 ) . . . f (yn ) = 2πσ 2 2 e− 2σ2
n 1 X
ln L = − ln 2πσ 2 − 2 (yi − µ)2
2 2σ
∂ ln L ∂ ln L
= 0, =0
∂µ ∂σ 2

Solve the above, we obtain the MLE of µ and σ 2

(yi − µ̂)2 (yi − y)2


P P
2
µ̂ = ŷ, σ̂ = =
n n
Notice that σ̂ 2 is biased and we adjust it to be unbiased as follows

(yi − y)2
P
2
S =
n−1

§1.2 Prediction Problem


Given Y1 , . . . , Yn , we want to predict a new Y , e.g., Y0 . An educated guess here is

Ŷ0 = Y
Pn
1. Predictor assumption: Ŷ0 = i=1 ai Yi

2. We want Ŷ0 to be unbiased, i.e., E Ŷ0 = µ


X
E ai Yi = µ
X
ai EYi = µ
X
=⇒ ai = 1

4
Duc Vu (Fall 2021) 1 Lec 1: Sep 27, 2021

3. Minimize the mean square error of prediction, i.e.,


 2 X
E Y0 − Ŷ0 s.t. ai = 1

Notice that this is a constraint optimization problem, we use the method of Lagrange multiplier
to obtain
 2 X 
min Q = E Y0 − Ŷ0 − 2λ ai − 1

Note: EW 2 = var(W ) + (EW )2


  hX i
min Q = var Y0 − Ŷ0 − 2λ ai − 1
  hX i
= var(Y0 ) + var(Ŷ0 ) − 2 cov Y0 , Ŷ0 − 2λ ai − 1
X hX i
= σ2 + σ2 a2i − 2λ ai − 1
∂Q
= 2σ 2 ai − 2λ = 0
∂ai
λ
ai = 2
σ
λ
Notice that a1 = a2 = . . . = an = σ2 . So
X nλ σ2
ai = 2
= 1 =⇒ λ =
σ n
Thus, we can see that
1
ai =
n
P
and therefore since Ŷ0 = ai Yi , it follows that Ŷ0 = Y .
Prediction Interval: r !
1
Y0 − Ŷ0 ∼ N 0, σ 1+
n
Recall from 100B
(n − 1)S 2 2
∼ Xn−1
σ2
So,
−Ŷ0 −0
Y0√
σ 1+ n 1
Y0 − Ŷ0
q = q ∼ tn−1
(n−1)S 2
σ2 /(n − 1) S 1 + n1
We can now construct the prediction interval for Y0 as follows
 
Y0 − Ŷ0
P −t α2 ;n−1 ≤ q ≤ t α2 ;n−1  = 1 − α
1
S 1+ n
q
Finally, Y0 ∈ Ŷ0 ± t α2 ;n−1 S 1 + n1 .

Remark 1.1. Compare this to the confidence interval for µ : µ ∈ Y ± t α2 ;n−1 √Sn .

5
Duc Vu (Fall 2021) 2 Lec 2: Sep 29, 2021

§2 Lec 2: Sep 29, 2021

§2.1 Linear Regression


Consider a simple regression model

Yi = β0 + β1 Xi + εi
or Yi = β1 Xi + εi

Data:

y x
y1 x1
.. ..
. .
yn xn

data

no

model estimation check model is it adequate?

yes

theory
use it

where the parameters are (


β0 : intercept
β1 : slope
and X1 , . . . , Xn are predictors that are not random; ε1 , . . . , εn are random error terms/disturbance/stochastic
terms, and Y1 , . . . , Yn are random response variable.
Assumption (Gauss-Markov Conditions):

E(εi ) = 0, var(εi ) = σ 2

ε1 , . . . , εn are independent. Using the Gauss-Markov conditions,

EYi = β0 + β1 Xi
var(Yi ) = σ 2
X
min Q = ε2i
X 2
min Q = (Yi − β0 − β1 Xi )
∂Q X
= −2 (Yi − β0 − β1 Xi ) = 0
∂β0
∂Q X
= −2 (Yi − β0 − β1 Xi ) Xi = 0
∂β1

6
Duc Vu (Fall 2021) 2 Lec 2: Sep 29, 2021

So,
(P P
yi − nβ0 − β1 xi = 0
xi yi − β0 xi − β1 x2i = 0
P P P
( P P
nβ0 + β1 xi = yi
=⇒ – normal equations
β0 xi + β1 x2i = xi yi
P P P

We can solve the above to get β̂0 , β̂1 .


 P    P 
n P x2i β̂0 y
P = P i
xi xi β̂1 x i yi
   P −1  P 
β̂0 n P x2i P yi
= P
β̂1 xi xi xi yi

Determinant of the matrix:


" P 2#
X X 2 X ( xi )
n x2i − xi =n x2i −
n
X 2
=n (xi − x) ≥ 0

(xi − x)2 = 0. From normal equations we get


P
If x1 = x2 = . . . = xn = x then

β̂0 = y − β̂1 x from (1)

and plug (1) into (2) to obtain

xi yi − n1 ( xi )( yi )
P P P
β̂1 = P 2 ( xi )2P
xi − n
or P
(xi − x)(yi − y)
β̂1 = P
(xi − x)2
or P
(xi − x)yi
β̂1 = P (*)
(xi − x)2
or P
(yi − y)xi
β̂1 = P
(xi − x)2
or P
xi yi − nxy
β̂1 = P P 2
x2i − ( nxi )
Note: From (*), we have
P
(xi − x)yi
β̂1 = P
(xi − x)2
(x1 − x)yi (xn − x)yn
=P + ... + P
(xi − x)2 (xi − x)2
Xn
= k1 y1 + . . . + kn yn = ki yi
i=1

7
Duc Vu (Fall 2021) 2 Lec 2: Sep 29, 2021

where ki = P xi −x 2 . Notice that


(xi −x)
X
ki = 0
X 1
ki2 = P
(x − x)2
P i
X (xi − x)xi
ki xi = P =1
(xi − x)2

Properties of β̂1 :
X X
E β̂1 = E ki yi = ki Eyi
X
= ki (β0 + β1 xi )
X X
= β0 ki + β1 ki xi
= β1 – unbiased

For the variance,


X 
var(β̂1 ) = var ki yi
X
= ki2 var(Yi )
σ2
=P
(xi − x)2

Properties of β̂0 :

β̂0 = y − β̂1 x
X yi X
= −x ki yi
n 
X 1
= − xki yi
n
Xn
= li yi
i=1

1
where li = n − xki and the properties of li are
X
li = 1
X1 2 X 1 
X 2
li2 = − xki = 2 2
+ x ki − xki
n n2 n
1 x2
= +P
n (xi − x)2
X
li xi = 0

Now, we can easily show that β̂0 is unbiased


X X
E β̂0 = E li yi = li Eyi
X X X
= li (β0 + β1 xi ) = β0 li + β1 li x i
= β0

8
Duc Vu (Fall 2021) 2 Lec 2: Sep 29, 2021

Thus,
x2
 
  X 
2
X
2 2 1
var β̂0 = var li yi = σ li = σ +P
n (xi − x)2
The fitted value is
ŷi = β̂0 + β̂1 xi = y + β̂1 (xi − x)
and the residual is defined as
ei = yi − ŷi
with properties
X
ei = 0
X
ei xi = 0
X
ei ŷi = 0

Estimation Using MML:


i.i.d
Assume ε1 , . . . , εn ∼ N (0, σ). Then Yi ∼ N (β0 + β1 Xi , σ). The log-likelihood function is
n 1 X
ln L = − ln 2πσ 2 − 2 (yi − β0 − β1 xi )2
2 2σ
So, we need to solve
∂ ln L ∂ ln L
= 0, =0
∂β0 ∂β1
to get β̂0 , β̂1 which are the same as least squares method.
∂ ln L n 1 X
2
=− 2 + 4 (yi − β0 − β1 xi )2 = 0
∂σ 2σ
P 2 2σ
2 ei
σ̂ =
n
Then,
 2
X X
(yi − y)2 = yi − ŷi +ŷi + y 
| {z }
ei

in which we expand to get X X X


(yi − y)2 = e2i + (ŷi − y)2
| {z } | {z } | {z }
SST SSE SSR

in which 
SST: sum of squares total

SSE: sum of squares error

SSR: sum of squares regression

9
Duc Vu (Fall 2021) 3 Lec 3: Oct 1, 2021

§3 Lec 3: Oct 1, 2021

§3.1 G a u s s - M a r kov T h e o r e m
Recall X
β̂1 = ki Yi

where ki = P xi −x 2 . Consider now


(xi −x)
X
b1 = ai Yi
P
which is another unbiased estimator of β1 . Then Eb1 = β1 or E ai Yi = β1 . So
X
β1 = ai EYi
X
= ai (β0 + β1 Xi )
X X
= β0 ai + β1 ai Xi

Thus, (P
ai = 0
P
ai xi = 1
and we know that !
n
X X
var(b1 ) = var ai Yi = σ2 a2i
i=1

and
X σ2
var(β̂1 ) = σ 2 ki2 = P
(xi − x)2
Now let ai = ki + di . Then,
X
var(b1 ) = σ 2(ki + di )2
X X X
= σ2 ki2 + σ 2 d2i + 2σ 2 ki di

P
We need to show ki di = 0.
X X X
ki (ai − ki ) = ki ai − ki2
P
(xi − x)ai 1
= P −P
(xi − x)2 (xi − x)2
P P
xi ai x ai 1
=P − P −P
(xi − x)2 (xi − x)2 (xi − x)2
=0

So var(b1 ) ≥ var(β̂1 ) and therefore β̂1 is the best linear unbiased estimator (BLUE).

§3.2 E s t i m a t i o n o f Va r i a n c e
Using MML
e2i
P
σ̂ 2 =
n
Is it unbiased? P 
var(ei ) + (Eei )2
P 2
2 Eei
E σ̂ = =
n n

10
Duc Vu (Fall 2021) 3 Lec 3: Oct 1, 2021

Note: ei = Yi − Ŷi = Yi − β̂0 − β̂1 Xi . So


h i
Eei = E Yi − β̂0 − β̂1 Xi = (β0 + β1 Xi ) − (β0 + β1 Xi ) = 0

Then, P
var(ei )
E σ̂ 2 =
n
Notice that
ei = Yi − β̂0 − β̂1 Xi
or
ei = Yi − Y − β̂1 (Xi − X)
where Ŷi = Y + β̂1 (Xi − X). Substitute in and we get
h i
var(ei ) = var Yi − Y − β̂1 (Xi − X)

= var(Yi ) + var(Y ) + (Xi − X)2 var(β̂1 ) − 2 cov(Yi , Y ) − 2(Xi − X) cov(Yi , β̂1 )


+ 2(Xi − X) cov(Y , β̂1 )

Let’s compute each term there.

Yi = β0 + β1 Xi + εi
var(Yi ) = σ 2
P
εi
Y = β0 + β1 X +
n
σ2
var(Y ) =
n  
Y1 + . . . + Yi + . . . + Yn
cov(Yi , Y ) = cov Yi ,
n
1 1 1
= cov(Yi , Y1 ) + . . . + cov(Yi , Yi ) + . . . + cov(Yi , Yn )
n n n
σ2
=
n
X
cov(Yi , β̂1 ) = cov(Yi , ki Yi )
= cov(Yi , k1 Y1 ) + . . . + cov(Yi , ki Yi ) + . . . + cov(Yi , kn Yn )
= k1 cov(Yi , Y1 ) + . . . + ki cov(Yi , Yi ) + . . . + kn cov(Y1 , Yn )
xi − x
= σ 2 ki = σ 2 P
(xi − x)2
Note: A property of covariance

cov(aY, bQ) = ab cov(Y, Q)

And for the last term,


 
Y1 + . . . + Yn
cov(Y , β̂1 ) = cov , k1 Y1 + . . . + kn Yn
n
Y1 Yn
= cov( , k1 Y1 + . . . + kn Yn ) + . . . + cov( , k1 Y1 + . . . + kn Yn )
n n
σ2 σ2 σ2
= k1 + k2 + . . . + kn
n n n
σ2 X
= ki = 0
n

11
Duc Vu (Fall 2021) 3 Lec 3: Oct 1, 2021

Now, we’re ready to compute the variance

σ2 σ 2 (xi − x)2 2σ 2 2σ 2 (xi − x)2


var(ei ) = σ 2 + +P 2
− − P
n (xi − x) n (xi − x)2
(xi − x)2
 
1
= σ2 1 − − P
n (xi − x)2

Therefore,
 
Pn 1 (xi −x)2
P
var(ei ) i=1 1− n − P
(xi −x)2
E σ̂ 2 = = σ2
n n
(n − 2) 2
= σ
n
It follows that the unbiased estimator of σ 2 is
P 2
n ei
Se2 = σ2 =
n−2 n−2

§3.3 Distribution Theory


i.i.d
Let Yi = β0 + β1 Xi + εi and we assume ε1 , . . . , εn ∼ N (0, σ)
!
X σ
βˆ1 = ki Yi =⇒ β̂1 ∼ N
β1 , p P
(xi − x)2
 s 
2
X 1 x
β̂0 = li Yi =⇒ β̂0 ∼ N β0 , σ +P 
n (xi − x)2

(n−2)Se2 2
We will show σ2 ∼ Xn−2 in the next lecture.

12
Duc Vu (Fall 2021) 4 Lec 4: Oct 4, 2021

§4 Lec 4: Oct 4, 2021

§4.1 C e nt e r e d M o d e l
Consider the model: Yi = β0 + β1 Xi + εi , i = 1, . . . , n and Gauss-Markov conditions hold, i.e.,

E [εi ] = 0
var [εi ] = σ 2

i.i.d
for i = 1, . . . , n and ε1 , . . . , εn are independent (we assume ε1 , . . . , εn ∼ N (0, σ)). This is non-
centered model. Let’s look at a centered model

Yi = β0 + β1 Xi ± β1 X + εi
Yi = β0 + β1 X + β1 (Xi − X) + εi
Yi = γ0 + β1 Zi + εi – centered model

where P
γ 0 = β0 P + β1 X and Zi = Xi − X.
Note: zi = (xi − x) = 0 and z = 0. So,
P P P
(zi − z)yi zi yi (xi − x)yi
β̂1 = P 2
= P 2 = P – same as non-centered model
(zi − z) zi (xi − x)2
γ̂0 = y − β̂1 z = y

Notice ŷi = y + β̂1 (xi − x) which is the same as ŷi of the non-centered model.

§4.2 D i s t r i b u t i o n T h e o r y U s i n g t h e C e nt e r e d M o d e l
Have
 
Yi ∼ N γ0 + β1 Xi − X , σ
!
σ
β̂1 ∼ β1 , pP
(xi − x)2
 
σ
γ̂0 = Y ∼ N γ0 , √
n
(n−2)Se2 2
Now, let’s show that σ2 ∼ Xn−2 . We have

Yi − γ0 − β1 (Xi − X)
∼ N (0, 1)
σ
 2
Yi − γ0 − β1 (Xi − X)
∼ X12
σ2
It follows that Pn  2
i=1 Yi − γ0 − β1 (Xi − X)
∼ Xn2
σ2
(n−2)Se2 e2i
P
Notice that σ2 = σ2 . Let’s manipulate this expression. First, let

Ph i2
Yi − γ0 − β1 (Xi − X) ± γ̂0 ± β̂1 (Xi − X)
L=
σ2

13
Duc Vu (Fall 2021) 4 Lec 4: Oct 4, 2021

Then,
Ph i2
yi − γ̂0 − β̂1 (xi − x) + (γ̂0 − γ0 ) + (β̂1 − β1 )(xi − x)
L=
σ2
Ph i2
ei + (γ̂0 − γ0 ) + (β̂1 − β1 )(xi − x
=
σ2
2
e2i (β̂1 − β1 )2 (xi − x)2
P P P
n (γ̂0 − γ0 ) 2(γ̂0 − γ0 ) ei
= + + +
σ2 σ2 σ2 σ2
P P
2(β̂1 − β1 ) ei (xi − x) 2(γ̂0 − γ0 )(β̂1 − β1 ) (xi − x)
+ +
σ2 σ2
So far,
" #2
2
(n − 2)Se2 γ̂0 − γ0
P
[yi − γ0 − β1 (xi − x)] β̂1 − β1
= + √ +
σ{z2 2
pP
| } | σ{z } |σ/{z n} σ/ (xi − x)2
2 Xn ? | {z }
X12 X12

Q = Q1 + Q2 + Q3
Let’s use moment generating function to find “?”. Notice that Q1 , Q2 , Q3 are independent why?

MQ (t) = MQ1 +Q2 +Q3


MQ (t) = MQ1 (t) · MQ2 (t) · MQ3 (t)
We have
n
Q ∼ Xn2 =⇒ MQ (t) = (1 − 2t)− 2
1
Q2 ∼ X12 =⇒ MQ2 (t) = (1 − 2t)− 2
1
Q3 ∼ X12 =⇒ MQ3 (t) = (1 − 2t)− 2
−n+2
=⇒ MQ1 (t) = (1 − 2t) 2

(n − 2)Se2 2
=⇒ Q1 = ∼ Xn−2
σ2
Note: If Y ∼ Γ(α, β) then
MY (t) = (1 − βt)−α
and
McY (t) = MY (ct)
Let’s now find the distribution of s2e .
σ2
Se2 = Q1
n−2
σ2
 
MSe2 (t) = M σ2 (t) = MQ1 t
n−2 Q1 n−2
 −n+2
2σ 2
 2

MSe2 (t) = 1− t
n−2
Therefore,
n − 2 2σ 2
 
Se2 ∼Γ ,
2 n−2
2σ 4
ESe2 = σ 2 , var(Se2 ) =
n−2

14
Duc Vu (Fall 2021) 4 Lec 4: Oct 4, 2021

Another way to show this result is to use the non-centered model


P 2
Yi − β0 − β1 Xi ± β̂0 ± β̂1 Xi
σ2

15
Duc Vu (Fall 2021) 5 Lec 5: Oct 6, 2021

§5 Lec 5: Oct 6, 2021

§5.1 D i s t r i b u t i o n T h e o r y U s i n g N o n - C e nt e r e d M o d e l
(n−2)Se2 2
Recall that we want to show σ2 ∼ Xn−2 using the non-centered model Yi = β0 + β1 Xi + εi for
i.i.d
ε1 , . . . , εn ∼ N (0, σ). Then, Yi ∼ N (β0 + β1 Xi , σ). Let
P 2
Yi − β0 − β1 Xi ± β̂0 ± β̂1 Xi
M= ∼ Xn2
σ2
Then,
P 2
yi − β̂0 − β̂1 xi + (β̂0 − β0 ) + (β̂1 − β1 )xi
M=
σ2
e2i 2
(β̂1 − β1 )2 x2i
P P P P
n(β̂0 − β0 ) 2(β̂0 − β0 ) ei 2(β̂1 − β1 ) ei xi
= + + + +
σ2 σ2 σ2 σ2 σ2
P
2(β̂0 − β0 )(β̂1 − β1 ) xi
+
σ2
P 2
n(β̂0 − β0 )2 (β̂1 − β1 )2 x2i
P P
ei 2(β̂0 − β0 )(β̂1 − β1 ) xi
= 2
+ + + (**)
| σ{z } | σ2 σ2 {z σ2 }
2
(n−2)Se ?
σ2

Let D = β̂0 + β̂1 X = Y and consider


2
(β̂1 − β1 )2 (D − (β0 + β1 x))
+ (*)
var(β̂1 ) var(D)
 
σ
Note: β̂1 ∼ N β1 , √P and
(xi −x)2

Yi = β0 + β1 Xi + εi
P P
Yi εi
Y = = β0 + β1 X +
n n
 
So Y ∼ N β0 + β1 X, √σn and thus D−(β 0 +β1 X)

σ/ n
∼ N (0, 1). It follows that each term in (*) follows
chi-square distribution with 1 degree of freedom. Now, we have

(β̂1 − β1 )2 X 2 n(β̂0 − β0 )2 (β̂1 − β1 )2 2 2(β̂0 − β0 )(β̂1 − β1 ) X


(∗) = (x i − x) + + nx + xi
σ2 σ2 σ2 σ2
x2i − nx2

(β̂1 − β1 )2 (β̂1 − β1 )2 nx2
P
n(β̂0 − β0 )2
P
2(β̂0 − β0 )(β̂1 − β1 ) xi
= + + +
σ2 σ2 σ2 σ2
which is equivalent to the last three terms of (**). We just need to show that

cov(Y , β̂1 ) = 0
cov(Y , ei ) = 0
cov(β̂1 , ei ) = 0

Remark 5.1. Under normality, zero covariance implies independence.

16
Duc Vu (Fall 2021) 5 Lec 5: Oct 6, 2021

§5.2 A Note on Gamma Distribution


Let Q ∼ Γ(α, β). Then

EQ = αβ
var(Q) = αβ 2
MQ (t) = (1 − βt)−α
Γ(α + k)β k
EQk =
Γ(α)

where Z ∞
Γ(α) = xα−1 e−x dx
0
is the Gamma function.
Property:

Γ(α) = (α − 1)Γ(α − 1)
Γ(α + 1) = αΓ(α)

If α is an integer, then
Γ(α) = (α − 1)!
 2

n−2 2σ
Recall that Se2 ∼ Γ 2 , n−2

2σ 4
ESe2 = σ 2 , var(Se2 ) =
n−2
Is Se unbiased estimator of σ?
 1
ESe = E Se2 2
  12
n−2 1 2σ 2
Γ 2 + 2 n−2
= n−2

Γ 2
r    
2 n−1 n−2
=σ Γ /Γ
n−2 2 2
= σA
Se
Thus, it’s biased and we can adjust the result to be unbiased, i.e., A.
If Y ∼ Xn2 , then
n
MY (t) = (1 − 2t)− 2
which is Γ n2 , 2 .


§5.3 C o e f f i c i e nt o f D e t e r m i n a t i o n
Recall X X X
(yi − y)2 = e2i + (ŷi − y)2
| {z } | {z } | {z }
SST SSE SSR

where Ŷi = y + β̂1 (xi − x). We define R2 as

SSR SSE
R2 = or R2 = 1 −
SST SST

17
Duc Vu (Fall 2021) 5 Lec 5: Oct 6, 2021

and 0 ≤ R2 ≤ 1. We have
 
var(Ŷi ) = var y + β̂1 (xi − x)
!
2 1 (xi − x)2
=σ +P 2
n (xi − x)

Another way to show this is to express Ŷi as a linear combination of Y1 , . . . , Yn .

Ŷi = y + β̂1 (xi − x)


P
yj X
= + (xi − x) kj yj
n
X1 
= + (xi − x)kj yj
n
X1 2
var(Ŷi ) = σ 2 + (xi − x)kj
n
X 1 2

= σ2 + (x i − x)2 2
kj + (x i − x)kj
n2 n
(xi − x)2
 
1
= σ2 +P
n (xi − x)2

Consider
P X 
X yl X 1
ei = yi − ŷi = yi − y − β̂1 (xi − x) = al yl − − (xi − x) kl yl = al − − (xi − x)kl yl
n n

where (
1, if l = i
al =
0, otherwise

18
Duc Vu (Fall 2021) 6 Lec 6: Oct 8, 2021

§6 Lec 6: Oct 8, 2021

§6.1 Va r i a n c e & C ova r i a n c e O p e r a t i o n s


Have
X X  Xn X
n X X
cov ai Yi , bj Yj = ai bj cov(Yi , Yj ) = ai bi cov(Yi , Yi ) = σ 2 ai bi
i=1 j=1

because Y1 , . . . , Yn are independent.

Example 6.1
Consider β̂0 and β̂1
X X 
cov(β̂0 , β̂1 ) = cov li Yi , ki Yj
X
= σ2 li ki
X  1  
2
=σ − ki x ki
n
1X X
= σ2 ki − σ 2 x ki2
n
σ2 x
= −P
(xi − x)2

Or
   
cov β̂0 , β̂1 = cov Y − β̂1 X, β̂1
 
= cov Y , β̂1 − X var(β̂1 )
−xσ 2
=P
(xi − x)2

Example 6.2
Consider Ŷi and Ŷj
   
cov Ŷi , Ŷj = cov y + β̂1 (xi − x), y + β̂1 (xj − x)
σ2 (xi − x)(xj − x) 2
= +0+0+ P σ
n (xi − x)2
 
1 (xi − x)(xj − x)
= σ2 + P
n (xi − x)2

When i = j,
(xi − x)2
 
2 1
var(Ŷi ) = σ +P
n (xi − x)2

19
Duc Vu (Fall 2021) 6 Lec 6: Oct 8, 2021

Example 6.3 (Cont’d)


Notice that
P
yl X
Ŷi = y + β̂1 (xi − x) = + (xi − x) k l yl
 n
X 1 X
= + (xi − x)kl yl = al yl
n
X
Yˆj = . . . = bv yv
  X
cov Ŷi , Ŷj = σ 2 al bl
X1 
1

2
=σ + (xi − x)kl + (xj − x)kl
n n
 
1 (xi − x)(xj − x)
= σ2 + P
n (xi − x)2

§6.2 Inference
Construct a confidence interval 1 − α for β1

P (L ≤ β1 ≤ U ) = 1 − α

Know !
σ
β̂1 ∼ N β1 , p P
(xi − x)2
and
(n − 2)Se2 2
∼ Xn−2
σ2
Consider
 
cov β̂1 , ei = 0

Under normality, since their covariance is 0, β̂1 and Se2 are independent. Thus,

1 −β1
√β̂P
σ/ (xi −x)2 β̂1 − β1
q = pP ∼ tn−2
(n−2)Se2
/(n − 2) Se / (xi − x)2
σ2

Pivot Method: !
β̂1 − β1
P −t α2 ; n−2 ≤ pP ≤ t α2 ; n−2 =1−α
Se / (xi − x)2
and after some manipulation we get
!
Se Se
P β̂1 − t α2 ; n−2 · pP ≤ β1 ≤ β̂1 + t α2 ; n−2 · pP =1−α
(xi − x)2 (xi − x)2

We are 1 − α confident that


" #
Se
β1 ∈ β̂1 ± t α2 ; n−2 · pP
(xi − x)2

20
Duc Vu (Fall 2021) 6 Lec 6: Oct 8, 2021

For β̂0 ,  
s
2
1 x
β̂0 ∼ N β0 , σ +P 
n (xi − x)2
and we proceed similarly to obtain
 s 
2
1 x
β0 ∈ β̂0 ± t α2 ;n−2 · Se +P 
n (xi − x)2

Say if we want to construct a confidence interval for β0 − 2β1 :


var(β̂0 − 2β̂1 ) = var(β̂0 ) + 4 var(β̂1 ) − 4 cov(β̂0 , β̂1 )
x2
 
2 1 4 4x
=σ + P + P + P
n (xi − x)2 (xi − x)2 (xi − x)2
(x + 2)2
 
1
= σ2 +P
n (xi − x)2
So, s !
1 (x + 2)2
β̂0 − 2β̂1 ∼ N β0 − 2β1 , σ +P
n (xi − x)2
Thus, the C.I. is " s #
1 (x + 2)2
β0 − 2β1 ∈ β̂0 − 2β̂1 ± t α2 ; n−2 · Se +P
n (xi − x)2

§6.3 P r e d i c t i o n I nt e r va l
Prediction interval for Y0 when X = X0 . Let’s begin with error of prediction: Y0 − Ŷ0 . We know
• Yi = β0 + β1 Xi + εi
• Y0 = β0 + β1 X0 + ε0
• Ŷ0 = β̂0 + β̂1 X0
So
E(Y0 − Ŷ0 ) = 0
var(Y0 − Ŷ0 ) = var(Y0 ) + var(Ŷ0 ) − 2 cov(Y0 , Ŷ0 )
(x0 − x)2
 
2 1
=σ 1+ + P
n (xi − x)2
We apply the same procedure in the inference section
 q 2

Y0 − Ŷ0 ∼ N 0, σ 1 + n1 + P(x(x0 −x)
s
i −x)
2
 1 (x0 − x)2
=⇒ Y0 ∈ Ŷ0 ± t α2 ;n−2 Se 1 + + P
(n−2)Se2
2 n (xi − x)2
∼ Xn−2

σ2
C.I. for EY0 for a given X = X0
s !
1 (x0 − x)2
Ŷ0 ∼ N β0 + β1 X0 , σ +P
n (xi − x)2
(n − 2)Se2 2
∼ Xn−2
σ2 s
1 (x0 − x)2
=⇒ EY0 ∈ Ŷ0 ± t α2 ;n−2 · Se +P
n (xi − x)2

21
Duc Vu (Fall 2021) 7 Lec 7: Oct 11, 2021

§7 Lec 7: Oct 11, 2021

§7.1 H y p o t h e s i s Te s t i n g
Consider the model:
Yi = β0 + β1 Xi + εi

Example 7.1
Hypothesis testing examples

H0 : β1 = 0, Ha : β1 6= 0
H0 : β1 = 1, Ha : β1 6= 1
H0 : β0 = 0, Ha : β0 6= 0
H0 : β0 + β1 = 0, Ha : β0 + β1 6= 0
)
β0 = β0∗
H0 : , Ha : not true
β1 = β1∗

Let’s consider the following two-sided test

H0 : β 1 = 0
Ha : β1 6= 0

Recall under H0 ,
 
σ
β̂1 ∼ N 0, √P

 β̂
(xi −x)2
=⇒ t = pP 1 ∼ tn−2
(n−2)Se2 2
 Se / (xi − x)2
∼ Xn−2

σ2

We reject H0 if t > t α2 ; n−2 or t < −t α2 ; n−2 . Using a 1 − α C.I.

Se
β1 ∈ β̂1 ± t α2 ; n−2 pP
(xi − x)2
For example, for −2 ≤ β1 ≤ 2, we do not reject H0 .

p − value = 2P (t > t∗ )

We reject H0 if p-value < α.


Test H0 : β1 = 0 using the F statistics. Under H0 ,
!
σ
β̂1 ∼ N 0, pP
(xi − x)2
β̂ − 0
pP1 ∼ N (0, 1)
σ/ (xi − x)2
Then,
β̂12 (xi − x)2
P
∼ X12
σ2
and we know
(n − 2)Se2 2
∼ Xn−2
σ2

22
Duc Vu (Fall 2021) 7 Lec 7: Oct 11, 2021

Therefore, we can form the F statistics


β̂12(xi −x)2
P
β̂12 (xi − x)2
P
σ2 /1
(n−2)Se2
= ∼ F1, n−2
/(n − 2) Se2
σ2

Definition 7.2 (F Distribution) — Let U ∼ Xn2 and V ∼ Xm


2
and U, V are independent. Then,
U
n
V
∼ Fn,m
m

We can observe that t2n−2 = F1, n−2 . In general,

Z ∼ N (0, 1)
U ∼ Xn2
Z, U are independent
Z Z 2 /1
p ∼ tn =⇒ ∼ F1, n
U/n U/n

F1,n−2
α

F1−α;1,n−2

Let’s find the expected value of the F statistics.


• Denominator:
ESe2 = σ 2

• Numerator:
X X
E β̂12 (xi − x)2 = (xi − x)2 E β̂12
X  
= (xi − x)2 var(β̂1 + (E β̂1 )2
σ2
X  
= (xi − x)2 P + β 2
1
(xi − x)2
X
= σ 2 + β12 (xi − x)2

Under H0 the ratio is approximately equal to 1. If H0 is not true the ratio is greater than 1.

23
Duc Vu (Fall 2021) 7 Lec 7: Oct 11, 2021

Now, for β̂0 ,


 q 2

β̂0 ∼ N 0, σ n1 + P x
(xi −x)2
 β̂0
=⇒ t = q ∼ tn−2
(n−2)Se2 2 Se 1
+ P x
2
∼ Xn−2

σ2 n (xi −x)2

and consider H0 : β1 = 1(β1 − 1 = 0) and Ha : β1 6= 1 (β1 − 1 6= 0). Then under H0 ,

β̂ − 1
pP1 ∼ N (0, 1)
σ/ (xi − x)2

Test Statistics:
β̂1 − 1
pP ∼ tn−2
Se / (xi − x)2
Using F statistics
(β̂1 − 1)2 (xi − x)2
P
∼ X12
σ2
and thus
(β̂1 − 1)2 (xi − x)2
P
2
∼ F1, n−2
Se

24
Duc Vu (Fall 2021) 8 Lec 8: Oct 13, 2021

§8 Lec 8: Oct 13, 2021

§8.1 L i ke l i h o o d R a t i o Te s t
Consider
Yi = β1 Xi + εi
H0 : β1 = 0
Ha : β1 6= 0
We know
!
σ
β̂1 ∼ N 0, pP
x2i
(n − 1)Se2 2
∼ Xn−1
σ2
β̂12 x2i
P
β̂1

So ttest : ∼ tn−1 and Ftest : Se2 ∼ F1,n−1 .
x2i
P
Se /
Likelihood Ratio Test (LRT):

For testing: H0 : β1 = 0
For the model: Yi = β0 + β1 Xi + εi
Show that this LRT is equivalent to the F statistic.
We reject H0 if
L(ŵ)
Λ= <k
L(ω̂)
where L(ŵ) is the maximized likelihood function under H0 and L(ω̂) is maximized likelihood
function under no restrictions. Under H0 : β1 = 0, we have Yi = β0 + εi . The likelihood function is
n 1
P 2
L = (2πσ 2 )− 2 e− 2σ2 (yi −β0 )
n 1 X
ln L = − ln 2πσ 2 − 2 (yi − β0 )2
2 2σ
β̂0 = y
(yi − y)2
P
2
σ̂0 =
n

e2i
P
Under no restriction, the estimates are the MLEs of β0 , β1 , σ 2 which are β̂0 , β̂1 and σ̂12 = n . Back
to LRT, we have
L(ŵ)
Λ=
L(ω̂)
1
(yi −y)2
P
n −
(2πσ02 )− 2 e 2σ02

= − 1
P
e2i
<k
2
(2πσ12 )e 2σ1

Note:
X
(yi − y)2 = nσ02
X
e2i = nσ12

25
Duc Vu (Fall 2021) 8 Lec 8: Oct 13, 2021

So,
n n
(2πσ̂02 )− 2 e− 2
n n < k
(2πσ̂12 )− 2 e− 2
σ̂12 2
< kn
σ̂02
P 2
ei /n 2
P < kn
(yi − y)2 /n

Notice that
X X X
(yi − y)2 = e2i + (ŷi − y)2
X X X
(yi − y)2 = e2i + β̂12 (xi − x)2

So,

e2i
P
2
< kn
e2i + β̂12 (xi − x)2
P P

1 2
< kn
β̂12 (xi −x)2
P
1+ P 2
ei
2
β̂1 (xi − x)2
P
2
> k− n − 1
(n − 2)Se2
β̂1 (xi − x)2
P  2 
−n
> (n − 2) k − 1 = k0
Se2

We reject H0 if
β̂12 (xi − x)2
P
> k0
Se2
Recall we stated that we reject H0 if Λ = L( ŵ)
L(ω̂) < k. Let’s find k. First, we need α (type I error).
Before that, we know
β̂12 (xi − x)2
P
∼ F1,n−2
Se2
So,
 
P F1,n−2 > k 0 H0 is true = α

§8.2 Powe r A n a l y s i s i n S i m p l e R e g r e s s i o n
Using the non-central t distribution

Definition 8.1 (Non-central t) — Let Z ∼ N (δ, 1) and U ∼ Xn2 and Z and U are independent.
Then,
Z
p ∼ tn (NCP = δ)
U/n

Back to the t ratio. If H0 is true,


√Pβ̂1
σ/ (xi −x)2
q
(n−2)Se2
σ2 /(n − 2)

26
Duc Vu (Fall 2021) 8 Lec 8: Oct 13, 2021

follows central tn−2 in which the numerator


 √P follows standard
 normal distribution. If H0 is not
β1 (xi −x)2
true, then the numerator follows N σ , 1 . Thus, the ratio follows tn−2 (NCP =
√P
β1 (xi −x)2
σ ). Finally, the power is
 
1 − β = P tn−2 (NCP) > t α2 ;n−2 + P tn−2 )NCP) < −t α2 ;n−2

27
Duc Vu (Fall 2021) 9 Lec 9: Oct 15, 2021

§9 Lec 9: Oct 15, 2021

§9.1 Extra Sum of Squares Method


So far, we have learnt several ways for hypothesis testing for, e.g.,

Yi = β0 + β1 Xi + εi
H0 : β1 = 0
Ha : β1 6= 0

which are

1. t statistics
2. F statistics
3. Likelihood ratio test

4. Extra sum of square principle (reduced and full model)

(SSER − SSEF )/(dfR − dfF )


∼ F1,n−2
SSEF /dfF
P 2)
SSEF = ei
dfF = n − 2

Under H0 : β1 = 0 we have a reduced model

Yi = β0 + εi =⇒ β̂0 = y

(yi − y)2 and dfR = n − 1. Thus,


P
Therefore SSER =
P 
(yi − y)2 − e2i / (n − 1 − (n − 2))
P
P 2
ei /(n − 2)

Note: X X X
(yi − y)2 = e2i + β̂12 (xi − x)2
| {z } | {z } | {z }
SST SSE SSR

So,

β̂12 (xi − x)2


P
∼ F1,n−2
Se2
!2
β̂1
pP ∼ t2n−2
Se / (xi − x)2

28
D u c Vu ( Fa l l 2 0 2 1 ) 9 Lec 9: Oct 15, 2021

Example 9.1
Use the extra sum of squares method to test

H0 : β1 = 1
Ha : β1 6= 1

Reduced model: Yi = β0 + xi + εi

Yi − xi = β0 + εi
β̂0 = y − x
X 2
SSER = (yi − xi − (y − x))
X 2
= (yi − y − (xi − x))
X X X
= (yi − y)2 + (xi − x)2 − 2 (xi − x)(yi − y) (*)

Note:
X X X
(yi − y)2 = e2i + β̂12 (xi − x)2
P
(xi − x)(yi − y)
β̂1 = P
(xi − x)2
X X
=⇒ (xi − x)(yi − y) = β̂1 (xi − x)2

So, we have
X X X X
(∗) = (xi − x)2 +
e2i + β̂12 (xi − x)2 − 2β̂1 (xi − x)2
X X
SSER = e2i + (β̂1 − 1)2 (xi − x)2

Test statistics:
(SSER − SSEF )/(dfR − dfF )
SSEF /dfF
P P 
e2i + (β̂1 − 1)2 (xi − x)2 − e2i / (n − 1 − (n − 2))
P
P 2
ei /(n − 2)
2
(xi − x)2
P
(β̂1 − 1)
∼ F1,n−2
Se2

Proof. Under H0 ,   
β̂ ∼ N 1, √P σ
1
(xi −x)2
 (n−2)Se2 2
σ2 ∼ Xn−2
So,
 2
β̂1 −1)
√(P /1
σ/ (xi −x)2
(n−2)Se2
σ2 /(n − 2)
2
(xi − x)2
P
(β̂1 − 1)
∼ F1,n−2
Se2

29
Duc Vu (Fall 2021) 9 Lec 9: Oct 15, 2021

§9.2 Powe r A n a l y s i s U s i n g N o n - C e nt r a l F D i s t r i b u t i o n

Definition 9.2 — 1. Y ∼ N (µ, 1) then Y 2 ∼ X12 (θ = µ2 )

2. Suppose Y ∼ N (µ, σ)

Y µ 
∼N ,1
σ σ
Y2 µ2
2
∼ X12 (θ = 2 )
σ σ

MGF of Y ∼ X12 (NCP = θ). Then


− 21 t
MY (t) = (1 − 2t) eθ 1−2t
− 12
If θ = 0 =⇒ MY (t) = (1 − 2t) .
Consider now
i.i.d
Y1 , Y2 , . . . , Yn ∼ N (µ, σ)
Y12 Yn2
Find distribution of Q = σ2 + ... + σ2 .
 n
µ2 t
− 21 σ2 1−2t
MQ (t) = (1 − 2t) e
n nµ2 t
= (1 − 2t)− 2 e σ2 1−2t
P 2
nµ2
 
Yi 2
Q= ∼ X n θ =
σ2 σ2

Non-Central F Distribution: Let U ∼ Xn2 (NCP = θ) and V ∼ Xm


2
. If U, V are independent, then

U/n
∼ Fn,m (NCP = θ)
V /m

Back to simple regression:


!
σ
β̂1 ∼ N β 1 , pP
(xi − x)2
!
β̂1 β1
pP ∼N pP ,1
σ/ (xi − x)2 σ/ (xi − x)2
β̂12 (xi − x)2 β12 (xi − x)2
P  P 
2
∼ X 1 θ =
σ2 σ2
(n − 2)Se2 2
∼ Xn−2
σ2
β̂12 (xi −x)2
P
β12 (xi − x)2
 P 
σ2 /1
(n−2)Se2
∼ F1,n−2 θ =
2 /(n − 2) σ2
σ

Thus,
1 − β = P (F1,n−2 (θ) > F1−α;1,n−2 )

30
Duc Vu (Fall 2021) 10 Lec 10: Oct 18, 2021

§10 Lec 10: Oct 18, 2021

§10.1 Multiple Regression


Consider:
Yi = β0 + β1 Xi1 + β2 Xi2 + . . . + βk Xik + εi , i = 1, . . . , n
where we have k predictors
      
Y1 1 x11 ... x1k β0 ε1
 Y2  1 x12 ... x2k   β1   ε 2 
 ..  =  .. ..   ..  +  .. 
      
..
 .  . . .  .   . 
Yn 1 xn1 ... xnk βk εn
Y = Xβ + ε

where

Y : n × 1 response vector
X : n × (k + 1) regression matrix
β : (k + 1) × 1 parameter vector
ε : n × 1 error vector

Assumption: Gauss-Markov conditions



E [εi ] = 0, i = 1, . . . , n 

var(εi ) = σ 2 , i = 1, . . . , n =⇒ E [ε] = 0, var(ε) = σ 2 I

ε1 , ε2 , . . . , εn are independent

 
Y1
Let Y =  ...  be a random vector with mean vector
 

Yn

    
Y1 EY1 µ1
 ..   ..   .. 
µ = E [Y] = E  .  =  .  =  . 
Yn EYn µn

and variance covariance matrix


σ12
 
σ12 ... σ1n
 σ21 σ22 ... σ2n 
Σ = E [Y − µ] [Y − µ] =  .
 
.. .. .. 
 .. . . . 
σn1 σn2 ... σn2
 
Y1 − µ1
 Y2 − µ2   
E  Y1 − µ1 Y2 − µ2 ... Yn − µn
 
..
 . 
Yn − µn

31
Duc Vu (Fall 2021) 10 Lec 10: Oct 18, 2021

 
a1
Properties: Let a =  ...  be a vector of constants and let a0 Y be a linear combination Y. Then
 

an
X
E [a0 Y] = a0 EY = a0 µ = ai µi
0 0
var (a Y) = a Σa

Let A be an m × n matrix of constant and consider AY (m × 1 vector). Then

E [AY] = AEY = Aµ
var (AY) = AΣA0

Using the Gauss-Markov conditions

EY = E [Xβ + ε] = Xβ
var(Y) = var (Xβ + ε) = σ 2 I

Estimation of β using Least Squares:


1. Geometric interpretation of least squares – orthogonal projection

X0 (Y − Xβ) = 0
X0 Y − X0 Xβ = 0
X0 Xβ = X0 Y
−1
β̂ = (X0 X) X0 Y

which is the least squares estimator of β.


2. Minimize the error sum of squares
X
min Q = ε2i

or min Q = ε0 ε but Y = Xβ + ε. Or
0
min Q = (Y − Xβ) (Y − Xβ)

Then,

min Q = Y0 Y − Y0 Xβ − βX0 Y + β 0 X0 Xβ
= Y0 Y − 2Y0 Xβ + βX0 Xβ
∂Q
=0 (*)
∂β

Note: Matrix
 and  vector differentiation:
θ1
Let θ =  ...  and g(θ) be a function of θ. Then
 

θp
 ∂g(θ) 
∂θ1
∂g(θ)  . 
=  .. 

∂θ ∂g(θ)

∂θp

32
Duc Vu (Fall 2021) 10 Lec 10: Oct 18, 2021

Let g(θ) = c0 θ. Then,


∂g(θ)
=c
∂θ
Let A be a symmetric matrix and consider g(θ) = θ 0 Aθ. Then,

∂g(θ)
= 2Aθ
∂θ
So apply these result to (*), we obtain

2X0 Y + 2X0 Xβ = 0
X0 Xβ = X0 Y

which is known as the normal equations. Notice that


−1
β̂ = (X0 X) X0 Y

which is OLS estimator of β.

33
Duc Vu (Fall 2021) 11 Lec 11: Oct 20, 2021

§11 Lec 11: Oct 20, 2021

§11.1 M u l t i p l e R e g r e s s i o n ( C o nt ’ d )
Recall that

Y = Xβ + ε
E [ε] = 0
var(ε) = σ 2 I

Least squares: X
min ε2i = ε0 ε = (Y − Xβ)0 (Y − Xβ)
Normal Equations:
−1
X0 Xβ = X0 Y =⇒ β̂ = (X0 X) X0 Y
Note that X is not a square matrix, so X0 X has to go together in order for it to be invertible.

X = (1, x1 , x2 , . . . , xk )
0
 
1
10 x 1 10 x 2 10 xk
 
x01  n ...
 0   x1 1 x01 x1
 0
x01 x2 ... x01 xk 
X0 X =  x 2  1 x1 x2 . . . xk =  .
  
.. .. ..
 ..

 ..  . . . 
 . 
x0k 1 x0k x1 ... x0k xk
x0k

which is a symmetric (k + 1) × (k + 1) matrix. We have


X
x1 x1 = x2i1
X
x1 0 x2 = xi1 xi2

Partition X and β

X = 1 X(0)
 
β0
β=
β(0)

Model:
 
Y= 1 X(0) β0 β(0) + ε
Y = β0 1 + X(0) β(0) + ε

Then,

10
 
0

XX= 1 X(0)
X(0) 0
10 X(0)
 
n
=
X(0) 0 1 X(0) 0 X(0)

So  
β̂0
β̂1  
10 X(0) 10 Y
   
β̂0 n
β̂ =  .  = ˆ =
 
 ..  β(0) X(0) 0 1 X(0) 0 X(0) X(0) Y
β̂k

34
Duc Vu (Fall 2021) 11 Lec 11: Oct 20, 2021

Also,  0
10
  
0 1Y
XY= Y=
X(0) 0 X0(0) Y
Fitted Values:

Ŷi = β̂0 + β̂1 xi1 + β̂2 xi2 + . . . + β̂k xik


 
    β̂0
Ŷ1 1 x11 x12 . . . x1k  
 Ŷ2  1 x21 x22 . . . x2k   β̂1 
..  β̂2 
 
 .  =  ..
  
. .. .. 
 .  . . . .   .. 

.
Ŷn 1 xn1 xn2 . . . xnk
β̂k
or
Ŷ = Xβ̂
or
−1
Ŷ = X (X0 X) X0 Y = HY
−1
where H = X (X0 X) X0 which is n × n “hat” matrix.
Properties of H:
1. H0 = H symmetric
0
(ABC) = C0 B0 A0

2. HH = H – idempotent
−1 −1
X (X0 X) X0 X (X0 X) X=H
h i
−1
3. tr H = tr X (X0 X) X0 = tr ((X0 X)−1 X0 X = tr Ik+1 = k + 1. Notice that the property
 

of trace is
tr (ABC) = tr (BCA) = tr (CAB) 6= tr (BAC)

4. HX = X or H (1, x1 , . . . , xk ) = (1, x1 , . . . , xk )
Residuals:

ei = yi − ŷi i = 1, . . . , n
e = y − ŷ
e = y − xβ̂
e = Y − HY
e = (I − H) Y = (I − H) (Xβ + ε)
= (I − H) Xβ + (I − H) ε
= (I − H) ε

Overall, we have two expressions for e

e = (I − H) Y
e = (I − H) ε

Notice that the error sum of squares


0
X
SSE = e2i = e0 e = [(I − H) Y] [(I − H) Y] = Y0 (I − H) Y

or
0
SSE = [(I − H) ε] [(I − H) ε] = ε0 (I − H) ε

35
Duc Vu (Fall 2021) 11 Lec 11: Oct 20, 2021

Properties of β̂:
h i
−1 −1
E β̂ = E X0 X X0 Y = (X0 X) X0 X |{z}
EY = β

which is unbiased.
h i
−1 −1 −1
var (β) = var (X0 X) X0 Y = (X0 X) X0 σ 2 IX (X0 X)
−1
= σ 2 (X0 X)

which is variance covariance matrix of β̂. Specifically,


 
v00 v01 ... v0k
  v10 v11 ... v1k 
−1
var β̂ = σ 2 (X0 X) = σ2  .
 
.. .. .. 
 .. . . . 
vk0 vk1 ... vkk
 
var β̂0 = σ 2 v00

var(β̂1 ) = σ 2 v11
 
cov β̂1 , β̂2 = σ 2 v12

where
−1
(X0 X) = {vij }i=1,...,n;j=1,...,n

36
Duc Vu (Fall 2021) 12 Lec 12: Oct 22, 2021

§12 Lec 12: Oct 22, 2021

§12.1 G a u s s - M a r kov T h e o r e m i n M u l t i p l e R e g r e s s i o n
−1
Let β̂ = (X0 X) X0 Y be the least squares estimator of β and ∗
 let b =M Y be an unbiased
∗ 0 −1 0
estimator of β (not the least squares). Let’s define M = M + X X X .
b is unbiased
Eb = β
because
EM∗ Y = β
or
h   i
−1
E M + X0 X X0 Y = β
 
−1
M + (X0 X) X0 Xβ = β
MXβ + β = β
MX = 0

Check var(b). h i
−1
var(b) = var (M∗ Y) = var M + (X0 X) X0 Y

Note:
var(AY) = AΣA0
where var(Y) = σ 2 I. Then,
h ih i
−1 −1
var(b) = σ 2 M + (X0 X) X0 M0 + X (X0 X)
−1 −1
= σ 2 MM0 + σ 2 MX (X0 X) + σ 2 (X0 X) X0 M0
−1 −1
+ σ 2 (X0 X) X0 X (X0 X)
−1
= σ 2 MM0 + σ 2 (X0 X)
= σ 2 MM0 + var(β̂1 )

A matrix B is positive definite if for a non zero vector a

a0 Ba > 0

Aside Note:
var (aY0 ) = a0 Σa > 0
Now, let a be a non zero vector
0
a0 MM0 a = (M0 a) (M0 a)
= q0 q
X
= qi2 > 0
 
Therefore, MM0 is a positive definite matrix and thus var(b) ≥ var β̂ .

37
Duc Vu (Fall 2021) 12 Lec 12: Oct 22, 2021

§12.2 G a u s s - M a r kov T h e o r e m Fo r a L i n e a r C o mb i n a t i o n
We have
   
var a0 β̂ = a0 var β̂ a
−1
= σ 2 a0 (X0 X) a
or
 
var a0 β̂0 + a1 β̂1 + a2 β̂2 = a20 var(β̂0 ) + a21 var(β̂1 ) + a22 var(β̂2 ) + 2a0 a1 cov(β̂0 , β̂1 )

+ 2a0 a2 cov(β̂0 , β̂2 ) + 2a1 a1 cov(β̂1 , β̂2 )


0
Let’s compare it to var(a b).
var (a0 b) = a0 var(b)a
h i
−1
= σ 2 a0 MM0 + (X0 X) a
−1
= σ 2 a0 MM0 a + σ 2 a0 (X0 X) a
 
= σ 2 a0 MM0 a + var a0 β̂
 
Thus, var(a0 b) ≥ var a0 β̂ .
Special Case:
 
0
0
 .. 
 
.
 
1
a= 
.
 .. 
 
0
0
var(bi ) ≥ var(β̂i )

§12.3 R e v i e w o f M u l t i va r i a t e N o r m a l D i s t r i b u t i o n
i.i.d
Normality assumption: ε1 , . . . , εn ∼ N (0, δ)
ε ∼ Nn 0, σ 2 I


Let Y ∼ Nn (µ, Σ)
1 − 12 − 12 (y−µ)0 Σ−1 (y−µ)
f (y) = n |Σ| e
(2π) 2
Consider
f (ε) = f (ε1 ) · f (ε2 ) . . . f (εn )
)
1 − 12 1 2
I)−1 ε
1 2 = n σ2 I e− 2 ε(σ
f (εi ) = √1 e− 2σ2 Σi (2π) 2
σ 2π
So 1 0
n
f (ε) = (2πσ 2 )− 2 e− 2σ2 ε ε =⇒ ε ∼ Nn (0, σ 2 I)
Joint MGF: Let Y ∼ Nn (µ, Σ). Then
0 0 1 0
MY (t) = Eet Y = et µ+ 2 t Σt
 
t1
 .. 
where t =  . .
tn

38
Duc Vu (Fall 2021) 12 Lec 12: Oct 22, 2021

Theorem 12.1
Let Y ∼ Nn (µ, Σ) and let A be m × n matrix of constant and c m × 1 vector of constants.
Using the joint mgf

AY ∼ Nm (Aµ, AΣA0 )
AY + c ∼ Nm (Aµ + c, AΣA0 )

Notice that

Y + Xβ + ε
2

ε ∼ Nn (0, σ I)

=⇒ Y ∼ Nn Xβ, σ 2 I

EY = Xβ

var(Y) = σ 2 I

39
Duc Vu (Fall 2021) 13 Lec 13: Oct 25, 2021

§13 Lec 13: Oct 25, 2021

§13.1 T h e o r e m s i n M u l t i va r i a t e N o r m a l D i s t r i b u t i o n
Consider: Y ∼ Nn (µ, Σ)
1 − 12 − 12 0
f (y) = n |Σ| e (y − µ) Σ−1 (y − µ)
(2π) 2
0 0 1 0
My (t) = Eet y = et µ+ 2 t Σt
1
Proof. Let Z ∼ N (0, I) and Y = Σ 2 Z + µ. Then, the spectral decomposition of Σ is
 
λ1 0
Σ = PΛP0 , Λ=
 .. 
. 
0 λn
1 1
Σ 2 = PΛ 2 P0

So,
1 2
MZi (ti ) = Eeti zi = e 2 ti
0
MZ (t) = Eet z = Eet1 z1 +...+tn zn
= Eet1 z1 · Eet2 z2 . . . Eetn zn
1 0
= e2t t
MY (t) = M 1 (t)
Σ 2 Z+µ
 1 
t0 Σ 2 Z+µ
= Ee
 1 0
t0 µ Σ2 t Z
=e Ee
1
Let t∗ = Σ 2 t. Then
0 ∗0
MY (t) = et µ Eet Z

0 0 1 ∗0
= et µ MZ (t∗ ) = et µ e 2 t T BA

0 1 0
= et µ+ 2 t Σt

Theorem 13.1
Let A be m × n matrix of constants and C be m × 1 vector of constants. Then

AY + C ∼ Nm (Aµ + C, AΣA0 )
AY ∼ Nm (Aµ, AΣA0 )

Proof. We have
0
MAY+C (t) = Eet (AY+C)
0 0 0
= et C · Ee(A t) Y

40
Duc Vu (Fall 2021) 13 Lec 13: Oct 25, 2021

Let t∗ = A0 t. Then
0
MAY+C (t) = et C · MY (t∗ )
0 ∗0 0
µ+ 21 t∗ Σt∗
= et C · et
0 1 0 0
= et (Aµ+C)+ 2 t AΣA t

Thus, AY + C ∼ Nm (Aµ + C, AΣA0 ).

Theorem 13.2
Let Y Nn (µ, Σ),      
Q1 µ1 Σ11 Σ12
Y= , µ= , Σ=
Q2 µ2 Σ21 Σ22
Note that 

Y1
 Y2 
 
 
Y=
 
Y
 3 

 Y4 
Y5
Then,

Q1 ∼ Np (µ1 , Σ11 )
Q2 ∼ Nn−p (µ2 , Σ22 )

Proof. Use the above theorem 


Q1 = Ip 0 Y = AY
Then,

EQ1 = EAY
 
 µ1
= I 0
µ2
= µ1
var (Q1 ) = var (AY)
  
 Σ11 Σ12 I
= AΣA0 = I 0
Σ21 Σ22 00
= Σ11

 √ 
If A = a0 (row vector), then a0 Y ∼ N a0 µ, a0 Σa .

41
Duc Vu (Fall 2021) 13 Lec 13: Oct 25, 2021

Theorem 13.3  
Q1
Independence for Y =
Q2
 
µ1
Y ∼ N (µ, Σ) , µ =
µ2
 
Σ11 Σ12
Σ=
Σ21 Σ22

Then, the MGF is


0 1 0
MY (t) = et µ+ 2 t Σt
0 0 1 0 1 0 0
= et1 µ1 +t2 µ2 + 2 t1 Σ11 t1 + 2 t2 Σ22 t2 +t1 Σ12 t2

If Σ12 = 0, then
0 1 0 0 1 0
MY (t) = et1 µ1 + 2 t1 Σ11 t1 et2 µ2 + 2 t2 Σ22 t2
= MQ1 (t1 ) · MQ2 (t2 )

Thus, Q1 , Q2 are independent ⇐⇒ cov (Q1 , Q2 ) = 0.

For AY, BY, we have


cov (AY, BY) = AΣB0

Theorem 13.4
We have
Q1 Q2 ∼ Np µ1 + Σ12 Σ−1 −1

22 (Q2 − µ2 ) , Σ11 − Σ12 Σ22 Σ21

Back to multiple regression


ε ∼ N 0, σ 2 I

Y = Xβ + ε,
Y ∼ N Xβ, σ 2 I


Then, the likelihood function is


1 − 1 2 −1
σ 2 I 2 e− 2 (y−xβ)(σ I) (y−xβ)
1
L = f (y) = n
(2π) 2
or
n 1 2 0
L = (2πσ 2 )− 2 e− 2 σ (y−xβ) (y−xβ)
n 1 0
ln L = − ln 2πσ 2 − 2 (y − xβ) (y − xβ)
2 2σ
Thus for β,
∂ ln L −1
= 0 =⇒ β̂ = (X0 X) X0 Y
∂β
and estimation for σ 2
∂ ln L µ 1 0
= − 2 + 4 (y − xβ) (y − xβ) = 0
∂σ 2 2σ 2σ
 0  
Y − Xβ̂ Y − Xβ̂ e0 e
σ̂ 2 = =
n n

42
Duc Vu (Fall 2021) 13 Lec 13: Oct 25, 2021

Now, e = (I − H) Y = (I − H)ε. Therefore,

e0 e = Y0 (I − H) Y = ε0 (I − H) ε
e0 e
σ̂ 2 =
n
Y0 (I − H)Y
=
n
ε0 (I − H) ε
=
n

43
Duc Vu (Fall 2021) 14 Lec 14: Oct 27, 2021

§14 Lec 14: Oct 27, 2021

§14.1 M e a n a n d Va r i a n c e i n M u l t i va r i a t e N o r m a l
Distribution
Consider
Y = Xβ + ε
ε ∼ Nn 0, σ 2 I


=⇒ Y ∼ Nn Xβ, σ 2 I


Joint pdf of Y is
− n2 1 0
f (y) = 2πσ 2 e− 2σ2 (y−xβ) (y−xβ)
Using the method of maximum we obtain the MLEs of β and σ 2
−1
β̂ = (X0 X) X0 Y
which is the same as the least squares estimator. And
 0  
y − xβ̂ y − xβ̂ e0 e
σ̂ 2 = =
n n
Note that e = (I − H)Y or e = (I − H)ε. Therefore,
e0 e = Y0 (I − H)Y or e0 e = ε0 (I − H)ε
So
1 0
E σ̂ 2 = Ee e
n  
1  0
= E ε (I − H)ε
n | {z }
scalar
1
= E [tr(I − H)εε0 ]
n
1
= tr [E (I − H) εε0 ]
n
1
= tr [(I − H) E (εε0 )]
n
Note:
0
Σ = E (Y − µ) (Y − µ)
E [YY0 ] = Σ + µµ0
where
E(ε) = 0
var(ε) = σ 2 I
Then,
1 
E σ̂ 2 = tr (I − H) σ 2 I + 000

n
1
= tr (I − H) σ 2 I
n
σ2
= tr (I − H)
n

44
Duc Vu (Fall 2021) 14 Lec 14: Oct 27, 2021

Let’s compute tr (I − H).

tr (I − H) = tr(I) − tr(H)
h i
−1
= tr(I) − tr X (X0 X) X0
h i
−1
= tr(I) − tr (X0 X) X0 X
= tr(In ) − tr(Ik+1 )
=n−k−1

So,
n−k−1
E σ̂ 2 = σ 2
n
which is biased. Therefore, the unbiased estimator of σ 2 is

n e0 e n e0 e
Se2 = σ̂ 2 = =
n−k−1 n n−k−1 n−k−1
In simple regression (k = 1 – one predictor)

e0 e
P 2
ei
Se2 = =
n−2 n−2

Now, let’s find the mean and variance of Ŷ and e.

Ŷ = HY
E Ŷ = HEY
= HXβ
= Xβ

Note: HX = X.
 
var Ŷ = var (HY)
= σ2 H

For e,

Ee = E [(I − H)Y]
= E [Y − HY]
= Xβ − Xβ
=0
var(e) = var [(I − H)Y]
= σ 2 (I − H)

§14.2 I n d e p e n d e nt Ve c t o r s i n M u l t i p l e R e g r e s s i o n
If Y ∼ Nn (µ, Σ), then AY and BY are independent iff

cov (AY, BY) = AΣB0 = 0

Apply this result for multiple regression


   
cov Ŷ, e , cov β̂, e

45
Duc Vu (Fall 2021) 14 Lec 14: Oct 27, 2021

or use      
Ŷ HY H
= = Y = AY
e (I − H)Y I−H

Y ∼ Nn Xβ, σ 2 I .

var (AY) = A var(Y)A0


 
2 H 
=σ H I−H
I−H
 
2 H 0

0 I−H

Ŷ and e are independent. Similarly, we can show that β̂ and e are independent.

§14.3 Pa r t i a l R e g r e s s i o n
Consider 
X = X1 X2
with the following three models
−1
Y = X1 β 1 + ε =⇒ β̂ 1 = (X01 X1 ) X01 Y
−1
Y = X2 β 2 + ε =⇒ β̂ 2 = (X02 X2 ) X02 Y

and
Y = Xβ + ε or Y = X1 β 1 + X2 β 2 + ε

46
Duc Vu (Fall 2021) 15 Lec 15: Oct 29, 2021

§15 Lec 15: Oct 29, 2021

§15.1 Pa r t i a l R e g r e s s i o n ( C o nt ’ d )
Normal equation:
X0 Xβ̂ = X0 Y
using
X01
   
0 β̂ 12
X = and β̂ =
X02 β̂ 21
Then,

X01 X01 X1 X01 X2


   
0

XX= X1 X2 =
X02 X02 X1 X02 X2

and
X01
   0 
0 X1 Y
XY= Y=
X02 X02 Y
and the normal equations are
 0
X01 X2
   0 
X1 X1 β̂ 12 X1 Y
=
X02 X1 X02 X2 β̂ 21 X02 Y

Then,

X01 X1 β̂ 12 + X01 X2 β̂ 21 = X01 Y (1)


X02 X1 β̂ 12 + X02 X2 β̂ 21 = X02 Y (2)

From (1),
X01 X1 β̂ 12 = X01 Y − X01 X2 β̂ 21
So,
−1 −1 0
β̂ 12 = (X01 X1 ) X01 Y − (X01 X1 ) X1 X2 β̂ 21 (3)
| {z }
β̂ 1

Let’s find β̂ 21 by substitute (3) into (2).


h i 0
−1 −1
X02 X1 (X01 X1 ) X01 Y − (X01 X1 ) X01 X2 β̂ 21 + X2 X2 β̂ 21 = X02 Y
−1 −1
X02 X1 (X01 X1 ) X01 Y − X02 X1 (X01 X1 ) X01 X2 β̂ 21 + (X02 X2 ) β̂ 21 = X2 0 Y

Then,
 
−1 −1
X02 X2 β̂ 21 − X02 X1 (X01 X1 ) X01 X2 β̂ 21 = X02 Y − X02 X1 (X01 X1 ) X01 Y
h i h i
−1 −1
X02 I − X1 (X01 X1 ) X01 X2 β̂ 21 = X02 I − X1 (X01 X1 ) X01 Y

X02 [I − H1 ] X2 β̂ 21 = X02 [I − H1 ] Y
X02 (I − H1 ) (I − H1 ) X2 β̂ 21 = X02 (I − H1 ) (I − H1 ) Y
0 0
[(I − H) X2 ] [(I − H) X2 ] β̂ 21 = [(I − H) X2 ] [(I − H1 ) Y]

Note:
(I − H1 ) Y = Y∗

47
Duc Vu (Fall 2021) 15 Lec 15: Oct 29, 2021

which is residuals from regression of Y on X1 . Suppose



X2 = x3 x4 x5
Here k = 5 and 
X= 1 x1 x2 x3 x4 x5
where  
X1 = 1 x1 x2 , X2 = x 3 x4 x5
Then,

(I − H1 ) X2 = (I − H1 ) x3 x4 x5
 
= (I − H1 ) x3 (I − H1 ) x4 (I − H1 ) x5
= X2 ∗

So,  
0 0
X∗2 X∗2 β̂ 21 = X∗2 Y∗
and thus  0 −1 0
β̂ 21 = X∗2 X∗2 X∗2 Y∗
Special Case 1:

X = 1 X(0)
 
β0
β=
β (0)

Now, let’s use partial regression to find β̂ (0) .


Regression Y on 1: Y = β0 1 + ε and
 
  y1 − y
1
Y∗ = (I − H1 ) Y = I − 1 (10 1) 10 Y = I − 110 Y =  ... 
h i
−1  
n
yn − y
X∗(0) regress X(0) on 1

X∗(0) = (I − H1 ) X(0)
 
1
= I − 110 X(0)
n
    
1 1
= I − 110 x1 , . . . , I − 110 xk
n n
 
x11 − x1 . . . x1k − xk
 x21 − x1 . . . x2k − xk 
=
 
.. .. 
 . . 
xn1 − x1 ... xnk − xk
 
β1
 .. 
Finally, to estimate the vector of the slopes β (0) = . 
βk
 
  x11 − x1 ... x1k − xk
y1 − y  x21 − x1 ... x2k − xk 
We regress  ...  on
   
 .. .. 
 . . 
yn − y
xn1 − x1 ... xnk − xk

48
Duc Vu (Fall 2021) 15 Lec 15: Oct 29, 2021

 0 −1 0
to get β̂ (0) = X∗(0) X∗(0) X∗(0) Y∗ where
 
1 0
X∗(0) = I−
11 X(0)
n
 
1
Y∗ = I − 110 Y
n

49
Duc Vu (Fall 2021) 16 Lec 16: Nov 1, 2021

§16 L e c 1 6 : N ov 1 , 2 0 2 1

§16.1 Pa r t i a l R e g r e s s i o n ( C o nt ’ d )
Consider:
Y = Xβ + ε
Then,
−1
β̂ = (X0 X) X0 Y
The partial regression of Y∗ on X∗2
 0 −1 0
β̂ 21 = X∗2 X∗2 X∗2 Y∗

i.e., Y∗ = X∗2 β 2 + ε.
Special Case 2: Begin with
Y = Xβ + ε
with k predictors. Then, we add an extra predictor Z. The new model is

Y = Xβ + cZ + ε

Use partial regression to estimate c.


1. Regress Y on X → e residuals
2. Regress Z on X → Z∗ residuals.
3. Regress e on Z∗ to get
 0 −1 0
ĉ = Z∗ Z∗ Z∗ e
or 0 0
Z∗ e e0 Z∗
ĉ = 0 =
Z∗ Z∗ Z∗0 Z∗
Change in the error sum of squares when a new predictor is added in the model

Y = Xβ + ε (1)
Y = Xβ + cZ + ε (2)

Residuals using (1)


e = Y − Xβ̂
Residuals using (2)
u = Y − Xδ̂ − ĉZ
Now, we need to find δ̂

Y = Xβ + cZ + ε
 
 β
Y= X Z +ε
c
 
β
Y=w +ε
c
Y = wη + ε

Normal equations:
w0 wη = w0 Y

50
Duc Vu (Fall 2021) 16 Lec 16: Nov 1, 2021

or

X0 Xδ̂ + X0 Zĉ = X0 Y (1)


0 0 0
Z Xδ̂ + Z Zĉ = Z Y (2)

From (1)
−1
δ̂ = (X0 X) [X0 Y − X0 Zĉ]
or
−1
δ̂ = β̂ − (X0 X) X0 Zĉ
Now back to u
−1
u = Y − X β̂ + X (X0 X) X0 Zĉ − ĉZ
 
−1
u = e − I − X (X0 X) X0 Zĉ
= e − [I − H] Zĉ
= e − Z∗ ĉ

The SSE is

SSEXZ = u0 u
0
= (e − Z∗ ĉ) (e − Z∗ ĉ)
0 0
= e0 e − 2Z∗ eĉ + Z∗ Z∗ ĉ2
0 0
= e0 e − 2ĉ2 Z∗ Z∗ + Z∗ Z∗ ĉ2
0
= e0 e − Z∗ Z∗ ĉ2

Thus, we can conclude that adding a new predictor would never increase SSE, i.e., u0 u ≤ e0 e. Note
that the new R2 is

2 u0 u
RXZ =1−
SST
0
e0 e Z∗ Z∗ ĉ2
=1− +
SST SST
∗0 ∗ 2
2 Z Z ĉ
= RX +
SST
2 2
So, RXZ ≥ RX .

§16.2 Pa r t i a l C o r r e l a t i o n
Consider
Yi = β0 + β1 Xi1 + β2 Xi2 + εi
where 
Yi : income

Xi1 : age

Xi2 : number of years of education

• Regress Y on X1 → Y∗ residuals.
• Regress X2 on X1 → X∗2 residuals.

51
Duc Vu (Fall 2021) 16 Lec 16: Nov 1, 2021

2 cov2 (Y∗ , X∗2 )


rYX =
2 |X1
var(X2 ∗ ) var(Y∗ )
hP  ∗
 ∗
i2
Y1∗ − Y ∗
Xi2 − X 2 /(n − 1)
=
(Yi∗ −Y ∗ )
2 2
(Xi2 2 )
∗ −X ∗
P P

n−1 n−1
P ∗ 2
( Y Xi2 )
= P ∗2i P
( X2 ) ( Yi∗2 )
 0 2
Y∗ X∗2
= 0
X∗2 X∗2 (Y∗0 Y∗ )


Another method:
• Regress Y on X1 , X2 , . . . , Xk−1 → Y∗ .
• Regress Xk on X1 , X2 , . . . , Xk−1 → X2k .
SSE (Y on X1 , . . . , Xk−1 ) − SSE (Y on X1 , . . . , Xk )
rY2 Xk |X1 ,...,xk−1 =
SSE (Y on X1 , . . . , Xk−1 )

52
Duc Vu (Fall 2021) 17 Lec 17: Nov 3, 2021

§17 L e c 1 7 : N ov 3 , 2 0 2 1

§17.1 Constrained Least Squares


Consider
Y = Xβ + ε
We want to estimate β subject to a set of linear constraints of the form cβ = γ where C : m × k + 1,
β : k + 1 × 1 and γ : m × 1.
Suppose k = 4 (
β0 + 2β1 − 3β2 + 5β3 − β4 = 5
2β0 − β1 + β2 + 3β3 = 10
or  
β0
  β1   
1 2 −3 5 −1  β2  = 5

2 −1 1 3 0    10
β3 
β4
We still minimize (Y − Xβ 0 (Y − Xβ) but now subject to cβ = γ.
Method of Lagrange Multipliers:
0
min Q = (Y − Xβ) (Y − Xβ) + 2λ0 (cβ − γ)
So,
∂Q
= −2X0 Y + 2X0 Xβ + 2cλ = 0
∂β
Solve for β to get β̂c
−1
β̂ c = (X0 X) [X0 Y − c0 λ]
−1
β̂ c = β̂ − (X0 X) c0 λ
Now, we need to find λ. So
−1
cβ̂ c = cβ̂ − c (X0 X) c0 λ
−1
γ = cβ̂ − c (X0 X) c0 λ
h i−1  
−1
λ = c (X0 X) c0 cβ̂ − γ
Therefore,
h i−1  
−1 −1
β̂ c = β̂ − (X0 X) c0 c (X0 X) c0 cβ̂ − γ
Fitted values:
h i−1  
−1 −1
Ŷc = Xβ̂ c = Xβ̂ − X (X0 X) c0 c (X0 X) c0 cβ̂ − γ
Residuals:
h i−1  
−1 −1
ec = Y − Ŷc = e + X (X0 X) c0 c (X0 X) c0 cβ̂ − γ
Error sum of squares:
0
SSEc = ec ec
 0  
−1 0 −1 −1 0 −1
h i  h i 
0 −1 0 0 0 −1 0 0
= e + X (X X) c c (X X) c cβ̂ − γ e + X (X X) c c (X X) c cβ̂ − γ

= e0 e + e0 X [. . .] + [. . .] X0 e
 0 h i−1 h i−1  
−1 −1 −1 −1
+ cβ̂ − γ c (X0 X) c0 c (X0 X) X0 X (X0 X) c0 c (X0 X) c0 cβ̂ − γ

53
Duc Vu (Fall 2021) 17 Lec 17: Nov 3, 2021

Finally,
0
 0 h i−1  
−1
ec ec = e0 e + cβ̂ − γ c (X0 X) c0 cβ̂ − γ

We can deduce that SSEc ≥ SSE.


MLE of σ 2
e0 e
σ̂ 2 =
n
For the constrained model 0
ec ec
σ̂c2 =
n
and
(n − k − 1 + m)σ 2
E σ̂c2 =
n
Method Using the Canonical Form of the Model:

cβ = γ
 
 β1
c = c1 c2 , β =
β2
c1 β 1 + c2 β 2 = γ
 
 β
−3 5 −1  2 
     
1 2 β0 5
+ β3 =
2 −1 β1 1 3 0 10
β4

Back to the model using the same partition we get

Y = X1 β 1 + X2 β 2 + ε

Then,

Y = X1 c−1
1 [γ − c2 β 2 ] + X2 β 2 + ε
Y − X1 c1 γ = X2 − X1 c−1
−1

1 c2 β 2 + ε
Yr = X2r β 2 + ε

which is the same form as Y = Xβ + ε. Thus,


 0 −1 0
β̂ 2c = X2r X2r X2r Yr

and therefore,  
β̂1c = c−1
1 γ − c 2 β̂ 2c

Overall,
h i−1  
−1 −1
β̂ c = β̂ − (X0 X) c0 c (X0 X) c0 cβ̂ − γ
or  
β̂
β̂ c = 1c
β̂ 2c

which is from canonical form. Next, let’s find the mean and variance of β̂ c .

E β̂ c = β

Notice that  
−1 0 −1
h i
0 −1 0 0
β̂ c = I − (X X) c c (X X) c c β̂ + const = Aβ̂

54
Duc Vu (Fall 2021) 17 Lec 17: Nov 3, 2021

So
−1
var(β̂ c ) = σ 2 A (X0 X) A0
or using the canonical model
  
var(β̂1c ) cov β̂ 1c , β̂ 2c
var(β̂ c ) =    
cov β̂ 1c , β̂ 2c var(β̂ 2c )

55
Duc Vu (Fall 2021) 18 Lec 18: Nov 5, 2021

§18 L e c 1 8 : N ov 5 , 2 0 2 1
Consider:

Y = Xβ + ε
i.i.d
ε1 , . . . , εn ∼ N (0, σ)
ε ∼ Nn 0, σ 2 I



Then, Y ∼ Nn Xβ, σ 2 I
−1
β̂ = (X0 X) X0 Y
 
−1
β̂ ∼ Nk+1 β, σ 2 (X0 X)

β̂ 1 ∼ N (β1 , σ v11 )
 
v00 v01 . . . v0k
v10 v11 . . . v1k 
−1
(X0 X) =  .
 
.. .. 
 .. . . 
vk1 vk2 ... vkk

§18.1 Q u a d r a t i c Fo r m s o f N o r m a l l y D i s t r i b u t e d R a n d o m
Va r i a b l e s
We have

a) Z ∼ Nn (0, I)
i.i.d
Z1 , . . . , Zn ∼ N (0, 1)
Zi2 ∼ X12
X
Zi2 ∼ Xn2
Z0 Z ∼ Xn2

b) Z ∼ Nn 0, σ 2 I . Then,

Zi ∼ N (0, σ)
Zi
∼ N (0, 1)
σ
Zi2
2
∼ X12
σ
P 2
Zi
∼ Xn2
σ2
Z
∼ Nn (0, I)
σ
Z0 Z
∼ Xn2
σ2
In multiple regression,

ε ∼ Nn 0, σ 2 I


ε0 ε
∼ Xn2
σ2

56
Duc Vu (Fall 2021) 18 Lec 18: Nov 5, 2021


c) Y ∼ Nn µ, σ 2 I

Yi − µi
Yi ∼ N (µi , σ) =⇒ ∼ N (0, 1)
σ
 2
Yi − µi
∼ X12
σ
X  Yi − µi 2
∼ Xn2
σ
0
(Y − µ) (Y − µ)
∼ Xn2
σ2
In multiple regression

Y ∼ Nn Xβ, σ 2 I

0
(Y − Xβ) (Y − Xβ)
∼ Xn2
σ2
1
d) Y ∼ Nn (µ, Σ), use V = Σ− 2 (Y − µ). Σ is symmetric matrix

Σ = PΛP0
 
λ1 0
 λ2 
Λ=
 
.. 
 . 
0 λn

where |Σ − λI| = 0. If x is a new zero vector such that Σx = λx, we say that x is an
eigenvector of Σ. Normalize the eigenvectors so that they have length 1

(λ1 , e1 ), (λ2 , e2 ), . . . , (λn , en )


e0i ej = 0, e0i ei = 1

P = e1 . . . en
PP0 = I
Σ = PΛP0 = λ1 e1 e01 + λ2 e2 e02 + . . . + λn en e0n

Result:
1 1
Σ− 2 = PΛ− 2 P0
Properties:
 1
0 1
Σ− 2 = Σ− 2
1 1
Σ− 2 Σ− 2 = Σ−1
1 1
Σ 2 = PΛ 2 P0
1 1
Σ 2 = PΛ 2 P0
 1 0 1
Σ2 = Σ2
1 1
Σ2 Σ2 = Σ

57
Duc Vu (Fall 2021) 18 Lec 18: Nov 5, 2021

Back to the transformation


1
V = Σ− 2 (Y − µ)
1
EV = Σ− 2 E (Y − µ) = 0
h 1
i
var(V) = var Σ− 2 (Y − µ)
h 1 1
i
= var Σ− 2 Y − Σ− 2 µ
 1

= var Σ− 2 Y
1 1
= Σ− 2 var(Y)Σ− 2
1 1
= Σ− 2 ΣΣ− 2
=I
1
So, V ∼ Nn (0, I). Then V0 V ∼ Xn2 and because V = Σ− 2 (Y − µ), it follows that
 1
  1
 1 1
0 0
Σ− 2 (Y − µ) Σ− 2 (Y − µ) = (Y − µ) Σ− 2 Σ− 2 (Y − µ) ∼ Xn2

Therefore,
0
(Y − µ) Σ−1 (Y − µ) ∼ Xn2
In multiple regression  
−1
β̂ ∼ Nk+1 β, σ 2 (X0 X)
1
 
We want to create a X 2 random variable using the distribution of β̂. Let V = (X0 X) 2 β̂ − β .

1
 
EV = (X0 X) 2 E β̂ − β = 0
1
h  i
var(V) = var (X0 X) 2 β̂ − β
1 1
= (X0 X) 2 var(β̂) (X0 X) 2
1 1
−1
= σ 2 (X0 X) 2 (X0 X) (X0 X) 2
= σ2 I

We have so far

V ∼ Nk+1 0, σ 2 I


V0 V 2
∼ Xk+1
σ2
 0  
β̂ − β X0 X β̂ − β
2
∼ Xk+1
σ
Summary:
( (Y−Xβ)0 (Y−Xβ)
σ2 ∼ Xn2
0 0
(β̂−β) X X(β̂−β) 2
σ2 ∼ Xk+1

(n−k−1)Se2 2
Problem 18.1. Show that σ2 ∼ Xn−k−1
Proof. Have  0  
Y − Xβ ± Xβ̂ Y − Xβ ± Xβ̂
∼ Xn2
σ2

58
Duc Vu (Fall 2021) 18 Lec 18: Nov 5, 2021

Rearrange and expand


  0       0
e + X β̂ − β e + X β̂ − β ee 0 e0 X β̂ − β β̂ − βX0 e
= 2 + +
σ2 σ σ2 σ2
 0  
β̂ − β X0 X β̂ − β
+
σ2
 0  
0
ee0 β̂ − β X X β̂ − β
= 2 +
σ σ2
e0 e
Note: Se2 = n−k−1 =⇒ e0 e = (n − k − 1)Se2
 0  
(n − k − 1)S 2 β̂ − β X0 X β̂ − β
0 e
(Y − Xβ) (Y − Xβ) /σ 2 = +
| {z } σ2 σ{z2
∼X 2
| }
n 2
∼Xk+1

We know cov(β̂, e) = 0.

Q = Q1 + Q2
MQ (t) = MQ1 (t) · MQ2 (t)
MQ (t)
MQ1 (t) =
MQ2 (t)
n
(1 − 2t)− 2
= k+1
(1 − 2t)− 2

n−k−1
= (1 − 2t)− 2

(n−k−1)Se2 2
So, Q1 = σ2 ∼ Xn−k−1 .
In simple regression, k = 1,

(n − 2)Se2 2
∼ Xn−2
σ2
σ2
Se2 = Q1
n−k−1
So,

MSe2 (t) = M σ2 (t)


n−k−1 Q1

σ2 t
 
= MQ1
n−k−1
− n−k−1
2σ 2 t
 2

= 1−
n−k−1
 
n−k−1 2σ 2
Thus, Se2 ∼ Γ 2 , n−k−1

ESe2 = σ 2
2σ 4
var(Se2 ) =
n−k−1

59
Duc Vu (Fall 2021) 19 Lec 19: Nov 8, 2021

§19 L e c 1 9 : N ov 8 , 2 0 2 1

§19.1 Q u a d r a t i c Fo r m s a n d T h e i r D i s t r i b u t i o n – O ve r v i e w
1. Z ∼ N (0, I)
Z0 Z ∼ Xn2

2. Z ∼ N 0, σ 2 I
Z0 Z
∼ Xn2
σ2
and
ε0 ε
∼ Xn2
σ2

3. Y ∼ Nn µ, σ 2 I
0
(Y − µ) (Y − µ)
∼ Xn2
σ2
or 0
(Y − Xβ) (Y − Xβ)
∼ Xn2
σ2
4. Y ∼ N (µ, Σ). From the spectral decomposition,
1
V = Σ− 2 (Y − µ)

Then,

V ∼ Nn (0, I)

From 1), V0 V ∼ Xn2 or


0
(Y − µ) Σ−1 (Y − µ) ∼ Xn2
 
−1
β̂ ∼ Nk+1 β, σ 2 (X0 X)

1
 
V = (X0 X) 2 β̂ − β
V ∼ Nk+1 0, σ 2 I


From 2),
V0 V 2
∼ Xk+1
σ2
Finally,  0  
β̂ − β X0 X β̂ − β
2
∼ Xk+1
σ2
Also, recall that we showed in last lecture

(n − k − 1) Se2 2
∼ Xn−k−1
σ2

60
Duc Vu (Fall 2021) 19 Lec 19: Nov 8, 2021

§19.2 A n o t h e r P r o o f o f Q u a d r a t i c Fo r m s a n d T h e i r
Distribution
1. Let Y ∼ Nn (0, I) and Z = P0 Y where P is an orthogonal matrix where P0 P = I. Then,
Z ∼ Nn (0, I).

2. Let A be a symmetric and idempotent matrix. Then the eigenvalues are 0 or 1.

Proof. Have Ax = λx. Multiply both sides by x0

x0 Ax = λx0 x
x0 AAx = λx0 x
0
(Ax) (Ax) = λx0 x
λ2 x0 x = λx0 x

Therefore, λ = 0 or λ = 1.
Question 19.1. How many 1’s?

Using the trace of A,

tr A = tr (PΛP0 )
= tr (ΛPP0 )
= tr Λ

3. Let Y ∼ N (0, I) and suppose A is a symmetric and idempotent matrix. Then Y0 AY ∼ X12
where r = tr (A) (number of eigenvalues equal to 1).

A = PΛP0 =⇒ Y0 AY = Y0 PΛP0 Y = Z0 ΛZ from 1)

Then,
Y0 AY = z12 + z22 + . . . + zr2 ∼ Xr2
where Z ∼ Nn (O, I) =⇒ zi ∼ N (0, 1), and so zi2 ∼ X12
(n−k−1)Se2 2
4. Use the previous theorem (3.) to show that σ2 ∼ Xn−k−1

e0 e
Se2 = =⇒ e0 e = (n − k − 1)Se2
n−k−1
e0 e 2
WTS: σ2 ∼ Xn−k−1

Proof. Have )
e = (I − H) Y
=⇒ e = (I − H) ε
Y = Xβ + ε
Therefore,
e0 e ε0 (I − H) ε ε ε 0

2
= 2
= (I − H) = ε∗ (I − H) ε∗
σ σ σ σ
where ε ∼ N (0, I). Using the theorem above (3.), we conclude that

0 (n − k − 1)Se2
ε∗ (I − H) ε∗ = 2
∼ Xtr(I−H) = Xn−k−1
σ2

61
Duc Vu (Fall 2021) 19 Lec 19: Nov 8, 2021

§19.3 Efficiency of Least Squares Estimators


Let θ̂ be an unbiased estimator of θ. Then,
  1
var θ̂ ≥
nI(θ)

This is known as the Cramer-Rao Lower Bound. Recall the score function

∂ ln f (x; θ)
S=
∂θ
and the information matrix
2
E∂ 2 f (x; θ)

∂ ln f (x; θ)
I(θ) = E =− = var(S)
∂θ ∂θ2

and nI(θ) is the information in the sample. An estimator is efficient if


• It is unbiased
• its variance is equal to the Cramer-Rao lower bound.

Also,
∂ 2 ln L
I(θ) = −E
∂θ2
for Y1 , . . . , Yn i.i.d

62
Duc Vu (Fall 2021) 20 Lec 20: Nov 10, 2021

§20 L e c 2 0 : N ov 1 0 , 2 0 2 1

§20.1 I n f o r m a t i o n M a t r i x a n d E f f i c i e nt E s t i m a t o r
i.i.d
Let Y1 , Y2 , . . . , Yn ∼ N (µ, σ). Is y an efficient estimator for µ where
Ey = µ
σ2
var(y) =
n
Consider the pdf
1 1 2
√ e− 2σ2 (yi −µ)
f (yi ) =
σ 2π
− n 1
P 2
L = 2πσ 2 2 e− 2σ2 (yi −µ)
n 1 X
ln L = − ln 2πσ 2 − 2 (yi − µ)2
2 2σ
∂ ln L 2 X 1 X 
= 2
(yi − µ) = 2 yi − nµ
∂µ 2σ σ
2
∂ ln L µ
=− 2
∂µ2 σ
Cramer-Rao Lower Bound:
1 1 σ2
= =
− − σn2
2

−E ∂ ∂µln2L n
Thus, y is an efficient estimator for µ.
Let θ̂ be the estimator of θ.
1. E θ̂ = θ
2. Find var(θ̂) and compared it with the inverse of the information matrix I−1 (θ) where
 2
∂ 2 ln L ∂ 2 ln L

∂ ln L
∂θ12 ∂θ1 ∂θ2 . . . ∂θ 1 ∂θp
 ∂ 2 ln L ∂ 2 ln L ∂ 2 ln L 

 ∂θ ∂θ
 2 1 ∂θ22
. . . ∂θ 2 ∂θp 
I(θ) = −E  . .. .. 
 .. . . 
 2 2 2

∂ ln L ∂ ln L
∂θp ∂θ1 ∂θp ∂θ2 . . . ∂ ∂θln2 L
p

2
In multiple regression: β0 , β1 , . . . , βk , σ
Y = Xβ + ε
ε ∼ N 0, σ 2 I


Y ∼ Nn Xβ, σ 2 I


n 1 0
=⇒ ln L = − ln 2πσ 2 − 2 (Y − Xβ) (Y − Xβ)
2 2σ
n 1
ln L = − ln 2πσ 2 − 2 Y0 Y − 2Y0 Xβ + β 0 X0 Xβ

2 2σ
Then
∂ 2 ln L ∂ 2 ln L ∂ 2 ln L ∂ 2 ln L
 
2 ∂β0 ∂β1 ... ∂β0 ∂βk ∂β0 ∂σ 2
 ∂ 2∂βln0 L ∂ 2 ln L ∂ 2 ln L ∂ 2 ln L


 ∂β1 ∂β0 ∂β12
... ∂β1 ∂βk ∂β1 ∂σ 2

∂ 2 ln L
!
 .
 ∂ ln L
.. .. ..  ∂β∂β 0 ∂β∂σ 2
 ..
I(θ) = −E  . . .  = −E ∂ 2 ln L ∂ 2 ln L

 ∂ 2 ln L ∂ 2 ln L ∂ 2 ln L 2
∂ ln L  ∂σ 2 ∂β 0 ∂(σ 2 )(2)
 ∂β ∂β ∂βk ∂β1 ... ∂βk2 ∂βk ∂σ 2 

 2k 0
∂ ln L ∂ 2 ln L ∂ 2 ln L 2
∂ ln L
∂σ 2 ∂β0 ∂σ 2 ∂β1 ... ∂βk2 ∂(σ 2 )(2)

63
Duc Vu (Fall 2021) 20 Lec 20: Nov 10, 2021

Then,
∂ ln L 1
= − 2 (−2X0 Y + 2X0 Xβ)
∂β 2σ
∂ 2 ln L 1 0 X0 X
= − (2X X) = −
∂β∂β 0 2σ 2 σ2
2
∂ ln L 1
2
= − 4 (−2X0 Y + 2X0 Xβ)
∂β∂σ 2σ
∂ ln L n 1 0
= − 2 + 4 (Y − Xβ) (Y − Xβ)
∂σ 2 2σ 2σ
∂ 2 ln L n 1 0
2 (2)
= 4
− 6 (Y − Xβ) (Y − Xβ)
∂(σ ) 2σ σ
 2 
∂ ln L n n n
E 2 (2)
= 4
− 4 =− 4
∂(σ ) 2σ σ 2σ

Thus,
 X0 X   X0 X 
− σ2 0 σ2 0
I(θ) = =
00 − 2σn4 0 0 n
2σ 4
 2 0 −1 
σ (X X) 0
I−1 (θ) = 4
00 2σ
n

Notice that E β̂ = β and var(β̂) = σ 2 (X0 X)−1 . So β̂ is an efficient estimator of β.

e0 e
Se2 = , ESe2 = σ 2
n−k−1
2σ 4
var(Se2 ) =
n−k−1

§20.2 C e nt e r e d M o d e l
Consider Y = Xβ + ε

X = 1 X(0)
 
β0
β=
β (0)

Then,
1 0
Y = β0 1 + X(0) β (0) + ε ± 11 X(0) β (0)
n
Rearrange this expression and we obtain
 
1 0 1 0
Y = β0 1 + 11 X(0) β (0) + I − 11 X(0) β (0) + ε
n n
   
1 0 1 0
= 1 β0 + 1 X(0) β (0) + I − 11 X(0) β (0) + ε
n n
= γ0 1 + Zβ (0) + ε

Estimate the centered model


−1  0   −1  0 
10 Z
  
γ̂0 n 1Y n 0 1Y
= =
β̂ (0) Z0 1 Z0 Z Z0 Y 0 Z0 Z Z0 Y

64
Duc Vu (Fall 2021) 20 Lec 20: Nov 10, 2021

Thus,

γ̂0 = y
−1
β̂ (0) = (Z0 Z) Z0 Y
   −1  
0 1 0 0 1 0
= X(0) I − 11 X(0) X(0) I − 11 Y
n n
 0 0 0
= X∗(0) X∗(0) X∗(0) Y∗
 
Observe that Y ∼ Nn γ0 1 + Zβ (0) , σ 2 I . Then,
 0  
Y − γ0 1 − Zβ (0) Y − γ0 1 − Zβ (0)
∼ Xn2
σ2

• Fitted values: Ŷ = 1γ̂0 + Zβ̂ (0)

• e = Y − Ŷ = Y − 1γ̂0 − Zβ̂ (0)

Note: Fitted values and residuals are the same for both models.

65
Duc Vu (Fall 2021) 21 Lec 21: Nov 12, 2021

§21 L e c 2 1 : N ov 1 2 , 2 0 2 1

§21.1 C o n f i d e n c e I nt e r va l s i n M u l t i p l e R e g r e s s i o n
Consider  
−1
β̂ ∼ Nk+1 β, σ 2 (X0 X)

Let’s find a 1 − α confidence interval for β1 .



β̂1 ∼ N (β1 , σ v11 )
(n − k − 1)Se2 2
∼ Xn−k−1
σ2
β̂1√−β1
σ v11 β̂1 − β1
q = √ ∼ tn−k−1
(n−k−1)Se2 Se v11
σ2 /n − k − 1
!
β̂1 − β1
P −t α2 ;n−k−1 = √ ≤ t α2 ;n−k−1 =1−α
Se v11

Finally,

β1 ∈ β̂1 ± t α2 ;n−k−1 · Se v11
In general, to construct a confidence interval for a0 β
 q 
−1
a0 β̂ ∼ N a0 β, σ a0 (X0 X) a

Then,
0 0
√a β̂−a β−1
σ a0 (X0 X) a
q ∼ tn−k−1
(n−k−1)Se2
σ2 /(n − k − 1)
a0 β̂ − a0 β
q ∼ tn−k−1
−1
Se a0 (X0 X) a

Finally, q
0 0 −1
a β ∈ a β̂ ± t α
2 ;n−k−1
· Se a0 (X0 X) a

If a = 0 1 0 0 . . . 0 then a0 β = β1 .


Prediction Interval for Y0 : For a given X00 = 1



x01 x02 . . . x0k

Y = Xβ + ε
 
  β0  
Y1 ε1
 .. 
 β
  1  . 

 . = 1 x01 x02 ... x0k  .  +  .. 
 .. 
Yn εn
βn

So the predictor is
Ŷ0 = x00 β̂

66
Duc Vu (Fall 2021) 21 Lec 21: Nov 12, 2021

 
Error of the prediction is Y0 − Ŷ0 with E Y0 − Ŷ0 = EY0 − E Ŷ0 = X00 β − X00 β = 0. Note that
Y0 = X̂00 β + ε0
 
var Y0 − Ŷ0 = var(Y0 ) + var(Ŷ0 )
−1
= σ 2 + σ 2 x00 (X0 X) x0
 
−1
= σ 2 1 + x00 (X0 X) x0

Then,
 q 
−1
Y0 − Ŷ0 ∼ N 0, σ 1+ x00 (X0 X) x0

(n − k − 1)Se2 2
∼ Xn−k−1
σ2
With this, we can construct a t ratio

√ Y0 −Ŷ0
σ 1+x00 (X0 X)−1 x0 Y0 − Ŷ0
q = q ∼ tn−k−1
(n−k−1)Se2 −1
σ2 /(n − k − 1) Se 1 + x00 (X0 X) x0

and the prediction interval for Y0 is


q
−1
Y0 ∈ Ŷ0 ± t α2 ;n−k−1 · Se 1 + x00 (X0 X) x0

For a given x00 = 1 x0k , Ŷ0 = x00 β̂ and



x01 x02 ...
 q 
0 0 0 −1
Ŷ0 ∼ N x0 β, σ x0 (X X) x0

(n − k − 1)Se2 2
∼ Xn−k−1
σ2
SO the confidence interval for EY0 is
q
−1
EY0 = x00 β ∈ Ŷ0 ± t α
2 ;n−k−1
· Se x00 (X0 X) x0

§21.2 H y p o t h e s i s Te s t i n g
Suppose k = 5 then

Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi3 + β4 Xi4 + β5 Xi5 + εi

Suppose we want to test


1. H0 : β1 = 0, Ha : β1 6= 0
2. H0 : β3 = 2, Ha : β3 6= 2
3. H0 : β2 − β5 = 0, β2 − β5 6= 0
4. H0 : β2 = β5 = 0, Ha : not true
5. H0 : β1 = β2 = β3 = β4 = β5 , Ha : not true or β(0) = 0, β(0) 6= 0
As the above can be expressed using

H0 : Cβ = γ
Ha : Cβ 6= γ

67
Duc Vu (Fall 2021) 21 Lec 21: Nov 12, 2021


1. C = 0 1 0 0
0 ,γ=0 0

2. C = 0 0 0 1 0 0 , γ = 2

3. C = 0 0 1 0 0 −1 , γ = 0
   
0 0 1 0 0 0 0
4. C = ,γ= . Check:
0 0 0 0 0 1 0
   
β2 0
Cβ = =
β5 0

5.    
0 1 0 0 0 0 0
0 0 1 0 0 0 0
   
0
C= 0 0 1 0 0, 0
γ= 
0 0 0 0 1 0 0
0 0 0 0 0 1 0

or C = 0 I
In general, C is m × k + 1 matrix.

H0 : Cβ = γ =⇒ cβ − γ = 0
Ha : Cβ 6= γ =⇒ cβ − γ 6= 0

Consider Cβ̂ − γ and find its distribution under H0 .


 
E Cβ̂ − γ = 0
 
−1
var Cβ̂ − γ = σ 2 C (X0 X) C0

Therefore,  
−1
Cβ̂ − γ ∼ Nm 0, σ 2 C (X0 X) C0
and let h i− 21 h i
−1
V = C (X0 X) C0 Cβ̂ − γ
Then EV = 0
var (V) = σ 2 Im×m
0
V V

So, V ∼ Nm 0, σ 2 I and σ2
2
∼ Xm
 0  −1  
−1
Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ
2
∼ Xm
σ2
(n − k − 1)Se2 2
∼ Xn−k−1
σ2
β̂ and Se2 are independent. Therefore,
0
 −1
−1
(Cβ̂−γ ) C(X0 X) C0 (Cβ̂−γ )
σ2 /m
(n−k−1)Se2
∼ Fm,n−k−1
σ2 /n −k−1
or  0  −1  
−1
Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ
∼ Fm,n−k−1
mSe2

68
Duc Vu (Fall 2021) 22 Lec 22: Nov 15, 2021

§22 L e c 2 2 : N ov 1 5 , 2 0 2 1

§22.1 F Te s t f o r t h e G e n e r a l L i n e a r H y p o t h e s i s
Consider:
H0 : Cβ = γ
Ha : Cβ 6= γ
 
−1
Under H0 : Cβ̂ − γ ∼ Nm 0, σ 2 C (X0 X) C0
0
 −1
−1 
(Cβ̂−γ ) C(X0 X) C0 (Cβ̂−γ ) 2
σ2 ∼ Xm
(n−k−1)Se2 2
∼ Xn−k−1

σ2
 0  −1  
−1
Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ
=⇒ ∼ Fm,n−k−1
mSe2
Reject H0 if F > F1−α;m,n−k−1
Note: EmSe2 = mσ 2 – expected value of the denominator. Expected value of the numerator (using
properties of trace) is
 −1
0 −1
mσ 2 + (Cβ − γ) C (X0 X) C0 (Cβ − γ)

If H0 is true, the second term becomes 0 and the expected value is approximately equal to 1.

§22.2 F Statistics and t statistics in Multiple Regression


Suppose H0 : β1 = 0, Ha : β1 6= 0, k = 5 and m = 1. Then,

C= 0 1 0 0 0 0 , γ=0

and Cβ̂ − γ = β̂1 . Then the F statistics is


β̂12
2
∼ F1,n−k−1
Se v11
Now we test H0 : β1 = 0 using the t statistics
√  )
β̂1 ∼ N 0, σ v11 β̂1
(n−k−1)Se2
=⇒ √ ∼ tn−k−1
∼ 2
Xn−k−1 Se v11
σ2

Thus, t2n−k−1 = F1,n−k−1 .


Suppose
H0 : a0 β = 0
Ha : a0 β 6= 0
 q 
0 0 0 −1
t-statistics: a β̂ ∼ N 0, σ a (X X) a and

(n − k − 1)Se2 2
∼ Xn−k−1
σ2
Then,
a0 β̂
q ∼ tn−k−1
−1
Se a0 (X0 X) a

69
Duc Vu (Fall 2021) 22 Lec 22: Nov 15, 2021

§22.3 Powe r A n a l y s i s i n M u l t i p l e R e g r e s s i o n
Let Y ∼ Nn (µ, I). Then Y0 Y ∼ Xn2 (NCP = µ0 µ). Let Y ∼ Nn µ, σ 2 I . Then,


Y µ 
∼ Nn ,I
σ σ
Y0 Y µ0 µ
 
2
∼ Xn NCP = 2
σ2 σ

Let Q ∼ Xn2 (NCP = θ)


−n t
MQ (t) = (1 − 2t) 2
eθ 1−2t

When H0 is no true,  
−1
Cβ̂ − γ ∼ Nm Cβ − γ, σ 2 C (X0 X) C0
 −1
− 1
C(X0 X) C0 2
(Cβ̂−γ )
Let V = σ . Then,
 − 12 
0 −1
 C (X X) C0 (Cβ − γ)
V ∼ Nm  , I

σ
 0 h i−1    h i−1 
−1 0 −1 0
Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ (Cβ − γ) C (X 0
X) C (Cβ − γ)
2 
∼ Xm NCP =

σ2 σ2

and the non-central F distribution


)
U ∼ Xn2 (NCP = θ) U/n
2
=⇒ ∼ Fn,m (NCP = θ)
V ∼ Xm V /m

where U, V are independent. Apply this for the power analysis


0
h −1
i−1  0 h i−1  
(Cβ̂−γ ) C(X0 X) C0 C β̂ − γ C (X 0
X)
−1 0
C C β̂ − γ
σ2 /m
(n−k−1)Se2
= ∼ Fm,n−k−1
/(n − k − 1) mSe2
σ2
h −1
i−1
(Cβ−γ)0 C(X0 X) C0 (Cβ̂−γ )
with NCP = σ2
figure here
The power is
1 − β = P (Fm,n−k−1 (NCP = θ) > F1−α;m,n−k−1 )

§22.4 F Statistics Using the Extra Sum of Squares


Under H0 : Cβ = γ, we have a constrained least squares problem with
h i 
−1 −1
β̂ c = β̂ − (X0 X) C0 C (X0 X) C0 Cβ̂ − γ

and  0 h i−1  
−1
SSEc = ec 0 ec = e0 e + Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ

Using extra sum of squares


(SSEc − SSEF ) /(dfR − dfF )
∼ FdfR −dfF ,dfF
SSEF /dfF

70
Duc Vu (Fall 2021) 22 Lec 22: Nov 15, 2021

where dfF = n − k − 1 and dfR = n − (k − m) − 1 so dfR − dfF = m, e.g.,

k=5
H0 : β1 = β2 = 0
Full : n − 5 − 1
Reduced : n − 3 − 1
=⇒ m = 2

Thus,
 0 h i−1  
−1
Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ
∼ Fm,n−k−1
mSe2
which is the same as method 1.

71
Duc Vu (Fall 2021) 23 Lec 23: Nov 17, 2021

§23 L e c 2 3 : N ov 1 7 , 2 0 2 1

§23.1 Te s t i n g t h e O ve r a l l S i g n i f i c a n c e o f t h e M o d e l
Consider

H0 : β (0) = 0
Ha : β (0) 6= 0

We can test this hypothesis using the F test for the general linear hypothesis
 0 h i−1  
−1
Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ
∼ Fk,n−k−1
kSe2

where m = k in this case. We can also use the following test statistic

MSR SSR/k
=
MSE SSE/(n − k − 1)
 0 h i−1  
−1
C (X0 X) C0

Note: SSR = Cβ̂ − γ Cβ̂ − γ where C = 0 Ik .

§23.2 L i ke l i h o o d R a t i o Te s t
Consider:

H0 : β (0) = γ
Ha : β (0) 6= γ

We reject H0 if
L (ŵ)
Λ= <k
L (ω̂)
where
• L (ŵ) : maximized likelihood function under H0

• L (ω̂) : maximized likelihood function under no restriction


Note that
Y = Xβ + ε =⇒ Y ∼ Nn Xβ, σ 2 I


Thus, the likelihood function is


− n2 1 0
L = 2πσ 2 e− 2σ2 (y−xβ) (y−xβ)

Without any restrictions


−1
β̂ = (X0 X) X0 Y
and  0  
y − xβ̂ y − xβ̂ e0 e
σ̂12 = =
n n
Under H0 ,
h i−1  
−1 −1
β̂ c = β̂ − (X0 X) C0 C (X0 X) C0 Cβ̂ − γ

72
Duc Vu (Fall 2021) 23 Lec 23: Nov 17, 2021

and  0  
y − xβ̂ c y − xβ̂ c e0c ec
σ̂02 = =
n n
Back to LRT, we have
− n2 − 1
2 e0c ec
2πσ̂02 e 2σ̂0

Λ= − 1
e0 e
<k
−n 2
(2πσ̂12 ) 2
e 2σ̂1

Replace

e0c ec = nσ̂02
e0 e = nσ̂12

We obtain
e0 e 2
< kn
e0c ec
Also,
 0 h i−1  
−1
e0c ec = e0 e + Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ

Thus,
1 2
0 −1 < kn
(Cβ̂−γ ) [C(X0 X)−1 C0 ] (Cβ̂−γ )
1+ e0 e

We see that if H0 is true then Cβ̂ ≈ γ and therefore the ratio above is approximately equal to 1. If
H0 is no true then the ratio above is less than 1. Manipulating the above expression, we have
 0 h i−1  
−1
Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ  2 n−k−1
> k− n − 1 = k0
mSe2 m

Use significance level α (type I error) to find k 0 (rejection region).

P (Fm,n−k−1 > k 0 ) = α

Therefore, k 0 = F1−α;m,n−k−1 .

1−α
α

F1−α;m,n−k−1

§23.3 M u l t i - C o l l i n e a r i ty
This is problem when some predictors are highly correlated with other predictors.

73
Duc Vu (Fall 2021) 23 Lec 23: Nov 17, 2021

Example 23.1
Suppose k = 2. )
H0 : β1 = β2 = 0
Use F statistic
Ha : at least on βi 6= 0
Suppose we reject H0 (at least one bi 6= 0). Then test β1 = 0 and β2 = 0 individually.

H0 : β1 = 0 H0 : β 2 = 0
Ha : β1 6= 0 Ha : β2 6= 0

Suppose we don’t reject H0 is both tests. This contradiction between the F statistic and the t
statistics is a problem caused by multi-collinearity.

Multi-collinearity inflates the variance of β̂i and therefore the corresponding t statistics will be
small. To explain this we will use the centered and scaled model

Y = γ0 1 + Zβ (0) + ε

where Z = I − n1 110 X(0) . Or




  
Yi = γ0 + β1 Xi1 − X 1 + β2 Xi2 − X 2 + . . . + βk Xik − X k + εi

where i = 1, 2, . . . , n. Or

Yi = γ0 + β1 zi1 + β2 Zi2 + . . . + βk Zik + εi

where Z1 , . . . , Zk are the centered predictors.

74
Duc Vu (Fall 2021) 24 Lec 24: Nov 19, 2021

§24 L e c 2 4 : N ov 1 9 , 2 0 2 1

§24.1 C e nt e r e d a n d S c a l e d M o d e l i n M a t r i x / Ve c t o r Fo r m
Consider the centered model

Y = γ0 1 + Zβ (0) + ε
1
γ0 = β0 + 10 X(0) β (0)
n
= β0 + β1 x 1 + . . . + βk x k
 
1 0
Z = I − 11 X(0)
n
or

Yi = γ0 + β1 (xi1 − x1 ) + β2 (xi2 − x2 ) + . . . + βk (xik − xk ) + εi


= γ0 + β1 Zi1 + β2 Zi2 + . . . + βk Zik + εi
pP
Centering and scaling: multiply and divide each centered prediction by (xij − xj )2 . Then,

(x1i − x1 (xik − xk )
qX qX
Yi = γ0 + β1 (xi1 − x1 )2 pP + . . . + βk (xik − xk )2 pP + εi
(xi1 − x1 )2 (xik − xk )2
or
Yi = γ0 + δ1 Zsi1 + . . . + δk Zsik + εi
where
qX
δj = βj (xij − xj )2
xij − xj
Zsij = pP
(xij − xj )2

From the centered model


−1
Y = γ0 1 + ZD
| {z } Dβ (0) +ε
Z
| {z }
s
δ (0)

where pP 
(xi1 − x1 )2 0
D=
 .. 
. 
pP
0 (xik − xk )2
Then
Y = γ0 1 + Zs δ (0) + ε
Note:

1. 10 Zs = 10 ZD−1 = 10 I − n1 110 X(0) D−1 = 00


 

2. Z0s 1 = 0
Z0s1
 

3. Z0s Zs =  ...  Zs1


 
... Zsk

Z0sk

75
Duc Vu (Fall 2021) 24 Lec 24: Nov 19, 2021

Let’s examine Z0s1 Zs1 and Z0s1 Zs2

√Pxi1 −x1
 
(xi1 −x1 )2 
 P(xi1 − x1 )2
i P
xi1 −x1 √Pxn1 −x1 ..
h
Z0s1 Zs1 = √P(xi1 −x1 )2 ... = =1

(xi1 −x1 ) 2  . (xi1 − x1 )2
√Pxn1 −x1
 
(xi1 −x1 )2

Similarly for Z0s1 Zs2 ,


P
(xi1 − x1 )(xi2 − x2 )/n − 1
Z0s1 Zs2 = pP pP = v12
(xi1 − x1 )2 (xi2 − x2 )2 /n − 1

Then,
 
1 r12 . . . r1k
r21 1 . . . r2k 
Z0s Zs = R =  .
 
.. .. 
 .. . . 
rk1 rk2 ... 1

Estimation of Y = γ0 1 + Zs δ(0) + ε
−1  −1 
00
1
10 1 10 Zs 10 Y 10 Y 10 Y
       
γ̂0 n 0 n
= = = −1
δ̂ (0) Z0s 1 Z0s Zs Z0s Y 0 Z0s Zs Z0s Y 0 (Z0s Zs ) Z0s Y

So
γ̂0 = y
which is the same as the estimate of γ0 of the centered model. And
−1
δ̂ (0) = (Z0s Zs ) Z0s Y

Properties:
−1
E δ̂ (0) = (Z0s Zs ) Z0s EY
−1
= (Z0s Zs ) Z0s γ0 1 + Zs δ (0)


−1
= 0 + (Z0s Zs ) Z0s Zs δ (0)
= δ (0)
h i
−1
var(δ̂ (0) ) = var (Z0s Zs ) Z0s Y
= σ 2 R−1

Non-Centered Model:
Y = Xβ + ε or Y = β0 1 + X(0) β (0) + ε
Centered Model:
Y = γ0 1 + Zβ (0) + ε
Centered/Scaled Model:
Y = γ0 1 + Zs δ (0) + ε
where
1 0
γ 0 = β0 + 1 X(0) β (0)
n
δ (0) = Dβ (0)

76
Duc Vu (Fall 2021) 24 Lec 24: Nov 19, 2021

So
1 0
β̂0 = γ̂0 − 1 X(0) β̂ (0)
n
1
= y − 10 X(0) D−1 δ̂ (0)
n
β̂ (0) = D−1 δ̂ (0)

77
Duc Vu (Fall 2021) 25 Lec 25: Nov 22, 2021

§25 L e c 2 5 : N ov 2 2 , 2 0 2 1

§25.1 M u l t i - C o l l i n e a r i ty
Let X1 , X2 , . . . , Xk be predictors.
pP Some predictors are highly correlated with other predictors.
Earlier we saw that δ1 = β1 (xi1 − x1 )2
qX
δ̂1 = β̂1 (xi1 − x1 )2
δ̂1
β̂1 = pP
(xi1 − x1 )2
var(δ̂1 )
var(β̂1 ) = P
(xi1 − x1 )2

So let’s find the variance of δ̂1 using the centered and scaled model.
 
1 r12 r13 . . . r1k
r21 1 r23 . . . r2k 
var(δ̂ (0) ) = σ 2 R−1 = σ 2  . ..  = ∗
 
 .. ..
. . 
rk1 rk2 rk3 1

Therefore, var(δ̂1 ) = σ 2 R−1 [1, 1]. Using the inverse of a partitioned matrix
−1 
C−1 −C−1
 
A11 A12 11 11 C12
=
A21 A22 −C21 C−111 A−1 −1
22 + C21 C11 C12

Here A11 = 1, A12 = r0 , A21 = r, A22 = R22 , C11 = A11 − A12 A−1
22 A21 . Therefore,

σ2
var(δ̂1 ) =
1 − r0 R−1
22 r
2
σ 2
We will sow that var(δ̂1 ) = 1−R 2 where R1 is the R-square from the regression of X1 on
1
X2 , X3 , . . . , Xk . Instead, we can regress Zs1 on Zs2 , Zs3 , . . . , Zsk . Because we have seen that
the three models:
• Non-centered
• Centered
• Centered/Scaled
Here is the model

Zsi1 = α0 + α1 Zsi2 + α2 Zsi3 + . . . + αk−1 Zsik + εi


SSR
R12 =
SST
X 2 X 2
SST = Zsi1 − Z s1 = Zsi1
 2
0
2
= Zs1 Zs1 = 1. So far, we have R2 = SSR = Ẑsi1 − Z s1  or R12 = 2
P P P
But Zi1 Ẑsi1 =
|{z}
→0
0
Ẑs1 Ẑs1 . Here, Ẑs1 = HZs1 where H is the hat matrix using Zs2 , Zs3 , . . . , Zsk . Therefore, R12 =
0
Zs1 HZs1 or
0
 0 −1 0
R12 = Zs1 Z∗s Z∗s Z∗ Z∗s Zs1 = r0 R−1
22 r

78
Duc Vu (Fall 2021) 25 Lec 25: Nov 22, 2021

Earlier we found that


σ2 σ2
var(δ̂1 ) = −1 =
1 − r0 R22 r 1 − R12

Now back to the variance of β̂1 :


var(δ̂1 )
var(β̂1 ) = P
(xi1 − x1 )2
σ2
Replace var(δ̂1 ) = 1−R12
we obtain

σ2
var(β̂1 ) =
R12 )
P
(1 − (xi1 − x1 )2

Therefore, if R12 is close to 1, then var(β̂1 ) is large.


Detection of Multi-Collinearity: use variance inflation factor (VIF). For each predictor j compute
VIFj
1
VIFj =
1 − Rj2
where Rj2 is the R-square from the regression of predictor xj on the other predictors. For example,
if VIF > 10, then Rj2 > 0.90 which means xj is highly correlated with the other predictors.

§25.2 Generalized Least Squares


Consider the model:
Y = Xβ + ε
So far we assumed the Gauss-Markov conditions. Suppose now

Eε = 0
var(ε) = σ 2 V

where V is a symmetric matrix of constants. If we use the ordinary least squares


 (OLS) estimator

−1 −1
β̂ = (X X) X Y. We still get E β̂ = β because EY = Xβ but var(β̂) = var (X0 X) X0 Y =
0 0

−1 −1
σ 2 (X0 X) X0 VX (X0 X) . Therefore, β̂ is not BLUE because the Gauss-Markov conditions do
1
not hold. We transform the model as follows: let V− 2 be the inverse square root matrix of V.
− 12
Multiply the model on both sides by V
1 1 1
V− 2 Y = V− 2 Xβ + V− 2 ε

or

Y∗ = X∗ β + ε∗
 1

Eε∗ = E V− 2 ε = 0
 1
 1 1
var (ε∗ ) = var V− 2 ε = σ 2 V− 2 VV− 2 = σ 2 I

with this transformation we see that the Gauss-Markov conditions hold. Therefore, we estimate β
using
 0 −1 0
β̂ GLS = X∗ X∗ X∗ Y ∗
1 1
Replace X∗ = V− 2 X and Y∗ = V− 2 Y to get
−1
β̂ GLS = X0 V−1 X X0 V−1 Y

79
Duc Vu (Fall 2021) 25 Lec 25: Nov 22, 2021

Then, the mean is


  −1 0 −1 −1 0 −1
E β̂ GLS = X0 V−1 X X V EY = X0 V−1 X X V Xβ = β

which is unbiased. And


   −1 0 −1  −1
var β̂ GLS = var X0 V−1 X X V Y = σ 2 X0 V−1 X

80
Duc Vu (Fall 2021) 26 Lec 26: Nov 24, 2021

§26 L e c 2 6 : N ov 2 4 , 2 0 2 1

§26.1 G e n e r a l i z e d L e a s t S q u a r e s ( C o nt ’ d )
Estimate β by direct minimization of the error sum of squares using the transformed model

Y ∗ = X∗ β + ε ∗
0 0 1 1
min ε∗ ε∗ or min (Y∗ − X∗ β) (Y∗ − X∗ β). Replace Y∗ = V− 2 Y and X∗ = V− 2 X. We minimize
0
min Q = (Y − Xβ) V−1 (Y − Xβ)
= Y0 V−1 Y − 2Y0 V−1 Xβ + β 0 X0 V−1 Xβ
∂Q
= −2X0 V−1 Y + 2X0 V−1 Xβ = 0
∂β
−1 0 −1
β̂ GLS = X0 V−1 X XV Y
 
Assume now ε ∼ Nn 0, σ 2 V . Then, Y ∼ Nn Xβ, σ 2 V and

1 −1 1 0 2 −1
L= n σ 2 V 2 e− 2 (y−xβ) (σ V) (y−xβ)
(2π) 2
n 1 1 0
ln L = − ln 2πσ 2 − ln |V | − 2 (y − xβ) V−1 (y − xβ)
2 2 2σ
∂ ln L
=0
∂θ
−1 0 −1
β̂ GLS = X0 V−1 X XV Y

Estimation of σ 2
∂ ln L n 1 0
2
= − 2 + 4 (y − xβ) V−1 (y − xβ) = 0
∂σ 2σ 2σ
 0    0  
y − xβ̂ GLS V−1 y − xβ̂ GLS y∗ − x∗ β̂ GLS y∗ − x∗ β̂ GLS
σ̂ 2 = =
n n
0
e eGLS
= GLS
n
Use the properties of trace to find E σ̂ 2
1  0  
E σ̂ 2 = E y − xβ̂ GLS V−1 y − xβ̂ GLS
n
1 h   i
= tr V−1 var y − Xβ̂ GLS + 000
n
h i
because E Y − Xβ̂ GLS = 0. So

−1 0 −1
Y − Xβ̂ GLS = Y − X X0 V−1 X

XV Y
h −1 0 −1 i
0 −1
= I−X XV X XV Y

So
  h −1 0 −1 i h −1 0 i
var Y − Xβ̂ GLS = σ 2 I − X X0 V−1 X XV V I − V−1 X X0 V−1 X X
−1
= σ 2 V − σ 2 X X0 V−1 X X0


81
Duc Vu (Fall 2021) 26 Lec 26: Nov 24, 2021

Back to expectation
1 h −1 0 i
E σ̂ 2 = tr V−1 σ 2 V − σ 2 X X0 V−1 X X
n
1 2
σ tr In − σ 2 tr Ik+1

=
n
n−k−1 2
= σ
n
0
eGLS eGLS
Thus, the unbiased estimator of σ 2 is Se2GLS = n−k−1 .

§26.2 Comparing Regression Equations


Suppose we have two data sets on the same variables

Y1 = X1 β 1 + ε 1
Y2 = X2 β 2 + ε2
! !
β (1) β (1)
Let β 1 = (2) , β 2 = (2) . Note that
β1 β2

(2) (2)
β (1) : p × 1, β 1 : (k + 1 − p) × 1, β 2 : (k + 1 − p) × 1
(2) (2)
Suppose we want to test β 1 = β 2 (assume that the first p elements of β 1 and β 2 are the same).
We can construct one model as follows
 
! β (1)
  (1) (2)  
Y1 X1 X1 0  (2)  ε1
= (1) (2) β 1  + ε
Y2 X2 0 X2 (2) 2
β2

Therefore, Y = Xβ + ε and the hypothesis β 21 = β 22 can be tested using

H0 : Cβ = 0
Ha : Cβ 6= 0

82
Duc Vu (Fall 2021) 27 Lec 27: Nov 29, 2021

§27 L e c 2 7 : N ov 2 9 , 2 0 2 1

§27.1 C o m p a r i n g R e g r e s s i o n E q u a t i o n s ( C o nt ’ d )
We can use the F test for the general linear hypothesis
 0 h i−1  
−1
Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ
∼ Fm,n−k−1
mSe2

Example 27.1
Suppose k = 5 and p = 3. We want to test
(1) (2) (1) (2) (1) (2)
H0 : β3 = β3 , β4 = β4 , β5 = β5
Ha : not true
 
0 0 0 1 0 0 −1 0 0
C = 0 0 0 0 1 0 0 −1 0
0 0 0 0 0 1 0 0 −1

and  
β0
 β1 
 
 β2 
 
β (1) 
 3 
 (1) 
β= β 
 4(1) 
β 
 5 
β (2) 
 3 
 (2) 
 β4 
(2)
β5
In general, 
C = 0k+1−p,p Ik+1−p −Ik+1−p
Therefore,
C : (k + 1 − p) × (2(k + 1) − p)

We can also test the hypothesis using the extra sum of squares principle. Under the null hypothesis,
the model is expressed as follows
   0  0  
Y1 X1 X21 β ε
= + 1
Y2 X02 X22 β∗ ε2

where β ∗ is the common beta subvector under H0 . Therefore,

(SSER − SSEF )/(dfR − dfF )


∼ FdfR −dfF ,dfF
SSEF /dfF
dfF = n − p − 2(k + 1 − p) = n + p − 2(k + 1)
dfR = n − p − (k + 1 − p) = n − k − 1

83
Duc Vu (Fall 2021) 27 Lec 27: Nov 29, 2021

Example 27.2
Suppose k = 5, p = 4
(1) (1) (1) (2)
H0 : β4 = β5 , β5 = β5
Ha : not true

Formulation:
(1) (1) (1) (1) (1)
 

y11
 1 x11 x12 x13 x14 x15 0 0  
ε11
 β0
 
(1) (1) (1) (1) (1)
 y21  1 x21 x22 x23 x24 x25 0 0

 ε21 
  β1   

  . .. .. .. .. .. .. ..
 ..   .   . 
 .  . . . . . . .   β2   .. 
.

  
yn1  1 (1) (1) (1) (1) (1) 
 β3  εn1 
  
 = xn1 xn2 xn3 xn4 xn5 0 0    (1) + 
 y12  1 (2) (2) (2) (2) (2)
x11 x12 x13 0 0 x14 x15  β4   ε12 
   
  
 y22   (2)  β (1) 
x25   5   ε22 
(2) (2) (2) (2)
  1 x21 x22 x23 0 0 x24
 
 .     β2   . 
 ..   .. .. .. .. .. .. .. ..  4  .. 
. . . . . . . .  β2
yn2 (2) (2) (2) (2) (2) 5 εn2
1 xn1 xn2 xn3 0 0 xn4 xn5
 
0 0 0 0 1 0 −1 0
C=
0 0 0 0 0 1 0 −1
dfF = n − 8, dfR = n − 6

§27.2 D e l e t i n g a S i n g l e Po i nt i n M u l t i p l e R e g r e s s i o n
We want to explore the effect of deleting a single point in multiple regression
• Effect on β̂
• Effect on Se2
• Effect on fitted values

We can delete one point at a time and run a new regression each time to see the effect. But this
will require n + 1 regressions (one on the full data set and n regressions when we delete data point
i, i = 1, . . . , n). There is a more automated way and the result is based on the residuals, ei , and
leverage values hii from the regression of the full data set.

Y = Xβ + ε

Suppose we want to delete data point i


     
Y(i) X(i) ε
= β + (i)
Yi x0i εi

where Y(i) , X(i) are the vectors Y and matrix X after deleting point i from the data set. We are
working now with the model

Y(i) = X(i) β + ε(i)


 −1
β̂ (i) = X0(i) X(i) X0(i) Y(i)

where β̂ (i) is the estimator of the vector β after deleting data point i.

84
Duc Vu (Fall 2021) 28 Lec 28: Dec 1, 2021

§28 Lec 28: Dec 1, 2021

§28.1 D e l e t i n g a S i n g l e Po i nt i n M u l t i p l e R e g r e s s i o n
( C o nt ’ d )
0 0
Let’s find expressions for X(i) X(i) and X(i) Y(i)
 
X(i)
 0 0
X0 X = X(i) xi = X(i) X(i) + xi x0i
x0i
0
X(i) X(i) = X0 X − xi x0i

Result: Let A be a square invertible matrix and b be a vector. Then


−1 A−1 bb0 A−1
A − bb0 = A−1 +

1 − b0 A−1 b
Note: b0 A−1 b 6= 1. We can verify this as follows: multiply both sides by A − bb0 to get


identity matrix both sides. In our problem A is X0 X and b is xi .Using the result we can find
 0 −1 0
 0  Y  0
(i)
X(i) X(i) . Now let’s find X(i) Y(i) . From X0 Y = X(i) X(i) = X(i) Y(i) + xi yi . We
yi
0
 0 −1 0
get X(i) Y(i) = X0 Y − xi yi . Back to β̂ (i) = X(i) X(i) X(i) Y(i) . To find

−1
(X0 X) xi
β̂ (i) = β̂ − ei
1 − hii
Then
−1
(X0 X) xi ei
β̂ − β̂ (i) =
1 − hii
This is the difference in the estimator of β before and after deleting data point i.
Effect on fitted values:

Ŷi − Ŷi(i) = x0i β̂ − x0i β̂ (i)


 
= x0i β̂ − β̂ (i)
−1
(X0 X) xi ei
= x0i
1 − hii
hii
= ei
1 − hii
Finally, we can show that the error sum of square after deleting data point are connected with the
error sum of squares of the full data set as follows

2 e2i
(n − k − 2)Se(i) = (n − k − 1)Se2 −
1 − hii
• Se(i)
2
is the unbiased estimator of σ 2 after deleting data point i

• Se2 is unbiased estimator of σ 2 using full data set


• ei , hii are residual i and leverage i using full data set

e = (I − H) Y
ei = (I − H)i Y
−1
hii = x0i (X0 X) xi

85
Duc Vu (Fall 2021) 28 Lec 28: Dec 1, 2021

Adding a data point in multiple regression


     
Y X ε
= β +
y0 x00 ε0
Ynew = Xnew β + εnew
 
0  X
Xnew = X0 x0 = X0 X + x0 x00

Xnew
x00

Result: Let A be a square invertible matrix and β a vector s.t. 1 + b0 A−1 b 6= 1. Then

−1 A−1 bb0 A−1


A + bb0 = A−1 −
1 + b0 A−1 b
Here A is X0 X and b is x0 . Also,
 
0  Y
Xnew Ynew = X 0
x0 = X0 Y + x0 y0
y0

Finally,
−1
(X0 X) x0 e0
β̂ new = β̂ +
1 + h00
−1
where e0 = y0 − x00 β̂ and h00 = x00 (X0 X) x0 .

§28.2 I n f l u e nt i a l A n a l y s i s
Internally studentized residuals

√ ei
)
ei ∼ N 0, σ 1 − hii σ 1−hii ei
(n−k−1)Se2
=⇒ q = √
∼ 2
Xn−k−1 (n−k−1)Se2
/(n − k − 1) Se 1 − hii
σ2 σ2

This ratio does not follow a t distribution because ei and Se are not independent. Let vi = √ei .
Se 1−hii
Show that
vi2
 
1 1
∼ beta , (n − k − 2)
n−k−1 2 2
So we need to show that
e2i
 
1 1
∼ beta , (n − k − 2)
SSE(1 − hii ) 2 2

1. e = (I − H) ε, thus

ei = c0i (I − H) ε

where c0i = 0

0 ... 1 0 ... 0 in which 1 is at the ith position. So

e2i = ei ei = ε0 (I − H) ci c0i (I − H) ε

2. SSE = ε0 (I − H) ε

86
Duc Vu (Fall 2021) 29 Lec 29: Dec 3, 2021

§29 Lec 29: Dec 3, 2021

§29.1 I n f l u e nt i a l A n a l y s i s ( C o nt ’ d )
e2i
Let’s express SSE(1−hii ) as follows

ε0 (I − H) ci c0i (I − H) ε
ε0 (I − H) ε(1 − hii )

Divide by σ 2
ε0
σ (I − H) ci c0i (I − H) σε Z0 QZ
0 =
ε ε
σ (I − H) σ (1 − hii )
Z0 (I − H) Z
ε (I−H)ci c0i (I−H)
Here Z = σ ∼ N (0, I) and Q = 1−hii

3. Have
(I − H) ci c0i (I − H) (I − H) ci c0i (I − H)
QQ =
(1 − hii )2
(I − H) ci ci (I − H) ci c0i (I − H)
0
=
(1 − hii )2
0
(I − H) ci ci (I − H)
=
1 − hii
=Q

Note c0i (I − H) ci = 1 − hii . Thus, Q is symmetric and idempotent matrix. Because


Z ∼ N (0, I), it follows that Z0 QZ ∼ Xtr(Q)
2

(I − H) ci c0i (I − H) c0i (I − H) (I − H) ci
   
tr (Q) = tr = tr =1
1 − hii 1 − hii

Thus, Z0 QZ ∼ X12
4. Back to the ratio
Z0 QZ Z0 QZ
= 0
Z0 (I − H) Z Z (I − H − Q) Z + Z0 QZ
)
(I − H − Q) Q = Q − HQ − QQ
H(I−H)ci c0i (I−H)
HQ = 1−hii =0

Thus, Z0 (I − H − Q) Z and Z0 QZ are independent.


5. Consider

(I − H − Q) (I − H − Q) = I − H − (I − H) Q − Q (I − H) + QQ
=I−H−Q+0−Q+0+Q
=I−H−Q

Therefore, I − H − Q is symmetric and idempotent matrix. It follows that

Z0 (I − H − Q) Z ∼ Xtr(I−H−Q)
2

or
Z0 (I − H − Q) Z ∼ Xn−k−2
2

87
Duc Vu (Fall 2021) 29 Lec 29: Dec 3, 2021

X
Result: Let X ∼ Γ(α1 , β), Y ∼ Γ(α2 , β) and X, Y are independent. Let U = X + Y and V = X+Y .
Then, U, V are independent and
U ∼ Γ (α1 + α2 , β)
V ∼ beta (α1 , α2 )
Therefore,
Z0 QZ
 
1 1
∼ beta , (n − k − 2)
Z (I − H − Q) Z + Z0 QZ
0 2 2
Z0 QZ ∼ X12 or Z0 QZ ∼ Γ 21 , 2 . Also,


 
1
Z0 (I − H − Q) Z ∼ Xn−k−2
2
or ∼ Γ (n − k − 2), 2
2
ri2 1 1

We can conclude that n−k−1 ∼ beta 2 , 2 (n − k − 2)

§29.2 E x t e r n a l l y S t u d e nt i z e d R e s i d u a l
Consider the ratio
e
ti = √i
Se(i) 1 − hii
2
where Se(i) is the unbiased estimator of σ 2 after data point i is deleted from the data set. Notice
that
e2i
t2i = 2
Se(i) (1 − hii )
e2i (n − k − 2)
= 2 (1 −
(n − k − 2)Se(i) hii )

2 e2i
But (n − k − 2)Se(i) = (n − k − 1)Se2 − 1−hii . Then,

e2i (n − k − 2)
t2i = h i
e2
(n − k − 1)Se2 − 1−hi ii (1 − hii )
e2i (n − k − 2)
=
(n − k − 1)Se2 (1 − hii ) − e2i
e2i
Note: ri2 = Se2 (1−hii ) =⇒ e2i = ri2 Se2 (1 − hii ). Then,

ri2 Se2 (1 − hii )(n − k − 2)


t2i =
(n − k − 1)Se2 (1 − hii ) − ri2 Se2 (1 − hii )
r2 (n − k − 2) B
= i = (n − k − 2)
n − k − 1 − ri2 1−B
ri2 ri2
∼ beta 21 , 12 (n − k − 2) . From homework #10, exercise #5: If

where B = n−k−1 with n−k−1
B ∼ beta 12 α, 12 β , then

βB
∼ Fα,β
α(1 − B)
Here α = 1, β = n − k − 2. It follows that
B
t2i = (n − k − 2) ∼ F1,n−k−2
1−B
and therefore
e
ti = √i ∼ tn−k−2
Se(i) 1 − hii

88
Duc Vu (Fall 2021) 29 Lec 29: Dec 3, 2021

§29.3 A N o t e o n Va l u a b l e S e l e c t i o n
Effect on the regression when predictors are removed from the model
a) Effect on β̂
Y = Xβ + ε
is the correct model or Y = X1 β 1 + X2 β 2 + ε. Suppose we decided to use Y = X1 β 1 + ε.
−1
Then β̂ 1 = (X01 X1 ) X01 Y and therefore
−1
E β̂ 1 = (X01 X1 ) X01 (X1 β 1 + X2 β 2 )
−1
= β 1 + (X01 X1 ) X01 X2

which is not unbiased.

b) Effect on the variance covariance matrix of β̂:


−1
• From short regression: var(β̂ 1 ) = σ 2 (X01 X1 )
• Long regression:
   0 −1
−1 −1
var β̂ 1·2 = σ 2 X∗1 X∗1 = σ 2 (X01 (I − H)X1 ) = σ 2 (X01 X1 − X01 H2 X1 )

Result: If A−1 ≥ B−1 then A ≤ B. Then


h  i−1 h  i−1 1
var β̂ 1 − var β̂ 1·2 = 2 X01 H2 X1 ≥ 0
   σ
var β̂ 1 ≤ var β̂ 1·2

89

You might also like