Stats 100 C
Stats 100 C
Duc Vu
Fall 2021
This is stats 100C – Linear Models taught by Professor Christou. There is not an official
textbook used for the course. Instead, handouts and reference materials are distributed and
can be accessed through the class website. You can find other math/stats lecture notes through
my personal blog. Let me know through my email if you notice something mathematically
wrong/concerning. Thank you!
Contents
2
Duc Vu (Fall 2021) Contents
3
Duc Vu (Fall 2021) 1 Lec 1: Sep 27, 2021
1 1 2
f (yi ) =√ e− 2σ2 (yi −µ)
σ 2π
− 1 1 2
= 2πσ 2 2 e− 2σ2 (yi −µ)
− n 1
P
(yi −µ)2
L = f (y1 ) . . . f (yn ) = 2πσ 2 2 e− 2σ2
n 1 X
ln L = − ln 2πσ 2 − 2 (yi − µ)2
2 2σ
∂ ln L ∂ ln L
= 0, =0
∂µ ∂σ 2
(yi − y)2
P
2
S =
n−1
Ŷ0 = Y
Pn
1. Predictor assumption: Ŷ0 = i=1 ai Yi
4
Duc Vu (Fall 2021) 1 Lec 1: Sep 27, 2021
Notice that this is a constraint optimization problem, we use the method of Lagrange multiplier
to obtain
2 X
min Q = E Y0 − Ŷ0 − 2λ ai − 1
Remark 1.1. Compare this to the confidence interval for µ : µ ∈ Y ± t α2 ;n−1 √Sn .
5
Duc Vu (Fall 2021) 2 Lec 2: Sep 29, 2021
Yi = β0 + β1 Xi + εi
or Yi = β1 Xi + εi
Data:
y x
y1 x1
.. ..
. .
yn xn
data
no
yes
theory
use it
E(εi ) = 0, var(εi ) = σ 2
EYi = β0 + β1 Xi
var(Yi ) = σ 2
X
min Q = ε2i
X 2
min Q = (Yi − β0 − β1 Xi )
∂Q X
= −2 (Yi − β0 − β1 Xi ) = 0
∂β0
∂Q X
= −2 (Yi − β0 − β1 Xi ) Xi = 0
∂β1
6
Duc Vu (Fall 2021) 2 Lec 2: Sep 29, 2021
So,
(P P
yi − nβ0 − β1 xi = 0
xi yi − β0 xi − β1 x2i = 0
P P P
( P P
nβ0 + β1 xi = yi
=⇒ – normal equations
β0 xi + β1 x2i = xi yi
P P P
xi yi − n1 ( xi )( yi )
P P P
β̂1 = P 2 ( xi )2P
xi − n
or P
(xi − x)(yi − y)
β̂1 = P
(xi − x)2
or P
(xi − x)yi
β̂1 = P (*)
(xi − x)2
or P
(yi − y)xi
β̂1 = P
(xi − x)2
or P
xi yi − nxy
β̂1 = P P 2
x2i − ( nxi )
Note: From (*), we have
P
(xi − x)yi
β̂1 = P
(xi − x)2
(x1 − x)yi (xn − x)yn
=P + ... + P
(xi − x)2 (xi − x)2
Xn
= k1 y1 + . . . + kn yn = ki yi
i=1
7
Duc Vu (Fall 2021) 2 Lec 2: Sep 29, 2021
Properties of β̂1 :
X X
E β̂1 = E ki yi = ki Eyi
X
= ki (β0 + β1 xi )
X X
= β0 ki + β1 ki xi
= β1 – unbiased
Properties of β̂0 :
β̂0 = y − β̂1 x
X yi X
= −x ki yi
n
X 1
= − xki yi
n
Xn
= li yi
i=1
1
where li = n − xki and the properties of li are
X
li = 1
X1 2 X 1
X 2
li2 = − xki = 2 2
+ x ki − xki
n n2 n
1 x2
= +P
n (xi − x)2
X
li xi = 0
8
Duc Vu (Fall 2021) 2 Lec 2: Sep 29, 2021
Thus,
x2
X
2
X
2 2 1
var β̂0 = var li yi = σ li = σ +P
n (xi − x)2
The fitted value is
ŷi = β̂0 + β̂1 xi = y + β̂1 (xi − x)
and the residual is defined as
ei = yi − ŷi
with properties
X
ei = 0
X
ei xi = 0
X
ei ŷi = 0
in which
SST: sum of squares total
SSE: sum of squares error
SSR: sum of squares regression
9
Duc Vu (Fall 2021) 3 Lec 3: Oct 1, 2021
§3.1 G a u s s - M a r kov T h e o r e m
Recall X
β̂1 = ki Yi
Thus, (P
ai = 0
P
ai xi = 1
and we know that !
n
X X
var(b1 ) = var ai Yi = σ2 a2i
i=1
and
X σ2
var(β̂1 ) = σ 2 ki2 = P
(xi − x)2
Now let ai = ki + di . Then,
X
var(b1 ) = σ 2(ki + di )2
X X X
= σ2 ki2 + σ 2 d2i + 2σ 2 ki di
P
We need to show ki di = 0.
X X X
ki (ai − ki ) = ki ai − ki2
P
(xi − x)ai 1
= P −P
(xi − x)2 (xi − x)2
P P
xi ai x ai 1
=P − P −P
(xi − x)2 (xi − x)2 (xi − x)2
=0
So var(b1 ) ≥ var(β̂1 ) and therefore β̂1 is the best linear unbiased estimator (BLUE).
§3.2 E s t i m a t i o n o f Va r i a n c e
Using MML
e2i
P
σ̂ 2 =
n
Is it unbiased? P
var(ei ) + (Eei )2
P 2
2 Eei
E σ̂ = =
n n
10
Duc Vu (Fall 2021) 3 Lec 3: Oct 1, 2021
Then, P
var(ei )
E σ̂ 2 =
n
Notice that
ei = Yi − β̂0 − β̂1 Xi
or
ei = Yi − Y − β̂1 (Xi − X)
where Ŷi = Y + β̂1 (Xi − X). Substitute in and we get
h i
var(ei ) = var Yi − Y − β̂1 (Xi − X)
Yi = β0 + β1 Xi + εi
var(Yi ) = σ 2
P
εi
Y = β0 + β1 X +
n
σ2
var(Y ) =
n
Y1 + . . . + Yi + . . . + Yn
cov(Yi , Y ) = cov Yi ,
n
1 1 1
= cov(Yi , Y1 ) + . . . + cov(Yi , Yi ) + . . . + cov(Yi , Yn )
n n n
σ2
=
n
X
cov(Yi , β̂1 ) = cov(Yi , ki Yi )
= cov(Yi , k1 Y1 ) + . . . + cov(Yi , ki Yi ) + . . . + cov(Yi , kn Yn )
= k1 cov(Yi , Y1 ) + . . . + ki cov(Yi , Yi ) + . . . + kn cov(Y1 , Yn )
xi − x
= σ 2 ki = σ 2 P
(xi − x)2
Note: A property of covariance
11
Duc Vu (Fall 2021) 3 Lec 3: Oct 1, 2021
Therefore,
Pn 1 (xi −x)2
P
var(ei ) i=1 1− n − P
(xi −x)2
E σ̂ 2 = = σ2
n n
(n − 2) 2
= σ
n
It follows that the unbiased estimator of σ 2 is
P 2
n ei
Se2 = σ2 =
n−2 n−2
(n−2)Se2 2
We will show σ2 ∼ Xn−2 in the next lecture.
12
Duc Vu (Fall 2021) 4 Lec 4: Oct 4, 2021
§4.1 C e nt e r e d M o d e l
Consider the model: Yi = β0 + β1 Xi + εi , i = 1, . . . , n and Gauss-Markov conditions hold, i.e.,
E [εi ] = 0
var [εi ] = σ 2
i.i.d
for i = 1, . . . , n and ε1 , . . . , εn are independent (we assume ε1 , . . . , εn ∼ N (0, σ)). This is non-
centered model. Let’s look at a centered model
Yi = β0 + β1 Xi ± β1 X + εi
Yi = β0 + β1 X + β1 (Xi − X) + εi
Yi = γ0 + β1 Zi + εi – centered model
where P
γ 0 = β0 P + β1 X and Zi = Xi − X.
Note: zi = (xi − x) = 0 and z = 0. So,
P P P
(zi − z)yi zi yi (xi − x)yi
β̂1 = P 2
= P 2 = P – same as non-centered model
(zi − z) zi (xi − x)2
γ̂0 = y − β̂1 z = y
Notice ŷi = y + β̂1 (xi − x) which is the same as ŷi of the non-centered model.
§4.2 D i s t r i b u t i o n T h e o r y U s i n g t h e C e nt e r e d M o d e l
Have
Yi ∼ N γ0 + β1 Xi − X , σ
!
σ
β̂1 ∼ β1 , pP
(xi − x)2
σ
γ̂0 = Y ∼ N γ0 , √
n
(n−2)Se2 2
Now, let’s show that σ2 ∼ Xn−2 . We have
Yi − γ0 − β1 (Xi − X)
∼ N (0, 1)
σ
2
Yi − γ0 − β1 (Xi − X)
∼ X12
σ2
It follows that Pn 2
i=1 Yi − γ0 − β1 (Xi − X)
∼ Xn2
σ2
(n−2)Se2 e2i
P
Notice that σ2 = σ2 . Let’s manipulate this expression. First, let
Ph i2
Yi − γ0 − β1 (Xi − X) ± γ̂0 ± β̂1 (Xi − X)
L=
σ2
13
Duc Vu (Fall 2021) 4 Lec 4: Oct 4, 2021
Then,
Ph i2
yi − γ̂0 − β̂1 (xi − x) + (γ̂0 − γ0 ) + (β̂1 − β1 )(xi − x)
L=
σ2
Ph i2
ei + (γ̂0 − γ0 ) + (β̂1 − β1 )(xi − x
=
σ2
2
e2i (β̂1 − β1 )2 (xi − x)2
P P P
n (γ̂0 − γ0 ) 2(γ̂0 − γ0 ) ei
= + + +
σ2 σ2 σ2 σ2
P P
2(β̂1 − β1 ) ei (xi − x) 2(γ̂0 − γ0 )(β̂1 − β1 ) (xi − x)
+ +
σ2 σ2
So far,
" #2
2
(n − 2)Se2 γ̂0 − γ0
P
[yi − γ0 − β1 (xi − x)] β̂1 − β1
= + √ +
σ{z2 2
pP
| } | σ{z } |σ/{z n} σ/ (xi − x)2
2 Xn ? | {z }
X12 X12
Q = Q1 + Q2 + Q3
Let’s use moment generating function to find “?”. Notice that Q1 , Q2 , Q3 are independent why?
(n − 2)Se2 2
=⇒ Q1 = ∼ Xn−2
σ2
Note: If Y ∼ Γ(α, β) then
MY (t) = (1 − βt)−α
and
McY (t) = MY (ct)
Let’s now find the distribution of s2e .
σ2
Se2 = Q1
n−2
σ2
MSe2 (t) = M σ2 (t) = MQ1 t
n−2 Q1 n−2
−n+2
2σ 2
2
MSe2 (t) = 1− t
n−2
Therefore,
n − 2 2σ 2
Se2 ∼Γ ,
2 n−2
2σ 4
ESe2 = σ 2 , var(Se2 ) =
n−2
14
Duc Vu (Fall 2021) 4 Lec 4: Oct 4, 2021
15
Duc Vu (Fall 2021) 5 Lec 5: Oct 6, 2021
§5.1 D i s t r i b u t i o n T h e o r y U s i n g N o n - C e nt e r e d M o d e l
(n−2)Se2 2
Recall that we want to show σ2 ∼ Xn−2 using the non-centered model Yi = β0 + β1 Xi + εi for
i.i.d
ε1 , . . . , εn ∼ N (0, σ). Then, Yi ∼ N (β0 + β1 Xi , σ). Let
P 2
Yi − β0 − β1 Xi ± β̂0 ± β̂1 Xi
M= ∼ Xn2
σ2
Then,
P 2
yi − β̂0 − β̂1 xi + (β̂0 − β0 ) + (β̂1 − β1 )xi
M=
σ2
e2i 2
(β̂1 − β1 )2 x2i
P P P P
n(β̂0 − β0 ) 2(β̂0 − β0 ) ei 2(β̂1 − β1 ) ei xi
= + + + +
σ2 σ2 σ2 σ2 σ2
P
2(β̂0 − β0 )(β̂1 − β1 ) xi
+
σ2
P 2
n(β̂0 − β0 )2 (β̂1 − β1 )2 x2i
P P
ei 2(β̂0 − β0 )(β̂1 − β1 ) xi
= 2
+ + + (**)
| σ{z } | σ2 σ2 {z σ2 }
2
(n−2)Se ?
σ2
Yi = β0 + β1 Xi + εi
P P
Yi εi
Y = = β0 + β1 X +
n n
So Y ∼ N β0 + β1 X, √σn and thus D−(β 0 +β1 X)
√
σ/ n
∼ N (0, 1). It follows that each term in (*) follows
chi-square distribution with 1 degree of freedom. Now, we have
cov(Y , β̂1 ) = 0
cov(Y , ei ) = 0
cov(β̂1 , ei ) = 0
16
Duc Vu (Fall 2021) 5 Lec 5: Oct 6, 2021
EQ = αβ
var(Q) = αβ 2
MQ (t) = (1 − βt)−α
Γ(α + k)β k
EQk =
Γ(α)
where Z ∞
Γ(α) = xα−1 e−x dx
0
is the Gamma function.
Property:
Γ(α) = (α − 1)Γ(α − 1)
Γ(α + 1) = αΓ(α)
If α is an integer, then
Γ(α) = (α − 1)!
2
n−2 2σ
Recall that Se2 ∼ Γ 2 , n−2
2σ 4
ESe2 = σ 2 , var(Se2 ) =
n−2
Is Se unbiased estimator of σ?
1
ESe = E Se2 2
12
n−2 1 2σ 2
Γ 2 + 2 n−2
= n−2
Γ 2
r
2 n−1 n−2
=σ Γ /Γ
n−2 2 2
= σA
Se
Thus, it’s biased and we can adjust the result to be unbiased, i.e., A.
If Y ∼ Xn2 , then
n
MY (t) = (1 − 2t)− 2
which is Γ n2 , 2 .
§5.3 C o e f f i c i e nt o f D e t e r m i n a t i o n
Recall X X X
(yi − y)2 = e2i + (ŷi − y)2
| {z } | {z } | {z }
SST SSE SSR
SSR SSE
R2 = or R2 = 1 −
SST SST
17
Duc Vu (Fall 2021) 5 Lec 5: Oct 6, 2021
and 0 ≤ R2 ≤ 1. We have
var(Ŷi ) = var y + β̂1 (xi − x)
!
2 1 (xi − x)2
=σ +P 2
n (xi − x)
Consider
P X
X yl X 1
ei = yi − ŷi = yi − y − β̂1 (xi − x) = al yl − − (xi − x) kl yl = al − − (xi − x)kl yl
n n
where (
1, if l = i
al =
0, otherwise
18
Duc Vu (Fall 2021) 6 Lec 6: Oct 8, 2021
Example 6.1
Consider β̂0 and β̂1
X X
cov(β̂0 , β̂1 ) = cov li Yi , ki Yj
X
= σ2 li ki
X 1
2
=σ − ki x ki
n
1X X
= σ2 ki − σ 2 x ki2
n
σ2 x
= −P
(xi − x)2
Or
cov β̂0 , β̂1 = cov Y − β̂1 X, β̂1
= cov Y , β̂1 − X var(β̂1 )
−xσ 2
=P
(xi − x)2
Example 6.2
Consider Ŷi and Ŷj
cov Ŷi , Ŷj = cov y + β̂1 (xi − x), y + β̂1 (xj − x)
σ2 (xi − x)(xj − x) 2
= +0+0+ P σ
n (xi − x)2
1 (xi − x)(xj − x)
= σ2 + P
n (xi − x)2
When i = j,
(xi − x)2
2 1
var(Ŷi ) = σ +P
n (xi − x)2
19
Duc Vu (Fall 2021) 6 Lec 6: Oct 8, 2021
§6.2 Inference
Construct a confidence interval 1 − α for β1
P (L ≤ β1 ≤ U ) = 1 − α
Know !
σ
β̂1 ∼ N β1 , p P
(xi − x)2
and
(n − 2)Se2 2
∼ Xn−2
σ2
Consider
cov β̂1 , ei = 0
Under normality, since their covariance is 0, β̂1 and Se2 are independent. Thus,
1 −β1
√β̂P
σ/ (xi −x)2 β̂1 − β1
q = pP ∼ tn−2
(n−2)Se2
/(n − 2) Se / (xi − x)2
σ2
Pivot Method: !
β̂1 − β1
P −t α2 ; n−2 ≤ pP ≤ t α2 ; n−2 =1−α
Se / (xi − x)2
and after some manipulation we get
!
Se Se
P β̂1 − t α2 ; n−2 · pP ≤ β1 ≤ β̂1 + t α2 ; n−2 · pP =1−α
(xi − x)2 (xi − x)2
20
Duc Vu (Fall 2021) 6 Lec 6: Oct 8, 2021
For β̂0 ,
s
2
1 x
β̂0 ∼ N β0 , σ +P
n (xi − x)2
and we proceed similarly to obtain
s
2
1 x
β0 ∈ β̂0 ± t α2 ;n−2 · Se +P
n (xi − x)2
§6.3 P r e d i c t i o n I nt e r va l
Prediction interval for Y0 when X = X0 . Let’s begin with error of prediction: Y0 − Ŷ0 . We know
• Yi = β0 + β1 Xi + εi
• Y0 = β0 + β1 X0 + ε0
• Ŷ0 = β̂0 + β̂1 X0
So
E(Y0 − Ŷ0 ) = 0
var(Y0 − Ŷ0 ) = var(Y0 ) + var(Ŷ0 ) − 2 cov(Y0 , Ŷ0 )
(x0 − x)2
2 1
=σ 1+ + P
n (xi − x)2
We apply the same procedure in the inference section
q 2
Y0 − Ŷ0 ∼ N 0, σ 1 + n1 + P(x(x0 −x)
s
i −x)
2
1 (x0 − x)2
=⇒ Y0 ∈ Ŷ0 ± t α2 ;n−2 Se 1 + + P
(n−2)Se2
2 n (xi − x)2
∼ Xn−2
σ2
C.I. for EY0 for a given X = X0
s !
1 (x0 − x)2
Ŷ0 ∼ N β0 + β1 X0 , σ +P
n (xi − x)2
(n − 2)Se2 2
∼ Xn−2
σ2 s
1 (x0 − x)2
=⇒ EY0 ∈ Ŷ0 ± t α2 ;n−2 · Se +P
n (xi − x)2
21
Duc Vu (Fall 2021) 7 Lec 7: Oct 11, 2021
§7.1 H y p o t h e s i s Te s t i n g
Consider the model:
Yi = β0 + β1 Xi + εi
Example 7.1
Hypothesis testing examples
H0 : β1 = 0, Ha : β1 6= 0
H0 : β1 = 1, Ha : β1 6= 1
H0 : β0 = 0, Ha : β0 6= 0
H0 : β0 + β1 = 0, Ha : β0 + β1 6= 0
)
β0 = β0∗
H0 : , Ha : not true
β1 = β1∗
H0 : β 1 = 0
Ha : β1 6= 0
Recall under H0 ,
σ
β̂1 ∼ N 0, √P
β̂
(xi −x)2
=⇒ t = pP 1 ∼ tn−2
(n−2)Se2 2
Se / (xi − x)2
∼ Xn−2
σ2
Se
β1 ∈ β̂1 ± t α2 ; n−2 pP
(xi − x)2
For example, for −2 ≤ β1 ≤ 2, we do not reject H0 .
p − value = 2P (t > t∗ )
22
Duc Vu (Fall 2021) 7 Lec 7: Oct 11, 2021
Z ∼ N (0, 1)
U ∼ Xn2
Z, U are independent
Z Z 2 /1
p ∼ tn =⇒ ∼ F1, n
U/n U/n
F1,n−2
α
F1−α;1,n−2
• Numerator:
X X
E β̂12 (xi − x)2 = (xi − x)2 E β̂12
X
= (xi − x)2 var(β̂1 + (E β̂1 )2
σ2
X
= (xi − x)2 P + β 2
1
(xi − x)2
X
= σ 2 + β12 (xi − x)2
Under H0 the ratio is approximately equal to 1. If H0 is not true the ratio is greater than 1.
23
Duc Vu (Fall 2021) 7 Lec 7: Oct 11, 2021
β̂ − 1
pP1 ∼ N (0, 1)
σ/ (xi − x)2
Test Statistics:
β̂1 − 1
pP ∼ tn−2
Se / (xi − x)2
Using F statistics
(β̂1 − 1)2 (xi − x)2
P
∼ X12
σ2
and thus
(β̂1 − 1)2 (xi − x)2
P
2
∼ F1, n−2
Se
24
Duc Vu (Fall 2021) 8 Lec 8: Oct 13, 2021
§8.1 L i ke l i h o o d R a t i o Te s t
Consider
Yi = β1 Xi + εi
H0 : β1 = 0
Ha : β1 6= 0
We know
!
σ
β̂1 ∼ N 0, pP
x2i
(n − 1)Se2 2
∼ Xn−1
σ2
β̂12 x2i
P
β̂1
√
So ttest : ∼ tn−1 and Ftest : Se2 ∼ F1,n−1 .
x2i
P
Se /
Likelihood Ratio Test (LRT):
For testing: H0 : β1 = 0
For the model: Yi = β0 + β1 Xi + εi
Show that this LRT is equivalent to the F statistic.
We reject H0 if
L(ŵ)
Λ= <k
L(ω̂)
where L(ŵ) is the maximized likelihood function under H0 and L(ω̂) is maximized likelihood
function under no restrictions. Under H0 : β1 = 0, we have Yi = β0 + εi . The likelihood function is
n 1
P 2
L = (2πσ 2 )− 2 e− 2σ2 (yi −β0 )
n 1 X
ln L = − ln 2πσ 2 − 2 (yi − β0 )2
2 2σ
β̂0 = y
(yi − y)2
P
2
σ̂0 =
n
e2i
P
Under no restriction, the estimates are the MLEs of β0 , β1 , σ 2 which are β̂0 , β̂1 and σ̂12 = n . Back
to LRT, we have
L(ŵ)
Λ=
L(ω̂)
1
(yi −y)2
P
n −
(2πσ02 )− 2 e 2σ02
= − 1
P
e2i
<k
2
(2πσ12 )e 2σ1
Note:
X
(yi − y)2 = nσ02
X
e2i = nσ12
25
Duc Vu (Fall 2021) 8 Lec 8: Oct 13, 2021
So,
n n
(2πσ̂02 )− 2 e− 2
n n < k
(2πσ̂12 )− 2 e− 2
σ̂12 2
< kn
σ̂02
P 2
ei /n 2
P < kn
(yi − y)2 /n
Notice that
X X X
(yi − y)2 = e2i + (ŷi − y)2
X X X
(yi − y)2 = e2i + β̂12 (xi − x)2
So,
e2i
P
2
< kn
e2i + β̂12 (xi − x)2
P P
1 2
< kn
β̂12 (xi −x)2
P
1+ P 2
ei
2
β̂1 (xi − x)2
P
2
> k− n − 1
(n − 2)Se2
β̂1 (xi − x)2
P 2
−n
> (n − 2) k − 1 = k0
Se2
We reject H0 if
β̂12 (xi − x)2
P
> k0
Se2
Recall we stated that we reject H0 if Λ = L( ŵ)
L(ω̂) < k. Let’s find k. First, we need α (type I error).
Before that, we know
β̂12 (xi − x)2
P
∼ F1,n−2
Se2
So,
P F1,n−2 > k 0 H0 is true = α
§8.2 Powe r A n a l y s i s i n S i m p l e R e g r e s s i o n
Using the non-central t distribution
Definition 8.1 (Non-central t) — Let Z ∼ N (δ, 1) and U ∼ Xn2 and Z and U are independent.
Then,
Z
p ∼ tn (NCP = δ)
U/n
26
Duc Vu (Fall 2021) 8 Lec 8: Oct 13, 2021
27
Duc Vu (Fall 2021) 9 Lec 9: Oct 15, 2021
Yi = β0 + β1 Xi + εi
H0 : β1 = 0
Ha : β1 6= 0
which are
1. t statistics
2. F statistics
3. Likelihood ratio test
Yi = β0 + εi =⇒ β̂0 = y
Note: X X X
(yi − y)2 = e2i + β̂12 (xi − x)2
| {z } | {z } | {z }
SST SSE SSR
So,
28
D u c Vu ( Fa l l 2 0 2 1 ) 9 Lec 9: Oct 15, 2021
Example 9.1
Use the extra sum of squares method to test
H0 : β1 = 1
Ha : β1 6= 1
Reduced model: Yi = β0 + xi + εi
Yi − xi = β0 + εi
β̂0 = y − x
X 2
SSER = (yi − xi − (y − x))
X 2
= (yi − y − (xi − x))
X X X
= (yi − y)2 + (xi − x)2 − 2 (xi − x)(yi − y) (*)
Note:
X X X
(yi − y)2 = e2i + β̂12 (xi − x)2
P
(xi − x)(yi − y)
β̂1 = P
(xi − x)2
X X
=⇒ (xi − x)(yi − y) = β̂1 (xi − x)2
So, we have
X X X X
(∗) = (xi − x)2 +
e2i + β̂12 (xi − x)2 − 2β̂1 (xi − x)2
X X
SSER = e2i + (β̂1 − 1)2 (xi − x)2
Test statistics:
(SSER − SSEF )/(dfR − dfF )
SSEF /dfF
P P
e2i + (β̂1 − 1)2 (xi − x)2 − e2i / (n − 1 − (n − 2))
P
P 2
ei /(n − 2)
2
(xi − x)2
P
(β̂1 − 1)
∼ F1,n−2
Se2
Proof. Under H0 ,
β̂ ∼ N 1, √P σ
1
(xi −x)2
(n−2)Se2 2
σ2 ∼ Xn−2
So,
2
β̂1 −1)
√(P /1
σ/ (xi −x)2
(n−2)Se2
σ2 /(n − 2)
2
(xi − x)2
P
(β̂1 − 1)
∼ F1,n−2
Se2
29
Duc Vu (Fall 2021) 9 Lec 9: Oct 15, 2021
§9.2 Powe r A n a l y s i s U s i n g N o n - C e nt r a l F D i s t r i b u t i o n
2. Suppose Y ∼ N (µ, σ)
Y µ
∼N ,1
σ σ
Y2 µ2
2
∼ X12 (θ = 2 )
σ σ
U/n
∼ Fn,m (NCP = θ)
V /m
Thus,
1 − β = P (F1,n−2 (θ) > F1−α;1,n−2 )
30
Duc Vu (Fall 2021) 10 Lec 10: Oct 18, 2021
where
Y : n × 1 response vector
X : n × (k + 1) regression matrix
β : (k + 1) × 1 parameter vector
ε : n × 1 error vector
Y1
Let Y = ... be a random vector with mean vector
Yn
Y1 EY1 µ1
.. .. ..
µ = E [Y] = E . = . = .
Yn EYn µn
31
Duc Vu (Fall 2021) 10 Lec 10: Oct 18, 2021
a1
Properties: Let a = ... be a vector of constants and let a0 Y be a linear combination Y. Then
an
X
E [a0 Y] = a0 EY = a0 µ = ai µi
0 0
var (a Y) = a Σa
E [AY] = AEY = Aµ
var (AY) = AΣA0
EY = E [Xβ + ε] = Xβ
var(Y) = var (Xβ + ε) = σ 2 I
X0 (Y − Xβ) = 0
X0 Y − X0 Xβ = 0
X0 Xβ = X0 Y
−1
β̂ = (X0 X) X0 Y
or min Q = ε0 ε but Y = Xβ + ε. Or
0
min Q = (Y − Xβ) (Y − Xβ)
Then,
min Q = Y0 Y − Y0 Xβ − βX0 Y + β 0 X0 Xβ
= Y0 Y − 2Y0 Xβ + βX0 Xβ
∂Q
=0 (*)
∂β
Note: Matrix
and vector differentiation:
θ1
Let θ = ... and g(θ) be a function of θ. Then
θp
∂g(θ)
∂θ1
∂g(θ) .
= ..
∂θ ∂g(θ)
∂θp
32
Duc Vu (Fall 2021) 10 Lec 10: Oct 18, 2021
∂g(θ)
= 2Aθ
∂θ
So apply these result to (*), we obtain
2X0 Y + 2X0 Xβ = 0
X0 Xβ = X0 Y
33
Duc Vu (Fall 2021) 11 Lec 11: Oct 20, 2021
§11.1 M u l t i p l e R e g r e s s i o n ( C o nt ’ d )
Recall that
Y = Xβ + ε
E [ε] = 0
var(ε) = σ 2 I
Least squares: X
min ε2i = ε0 ε = (Y − Xβ)0 (Y − Xβ)
Normal Equations:
−1
X0 Xβ = X0 Y =⇒ β̂ = (X0 X) X0 Y
Note that X is not a square matrix, so X0 X has to go together in order for it to be invertible.
X = (1, x1 , x2 , . . . , xk )
0
1
10 x 1 10 x 2 10 xk
x01 n ...
0 x1 1 x01 x1
0
x01 x2 ... x01 xk
X0 X = x 2 1 x1 x2 . . . xk = .
.. .. ..
..
.. . . .
.
x0k 1 x0k x1 ... x0k xk
x0k
Partition X and β
X = 1 X(0)
β0
β=
β(0)
Model:
Y= 1 X(0) β0 β(0) + ε
Y = β0 1 + X(0) β(0) + ε
Then,
10
0
XX= 1 X(0)
X(0) 0
10 X(0)
n
=
X(0) 0 1 X(0) 0 X(0)
So
β̂0
β̂1
10 X(0) 10 Y
β̂0 n
β̂ = . = ˆ =
.. β(0) X(0) 0 1 X(0) 0 X(0) X(0) Y
β̂k
34
Duc Vu (Fall 2021) 11 Lec 11: Oct 20, 2021
Also, 0
10
0 1Y
XY= Y=
X(0) 0 X0(0) Y
Fitted Values:
2. HH = H – idempotent
−1 −1
X (X0 X) X0 X (X0 X) X=H
h i
−1
3. tr H = tr X (X0 X) X0 = tr ((X0 X)−1 X0 X = tr Ik+1 = k + 1. Notice that the property
of trace is
tr (ABC) = tr (BCA) = tr (CAB) 6= tr (BAC)
4. HX = X or H (1, x1 , . . . , xk ) = (1, x1 , . . . , xk )
Residuals:
ei = yi − ŷi i = 1, . . . , n
e = y − ŷ
e = y − xβ̂
e = Y − HY
e = (I − H) Y = (I − H) (Xβ + ε)
= (I − H) Xβ + (I − H) ε
= (I − H) ε
e = (I − H) Y
e = (I − H) ε
or
0
SSE = [(I − H) ε] [(I − H) ε] = ε0 (I − H) ε
35
Duc Vu (Fall 2021) 11 Lec 11: Oct 20, 2021
Properties of β̂:
h i
−1 −1
E β̂ = E X0 X X0 Y = (X0 X) X0 X |{z}
EY = β
=β
which is unbiased.
h i
−1 −1 −1
var (β) = var (X0 X) X0 Y = (X0 X) X0 σ 2 IX (X0 X)
−1
= σ 2 (X0 X)
var(β̂1 ) = σ 2 v11
cov β̂1 , β̂2 = σ 2 v12
where
−1
(X0 X) = {vij }i=1,...,n;j=1,...,n
36
Duc Vu (Fall 2021) 12 Lec 12: Oct 22, 2021
§12.1 G a u s s - M a r kov T h e o r e m i n M u l t i p l e R e g r e s s i o n
−1
Let β̂ = (X0 X) X0 Y be the least squares estimator of β and ∗
let b =M Y be an unbiased
∗ 0 −1 0
estimator of β (not the least squares). Let’s define M = M + X X X .
b is unbiased
Eb = β
because
EM∗ Y = β
or
h i
−1
E M + X0 X X0 Y = β
−1
M + (X0 X) X0 Xβ = β
MXβ + β = β
MX = 0
Check var(b). h i
−1
var(b) = var (M∗ Y) = var M + (X0 X) X0 Y
Note:
var(AY) = AΣA0
where var(Y) = σ 2 I. Then,
h ih i
−1 −1
var(b) = σ 2 M + (X0 X) X0 M0 + X (X0 X)
−1 −1
= σ 2 MM0 + σ 2 MX (X0 X) + σ 2 (X0 X) X0 M0
−1 −1
+ σ 2 (X0 X) X0 X (X0 X)
−1
= σ 2 MM0 + σ 2 (X0 X)
= σ 2 MM0 + var(β̂1 )
a0 Ba > 0
Aside Note:
var (aY0 ) = a0 Σa > 0
Now, let a be a non zero vector
0
a0 MM0 a = (M0 a) (M0 a)
= q0 q
X
= qi2 > 0
Therefore, MM0 is a positive definite matrix and thus var(b) ≥ var β̂ .
37
Duc Vu (Fall 2021) 12 Lec 12: Oct 22, 2021
§12.2 G a u s s - M a r kov T h e o r e m Fo r a L i n e a r C o mb i n a t i o n
We have
var a0 β̂ = a0 var β̂ a
−1
= σ 2 a0 (X0 X) a
or
var a0 β̂0 + a1 β̂1 + a2 β̂2 = a20 var(β̂0 ) + a21 var(β̂1 ) + a22 var(β̂2 ) + 2a0 a1 cov(β̂0 , β̂1 )
§12.3 R e v i e w o f M u l t i va r i a t e N o r m a l D i s t r i b u t i o n
i.i.d
Normality assumption: ε1 , . . . , εn ∼ N (0, δ)
ε ∼ Nn 0, σ 2 I
Let Y ∼ Nn (µ, Σ)
1 − 12 − 12 (y−µ)0 Σ−1 (y−µ)
f (y) = n |Σ| e
(2π) 2
Consider
f (ε) = f (ε1 ) · f (ε2 ) . . . f (εn )
)
1 − 12 1 2
I)−1 ε
1 2 = n σ2 I e− 2 ε(σ
f (εi ) = √1 e− 2σ2 Σi (2π) 2
σ 2π
So 1 0
n
f (ε) = (2πσ 2 )− 2 e− 2σ2 ε ε =⇒ ε ∼ Nn (0, σ 2 I)
Joint MGF: Let Y ∼ Nn (µ, Σ). Then
0 0 1 0
MY (t) = Eet Y = et µ+ 2 t Σt
t1
..
where t = . .
tn
38
Duc Vu (Fall 2021) 12 Lec 12: Oct 22, 2021
Theorem 12.1
Let Y ∼ Nn (µ, Σ) and let A be m × n matrix of constant and c m × 1 vector of constants.
Using the joint mgf
AY ∼ Nm (Aµ, AΣA0 )
AY + c ∼ Nm (Aµ + c, AΣA0 )
Notice that
Y + Xβ + ε
2
ε ∼ Nn (0, σ I)
=⇒ Y ∼ Nn Xβ, σ 2 I
EY = Xβ
var(Y) = σ 2 I
39
Duc Vu (Fall 2021) 13 Lec 13: Oct 25, 2021
§13.1 T h e o r e m s i n M u l t i va r i a t e N o r m a l D i s t r i b u t i o n
Consider: Y ∼ Nn (µ, Σ)
1 − 12 − 12 0
f (y) = n |Σ| e (y − µ) Σ−1 (y − µ)
(2π) 2
0 0 1 0
My (t) = Eet y = et µ+ 2 t Σt
1
Proof. Let Z ∼ N (0, I) and Y = Σ 2 Z + µ. Then, the spectral decomposition of Σ is
λ1 0
Σ = PΛP0 , Λ=
..
.
0 λn
1 1
Σ 2 = PΛ 2 P0
So,
1 2
MZi (ti ) = Eeti zi = e 2 ti
0
MZ (t) = Eet z = Eet1 z1 +...+tn zn
= Eet1 z1 · Eet2 z2 . . . Eetn zn
1 0
= e2t t
MY (t) = M 1 (t)
Σ 2 Z+µ
1
t0 Σ 2 Z+µ
= Ee
1 0
t0 µ Σ2 t Z
=e Ee
1
Let t∗ = Σ 2 t. Then
0 ∗0
MY (t) = et µ Eet Z
0 0 1 ∗0
= et µ MZ (t∗ ) = et µ e 2 t T BA
0 1 0
= et µ+ 2 t Σt
Theorem 13.1
Let A be m × n matrix of constants and C be m × 1 vector of constants. Then
AY + C ∼ Nm (Aµ + C, AΣA0 )
AY ∼ Nm (Aµ, AΣA0 )
Proof. We have
0
MAY+C (t) = Eet (AY+C)
0 0 0
= et C · Ee(A t) Y
40
Duc Vu (Fall 2021) 13 Lec 13: Oct 25, 2021
Let t∗ = A0 t. Then
0
MAY+C (t) = et C · MY (t∗ )
0 ∗0 0
µ+ 21 t∗ Σt∗
= et C · et
0 1 0 0
= et (Aµ+C)+ 2 t AΣA t
Theorem 13.2
Let Y Nn (µ, Σ),
Q1 µ1 Σ11 Σ12
Y= , µ= , Σ=
Q2 µ2 Σ21 Σ22
Note that
Y1
Y2
Y=
Y
3
Y4
Y5
Then,
Q1 ∼ Np (µ1 , Σ11 )
Q2 ∼ Nn−p (µ2 , Σ22 )
EQ1 = EAY
µ1
= I 0
µ2
= µ1
var (Q1 ) = var (AY)
Σ11 Σ12 I
= AΣA0 = I 0
Σ21 Σ22 00
= Σ11
√
If A = a0 (row vector), then a0 Y ∼ N a0 µ, a0 Σa .
41
Duc Vu (Fall 2021) 13 Lec 13: Oct 25, 2021
Theorem 13.3
Q1
Independence for Y =
Q2
µ1
Y ∼ N (µ, Σ) , µ =
µ2
Σ11 Σ12
Σ=
Σ21 Σ22
If Σ12 = 0, then
0 1 0 0 1 0
MY (t) = et1 µ1 + 2 t1 Σ11 t1 et2 µ2 + 2 t2 Σ22 t2
= MQ1 (t1 ) · MQ2 (t2 )
Theorem 13.4
We have
Q1 Q2 ∼ Np µ1 + Σ12 Σ−1 −1
22 (Q2 − µ2 ) , Σ11 − Σ12 Σ22 Σ21
42
Duc Vu (Fall 2021) 13 Lec 13: Oct 25, 2021
e0 e = Y0 (I − H) Y = ε0 (I − H) ε
e0 e
σ̂ 2 =
n
Y0 (I − H)Y
=
n
ε0 (I − H) ε
=
n
43
Duc Vu (Fall 2021) 14 Lec 14: Oct 27, 2021
§14.1 M e a n a n d Va r i a n c e i n M u l t i va r i a t e N o r m a l
Distribution
Consider
Y = Xβ + ε
ε ∼ Nn 0, σ 2 I
=⇒ Y ∼ Nn Xβ, σ 2 I
Joint pdf of Y is
− n2 1 0
f (y) = 2πσ 2 e− 2σ2 (y−xβ) (y−xβ)
Using the method of maximum we obtain the MLEs of β and σ 2
−1
β̂ = (X0 X) X0 Y
which is the same as the least squares estimator. And
0
y − xβ̂ y − xβ̂ e0 e
σ̂ 2 = =
n n
Note that e = (I − H)Y or e = (I − H)ε. Therefore,
e0 e = Y0 (I − H)Y or e0 e = ε0 (I − H)ε
So
1 0
E σ̂ 2 = Ee e
n
1 0
= E ε (I − H)ε
n | {z }
scalar
1
= E [tr(I − H)εε0 ]
n
1
= tr [E (I − H) εε0 ]
n
1
= tr [(I − H) E (εε0 )]
n
Note:
0
Σ = E (Y − µ) (Y − µ)
E [YY0 ] = Σ + µµ0
where
E(ε) = 0
var(ε) = σ 2 I
Then,
1
E σ̂ 2 = tr (I − H) σ 2 I + 000
n
1
= tr (I − H) σ 2 I
n
σ2
= tr (I − H)
n
44
Duc Vu (Fall 2021) 14 Lec 14: Oct 27, 2021
tr (I − H) = tr(I) − tr(H)
h i
−1
= tr(I) − tr X (X0 X) X0
h i
−1
= tr(I) − tr (X0 X) X0 X
= tr(In ) − tr(Ik+1 )
=n−k−1
So,
n−k−1
E σ̂ 2 = σ 2
n
which is biased. Therefore, the unbiased estimator of σ 2 is
n e0 e n e0 e
Se2 = σ̂ 2 = =
n−k−1 n n−k−1 n−k−1
In simple regression (k = 1 – one predictor)
e0 e
P 2
ei
Se2 = =
n−2 n−2
Ŷ = HY
E Ŷ = HEY
= HXβ
= Xβ
Note: HX = X.
var Ŷ = var (HY)
= σ2 H
For e,
Ee = E [(I − H)Y]
= E [Y − HY]
= Xβ − Xβ
=0
var(e) = var [(I − H)Y]
= σ 2 (I − H)
§14.2 I n d e p e n d e nt Ve c t o r s i n M u l t i p l e R e g r e s s i o n
If Y ∼ Nn (µ, Σ), then AY and BY are independent iff
45
Duc Vu (Fall 2021) 14 Lec 14: Oct 27, 2021
or use
Ŷ HY H
= = Y = AY
e (I − H)Y I−H
Y ∼ Nn Xβ, σ 2 I .
Ŷ and e are independent. Similarly, we can show that β̂ and e are independent.
§14.3 Pa r t i a l R e g r e s s i o n
Consider
X = X1 X2
with the following three models
−1
Y = X1 β 1 + ε =⇒ β̂ 1 = (X01 X1 ) X01 Y
−1
Y = X2 β 2 + ε =⇒ β̂ 2 = (X02 X2 ) X02 Y
and
Y = Xβ + ε or Y = X1 β 1 + X2 β 2 + ε
46
Duc Vu (Fall 2021) 15 Lec 15: Oct 29, 2021
§15.1 Pa r t i a l R e g r e s s i o n ( C o nt ’ d )
Normal equation:
X0 Xβ̂ = X0 Y
using
X01
0 β̂ 12
X = and β̂ =
X02 β̂ 21
Then,
and
X01
0
0 X1 Y
XY= Y=
X02 X02 Y
and the normal equations are
0
X01 X2
0
X1 X1 β̂ 12 X1 Y
=
X02 X1 X02 X2 β̂ 21 X02 Y
Then,
From (1),
X01 X1 β̂ 12 = X01 Y − X01 X2 β̂ 21
So,
−1 −1 0
β̂ 12 = (X01 X1 ) X01 Y − (X01 X1 ) X1 X2 β̂ 21 (3)
| {z }
β̂ 1
Then,
−1 −1
X02 X2 β̂ 21 − X02 X1 (X01 X1 ) X01 X2 β̂ 21 = X02 Y − X02 X1 (X01 X1 ) X01 Y
h i h i
−1 −1
X02 I − X1 (X01 X1 ) X01 X2 β̂ 21 = X02 I − X1 (X01 X1 ) X01 Y
X02 [I − H1 ] X2 β̂ 21 = X02 [I − H1 ] Y
X02 (I − H1 ) (I − H1 ) X2 β̂ 21 = X02 (I − H1 ) (I − H1 ) Y
0 0
[(I − H) X2 ] [(I − H) X2 ] β̂ 21 = [(I − H) X2 ] [(I − H1 ) Y]
Note:
(I − H1 ) Y = Y∗
47
Duc Vu (Fall 2021) 15 Lec 15: Oct 29, 2021
So,
0 0
X∗2 X∗2 β̂ 21 = X∗2 Y∗
and thus 0 −1 0
β̂ 21 = X∗2 X∗2 X∗2 Y∗
Special Case 1:
X = 1 X(0)
β0
β=
β (0)
X∗(0) = (I − H1 ) X(0)
1
= I − 110 X(0)
n
1 1
= I − 110 x1 , . . . , I − 110 xk
n n
x11 − x1 . . . x1k − xk
x21 − x1 . . . x2k − xk
=
.. ..
. .
xn1 − x1 ... xnk − xk
β1
..
Finally, to estimate the vector of the slopes β (0) = .
βk
x11 − x1 ... x1k − xk
y1 − y x21 − x1 ... x2k − xk
We regress ... on
.. ..
. .
yn − y
xn1 − x1 ... xnk − xk
48
Duc Vu (Fall 2021) 15 Lec 15: Oct 29, 2021
0 −1 0
to get β̂ (0) = X∗(0) X∗(0) X∗(0) Y∗ where
1 0
X∗(0) = I−
11 X(0)
n
1
Y∗ = I − 110 Y
n
49
Duc Vu (Fall 2021) 16 Lec 16: Nov 1, 2021
§16 L e c 1 6 : N ov 1 , 2 0 2 1
§16.1 Pa r t i a l R e g r e s s i o n ( C o nt ’ d )
Consider:
Y = Xβ + ε
Then,
−1
β̂ = (X0 X) X0 Y
The partial regression of Y∗ on X∗2
0 −1 0
β̂ 21 = X∗2 X∗2 X∗2 Y∗
i.e., Y∗ = X∗2 β 2 + ε.
Special Case 2: Begin with
Y = Xβ + ε
with k predictors. Then, we add an extra predictor Z. The new model is
Y = Xβ + cZ + ε
Y = Xβ + ε (1)
Y = Xβ + cZ + ε (2)
Y = Xβ + cZ + ε
β
Y= X Z +ε
c
β
Y=w +ε
c
Y = wη + ε
Normal equations:
w0 wη = w0 Y
50
Duc Vu (Fall 2021) 16 Lec 16: Nov 1, 2021
or
From (1)
−1
δ̂ = (X0 X) [X0 Y − X0 Zĉ]
or
−1
δ̂ = β̂ − (X0 X) X0 Zĉ
Now back to u
−1
u = Y − X β̂ + X (X0 X) X0 Zĉ − ĉZ
−1
u = e − I − X (X0 X) X0 Zĉ
= e − [I − H] Zĉ
= e − Z∗ ĉ
The SSE is
SSEXZ = u0 u
0
= (e − Z∗ ĉ) (e − Z∗ ĉ)
0 0
= e0 e − 2Z∗ eĉ + Z∗ Z∗ ĉ2
0 0
= e0 e − 2ĉ2 Z∗ Z∗ + Z∗ Z∗ ĉ2
0
= e0 e − Z∗ Z∗ ĉ2
Thus, we can conclude that adding a new predictor would never increase SSE, i.e., u0 u ≤ e0 e. Note
that the new R2 is
2 u0 u
RXZ =1−
SST
0
e0 e Z∗ Z∗ ĉ2
=1− +
SST SST
∗0 ∗ 2
2 Z Z ĉ
= RX +
SST
2 2
So, RXZ ≥ RX .
§16.2 Pa r t i a l C o r r e l a t i o n
Consider
Yi = β0 + β1 Xi1 + β2 Xi2 + εi
where
Yi : income
Xi1 : age
Xi2 : number of years of education
• Regress Y on X1 → Y∗ residuals.
• Regress X2 on X1 → X∗2 residuals.
51
Duc Vu (Fall 2021) 16 Lec 16: Nov 1, 2021
n−1 n−1
P ∗ 2
( Y Xi2 )
= P ∗2i P
( X2 ) ( Yi∗2 )
0 2
Y∗ X∗2
= 0
X∗2 X∗2 (Y∗0 Y∗ )
Another method:
• Regress Y on X1 , X2 , . . . , Xk−1 → Y∗ .
• Regress Xk on X1 , X2 , . . . , Xk−1 → X2k .
SSE (Y on X1 , . . . , Xk−1 ) − SSE (Y on X1 , . . . , Xk )
rY2 Xk |X1 ,...,xk−1 =
SSE (Y on X1 , . . . , Xk−1 )
52
Duc Vu (Fall 2021) 17 Lec 17: Nov 3, 2021
§17 L e c 1 7 : N ov 3 , 2 0 2 1
= e0 e + e0 X [. . .] + [. . .] X0 e
0 h i−1 h i−1
−1 −1 −1 −1
+ cβ̂ − γ c (X0 X) c0 c (X0 X) X0 X (X0 X) c0 c (X0 X) c0 cβ̂ − γ
53
Duc Vu (Fall 2021) 17 Lec 17: Nov 3, 2021
Finally,
0
0 h i−1
−1
ec ec = e0 e + cβ̂ − γ c (X0 X) c0 cβ̂ − γ
cβ = γ
β1
c = c1 c2 , β =
β2
c1 β 1 + c2 β 2 = γ
β
−3 5 −1 2
1 2 β0 5
+ β3 =
2 −1 β1 1 3 0 10
β4
Y = X1 β 1 + X2 β 2 + ε
Then,
Y = X1 c−1
1 [γ − c2 β 2 ] + X2 β 2 + ε
Y − X1 c1 γ = X2 − X1 c−1
−1
1 c2 β 2 + ε
Yr = X2r β 2 + ε
and therefore,
β̂1c = c−1
1 γ − c 2 β̂ 2c
Overall,
h i−1
−1 −1
β̂ c = β̂ − (X0 X) c0 c (X0 X) c0 cβ̂ − γ
or
β̂
β̂ c = 1c
β̂ 2c
which is from canonical form. Next, let’s find the mean and variance of β̂ c .
E β̂ c = β
Notice that
−1 0 −1
h i
0 −1 0 0
β̂ c = I − (X X) c c (X X) c c β̂ + const = Aβ̂
54
Duc Vu (Fall 2021) 17 Lec 17: Nov 3, 2021
So
−1
var(β̂ c ) = σ 2 A (X0 X) A0
or using the canonical model
var(β̂1c ) cov β̂ 1c , β̂ 2c
var(β̂ c ) =
cov β̂ 1c , β̂ 2c var(β̂ 2c )
55
Duc Vu (Fall 2021) 18 Lec 18: Nov 5, 2021
§18 L e c 1 8 : N ov 5 , 2 0 2 1
Consider:
Y = Xβ + ε
i.i.d
ε1 , . . . , εn ∼ N (0, σ)
ε ∼ Nn 0, σ 2 I
Then, Y ∼ Nn Xβ, σ 2 I
−1
β̂ = (X0 X) X0 Y
−1
β̂ ∼ Nk+1 β, σ 2 (X0 X)
√
β̂ 1 ∼ N (β1 , σ v11 )
v00 v01 . . . v0k
v10 v11 . . . v1k
−1
(X0 X) = .
.. ..
.. . .
vk1 vk2 ... vkk
§18.1 Q u a d r a t i c Fo r m s o f N o r m a l l y D i s t r i b u t e d R a n d o m
Va r i a b l e s
We have
a) Z ∼ Nn (0, I)
i.i.d
Z1 , . . . , Zn ∼ N (0, 1)
Zi2 ∼ X12
X
Zi2 ∼ Xn2
Z0 Z ∼ Xn2
b) Z ∼ Nn 0, σ 2 I . Then,
Zi ∼ N (0, σ)
Zi
∼ N (0, 1)
σ
Zi2
2
∼ X12
σ
P 2
Zi
∼ Xn2
σ2
Z
∼ Nn (0, I)
σ
Z0 Z
∼ Xn2
σ2
In multiple regression,
ε ∼ Nn 0, σ 2 I
ε0 ε
∼ Xn2
σ2
56
Duc Vu (Fall 2021) 18 Lec 18: Nov 5, 2021
c) Y ∼ Nn µ, σ 2 I
Yi − µi
Yi ∼ N (µi , σ) =⇒ ∼ N (0, 1)
σ
2
Yi − µi
∼ X12
σ
X Yi − µi 2
∼ Xn2
σ
0
(Y − µ) (Y − µ)
∼ Xn2
σ2
In multiple regression
Y ∼ Nn Xβ, σ 2 I
0
(Y − Xβ) (Y − Xβ)
∼ Xn2
σ2
1
d) Y ∼ Nn (µ, Σ), use V = Σ− 2 (Y − µ). Σ is symmetric matrix
Σ = PΛP0
λ1 0
λ2
Λ=
..
.
0 λn
where |Σ − λI| = 0. If x is a new zero vector such that Σx = λx, we say that x is an
eigenvector of Σ. Normalize the eigenvectors so that they have length 1
Result:
1 1
Σ− 2 = PΛ− 2 P0
Properties:
1
0 1
Σ− 2 = Σ− 2
1 1
Σ− 2 Σ− 2 = Σ−1
1 1
Σ 2 = PΛ 2 P0
1 1
Σ 2 = PΛ 2 P0
1 0 1
Σ2 = Σ2
1 1
Σ2 Σ2 = Σ
57
Duc Vu (Fall 2021) 18 Lec 18: Nov 5, 2021
Therefore,
0
(Y − µ) Σ−1 (Y − µ) ∼ Xn2
In multiple regression
−1
β̂ ∼ Nk+1 β, σ 2 (X0 X)
1
We want to create a X 2 random variable using the distribution of β̂. Let V = (X0 X) 2 β̂ − β .
1
EV = (X0 X) 2 E β̂ − β = 0
1
h i
var(V) = var (X0 X) 2 β̂ − β
1 1
= (X0 X) 2 var(β̂) (X0 X) 2
1 1
−1
= σ 2 (X0 X) 2 (X0 X) (X0 X) 2
= σ2 I
We have so far
V ∼ Nk+1 0, σ 2 I
V0 V 2
∼ Xk+1
σ2
0
β̂ − β X0 X β̂ − β
2
∼ Xk+1
σ
Summary:
( (Y−Xβ)0 (Y−Xβ)
σ2 ∼ Xn2
0 0
(β̂−β) X X(β̂−β) 2
σ2 ∼ Xk+1
(n−k−1)Se2 2
Problem 18.1. Show that σ2 ∼ Xn−k−1
Proof. Have 0
Y − Xβ ± Xβ̂ Y − Xβ ± Xβ̂
∼ Xn2
σ2
58
Duc Vu (Fall 2021) 18 Lec 18: Nov 5, 2021
We know cov(β̂, e) = 0.
Q = Q1 + Q2
MQ (t) = MQ1 (t) · MQ2 (t)
MQ (t)
MQ1 (t) =
MQ2 (t)
n
(1 − 2t)− 2
= k+1
(1 − 2t)− 2
n−k−1
= (1 − 2t)− 2
(n−k−1)Se2 2
So, Q1 = σ2 ∼ Xn−k−1 .
In simple regression, k = 1,
(n − 2)Se2 2
∼ Xn−2
σ2
σ2
Se2 = Q1
n−k−1
So,
σ2 t
= MQ1
n−k−1
− n−k−1
2σ 2 t
2
= 1−
n−k−1
n−k−1 2σ 2
Thus, Se2 ∼ Γ 2 , n−k−1
ESe2 = σ 2
2σ 4
var(Se2 ) =
n−k−1
59
Duc Vu (Fall 2021) 19 Lec 19: Nov 8, 2021
§19 L e c 1 9 : N ov 8 , 2 0 2 1
§19.1 Q u a d r a t i c Fo r m s a n d T h e i r D i s t r i b u t i o n – O ve r v i e w
1. Z ∼ N (0, I)
Z0 Z ∼ Xn2
2. Z ∼ N 0, σ 2 I
Z0 Z
∼ Xn2
σ2
and
ε0 ε
∼ Xn2
σ2
3. Y ∼ Nn µ, σ 2 I
0
(Y − µ) (Y − µ)
∼ Xn2
σ2
or 0
(Y − Xβ) (Y − Xβ)
∼ Xn2
σ2
4. Y ∼ N (µ, Σ). From the spectral decomposition,
1
V = Σ− 2 (Y − µ)
Then,
V ∼ Nn (0, I)
1
V = (X0 X) 2 β̂ − β
V ∼ Nk+1 0, σ 2 I
From 2),
V0 V 2
∼ Xk+1
σ2
Finally, 0
β̂ − β X0 X β̂ − β
2
∼ Xk+1
σ2
Also, recall that we showed in last lecture
(n − k − 1) Se2 2
∼ Xn−k−1
σ2
60
Duc Vu (Fall 2021) 19 Lec 19: Nov 8, 2021
§19.2 A n o t h e r P r o o f o f Q u a d r a t i c Fo r m s a n d T h e i r
Distribution
1. Let Y ∼ Nn (0, I) and Z = P0 Y where P is an orthogonal matrix where P0 P = I. Then,
Z ∼ Nn (0, I).
x0 Ax = λx0 x
x0 AAx = λx0 x
0
(Ax) (Ax) = λx0 x
λ2 x0 x = λx0 x
Therefore, λ = 0 or λ = 1.
Question 19.1. How many 1’s?
tr A = tr (PΛP0 )
= tr (ΛPP0 )
= tr Λ
3. Let Y ∼ N (0, I) and suppose A is a symmetric and idempotent matrix. Then Y0 AY ∼ X12
where r = tr (A) (number of eigenvalues equal to 1).
Then,
Y0 AY = z12 + z22 + . . . + zr2 ∼ Xr2
where Z ∼ Nn (O, I) =⇒ zi ∼ N (0, 1), and so zi2 ∼ X12
(n−k−1)Se2 2
4. Use the previous theorem (3.) to show that σ2 ∼ Xn−k−1
e0 e
Se2 = =⇒ e0 e = (n − k − 1)Se2
n−k−1
e0 e 2
WTS: σ2 ∼ Xn−k−1
Proof. Have )
e = (I − H) Y
=⇒ e = (I − H) ε
Y = Xβ + ε
Therefore,
e0 e ε0 (I − H) ε ε ε 0
2
= 2
= (I − H) = ε∗ (I − H) ε∗
σ σ σ σ
where ε ∼ N (0, I). Using the theorem above (3.), we conclude that
0 (n − k − 1)Se2
ε∗ (I − H) ε∗ = 2
∼ Xtr(I−H) = Xn−k−1
σ2
61
Duc Vu (Fall 2021) 19 Lec 19: Nov 8, 2021
This is known as the Cramer-Rao Lower Bound. Recall the score function
∂ ln f (x; θ)
S=
∂θ
and the information matrix
2
E∂ 2 f (x; θ)
∂ ln f (x; θ)
I(θ) = E =− = var(S)
∂θ ∂θ2
Also,
∂ 2 ln L
I(θ) = −E
∂θ2
for Y1 , . . . , Yn i.i.d
62
Duc Vu (Fall 2021) 20 Lec 20: Nov 10, 2021
§20 L e c 2 0 : N ov 1 0 , 2 0 2 1
§20.1 I n f o r m a t i o n M a t r i x a n d E f f i c i e nt E s t i m a t o r
i.i.d
Let Y1 , Y2 , . . . , Yn ∼ N (µ, σ). Is y an efficient estimator for µ where
Ey = µ
σ2
var(y) =
n
Consider the pdf
1 1 2
√ e− 2σ2 (yi −µ)
f (yi ) =
σ 2π
− n 1
P 2
L = 2πσ 2 2 e− 2σ2 (yi −µ)
n 1 X
ln L = − ln 2πσ 2 − 2 (yi − µ)2
2 2σ
∂ ln L 2 X 1 X
= 2
(yi − µ) = 2 yi − nµ
∂µ 2σ σ
2
∂ ln L µ
=− 2
∂µ2 σ
Cramer-Rao Lower Bound:
1 1 σ2
= =
− − σn2
2
−E ∂ ∂µln2L n
Thus, y is an efficient estimator for µ.
Let θ̂ be the estimator of θ.
1. E θ̂ = θ
2. Find var(θ̂) and compared it with the inverse of the information matrix I−1 (θ) where
2
∂ 2 ln L ∂ 2 ln L
∂ ln L
∂θ12 ∂θ1 ∂θ2 . . . ∂θ 1 ∂θp
∂ 2 ln L ∂ 2 ln L ∂ 2 ln L
∂θ ∂θ
2 1 ∂θ22
. . . ∂θ 2 ∂θp
I(θ) = −E . .. ..
.. . .
2 2 2
∂ ln L ∂ ln L
∂θp ∂θ1 ∂θp ∂θ2 . . . ∂ ∂θln2 L
p
2
In multiple regression: β0 , β1 , . . . , βk , σ
Y = Xβ + ε
ε ∼ N 0, σ 2 I
Y ∼ Nn Xβ, σ 2 I
n 1 0
=⇒ ln L = − ln 2πσ 2 − 2 (Y − Xβ) (Y − Xβ)
2 2σ
n 1
ln L = − ln 2πσ 2 − 2 Y0 Y − 2Y0 Xβ + β 0 X0 Xβ
2 2σ
Then
∂ 2 ln L ∂ 2 ln L ∂ 2 ln L ∂ 2 ln L
2 ∂β0 ∂β1 ... ∂β0 ∂βk ∂β0 ∂σ 2
∂ 2∂βln0 L ∂ 2 ln L ∂ 2 ln L ∂ 2 ln L
∂β1 ∂β0 ∂β12
... ∂β1 ∂βk ∂β1 ∂σ 2
∂ 2 ln L
!
.
∂ ln L
.. .. .. ∂β∂β 0 ∂β∂σ 2
..
I(θ) = −E . . . = −E ∂ 2 ln L ∂ 2 ln L
∂ 2 ln L ∂ 2 ln L ∂ 2 ln L 2
∂ ln L ∂σ 2 ∂β 0 ∂(σ 2 )(2)
∂β ∂β ∂βk ∂β1 ... ∂βk2 ∂βk ∂σ 2
2k 0
∂ ln L ∂ 2 ln L ∂ 2 ln L 2
∂ ln L
∂σ 2 ∂β0 ∂σ 2 ∂β1 ... ∂βk2 ∂(σ 2 )(2)
63
Duc Vu (Fall 2021) 20 Lec 20: Nov 10, 2021
Then,
∂ ln L 1
= − 2 (−2X0 Y + 2X0 Xβ)
∂β 2σ
∂ 2 ln L 1 0 X0 X
= − (2X X) = −
∂β∂β 0 2σ 2 σ2
2
∂ ln L 1
2
= − 4 (−2X0 Y + 2X0 Xβ)
∂β∂σ 2σ
∂ ln L n 1 0
= − 2 + 4 (Y − Xβ) (Y − Xβ)
∂σ 2 2σ 2σ
∂ 2 ln L n 1 0
2 (2)
= 4
− 6 (Y − Xβ) (Y − Xβ)
∂(σ ) 2σ σ
2
∂ ln L n n n
E 2 (2)
= 4
− 4 =− 4
∂(σ ) 2σ σ 2σ
Thus,
X0 X X0 X
− σ2 0 σ2 0
I(θ) = =
00 − 2σn4 0 0 n
2σ 4
2 0 −1
σ (X X) 0
I−1 (θ) = 4
00 2σ
n
e0 e
Se2 = , ESe2 = σ 2
n−k−1
2σ 4
var(Se2 ) =
n−k−1
§20.2 C e nt e r e d M o d e l
Consider Y = Xβ + ε
X = 1 X(0)
β0
β=
β (0)
Then,
1 0
Y = β0 1 + X(0) β (0) + ε ± 11 X(0) β (0)
n
Rearrange this expression and we obtain
1 0 1 0
Y = β0 1 + 11 X(0) β (0) + I − 11 X(0) β (0) + ε
n n
1 0 1 0
= 1 β0 + 1 X(0) β (0) + I − 11 X(0) β (0) + ε
n n
= γ0 1 + Zβ (0) + ε
64
Duc Vu (Fall 2021) 20 Lec 20: Nov 10, 2021
Thus,
γ̂0 = y
−1
β̂ (0) = (Z0 Z) Z0 Y
−1
0 1 0 0 1 0
= X(0) I − 11 X(0) X(0) I − 11 Y
n n
0 0 0
= X∗(0) X∗(0) X∗(0) Y∗
Observe that Y ∼ Nn γ0 1 + Zβ (0) , σ 2 I . Then,
0
Y − γ0 1 − Zβ (0) Y − γ0 1 − Zβ (0)
∼ Xn2
σ2
Note: Fitted values and residuals are the same for both models.
65
Duc Vu (Fall 2021) 21 Lec 21: Nov 12, 2021
§21 L e c 2 1 : N ov 1 2 , 2 0 2 1
§21.1 C o n f i d e n c e I nt e r va l s i n M u l t i p l e R e g r e s s i o n
Consider
−1
β̂ ∼ Nk+1 β, σ 2 (X0 X)
Finally,
√
β1 ∈ β̂1 ± t α2 ;n−k−1 · Se v11
In general, to construct a confidence interval for a0 β
q
−1
a0 β̂ ∼ N a0 β, σ a0 (X0 X) a
Then,
0 0
√a β̂−a β−1
σ a0 (X0 X) a
q ∼ tn−k−1
(n−k−1)Se2
σ2 /(n − k − 1)
a0 β̂ − a0 β
q ∼ tn−k−1
−1
Se a0 (X0 X) a
Finally, q
0 0 −1
a β ∈ a β̂ ± t α
2 ;n−k−1
· Se a0 (X0 X) a
If a = 0 1 0 0 . . . 0 then a0 β = β1 .
Y = Xβ + ε
β0
Y1 ε1
..
β
1 .
. = 1 x01 x02 ... x0k . + ..
..
Yn εn
βn
So the predictor is
Ŷ0 = x00 β̂
66
Duc Vu (Fall 2021) 21 Lec 21: Nov 12, 2021
Error of the prediction is Y0 − Ŷ0 with E Y0 − Ŷ0 = EY0 − E Ŷ0 = X00 β − X00 β = 0. Note that
Y0 = X̂00 β + ε0
var Y0 − Ŷ0 = var(Y0 ) + var(Ŷ0 )
−1
= σ 2 + σ 2 x00 (X0 X) x0
−1
= σ 2 1 + x00 (X0 X) x0
Then,
q
−1
Y0 − Ŷ0 ∼ N 0, σ 1+ x00 (X0 X) x0
(n − k − 1)Se2 2
∼ Xn−k−1
σ2
With this, we can construct a t ratio
√ Y0 −Ŷ0
σ 1+x00 (X0 X)−1 x0 Y0 − Ŷ0
q = q ∼ tn−k−1
(n−k−1)Se2 −1
σ2 /(n − k − 1) Se 1 + x00 (X0 X) x0
(n − k − 1)Se2 2
∼ Xn−k−1
σ2
SO the confidence interval for EY0 is
q
−1
EY0 = x00 β ∈ Ŷ0 ± t α
2 ;n−k−1
· Se x00 (X0 X) x0
§21.2 H y p o t h e s i s Te s t i n g
Suppose k = 5 then
H0 : Cβ = γ
Ha : Cβ 6= γ
67
Duc Vu (Fall 2021) 21 Lec 21: Nov 12, 2021
1. C = 0 1 0 0
0 ,γ=0 0
2. C = 0 0 0 1 0 0 , γ = 2
3. C = 0 0 1 0 0 −1 , γ = 0
0 0 1 0 0 0 0
4. C = ,γ= . Check:
0 0 0 0 0 1 0
β2 0
Cβ = =
β5 0
5.
0 1 0 0 0 0 0
0 0 1 0 0 0 0
0
C= 0 0 1 0 0, 0
γ=
0 0 0 0 1 0 0
0 0 0 0 0 1 0
or C = 0 I
In general, C is m × k + 1 matrix.
H0 : Cβ = γ =⇒ cβ − γ = 0
Ha : Cβ 6= γ =⇒ cβ − γ 6= 0
Therefore,
−1
Cβ̂ − γ ∼ Nm 0, σ 2 C (X0 X) C0
and let h i− 21 h i
−1
V = C (X0 X) C0 Cβ̂ − γ
Then EV = 0
var (V) = σ 2 Im×m
0
V V
So, V ∼ Nm 0, σ 2 I and σ2
2
∼ Xm
0 −1
−1
Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ
2
∼ Xm
σ2
(n − k − 1)Se2 2
∼ Xn−k−1
σ2
β̂ and Se2 are independent. Therefore,
0
−1
−1
(Cβ̂−γ ) C(X0 X) C0 (Cβ̂−γ )
σ2 /m
(n−k−1)Se2
∼ Fm,n−k−1
σ2 /n −k−1
or 0 −1
−1
Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ
∼ Fm,n−k−1
mSe2
68
Duc Vu (Fall 2021) 22 Lec 22: Nov 15, 2021
§22 L e c 2 2 : N ov 1 5 , 2 0 2 1
§22.1 F Te s t f o r t h e G e n e r a l L i n e a r H y p o t h e s i s
Consider:
H0 : Cβ = γ
Ha : Cβ 6= γ
−1
Under H0 : Cβ̂ − γ ∼ Nm 0, σ 2 C (X0 X) C0
0
−1
−1
(Cβ̂−γ ) C(X0 X) C0 (Cβ̂−γ ) 2
σ2 ∼ Xm
(n−k−1)Se2 2
∼ Xn−k−1
σ2
0 −1
−1
Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ
=⇒ ∼ Fm,n−k−1
mSe2
Reject H0 if F > F1−α;m,n−k−1
Note: EmSe2 = mσ 2 – expected value of the denominator. Expected value of the numerator (using
properties of trace) is
−1
0 −1
mσ 2 + (Cβ − γ) C (X0 X) C0 (Cβ − γ)
If H0 is true, the second term becomes 0 and the expected value is approximately equal to 1.
(n − k − 1)Se2 2
∼ Xn−k−1
σ2
Then,
a0 β̂
q ∼ tn−k−1
−1
Se a0 (X0 X) a
69
Duc Vu (Fall 2021) 22 Lec 22: Nov 15, 2021
§22.3 Powe r A n a l y s i s i n M u l t i p l e R e g r e s s i o n
Let Y ∼ Nn (µ, I). Then Y0 Y ∼ Xn2 (NCP = µ0 µ). Let Y ∼ Nn µ, σ 2 I . Then,
Y µ
∼ Nn ,I
σ σ
Y0 Y µ0 µ
2
∼ Xn NCP = 2
σ2 σ
When H0 is no true,
−1
Cβ̂ − γ ∼ Nm Cβ − γ, σ 2 C (X0 X) C0
−1
− 1
C(X0 X) C0 2
(Cβ̂−γ )
Let V = σ . Then,
− 12
0 −1
C (X X) C0 (Cβ − γ)
V ∼ Nm , I
σ
0 h i−1 h i−1
−1 0 −1 0
Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ (Cβ − γ) C (X 0
X) C (Cβ − γ)
2
∼ Xm NCP =
σ2 σ2
and 0 h i−1
−1
SSEc = ec 0 ec = e0 e + Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ
70
Duc Vu (Fall 2021) 22 Lec 22: Nov 15, 2021
k=5
H0 : β1 = β2 = 0
Full : n − 5 − 1
Reduced : n − 3 − 1
=⇒ m = 2
Thus,
0 h i−1
−1
Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ
∼ Fm,n−k−1
mSe2
which is the same as method 1.
71
Duc Vu (Fall 2021) 23 Lec 23: Nov 17, 2021
§23 L e c 2 3 : N ov 1 7 , 2 0 2 1
§23.1 Te s t i n g t h e O ve r a l l S i g n i f i c a n c e o f t h e M o d e l
Consider
H0 : β (0) = 0
Ha : β (0) 6= 0
We can test this hypothesis using the F test for the general linear hypothesis
0 h i−1
−1
Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ
∼ Fk,n−k−1
kSe2
where m = k in this case. We can also use the following test statistic
MSR SSR/k
=
MSE SSE/(n − k − 1)
0 h i−1
−1
C (X0 X) C0
Note: SSR = Cβ̂ − γ Cβ̂ − γ where C = 0 Ik .
§23.2 L i ke l i h o o d R a t i o Te s t
Consider:
H0 : β (0) = γ
Ha : β (0) 6= γ
We reject H0 if
L (ŵ)
Λ= <k
L (ω̂)
where
• L (ŵ) : maximized likelihood function under H0
72
Duc Vu (Fall 2021) 23 Lec 23: Nov 17, 2021
and 0
y − xβ̂ c y − xβ̂ c e0c ec
σ̂02 = =
n n
Back to LRT, we have
− n2 − 1
2 e0c ec
2πσ̂02 e 2σ̂0
Λ= − 1
e0 e
<k
−n 2
(2πσ̂12 ) 2
e 2σ̂1
Replace
e0c ec = nσ̂02
e0 e = nσ̂12
We obtain
e0 e 2
< kn
e0c ec
Also,
0 h i−1
−1
e0c ec = e0 e + Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ
Thus,
1 2
0 −1 < kn
(Cβ̂−γ ) [C(X0 X)−1 C0 ] (Cβ̂−γ )
1+ e0 e
We see that if H0 is true then Cβ̂ ≈ γ and therefore the ratio above is approximately equal to 1. If
H0 is no true then the ratio above is less than 1. Manipulating the above expression, we have
0 h i−1
−1
Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ 2 n−k−1
> k− n − 1 = k0
mSe2 m
P (Fm,n−k−1 > k 0 ) = α
Therefore, k 0 = F1−α;m,n−k−1 .
1−α
α
F1−α;m,n−k−1
§23.3 M u l t i - C o l l i n e a r i ty
This is problem when some predictors are highly correlated with other predictors.
73
Duc Vu (Fall 2021) 23 Lec 23: Nov 17, 2021
Example 23.1
Suppose k = 2. )
H0 : β1 = β2 = 0
Use F statistic
Ha : at least on βi 6= 0
Suppose we reject H0 (at least one bi 6= 0). Then test β1 = 0 and β2 = 0 individually.
H0 : β1 = 0 H0 : β 2 = 0
Ha : β1 6= 0 Ha : β2 6= 0
Suppose we don’t reject H0 is both tests. This contradiction between the F statistic and the t
statistics is a problem caused by multi-collinearity.
Multi-collinearity inflates the variance of β̂i and therefore the corresponding t statistics will be
small. To explain this we will use the centered and scaled model
Y = γ0 1 + Zβ (0) + ε
Yi = γ0 + β1 Xi1 − X 1 + β2 Xi2 − X 2 + . . . + βk Xik − X k + εi
where i = 1, 2, . . . , n. Or
74
Duc Vu (Fall 2021) 24 Lec 24: Nov 19, 2021
§24 L e c 2 4 : N ov 1 9 , 2 0 2 1
§24.1 C e nt e r e d a n d S c a l e d M o d e l i n M a t r i x / Ve c t o r Fo r m
Consider the centered model
Y = γ0 1 + Zβ (0) + ε
1
γ0 = β0 + 10 X(0) β (0)
n
= β0 + β1 x 1 + . . . + βk x k
1 0
Z = I − 11 X(0)
n
or
(x1i − x1 (xik − xk )
qX qX
Yi = γ0 + β1 (xi1 − x1 )2 pP + . . . + βk (xik − xk )2 pP + εi
(xi1 − x1 )2 (xik − xk )2
or
Yi = γ0 + δ1 Zsi1 + . . . + δk Zsik + εi
where
qX
δj = βj (xij − xj )2
xij − xj
Zsij = pP
(xij − xj )2
where pP
(xi1 − x1 )2 0
D=
..
.
pP
0 (xik − xk )2
Then
Y = γ0 1 + Zs δ (0) + ε
Note:
2. Z0s 1 = 0
Z0s1
Z0sk
75
Duc Vu (Fall 2021) 24 Lec 24: Nov 19, 2021
√Pxi1 −x1
(xi1 −x1 )2
P(xi1 − x1 )2
i P
xi1 −x1 √Pxn1 −x1 ..
h
Z0s1 Zs1 = √P(xi1 −x1 )2 ... = =1
(xi1 −x1 ) 2 . (xi1 − x1 )2
√Pxn1 −x1
(xi1 −x1 )2
Then,
1 r12 . . . r1k
r21 1 . . . r2k
Z0s Zs = R = .
.. ..
.. . .
rk1 rk2 ... 1
Estimation of Y = γ0 1 + Zs δ(0) + ε
−1 −1
00
1
10 1 10 Zs 10 Y 10 Y 10 Y
γ̂0 n 0 n
= = = −1
δ̂ (0) Z0s 1 Z0s Zs Z0s Y 0 Z0s Zs Z0s Y 0 (Z0s Zs ) Z0s Y
So
γ̂0 = y
which is the same as the estimate of γ0 of the centered model. And
−1
δ̂ (0) = (Z0s Zs ) Z0s Y
Properties:
−1
E δ̂ (0) = (Z0s Zs ) Z0s EY
−1
= (Z0s Zs ) Z0s γ0 1 + Zs δ (0)
−1
= 0 + (Z0s Zs ) Z0s Zs δ (0)
= δ (0)
h i
−1
var(δ̂ (0) ) = var (Z0s Zs ) Z0s Y
= σ 2 R−1
Non-Centered Model:
Y = Xβ + ε or Y = β0 1 + X(0) β (0) + ε
Centered Model:
Y = γ0 1 + Zβ (0) + ε
Centered/Scaled Model:
Y = γ0 1 + Zs δ (0) + ε
where
1 0
γ 0 = β0 + 1 X(0) β (0)
n
δ (0) = Dβ (0)
76
Duc Vu (Fall 2021) 24 Lec 24: Nov 19, 2021
So
1 0
β̂0 = γ̂0 − 1 X(0) β̂ (0)
n
1
= y − 10 X(0) D−1 δ̂ (0)
n
β̂ (0) = D−1 δ̂ (0)
77
Duc Vu (Fall 2021) 25 Lec 25: Nov 22, 2021
§25 L e c 2 5 : N ov 2 2 , 2 0 2 1
§25.1 M u l t i - C o l l i n e a r i ty
Let X1 , X2 , . . . , Xk be predictors.
pP Some predictors are highly correlated with other predictors.
Earlier we saw that δ1 = β1 (xi1 − x1 )2
qX
δ̂1 = β̂1 (xi1 − x1 )2
δ̂1
β̂1 = pP
(xi1 − x1 )2
var(δ̂1 )
var(β̂1 ) = P
(xi1 − x1 )2
So let’s find the variance of δ̂1 using the centered and scaled model.
1 r12 r13 . . . r1k
r21 1 r23 . . . r2k
var(δ̂ (0) ) = σ 2 R−1 = σ 2 . .. = ∗
.. ..
. .
rk1 rk2 rk3 1
Therefore, var(δ̂1 ) = σ 2 R−1 [1, 1]. Using the inverse of a partitioned matrix
−1
C−1 −C−1
A11 A12 11 11 C12
=
A21 A22 −C21 C−111 A−1 −1
22 + C21 C11 C12
Here A11 = 1, A12 = r0 , A21 = r, A22 = R22 , C11 = A11 − A12 A−1
22 A21 . Therefore,
σ2
var(δ̂1 ) =
1 − r0 R−1
22 r
2
σ 2
We will sow that var(δ̂1 ) = 1−R 2 where R1 is the R-square from the regression of X1 on
1
X2 , X3 , . . . , Xk . Instead, we can regress Zs1 on Zs2 , Zs3 , . . . , Zsk . Because we have seen that
the three models:
• Non-centered
• Centered
• Centered/Scaled
Here is the model
78
Duc Vu (Fall 2021) 25 Lec 25: Nov 22, 2021
σ2
var(β̂1 ) =
R12 )
P
(1 − (xi1 − x1 )2
Eε = 0
var(ε) = σ 2 V
−1 −1
σ 2 (X0 X) X0 VX (X0 X) . Therefore, β̂ is not BLUE because the Gauss-Markov conditions do
1
not hold. We transform the model as follows: let V− 2 be the inverse square root matrix of V.
− 12
Multiply the model on both sides by V
1 1 1
V− 2 Y = V− 2 Xβ + V− 2 ε
or
Y∗ = X∗ β + ε∗
1
Eε∗ = E V− 2 ε = 0
1
1 1
var (ε∗ ) = var V− 2 ε = σ 2 V− 2 VV− 2 = σ 2 I
with this transformation we see that the Gauss-Markov conditions hold. Therefore, we estimate β
using
0 −1 0
β̂ GLS = X∗ X∗ X∗ Y ∗
1 1
Replace X∗ = V− 2 X and Y∗ = V− 2 Y to get
−1
β̂ GLS = X0 V−1 X X0 V−1 Y
79
Duc Vu (Fall 2021) 25 Lec 25: Nov 22, 2021
80
Duc Vu (Fall 2021) 26 Lec 26: Nov 24, 2021
§26 L e c 2 6 : N ov 2 4 , 2 0 2 1
§26.1 G e n e r a l i z e d L e a s t S q u a r e s ( C o nt ’ d )
Estimate β by direct minimization of the error sum of squares using the transformed model
Y ∗ = X∗ β + ε ∗
0 0 1 1
min ε∗ ε∗ or min (Y∗ − X∗ β) (Y∗ − X∗ β). Replace Y∗ = V− 2 Y and X∗ = V− 2 X. We minimize
0
min Q = (Y − Xβ) V−1 (Y − Xβ)
= Y0 V−1 Y − 2Y0 V−1 Xβ + β 0 X0 V−1 Xβ
∂Q
= −2X0 V−1 Y + 2X0 V−1 Xβ = 0
∂β
−1 0 −1
β̂ GLS = X0 V−1 X XV Y
Assume now ε ∼ Nn 0, σ 2 V . Then, Y ∼ Nn Xβ, σ 2 V and
1 −1 1 0 2 −1
L= n σ 2 V 2 e− 2 (y−xβ) (σ V) (y−xβ)
(2π) 2
n 1 1 0
ln L = − ln 2πσ 2 − ln |V | − 2 (y − xβ) V−1 (y − xβ)
2 2 2σ
∂ ln L
=0
∂θ
−1 0 −1
β̂ GLS = X0 V−1 X XV Y
Estimation of σ 2
∂ ln L n 1 0
2
= − 2 + 4 (y − xβ) V−1 (y − xβ) = 0
∂σ 2σ 2σ
0 0
y − xβ̂ GLS V−1 y − xβ̂ GLS y∗ − x∗ β̂ GLS y∗ − x∗ β̂ GLS
σ̂ 2 = =
n n
0
e eGLS
= GLS
n
Use the properties of trace to find E σ̂ 2
1 0
E σ̂ 2 = E y − xβ̂ GLS V−1 y − xβ̂ GLS
n
1 h i
= tr V−1 var y − Xβ̂ GLS + 000
n
h i
because E Y − Xβ̂ GLS = 0. So
−1 0 −1
Y − Xβ̂ GLS = Y − X X0 V−1 X
XV Y
h −1 0 −1 i
0 −1
= I−X XV X XV Y
So
h −1 0 −1 i h −1 0 i
var Y − Xβ̂ GLS = σ 2 I − X X0 V−1 X XV V I − V−1 X X0 V−1 X X
−1
= σ 2 V − σ 2 X X0 V−1 X X0
81
Duc Vu (Fall 2021) 26 Lec 26: Nov 24, 2021
Back to expectation
1 h −1 0 i
E σ̂ 2 = tr V−1 σ 2 V − σ 2 X X0 V−1 X X
n
1 2
σ tr In − σ 2 tr Ik+1
=
n
n−k−1 2
= σ
n
0
eGLS eGLS
Thus, the unbiased estimator of σ 2 is Se2GLS = n−k−1 .
Y1 = X1 β 1 + ε 1
Y2 = X2 β 2 + ε2
! !
β (1) β (1)
Let β 1 = (2) , β 2 = (2) . Note that
β1 β2
(2) (2)
β (1) : p × 1, β 1 : (k + 1 − p) × 1, β 2 : (k + 1 − p) × 1
(2) (2)
Suppose we want to test β 1 = β 2 (assume that the first p elements of β 1 and β 2 are the same).
We can construct one model as follows
! β (1)
(1) (2)
Y1 X1 X1 0 (2) ε1
= (1) (2) β 1 + ε
Y2 X2 0 X2 (2) 2
β2
H0 : Cβ = 0
Ha : Cβ 6= 0
82
Duc Vu (Fall 2021) 27 Lec 27: Nov 29, 2021
§27 L e c 2 7 : N ov 2 9 , 2 0 2 1
§27.1 C o m p a r i n g R e g r e s s i o n E q u a t i o n s ( C o nt ’ d )
We can use the F test for the general linear hypothesis
0 h i−1
−1
Cβ̂ − γ C (X0 X) C0 Cβ̂ − γ
∼ Fm,n−k−1
mSe2
Example 27.1
Suppose k = 5 and p = 3. We want to test
(1) (2) (1) (2) (1) (2)
H0 : β3 = β3 , β4 = β4 , β5 = β5
Ha : not true
0 0 0 1 0 0 −1 0 0
C = 0 0 0 0 1 0 0 −1 0
0 0 0 0 0 1 0 0 −1
and
β0
β1
β2
β (1)
3
(1)
β= β
4(1)
β
5
β (2)
3
(2)
β4
(2)
β5
In general,
C = 0k+1−p,p Ik+1−p −Ik+1−p
Therefore,
C : (k + 1 − p) × (2(k + 1) − p)
We can also test the hypothesis using the extra sum of squares principle. Under the null hypothesis,
the model is expressed as follows
0 0
Y1 X1 X21 β ε
= + 1
Y2 X02 X22 β∗ ε2
83
Duc Vu (Fall 2021) 27 Lec 27: Nov 29, 2021
Example 27.2
Suppose k = 5, p = 4
(1) (1) (1) (2)
H0 : β4 = β5 , β5 = β5
Ha : not true
Formulation:
(1) (1) (1) (1) (1)
y11
1 x11 x12 x13 x14 x15 0 0
ε11
β0
(1) (1) (1) (1) (1)
y21 1 x21 x22 x23 x24 x25 0 0
ε21
β1
. .. .. .. .. .. .. ..
.. . .
. . . . . . . . β2 ..
.
yn1 1 (1) (1) (1) (1) (1)
β3 εn1
= xn1 xn2 xn3 xn4 xn5 0 0 (1) +
y12 1 (2) (2) (2) (2) (2)
x11 x12 x13 0 0 x14 x15 β4 ε12
y22 (2) β (1)
x25 5 ε22
(2) (2) (2) (2)
1 x21 x22 x23 0 0 x24
. β2 .
.. .. .. .. .. .. .. .. .. 4 ..
. . . . . . . . β2
yn2 (2) (2) (2) (2) (2) 5 εn2
1 xn1 xn2 xn3 0 0 xn4 xn5
0 0 0 0 1 0 −1 0
C=
0 0 0 0 0 1 0 −1
dfF = n − 8, dfR = n − 6
§27.2 D e l e t i n g a S i n g l e Po i nt i n M u l t i p l e R e g r e s s i o n
We want to explore the effect of deleting a single point in multiple regression
• Effect on β̂
• Effect on Se2
• Effect on fitted values
We can delete one point at a time and run a new regression each time to see the effect. But this
will require n + 1 regressions (one on the full data set and n regressions when we delete data point
i, i = 1, . . . , n). There is a more automated way and the result is based on the residuals, ei , and
leverage values hii from the regression of the full data set.
Y = Xβ + ε
where Y(i) , X(i) are the vectors Y and matrix X after deleting point i from the data set. We are
working now with the model
where β̂ (i) is the estimator of the vector β after deleting data point i.
84
Duc Vu (Fall 2021) 28 Lec 28: Dec 1, 2021
§28.1 D e l e t i n g a S i n g l e Po i nt i n M u l t i p l e R e g r e s s i o n
( C o nt ’ d )
0 0
Let’s find expressions for X(i) X(i) and X(i) Y(i)
X(i)
0 0
X0 X = X(i) xi = X(i) X(i) + xi x0i
x0i
0
X(i) X(i) = X0 X − xi x0i
identity matrix both sides. In our problem A is X0 X and b is xi .Using the result we can find
0 −1 0
0 Y 0
(i)
X(i) X(i) . Now let’s find X(i) Y(i) . From X0 Y = X(i) X(i) = X(i) Y(i) + xi yi . We
yi
0
0 −1 0
get X(i) Y(i) = X0 Y − xi yi . Back to β̂ (i) = X(i) X(i) X(i) Y(i) . To find
−1
(X0 X) xi
β̂ (i) = β̂ − ei
1 − hii
Then
−1
(X0 X) xi ei
β̂ − β̂ (i) =
1 − hii
This is the difference in the estimator of β before and after deleting data point i.
Effect on fitted values:
2 e2i
(n − k − 2)Se(i) = (n − k − 1)Se2 −
1 − hii
• Se(i)
2
is the unbiased estimator of σ 2 after deleting data point i
e = (I − H) Y
ei = (I − H)i Y
−1
hii = x0i (X0 X) xi
85
Duc Vu (Fall 2021) 28 Lec 28: Dec 1, 2021
Result: Let A be a square invertible matrix and β a vector s.t. 1 + b0 A−1 b 6= 1. Then
Finally,
−1
(X0 X) x0 e0
β̂ new = β̂ +
1 + h00
−1
where e0 = y0 − x00 β̂ and h00 = x00 (X0 X) x0 .
§28.2 I n f l u e nt i a l A n a l y s i s
Internally studentized residuals
√
√ ei
)
ei ∼ N 0, σ 1 − hii σ 1−hii ei
(n−k−1)Se2
=⇒ q = √
∼ 2
Xn−k−1 (n−k−1)Se2
/(n − k − 1) Se 1 − hii
σ2 σ2
This ratio does not follow a t distribution because ei and Se are not independent. Let vi = √ei .
Se 1−hii
Show that
vi2
1 1
∼ beta , (n − k − 2)
n−k−1 2 2
So we need to show that
e2i
1 1
∼ beta , (n − k − 2)
SSE(1 − hii ) 2 2
1. e = (I − H) ε, thus
ei = c0i (I − H) ε
where c0i = 0
0 ... 1 0 ... 0 in which 1 is at the ith position. So
e2i = ei ei = ε0 (I − H) ci c0i (I − H) ε
2. SSE = ε0 (I − H) ε
86
Duc Vu (Fall 2021) 29 Lec 29: Dec 3, 2021
§29.1 I n f l u e nt i a l A n a l y s i s ( C o nt ’ d )
e2i
Let’s express SSE(1−hii ) as follows
ε0 (I − H) ci c0i (I − H) ε
ε0 (I − H) ε(1 − hii )
Divide by σ 2
ε0
σ (I − H) ci c0i (I − H) σε Z0 QZ
0 =
ε ε
σ (I − H) σ (1 − hii )
Z0 (I − H) Z
ε (I−H)ci c0i (I−H)
Here Z = σ ∼ N (0, I) and Q = 1−hii
3. Have
(I − H) ci c0i (I − H) (I − H) ci c0i (I − H)
QQ =
(1 − hii )2
(I − H) ci ci (I − H) ci c0i (I − H)
0
=
(1 − hii )2
0
(I − H) ci ci (I − H)
=
1 − hii
=Q
(I − H) ci c0i (I − H) c0i (I − H) (I − H) ci
tr (Q) = tr = tr =1
1 − hii 1 − hii
Thus, Z0 QZ ∼ X12
4. Back to the ratio
Z0 QZ Z0 QZ
= 0
Z0 (I − H) Z Z (I − H − Q) Z + Z0 QZ
)
(I − H − Q) Q = Q − HQ − QQ
H(I−H)ci c0i (I−H)
HQ = 1−hii =0
(I − H − Q) (I − H − Q) = I − H − (I − H) Q − Q (I − H) + QQ
=I−H−Q+0−Q+0+Q
=I−H−Q
Z0 (I − H − Q) Z ∼ Xtr(I−H−Q)
2
or
Z0 (I − H − Q) Z ∼ Xn−k−2
2
87
Duc Vu (Fall 2021) 29 Lec 29: Dec 3, 2021
X
Result: Let X ∼ Γ(α1 , β), Y ∼ Γ(α2 , β) and X, Y are independent. Let U = X + Y and V = X+Y .
Then, U, V are independent and
U ∼ Γ (α1 + α2 , β)
V ∼ beta (α1 , α2 )
Therefore,
Z0 QZ
1 1
∼ beta , (n − k − 2)
Z (I − H − Q) Z + Z0 QZ
0 2 2
Z0 QZ ∼ X12 or Z0 QZ ∼ Γ 21 , 2 . Also,
1
Z0 (I − H − Q) Z ∼ Xn−k−2
2
or ∼ Γ (n − k − 2), 2
2
ri2 1 1
We can conclude that n−k−1 ∼ beta 2 , 2 (n − k − 2)
§29.2 E x t e r n a l l y S t u d e nt i z e d R e s i d u a l
Consider the ratio
e
ti = √i
Se(i) 1 − hii
2
where Se(i) is the unbiased estimator of σ 2 after data point i is deleted from the data set. Notice
that
e2i
t2i = 2
Se(i) (1 − hii )
e2i (n − k − 2)
= 2 (1 −
(n − k − 2)Se(i) hii )
2 e2i
But (n − k − 2)Se(i) = (n − k − 1)Se2 − 1−hii . Then,
e2i (n − k − 2)
t2i = h i
e2
(n − k − 1)Se2 − 1−hi ii (1 − hii )
e2i (n − k − 2)
=
(n − k − 1)Se2 (1 − hii ) − e2i
e2i
Note: ri2 = Se2 (1−hii ) =⇒ e2i = ri2 Se2 (1 − hii ). Then,
88
Duc Vu (Fall 2021) 29 Lec 29: Dec 3, 2021
§29.3 A N o t e o n Va l u a b l e S e l e c t i o n
Effect on the regression when predictors are removed from the model
a) Effect on β̂
Y = Xβ + ε
is the correct model or Y = X1 β 1 + X2 β 2 + ε. Suppose we decided to use Y = X1 β 1 + ε.
−1
Then β̂ 1 = (X01 X1 ) X01 Y and therefore
−1
E β̂ 1 = (X01 X1 ) X01 (X1 β 1 + X2 β 2 )
−1
= β 1 + (X01 X1 ) X01 X2
89