Week2 Lecture2
Week2 Lecture2
University of Amsterdam
Week 2. Lecture 2
February, 2024
1 / 59
Overview
2 / 59
The plan for this lecture
3 / 59
Recap: Linear model
4 / 59
Recap: OLS estimator
Using the matrix notation we showed that formula for the OLS estimator is
given by:
n
!−1 n
!
β = (X X ) X y = xi xi x i yi .
−1
X X
0 0 0
b (2)
i=1 i=1
We will see that both of these expressions are very convenient to study the
statistical properties of the OLS estimator.
5 / 59
1. The Frisch-Waugh-Lovell theorem
6 / 59
1.1. OLS vs. one-by-one regressions
7 / 59
OLS vs. Naive estimators
n
!−1 n
!
−1
β̈ (k) = (x (k) )0 x (k) (x (k) )0 y =
X X
2
xk,i xk,i yi . (3)
i=1 i=1
n
!−1
(X X) xx
−1
X
0 0
= i i (4)
i=1
8 / 59
Example. Single regressors setup.
In the case of simple linear regression model that naive estimator would be
given by:
α̈ = y , (5)
Pn
xi yi
β̈ = Pi=1n 2
. (6)
i=1 xi
9 / 59
Example. When equivalent?
As it turns out this is just a special case of a more general result on partial
regression.
10 / 59
1.2. The Theorem
11 / 59
Preliminaries
12 / 59
The Theorem
where MX 2
= In − X2 (X20 X2 )−1 X20 and MX 1
= In − X1 (X10 X1 )−1 X10 .
13 / 59
Intuition
Take βb1 . From the above formula it is clear that the estimate βb1 can be
equivalently obtained in two steps:
Step 1 Consider regression of X1 (for each column) on X2 . Corresponding
residuals are given by E1⊥2 = MX2 X1 .
Step 2 Regress y on these residuals E1⊥2 :
βe1 = (E1⊥2
0
E1⊥2 )−1 (E1⊥2
0
y) . (12)
Hence, the OLS estimator βb1 measures scaled covariance between X1 and y
after we project away (or residualize) the part of X1 that is also explained
by X2 .
14 / 59
Intuition. Special case
In the special case where X10 X2 = O, i.e. the informational content of the
two matrices are non-overlapping (or the corresponding spaces are
orthogonal to each other) then it follows that:
15 / 59
Fundamental Result
16 / 59
Proof
X 0 (y − X β)
b = 0K . (15)
17 / 59
Plug-in that expression for βb1 in the second set of equation:
Using the MX 1
= In − PX1 notation:
18 / 59
Empirical results. Model with interactions.
19 / 59
Empirical results. Regression of interaction term on
everything else.
20 / 59
Empirical results. Regression of price on residual
21 / 59
2. OLS under Classical Assumptions
22 / 59
2.1. Moments of vectors
23 / 59
Vector-valued random variables
µM E[zM ]
24 / 59
The variance-covariance matrix
This matrix is symmetric as σij = σji for any i and j. And it is positive
semi-definite, i.e. a 0 Σ a ≥ 0 for any a ∈ R M .
25 / 59
Linear transformations. Mean.
ze = Az + b, (24)
for any vector b and any matrix A. Then using the corresponding
definitions:
26 / 59
Linear transformations. Variance.
This is just a generalization of the fact that if you scale random variable by
a constant c, the variance of that random variable increases by c 2 .
27 / 59
Comparing two variance-covariance matrices
Consider two random variables a with var (a) = σa2 and b with var (b) = σb2 ,
we say that random variable a has a no larger variance than the random
variable b when:
σb2 ≥ σa2 . (26)
Can we generalize this to random vectors (of same dimensions) a and b
with variance-covariance matrices Σa and Σb ?
28 / 59
We say that the random vector a has no larger variance than the random
vector b when:
Σb ≥ Σa . (27)
Here ≥ means that the difference between the two matrices is a positive
semi-definite matrix, i.e.:
Why this specific definition? This definition implies that the variance of
random variable q 0 a (scalar !) is no larger than that of q 0 b (again scalar)
for any q . Hence, for any possible combination q we always get that one
variable has no larger variance that the other one.
29 / 59
2.2. Mean and variance of OLS
30 / 59
Classical Assumptions in vector notation
We extend the classical assumptions to general K using matrix notation.
Assumption 1 (Fixed regressors/Exogeneity). X can be treated as fixed
numbers (i.e. one can condition on them). They satisfy
X 0 X > 0 (or wp1).
Assumption 2 (Random disturbances). The n disturbances/error terms ε
are random variables with E[ε|X ] = 0n .
Assumption 3 (Homoscedasticity). The variance of ε exists and are all
equal E[εε0 |X ] = σ02 In for σ02 > 0.
Assumption 4 (Independence). ε1 , . . . , εn are independent after
conditioning on X .
Assumption 5 (Fixed parameters). The true parameters β0 and σ 2 are fixed
unknown parameters.
Assumption 6 (Linear model). The data on y have been generated by:
y = X β0 + ε. (29)
E[β]
b = β0 . (30)
= β0 + (X 0 X ) X 0 ε.
−1
Hence:
b X ] = β0 + E[(X 0 X )
E[β| X 0 ε|X ]
−1
= β0 + (X 0 X ) X 0 E[ε|X ]
−1
= β0 .
32 / 59
Variance of OLS
b X ) = E[(βb − E[β|
Var (β| b X ])(βb − E[β|
b X ])0 |X ]. (31)
b X ) = E[(X 0 X )
Var (β|
−1
X 0 εε0 X (X 0 X )−1 |X ]
= (X 0 X ) X 0 E[εε0 |X ]X (X 0 X )
−1 −1
= (X 0 X ) X 0 σ02 In X (X 0 X )
−1 −1
= σ02 (X 0 X ) .
−1
33 / 59
2.3. OLS is BLUE
34 / 59
Is OLS an efficient estimator? The Gauss-Markov theorem
35 / 59
Linear unbiased estimator
βe = Ay , (32)
βe = AX β0 + Aε. (33)
Hence:
e X ] = E[AX |X ]β0 + A E[ε|X ] = AX β0 .
E[β| (34)
As a result βe is (conditionally unbiased) as long as AX = IK .
36 / 59
Best estimator
37 / 59
Proof
Note that:
βe = βb + (βe − β),
b (36)
and also
βe − β0 = βb − β0 + (βe − β0 − (βb − β0 )). (37)
Denote by AOLS = (X 0 X )−1 X 0 then previous equality can be expressed as:
βe − β0 = AOLS ε + (A − AOLS )ε. (38)
From here:
e X ) = E[(βe − β0 )(βe − β0 )0 |X ]
var (β|
= E[AOLS εε0 A0OLS |X ] + E[(A − AOLS )εε0 (A − AOLS )0 |X ]
+ E[AOLS εε0 (A − AOLS )0 |X ] + E[(A − AOLS )εε0 A0OLS |X ].
38 / 59
Note that:
However:
Here the second equality follows from the fact that AX = In (by
unbiasedness). As a result:
39 / 59
Combining all results:
e X ) = E[(βe − β0 )(βe − β0 )0 |X ]
var (β|
= E[AOLS εε0 A0OLS |X ] + E[(A − AOLS )εε0 (A − AOLS )0 |X ]
b X ) + E[(A − AOLS )εε0 (A − AOLS )0 |X ].
= var (β|
e X ) ≥ var (β|
var (β| b X ). (42)
40 / 59
3. Estimation of σ 2
41 / 59
Plug-in approach
Note that if in practise want to compute the variance of β:
b
b X ) = σ 2 (X 0 X )−1 ,
var (β| (43)
0
Given that E[ε2i ] = σ02 , then the most natural approach to estimate the
unknown quantity is based on the method-of-moments (or plug-in) principle:
n
ei )2 = eb0 eb.
1X 1
b2 =
σ (b (44)
n n
i=1
where eb = y − X βb = MX y .
σ 2 ] 6= σ02 .
E[b (45)
43 / 59
Preliminaries: Matrix Trace
44 / 59
Preliminaries: Matrix Trace. Circular permutations.
tr (CD 0 ) = tr (D 0 C ). (49)
We will not attempt to prove this fact in this course, please refer to any
textbook on Advanced Linear Algebra.
45 / 59
Application: Trace of Projection matrix
MB = Iq − B (B 0 B )−1 B 0 . (50)
46 / 59
Using additivity and invariance under circular permutations:
47 / 59
3.2. Unbiased estimation
48 / 59
Biased estimator?
b2 =
σ
n
e
b eb = y 0 MX y .
1 0 1
n
(51)
49 / 59
Some trace tricks
TRICK 1: Note that any scalar a is trace of itself, i.e. a = tr (a). Hence:
b2 = tr (b
σ σ2 )
= tr (e 0 MX e )
1
n
= tr (MX εε0 ) .
1
n
In the final line we use invariance under circular permutations.
50 / 59
Some trace tricks
TRICK 2: From the definition of tr (·) it is clear that E[tr (A)] = tr (E[A])
for any appropriate matrix A. Using this trick:
51 / 59
Towards unbiased estimation
s2 ≈ σ
b2 . (55)
52 / 59
3.3. Standard errors
53 / 59
Estimating variance of βb
var b X ) = s 2 (X 0 X )−1 .
c (β| (56)
54 / 59
Standard error
var
c (βck |X ) = s 2 [(X 0 X )−1 ][k,k] . (57)
Note that here we first take first the inverse of [K × K ] matrix (X 0 X ), and
only later [k, k] element, and not in opposite order (typical mistake!).
The corresponding standard error for k’th regressor is given by the square
root of the variance, i.e.:
55 / 59
Empirical results. Model 2.
56 / 59
4. Summary
57 / 59
Summary today
In this lecture
I We proved the celebrated The Frisch-Waugh-Lovell theorem.
I We studied statistical properties of the OLS estimator.
I We proved that this estimator is Best Linear Unbiased Estimator
(BLUE).
58 / 59
Next Week
59 / 59