0% found this document useful (0 votes)
6 views

Week2 Lecture2

The Frisch-Waugh-Lovell theorem states that the OLS estimate of a subset of regressors can be obtained through a two-step procedure of first regressing on the remaining regressors and then regressing the dependent variable on the residuals. This allows OLS estimates to be decomposed into separate regressions. The theorem is proved using the projection matrix and properties of orthogonal projections. It provides intuition for how OLS estimates change when variables are added or removed from the model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Week2 Lecture2

The Frisch-Waugh-Lovell theorem states that the OLS estimate of a subset of regressors can be obtained through a two-step procedure of first regressing on the remaining regressors and then regressing the dependent variable on the residuals. This allows OLS estimates to be decomposed into separate regressions. The theorem is proved using the projection matrix and properties of orthogonal projections. It provides intuition for how OLS estimates change when variables are added or removed from the model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Econometrics 1 (6012B0374Y)

dr. Artūras Juodis

University of Amsterdam

Week 2. Lecture 2

February, 2024

1 / 59
Overview

1 The Frisch-Waugh-Lovell theorem OLS is BLUE


OLS vs. one-by-one regressions 3 Estimation of σ 2
The Theorem Some matrix algebra
2 OLS under Classical Assumptions Unbiased estimation
Moments of vectors Standard errors
Mean and variance of OLS 4 Summary

2 / 59
The plan for this lecture

I We prove the Frisch-Waugh-Lovell theorem.


I We study the statistical properties of the OLS estimator using the
matrix notation.
I We prove the Gauss-Markov theorem.

3 / 59
Recap: Linear model

On Monday, we introduced the multiple regression model using the vector


notation:
yi = xi0 β + εi , i = 1, . . . , n. (1)
For [K × 1] dimensional vector β of unknown regression coefficients. We
illustrated how models with K > 2 arise naturally if one combines restricted
models with K = 2.

In particular, we showed how two different measures of impact of distance


can be combined to result in a richer model that better explains the
variation of hotel prices in Vienna.

4 / 59
Recap: OLS estimator

Using the matrix notation we showed that formula for the OLS estimator is
given by:
n
!−1 n
!
β = (X X ) X y = xi xi x i yi .
−1
X X
0 0 0
b (2)
i=1 i=1

We will see that both of these expressions are very convenient to study the
statistical properties of the OLS estimator.

5 / 59
1. The Frisch-Waugh-Lovell theorem

6 / 59
1.1. OLS vs. one-by-one regressions

7 / 59
OLS vs. Naive estimators

Note that OLS estimator βb is not the same/equivalent to the set of K


different one-by-one regressions of the form:

n
!−1 n
!
−1
β̈ (k) = (x (k) )0 x (k) (x (k) )0 y =
 X X
2
xk,i xk,i yi . (3)
i=1 i=1

for all k = 1, . . . , K . These naive estimators only look at individual variation


for each of the K regressors. Instead, OLS estimator through the

n
!−1
(X X) xx
−1
X
0 0
= i i (4)
i=1

term, weights regressors differently based on their covariances as well.

8 / 59
Example. Single regressors setup.

In the case of simple linear regression model that naive estimator would be
given by:

α̈ = y , (5)
Pn
xi yi
β̈ = Pi=1n 2
. (6)
i=1 xi

These clearly differ from the estimators α


b, βb we derived before. Again,
because covariance between x1,i = 1 and x2,i = xi is not present!

9 / 59
Example. When equivalent?

However, the two estimators are equivalent if the following condition is


satisfied:
n
(x ) x =
X
(1) 0 (2)
xi = 0, (7)
i=1

i.e. x = 0. In particular, this happens if you run a regression where you


demean the data first, i.e. consider xei ≡ xi − x as a regressor.

Hence, βb is equivalent to a two-step estimator where in the first step you


demean the variable xi (orthogonalize x (2) with respect to ın ), and in the
second step you estimate simple regression without intercept included in the
model.

As it turns out this is just a special case of a more general result on partial
regression.

10 / 59
1.2. The Theorem

11 / 59
Preliminaries

We consider a partition of total matrix of regressors X into two


non-overlapping blocks:
X = (X1 , X2 ), (8)
where K1 + K2 = K , with K1 the number of columns in X1 and likewise for
K2 . Consider also similar decomposition of β:
b

βb = (βb10 , βb20 )0 . (9)

12 / 59
The Theorem

The main statement of this theorem can be summarized in the following


two equations:

βb1 = (X10 MX2 X1 ) (X10 MX2 y ) ,


−1
(10)
βb2 = (X MX X2 ) (X MX y ) ,
0 −1 0
2 1 2 1
(11)

where MX 2
= In − X2 (X20 X2 )−1 X20 and MX 1
= In − X1 (X10 X1 )−1 X10 .

13 / 59
Intuition

Take βb1 . From the above formula it is clear that the estimate βb1 can be
equivalently obtained in two steps:
Step 1 Consider regression of X1 (for each column) on X2 . Corresponding
residuals are given by E1⊥2 = MX2 X1 .
Step 2 Regress y on these residuals E1⊥2 :
βe1 = (E1⊥2
0
E1⊥2 )−1 (E1⊥2
0
y) . (12)

Here from the fact that MX MX


2 2
= MX2 the desired result βe1 = βb1 .

Hence, the OLS estimator βb1 measures scaled covariance between X1 and y
after we project away (or residualize) the part of X1 that is also explained
by X2 .

14 / 59
Intuition. Special case

In the special case where X10 X2 = O, i.e. the informational content of the
two matrices are non-overlapping (or the corresponding spaces are
orthogonal to each other) then it follows that:

βb1 = (X10 X1 ) (X10 y ) ,


−1
(13)
βb2 = (X20 X2 ) (X20 y ) .
−1
(14)

Thus multivariate regression coefficients with K1 + K2 regressors included


jointly are equivalent to two separate regression coefficients.

15 / 59
Fundamental Result

Next week we will extensively use this result to study how:


I Regression coefficients change when add/delete variable from the
model.
I How model fit changes when we add/delete variables.

16 / 59
Proof

The proof is fairly straightforward. Consider the first order conditions


(normal equations) associated with the OLS estimator:

X 0 (y − X β)
b = 0K . (15)

Using the partition into X1 and X2 the above reads as:


X10 (y − X1 βb1 − X2 βb2 ) = 0K ,
1

X20 (y − X1 βb1 − X2 βb2 ) = 0K .


2

Using the first K1 equations solve for βb1 as a function of βb2 :

βb1 = (X10 X1 ) X10 (y − X2 βb2 ) .


 
−1
(16)

17 / 59
Plug-in that expression for βb1 in the second set of equation:

X20 (y − X1 (X10 X1 )−1 X10 (y − X2 βb2 ) − X2 βb2 ) = 0K2 .


 
(17)

Using the MX 1
= In − PX1 notation:

X20 (y − PX (y − X2 βb2 ) − X2 βb2 ) = 0K .


1 2
(18)

Collecting all terms:

X20 (MX y − MX X2 βb2 ) = 0K .


1 1 2 (19)

From here the result follows:

βb2 = (X20 MX1 X2 ) (X20 MX1 y ) .


−1
(20)

18 / 59
Empirical results. Model with interactions.

Figure: Regression of Price on distancei ,dummy variable Di = 1(distancei < 2km),


and their interactioni .

19 / 59
Empirical results. Regression of interaction term on
everything else.

Figure: Regression of interaction on distancei and dummy variable.

20 / 59
Empirical results. Regression of price on residual

Figure: Regression of Price on residual from the previous step.

21 / 59
2. OLS under Classical Assumptions

22 / 59
2.1. Moments of vectors

23 / 59
Vector-valued random variables

Consider an [M × 1] vector-valued random variable of the form:


 
z1
z =  . .
 .. 
(21)
zM

The associated mean vector µz ≡ E[z ] is given by:


   
µ1 E[z1 ]
µz =  ...  =  ...  . (22)
   

µM E[zM ]

Hence, the mean of any vector is just a vector of means of corresponding


entries.

24 / 59
The variance-covariance matrix

The [M × M] variance-covariance matrix is defined in the usual way:

Σz ≡ E [(z − E[z ])(z − E[z ])0 ] , (23)

where we usually partition this matrix as:


 
σ11 . . . σ1M
Σz =  ... .. 

... . 
σM1 . . . σMM
 
E[(z1 − E[z1 ])(z1 − E[z1 ])] ... E[(z1 − E[z1 ])(zM − E[zM ])]
= .. ..
.
 
. ... .
E[(zM − E[zM ])(z1 − E[z1 ])] . . . E[(zM − E[zM ])(zM − E[zM ])]

This matrix is symmetric as σij = σji for any i and j. And it is positive
semi-definite, i.e. a 0 Σ a ≥ 0 for any a ∈ R M .

25 / 59
Linear transformations. Mean.

Next, consider a linear transformation of z :

ze = Az + b, (24)

for any vector b and any matrix A. Then using the corresponding
definitions:

z ] = E[Az + b] = A E[z ] + b = Aµz + b.


µz̃ = E[e (25)

26 / 59
Linear transformations. Variance.

As for the variance-covariance matrix of ze note that ze − E[e


z ] = A(z − E[z ]).
Hence the corresponding variance covariance matrix:

z − E[ez ])(ez − E[ez ])0 ]


Σze ≡ E [(e
= A E [(z − E[z ])(z − E[z ])0 ] A0
= AΣz A0 .

This is just a generalization of the fact that if you scale random variable by
a constant c, the variance of that random variable increases by c 2 .

27 / 59
Comparing two variance-covariance matrices

Consider two random variables a with var (a) = σa2 and b with var (b) = σb2 ,
we say that random variable a has a no larger variance than the random
variable b when:
σb2 ≥ σa2 . (26)
Can we generalize this to random vectors (of same dimensions) a and b
with variance-covariance matrices Σa and Σb ?

28 / 59
We say that the random vector a has no larger variance than the random
vector b when:
Σb ≥ Σa . (27)
Here ≥ means that the difference between the two matrices is a positive
semi-definite matrix, i.e.:

q 0 (Σb − Σa )q ≥ 0, ∀q ∈ R dim(a) . (28)

Why this specific definition? This definition implies that the variance of
random variable q 0 a (scalar !) is no larger than that of q 0 b (again scalar)
for any q . Hence, for any possible combination q we always get that one
variable has no larger variance that the other one.

29 / 59
2.2. Mean and variance of OLS

30 / 59
Classical Assumptions in vector notation
We extend the classical assumptions to general K using matrix notation.
Assumption 1 (Fixed regressors/Exogeneity). X can be treated as fixed
numbers (i.e. one can condition on them). They satisfy
X 0 X > 0 (or wp1).
Assumption 2 (Random disturbances). The n disturbances/error terms ε
are random variables with E[ε|X ] = 0n .
Assumption 3 (Homoscedasticity). The variance of ε exists and are all
equal E[εε0 |X ] = σ02 In for σ02 > 0.
Assumption 4 (Independence). ε1 , . . . , εn are independent after
conditioning on X .
Assumption 5 (Fixed parameters). The true parameters β0 and σ 2 are fixed
unknown parameters.
Assumption 6 (Linear model). The data on y have been generated by:

y = X β0 + ε. (29)

Assumption 7 (Normality) the disturbances  ε are jointly normally


distributed ε ∼ N 0n , σ02 In conditionally on X .
31 / 59
OLS is unbiased

At first we show that the OLS estimator is unbiased:

E[β]
b = β0 . (30)

For this note that using y = X β0 + ε:


βb = (X 0 X ) X 0 y
−1

= β0 + (X 0 X ) X 0 ε.
−1

Hence:
b X ] = β0 + E[(X 0 X )
E[β| X 0 ε|X ]
−1

= β0 + (X 0 X ) X 0 E[ε|X ]
−1

= β0 .

By the Law of Iterated expectations we also get that E[β]


b = β0 .

32 / 59
Variance of OLS

Consider the variance-covariance matrix of β.


b By definition:

b X ) = E[(βb − E[β|
Var (β| b X ])(βb − E[β|
b X ])0 |X ]. (31)

b X ] = (X 0 X )−1 X 0 ε. Hence, the estimator is


Using previous results βb − E[β|
just a linear combination of ε (after conditioning on X ): as a result

b X ) = E[(X 0 X )
Var (β|
−1
X 0 εε0 X (X 0 X )−1 |X ]
= (X 0 X ) X 0 E[εε0 |X ]X (X 0 X )
−1 −1

= (X 0 X ) X 0 σ02 In X (X 0 X )
−1 −1

= σ02 (X 0 X ) .
−1

33 / 59
2.3. OLS is BLUE

34 / 59
Is OLS an efficient estimator? The Gauss-Markov theorem

The OLS estimator is unbiased, but is it possible to find another estimator


with smaller variance (in the positive definite sense)? The answer to this
questions is actually negative if one restricts attention to the class of linear
and unbiased estimators.

The celebrated Gauss-Markov theorem states that the OLS


estimator βb is the Best Linear Unbiased Estimator (BLUE) for
unknown β0 .

Here by best it means that it has the smallest variance.

35 / 59
Linear unbiased estimator

We say that estimator βe is linear in y if it is of the form:

βe = Ay , (32)

where A is a function of X (possibly non-linear). When such estimator is


unbiased? Using y = X β0 + ε:

βe = AX β0 + Aε. (33)

Hence:
e X ] = E[AX |X ]β0 + A E[ε|X ] = AX β0 .
E[β| (34)
As a result βe is (conditionally unbiased) as long as AX = IK .

36 / 59
Best estimator

We will show that for any choice of A we have that:


e X ) ≥ var (β|
var (β| b X ), (35)

where again ≥ should be interpreted in the positive-definite sense.

37 / 59
Proof

Note that:
βe = βb + (βe − β),
b (36)
and also
βe − β0 = βb − β0 + (βe − β0 − (βb − β0 )). (37)
Denote by AOLS = (X 0 X )−1 X 0 then previous equality can be expressed as:
βe − β0 = AOLS ε + (A − AOLS )ε. (38)

From here:
e X ) = E[(βe − β0 )(βe − β0 )0 |X ]
var (β|
= E[AOLS εε0 A0OLS |X ] + E[(A − AOLS )εε0 (A − AOLS )0 |X ]
+ E[AOLS εε0 (A − AOLS )0 |X ] + E[(A − AOLS )εε0 A0OLS |X ].

We next show that E[AOLS εε0 (A − AOLS )0 |X ] = O.

38 / 59
Note that:

E[AOLS εε0 (A − AOLS )0 |X ] = σ02 AOLS (A − AOLS )0


= σ 2 (AOLS A0 − AOLS A0OLS ) .

However:

AOLS A0OLS = (X 0 X )−1 X 0 X (X 0 X )−1 = (X 0 X )−1 (39)


AOLS A0 = (X 0 X )−1 X 0 A0 = (X 0 X )−1 . (40)

Here the second equality follows from the fact that AX = In (by
unbiasedness). As a result:

E[AOLS εε0 (A − AOLS )0 |X ] = E[(A − AOLS )εε0 A0OLS |X ] = O. (41)

39 / 59
Combining all results:
e X ) = E[(βe − β0 )(βe − β0 )0 |X ]
var (β|
= E[AOLS εε0 A0OLS |X ] + E[(A − AOLS )εε0 (A − AOLS )0 |X ]
b X ) + E[(A − AOLS )εε0 (A − AOLS )0 |X ].
= var (β|

Given that E[(A − AOLS )εε0 (A − AOLS )0 |X ] is a positive semi-definite


matrix (by construction), we conclude that:

e X ) ≥ var (β|
var (β| b X ). (42)

This completes the proof.

40 / 59
3. Estimation of σ 2

41 / 59
Plug-in approach
Note that if in practise want to compute the variance of β:
b

b X ) = σ 2 (X 0 X )−1 ,
var (β| (43)
0

then this variance estimator is not feasible, i.e. it cannot be computed,


because σ02 is unknown.

Given that E[ε2i ] = σ02 , then the most natural approach to estimate the
unknown quantity is based on the method-of-moments (or plug-in) principle:
n
ei )2 = eb0 eb.
1X 1
b2 =
σ (b (44)
n n
i=1

where eb = y − X βb = MX y .

In what follows we show that:

σ 2 ] 6= σ02 .
E[b (45)

Thus, estimator is not unbiased.


42 / 59
3.1. Some matrix algebra

43 / 59
Preliminaries: Matrix Trace

For any [q × q] (square) matrix A of the form:


 
a11 ... a1a
A= .
 .. . 
. . . ..  (46)
aa1 . . . aaa

The trace is defined as the sum of all diagonal elements, i.e.:


q
tr (A) =
X
ass . (47)
s=1

It is trivial to see that fro two [q × q] matrices A and B we have:


q q q
tr (A + B ) = bss = tr (A) + tr (B ).
X X X
(ass + bss ) = ass + (48)
s=1 s=1 s=1

44 / 59
Preliminaries: Matrix Trace. Circular permutations.

LetC and D be two matrices of dimensions [q × u] such that CD 0 and


D C are both square matrices. Then:
0

tr (CD 0 ) = tr (D 0 C ). (49)

Hence, trace operator is invariant to circular permutations. This holds even


if the two resulting matrices CD 0 and D 0 C are potentially of different
dimensions.

We will not attempt to prove this fact in this course, please refer to any
textbook on Advanced Linear Algebra.

45 / 59
Application: Trace of Projection matrix

For any [q × w ] matrix B with rank(B ) = w consider the corresponding


orthogonal projection matrix MB :

MB = Iq − B (B 0 B )−1 B 0 . (50)

We show that tr (MB ) = q − w .

46 / 59
Using additivity and invariance under circular permutations:

tr (MB ) = tr (Iq − B (B 0 B )−1 B 0 )


= tr (Iq ) − tr (B (B 0 B )−1 B 0 )
= tr (Iq ) − tr (B 0 B (B 0 B )−1 )
= tr (Iq ) − tr (Iw )
= q − w.

47 / 59
3.2. Unbiased estimation

48 / 59
Biased estimator?

Note that original estimator is given by:

b2 =
σ
n
e
b eb = y 0 MX y .
1 0 1
n
(51)

Using the true model y = X β0 + ε and the fact that MX X = O it can be


equivalently expressed as:
b2 = ε0 MX ε.
1
σ (52)
n

49 / 59
Some trace tricks

TRICK 1: Note that any scalar a is trace of itself, i.e. a = tr (a). Hence:

b2 = tr (b
σ σ2 )

= tr (e 0 MX e )
1
n
= tr (MX εε0 ) .
1
n
In the final line we use invariance under circular permutations.

50 / 59
Some trace tricks
TRICK 2: From the definition of tr (·) it is clear that E[tr (A)] = tr (E[A])
for any appropriate matrix A. Using this trick:

σ 2 |X ] = E[tr (MX εε0 ) |X ]


1
E[b
n
= tr (E[MX εε0 |X ])
1
n
= tr (MX E[εε0 |X ])
1
n
= tr MX σ02 In
1 
n
σ2
= 0 tr (MX )
n
n−K
= σ02 .
n
Where in the final line we used the property of traces of projection matrices
that we proved previously.

51 / 59
Towards unbiased estimation

From the above result it is clear that:


1
s2 ≡ εb0 εb, (53)
n−K

is unbiased, i.e. E[s 2 ] = σ02 . Because:


n
s2 = b2 .
σ (54)
n−K
Obviously, for larger values of n the two are equivalent, i.e.:

s2 ≈ σ
b2 . (55)

But for any fixed value of n we always have s 2 ≥ σ


b2 .

The n − K term is usually referred to as the degrees of freedom (df)


correction.

52 / 59
3.3. Standard errors

53 / 59
Estimating variance of βb

Given that s 2 is unbiased estimator of σ 2 , the feasible estimator for the


variance-covariance matrix of βb is generally given by:

var b X ) = s 2 (X 0 X )−1 .
c (β| (56)

Given data (y , X ) the quantity var b X ) is observed and can be computed.


c (β|

54 / 59
Standard error

By the definition of the variance-covariance matrix, the variance of each


element β
ck is given by the kth diagonal element of var b X ),i.e:
c (β|

var
c (βck |X ) = s 2 [(X 0 X )−1 ][k,k] . (57)

Note that here we first take first the inverse of [K × K ] matrix (X 0 X ), and
only later [k, k] element, and not in opposite order (typical mistake!).

The corresponding standard error for k’th regressor is given by the square
root of the variance, i.e.:

ck |X ) = s 2 [(X 0 X )−1 ][k,k] .


q q
SEk = var c (β (58)

55 / 59
Empirical results. Model 2.

Figure: Regression of Price on distancei ,dummy variable Di = 1(distancei < 2km),


and their interactioni .

56 / 59
4. Summary

57 / 59
Summary today

In this lecture
I We proved the celebrated The Frisch-Waugh-Lovell theorem.
I We studied statistical properties of the OLS estimator.
I We proved that this estimator is Best Linear Unbiased Estimator
(BLUE).

58 / 59
Next Week

I We investigate how properties of the OLS estimator change when we


add/remove regressors from the model.
I We argue why sometimes removal of regressors can be beneficial, and
sometimes it can be harmful. We introduce the bias-variance tradeoff.
I We show how statistical inference on β0 can be conducted using the
OLS estimator.

59 / 59

You might also like