0% found this document useful (0 votes)
100 views518 pages

EC221: Principles of Econometrics Introducing Lent: DR M. Schafgans

This document is a lecture on principles of econometrics given by Dr. M. Schafgans at the London School of Economics during Lent 2022. It recaps bivariate and multivariate regression analysis, including defining dependent and independent variables, the error term, and how adding more explanatory variables allows researchers to explain more variation in outcomes, incorporate different functional forms, and more accurately estimate causal relationships.

Uploaded by

Remo Muharemi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views518 pages

EC221: Principles of Econometrics Introducing Lent: DR M. Schafgans

This document is a lecture on principles of econometrics given by Dr. M. Schafgans at the London School of Economics during Lent 2022. It recaps bivariate and multivariate regression analysis, including defining dependent and independent variables, the error term, and how adding more explanatory variables allows researchers to explain more variation in outcomes, incorporate different functional forms, and more accurately estimate causal relationships.

Uploaded by

Remo Muharemi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 518

EC221: Principles of Econometrics

Introducing Lent

Dr M. Schafgans

London School of Economics

Lent 2022

Dr Schafgans (LSE) EC221: Introducing Lent 1 / 17


Bivariate Regression Analysis - recap I
In Michaelmas term, we introduced the bivariate regression model
Y = β0 + β1 X + ε

Dr Schafgans (LSE) EC221: Introducing Lent 2 / 17


Bivariate Regression Analysis - recap I
In Michaelmas term, we introduced the bivariate regression model
Y = β0 + β1 X + ε

A simple way to summarize the relationship in our population

Dr Schafgans (LSE) EC221: Introducing Lent 2 / 17


Bivariate Regression Analysis - recap I
In Michaelmas term, we introduced the bivariate regression model
Y = β0 + β1 X + ε

A simple way to summarize the relationship in our population


When displaying your data (sample) f(Xi , Yi )gni=1 , your will never see
that all (Xi , Yi )-pairs lie on a straight line.

Dr Schafgans (LSE) EC221: Introducing Lent 2 / 17


Bivariate Regression Analysis - recap I
In Michaelmas term, we introduced the bivariate regression model
Y = β0 + β1 X + ε

A simple way to summarize the relationship in our population


When displaying your data (sample) f(Xi , Yi )gni=1 , your will never see
that all (Xi , Yi )-pairs lie on a straight line.

The error term, εi , accounts for this discrepancy. Re‡ects our ignorance
(everything we haven’t modelled).

Dr Schafgans (LSE) EC221: Introducing Lent 2 / 17


Bivariate Regression Analysis - recap I
In Michaelmas term, we introduced the bivariate regression model
Y = β0 + β1 X + ε

A simple way to summarize the relationship in our population


When displaying your data (sample) f(Xi , Yi )gni=1 , your will never see
that all (Xi , Yi )-pairs lie on a straight line.

The error term, εi , accounts for this discrepancy. Re‡ects our ignorance
(everything we haven’t modelled).
Aim: we want to estimate the causal e¤ect β1 (the e¤ect of X on Y
holding other factors, lumped into ε, …xed - ceteris paribus)
Dr Schafgans (LSE) EC221: Introducing Lent 2 / 17
Bivariate Regression Analysis - recap II

Y = β0 + β1 X + ε
Terminology

Dr Schafgans (LSE) EC221: Introducing Lent 3 / 17


Bivariate Regression Analysis - recap II

Y = β0 + β1 X + ε
Terminology
Y is the dependent variable

Dr Schafgans (LSE) EC221: Introducing Lent 3 / 17


Bivariate Regression Analysis - recap II

Y = β0 + β1 X + ε
Terminology
Y is the dependent variable
X the independent, explanatory variable

Dr Schafgans (LSE) EC221: Introducing Lent 3 / 17


Bivariate Regression Analysis - recap II

Y = β0 + β1 X + ε
Terminology
Y is the dependent variable
X the independent, explanatory variable
ε is the error term (our ignorance), E (ε) = 0

Dr Schafgans (LSE) EC221: Introducing Lent 3 / 17


Bivariate Regression Analysis - recap II

Y = β0 + β1 X + ε
Terminology
Y is the dependent variable
X the independent, explanatory variable
ε is the error term (our ignorance), E (ε) = 0

Our desire to interpret β1 as causal requires us to restrict the relation


between the explanatory variables X and the unobservables ε.

Dr Schafgans (LSE) EC221: Introducing Lent 3 / 17


Bivariate Regression Analysis - recap II

Y = β0 + β1 X + ε
Terminology
Y is the dependent variable
X the independent, explanatory variable
ε is the error term (our ignorance), E (ε) = 0

Our desire to interpret β1 as causal requires us to restrict the relation


between the explanatory variables X and the unobservables ε.

Speci…cally, we will need to assume mean independence


E ( ε jX ) = E ( ε ) = 0
This assumption ensures there is no correlation between the errors and
regressor(s) – more details later.

Dr Schafgans (LSE) EC221: Introducing Lent 3 / 17


Bivariate Regression Analysis - recap II

Y = β0 + β1 X + ε
Terminology
Y is the dependent variable
X the independent, explanatory variable
ε is the error term (our ignorance), E (ε) = 0

Our desire to interpret β1 as causal requires us to restrict the relation


between the explanatory variables X and the unobservables ε.

Speci…cally, we will need to assume mean independence


E ( ε jX ) = E ( ε ) = 0
This assumption ensures there is no correlation between the errors and
regressor(s) – more details later.

Assumption implies E (Y jX ) = β0 + β1 X and β1 = ∂E (Y jX ) /∂X .


VN1.1

Dr Schafgans (LSE) EC221: Introducing Lent 3 / 17


Multivariate Regression Analysis - recap I

Last term you discussed concerns about your ability to interpret β1 as


a causal relationship when omitting relevant variables - confounders.

Dr Schafgans (LSE) EC221: Introducing Lent 4 / 17


Multivariate Regression Analysis - recap I

Last term you discussed concerns about your ability to interpret β1 as


a causal relationship when omitting relevant variables - confounders.
The problem: your ignorance (ε) is correlated with the included
regressors (X )

Dr Schafgans (LSE) EC221: Introducing Lent 4 / 17


Multivariate Regression Analysis - recap I

Last term you discussed concerns about your ability to interpret β1 as


a causal relationship when omitting relevant variables - confounders.
The problem: your ignorance (ε) is correlated with the included
regressors (X )

The multivariate regression model enabled us to control explicitly for


other factors that might a¤ect the outcome

Y = β 0 + β 1 X1 + β 2 X2 + ε

Dr Schafgans (LSE) EC221: Introducing Lent 4 / 17


Multivariate Regression Analysis - recap I

Last term you discussed concerns about your ability to interpret β1 as


a causal relationship when omitting relevant variables - confounders.
The problem: your ignorance (ε) is correlated with the included
regressors (X )

The multivariate regression model enabled us to control explicitly for


other factors that might a¤ect the outcome

Y = β 0 + β 1 X1 + β 2 X2 + ε

Recall: Adding more explanatory variables allows us to:

Dr Schafgans (LSE) EC221: Introducing Lent 4 / 17


Multivariate Regression Analysis - recap I

Last term you discussed concerns about your ability to interpret β1 as


a causal relationship when omitting relevant variables - confounders.
The problem: your ignorance (ε) is correlated with the included
regressors (X )

The multivariate regression model enabled us to control explicitly for


other factors that might a¤ect the outcome

Y = β 0 + β 1 X1 + β 2 X2 + ε

Recall: Adding more explanatory variables allows us to:


Explain more of the variation in outcomes (higher R 2 )

Dr Schafgans (LSE) EC221: Introducing Lent 4 / 17


Multivariate Regression Analysis - recap I

Last term you discussed concerns about your ability to interpret β1 as


a causal relationship when omitting relevant variables - confounders.
The problem: your ignorance (ε) is correlated with the included
regressors (X )

The multivariate regression model enabled us to control explicitly for


other factors that might a¤ect the outcome

Y = β 0 + β 1 X1 + β 2 X2 + ε

Recall: Adding more explanatory variables allows us to:


Explain more of the variation in outcomes (higher R 2 )
Incorporate more general functional form relationships (e.g., add
quadratic terms)

Dr Schafgans (LSE) EC221: Introducing Lent 4 / 17


Multivariate Regression Analysis - recap I

Last term you discussed concerns about your ability to interpret β1 as


a causal relationship when omitting relevant variables - confounders.
The problem: your ignorance (ε) is correlated with the included
regressors (X )

The multivariate regression model enabled us to control explicitly for


other factors that might a¤ect the outcome

Y = β 0 + β 1 X1 + β 2 X2 + ε

Recall: Adding more explanatory variables allows us to:


Explain more of the variation in outcomes (higher R 2 )
Incorporate more general functional form relationships (e.g., add
quadratic terms)
Move towards establishing causality

Dr Schafgans (LSE) EC221: Introducing Lent 4 / 17


Multivariate Regression Analysis - recap II

To give our parameters causal interpretation, we require

E ( ε j X1 , X2 ) E (εjX ) = 0.

Dr Schafgans (LSE) EC221: Introducing Lent 5 / 17


Multivariate Regression Analysis - recap II

To give our parameters causal interpretation, we require

E ( ε j X1 , X2 ) E (εjX ) = 0.

This guarantees that there is no correlation between the errors (ε) and
regressors (X1 , X2 ).

Dr Schafgans (LSE) EC221: Introducing Lent 5 / 17


Multivariate Regression Analysis - recap II

To give our parameters causal interpretation, we require

E ( ε j X1 , X2 ) E (εjX ) = 0.

This guarantees that there is no correlation between the errors (ε) and
regressors (X1 , X2 ).

As this ensures E (Y jX ) = β0 + β1 X1 + β2 X2 , we obtain, e.g.,

∂E (Y jX ) ∂ ( β0 + β1 X1 + β2 X2 )
= = β1
∂X1 ∂X1

Dr Schafgans (LSE) EC221: Introducing Lent 5 / 17


Multivariate Regression Analysis - recap II

To give our parameters causal interpretation, we require

E ( ε j X1 , X2 ) E (εjX ) = 0.

This guarantees that there is no correlation between the errors (ε) and
regressors (X1 , X2 ).

As this ensures E (Y jX ) = β0 + β1 X1 + β2 X2 , we obtain, e.g.,

∂E (Y jX ) ∂ ( β0 + β1 X1 + β2 X2 )
= = β1
∂X1 ∂X1

β1 marginal e¤ect that X1 has on E (Y jX ) holding everything else


constant, including X2 . So called: partial e¤ect.

Dr Schafgans (LSE) EC221: Introducing Lent 5 / 17


Multivariate Regression Analysis - recap II

To give our parameters causal interpretation, we require

E ( ε j X1 , X2 ) E (εjX ) = 0.

This guarantees that there is no correlation between the errors (ε) and
regressors (X1 , X2 ).

As this ensures E (Y jX ) = β0 + β1 X1 + β2 X2 , we obtain, e.g.,

∂E (Y jX ) ∂ ( β0 + β1 X1 + β2 X2 )
= = β1
∂X1 ∂X1

β1 marginal e¤ect that X1 has on E (Y jX ) holding everything else


constant, including X2 . So called: partial e¤ect.

We also say, β1 is the e¤ect X1 has on Y after controlling for X2


(Regression anatomy, Frisch-Waugh-Lovell)

Dr Schafgans (LSE) EC221: Introducing Lent 5 / 17


Bivariate Regression Analysis - recap III

To estimate the parameters, we will need to obtain a sample from the


population f(Yi , Xi )gni=1 .

Yi = β0 + β1 Xi + εi , i = 1, .., n.

Dr Schafgans (LSE) EC221: Introducing Lent 6 / 17


Bivariate Regression Analysis - recap III

To estimate the parameters, we will need to obtain a sample from the


population f(Yi , Xi )gni=1 .

Yi = β0 + β1 Xi + εi , i = 1, .., n.

We obtain a sample f(Yi , Xi )gni=1 , e¤ectively by drawing from the


joint distribution of (X , ε) .

Dr Schafgans (LSE) EC221: Introducing Lent 6 / 17


Bivariate Regression Analysis - recap III

To estimate the parameters, we will need to obtain a sample from the


population f(Yi , Xi )gni=1 .

Yi = β0 + β1 Xi + εi , i = 1, .., n.

We obtain a sample f(Yi , Xi )gni=1 , e¤ectively by drawing from the


joint distribution of (X , ε) .

In LT, we will consider a range of assumptions regarding this Data


Generating Process (DGP) . These assumptions (e.g., the
Gauss-Markov Assumptions), are crucial as

Dr Schafgans (LSE) EC221: Introducing Lent 6 / 17


Bivariate Regression Analysis - recap III

To estimate the parameters, we will need to obtain a sample from the


population f(Yi , Xi )gni=1 .

Yi = β0 + β1 Xi + εi , i = 1, .., n.

We obtain a sample f(Yi , Xi )gni=1 , e¤ectively by drawing from the


joint distribution of (X , ε) .

In LT, we will consider a range of assumptions regarding this Data


Generating Process (DGP) . These assumptions (e.g., the
Gauss-Markov Assumptions), are crucial as
They will help establish …nite and asymptotic properties of our
estimators.

Dr Schafgans (LSE) EC221: Introducing Lent 6 / 17


Bivariate Regression Analysis - recap III

To estimate the parameters, we will need to obtain a sample from the


population f(Yi , Xi )gni=1 .

Yi = β0 + β1 Xi + εi , i = 1, .., n.

We obtain a sample f(Yi , Xi )gni=1 , e¤ectively by drawing from the


joint distribution of (X , ε) .

In LT, we will consider a range of assumptions regarding this Data


Generating Process (DGP) . These assumptions (e.g., the
Gauss-Markov Assumptions), are crucial as
They will help establish …nite and asymptotic properties of our
estimators.
They will explain the suitability/optimality of particular estimation
procedures.

Dr Schafgans (LSE) EC221: Introducing Lent 6 / 17


Bivariate Regression Analysis - recap IV

Yi = β0 + β1 Xi + εi , i = 1, .., n.

Depending on the type of data, di¤erent assumptions may be needed.

Dr Schafgans (LSE) EC221: Introducing Lent 7 / 17


Bivariate Regression Analysis - recap IV

Yi = β0 + β1 Xi + εi , i = 1, .., n.

Depending on the type of data, di¤erent assumptions may be needed.


Cross-sectional data:

Dr Schafgans (LSE) EC221: Introducing Lent 7 / 17


Bivariate Regression Analysis - recap IV

Yi = β0 + β1 Xi + εi , i = 1, .., n.

Depending on the type of data, di¤erent assumptions may be needed.


Cross-sectional data:
It may be reasonable to assume that (εi , Xi ) are i.i.d. drawings from
the population

Dr Schafgans (LSE) EC221: Introducing Lent 7 / 17


Bivariate Regression Analysis - recap IV

Yi = β0 + β1 Xi + εi , i = 1, .., n.

Depending on the type of data, di¤erent assumptions may be needed.


Cross-sectional data:
It may be reasonable to assume that (εi , Xi ) are i.i.d. drawings from
the population
Time-series data:

Dr Schafgans (LSE) EC221: Introducing Lent 7 / 17


Bivariate Regression Analysis - recap IV

Yi = β0 + β1 Xi + εi , i = 1, .., n.

Depending on the type of data, di¤erent assumptions may be needed.


Cross-sectional data:
It may be reasonable to assume that (εi , Xi ) are i.i.d. drawings from
the population
Time-series data:
Time-series data typically exhibits dependence and we should account
for dependence between (εi , Xi ) and εj , Xj , i 6= j.

Dr Schafgans (LSE) EC221: Introducing Lent 7 / 17


Bivariate Regression Analysis - recap IV

Yi = β0 + β1 Xi + εi , i = 1, .., n.

Depending on the type of data, di¤erent assumptions may be needed.


Cross-sectional data:
It may be reasonable to assume that (εi , Xi ) are i.i.d. drawings from
the population
Time-series data:
Time-series data typically exhibits dependence and we should account
for dependence between (εi , Xi ) and εj , Xj , i 6= j.

Correlation between Xi and εi is common (e.g., measurement error,


omitted variables, simultaneity).

Dr Schafgans (LSE) EC221: Introducing Lent 7 / 17


Bivariate Regression Analysis - recap IV

Yi = β0 + β1 Xi + εi , i = 1, .., n.

Depending on the type of data, di¤erent assumptions may be needed.


Cross-sectional data:
It may be reasonable to assume that (εi , Xi ) are i.i.d. drawings from
the population
Time-series data:
Time-series data typically exhibits dependence and we should account
for dependence between (εi , Xi ) and εj , Xj , i 6= j.

Correlation between Xi and εi is common (e.g., measurement error,


omitted variables, simultaneity).
We will formally show that any correlation between Xi and εi renders
the OLS estimator undesirable.

Dr Schafgans (LSE) EC221: Introducing Lent 7 / 17


Bivariate Regression Analysis - OLS estimator I

Given a sample, we could consider using the Ordinary Least Squares


(OLS) estimator for β0 and β1 .

β̂0 , β̂1 = arg min ∑i =1 (Yi


n
b0 b 1 Xi ) 2
b 0 ,b 1

Dr Schafgans (LSE) EC221: Introducing Lent 8 / 17


Bivariate Regression Analysis - OLS estimator I

Given a sample, we could consider using the Ordinary Least Squares


(OLS) estimator for β0 and β1 .

β̂0 , β̂1 = arg min ∑i =1 (Yi


n
b0 b 1 Xi ) 2
b 0 ,b 1

Terminology:

Dr Schafgans (LSE) EC221: Introducing Lent 8 / 17


Bivariate Regression Analysis - OLS estimator I

Given a sample, we could consider using the Ordinary Least Squares


(OLS) estimator for β0 and β1 .

β̂0 , β̂1 = arg min ∑i =1 (Yi


n
b0 b 1 Xi ) 2
b 0 ,b 1

Terminology:
β̂0 and β̂1 are estimators (random variables). Realisations of these
estimators for a particular sample are called estimates.

Dr Schafgans (LSE) EC221: Introducing Lent 8 / 17


Bivariate Regression Analysis - OLS estimator I

Given a sample, we could consider using the Ordinary Least Squares


(OLS) estimator for β0 and β1 .

β̂0 , β̂1 = arg min ∑i =1 (Yi


n
b0 b 1 Xi ) 2
b 0 ,b 1

Terminology:
β̂0 and β̂1 are estimators (random variables). Realisations of these
estimators for a particular sample are called estimates.

Ŷi = β̂0 + β̂1 Xi are …tted values

Dr Schafgans (LSE) EC221: Introducing Lent 8 / 17


Bivariate Regression Analysis - OLS estimator I

Given a sample, we could consider using the Ordinary Least Squares


(OLS) estimator for β0 and β1 .

β̂0 , β̂1 = arg min ∑i =1 (Yi


n
b0 b 1 Xi ) 2
b 0 ,b 1

Terminology:
β̂0 and β̂1 are estimators (random variables). Realisations of these
estimators for a particular sample are called estimates.

Ŷi = β̂0 + β̂1 Xi are …tted values

ε̂i = Yi β̂0 β̂1 Xi are residuals Not same as εi !!!

Dr Schafgans (LSE) EC221: Introducing Lent 8 / 17


Bivariate Regression Analysis - OLS estimator I

Given a sample, we could consider using the Ordinary Least Squares


(OLS) estimator for β0 and β1 .

β̂0 , β̂1 = arg min ∑i =1 (Yi


n
b0 b 1 Xi ) 2
b 0 ,b 1

Terminology:
β̂0 and β̂1 are estimators (random variables). Realisations of these
estimators for a particular sample are called estimates.

Ŷi = β̂0 + β̂1 Xi are …tted values

ε̂i = Yi β̂0 β̂1 Xi are residuals Not same as εi !!!

This is an intuitive estimator: if we assume a linear relation exist


between Y and X , we should choose that line that minimizes the
square distances between observed Yi and b0 + b1 Xi .
Dr Schafgans (LSE) EC221: Introducing Lent 8 / 17
Bivariate Regression Analysis - OLS estimator II

β̂0 , β̂1 = arg min ∑i =1 (Yi


n
b0 b 1 Xi ) 2
b 0 ,b 1

The F.O.C. are given by

2 ∑ i = 1 Yi
∂ n
∂b 0 : β̂0 β̂1 Xi = 0
2 ∑ i = 1 Yi
∂ n
∂b 1 : β̂0 β̂1 Xi Xi = 0

Dr Schafgans (LSE) EC221: Introducing Lent 9 / 17


Bivariate Regression Analysis - OLS estimator II

β̂0 , β̂1 = arg min ∑i =1 (Yi


n
b0 b 1 Xi ) 2
b 0 ,b 1

The F.O.C. are given by

2 ∑ i = 1 Yi
∂ n
∂b 0 : β̂0 β̂1 Xi = 0
2 ∑ i = 1 Yi
∂ n
∂b 1 : β̂0 β̂1 Xi Xi = 0

Solving these two equations (see PS2) yields

β̂0 = Ȳ β̂1 X̄
n
∑i =1 (Yi Ȳ )(Xi X̄ ) Sample Cov (Yi , Xi )
β̂1 = =
∑ni=1 (Xi X̄ )2 Sample Var (Xi )

Dr Schafgans (LSE) EC221: Introducing Lent 9 / 17


Bivariate Regression Analysis - OLS estimator II

β̂0 , β̂1 = arg min ∑i =1 (Yi


n
b0 b 1 Xi ) 2
b 0 ,b 1

The F.O.C. are given by

2 ∑ i = 1 Yi
∂ n
∂b 0 : β̂0 β̂1 Xi = 0
2 ∑ i = 1 Yi
∂ n
∂b 1 : β̂0 β̂1 Xi Xi = 0

Solving these two equations (see PS2) yields

β̂0 = Ȳ β̂1 X̄
n
∑i =1 (Yi Ȳ )(Xi X̄ ) Sample Cov (Yi , Xi )
β̂1 = =
∑ni=1 (Xi X̄ )2 Sample Var (Xi )

Observe that we do need sample variability in the regressors!

Dr Schafgans (LSE) EC221: Introducing Lent 9 / 17


Bivariate Regression Analysis - recap V
Under repeated sampling β̂1 β̂0 will take di¤erent values (r.v.)

Dr Schafgans (LSE) EC221: Introducing Lent 10 / 17


Bivariate Regression Analysis - recap V
Under repeated sampling β̂1 β̂0 will take di¤erent values (r.v.)
We wil consider various aspects about the sampling distribution of β̂1

Dr Schafgans (LSE) EC221: Introducing Lent 10 / 17


Bivariate Regression Analysis - recap V
Under repeated sampling β̂1 β̂0 will take di¤erent values (r.v.)
We wil consider various aspects about the sampling distribution of β̂1

Unbiasedness: E β̂1 = β1

Dr Schafgans (LSE) EC221: Introducing Lent 10 / 17


Bivariate Regression Analysis - recap V
Under repeated sampling β̂1 β̂0 will take di¤erent values (r.v.)
We wil consider various aspects about the sampling distribution of β̂1

Unbiasedness: E β̂1 = β1
Is the estimator under repeated sampling correct on average?
Unbiasedness ensures that we do not make systematic errors when
estimating a parameter.

Dr Schafgans (LSE) EC221: Introducing Lent 10 / 17


Bivariate Regression Analysis - recap V
Under repeated sampling β̂1 β̂0 will take di¤erent values (r.v.)
We wil consider various aspects about the sampling distribution of β̂1

Unbiasedness: E β̂1 = β1
Is the estimator under repeated sampling correct on average?
Unbiasedness ensures that we do not make systematic errors when
estimating a parameter.

Variability: Var β̂1 indicates how disperse the realisations for β̂1
. are under repeated sampling. Is our estimator e¢ cient?

Dr Schafgans (LSE) EC221: Introducing Lent 10 / 17


Bivariate Regression Analysis - recap V
Under repeated sampling β̂1 β̂0 will take di¤erent values (r.v.)
We wil consider various aspects about the sampling distribution of β̂1

Unbiasedness: E β̂1 = β1
Is the estimator under repeated sampling correct on average?
Unbiasedness ensures that we do not make systematic errors when
estimating a parameter.

Variability: Var β̂1 indicates how disperse the realisations for β̂1
. are under repeated sampling. Is our estimator e¢ cient?
Standard errors of our OLS estimators are de…ned as VN1.2
q
d β̂
SE ( β̂ ) = Var
1 1

Dr Schafgans (LSE) EC221: Introducing Lent 10 / 17


Bivariate Regression Analysis - recap V
Under repeated sampling β̂1 β̂0 will take di¤erent values (r.v.)
We wil consider various aspects about the sampling distribution of β̂1

Unbiasedness: E β̂1 = β1
Is the estimator under repeated sampling correct on average?
Unbiasedness ensures that we do not make systematic errors when
estimating a parameter.

Variability: Var β̂1 indicates how disperse the realisations for β̂1
. are under repeated sampling. Is our estimator e¢ cient?
Standard errors of our OLS estimators are de…ned as VN1.2
q
d β̂
SE ( β̂ ) = Var
1 1

We will formally discuss the need and use of robust standard errors in
empirical research.
Dr Schafgans (LSE) EC221: Introducing Lent 10 / 17
Bivariate Regression Analysis - recap VI

Sampling distribution: Needed to conduct inference

Dr Schafgans (LSE) EC221: Introducing Lent 11 / 17


Bivariate Regression Analysis - recap VI

Sampling distribution: Needed to conduct inference


If we cannot establish the …nite sample distribution (may require too
strong an assumption), we will use asymptotic theory to enable us to
obtain a good approximation (CLT) suitable as long as our sample is
large.

Dr Schafgans (LSE) EC221: Introducing Lent 11 / 17


Bivariate Regression Analysis - recap VI

Sampling distribution: Needed to conduct inference


If we cannot establish the …nite sample distribution (may require too
strong an assumption), we will use asymptotic theory to enable us to
obtain a good approximation (CLT) suitable as long as our sample is
large.

In EC221 we heavily rely on linear algebra.

Dr Schafgans (LSE) EC221: Introducing Lent 11 / 17


Bivariate Regression Analysis - recap VI

Sampling distribution: Needed to conduct inference


If we cannot establish the …nite sample distribution (may require too
strong an assumption), we will use asymptotic theory to enable us to
obtain a good approximation (CLT) suitable as long as our sample is
large.

In EC221 we heavily rely on linear algebra.


While many properties of the OLS estimators in the bivariate regression
model can be derived using plain calculus the use of linear algebra is
much more elegant (easier).

Dr Schafgans (LSE) EC221: Introducing Lent 11 / 17


Bivariate Regression Analysis - recap VI

Sampling distribution: Needed to conduct inference


If we cannot establish the …nite sample distribution (may require too
strong an assumption), we will use asymptotic theory to enable us to
obtain a good approximation (CLT) suitable as long as our sample is
large.

In EC221 we heavily rely on linear algebra.


While many properties of the OLS estimators in the bivariate regression
model can be derived using plain calculus the use of linear algebra is
much more elegant (easier).
The use of linear algebra is quite natural (excel spreadsheet: matrix;
columns of data (variables): vector)

Dr Schafgans (LSE) EC221: Introducing Lent 11 / 17


Bivariate Regression Analysis - recap VI

Sampling distribution: Needed to conduct inference


If we cannot establish the …nite sample distribution (may require too
strong an assumption), we will use asymptotic theory to enable us to
obtain a good approximation (CLT) suitable as long as our sample is
large.

In EC221 we heavily rely on linear algebra.


While many properties of the OLS estimators in the bivariate regression
model can be derived using plain calculus the use of linear algebra is
much more elegant (easier).
The use of linear algebra is quite natural (excel spreadsheet: matrix;
columns of data (variables): vector)

In EC221 we will provide more complete derivations of statistical


properties and inference.

Dr Schafgans (LSE) EC221: Introducing Lent 11 / 17


Some Notational Issues
From now on, we will use f(yi , xi )gni=1 to denote random samples
(and realistions alike), instead of explicitly using f(Yi , Xi )gni=1 to
distinguish.

Dr Schafgans (LSE) EC221: Introducing Lent 12 / 17


Some Notational Issues
From now on, we will use f(yi , xi )gni=1 to denote random samples
(and realistions alike), instead of explicitly using f(Yi , Xi )gni=1 to
distinguish.
Small letters will be used to denote vectors (and scalars).

Dr Schafgans (LSE) EC221: Introducing Lent 12 / 17


Some Notational Issues
From now on, we will use f(yi , xi )gni=1 to denote random samples
(and realistions alike), instead of explicitly using f(Yi , Xi )gni=1 to
distinguish.
Small letters will be used to denote vectors (and scalars).
0 1
xi 1
B . C
E.g, xi = @ .. A Characteristics of individual i
xik

Dr Schafgans (LSE) EC221: Introducing Lent 12 / 17


Some Notational Issues
From now on, we will use f(yi , xi )gni=1 to denote random samples
(and realistions alike), instead of explicitly using f(Yi , Xi )gni=1 to
distinguish.
Small letters will be used to denote vectors (and scalars).
0 1
xi 1
B .. C
E.g, xi = @ . A Characteristics of individual i
xik
0 1
x1
B .. C
E.g., x = @ . A n observations on the explanatory variable x .
xn

Dr Schafgans (LSE) EC221: Introducing Lent 12 / 17


Some Notational Issues
From now on, we will use f(yi , xi )gni=1 to denote random samples
(and realistions alike), instead of explicitly using f(Yi , Xi )gni=1 to
distinguish.
Small letters will be used to denote vectors (and scalars).
0 1
xi 1
B . C
E.g, xi = @ .. A Characteristics of individual i
x
0 ik 1
x1
B . C
E.g., x = @ .. A n observations on the explanatory variable x .
xn
By convention, vectors are written in columns. To denote a row vector
you use the transpose notation (x 0 or x T )

Dr Schafgans (LSE) EC221: Introducing Lent 12 / 17


Some Notational Issues
From now on, we will use f(yi , xi )gni=1 to denote random samples
(and realistions alike), instead of explicitly using f(Yi , Xi )gni=1 to
distinguish.
Small letters will be used to denote vectors (and scalars).
0 1
xi 1
B . C
E.g, xi = @ .. A Characteristics of individual i
x
0 ik 1
x1
B . C
E.g., x = @ .. A n observations on the explanatory variable x .
xn
By convention, vectors are written in columns. To denote a row vector
you use the transpose notation (x 0 or x T )

Capital letters will be used to denote matrices

Dr Schafgans (LSE) EC221: Introducing Lent 12 / 17


Some Notational Issues
From now on, we will use f(yi , xi )gni=1 to denote random samples
(and realistions alike), instead of explicitly using f(Yi , Xi )gni=1 to
distinguish.
Small letters will be used to denote vectors (and scalars).
0 1
xi 1
B . C
E.g, xi = @ .. A Characteristics of individual i
x
0 ik 1
x1
B . C
E.g., x = @ .. A n observations on the explanatory variable x .
xn
By convention, vectors are written in columns. To denote a row vector
you use the transpose notation (x 0 or x T )

Capital letters will be used to denote matrices


0 1
x1 z1
B .. .. C
E.g., X = @ . . A n observations on explanatory variables x , z
xn zn
Dr Schafgans (LSE) EC221: Introducing Lent 12 / 17
Multivariate Statistics I

Important to recall some multivariate distribution theory: VN1.3

Dr Schafgans (LSE) EC221: Introducing Lent 13 / 17


Multivariate Statistics I

Important to recall some multivariate distribution theory: VN1.3

Joint density of n observations of the same variable:

f (x1 , ..., xn ) f (x ), x being a vector.

Dr Schafgans (LSE) EC221: Introducing Lent 13 / 17


Multivariate Statistics I

Important to recall some multivariate distribution theory: VN1.3

Joint density of n observations of the same variable:

f (x1 , ..., xn ) f (x ), x being a vector.


R∞ R∞
Marginal density: f (x1 ) = ∞ ... ∞ f (x1 , ..., xn )dx2 ...dxn

Dr Schafgans (LSE) EC221: Introducing Lent 13 / 17


Multivariate Statistics I

Important to recall some multivariate distribution theory: VN1.3

Joint density of n observations of the same variable:

f (x1 , ..., xn ) f (x ), x being a vector.


R∞ R∞
Marginal density: f (x1 ) = ∞ ... ∞ f (x1 , ..., xn )dx2 ...dxn
f (ε,x )
Conditional density: f (εjx ) = f (x )
. describes the distribution of ε for given value for x

Dr Schafgans (LSE) EC221: Introducing Lent 13 / 17


Multivariate Statistics I

Important to recall some multivariate distribution theory: VN1.3

Joint density of n observations of the same variable:

f (x1 , ..., xn ) f (x ), x being a vector.


R∞ R∞
Marginal density: f (x1 ) = ∞ ... ∞ f (x1 , ..., xn )dx2 ...dxn
f (ε,x )
Conditional density: f (εjx ) = f (x )
. describes the distribution of ε for given value for x

The conditional mean


Z ∞
E ( ε jx ) = ε f ( ε jx )d ε

Dr Schafgans (LSE) EC221: Introducing Lent 13 / 17


Multivariate Statistics I

Important to recall some multivariate distribution theory: VN1.3

Joint density of n observations of the same variable:

f (x1 , ..., xn ) f (x ), x being a vector.


R∞ R∞
Marginal density: f (x1 ) = ∞ ... ∞ f (x1 , ..., xn )dx2 ...dxn
f (ε,x )
Conditional density: f (εjx ) = f (x )
. describes the distribution of ε for given value for x

The conditional mean


Z ∞
E ( ε jx ) = ε f ( ε jx )d ε

Under independence: f (εjx ) = f (ε), E (εjx ) = E (ε)


. Joint density is product of marginals: f (ε, x ) = f (ε)f (x )

Dr Schafgans (LSE) EC221: Introducing Lent 13 / 17


Multivariate Statistics II

Covariance matrix: An important concept that will describe the


co-movement of elements in a vector of random variables.

Dr Schafgans (LSE) EC221: Introducing Lent 14 / 17


Multivariate Statistics II

Covariance matrix: An important concept that will describe the


co-movement of elements in a vector of random variables.

Covariance between 2 random variables:

Cov (X , Y ) = E [(X E (X )) (Y E (Y ))]


= E (XY ) E (X ) E (Y )

Dr Schafgans (LSE) EC221: Introducing Lent 14 / 17


Multivariate Statistics II

Covariance matrix: An important concept that will describe the


co-movement of elements in a vector of random variables.

Covariance between 2 random variables:

Cov (X , Y ) = E [(X E (X )) (Y E (Y ))]


= E (XY ) E (X ) E (Y )

Positive covariances tell us that we expect large values of the two


random variables to coincide.

Dr Schafgans (LSE) EC221: Introducing Lent 14 / 17


Multivariate Statistics II

Covariance matrix: An important concept that will describe the


co-movement of elements in a vector of random variables.

Covariance between 2 random variables:

Cov (X , Y ) = E [(X E (X )) (Y E (Y ))]


= E (XY ) E (X ) E (Y )

Positive covariances tell us that we expect large values of the two


random variables to coincide.
If X and Y are independent then Cov (X , Y ) = 0, but the reverse is
not always true! (see Statistic Revision notes)

Dr Schafgans (LSE) EC221: Introducing Lent 14 / 17


Multivariate Statistics III

The covariance matrix of ε = (ε1 , .., εn )0 , is denoted

Var (ε) and [Var (ε)]ij = σij

Dr Schafgans (LSE) EC221: Introducing Lent 15 / 17


Multivariate Statistics III

The covariance matrix of ε = (ε1 , .., εn )0 , is denoted

Var (ε) and [Var (ε)]ij = σij

Matrix that contains the variances on the diagonal and all covariances
on the o¤-diagonal. VN1.4
2 3 2 3
ε1 E ε1
6 7 6 .. 7
Var (ε) = E 4(ε E (ε))(ε E (ε))0 5 , with ε E (ε) = 4 . 5
| {z }| {z }
n 1 1 n ε n E ε n

Dr Schafgans (LSE) EC221: Introducing Lent 15 / 17


Multivariate Statistics III

The covariance matrix of ε = (ε1 , .., εn )0 , is denoted

Var (ε) and [Var (ε)]ij = σij

Matrix that contains the variances on the diagonal and all covariances
on the o¤-diagonal. VN1.4
2 3 2 3
ε1 E ε1
6 7 6 .. 7
Var (ε) = E 4(ε E (ε))(ε E (ε))0 5 , with ε E (ε) = 4 . 5
| {z }| {z }
n 1 1 n ε n E ε n

Verify that indeed by expanding the above we get:

Var (ε) =

Dr Schafgans (LSE) EC221: Introducing Lent 15 / 17


Useful results: Mean of sums of random variables
The expected value of the sum of random variables is:
n n
E ∑ xi = ∑ E [xi ]
i =1 i =1

Dr Schafgans (LSE) EC221: Introducing Lent 16 / 17


Useful results: Mean of sums of random variables
The expected value of the sum of random variables is:
n n
E ∑ xi = ∑ E [xi ]
i =1 i =1

For constants c and a1 , .., an :


n n
E c + ∑ ai xi = c + ∑ ai E [xi ]
i =1 i =1

Dr Schafgans (LSE) EC221: Introducing Lent 16 / 17


Useful results: Mean of sums of random variables
The expected value of the sum of random variables is:
n n
E ∑ xi = ∑ E [xi ]
i =1 i =1

For constants c and a1 , .., an :


n n
E c + ∑ ai xi = c + ∑ ai E [xi ]
i =1 i =1

Using ∑ni=1 ai xi = a0 x, where a and x are vectors:


E c + a0 x = c + a0 Ex

Dr Schafgans (LSE) EC221: Introducing Lent 16 / 17


Useful results: Mean of sums of random variables
The expected value of the sum of random variables is:
n n
E ∑ xi = ∑ E [xi ]
i =1 i =1

For constants c and a1 , .., an :


n n
E c + ∑ ai xi = c + ∑ ai E [xi ]
i =1 i =1

Using ∑ni=1 ai xi = a0 x, where a and x are vectors:


E c + a0 x = c + a0 Ex

If c is vector of constants and A is a matrix of constants:


E [c + Ax ] = c + AE (x )

Dr Schafgans (LSE) EC221: Introducing Lent 16 / 17


Useful results: Variance of sums of random variables
The variance of the sum of random variables is
n n n n
Var ∑ xi = ∑ Var [xi ] + ∑ ∑ Cov (xi , xj )
i =1 i =1 i =1 j =1
i 6 =j

Dr Schafgans (LSE) EC221: Introducing Lent 17 / 17


Useful results: Variance of sums of random variables
The variance of the sum of random variables is
n n n n
Var ∑ xi = ∑ Var [xi ] + ∑ ∑ Cov (xi , xj )
i =1 i =1 i =1 j =1
i 6 =j

For constants c and a1 , .., an :


n n n n
Var c + ∑ ai xi = ∑ ai2 Var [xi ] + ∑ ∑ ai aj Cov (xi , xj )
i =1 i =1 i =1 j =1
i 6 =j

Dr Schafgans (LSE) EC221: Introducing Lent 17 / 17


Useful results: Variance of sums of random variables
The variance of the sum of random variables is
n n n n
Var ∑ xi = ∑ Var [xi ] + ∑ ∑ Cov (xi , xj )
i =1 i =1 i =1 j =1
i 6 =j

For constants c and a1 , .., an :


n n n n
Var c + ∑ ai xi = ∑ ai2 Var [xi ] + ∑ ∑ ai aj Cov (xi , xj )
i =1 i =1 i =1 j =1
i 6 =j
Using linear algebra notation: Var [c + a0 x ] = a0 Var (x )a

Dr Schafgans (LSE) EC221: Introducing Lent 17 / 17


Useful results: Variance of sums of random variables
The variance of the sum of random variables is
n n n n
Var ∑ xi = ∑ Var [xi ] + ∑ ∑ Cov (xi , xj )
i =1 i =1 i =1 j =1
i 6 =j

For constants c and a1 , .., an :


n n n n
Var c + ∑ ai xi = ∑ ai2 Var [xi ] + ∑ ∑ ai aj Cov (xi , xj )
i =1 i =1 i =1 j =1
i 6 =j
Using linear algebra notation: Var [c + a0 x ] = a0 Var (x )a

If c is vector of constants and A is a matrix of constants:

Var [c + Ax ] = AVar (x )A0 VN1.5

Represents the covariance matrix of the vector c + Ax.


Dr Schafgans (LSE) EC221: Introducing Lent 17 / 17
VN1.1
VN1.2
VN1.3
VN1.4
VN1.5
EC221: Principles of Econometrics
Multiple Linear Regression Model

Dr M. Schafgans

London School of Economics

Lent 2022

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 1 / 52


Multiple Linear Regression Model
Preliminaries

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 2 / 52


Multiple Linear Regression Model – Preliminaries I
Let us …rst consider the multiple linear regression model

yi = α + β1 xi 1 + β2 xi 2 + εi , i = 1, ..., n

OLS procedure
n
α̂, β̂1 , β̂2 = arg min ∑ (yi a b1 xi 1 b2 xi 2 )2 VN2.1
a,b 1 ,b 2 i =1

Normal Equations (FOC)


9
∑ni=1 bεi = 0 =
∑ni=1 xi 1bεi = 0 Solve for α̂, β̂1 and β̂2
;
∑ni=1 xi 2bεi = 0
where ε̂i = yi α̂ β̂1 xi 1 β̂2 xi 2 are the residuals

α̂ = ȳ β̂1 x̄1 β̂2 x̄2 , easy to obtain from 1st equation. Can you
express β̂j in terms of f(yi , xi 1 , xi 2 )gni=1 only using plain calculus?

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 3 / 52


Multiple Linear Regression Model – Preliminaries II

General k variable regression model

yi = β1 + β2 xi 2 + ... + βk xik + εi , i = 1, ..., n

Special case of

yi = β1 xi 1 + β2 xi 2 + ... + βk xik + εi , where xi 1 = 1 8i

Normal equations

∑ni=1 bεi = 0
∑ni=1 xi 2bεi = 0
.. solve for β̂1 , β̂2 ... β̂k
.
∑ni=1 xik bεi = 0 with ε̂i = yi β̂1 β̂2 xi 2 .. β̂k xik

Explicit solution: Matrix Algebra

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 4 / 52


Multiple Linear Regression Model – Matrix Notation I

yi = β1 xi 1 + β2 xi 2 + ... + βk xik + εi , i = 1, ..., n


The ith observation of the model can be rewritten as:
2 3 2 3
β1 xi 1
6 β 7 6 xi 2 7
6 2 7 6 7
yi = xi0 β + εi , where β = 6 . 7 and xi = 6 .. 7
4 .. 5 4 . 5
βk xik
β is a vector of (…xed) parameters
xi is a vector of characteristics of individual i.
NB. xi0 β = β0 xi (scalar)

Our model has n observations: so let’s stack them


0 1 0 0 1
y1 x1 β + ε1
B y2 C B x 0 β + ε2 C
B C B 2 C
B .. C = B .. C
@ . A @ . A
yn xn0 β + εn
Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 5 / 52
Multiple Linear Regression Model – Matrix Notation II
0 1 0 0 1 0 0 1 0 1
y1 x1 β + ε1 x1 β ε1
B .. C B .
.. C B . C B .. C
@ . A=@ A = @ .. A + @ . A
yn xn0 β + εn xn0 β εn
As β appears for all observations, we can write this as
0 1 0 0 1 0 1
y1 x1 ε1
B .. C B .. C B C
@ . A =@ . A β + @ ... A
yn xn0 εn
(nx1) (nxk) (kx1) (nx1)
Using matrix algebra this may be further simpli…ed as
2 3
x11 x12 x1k
y = X β + ε with X = 4 5 = [x1 , x2 , ..., xk ]
xn1 xn2 xnk
y and ε are the n 1 dimensional vectors
X is like an excel spreadsheet - a matrix.
Empirical setting: VN2.2

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 6 / 52


Multiple Linear Regression model
y = Xβ+ε

Gauss-Markov Assumption

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 7 / 52


MLR model - Gauss-Markov Assumptions

(A.1) True model - linear in parameters y = X β + ε, with E (ε) = 0


2 3 2 3
x11 x12 x1k β1
X =4 5 = [x1 , x2 , ..., xk ] and β = 6 . 7
4 .. 5
xn1 xn2 xnk βk

Linearity refers to the manner in which the parameters and the


disturbance enter the equation, not necessarily to the relationship
The equations yi = α + βxi + εi , log (yi ) = α + β log (xi ) + εi ,
yi = α + β/xi + εi are all linear in some function of xi .
The interpretation of β depends on the functional form (MT)

If we have reason to specify E (ε) 6= 0, we should built it into the


systematic part of the regression. VN2.3

Obviously we can also imagine settings where the parameters enter in a


non-linear fashion (LDV) – more later
Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 8 / 52
MLR model - Gauss-Markov Assumptions

(A.2) No perfect multicollinearity


No exact linear relationships among the independent variables.
Example 1: Our model cannot be

yi = β1 + β2 xi 2 + β3 xi 3 + εi where xi 3 = 2xi 2

(data will not be informative about β2 and β3 ). Instead:

yi = β1 + β2 xi 2 + εi where β2 = β2 + 2β3

Classic example of perfect multicollinearity: dummy variable trap.

Equivalent to assumption: X has full column rank k n. VN2.4

This ensures that X 0X is invertible (see PS1, Q1)

In the simple linear regression model: ∑ni=1 (xi x̄ )2 > 0.


In absence of sample variation, i.e., when xi = c 8i , the model reduces
to yi = α + βc + εi ; α and β are not separately identi…ed.
Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 9 / 52
MLR model - Gauss-Markov Assumptions

(A.3) Zero conditional mean

E ( ε jX ) = 0 VN2.5 ) Uncorrelatedness between X and ε

Ensures the exogeneity of the explanatory variables.

Allows us to interpret the regression of y on X as the conditional mean


of y , i.e., E (y jX ) = X β:

E (y jX ) = E (X β + εjX ) = X β + E (εjX ) = X β.

As
∂ (E (y jX )) ∂ ( β1 x1 + ...βk xk )
= = βj
∂xj ∂xj
βj provides the marginal e¤ect of the explanatory variable xj on the
conditional expectation of y , ceteris paribus (holding everything else
constant).

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 10 / 52


MLR model - Gauss-Markov Assumptions

(A.3) Zero conditional mean (Exogeneity of the explanatory variables)

E (εjX ) = 0 ) Uncorrelatedness between X and ε

The need for the conditioning on the regressors in this assumption is


associated with the fact that the regressors are stochastic (random).
Intuition: In obtaining a new sample f(yi , xi )gni=1 it is di¢ cult in
general to control (keep …xed) the x ’s!
If X is …xed (non-stochastic), A.3 reduces to E (ε) = 0.

Convenient assumption:
First pretend we can keep X …xed – we do this by conditioning on X
(impose E (εjX ) = 0 (stronger) instead of E (ε) = 0)
Deal with the stochastic nature of the regressors afterwards
(Use the Law of Iterated Expectations)

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 11 / 52


Law of Iterated Expectations (LIE)
Let x and y be two (vectors of) random variables (rv)
f (x, y ) = f (y jx )f (x ).

Theorem
Law of iterated expectations

E (h (x, y )) = E [E [h (x, y )jx ]]

assuming these expectations exist.

Enables us to deal with each random variable consecutively.

A proof of this uses the de…nition of (conditional) expectations:


RR
E (Rh (Rx , y )) = h (x , y )f (x , y )dydx
= R Rh (x , y )f (y jx )f (x )dydx usingRBayes’theorem
= h (x , y )f (y jx )dy f (x )dx = [E (h (x , y )jx )] f (x )dx
= E (E [h (x , y )jx ])
Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 12 / 52
MLR model - Gauss-Markov Assumptions
(A.3) Zero conditional mean (Exogeneity of the explanatory variables)
E (εjX ) = 0 ) Uncorrelatedness between X and ε
Assumption ensures: Cov (xis , εj ) = 0 8i, j = 1, .., n and s = 1, .., k
There is no correlation between the error of observation j and the
regressors of all observations.
Show result using Law of Iterated Expectation: VN2.5

We also say that the explanatory variables are strictly exogenous.


In a time series setting, this assumption is often deemed to be too
strong as it would not allow shocks (εj ) to a¤ect regressors in the
future (xj +h,s ). More later.

If our sample f(yi , xi )gni=1 , or equivalently f(εi , xi )gni=1 , is drawn


independently (cross-section) then
E ( εi jX ) E ( εi jxi ) .

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 13 / 52


MLR model - Gauss-Markov Assumptions

(A.3) Zero conditional mean (Exogeneity of the explanatory variables)

E (εjX ) = 0 ) Uncorrelatedness between X and ε

Even if f(εi , xi )gni=1 is drawn independently (random sampling), this


does not preclude correlation between εi and xi .

Independence between the errors ε and regressors X would ensure


the uncorrelatedness as well, but is stronger than needed: we only
need mean independence E (εi jX ) = E (εi )!

Simply assuming uncorrelatedness only is insu¢ cient


Independence ) Mean independence ) Uncorrelatedness
Reverse typically doesn’t hold

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 14 / 52


MLR model - Gauss-Markov Assumptions

(A.4) Homoskedasticity and no autocorrelation

Var (εjX ) = σ2 I

The covariance matrix summarizes the fact that


The variances Var (εi jX ) = σ2 are constant (homoskedasticity)
[Diagonal terms]
The covariances Cov (εi , εj jX ) = 0 all i 6= j (no autocorrelation)
[O¤ diagonal terms]
By de…nition

Var (εjX ) = E (ε E (εjX )) (ε E (εjX ))0 jX


= E εε0 jX using A.3

If X is …xed (non-stochastic), A.4 reduces to Var (ε) = σ2 I .


To allow for stochastic regressors, we require Var (εjX ) = σ2 I . This
implies Var (ε) = σ2 I by law of iterated expectation
Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 15 / 52
MLR model - Gauss-Markov Assumptions
(A.4) Homoskedasticity and no autocorrelation
A.3
Var (εjX ) = E (εε0 jX ) = σ2 I
In cross sectional data (as done last term) we often argue that
f(εi , xi )gni=1 are drawn independently.
Independence ensures Cov (εi , εj jX ) = 0 and Var (εi jX ) = Var (εi jxi ).

When Var (εi jxi ) depends on xi we have heteroskedasticity.


E.g., not only the mean E (yi jxi ) but also the variance Var (yi jxi ) of
food expenditures (y ) may depend on income (x )
Note: Var (yi jxi ) = Var (xi0 β + εi jxi ) = Var (εi jxi ).

Spatial econometrics argues that in cross sectional data there may be


dependence across individuals.
Unobserved characteristics of neighbors (or people from the same
community) may exhibit correlations - may depend on distance between
individuals (geographic or socio-economic)
Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 16 / 52
MLR model - Gauss-Markov Assumptions

(A.4) Homoskedasticity and no autocorrelation


A.3
Var (εjX ) = E (εε0 jX ) = σ2 I

In time series data, we do not think it is reasonable to assume


independence of observations in our sample f(yt , xt )gT
t =1 and indeed
we expect that fεt gT
t =1 is correlated over time.
With time series data we often use subscripts t to denote a particular
observation (time has a natural ordering). With cross sectional data we
use subscript i to denote a particular individual.

By imposing a stationarity assumption (later), we will maintain the


assumption that the variance is homoskedastic.
Nevertheless, we typically want to allow for autocorrelation: in time
series data it is likely that εt and εt +s are correlated.

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 17 / 52


MLR model - Gauss-Markov Assumptions

(A.1) True model - linear in parameters y = X β + ε, with E (ε) = 0


2 3 2 3
x11 x12 x1k β1
X =4 5 = [x1 , x2 , ..., xk ] and β = 6 . 7
4 .. 5
xn1 xn2 xnk βk

(A.2) No perfect multicollinearity


(A.3) Zero conditional mean (Exogeneity of the explanatory variables)

E (εjX ) = 0 ) Uncorrelatedness between X and ε

(A.4) Homoskedasticity and no autocorrelation


A.3
Var (εjX ) = E (εε0 jX ) = σ2 I

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 18 / 52


MLR model - Gauss-Markov Assumptions

Graphically:

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 19 / 52


OLS estimator
Multiple Linear Regression model
y = Xβ+ε

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 20 / 52


OLS Estimator I
0
1
β̂1
B C
OLS estimator β̂ = @ ... A , minimizes the residual sum of squares:
β̂k

∑i =1 (yi
n
S (b ) = b1 xi 1 b2 xi 2 ... bk xik )2
8 n
< ∑i =1 xi 1bεi = 0
The normal equations (FOC) are: >
. ..
> .
: n
∑i =1 xik bεi = 0
. where ε̂i = yi β̂1 xi 1 β̂2 xi 2 ... β̂k xik

Using matrix algebra, we can rewrite the normal equations VN2.6

X0 bε =0
(kxn) (nx1) (kx1)

Recognizing that ε̂ = y X β̂ this will enable us to solve for β̂. VN2.7

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 21 / 52


OLS Estimator II

Provided X 0 X is invertible, A.2

b
β = (X 0 X ) 1X 0y ( Our OLS estimator for β!

If there is perfect multicollinearity X 0 X is singular (cannot be inverted).


In that case the estimator b β cannot be computed

In the presence of perfect multicollinearity β is not identi…ed.

In the simple linear regression model, we can only estimate the slope
provided ∑ni=1 (xi x )2 > 0

Recap VN2.8

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 22 / 52


OLS Estimation - Derivation FOC using matrix algebra
OLS estimator minimizes the residual sum of squares (RSS):
n
S (b ) = ∑ (yi b1 xi 1 b2 xi 2 ... bk xik )2 = (y Xb )0 (y Xb )
i =1

(Recall: z 0 z = ∑ni=1 zi2 )

1 First: Expand the expression for S (b ) :


S (b ) = y 0 y 2b 0 X 0 y + b 0 0
| X{zXb} since b 0 X 0 y = y 0 Xb
| {z } | {z }
linear term quadratic term scalar

2 Second: Using result from PS1, Q4, derive the FOC .


0 ∂S (b ) 1
∂b 1
∂S (b ) B .. C
= B
@ .
C
A =0 VN2.9
∂b β̂ ∂S (b )
∂b k β̂

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 23 / 52


OLS estimator = Method of Moments (MM) estimator

The normal equations that de…ne the OLS estimator are


0 n 1 0 1
∑i =1 xi 1bεi 0
B .. C B .. C
@ . A=@ . A
∑ni=1 xik bεi 0

These equations are are proportional to the requirement that

1 n
n i∑
xis bεi = 0 for s = 1, .., k
=1

They represent the sample analogues of the population requirement


that the errors and regressors are uncorrelated

E (xis εi ) = 0 for s = 1, .., k

This realization tells us that β̂ is also a MM estimator.


Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 24 / 52
Method of Moments estimator

The starting point of a Method of Moment estimator, as its name


suggest, are Population Moments
(here E (xis εi ) = 0 with εi = yi xi0 β).
Important: we need at least as many moments as parameters of
interest. (Identi…cation)

The MME chooses parameters to ensure they satisfy the


Sample Analogues of the Population Moments
(here n1 Σxis ε̂i = 0 with ε̂i = yi xi0 β̂MME )

Our GM assumptions give us another population moment condition


that we can use to estimate σ2 : E (ε2i ) = σ2 .
What would its MM estimator be?
n
ε̂0 ε̂
Its sample analogue: σ̂2MME = 1
n ∑ ε̂2i = n
i =1

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 25 / 52


Finite Sample Propertiess
of

b
β = (X 0 X ) 1
X 0y

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 26 / 52


Classical (Gauss-Markov) Linear Regression Assumptions

(A.1) True model - linear in parameters y = X β + ε, with E (ε) = 0


2 3 2 3
x11 x12 x1k β1
X =4 5 = [x1 , x2 , ..., xk ] and β = 6 . 7
4 .. 5
xn1 xn2 xnk βk

(A.2) No perfect multicollinearity


(A.3) Zero conditional mean (Exogeneity of the explanatory variables)

E (εjX ) = 0 ) Uncorrelatedness between X and ε

(A.4) Homoskedasticity and no autocorrelation


A.3
Var (εjX ) = E (εε0 jX ) = σ2 I

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 27 / 52


Finite Sample Properties β̂ – Unbiased

Theorem
Under A.1-A.3, b
β is Unbiased, i.e., E β̂ = β

b
β = (X 0 X ) 1
X 0y properly de…ned under A.2

Proof follows a two step procedure: VN2.10-11

Step 1: Plug in the true model – A.1


b
β = β + (X 0 X ) 1 X 0 ε

Step 2: Take expectations – use assumption A.3


Proof easiest under the assumption that X is …xed (deterministic). The
more reasonable setting where X is stochastic, requires the use of the
law of iterated expectations.

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 28 / 52


Finite Sample Properties β̂ – Variance

Theorem
1
Under A.1-A.4, Var b
β jX = σ 2 (X 0 X )

To obtain this result we use Assumption A.4 (homoskedasticity and


no autocorrelation).
0 1
Var ( β̂1 jX ) Cov ( β̂1 , β̂k jX )
B .. C
We recall Var ( β̂jX ) = @ . A
Var ( β̂k jX )
By analogy to the proof of unbiasedness:
Step 1: Plug in the true model – A.1
b
β = β + (X 0 X ) 1 X 0 ε

Step 2: Obtain variance – uses Assumption A.4 VN2.12-13

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 29 / 52


Finite sample properties of β̂
Unbiasedness and Variablity are two important …nite sample properties
that reveal how the realizations of β̂ behave under repeated sampling
Theorem
Under A.1-A.3, b
β is unbiased, i.e., E β̂ = β

Unbiasedness ensures that we will not make systematic errors when


estimating a parameter (on average correct).

Theorem
1
Under A.1-A.4, Var b
β jX = σ 2 (X 0 X )
1
Question: What does Var b βjX = σ2 (X 0 X ) tell us about aspects
that will help lower the variability of our estimates under repeated
sampling (improve the precision of our parameter estimates)?
Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 30 / 52
Variance of β̂ – Simple linear regression model
1
In PS2, Q1, you are asked to show that Var b β jX = σ 2 (X 0 X ) ,
when applied to the simple linear regression setting
yi = β1 + β2 xi + εi ,
yields the following variance of the slope parameter: VN2.14

σ2 σ2
Var (b
β2 jX ) = n
nsx2
∑ (xi x̄ )2
i =1
n
1
n i∑
with sx2 = (xi x̄ )2 > 0 (sample variance of the regressor)
=1

Reveals that we can increase the precision (reduce the variability of our
estimator in repeated sampling) by:
1 Increasing the sample size (n ).
2 Having a greater variability (excitation) of the regressors (sx2 ).
3 Having a lower variance of the errors (σ2 ).
Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 31 / 52
Variance of β̂ – Near Multicollinearity

In the multiple linear regression model we also need to consider the


possibility that regressors may be highly correlated.
When regressors are highly correlated, it is di¢ cult to disentangle the
separate e¤ects of the regressors on the dependent variable.
This is re‡ected in high variances of our estimates

Consider yi = α + xi 1 β1 + xi 2 β2 + εi , we can show (see PS3, Q2)

σ2 σ2
Var ( β̂j jX ) = n =
nsx2j (1 rx21 ,x2 )
∑ (xij x̄j )2 (1 rx21 ,x2 )
i =1
1 n
with sx2j = ∑ (xij x̄j )2 and rx1 ,x2 the sample correlation
n i =1

This shows that as rx21 ,x2 ! 1 the variances become very large! Add:
. Ensure regressors are not highly collinear
Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 32 / 52
Variance of β̂ – Estimation and Standard errors
To estimate the variance of b
β we need to replace the unknown error
2
variance σ by an estimate.
The estimator (unbiased) for σ2 used is given by1
n 0
s2 = n 1 k ∑ bε2i = nε̂ ε̂k , E (s 2 ) = σ 2
i =1

The square root of s 2 , s, is called the standard error of the regression.

The estimated variance-covariance matrix:


d (b
Var β ) = s 2 (X 0 X ) 1

Its j th diagonal element is the estimated variance of b


βj , the squared
b
root of which gives the standard error of βj :
q
SE (b
βj ) = Vard (bβj ) VN2.15

1 In Wooldridge (2013) you will …nd the formula s 2 = SSR / (n k 1). There, the
number of parameters equals k (slopes) + 1 (intercept). Here it equals k.
Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 33 / 52
Finite Sample Properties β̂ – E¢ ciency: BLUE
Theorem
Gauss-Markov Theorem: Under A.1-A.4, the OLS estimator β̂ is the best
linear estimator (BLUE) of β.
1
Proof: Let us show that β̂ = (X 0 X ) X 0 y is BLUE and assume (for
simplicity) that X is …xed (result holds when X is stochastic too). We
use the following three steps, details VN2.16-17 .

Step 1:Introduce another linear estimator for β, β̃


β̃ = Cy
(k 1 ) =(k n ) (n 1 )

Step 2:Need to ensure β̃ is unbiased, i.e., E β̃ = β


This will impose a restriction on the C matrix.

Step 3:Verify that β̂ is e¢ cient. Need to show


Var (b
β) Var ( β̃)
Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 34 / 52
Finite Sample Properties β̂ – E¢ ciency: BLUE II

Corollary For any vector of constants, c, the minimum variance linear


unbiased estimator of c 0 β in the classical regression model (given
assumptions A.1–A.4) is c 0 bβ, where bβ is the least squares estimator

Var(c 0 b
β) Var(c 0 e
β)

The BLUE of c 0 β is c 0 b
β!
Each coe¢ cient βj is estimated at least as e¢ ciently by b βj as by any
other linear unbiased estimator.
0 1
0
B .. C
B . C
B C
B 1 C
c=B B C j th position ) c 0 β = βj
C
B 0 C
B .. C
@ . A
0

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 35 / 52


Finite Sample Properties β̂ – Sampling Distribution I

(A.5) Normality εjX N (0, σ2 I )


In order to obtain the sampling distribution (pdf) for β̂, we need to
add an assumption about the distribution of the errors.
We assume that εi (conditional of X ) are independent normal.
We can ignore the conditioning again when X is …xed
The sampling distribution for β̂ plays an important role when testing
hypotheses!

The multiple linear regression model under Assumptions A1-A5


(Gauss-Markov + normality) is often referred to as the Classical
Linear Regression Model.

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 36 / 52


Finite Sample Properties β̂ – Sampling Distribution II
(A.5) Normality εjX N (0, σ2 I )
The exact sample distribution for β̂ then follows:

Theorem
1
Under A.1-A.4 and A.5, b
β jX N ( β, σ2 (X 0 X ) )

The proof requires little work, recalling


b
β = β + (X 0 X ) 1 X 0 ε.
Any linear combination of normal random variables is normal and we
already established the mean and variance.
Without Assumption A.5, we may still be able to rely on this result
being a good approximation (CLT) - later.
All marginals are normal as well
h i
b
βj jX N ( βj , σ2 cjj ), where cjj = (X 0 X ) 1 ,
jj

i.e. cjj is the (j, j ) element of the (X 0 X ) 1 matrix.


Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 37 / 52
Finite Sample Properties β̂ – E¢ ciency: MVUE

(A.5) Normality εjX N (0, σ2 I )


Adding normality allows us to show that the OLS estimator equals the
Maximum Likelihood estimator.
b
β=b
βMLE

To use MLE we need to fully describe the pdf of y jX (more than GM


assumptions needed).
Result depends critically on the assumption of normality! b
βMLE will be
di¤erent under di¤erent distributional assumptions.
Given this extra assumption we can argue that OLS is best among all
unbiased – linear and non-linear – estimators
Theorem
Under A.1-A.5, the OLS estimator β̂ is the minimum variance unbiased estimator
(MVUE) of β.

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 38 / 52


Maximum Likelihood Estimator (more later)
To use the MLE procedure to estimate β and σ2 we need the joint
(conditional) density of the data y1 , .., yn .
While GM makes assumptions about the mean and variance/covariance
of our errors, it’s (joint) distribution is left unspeci…ed.

Given Assumption A.5: y jX N X β, σ2 I

The joint density of the observations y1 , y2, y3, ..., yn, (de…nes the
likelihood function) is [using independence]:
n 1 1 xi0 β)
2
L α, β, σ2 f y1 , ..., yn ; β, σ2 = ∏ p e 2σ2
(y i
i =1 2πσ2
The MLE estimator chooses β, and σ2 so as to maximizing the
likelihood, or, equivalently, the log-likelihood:
n n 1 n
2σ2 i∑
log L( β, σ2 ) = ln(2π ) ln σ2 (yi xi0 β)2
2 2 =1

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 39 / 52


Maximum Likelihood Estimator (more later)

The log-likelihood function includes the component ∑i =1 (yi xi0 β)2


n

which gives (for given β) the sum of squared residuals S ( β) :


n n 1
ln L( β, σ2 ) = ln(2π ) ln σ2 S ( β)
2 2 2σ2

The …rst order conditions are (Pro…ciency Quiz, Q8):

∂ 1 ∂S ( β )
: )=0
∂β
2σ̂2MLE ∂β β̂MLE
∂ n 1
∂σ2
: + S β̂ = 0
2σ̂2MLE 2σ̂4MLE

The …rst set of FOC conditions are same as FOC of OLS )

Under normality: b
βMLE = b
βOLS = b
βMME
n
The last FOC yields σ̂2MLE = 1
n ∑ ε̂2i ε̂0 ε̂
n = σ̂2MME
i =1

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 40 / 52


Estimation methods - summary

Let me summarize the three estimation principles we have discussed


for parameters of the Multiple Linear Regression Model
Recap: VN2.18

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 41 / 52


Algebraic/Geometric Aspects
of the
OLS estimator

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 42 / 52


Geometric Aspects of the OLS

Recall the normal equations

X 0y X 0X b
β = X 0 (y Xb
β) = X 0bε = 0.

These normal equations imply that the residuals and regressors are
orthogonal to each other.
Recall the vectors w and z are orthogonal i¤ w 0 z = 0

If our model contains an intercept, which is equivalent to saying that


one of the X columns is …lled with 1’s, this gives us the following
results:
∑ni=1 bεi = 0 : The least squares residuals sum to zero
ŷ = ȳ : The …tted values and true values have the same mean.
ȳ = X̄ β̂ : The …tted regression "line" passes through the mean

Review these statements for the simple regression model. VN2.19

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 43 / 52


Algebraic and Geometric Aspects of the Solution I

Model y = Xβ+ε

Estimator: b
β = (X 0 X ) 1 X 0 y
Fitted values: yb = X b
β = X (X 0 X ) 1 X 0 y
Residual: bε = y yb = y X (X 0 X ) 1 X 0 y = (In X (X 0 X ) 1 X 0 )y

We introduce the following two matrices

P = X (X 0 X ) 1 X 0
M = In X (X 0 X ) 1X 0 = In P

This allows us to write


Fitted values: yb = Py P : Projection matrix
Residuals: bε = My M : Residual maker

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 44 / 52


Algebraic and Geometric Aspects of the Solution II

P = X (X 0 X ) 1 X 0
M = In X ( X 0 X ) 1X 0 = In P

Properties of the P and M matrices (PS1, Q2)


Symmetric and Idempotent

Projection: PX = X (X 0 X ) 1 X 0 X = X
) MX = 0 (M is orthogonal to X )
Orthogonal: PM = 0

These matrices (and their properties) are useful in various proofs.


E.g., to show that the OLS residuals and the …tted values are
orthogonal, i.e., ŷ 0bε = 0 :
ŷ 0bε = (Py )0 (My ) = y 0 P 0 My = y 0 0 y = 0

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 45 / 52


OLS: Orthogonal Projection - Graphical Discussion

OLS has a very nice geometric interpretation: it is an orthogonal


projection!

Fitted values: yb = Py
Projects y onto the space of all linear combination of the regressors

Residuals bε = My are orthogonal to the projection space


Hence we say orthogonal projection. VN2.20

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 46 / 52


Estimator for σ2
We use the following estimator of σ2
RSS bε0bε
s2 = =
n k n k

Using the matrix notation, bε = My , we note:

RSS = bε0bε = (My )0 (My ) = y 0 My (M symmetric/idempotent)


= (X β + ε)0 M (X β + ε) (Plugging in y = X β + ε)
= ε0 M ε (since MX = 0, X 0 M = 0)
ε0 M ε
s2 = ( Using A.1
n k

Recall, using s 2 we can obtain an estimate of the variance of β̂ :


d ( β̂jX ) = s 2 (X 0 X ) 1
Var

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 47 / 52


Finite sample properties of β̂

We completed our discussion of the …nite sample properties of the


1
OLS (MM) estimator β̂ = (X 0 X ) X 0 y

Theorem
1
Under A.1-A.5, b
β jX N β, σ2 (X 0 X )

1
OLS is Orthogonal Projection: ŷ = Py , with P = X (X 0 X ) X0
1
Residuals ε̂ = My , with M = I X (X 0 X ) X0

Residuals are orthogonal to ŷ : ŷ 0 ε̂ = 0

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 48 / 52


Finite Sample Properties
of

RSS bε0bε
s2= =
n k n k

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 49 / 52


Finite Sample Properties s 2 – Unbiased

Theorem
Under A.1-A.4, s 2 is Unbiased, i.e., E s 2 = σ2

bε0bε
s2 =
n k

Proof follows a two step procedure: VN2.21-22

Step 1: Simplify s 2 (i.e., write in terms of ε) [plug in the true model]

bε0bε ε0 Mε
s2 = =
n k n k

Step 2: Take expectations (recognizing s 2 is a scalar!)


Proof easiest under the assumption that X is …xed (deterministic). The
more reasonable setting where X is stochastic, requires the use of the
law of iterated expectations.

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 50 / 52


Finite Sample Properties s 2 – Sampling Distribution

We need an extra assumption to obtain sampling distribution of s 2


Theorem
Under A.1-A.4 and A.5, (n k )σ 2s2 χ2n k

ε0 M ε
The proof uses the fact that using A.1 we can write s 2 = n k
This gives us
ε0 Mε
(n k )σ 2 s 2 = 2 .
σ
Quadratic form of normal rv’s; M is symmetric, idempotent matrix with
rank(M ) = n k, (see PS2, Q3) VN2.23

ε0 M ε
χ2n k ε N (0, σ2 I ) [X , M …xed]
σ2
ε0 M ε
If X is stochastic, we …rst claim σ2
X χ2n k (pretend X …xed).

Accounting for the randomness of X distribution remains same. VN2.24

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 51 / 52


Zero-covariance

For testing purposes it is useful to note that under GM assumptions


plus normality: b
β and s 2 are independent.
Proof: Due to our normality assumption, it su¢ ces to proof that β̂
and ε̂ are uncorrelated.
By normality uncorrelatedness implies independence.
If β̂ is independent of ε̂, it is also independent of any function of ε̂ such
as s 2 !

Show Cov (b
β,bε) = 0 (for simplicity treat X as …xed) VN2.25

Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 52 / 52


VN2.1
VN2.2
VN2.3
VN2.4
VN2.5
VN2.6
VN2.7
VN2.8
VN2.9
VN2.10
VN2.11
VN2.12
VN2.13
VN2.14
VN2.15
VN2.16
VN2.17
VN2.18
VN2.19
VN2.20
VN21
VN2.22
VN2.23
VN2.24
VN2.25
EC221: Principles of Econometrics
Multiple Linear Regression Model and
Speci…cation Issues

Dr M. Schafgans

London School of Economics

Lent 2022

Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 1 / 22


Partitioned Regression
Frish-Waugh-Lovell Theorem
Regression Anatomy

Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 2 / 22


Partitioned Regression

Let us partition the matrix (X ) with the k explanatory variables into


X1 (n k1 ) and X2 (n k2 ), where k1 + k2 = k.
" #
.. β1 (k 1 1 )
y = Xβ+ε = X1 . X2 +ε
(n k 1 ) (n k 2 ) β2 (k 2 1 )
or
y = X1 β 1 + X2 β 2 + ε

Reason:
Interest may only be in a subset of parameters;
Allows us to deal with computational issues when k1 + k2 is very big.
X 0 X is a very large matrix to invert.
E.g. …xed e¤ect panel data models.
β̂1
We are interested in our partitioned regression estimators
β̂2

Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 3 / 22


Partitioned Regression Estimator - derivation using FOC

The partitioned regression estimators for β1 and β2 can be derived by


solving the FOC.
X10 X1 X10 X2 b
β1 X10 y
X 0X b
β = X 0y ) 0 0 = VN3.1
X2 X1 X2 X2 b
β 2
X20 y
Can expand this expression to yield two sets of equation which can be
solved for (b
β1 , b
β2 ):

X10 X1 b
β1 + X10 X2 b
β2 = X10 y
X20 X1 b
β1 + X20 X2 b
β2 = X20 y

Yielding
b
β1 = (X10 M2 X1 ) 1
X10 M2 y , with M2 = I X2 (X20 X2 ) 1 X20
b
β = (X20 M1 X2 ) 1
X20 M1 y , with M1 = I X1 (X10 X1 ) 1 X10
2

(Show: PS3 Q2a)


Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 4 / 22
Simple example: Partitioned Regression
Write simple linear regression model in partitioned regression form
0 1 0 1
1 x1
B .. C B . C
y = β1 + β2 x + ε; Let X1 @ . A = ιn 1 , X2 = @ .. A = x
1 xn
Use partitioned regression formula to de…ne estimator for slope:
bβ = (X20 M1 X2 ) 1 X20 M1 y
2
= (x 0 M1 x ) 1 0
x My where M1 = In ι(ι0 ι) 1 0
ι = In 1 0
n ιι
0 1
z1 z
B C n
..
Recall: Mz = @ . A, where z = n1 ∑ zi (PS1, Q2)
i =1
zn z
n
x 0 Mx = (Mx )0 (Mx ) = ∑ (xi x )2
i =1
x 0 My = (Mx )0 (My ) = ∑ni=1 (xi x ) (yi y)
n
Hence β̂2 = ∑i =1n i
(x x )(yi y )
2
as expected!
∑ (x i x )
i =1
Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 5 / 22
Partitioned Regression (recap)

y = X1 β 1 + X2 β 2 + ε

1
While β̂ = (X 0 X ) X 0 y , the partitioned OLS estimators

b
β1 = (X10 M2 X1 ) 1
X10 M2 y , with M2 = I X2 (X20 X2 ) 1
X20

Computational advantages (dimension X 0 X may be huge)


If X10 X2 = 0, easy to ignore one set of regressors as

b 1
β1 = (X10 M2 X1 ) 1 X10 M2 y = X10 X1 X10 y

lwagei = α1 malei + α2 femalei + εi , i = 1, ..., n


yt = β1 s1t + β2 s2t + β3 s3t + β4 s4t + εt , t = 1, ..., T
In general X10 X2 6= 0, cannot simply ignore set of regressors (OVB)

Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 6 / 22


Partitioned Regression - Orthogonal Regressors

y = X β + ε = X1 β 1 + X2 β 2 + ε

If X1 ?X2 (X10 X2 = 0) then


the partitioned regression estimator of β1 simpli…es VN3.2

b
β1 = (X10 M2 X1 ) 1 X10 M2 y = (X10 X1 ) 1 X10 y
Implication: to obtain the parameter estimates for β1 we can ignore
the presence of X2 if X10 X2 = 0
E.g., yt = α1 s1t + α2 s2t + .. + α4 s4t + εt , where sj seasonal dummies

sjt = 1 if observation falls in j th quarter, 0 otherwise

To estimate α1 we can simply regress: yt = α1 s1t + vt

In general though X10 X2 6= 0!

Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 7 / 22


Partitioned Regression - Frisch-Waugh-Lovell Theorem I

y = X β + ε = X1 β 1 + X2 β 2 + ε

To allow us to "drop X2 " when estimating β1 , we will pre-multiply


our model with M2 (using M2 X2 = 0)

M2 y = M2 X1 β1 + error with error = M2 ε


y = X1 β1 + error

This represents is a residual based model


y = M2 y contains the residuals from a regression of y on X2
X1 = M2 X1 contains the residuals from k1 regressions of the X1 on X2 .

FWL: We can obtain β̂1 by performing OLS on the residual


based model. VN3.3
The residual based regression has the correlation that X1 and y have
with X2 removed before estimating β1 .
Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 8 / 22
Partitioned Regression - Frisch-Waugh-Lovell Theorem II
The precision of β̂1 (Show: PS3, Q2b) is given by

Var (b
β1 jX ) = σ2 (X10 M2 X1 ) 1

To obtain se’s we need an unbiased estimator for σ2 :


s 2 = ε̂0 ε̂/(n k1 k2 ) with ε̂ = y X1 β̂1 X2 β̂2

FWL: The residuals from the residual based regression


y = X1 β1 + error
are equivalent to residuals from the full regression
y = X 1 β1 + X 2 β2 + ε

We don’t need to explicitly estimate β̂2 to obtain se’s!


0
s 2 = e[ rror /(n
rror e[ k1 k2 ) with e[
rror = y X1 β̂1

Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 9 / 22


Partitioned Regression – Frisch Waugh Lovell (Summary)
Partition the matrix (X ) : X = [X1 : X2 ]
y = X β + ε = X1 β 1 + X2 β 2 + ε
b
β1 = (X10 M2 X1 ) 1
X10 M2 y , with M2 = I X2 (X20 X2 ) 1
X20

Theorem
Frisch-Waugh Lovell
We can obtain b
β1 by estimating the "residual based model"

y = X1 β1 + error , with y = M2 y and X1 = M2 X1

rror = ε̂
The residual sum of squares are identical as e[
For computation of SEs need to recognize we have used, explicitly or
implicitly, k1 + k2 regressors.

While we can obtain b β1 also by estimating y = X1 β1 + e its residuals


will not be correct for obtaining s 2 .
Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 10 / 22
Partitioned Regression - example 2
Consider the partitioned regression of the model (Quarterly data)
y = S γ + X δ + ε, where
2 3
p p p p
S = 4 s1t s2t s3t s4t 5 seasonal dummies
p p p p
X = [economic variables (NO INTERCEPT)]
Let us interpret the Residual based regression we can use to estimate
the causal e¤ect δ
MS y = MS X δ + error
deseasonalisation:
MS y are the residuals from running a regression of y on the season
dummies (also called the deseasonalised y variable). VN3.4
You are asked to discuss this further in PS3, Q1
Many time series processes are already deseasonalized before we can
access them.
Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 11 / 22
Partitioned Regression - example 3 (extension)

Panel data model, with unobserved heterogeneity

yit = αi + zit0 β + εit , i = 1, ..., n, and t = 1, .., T

Frisch Waugh Lovell, tells us that

β̂ = (Z 0 MD Z ) 1 Z 0 MD y

where D is a matrix of dummy variables (one for each individual)


The residual based regression that provides this result is here
1 T
yit ȳi = (zit z̄i )0 β + error , with ȳi = ∑ y
T t =1 it

Intuition: By using group mean di¤erenced variables we can control for


unobserved characteristics that are constant over time
Details: VN3.5-8

Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 12 / 22


Speci…cation issue:
Omission of Relevant Variables

Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 13 / 22


Omission of Relevant Variables I
Classic Omissions are: special factors such as strikes, wars;
seasonality: the Christmas e¤ect; and dynamics
Consider the setting of deterministic regressors (X , Z ) (simplicity)

Consider the following two models


True model ("long regression")
y = X β + Z δ + ε, E (ε) = 0, Var (ε) = σ2 I

Misspeci…ed model which we estimate ("short regression")


y = Xβ+v

OLS estimator based on the estimated model is


e
β = (X 0 X ) 1 X 0 y

This estimator, in general, is biased.


The reported standard errors (and test statistics) are invalid.
Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 14 / 22
Omission of Relevant Variables II

Let us derive this omitted variable bias (OVB)


Plug in the true model:
e
β = (X 0 X ) 1 X 0 (X β + Z δ + ε ) = β + (X 0 X ) 1 X 0 Z δ + (X 0 X ) 1 X 0 ε

Taking expectations (assuming X,Z …xed):

Ee
β = E ( β + (X 0 X ) 1
X 0 Z δ + (X 0 X ) 1 X 0 ε )
= β + (X 0 X ) 1 X 0 Z δ + (X 0 X ) 1 X 0 E ( ε ) = β + (X 0 X ) 1 X 0 Z δ
| {z }
BIAS
Observe: Necessary conditions for omitted variable bias:
I δ 6= 0 i.e., Z is relevant
I X 0 Z 6= 0 i.e., X a¤ects Z

The bias re‡ects the indirect e¤ect X has on the expected value of
y associated with changes in Z

Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 15 / 22


Omission of Relevant Variables III

Let us consider the variance of β̃ = (X 0 X ) 1X 0y , recalling

e
β = β + (X 0 X ) 1
X 0 Z δ + (X 0 X ) 1
X 0ε

Var e
β = Var β + (X 0 X ) 1 X 0 Z δ + (X 0 X ) 1 X 0 ε
1
(X 0 X ) 1 X 0 Var ( ε )X (X 0 X ) = σ 2 (X 0 X ) 1 as Var (ε) = σ2 In

The estimator of σ2 using the short regression equals

v̂ 0 v̂
s̃ 2 = , with v̂ = y X β̃
n kX

Unfortunately, this estimator will not be an unbiased estimator of σ2 ,


hence the t-statistics and F-statistics based on s̃ 2 (X 0 X ) 1 will be
INVALID. VN3.9
Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 16 / 22
Example of Omitted Variable Bias
Production Function - Cobb-Douglas

The Cobb-Douglas function, with K and L as factors of production:


ln Yi = a0 + a1 ln Ki + a2 ln Li + vi
Consider the impact of omitting a relevant factor of production: M
managerial input
ln Yi = β0 + β1 ln Ki + β2 ln Li + β3 ln Mi + εi true model
OVB results when managerial input is correlated with K and/or L
The estimated elasticities of capital and labour on output, â1 and â2 ,
will incorporate the e¤ect managerial input has on output
ln Mi = γ1 + γ2 ln Ki + γ3 ln Li + ei
What impact will it have on the returns to scale: s = β1 + β2 + β3 ,
which will be estimated as â1 + â2 ?
Bias = E (â1 + â2 s)
See also PS4 Q2
Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 17 / 22
Speci…cation issue:
Inclusion of Irrelevant Variables

Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 18 / 22


Inclusion of irrelevant variables I
It is less of a problem than omission of relevant variables.
True model:
y = X β + v , E (v ) = 0, Var (v ) = σ2 I
Misspeci…ed model which we estimate:
y = Xβ + Zδ + ε

The OLS estimator based on the estimated model is


e
e
β = (X 0 M X ) 1 X 0 M y Z Z

(partitioned regression formula)


This estimator is still unbiased.
We can trust the SE’s and inference will be valid.
Leaving out irrelevant regressors (imposing correct restrictions) though
would yield more e¢ cient parameter estimates (power!)
Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 19 / 22
Inclusion of irrelevant variables
Let us convince ourselves of the unbiasedness
Plug in the true model
e
e
β = (X 0 MZ X ) 1 X 0 MZ (X β + v ) = β + (X 0 MZ X ) 1 X 0 MZ v

Take expectations (X,Z deterministic)

Ee
e
β = E ( β + (X 0 MZ X ) 1
X 0 MZ v )
= β + (X 0 MZ X ) 1 X 0 MZ E (v ) = β

Can we trust se’s? The variance of e


e
β is given by
e
Var e
β = Var ( β + (X 0 MZ X ) 1 X 0 MZ v )
= (X 0 MZ X ) 1 X 0 MZ Var (v )MZ0 X (X 0 MZ X ) 1
ε̂0 ε̂
= σ2 (X 0 MZ X ) 1 and our estimate of σ2 , s 2 = N kX kZ is unbiased!
VN3.10 ; Can trust se’s!

1
E¢ ciency loss: σ2 (X 0 MZ X ) 1 is bigger than σ2 (X 0 X ) ! VN3.11

Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 20 / 22


Selecting Regressors

It is good practice to select the set of potentially relevant variables on


the basis of economic arguments rather than statistical ones.
There is always a small (but not ignorable) probability of drawing the
wrong conclusion.
For example there is always a probability of rejecting then null
hypothesis that a coe¢ cient is zero, while the null is actually true.
Such type I errors are likely to happen more than intended if we use a
sequence of many tests to select the regressors to include in the model.
This process is referred to as data mining.

Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 21 / 22


Selecting Regressors

In presenting your estimation results, it is not a ‘sin’to have


insigni…cant variables included in your speci…cation.
Of course, you should be careful with including many variables in your
model that are multicollinear so that, in the end, almost none of the
variables appear individually signi…cant.
RSS
For selecting regressors the goodness of …t measure R 2 = 1 TSS ,
may not be that suitable.
Reason: the R 2 cannot go down if we add additional regressors (why?)
A measure that allows for a tradeo¤ between …t and parsimony is the
adjusted R 2 :
RSS / (n k )
R̄ 2 = 1
TSS / (n 1)

Dr M. Schafgans (LSE) EC221: CLRM and Speci…cation Issues 22 / 22


VN3.1
VN3.2
VN3.3
VN3.4
VN3.5
VN3.6
VN3.7
VN3.8
VN3.9
VN3.10
VN3.11
EC221: Principles of Econometrics
Multiple Linear Regression and Hypothesis Testing

Dr M. Schafgans

London School of Economics

Lent 2022

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 1 / 32


Hypothesis Testing under Assumptions A1-A5
Using the exact sampling distributions of b
β and s 2 (under CLRM
assumptions) we can proceed to develop tests for hypotheses
regarding β and σ2 .
For simplicity we treat X as …xed in this handout. (Tests are identical,
whether we treat X as …xed or stochastic.)

1 Exact Sampling Distribution of b


β:
b
β N ( β, σ2 (X 0 X ) 1 )

We note: b
βi N ( βi , σ2 cii ), where cii = (X 0 X ) 1
ii

2 Exact Sampling Distribution of s 2 :

(n k )σ 2 s 2 χ2n k

3 Independence of b
β and s 2
Dr M. Schafgans (LSE) CLRM and Hypothesis testing 2 / 32
Hypothesis Testing
Single Linear Restrictions - t-test
Under GM assumptions + normality

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 3 / 32


Hypothesis of single parameter
Let us test a speci…c hypothesis about the true value of the unknown
parameter, say β2 (scalar)
H0 : β 2 = 5

Specify alternative hypothesis


HA : β2 6 = 5 2-sided
HA0 : β2 > 5 or HA00 : β2 < 5 1-sided

Obviously we’d want to reject if β̂2 is not close to 5.


What is far depends on the sampling distribution of b
β2 under H0 .

Unfortunately this distribution is not nice (which values of β̂2 are too
far away from 5 for a given signi…cance level?)
h i
1
Under H0 : b
β2 N (5, σ2 c22 ), where c22 = X 0 X
22
Discussion visualizer: VN4.1-4

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 4 / 32


T-test: Hypothesis of single parameter I
If σ2 is known, we can use the test statistic
b
β 5
z = 2p N (0, 1) under H0 .
σ c22
to test our hypothesis.
The rejection rule (two sided test):
b
β2 5 α=5%
Reject H0 if p
σ c22
> zα/2 = 1.96 (σ2 assumed known).

If σ2 is unknown, we instead use the test statistic


b
β2 5 b
β 5
t= p = 2 tn k under H0 .
s c22 b
SE( β2 )
The rejection rule (two sided test):
b 5
(σ2 assumed unknown).
β
Reject H0 if s p2 c > tn k ,α/2
22

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 5 / 32


T-test: Hypothesis of single parameter II
Derivation of t distribution:
c
β 5
b
β2 5
2
p
σ c22 N (0, 1)
t = p =v tn k under H0
s c22 u
u (n k ) σ 2 s 2 / (n k )
t| {z }
χ2n k

using the independence of b


β and s 2 .
The distribution changes to take account of the additional imprecision
associated with the fact that we need to use an estimator for σ2 .
For small n, the critical values are typically larger when σ2 is not known
The t distribution has fatter tails than the N (0, 1) distribution
Deviations from the null need to be bigger before you start rejecting to
ensure the signi…cance level remains the same.
For large n, the critical values are identical, σ2 is known or not
Due to the fact that the t distribution converges to the N (0, 1)
distribution as the degrees of freedom increase!
Dr M. Schafgans (LSE) CLRM and Hypothesis testing 6 / 32
T-test / p-values: Hypothesis of single parameter

Regression output, typically provides us with t-ratios aside from the


coe¢ cient estimates.
These are the t tests of the hypothesis H0 : βi = 0 against the
alternative HA : βi 6= 0 : b
βi /SE (b
βi ).

Associated with the tests statistics you may also …nd p-values.
De…nition
P-values: the lowest level of signi…cance at which you want to reject H0 .

Example: Say the realisation of our t-test statistic for H0 : β2 = 5


against HA : β2 6= 5 takes the value 2.660.
The p-value equals Pr(jt60 j > 2.660). Graph: VN4.5

Using the tables: p-value=0.01! At the 1% level of signi…cance we


would reject H0 .

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 7 / 32


Con…dence Interval
Instead of providing point estimates β̂2 (and associated precision,
SE β̂2 ), you may report interval estimates.

The 100(1 α)% con…dence interval for β2 (σ2 unknown)


is given by
[b
β2 tn k ,α/2 SE (b
β2 ), b
β2 + tn k ,α/2 SE (b
β2 )] VN4.5

Depending on the sample we have, the interval will be di¤erent!


Under repeated sampling, these intervals will contain the true value
with probability 1 α.

Con…dence interval typically wider when σ2 is unknown.


When σ2 is known, we have
[b
β2 zα/2 Stdev (b
β2 ), b
β2 + zα/2 Stdev (b
β2 )]

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 8 / 32


T-test and Con…dence Interval
Hypothesis of single parameter

Assume σ2 is unknown: Test H0 : β2 = b2 against H1 : β2 6= b2

The following two statements are equivalent:


1 With signi…cance level α, we reject H0 : β2 = b2 using the t test

b
β2 b 2
Reject H0 if > tn k ,α/2 .
SE ( β̂2 )

2 b2 does NOT lie in this (1 α)100% con…dence region!

/ [b
b2 2 β2 tn k ,α/2 SE (b
β2 ), b
β2 + tn k ,α/2 SE (b
β2 )]

In PS4, Q3 you are asked to convince yourself of this result.

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 9 / 32


Hypothesis of a single linear restriction

Often a hypothesis of economic interest implies a linear restriction on


more than one coe¢ cient, such as

β2 + β3 + .. + βk = 1

(constant returns to scale: Cobb-Douglas production function).

In general, we can formulate such an hypothesis as

H0 : r 0 β = c

Obviously we’d want to reject if β̂2 + β̂3 + .. + β̂k is not close to 1


That is when r 0 β̂ c is not close to 0.
Visualizer notes: VN4.7-8

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 10 / 32


T-test: Hypothesis of a single linear restriction

To test H0 : r 0 β = c, use result: Under CLRM Assumptions A.1-A.5

H0 : r 0 b
β N (c, σ2 r 0 (X 0 X ) 1
r ).

If σ2 is known, our test statistic is obtained by standardizing it

r 0b
β c r 0b
β c
z= p N (0, 1) under H0
σ r 0 (X 0 X ) 1 r Stdev (r 0 b
β c)

Reject if jz j > zα/2 (two-sided test) given signi…cance level α.

When σ2 is unknown, we need to replace σ2 again by its unbiased


estimator s 2 :
r 0b
β c r 0b
β c
t= p tn k under H0
s r (X 0 X ) 1 r
0 SE (r 0 b
β c)

Reject if jt j > tn k ,α/2 (two-sided test) given signi…cance level α.

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 11 / 32


Empirical implementation: Test CRS

yi = β1 + β2 ki + β3 li + εi

Hypothesis of Constant Returns to Scale (CRS):


H0 : β2 + β3 = 1, HA : β2 + β3 6= 1

Test statistic:
β̂2 + β̂3 1 β̂2 + β̂3 1
tN 3 under H0
SE ( β̂2 + β̂3 1) SE ( β̂2 + β̂3 )

Reject when absolute value of test statistic exceeds tα/2,N 3.

Practical concen:SE ( β̂2 + β̂3 ) 6= SE ( β̂2 ) + SE ( β̂3 )


By reparameterising we can avoid the problem
Introduce new parameter: γ = β2 + β3 1, and rewrite model so that
it includes γ, and test H0 : γ = 0! (see PS5 Q1b)
Dr M. Schafgans (LSE) CLRM and Hypothesis testing 12 / 32
Hypothesis Testing Recipe

1 De…ne H0 and H1 .
2 Formulate a test statistic and provide its distribution under H0
A test statistic is a random variable which
is computable from the data and does not comprise any unknown
quantities.
has a well de…ned distribution needed to de…ne the rejection region.
3 De…ne the signi…cance level of your test, and provide the associated
critical values.
4 State the rejection rule and interpret your …ndings.
Do not use the terminology: "accept H0 "!
5 Clearly indicate the assumptions you make for validity of test (e.g.,
GM + normality)
Dr M. Schafgans (LSE) CLRM and Hypothesis testing 13 / 32
Hypothesis Testing
Multiple Linear Restrictions - F-test
Under GM assumptions + normality

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 14 / 32


Testing J linear restrictions on the parameters I

How do we test when there are more than one linear restrictions?
Formalize the null and alternative hypothesis

H0 : R β = c H : r0β = c
, 0
HA : R β 6= c (only consider 2-sided) HA : 2-sided/1-sided

With J equalling the number of restrictions, J k < n.


R is a J k matrix of known constants, with full rank
(no redundant restrictions)
The number of rows equals the number of restrictions.

c is a J 1 vector of known constants.

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 15 / 32


Testing J linear restrictions on the parameters II

Rβ = c R: J k matrix
01
c : J 1 vector β1
B β C
Examples: Let k = 4 β=B 2 C
@ β A
3
β4
β2 = β3
R= 0 1 1 0 c= 0 single restriction R β = r 0 β
β2 = 2 and β3 = 1
0 1 0 0 2
R= c= 2 restrictions
0 0 1 0 1
β2 = β3 = β4 = 0
0 1 01
0 1 0 0 0
R=@ 0 0 1 0 A c=@ 0 A 3 restrictions
0 0 0 1 0
How would you write the following restrictions: β2 = β3 = 2β4 ? VN4.9

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 16 / 32


Testing J linear restrictions on the parameters III

Hypothesis testing can be approached from two viewpoints


1 Do the estimates come reasonably close to satisfying the restrictions
implied by the hypothesis?
R β̂ c
(Wald testing principle)

2 Does imposing the restrictions lead to a signi…cant loss of …t?


Since unrestricted least squares is by de…nition “least squares”,
imposing some restrictions cannot yield an improvement in …t!
Assess whether the loss of …t is so large as to cast doubt on the validity
of the restrictions.

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 17 / 32


APPROACH 1: Wald Test Principle I

H0 : R β = c, HA : R β 6 = c

Given our estimate b β, we ask whether the discrepancy of R bβ c from


zero is statistically signi…cant or simply due to sampling error.

Calculate d = R b
β c : discrepancy vector

We should reject if the elements of d are large in absolute value.


Since, in general, d is a vector we use the quadratic form

d 0 [Var (d )] 1 d

It is a scalar measure of the size of a vector that has a ‘nice’


distribution given that under H0 , d is normally distributed centered
around zero
We should reject if d 0 [Var (d )] 1 d is too large
Dr M. Schafgans (LSE) CLRM and Hypothesis testing 18 / 32
APPROACH 1: Wald Test Principle II

Recall:
If z is an n-dimensional vector of random variables distributed as
N (0, V ) where V is non-singular then z 0 V 1 z χ2n (PS2, Q3).

Visualizer notes VN4.10-11

Under H0 : d = Rb
β N (0, Var (d ))
c
h i
with Var (d ) = σ2 R (X 0 X ) 1 R 0

0 1
(R bβ c ) [R (X 0 X ) 1 R 0 ] (R bβ c )
Under H0 : d 0 [Var (d )] 1 d = σ2
χ2J

To ensure Var (d ) is invertible, R needs to have full row rank (no


redundant restrictions).

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 19 / 32


APPROACH 1: Wald Test Principle III

If σ2 is known, we can use this as our test statistic. We would reject


if our test statistic is too large, relative to the critical value given by
the χ2J distribution.
0 1
(R bβ c ) [R (X 0 X ) 1R 0
] (R bβ c )
σ2
χ2J under H0
0 1
d [Var (d )] d χ2J under H0

If σ2 is unknown this is not a test statistic (cannot be computed). We


0
need to replace σ2 by it unbiased estimator s 2 = nbε bεk
0 1
(R bβ c ) [ R (X 0 X ) 1 R 0 ] (R bβ c ) /J
W = s2
FJ ,n k underH0
h i 1
\
d 0 Var (d ) d /J FJ ,n k underH0

To ensure the resulting distribution is ’nice’we need to divide the


result by J (the number of restrictions)
Dr M. Schafgans (LSE) CLRM and Hypothesis testing 20 / 32
APPROACH 1: Wald Test Principle IV
F distribution follows as
0 1
Rb
β c R (X 0 X ) 1R 0 Rb
β c /J
W =
s2
can we rewritten as

d 0 Var (d ) 1d χ2J
z }| {
0 1
Rb
β c R (X 0 X ) 1 R 0 Rb
β c
/J
W = σ2
(n k) s2
2
/ (n k)
| σ
{z }
χ2n k

The numerator and denominator are independent since b


β and s 2 are.
Dr M. Schafgans (LSE) CLRM and Hypothesis testing 21 / 32
APPROACH 1: Wald Test Principle V

Thus to test the hypothesis we proceed as follows (see PS 5, Q2):


1 Calculate W based on the results of a least squares regression of y on
X
2 If we test at a 5% level of signi…cance …nd that value K such that 5%
at the area under an F distribution with (J, n k ) degrees of freedom
lies to the right of K
3 Reject H0 if

W >K

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 22 / 32


APPROACH 1: Wald Test Principle – aside
Observe, if J = 1 (1 restriction): [d = R b
β c is a scalar]
Test statistic when σ2 is known
2
d2 d
d 0 [Var (d )] 1 d = = χ21 under H0
Var (d ) Stdev (d )

Test statistic when σ2 is unknown


h i 1 d2 d 2
\
d 0 Var (d ) d /J = /1 = tn2 k under H0
\
Var (d ) SE (d )
where tn2 k = F1,n k

So the Wald test of signi…cance of a parameter is the squared of the


usual t-statistic for signi…cance of a parameter
By taking squares we hide the direction of our violation(s).
By taking squares we remove the possibility of a one-sided test.
Dr M. Schafgans (LSE) CLRM and Hypothesis testing 23 / 32
APPROACH 2: Based on comparing RRSS and URSS I

H0 : R β = c, HA : R β 6 = c

Does imposing the restrictions lead to a signi…cant loss of …t?


1 Regress y on X .
Call this unconstrained least squares estimator b
β.
Calculate
URSS = bε0bε = (y Xb
β ) 0 (y Xb
β)
unrestricted residual sum of squares
2 Regress y on X subject to the constraint R β = c.
That is minimize (y X β)0 (y X β) st. R β = c.
Call this constrained least squares estimator β .
Calculate
RRSS = ε 0 ε = (y X β ) 0 (y Xβ )
restricted residual sum of squares
3 Compute test statistic:
Dr M. Schafgans (LSE) CLRM and Hypothesis testing 24 / 32
APPROACH 2: Based on comparing RRSS and URSS II

The F test statistic is given by the ratio:

(RRSS URSS ) / J
F = FJ ,n k under H0
URSS / n k

J is the total number of restrictions


n k is the degrees of freedom of the unrestricted model
k is the total number of parameters estimated in the unconstrained
model

The ratio F measures the % increase in residual variance (loss in …t)


due to imposing H0

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 25 / 32


APPROACH 2: Based on comparing RRSS and URSS II

Thus to test the hypothesis we proceed as follows (see PS5, Q1c, and
PS5-extra, Q1):
1 Calculate F based on the results of the restricted and unrestricted least
squares regression.
2 If we test at a 5% level of signi…cance …nd that value K such that 5%
at the area under an F distribution with (J, n k ) degrees of freedom
lies to the right of K
3 Reject H0 if

F >K

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 26 / 32


Example Approach 2:Test Signi…cance of the Regression I

yi = β1 + β2 xi 2 + .. + βk xik + εi = xi0 β + εi

Let us look at an example of the F test:


Test the signi…cance of the regression

H0 : β2 = β3 = ... = βk = 0
HA : β2 6= 0 and/or β3 6= 0 and/or...βk 6= 0

The alternative states that at least one of the coe¢ cients is not equal
to zero.

How can we obtain the ingredients of the F statistic?

(RRSS URSS )/# restrictions


F =
URSS /df unrestricted model
F# restrictions, df unrestricted model under H0 ,

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 27 / 32


Example Approach 2:Test Signi…cance of the Regression II

The unrestricted model is our original model:

yi = xi0 β + εi

From our OLS regression we obtain URSS = ∑ni=1 (yi xi0 b


β)2 = RSS

The restricted model (imposes the null) is

yi = β1 + εi

OLS : β1 = arg min ∑ni=1 (yi b1 ) 2


b1
F .O.C . : 2 ∑ni=1 (yi β1 ) = 0 or β1 = y
) εi = yi β1 = yi y
) RRSS = ∑ni=1 εi 2 = ∑ni=1 (yi y )2 = TSS
The number of restrictions: J = k 1;
The degrees of freedom of the unrestricted model: n k
Dr M. Schafgans (LSE) CLRM and Hypothesis testing 28 / 32
Example Approach 2:Test Signi…cance of the Regression III

The test statistic for the signi…cance of the regression therefore is

(TSS RSS ) /(k 1)


F =
RSS /(n k )
TSS RSS
/ (k 1 )
= TSS [Divide numerator and denominator by TSS]
RSS
/ (n k )
TSS
R 2 / (k 1 )
= Fk 1,n k under H0
(1 R 2 ) / (n k )

RSS
where we use R 2 = 1 TSS R 2 = Coe¢ cient of determination.

Note when the R 2 is large we should observe large values for F


(signals relevance of regressors).
Large values of F give evidence against the validity of the hypothesis.
Dr M. Schafgans (LSE) CLRM and Hypothesis testing 29 / 32
APPROACH 2: Based on comparing RRSS and URSS III
Constrained least squares estimation

Our aim is to …nd the β which minimizes

(y X β ) 0 (y X β) subject to Rβ = c

To perform such constrained optimization we can either simply plug


in the restrictions or explicitly use the Lagrangian function (see
also PS4-extra Q2).
Lagrangian function: L (b, λ) = (y Xb )0 (y Xb ) 2λ0 (Rb c) ,
Obtaining the …rst order conditions and solve for ( β , λ )
λ lagrange multipliers, "shadow prices" of imposing the restrictions

Details visualizer notes VN4.17

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 30 / 32


APPROACH 2: Based on comparing RRSS and URSS IV

We have derived the restricted parameters β and λ

β = b
β (X 0 X ) 1 R 0 (R (X 0 X ) 1 R 0 ) 1
(R b
β c)
λ = (R (X 0 X ) 1 R 0 ) 1 (c R bβ)

Observe: if R b
β = c (discrepancy=0) then β = β̂ and λ = 0.

The fact that under A.1-A.5


(RRSS URSS )/J
F = FJ ,n k under H0
URSS /(n k )
follows from the result that we have shown
1
F =W \
d 0 Var (d ) d /J

where the latter under H0 is FJ ,n k.


Dr M. Schafgans (LSE) CLRM and Hypothesis testing 31 / 32
Single vs Joint Hypothesis testing

It is important to observe the following practice: single hypotheses


tests need to be followed up by a joint hypothesis test
It is perfectly possible for t tests on individual variable’s coe¢ cient not
to be signi…cant, while the F test for a number of these coe¢ cients to
be highly signi…cant (multicollinearity).
In the setting of near multicollinearity the marginal contribution of each
explanatory value, when added last, may be quite small
Near multicollinearity may yield the impression of insigni…cant
parameters (high standard errors), but jointly the parameters are
signi…cant.

Dr M. Schafgans (LSE) CLRM and Hypothesis testing 32 / 32


VN4.1
VN4.2
VN4.3
VN4.4
VN4.5
VN4.6
VN4.9
VN4.10
VN4.11
VN4.12
VN4.13
VN4.14
VN4.15
VN4.16
VN4.17
EC221: Principles of Econometrics
Multiple Linear Regression and Asymptotic Theory

Dr M. Schafgans

London School of Economics

Lent 2022

Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 1 / 18


Large Sampling (Asymptotic) Theory I

Maybe we do NOT know (…nite sample properties)


I whether an estimator is unbiased; or
I what the …nite sampling distribution of our estimator is.
(need to compute moments / requires existence of moments).

Rely on asymptotic results, and treat the n ! ∞ results as


approximations.
Two concepts that we will be looking at
Consistency (convergence in probability)
Property of ANY useful estimator!!!
If your sample size is large, we don’t need to worry about bias as long
as the estimator is consistent!
Limiting distribution (convergence in distribution)
Necessary to allow us to do hypothesis testing in the absence of a …nite
(exact) sampling distribution.
Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 2 / 18
Large Sampling (Asymptotic) Theory II
Consistency

Consistency: An estimator b θ of θ is consistent if, when the sample


size increases, b
θ gets “closer” to θ
p
plim θ̂ = θ or θ̂ ! θ

Consistency is a convergence in probability result

De…nition
Convergence in probability (consistency). Let Xn be a random variable indexed by
p
the size of a sample. Xn converges in probability to X , (Xn ! X ) if

lim P (jXn X j > ε) = 0 for any positive ε.


n !∞

Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 3 / 18


Large Sampling (Asymptotic) Theory II
Consistency

Su¢ cient conditions for consistency:


lim E (b
θ) = θ and lim Var (b
θ) = 0
n !∞ n !∞
Asymptotically unbiased

Guarantees, the stronger concept of convergence in mean squares


(Chebychev’s Inequality).
h 2
i
E θ̂ θ MSE (θ̂ )
P θ̂ θ > ε = !0
ε2 ε2

Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 4 / 18


Consistency - Sample Mean I

Suppose that X1 , X2 , ..., Xn is a random sample (i.i.d.) with mean µ


and variance σ2 .

Show X̄ = n1 ∑ni=1 Xi is a consistent estimator of µ :


Recall: E [X̄ ] = µ and Var [X̄ ] = σ2 /n.

As n ! ∞: lim E [X̄ ] = µ and lim Var [X̄ ] = 0.


n !∞ n !∞

These are su¢ cient conditions that guarantee that the sample mean,
X̄ , converges to the population mean E (Xi ), so:

1 n
plim
n ∑ i =1 X i =µ

In fact, we imposed stronger conditions than are needed for its


consistency! (su¢ ciency)
We do not need to assume that σ2 is …nite!

Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 5 / 18


Large Sampling (Asymptotic) Theory III
Consistency

Laws of large numbers (LLN) provide an alternative method for


proving consistency.
LLNs give conditions under which sample averages converge to their
population counterparts.
Theorem
Khinchine’s Weak Law of Large Numbers (WLLN): If X1 , ..., Xn is a random
sample (i.i.d.) from a probability distribution with …nite mean µ then

1
plim(X ) = plim
n ∑ Xi = E (Xi ) µ

Similar regularity conditions exist such that e.g.,


1 1
plim ∑ Xi2 = E (Xi2 ) or plim ∑ Xi εi = E (Xi εi )
n n

Alternative LLN’s exist for cases where Xi (εi ) are not i.i.d.
Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 6 / 18
Consistency - Sample Mean II

Return to question whether X̄ is a consistent estimator of µ!


The proof is a direct application of the WLLN (Khinchine)
With X1 , .., Xn i.i.d. random variables, as long as EXi is …nite

plim X̄ = E (Xi )

Since E (Xi ) µ consistency established!

Advantage of approach:
Recognizes that Var (Xi ) = σ2 < ∞ is not needed for consistency
Proof does not require us to derive Var (X̄ ), we just need to look at
plims of averages!

As it turns out, the plim is a nice operator - nicer than the expectation
operator.

Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 7 / 18


Large Sampling (Asymptotic) Theory IV
plim operator and Slutsky’s Theorem

The probability limits operator exhibits some nice intuitive properties:


If Xn and Yn are random variables with plim Xn = a and plim Yn = b
then
plim(Xn + Yn ) = a + b
plim(Xn Yn ) = ab
plim(Xn /Yn ) = a/b provided b 6= 0

Theorem
Slutsky Theorem. For a continuous function g (Xn ) that is not a function of n

plim g (Xn ) = g (plim Xn ).

Visualizer notes VN5.1-2

Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 8 / 18


Consistency - OLS estimator I

Show that b
β consistently estimates β in the simple linear regression
model: n n
β = ∑i =∑1 n i (x x )i 2 = β + ∑i =∑1n i (x x )i 2
b (x x )(y y ) (x x )(ε ε)
i =1 i i =1 i

Approach 1 (su¢ cient conditions):


E (b
β) = β
1 2
σ2
Var (b
β jX ) = n
∑ i =1 ( x i x ) 2
= 1 n

! 0 as n ! ∞.
n ∑ i =1 ( x i x ) 2

Approach 2 (law of large numbers):


1 n
∑i =1 (xi x )(εi ε)
plim(b
β) = plim( β + n
1 n ) Rewritten (averages)
n ∑ i =1 ( x i x )
2

Slutsky Theorem
plim n1 ∑ni=1 (xi x )(εi ε) plim SampleCov (xi ,εi )
plim β̂ = β + plim n1 ∑ni=1 (xi x )2
= β+ plim SampleVar (xi )

Law of Large numbers


Cov (x ,ε )
plim β̂ = β + Var (xi )i = β [say assume (xi , εi ) i.i.d.]
i

Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 9 / 18


Consistency - OLS estimator (Aside)

How can we see, e.g,. that plim SampleCov (xi , εi ) = Cov (xi , εi )?
Observe
1 n 1 n
n i∑ n i∑
SampleCov (xi , εi ) = (xi x )(εi ε) = xi εi x̄ ε
=1 =1

By Slutsky Theorem

1 n
n i∑
plim SampleCov (xi , εi ) = plim xi εi plim x̄ plim ε
=1

By Law of Large numbers

plim SampleCov (xi , εi ) = E (xi εi ) E (xi ) E (εi ) =def Cov (xi , εi )

By suitable LLN: plim n1 ∑ni=1 xi εi = E (xi εi )


Similarly: plim n1 ∑ni=1 xi = E (xi ) and plim n1 ∑ni=1 εi = E (εi )
Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 10 / 18
Consistency - OLS estimator II

Show that b
β consistently estimates β

b
β = (X 0 X ) 1
X 0 y = β + (X 0 X ) 1
X 0ε

Write estimator in terms of averages:


1
b X 0X X 0ε
β = β+
n n

Use Slutsky Theorem and Law of Large numbers on averages.

VN5.3-5

Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 11 / 18


Consistency - OLS estimator III

Show that s 2 consistently estimates σ2

ε̂0 ε̂ ε0 Mε
s2 = =
n k n k

Note, …rst we rewrite s 2 in terms of averages..


!
1
2 n ε0 Mε n ε0 ε ε0 X X 0X X 0ε
s = =
n k n n k n n n n

Apply Slutsky Theorem and Law of Large numbers (see PS5-extra, Q2)

Using the su¢ cient conditions, would, e.g., require the existence of the
4th moment of ε.

Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 12 / 18


Large Sampling (Asymptotic) Theory V
Asymptotic Distribution

If we do not know the …nite sample distribution of our estimator we


cannot perform hypothesis testing!
We will then need to assume that there is a distribution we can use that
approximates its distribution arbitrarily well for su¢ ciently large sample.

Related to the asymptotic property of convergence in distribution

De…nition
(Convergence in Distribution): The sequence Zn with cumulative distribution
functions FZ n (z ), converges in distribution to a random variable Z with
cumulative distribution function FZ (z ) if limn !∞ jFZ n (z ) FZ (z )j = 0, at all
points of continuity of FZ (z ).

Intuitively, the distribution of Zn starts more and more resembling that


of Z as n ! ∞.
Example: The t test (tn k distribution) starts behaving as if it has a
N (0, 1) distribution when n ! ∞.
Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 13 / 18
Large Sampling (Asymptotic) Theory VI
Asymptotic Distribution, Central Limit Theorem

For the asymptotic distribution we rely on Central Limit Theorems:


Theorem
Lindeberg-Levy CLT: If X1 , ..., Xn are a random sample (i.i.d.) from a
probability distribution with …nite mean µ and …nite variance σ2 then
p d
n (X µ ) p1 ∑ni=1 (Xi E (Xi )) ! N (0, σ2 )
p n
“ n (X µ) has a N (0, σ2 ) limiting distribution”

The CLT ensures that the normal distribution is a good approximation


of the distribution of X̄ when n is large:
a
X N (µ, σ2 /n )

Quite a powerful result since if we draw X1 , ..., Xn from an unknown


distribution with mean µ and variance σ2 , all we know is E (X̄ ) = µ
and Var (X̄ ) = σ2 /n. VN5.6
Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 14 / 18
Asymptotic Distribution - OLS estimator
Recall: to obtain the sampling distribution of b
β we needed to add a
distributional assumption to our GM assumptions (A5).

A1–A5: ) b
β jX N ( β, σ2 (X 0 X ) 1
)
Can we still do hypothesis testing if we are not happy to make this
assumption?
Yes! We will then rely on a CLT that will tell us that even if the
errors are not normally distributed, but, say simply random (i.i.d.)
with zero mean and …nite variance σ2
a
A1–A4: ) b
β jX N ( β, σ2 (X 0 X ) 1 )

Asymptotically, the distribution of bβ conditional on X behaves as


if it were normally distributed, centered around the truth, and
variance given by σ2 (X 0 X ) 1 .
For curious students, I provide a heuristic discussion VN5.7-8 - EC309
Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 15 / 18
Hypothesis Testing (Asymptotic)
Single/Multiple Linear Restrictions
Under GM assumptions only (without A5: NORMALITY)!

Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 16 / 18


Hypothesis Testing - Asymptotic Test Statistics

Using A1-A4, our asymptotic test statistics become


Single linear restriction: H0 : c 0 β = γ :

(c 0 β̂ γ) a
z= N (0, 1) under H0
SE (c 0 β̂ γ)

Use N (0, 1) (asymptotic test) instead of tn k (…nite sample test)


(c 0 β̂ γ) a
Under A1-A4, p N (0, 1) under H0
Var (c 0 β̂ γ)

Asymptotically it doesn’t matter whether we know σ2 or not as s 2 is a


consistent estimator of σ2 , and
0 β̂ a (c 0 β̂ γ)
p (c γ)
q
Var (c 0 β̂ γ) d (c 0 β̂ γ)
Var

Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 17 / 18


Hypothesis Testing - Asymptotic Test Statistics
Using A1-A4, our asymptotic test statistics become
Multiple linear restriction: H0 : R β = c
a
W = (R β̂ c )0 (s 2 R (X 0 X ) 1 R 0 ) 1 (R β̂ c) χ2J under H0
h i 1
\ a
= d 0 Var (d ) d χ2J under H0 VN5.9

Use the χ2 -test, (asymptotic test) instead of the F-test (…nite


sample test).
1 a
Under A1-A4 (R β̂ c )0 Var (c 0 β̂ c) R β̂ c χ2J under H0 .

Asymptotically there is no need use the F distribution to deal with the


imprecision of s 2 !
h i 1
a
\
d 0 Var (d ) d d 0 [Var (d )] 1 d

The t and the F test are exact test that rely on the assumption of
normality, the z and the χ2 -test don’t!
Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 18 / 18
VN5.1
VN5.2
VN5.3
VN5.4
VN5.5
VN5.6
VN5.7
VN5.8
VN5.9
EC221: Principles of Econometrics
Stationary Time Series Models

Dr M. Schafgans

London School of Economics

Lent 2022

Dr M. Schafgans (LSE) EC221: Time Series - Stationary 1 / 22


Time Series (TS) versus Cross Sectional (CS) data

In this handout we will focus our attention to the application of OLS


using TS data:
f(yt , xt1 , . . . , xtk ) : t = 1, ..., T g, where T denotes the sample size
With the help of various time-series models, we will highlight the
restrictiveness of GM assumptions in the time series setting.
OLS estimator in TS models typically biased!
Hence use of large sample analysis is even more important in time
series context.
In order for us to rely on LLN and CLT and conduct standard
statistical inference we will have to limit the dependence inherent in
time series processes (weak dependence). We will also require
stationarity (need to ensure that the process is stable over time).

Dr M. Schafgans (LSE) EC221: Time Series - Stationary 2 / 22


Models using Time series data (stationary) I

(1) Static model


yt = β0 + β1 xt + εt , t = 1, ..., T

The name "static model" comes from the fact that we are modeling a
contemporaneous relationship between y and x.
Example: Static Phillips Curve. One way to write a static Phillips curve
is
inft = β0 + β1 unemt + εt ,
where inft is, say, the annual rate of in‡ation during year t, and unemt is
annual unemployment rate during year t. β1 is supposed the measure the
trade-o¤ between in‡ation and unemployment.

Assumes that a change in xt at time t has an immediate e¤ect on yt .


As long as all GM assumptions are statis…ed, OLS will be BLUE.

Dr M. Schafgans (LSE) EC221: Time Series - Stationary 3 / 22


Models using Time series data (stationary) II

(1) Static model


yt = β0 + β1 xt + εt , t = 1, ..., T

It is likely that in static models our ignorance exhibits dependence


over time (i.e., Cov (εt , εs ) 6= 0 for some t 6= s ).
Could be caused by failing to recognize dynamics in this relation.

The presence of autocorrelation is a violation of GM, but as long as


E (εjX ) = 0 (regressors strictly exogenous), OLS remains unbiased.

For inference, we will need to use robust SE’s to deal with the
1
autocorrelation in the errors as Var β̂jX 6= σ2 (X 0 X ) (next
handout).

Dr M. Schafgans (LSE) EC221: Time Series - Stationary 4 / 22


Models using Time series data (stationary) III

(2) Finite Distributed Lag Models

yt = α + xt γ1 + xt 1 γ2 + xt 2 γ3 + εt

Enables us to capture e¤ects that take place with a lag.

Minimum Wage and Employment. Suppose we have monthly data on


employment and minimum wage. A change in the minimum wage may not
have its total e¤ect on employment in the same month.

γj , j = 1, 2, 3 are the distributed lag coe¢ cients.


γ1 is the contemporanous e¤ect – the impact propensity;
γ1 + γ2 + γ3 – the long run propensity

Dr M. Schafgans (LSE) EC221: Time Series - Stationary 5 / 22


Models using Time series data (stationary) IV

(2) Finite Distributed Lag Models

yt = α + xt γ1 + xt 1 γ2 + xt 2 γ3 + εt

2 3
. . . .
Our matrix of regressor is given by X = 4 1 xt xt 1 xt 2 5
. . . .

If xt changes slowly over time, we may get very imprecise estimates for
γj due to the problem of near multicollinearity.
If xt changes slowly over time, xt and xt 1 highly correlated
Fortunately, In general, we can estimate the LR e¤ect γ1 + γ2 + γ3
precisely.

Dr M. Schafgans (LSE) EC221: Time Series - Stationary 6 / 22


Models using Time series data (stationary) V
(2) Finite Distributed Lag Models
yt = α + xt γ1 + xt 1 γ2 + xt 2 γ3 + εt

Property of OLS estimator


For unbiased we have to assume strict exogeneity
E (εt jX ) = E (εt jx1 , x2 , ...., xT ) = 0

This would require εt to be uncorrelated with all past, current, and


future values of our regressors.
The fact that xt +1 may respond to εt invalidates this strict exogeneity
(minimum wage responding to past employment shocks) - see also
PS6, Q1
More reasonable to assume:
E (εt jxt , xt 1 , xt 2 , ....) = 0

The OLS estimator of FDL (and static) models will typically be biased!
Dr M. Schafgans (LSE) EC221: Time Series - Stationary 7 / 22
Models using Time series data (stationary) VI
(2) Finite Distributed Lag Models
yt = α + xt γ1 + xt 1 γ2 + xt 2 γ3 + εt

Under the weaker exogeneity assumption (more reasonable) it is


important to ensure good asymptotic properties are satis…ed.

Result: Under stationarity and weak dependence, OLS will be still be


consistent!. Stationarity & weak dependence enable us to use LLN.
h 0
i 1 0
plim β̂ = β + plim XTX plim XT ε .
X 0X
Let us assume plim T = D exist and invertible. For consistency,
0 1
plim T1 ∑T t =1 ε t
X 0ε
B plim 1 ∑T xt εt C
B T t =1 C
we require plim T @ plim 1 ∑T xt 1 εt A = 0
T t = 1
plim T1 ∑Tt =1 x t 2 ε t
Satis…ed when E (εt jxt , xt 1 , xt 2 , ....) =0
Dr M. Schafgans (LSE) EC221: Time Series - Stationary 8 / 22
Models using Time series data (stationary) VII
(3) Dynamic: Autoregressive distributed lag model – ADL(1,2)
yt = α + φyt 1 + xt γ1 + xt 1 γ2 + xt 2 γ3 + εt , jφj < 1 and
E (εt jxt , xt 1 , xt 2 , ., yt 1 , yt 2 ..) = 0.

Due to the dependence inherent in time series processes, it is likely


that yt 1 and yt are correlated, i.e., φ 6= 0.
If yt 1 helps to explain yt , omitting yt 1 in the presence of a
correlation between yt 1 and xt , would result in OVB! VN6.1
Controlling for yt 1 while estimating the e¤ect of xt is e¤ective when
estimating the causal e¤ect of xt on yt .

Minimum Wage and Growth in Employment. By controlling for


gempt 1 , we allow the possibility that gminwaget reacts to past
employment growth.
γ1 + γ2 + γ3
γ1 is the contemporanous e¤ect; 1 φ – the long run propensity
Dr M. Schafgans (LSE) EC221: Time Series - Stationary 9 / 22
Models using Time series data (stationary) VII
(3) Dynamic: Autoregressive distributed lag model – ADL(1,2)
yt = α + φyt 1 + xt γ1 + xt 1 γ2 + xt 2 γ3 + εt , jφj < 1 and
E (εt jxt , xt 1 , xt 2 , ., yt 1 , yt 2 ..) = 0.
2 3
. . . . .
Let X = 4 1 yt 1 xt xt 1 xt 2 5 and β = (α, φ, γ1 , γ2 , γ3 )0 .
. . . . .
Properties of β̂OLS :
E (εjX ) = 0 cannot be satis…ed here ) β̂OLS will be biased.
Reason: requires εt to be uncorrelated with all leads and lags of yt 1
(one of the regressors). VN6.2 See also PS6, Q2.
β̂OLS willl be consistent (given stationarity and weak dependence)
0 1 0
plim β̂OLS = β + plim XTX plim XT ε = β + M 1 .0 =β
X 0ε 1
plim is a vector containing, e.g., plim ∑ yt 1 εt = E (yt 1 εt ) by
T T
LLN and our assumptions ensure Eyt 1 εt = 0 and E (xs εt ) = 0 s t.
Dr M. Schafgans (LSE) EC221: Time Series - Stationary 10 / 22
Models using Time series data (stationary) X

(3) Dynamic: Autoregressive distributed lag model – ADL(1,2)

yt = α + φyt 1 + xt γ1 + xt 1 γ2 + xt 2 γ3 + εt , jφj < 1 and


E (εt jxt , xt 1 , xt 2 , ., yt 1 , yt 2 ..) = 0.

If εt exhibits autocorrelation OLS will NOT be consistent.


In this setting yt 1 is an endogenous variable (correlated with the
error term), hence E (εt jxt , xt 1 , xt 2 , ., yt 1 , yt 2 ..) = 0 CANNOT
be satis…ed - more later.

The ADL cannot be estimated by OLS (inconsistent) in the presence of


autocorrelation.
The DL and static model can still be estimated by OLS (consistent) in
the presence of autocorrelation, but we will need to correct the SE’s

Dr M. Schafgans (LSE) EC221: Time Series - Stationary 11 / 22


Models using Time series data (summary)
Considered three models
1 Static models, used when we are interested in a contemporaneous
relationship.
errors will likely exhibit autocorrelation

2 Finite Distributed Lag models, enable us to capture e¤ects that take


place with a lag.
short and long run impact; issues of near multicollinearity.

3 Autoregressive Distributed Lag models


yt = α + φyt 1 + xt γ1 + xt 1 γ2 + xt 2 γ3 + εt , jφj < 1

preferred for forecasting because they do allow lagged outcomes on y


to directly a¤ect current outcomes - causal interpretation.
Statistically, models with lagged dependent variables are more di¢ cult
to study
OLS will always be biased- violate strict exogeneity.
Dr M. Schafgans (LSE) EC221: Time Series - Stationary 12 / 22
Models using Time series data (summary)

As we argued that strict exogeneity is unlikely to be satistifed in time


series models, it is important to address asymptotic properties of our
OLS estimators under weaker exogeneity assumptions.
For the ADL model

yt = α + φyt 1 + xt γ1 + xt 1 γ2 + xt 2 γ3 + εt , jφj < 1

speci…cally
0 1
It
z }| {
E @εt jxt , xt 1 , ..., yt 1 , yt 2 , ...A = 0 or simply

E (εt jxt , xt 1 , xt 2 , yt 1 ) = 0.

To ensure we can apply LLN and CLT we will rely on two assumptions
(stationarity and weak dependence) which we discuss today.

Dr M. Schafgans (LSE) EC221: Time Series - Stationary 13 / 22


Models using Time series data (summary)

Final summary comments:

What is the matrix of regressors X in the three models and what is


0
plim Xn ε (assuming suitable LLN exist)?

1 The static model and the FDL model will be consistent under weak
exogeneity assumptions even if the error term exhibits weak
dependence.

2 The ADL model will only be consistent under a weak exogeneity


assumption if the error does NOT exhibit weak dependence.

If the error term in the ADL model does exhibit dependence, OLS will
not be consistent either because E (εt yt 1 ) 6= 0. (Will discuss this
later).

Dr M. Schafgans (LSE) EC221: Time Series - Stationary 14 / 22


Stationarity and Weak Dependent Time Series I

Key concepts required to enable to use of LLN and CLT in the


presence of dependence
1 Weak dependence:
While it is unreasonable to assume independence in TS, we will need to
limit this dependence. (replacing independence)
2 Stationarity:
We will require that the joint distribution of (xt1 , ..., xtm ) is identical to
that (xt1 +h , ..., xtm +h ) for any h. (replacing identically distributed) -
stability requirement

For example, 10-year blocks (x1970 , . . . , x1979 ), (x1974 , . . . , x1983 ),


(x1990 , . . . , x1999 ) have the same distribution.

This is stronger than identically distributed (corresponds to case where


m = 1). It ensures that the dependence structure among the elements
remains the same as well.

Dr M. Schafgans (LSE) EC221: Time Series - Stationary 15 / 22


Stationarity and Weakly Dependent Time Series II
Stationarity
The type of stationarity (weak) we typically impose is that of
covariance stationarity:

De…nition
A stochastic process fxt ; t = 1, 2, ..g is covariance stationary if
(i) E (xt ) is …nite and constant
(ii) Var (xt ) is …nite and constant
(iii) Cov (xt , xt +h ) is …nite and depends only on distance in time, h

The …rst two moments need to exist and not change (be identical) over
time!

But we can allow for deterministic trending behaviour as long as our


regression includes a deterministic trend (deterministic trends do not
have any impact on the dependence of the process over time). VN6.3
FWL - including trend in regression same a running regression with
detrended variables
Dr M. Schafgans (LSE) EC221: Time Series - Stationary 16 / 22
Stationarity and Weakly Dependent Time Series II

Dependence
To describe the dependence in fxt gT t =1 over time, we consider the
Corr (xt , xt +h ) as a function of h :
Cov (xt ,xt +h )
Corr (xt , xt +h ) = p
Var (xt )Var (xt +h )

Given stationarity, we can use the Autocorrelation Function (ACF)


ρ(h ) := Corr (xt , xt +h ) ; Sample analogue is the correlogram.

Loosely speaking, weak dependence requires

Corr (xt , xt +h ) ! 0 as h ! ∞

Corr (xt , xt +h ) can be non-zero but should vanish eventually or


asymptotically. (formal de…nition is technical)
Intuitively, weak dependence means asymptotic independence.

Dr M. Schafgans (LSE) EC221: Time Series - Stationary 17 / 22


Common dependence processes – AR, MA and ARMA
Autoregressive process of order one: AR (1)

yt = φyt 1 + εt , εt i.i.d. (0, σ2 ) white noise

Stationarity requires jφj < 1

Moving average process of order one: MA(1)

yt = εt + θεt 1, εt i.i.d. (0, σ2 )

Stationary

Autoregressive moving average process:


Combining the two processes, we get an ARMA(1, 1)
yt = φyt 1 + εt + θεt 1 , εt i.i.d. (0, σ2 )

Dr M. Schafgans (LSE) EC221: Time Series - Stationary 18 / 22


Condition for AR(1) to be stationary:

. yt = φyt 1 + εt , t = 1, ..., T with jφj < 1


εt s i.i.d.(0, σ2 ), εt is indep. of yt 1 , yt 2 , ....

The assumption that jφj < 1 is needed to ensure a …nite and positive
mean and variance – stationarity.
Recursive substition clari…es this:

yt = φ(φyt 2+ εt 1 ) + εt = φ2 yt 2 + φεt 1 + εt
= φ (φyt 3 + εt 2 ) + φεt 1 + εt = φ3 yt 3 + φ2 εt
2
2 + φεt 1 + εt
= ...
= εt + φεt 1 + φ2 εt 2 + φ3 εt 3 + ........as φs ! 0

The situation where φ = 1 (non-stationary) has received a lot of


attention in the literature in macroeconometrics, and is called
“unit roots”. (Augmented) Dickey Fuller test.

Dr M. Schafgans (LSE) EC221: Time Series - Stationary 19 / 22


Autocorrelation Function (ACF) and Correlogram

From the ACF (and correlogram) we can infer the extend to which
one value of the process is correlated with previous values and thus
the length and strength of the memory of the process.
It indicates how long (and how strongly) a shock in the process (εt )
a¤ects the current and future values of yt . Let h > 0

For the MA(1) process, we have ρ(0) = 1, ρ(1) = θ 2 and ρ(h ) = 0,


1 +θ
h = 2, 3, 4, .....
A shock in an MA(1) process a¤ects yt in two periods only
For the stationary AR(1) process, jφj < 1, we have ρ(h ) = φh .
A shock in the AR(1) process a¤ects all future observations with a
decreasing (geometric) e¤ect

We want to ensure that this dependence dies out su¢ ciently fast
Satis…ed for both the MA(1) and stationary AR(1).

Dr M. Schafgans (LSE) EC221: Time Series - Stationary 20 / 22


Derivation ACF

For a MA(1) process

yt = εt + θεt 1, εt i.i.d. (0, σ2 )

Fairly straightforward to show, that

E (yt ) = 0,
Var (yt ) = (1 + θ 2 )σ2 , and
Cov (yt +1 , yt ) = θσ while
Cov (yt +h , yt ) = 0 for h > 1.

VN6.4 (see PS6-extra Q1)

Dr M. Schafgans (LSE) EC221: Time Series - Stationary 21 / 22


Derivation ACF

For a stationary AR(1) process

yt = φyt 1 + εt , t = 1, ..., T with jφj < 1


εt s i.i.d.(0, σ2 ), εt is indep. of yt 1 , yt 2 , ....

The easiest way to determine the mean, variances and covariances, is


to use the properties of covariance stationarity. VN6.5-6

Dr M. Schafgans (LSE) EC221: Time Series - Stationary 22 / 22


VN6.1
VN6.2
VN6.3
VN6.4
VN6.5
VN6.6
EC221: Principles of Econometrics
Generalized Linear Regression Model

Dr M. Schafgans

London School of Economics

Lent 2022

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 1 / 43


Generalized Linear Regression Model

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 2 / 43


Generalized Linear Regression Model

The generalized linear regression model extends the linear regression


model by relaxing the assumption Var (εjX ) = σ2 I

De…nition
A1 : True model - linear in parameters y = X β + ε with E (ε) = 0
A2 : No perfect multicollinearity
A3 : Zero conditional mean E (εjX ) = 0
A3
A4 : General covariance matrix Var (εjX ) = E (εε0 jX ) = Σ

where Σ is a symmetric positive de…nite matrix.

Often the matix Σ will be written as Σ = σ2 Ω, where σ2 is an


unknown scaling parameter.

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 3 / 43


Special case GLS: Heteroskedasticity

Disturbances are heteroskedastic when they have di¤erent variances

yi = xi0 β + εi , i = 1, ..., n (Cross-section)

σ2i if i = j
Cov (εi , εj jX ) =
0 i 6= j

In this case Σ is a diagonal matrix


0 2 1
σ1 0 0 0
B 0 σ2 C
B 2 C
Σ=B .. C
@ . 0 A
0 0 σ2n

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 4 / 43


Special case GLS: Autocorrelation

Autocorrelation is usually found in time-series data

yt = xt0 β + εt , t = 1, ..., n (Time-series)

Cov (εt , εs jX ) 6= 0, for some t 6= s.

Economic time series often display a “memory” in that variation around


the regression function is not independent from one period to the next

(AR(1) process): εt = ρεt 1 + vt jρj < 1, vt iid (0, σ2 ).

0 1
1 ρ ρ2 ρ3 ρn 1
σ2 BB ρ 1 ρ ρn 2 C
C "fading
Σ= 2 B .. C
1 ρ @ . A memory"
ρn 1 ρn 2 1

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 5 / 43


Consequences for the OLS Estimator I
b
β is (in general) no longer BLUE
E¢ ciency is lost since not all GM assumptions are satis…ed

b
β remains Unbiased (and Consistent) and Linear

E b
β =β since E (εjX ) = 0 by A.3

1
Var ( β̂jX ) 6= σ2 (X 0 X ) VN7.1

Var b
β jX = (X 0 X ) 1
X 0 ΣX (X 0 X ) 1
given A.4
= σ 2 (X 0 X ) 1
X 0 ΩX (X 0 X ) 1

Consequently, standard t- and F -tests (based on σ2 (X 0 X ) 1 ) will


be invalid

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 6 / 43


Solutions GLS Model

These consequences indicate two ways of handling the problems of


heteroskedasticity/autocorrelation.

1 We decide to stick with the OLS estimator but we use a CORRECT


estimator of the covariance matrix.
OLS with robust standard errors. b
β
Advantage: Does not require us to be speci…c about the form of
heteroskedastictiy/serial correlation
Disadvantage: Loss of e¢ ciency
2 We may want to derive an alternative estimator that is BLUE
Generalized Least Squares estimator b
βGLS
Advantage: Allows us to regain (asymptotically) e¢ ciency
Disadvantage: Requires us to be speci…c about the form of
heteroskedasticity/serial correlation

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 7 / 43


GLS Estimator - E¢ ciency

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 8 / 43


Generalized Linear Regression: BLUE estimation I
Idea: Transform the model y = X β + ε, Var (εjX ) = σ2 Ω, in such a
way that the transformed model does satisfy the Gauss-Markov
assumptions.

Find a matrix R (square, nonsingular), such that


Ry = RX β + Rε

y = X β+ε De…ne: y = Ry , X = RX , ε = Rε
satis…es the Gauss-Markov conditons

OLS on the transformed model is BLUE (because transformed


model satis…es the GM assumptions)

The matrix R that would guarantee this, needs to satisfy the condition:
R 0R = Ω 1

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 9 / 43


Generalized Linear Regression: BLUE estimation II

Show that a transformation R such that R 0 R = Ω 1 ensures GM


assumptions are satis…ed

Ry = RX β + Rε
y = X β+ε De…ne: y = Ry , X = RX , ε = Rε

i.e.,

X absence of perfect multicollinearity


E ( ε jX ) = 0
Var (ε jX ) scalar covariance matrix

Visualizer notes VN7.2-3

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 10 / 43


Generalized Linear Regression: BLUE estimation III

Show that a transformation R such that R 0 R = Ω 1 exist:

Since Ω is symmetric, we can diagonalize it as

Ω = C ΛC 0 C 0 C = In

Since all eigenvalues of Λ are positive (Ω is pos.def.), Λ 1 exist and

Ω 1 = C Λ 1C 0

Hence, a square nonsingular matrix that satis…es R 0 R = Ω 1 is

R = C Λ 1 /2 C 0

0
R 0R = C Λ 1/2 C 0 CΛ 1/2 C 0 = CΛ 1/2 C 0 C Λ 1/2 C 0 = CΛ 1C 0

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 11 / 43


Generalised Least Squares Estimator I

OLS on the transformed linear regression model, yields an estimator


with desirable optimality properties.
As the OLS estimator on a model that satis…es all Gauss-Markov
conditions is BLUE.

We call this estimator the GLS estimator: VN7.4-5

β̂GLS = (X 0 X ) 1 X 0 y = (X 0 Ω 1 X ) 1 X 0 Ω 1 y ,

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 12 / 43


Generalised Least Squares Estimator I
Mathematically
GLS estimator minimizes the generalised sum of squares

S (b ) = (y Xb )0 Ω 1 (y Xb )

i.e.,
β̂GLS = arg min S (b )
b
(Derive FOC, and solve). VN7.6

In contrast: OLS estimator minimizes the sum of squares

S (b ) = (y Xb )0 (y Xb )

In the setting of heteroskedasticity, the GLS estimator accounts for the


fact that some observations are associated with a higher variability
than others by giving their discrepancies (residuals) less weight!
Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 13 / 43
Properties of β̂GLS
Deriving the properties of β̂GLS is easiest done by using familiar
proofs on the transformed model:
E ( ε jX ) = 0
y = X β + ε , with
Var (ε jX ) = σ2 I
E β̂GLS = β as before
1
Var β̂GLS jX = σ2 (X 0 X ) as before.

Then replace X = RX and use R 0 R = Ω 1 to reveal: VN7.7

1
Var β̂GLS jX = σ2 X 0 Ω 1
X

In PS 7:

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 14 / 43


2
Properties of sGLS and sampling distribution of β̂GLS

The unbiased estimator of the variance parameter σ2 is given by

2 ε̂ 0 ε̂ (y X β̂GLS )0 (y X β̂GLS )
sGLS = =
N k N k
(y X β̂GLS ) Ω (y X β̂GLS ) VN7.8-9
0 1
=
N k

Proof of unbiasedness easiest done by using familar proof on the


transformed model.

If εjX N (0, σ2 Ω), then we can show


1
β̂GLS jX N ( β, σ2 X 0 Ω 1
X )
2
and β̂GLS is independent of sGLS

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 15 / 43


Feasible GLS
Generalized least squares is generally not feasible as Ω typically is
unknown.

The GLS estimator that uses a consistent estimator of Ω is called:


Feasible GLS estimator.
b b
βFGLS = (X 0 Ω 1
X) 1 b
X 0Ω 1
y

To enable a consistent estimator for Ω, we will need to parameterize


the matrix Ω in terms of a …nite-dimensional parameter vector θ.
Ω = Ω(θ )
Use the classical OLS residuals ε̂ = y X β̂ to obtain θ̂ and Ω̂ = Ω(θ̂ )

b (Slutsky)
Consistency of θ̂ ensures consistency of Ω
plim Ω̂ plim Ω(θ̂ ) = Ω(plim θ̂ ) = Ω(θ ) = Ω

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 16 / 43


Feasible GLS - Properties

b
βFGLS has only desirable asymptotic properties!

Consistent
b
βFGLS typically is biased; in fact, β̂FGLS is not even linear! VN7.10

Hence bβFGLS cannot be BLUE


Given small sample, using FGLS may not be a good idea.
Asymptotic normal:

a 1
β̂FGLS jX N ( β, σ2 X 0 Ω 1 X )

even is εjX N (0, σ2 Ω)

"Only asymptotically does β̂FGLS inherit the desirability of β̂GLS ".


Given large sample, using FGLS is a good idea.

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 17 / 43


Cross-Section and Heteroskedasticity

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 18 / 43


Heteroskedasticity

Heteroskedasticity refers to the situation where the variance of the


errors εi di¤er by observations Var (εi jX ) = σ2i 6= σ2 .

This problem frequently is encountered in cross-sectional models.


In cross sections, one usually deals with members of a population at a
given point in time (individual consumers, …rms, industries).
Under random sampling,

Var (εi jX ) = Var (εi jxi ) = σ2i

In the presence of heteroskedasticity σ2i is a function of xi .

Engle curve: Variation of food expenditure among high income


households is much larger than the variation among low-income
households. VN7.11

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 19 / 43


Heteroskedasticity
Let us consider setting where only GM violation is presence of
Heteroskedasticity

yi = xi0 β + εi , E (εjX ) = 0 but Var (εjX ) = Σ = diag (σ21 , ..., σ2n )

Var (εi jX ) = σ2i 6= σ2 ; while Cov (εi , εj jX ) = 0


Approaches:
1 We decide to stick with the OLS estimator but we use a CORRECT
estimator of the covariance matrix.
Heteroskedasticity-robust Inference
2 We may want to derive an alternative estimator that is BLUE
GLS = Weighted Least Squares

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 20 / 43


Heteroskedasticity-robust Inference (OLS)
In presence of heteroskesticity and absence of autocorrelation: VN7.12
! 1 ! ! 1
n n n
Var b βjX = ∑ xi xi0 ∑ σ2i xi xi0 ∑ xi xi0
i =1 i =1 i =1
n
When σ2i i =1 unknown: We simply replace σ2i with bε2i to obtain
robust SEs, where bεi are the OLS residuals
! 1 ! ! 1
n n n
c (b
Var βjX ) = ∑ xi xi0 ∑ ε̂2i xi xi0 ∑ xi xi0
i =1 i =1 i =1

Important: This does not mean that Σ̂ = diag (ε̂21 ...ε̂2n ) is a consistent
estimator of Σ! Result relies on the fact that the matrix n1 ∑ bε2i xi xi0 is a
good approximation of n1 ∑ σ2i xi xi0 (White, 1980).
The heteroskedastic-consistent standard errors or simply White SEsare
obtained by taking square roots of its diagonal elements..
Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 21 / 43
GLS e¢ ciency - Heteroskedasticity = WLS

What transformation allows us to rid ourselves of the problem of


heteroskedasticity (which resulted in ine¢ ciency)?
We show: VN7.13-14

The transformed model that satis…es all GM assumptions (removed the


problem of heteroskedasticity removed) is given by
0
yi xi εi
= β+ , i = 1, ..., n.
σi σi σi

OLS on the transformed model gives us the GLS estimator:


" # 1 " # 1
n x x0 n n n
xi yi
β̂GLS = ∑ 2 ∑ ∑ ∑
i i 0
2
= r i x i x i ri x i y i ,
i =1 σ i i =1 σ i i =1 i =1

It is also called the weighted least squares estimator (WLS), with


weights ri given by (1/σ2i ).

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 22 / 43


GLS e¢ ciency - Heteroskedasticity = WLS

" # 1
n n
xi x 0 xi yi
β̂ = ∑ 2i ∑ 2
i =1 σ i i =1 σ i

If σ2i is known (up to scale), we can calculate this estimator, and apply
the standard t- and F -tests as in the linear regression model. VN7.14

Stata: use option in regression to provide variable which needs to be


used as weight.
Stata: weights: fweights (frequency), pweight (probability), aweight
(analytic), iweight (importance) 1/σ2i

An example where this may be the case is where we observe only group
averages and the size of each group di¤ers. See PS7, Q2

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 23 / 43


Feasible GLS - Feasible WLS I
n
Typically σ2i i =1
are not known ) Need to consider Feasible WLS!

This approach requires use to consistent estimates of σ2i :


" # 1
n n
1 1
βFGLS = ∑ 2 xi xi
b 0
∑ xy
2 i i
σ̂
i =1 i σ̂
i =1 i
which will require us to impose further structure (parameterize)!
A.3
σ2i = Var (εi jxi ) = E ε2i jxi
We consider two cases:
Linear Heteroskedasticity VN7.15

σ2i = δ0 + zi0 δ1 E (ε2i jxi )


Exponential Heteroskedasticity VN7.16-17

σ2i = exp(δ0 + zi0 δ1 )


where zi are functions of xi (xi , xi2 , other)
Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 24 / 43
Testing for Heteroskedasticity

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 25 / 43


Testing for Heteroskedasticity

Tests designed to detect heteroskedasticity will, in most cases, be


applied to the ordinary least squares residuals.
Based on the idea that OLS is unbiased and consistent estimator of β
in the presence of heteroskedasticity, and as such, the OLS residuals
will mimic the heteroskedasticity of the true disturbance.

The choice of the most appropriate test for heteroskedasticity is


determined by how explicit we want to be about the form of
heteroskedasticity.
The more explicit we are, the more powerful the test will be. However,
if the true heteroskedasticity is of a di¤erent form, the chosen test may
not indicate the presence of heteroskedasticity at all.

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 26 / 43


Testing for Heteroskedasticity

1 Test signi…cance of regression (in FGLS setting) (Wald test)


Test for heteroskedasticity with speci…c form

2 Breusch-Pagan/Godfrey test (LM test)


Test for heteroskedasticity - exact form not needed

3 White test
Does not specify anything about the form of heteroskedastic

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 27 / 43


Testing signi…cance of regression (in FGLS setting)
In this test, we test, e.g., H0 : σ2i = σ2 for all i
. .
HA : σ2i = δ0 + zi0 δ1
Under the alternative, the form of heteroskedasticity is given!.

The test is equivalent to testing


H0 : δ1 = 0
HA : at least one δ1 non-zero
It requires us to test the signi…cance of the regression.1
bε2i = δ0 + zi0 δ1 + vi
You either use F from the regression output, or, more appropriately
(don’t have ε2i !), use the asymptotic version of it:
0 d 1 a
δ̂1 Var δ̂1 δ̂1 χ2p under H0 , where p = dim (δ1 )

1 Ifwe specify HA : σ2i = exp(δ0 + zi0 δ1 ), we would use the test of the signi…cance of
the regression ln(bε2i ) = δ0 + zi0 δ1 + vi instead.
Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 28 / 43
Breush-Pagan Test for Heteroskedasticity I

In this test, proposed by Breusch and Pagan (1980), we test

H0 : σ2i = σ2 for all i


HA : σ2i = σ2 h (δ0 + zi0 δ1 )

Under HA the heteroskedasticity is unknown function of δ0 + zi0 δ1 .


h ( ) is some unknown, continuously di¤erentiable function.

Note in this case we cannot estimate δ1 itself, so use of Wald test


0 1
d δ̂1
δ̂1 Var δ̂1 is not not possible.

The Breuch-Pagan test for heteroskedasticity is an example of a


Lagrange Multiplier (LM) test (will discuss these test in more detail
later).

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 29 / 43


Breush-Pagan Test for Heteroskedasticity II

H0 : σ2i = σ2 for all i


HA : σ2i = σ2 h (δ0 + zi0 δ1 )

This LM test for heteroskedasticity takes the simple form of nR 2 of


an auxilliary regression (n denoting the sample size):
a
LM = nR 2 χ2p under H0 where p = dim (δ1 ) .

The auxilliary regression in this setting is given by

bε2i = δ0 + zi0 δ1 + vi (with ε̂i is OLS residual)

Reject H0 if nR 2 > χ2p,α . Large R 2 gives evidence of a signi…cant


relation between zi and ε̂2i indicative of heteroskedasticity!
We are not looking at δ1 itself (F test) b/c form is not explicit.
Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 30 / 43
White Test for Heteroskedasticity I

The White test tests a general hypothesis of the form:

H0 : σ2i = σ2 for all i


HA : Not H0

It is useful when one has no idea of the potential form of


heteroskeasticity
Equivalently, the null can be written as

H0 : E (ε2i jxi ) = σ2 for all i

The idea of the test is that if there is homoskedasticity, then xi or


functions of xi should not help to explain E (ε2i jxi ).

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 31 / 43


White Test for Heteroskedasticity II
The test works as follows:
Using the OLS residuals, run the auxilliary regression
OLS on ε̂2i = γ0 + zi0 γ1 + vi

zi may include some or all of the variables in xi as well as other


variables that depend on xi .
White’s original suggestion was to use xi plus the set of all unique
squares and cross products of variables in xi .

Test the hypothesis that γ1 = 0

Compute nR 2 from this auxilliary regression


We reject H0 if nR 2 > χ2p,α , as
a
nR 2 χ2p under H0 where p = dim (γ1 ) .

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 32 / 43


White Test for Heteroskedasticity III

Example: Consider the model

yi = β1 + β2 xi 2 + β3 xi 3 + εi , i = 1, ..., N

To perform the White test (for heteroskedasticity):


1 Calculate the least-squares residuals bεi .
2 Run the following (auxiliary) regression:

bε2i = α1 + α2 xi 2 + α3 xi 3 + α4 xi22 + α5 xi23 + α6 xi 2 xi 3 + νi

3 Compute nR 2 from this (auxiliary) regression.


4 Reject if nR 2 > χ25,α .

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 33 / 43


Time Series and Autocorrelation

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 34 / 43


Autocorrelation-robust inference
Autocorrelation refers to the situation where the errors εi are
correlated across observations, Cov (εi , εj jX ) 6= 0 for some i 6= j.
This problem frequently is encountered in time-series models.

If we have determined that OLS still has desirable large sample


properties despite this autocorrelation, we will want to use robust
standard errors. VN7.18
1
Var β̂jX = (X 0 X ) 1
X 0 ΣX X 0 X
In the presence of autocorrelation we cannot use OLS when we have
lagged dependent variables
Robust standard errors need to account for the autocorrelation - brie‡y
discuss HAC robust standard errors.
We will not discuss GLS/FGLS in this setting (do we really know
dependence structure).

How do we detect presence of autocorrelation? VN7.19

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 35 / 43


Autocorrelation-robust Inference (technical detail) I

For simplicity, consider bivariate regression

yt = β0 + β1 xt + εt ,

Suppose A.1-A.3 hold, Var (εt jX ) = σ2 and


Cov (εt , εt +s jX ) = γs with γs = 0 when s > lag
xt x̄
Recall β̂1 = β1 + ∑T
t =1 dt εt with dt = T .
∑t =1 (xt x̄ )2
As the εt ’s are no longer independent, Var β̂1 jX is more complex,
and equals
!
T T T T
Var ∑ wt εt jX = ∑ Var (dt εt jX) + ∑ ∑ Cov (dt εt , ds εs jX)
t =1 t =1 t =1 s 6 =t
| {z }
Not zero!

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 36 / 43


Serial Correlation-Robust SE (technical detail) II
Using the properties of conditional variance:
!
T T T T
Var ∑ dt εt jX = ∑ Var (dt εt jX) + ∑ ∑ Cov (dt εt , ds εs jX)
t =1 t =1 t =1 s 6 =t
T T T
= ∑ dt2 Var
| {z } t∑
( εt jX) + ∑ dt ds Cov
|
( εt , εs jX)
{z }
t =1 =1 s 6 =t
σ2 only 0 if jt s j>lag

Intuitively: An estimator for Var β̂1 jX becomes


!
T T T T
Var ∑ wt ut jX =
d ∑ dt2 ε̂2t + ∑ ∑ dt ds ε̂t ε̂s
t =1 t =1 t =1, s 6=t,jt s j lag
| {z } | {z }
Heterosk Robust Serial Correl Robust

(extends the ideas by White on heteroskedastic robust standard errors).


The HAC estimator (in book, nonexaminable) is a modi…cation hereof
that allows the lag to grow with the sample size.
Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 37 / 43
Serial Correlation-Robust Standard error

To make inference robust to presence of heteroskedasticity, we used

reg y x1 x2 ... xk, robust

(White standard errors)


To make inference robust to general forms of serial correlation, we use

newey y x1 x2 ... xk, lag(q)

(Newey-West or HAC standard errors)


The N-W standard errors are not as automated as the adjustment for
heteroskedasticity because we have to choose a lag.
If we choose q = 0, we get the White robust SE’s.
With annual data, the lag is usually fairly short, such as lag = 2, but
with quarterly or monthly data we tend to try longer lags.

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 38 / 43


Testing for Autocorrelation

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 39 / 43


Testing for Serial Correlation: Preliminaries

When testing for zero autocorrelation, we need to consider a


particular autocorrelation model
The most common is an AR(1) process

εt = ρεt 1 + vt , where vt white noise

We want to test the single hypothesis:

H0 : ρ = 0 against HA : ρ 6= 0 (or ρ > 0)

If we could observe fεt g , we just estimate AR(1) model for εt and use
a t-test for ρ = 0
Since we don’t observe fεt g, we will base our test on the OLS
residuals, ε̂t (similar to heteroskedasticity test).

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 40 / 43


Serial Correlation test under strict exogeneity
1 Estimate the model

yt = β0 + β1 x1t + .. + βk xkt + ut

by OLS, and save the residuals fût , t = 1, ..., T g


2 Run AR(1) regression
ût on ût 1
(you may add intercept)
3 Implement usual or heteroskedasticity-robust t test for H0 : ρ = 0.
4 If we do not reject the null, usual inference in …ne. If we reject the
null OLS inference is invalid and some action is required.

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 41 / 43


Serial Correlation test under strict exogeneity - Remarks

We can easily add lags to the above test and use the F test. For
example we can regress

ût on ût 1 , ût 2

and test for joint signi…cance.

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 42 / 43


Serial Correlation test under contemporaneous exogeneity

Simple adjustment is needed if regressors are not strictly exogenous.


All we have to do it add all regressors and intercept along with lagged
OLS residual
Precisely, Step 2 should be replaced with OLS of

ût on ût 1 , xt1 , ..., xtk

Heuristic explation. If we replace (ût , ût 1 ) above with (ut , ut 1 ) :


the need for adding regressors is to account for the fact that ut 1
might be correlated with xt1 , .., xtk if xtj ’s are not strictly exogenous
For example, this test should be used if xtj ’s contain lagged y
(because strict exgeoneity is always violated.

Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 43 / 43


VN7.1
VN7.2
VN7.3
VN7.4
VN7.5
VN7.6
VN7.7
VN7.8
VN7.9
VN7.10
VN7.11a
VN7.11b
VN7.12
VN7.13
VN7.14
VN7.15
VN7.16
VN7.17
VN7.18
VN7.19
EC221: Principles of Econometrics
Endogeneity

Dr M. Schafgans

London School of Economics

Lent 2022

Dr M. Schafgans (LSE) EC221: Endogeneity 1 / 50


Endogeneity (Overview) I
There may be statistical or economic reasons why we might expect
that the errors and regressors are correlated.

omitted variables
lagged dependent variables when errors exhibit dependence
measurement errors in the regressors
simultaneity

In the presence of correlation between errors and regressors

E ( ε jX ) 6 = 0

The problem is called Endogeneity: a key concept in econometrics and


econometric applications
A regressor x with Cov (x , u ) 6= 0 is called endogenous regressor.
A regressor x with Cov (x , u ) = 0 is called exogenous regressor.

Correlation between errors and regressors is a serious GM violation.


Dr M. Schafgans (LSE) EC221: Endogeneity 2 / 50
Endogeneity: Consequences for OLS Estimation I

Multiple linear regression model y = X β + ε, OLS estimator:


b
β = (X 0 X ) 1
X 0 y = β + (X 0 X ) 1
X 0 ε.

For unbiasedness we required assumption A.3: E (εjX ) = 0.

Correlatedness between regressors (stochastic) and errors violates this


assumption, E (εjX ) 6= 0 ) estimator becomes biased:
E b
β jX = β + (X 0 X ) 1 X 0 (E ε jX ) 6= β and
E β̂ = E E b
β jX 6= β
Moreover, our estimator will be inconsistent.
1
X 0X X 0ε
plim b
β = β + plim( n n )
1 0 1
= β+ plim n X X plim n1 X 0 ε 6= β
LLN
Recall plim n1 X 0 ε = plim n1 ∑ xi εi = E (xi εi ) 6= 0.

Dr M. Schafgans (LSE) EC221: Endogeneity 3 / 50


Endogeneity: Consequences for OLS Estimation II

Is it surprising that OLS performs bad?


Recall: The OLS estimator ensures that the following normal equations
are satis…ed
n
X 0 ε̂ = ∑i =1 xi ε̂i = 0.

OLS residuals are required to be orthogonal to the regressors.


If the true errors and regressors are correlated, E (xi εi ) 6= 0, imposing
that the residuals and regressors are orthogonal is inappropriate
(sample analogue)

Since it is inappropriate to impose X 0 ε̂ = 0 ) OLS will be a bad


estimator

The classic example is the Omitted Variable Bias problem we


discussed last term. VN8.1-2

Dr M. Schafgans (LSE) EC221: Endogeneity 4 / 50


Endogeneity
Lagged dependent variables as regressors
and
serially correlated errors

Dr M. Schafgans (LSE) EC221: Endogeneity 5 / 50


Lagged DepVar Regressors and Serially Correl Errors

Suppose the model of interest is given by

yt = β1 + β2 xt + β3 yt 1 + εt , j β3 j < 1, t = 1, ..., T
εt = ρεt 1 + vt , jρj < 1, vt i.i.d.(0, σ2v )
vt uncorrelated with any xs and εt 1 , εt 2,....

In this case we can show Cov (yt 1 , εt ) = E (yt 1 εt ) 6= 0.


Hence yt 1 is an endogenous explanatory variable.

Reason (intuition): εt is a function of εt 1 and yt 1 is a function of


εt 1 as well therefore εt and yt 1 are correlated.

Similar problem as in PS8-extra, Q2, where εt has an MA(1) dependenc

Dr M. Schafgans (LSE) EC221: Endogeneity 6 / 50


Lagged DepVar Regressors and Serially Correl Errors II

yt = β1 + β2 xt + β3 yt 1 + εt , j β3 j < 1, t = 1, ..., T
εt = ρεt 1 + vt , jρj < 1, vt i.i.d.(0, σ2v )
vt uncorrelated with any xs and εt 1 , εt 2,....

Let us see how we can obtain an expression for Cov (εt , yt 1 ).

Cov (εt , yt 1 ) = Cov (ρεt 1 + vt , β1 + β2 xt 1 + β3 yt 2 + εt 1 )


= ρVar (εt 1 ) + ρβ3 Cov ( εt 1 , yt 2 ) (all other terms are zero!)

= ρσ2ε + ρβ3 Cov (εt , yt 1) by covariance stationarity

) Cov (εt , yt ρ 2
1) = 1 ρβ3 σ ε 6= 0 [ρβ3 6= 1]

Dr M. Schafgans (LSE) EC221: Endogeneity 7 / 50


Lagged DepVar Regressors and Serially Correl Errors III

This correlation will make our OLS estimates b


β1 , b
β2 , and b β3
inconsistent: 2 3
.. .. ..
6 . . . 7
6
Using the matrix notation y = X β + ε with X = 4 1 xt yt 7
1 5
.. .. ..
. . .
0 1 0
plim b
β = β + plim XTX plim XT ε
0 1 0 1
plim T1 ∑ εt E εt
= β + MXX1 @ plim 1 ∑ xt εt A =LLN β + M 1 @ Ext εt A
T XX
plim T1 ∑ yt 1 εt Eyt 1 εt
0 1
0
= β + MXX1 @ 0 A 6= β
Cov (yt 1 , εt )
Typically all parameters are inconsistent!

Dr M. Schafgans (LSE) EC221: Endogeneity 8 / 50


Endogeneity
Measurement Error in Regressors

Dr M. Schafgans (LSE) EC221: Endogeneity 9 / 50


Measurement Error in regressors I

Measurement error is a widespread problem in practice, since a lot of


economics data are poorly measured.
E.g., a common problem in surveys: faulty responses due to unclear
questions, memory errors, deliberate distortion of responses (e.g.,
prestige bias), or misrecording of responses.

Say, we have a model that satis…es the GM assumptions

yi = xi 0 β + εi , E (εi jxi ) = 0, (xi , εi ) i.i.d.

We observe "xi with measurement error":

xi = xi + ui

ui denotes the measurement error (vector). One or more variables may


be measured with error
Dr M. Schafgans (LSE) EC221: Endogeneity 10 / 50
Measurement Error in regressors II
Our estimable linear regression model, is then given by
yi = (xi ui ) 0 β + ε i
yi = xi0 β + vi , where vi = εi ui0 β is a composite error

Endogeneity will result as the measurement error will be contained in


both xi and the composite error term.
In order to derive E (xi vi ) , assumptions need to be made about the
measurement error

Classical Measurement Error assumption (CME):


ui is independent of xi and εi ui i.i.d.(0, σ2u

These are strong assumptions: the true value does not reveal any
information about the size, sign or value of the measurement error.
Other assumptions can be made here, and the exact result will depend
on what is assumed
Dr M. Schafgans (LSE) EC221: Endogeneity 11 / 50
Measurement Error in regressors III

yi = xi0 β + vi , where vi = εi ui0 β is a composite error

The estimable model will yield an inconsistent estimator of β̂:


0 1 0
plim β̂ = β + plim( XnX ) plim( Xnv ) (Slutsky Thm)
= β + E (xi xi0 ) 1
E (xi vi ) 6= β (WLLN)

Under CME VN8.3

E ( xi vi ) = E ui ui0 β 6= 0
E xi xi0 = E xi xi 0 + E ui ui0
In general, all parameters are a¤ected by measurement error in one or
more of the explanatory variables .
When only one explanatory variable is measured with a CME error, its
parameter will exhibit "attenuation bias" (PS8, Q1). Little can be said
about direction of bias if more than one variable is measured with error.
Dr M. Schafgans (LSE) EC221: Endogeneity 12 / 50
Endogeneity
Simultaneity

Dr M. Schafgans (LSE) EC221: Endogeneity 13 / 50


Simultaneity I

Simultaneity arises when some of the X ’s are jointly determined with


the dependent variable y in the same economic model.
Familiar examples include market equilibrium, models of the
macroeonomy, and set of factor or commodity demand equations.

Demand/supply model (endogenous variables, pt , qt )

Demand : qt = α1 + α2 pt + α3 mt + u1t
Supply : qt = β1 + β2 pt + β3 ct + u2t

Keynesian model (endogenous variables, Ct , Yt )

Ct = α + βYt + εt
Yt = Ct + It

These, so called structural form equations, cannot be estimated


equation-by-equation by OLS because of this endogeneity.

Dr M. Schafgans (LSE) EC221: Endogeneity 14 / 50


Simultaneity (Keynesian Model ) II
Structural Form Equations and OLS

Ct = α + βYt + εt , E (εt ) = 0
Yt = C t + I t
The structural form (SF) equations provide the relationships of
interest (behavioural relations)
Ct and Yt are endogenous variables, which are jointly determined
It is assumed to be exogenous, or predetermined (i.e., determined
outside the model). I.e., It is independent of the error εt .
OLS to the structural form equation yields inconsistent parameter
estimates for (α, β) because
Cov (Yt , εt ) 6= 0
To obtain the Cov (Yt , εt ) we need to obtain the reduced form of this
model.
Reduced form: express the endogenous variables in terms of the
exogenous variables and errors only.
Dr M. Schafgans (LSE) EC221: Endogeneity 15 / 50
Simultaneity (Keynesian Model ) III
Reduced Form

The reduced form (RF) representation of our model is:

Yt = π 11 + π 12 It + v1t
Ct = π 21 + π 22 It + v2t

To get the reduced form for Yt (plug in 1st equation in the 2nd ):

Yt = α + βYt + εt + It
) Y t = 1 α β + 1 1 β It + 1
1 β εt

Need to assume that 1 β 6= 0.


1 1 1 2
Cleary Cov (Yt , εt ) = Cov ( 1 α β + 1 β It + 1 β εt , εt ) = 1 β σε 6= 0
To get the reduced form for Ct , we plug this back into the structural
equation.
OLS to the reduced form equation yields consistent parameter
estimates
Dr M. Schafgans (LSE) EC221: Endogeneity 16 / 50
Simultaneity (Demand/Supply Model ) IV
Inconsistency OLS Structural form

Demand : qt = α1 + α2 pt + α3 mt + u1t
;
Supply : qt = β1 + β2 pt + β3 ct + u2t
σ21 σ12
E (ut ) = 0, E ut ut0 = , mt , ct exog., indep of (u1t , u2t )
σ12 σ22

OLS on either demand or supply equation will be inconsistent.


Show: Cov (pt , u1t ) 6= 0 and Cov (pt , u2t ) 6= 0.
Derive the reduced form (RF) representation of our model

pt = π 11 + π 12 mt + π 13 ct + v1t
qt = π 21 + π 22 mt + π 23 ct + v2t

See PS8, Q2

Dr M. Schafgans (LSE) EC221: Endogeneity 17 / 50


Simultaneity (Demand/Supply Model ) V
Inconsistency OLS Structural form

Demand : qt = α1 + α2 pt + α3 mt + u1t
;
Supply : qt = β1 + β2 pt + β3 ct + u2t
σ21 σ12
E (ut ) = 0, E ut ut0 = , mt , ct exog., indep of (u1t , u2t )
σ12 σ22

Proof inconsistency demand equation:


2 3
. . .
Write q = X1 α + u1 , with X1 = 4 1 pt mt 5 , α̂ = (X10 X1 ) 1 X10 q
. . .
X 10 X 1 1 X 0u
1 1
. = α+ T T
0 1
Eu1t
α = α + MX 1X
plim b @ Ept u1t A 6= α, with MX X = Ex1t x 0
1 1 1t
1 1
Emt u1t
Dr M. Schafgans (LSE) EC221: Endogeneity 18 / 50
Solution to the Endogeneity Problem

E ( ε jX ) 6 = 0

Instrumental Variable (IV) Estimation

Dr M. Schafgans (LSE) EC221: Endogeneity 19 / 50


Solution Endogeneity: IV Estimation

y = X β + ε with E (εjX ) 6= 0
Recall, OLS estimator was bad as it imposes
1 0 1
X ε̂ = X 0 (y X b βOLS ) = 0
n n
when in fact the errors and regressors are correlated.
Suppose there exist a set of explanatory variables Z which
are correlated with our regressors X (relevance) and
are NOT correlated with the error ε (validity).
These variables can be used for estimation purposes and are known as:
Instrumental Variables.
De…ne our Instrumental Variable estimator b
βIV by imposing:
1 0 IV 1
Z ε̂ = Z 0 (y Xb
βIV ) = 0.
n n

Dr M. Schafgans (LSE) EC221: Endogeneity 20 / 50


Instrumental Variables Estimator

Our Instrumental Variable estimator b


βIV is given by the sample
analogue of the moment condition E (zi εi ) = 0:

1 0 IV 1
Z ε̂ = Z 0 (y Xb
βIV ) = 0.
n n

If we want to be able to solve for β̂IV , we have to have enough


conditions here (order condition)!!
Provided Z 0 X is square and non-singular (just-identi…ed) this gives:
VN8.4

β̂IV = (Z 0 X ) 1 Z 0 y

The requirement of its invertibility (just-identi…ed setting) is related to


the rank condition - captures both the relevance and the requirement
that instruments cannot e¤ect the outcome directly.

Dr M. Schafgans (LSE) EC221: Endogeneity 21 / 50


Properties of the IV Estimator (large sample)

β̂IV = (Z 0 X ) 1
Z 0 y (just-identi…ed)

In general E β̂IV 6= β
E (b
βIV jX , Z ) = β + (Z 0 X ) 1 Z 0 E (εjX , Z ) 6= β
If there is endogeneity, then E (εjX , Z ) 6= 0 because Cov (ε, X ) 6= 0!
Therefore β̂IV is in general biased (…nite sample property)

But plim β̂IV = β


1
plim b
βIV = β + plim n1 Z 0 X plim n1 Z 0 ε = β

Relevance and Exclusion ensure plim n1 Z 0 X = MZX = E (zi xi0 )


(invertible)
Validity ensures plim n1 Z 0 ε = E (zi εi ) = 0
Therefore β̂IV is consistent!

Dr M. Schafgans (LSE) EC221: Endogeneity 22 / 50


Properties of the IV Estimator

y = X β + ε with E (εjZ ) = 0
β̂IV = (Z 0 X ) 1
Z 0 y (just-identi…ed)

For statistical inference, we will use the following asymptotic


distributional result for β̂IV :
a
β̂IV N ( β, σ2 (Z 0 X ) 1
Z 0 Z (X 0 Z ) 1
) ( CLT

Assumes (conditional ) homoskedasticy and zero autocorrelation.

Var (εjZ ) = σ2 I

A natural estimator of σ2 (recall E ε2i = σ2 ) is

2 (y X b
βIV )0 (y X b
βIV )
sIV = nRSSk = n k

Dr M. Schafgans (LSE) EC221: Endogeneity 23 / 50


Conditions on our Instruments I

Let us consider the requirements on our instruments in more detail:

Consider the model (random sampling)

yt = xt0 β + εt , E (εt jxt ) 6= 0


zt instruments and β is a k 1 vector

Validity: Cov (zt , εt ) = E (zt εt ) = 0 implied by E (εt jzt ) = 0


The instruments should be uncorrelated with the error term.
Relevance: E (zt xt0 ) rank k (if square, invertible)
We need to have at least as many instruments as regressors (order
condition).
The instruments should be correlated with the original regressors.
To ensure full rank, the instruments for endogenous variables cannot
have a direct e¤ect on y

Dr M. Schafgans (LSE) EC221: Endogeneity 24 / 50


Conditions on our Instruments II

Consider the simple linear regression model

yt = α + βwt + εt where Cov (wt , εt ) 6= 0


1 1
Let xt = and zt =
wt dt

E εt
Validity: E (zt ε) =0
Edt εt
Shows that dt should be uncorrelated with εt , Edt εt = Cov (dt , εt ).
1 E (wt )
Relevant: E (zt xt0 ) needs to be invertible.
E ( dt ) E ( dt w t )
Non-zero determinant: E (dt wt ) E (dt )E (w t ) Cov (dt , wt ) 6= 0.
Shows that dt should be correlated with wt .

Dr M. Schafgans (LSE) EC221: Endogeneity 25 / 50


Conditions on our Instruments III
Consider the model where dt has a direct e¤ect on yt :
yt = α + βwt + γdt + εt where Cov (wt , εt ) = E (wt εt ) 6= 0
0 1
1
1
Let xt = @ dt A and consider zt =
dt
wt

Validity is still statis…ed, the rank of E (zt xt0 ) cannot be equal to 3.


There are not enough
0 instruments.
1
1
Choosing zt = @ dt A would not help.
dt
0 1
1 E ( dt ) E ( dt )
E zt xt0 = @ E (dt ) E (dt2 ) E (dt2 ) A , rank<3
E (wt ) E (wt dt ) E (wt dt )
We cannot use dt as instrument for wt if it a¤ects yt directly.
Exclusion requirement!
Dr M. Schafgans (LSE) EC221: Endogeneity 26 / 50
Recap (calculus based)

Dr M. Schafgans (LSE) EC221: Endogeneity 27 / 50


Variance of IV estimator - simple linear regression

yi = β0 + β1 xi + ui with E (ui ) = 0 but E (ui xi ) 6= 0

Our IV estimator of β̂1,IV can be written as VN8.5-9

(zi z̄ )
β̂1,IV = β1 + ∑i =1 di ui with di =
n
∑ni=1 (zi z̄ )(xi x̄ )

Assuming (conditional ) homoskedasticy and zero autocorrelation,

n σ2 1
Var β̂1,IV jx, z = σ2 ∑i =1 di2 = n
∑ i = 1 ( xi x̄ )2 Rzx
2

Rzx the sample correlation between instrument (z ) and regressor (x )


The higher the correlation, the more precise the IV estimator!
2 <1
If OLS is consistent: we would lose e¢ ciency by using IV since Rzx

Dr M. Schafgans (LSE) EC221: Endogeneity 28 / 50


Choice of Instruments

Dr M. Schafgans (LSE) EC221: Endogeneity 29 / 50


IV Choice of Instruments: General Principles I

We recall that good instruments Z are not only uncorrelated with


the error ε (which gives us consistency of b
βIV ) but are also highly
correlated with the X’s (which gives accuracy to our estimator).

AVar (b
βIV ) = σ2 (Z 0 X ) 1
Z 0 Z (X 0 Z ) 1

X 0 Z /n could be viewed as the matrix of sample covariances between


the instruments Z and the original regressors X . VN8.10
If the X ’s and Z ’s are not too well correlated, then X 0 Z will have
elements all close to zero, hence (X 0 Z ) 1 will have huge elements.
Consequnetly b βIV will have huge standard errors, i.e., it will be pretty
inaccurate.

We should choose Z = X if all regressors are "good":


b 1
βIV = Z 0 X Z 0y = b
β.

Dr M. Schafgans (LSE) EC221: Endogeneity 30 / 50


IV Choice of Instruments: General Principles II

β̂IV = (Z 0 X ) 1
Z 0 y (just-identi…ed)

Common strategy
Variables in X that are thought to be "good" (exogenous) are included
in Z
AVar (b
βIV ) = σ2 (Z 0 X ) 1 Z 0 Z (X 0 Z ) 1
Instruments for endogenous variables are variables that do not enter
the regression equation itself
These variables a¤ect the dependent variable only through regressors
(exclusion restriction)
In dynamic models, lags of the included exogenous variables may
provide suitable instruments for endogenous lagged dependent variable.
In simultaneous equation models, considering the other equations in
this system provides a natural way to propose suitable instruments.

Dr M. Schafgans (LSE) EC221: Endogeneity 31 / 50


Simultaneity (Demand/Supply Model ) V
Suitable Instruments (exact identi…ed)

Demand : qt = α1 + α2 pt + α3 mt + u1t
Supply : qt = β1 + β2 pt + β3 ct + u2t
Recall Cov (pt , u1t ) 6= 0 and Cov (pt , u2t ) 6= 0. pt is an endogenous
("bad") regressor in demand and supply equation. VN8.11
Demand
We need a instrument for pt . A "good" variable (valid), correlated with
pt (relevance), which doesn’2t appear in the 3
equation 2itself: ct 3
. . . . . .
1
α̂IV = (Z10 X1 ) Z10 q, X1 = 4 1 pt mt 5; Z1 = 4 1 ct mt 5
. . . . . .
Supply
We need a instrument for pt . A "good" variable (valid), correlated with
pt (relevance), which doesn’t2 appear in the3equation2itself: mt 3
. . . . . .
1 0
β̂IV = (Z2 X2 ) Z2 q, X2 = 4 1 pt ct 5; Z2 = 4 1 mt ct 5
0
. . . . . .
Dr M. Schafgans (LSE) EC221: Endogeneity 32 / 50
Two Stage Least Squares
or
Optimal IV

Dr M. Schafgans (LSE) EC221: Endogeneity 33 / 50


Two-Stage Least Squares
In the case that there are more instruments (Z ) than regressors (X )
(overidenti…ed), we would like to choose our instruments optimally:
b 1 0
βIV = Z opt 0 X Z opt y

The best choice is obtained by using the …tted values of an OLS


regression of the columns of X on all instruments, that is
b = Z (Z 0 Z )
Z opt = X 1
Z 0 X = PZ X

X̂ gives the linear combination of instruments (valid) with the highest


correlation (relevance).

Using these "optimal instruments", we obtain: VN8.12

b b 0X )
βIV = (X 1 b0
X y = ( X 0 PZ X ) 1
X 0 PZ y

We can show (see PS8, Q4) is equal to b b 0X


βIV = (X b ) 1X
b 0y .

Dr M. Schafgans (LSE) EC221: Endogeneity 34 / 50


Two Stage Least Squares
This suggests that b
βIV can be computed in two steps:

Step 1: Computing X b by OLS. The j th column of X b contains the …tted


th
value of a regression of the j endogenous regressor (X ) on all
instruments Z .
b = Z ( Z 0 Z ) 1 Z 0 X = PZ X
X
Step 2: Performing OLS of y on X b.
b 0X
(X b) 1 b0
X y

When using a two-step approach, typically one needs to correct the


standard errors.
b are …tted values, the estimator
If the computer does not know that X
for σ2 used will be
bb
(y X bb
βIV )0 (y X βIV )
s2 = n k inconsistent for σ2 !
b.
2 , one should use X rather than X
To obtain a consistent estimator, sIV
Dr M. Schafgans (LSE) EC221: Endogeneity 35 / 50
IV for lagged dependent variable
with serial correlated errors

Dr M. Schafgans (LSE) EC221: Endogeneity 36 / 50


IV for lagged DepVar Regr with Serially Correl errors

Suppose the model of interest is given by

yt = β1 xt + β2 yt 1 + εt , t = 1, ..., T
εt = ρεt 1 + vt , jρj < 1, vt i.i.d.(0, σ2v ) AR (1) model
εt is uncorrelated with xt
vt uncorr. with εt 1 , yt 1 , xt 1 , εt 2 , yt 2 , xt 2 , ..

We want to …nd a vector of IV for the set of regressors: (xt , yt 0


1)
For instruments, we can use the current and lagged values of the
regressors.
E.g., we can use zt = (xt , xt 1 )0 as instruments for the set of
regressors (xt , yt 1 )0 . Note: yt 1 is a function of xt 1 (and yt 2 )
We could also use more lagged values: zt = (xt , xt 1 , xt 2 , ...)0 . .
Large sample: Higher correlation when including more; Finite sample:
Not a good idea to include too many (bias)
In PS8-extra Q2, we consider same model with MA(1) errors.
Dr M. Schafgans (LSE) EC221: Endogeneity 37 / 50
IV in the presence of simultaneity
Overidenti…ed / Exact identi…ed / Under identi…ed

Dr M. Schafgans (LSE) EC221: Endogeneity 38 / 50


Demand : qt = α1 + α2 pt + α3 m1t + α4 m2t + u1t
Supply : qt = β1 + β2 pt + β3 ct + u2t
Cov (pt , u1t ) 6= 0 and Cov (pt , u2t ) 6= 0.

pt is an endogenous regressor in demand and supply equation.


Demand (just identi…ed)
We need a instrument for pt . We have exactly one "good" variable (valid),
correlated
2 with pt (relevance),3 which is
2 not in the equation itself:
3 ct .
. . . . . . . .
X1 = 4 1 pt m1t m2t 5; Z1 = 4 1 ct m1t m2t 5 ;
. . . . . . . .
Z10 X1 invertible!

Supply (Over identi…ed)


We need a instrument for pt . We have more "good" variables (valid),
correlated
2 with pt (relevance),
3 2which are not in the equation
3 itself: m1t , m2t .
. . . . . . .
X2 = 4 1 pt ct 5; Z2 = 4 1 m1t m2t ct 5 ;
. . . . . . .
Z20 X2 not invertible! VN8.13

Dr M. Schafgans (LSE) EC221: Endogeneity 39 / 50


Demand : qt = α1 + α2 pt + α3 m1t + α4 m2t + u1t
Supply : qt = β1 + β2 pt + β3 ct + u2t
Cov (pt , u1t ) 6= 0 and Cov (pt , u2t ) 6= 0.

To estimate the Supply (overidenti…ed) we should use β̂2SLS = β̂OptIV


1
2SLS: β̂2SLS = X̂20 X̂2 X̂20 q
Step 1: Obtain p̂t , by estimating the RF

pt = π 11 + π 12 m1t + π 13 m2t + π 14 ct + v1t


2 3
. . . .
Let Z = 4 1 m1t m2t ct 5 : p̂ = Z π̂ 1 = PZ p
. . . .
Step 2: Estimate the equation

qt = β1 + β2 p̂t + β3 ct + e2t
2 3 2 3
. . . . . .
Let X2 = 4 1 pt ct 5 ; X̂2 = 4 1 p̂t ct 5
. . . . . .
1
Optimal IV: β̂OptIV = X̂20 X2 X̂20 q

Dr M. Schafgans (LSE) EC221: Endogeneity 40 / 50


Simultaneity (Demand/Supply Model )
Suitable Instruments? Exact/Over or Under Identi…ed SUMMARY

Demand : qt = α1 + α2 pt + α3 m1t + α4 m2t + u1t


Supply : qt = β1 + β2 pt + β3 ct + u2t
Cov (pt , u1t ) 6= 0 and Cov (pt , u2t ) 6= 0.

pt is an endogenous regressor in demand and supply equation.


Demand Just identi…ed
1 1
Here α̂IV = (Z10 X1 ) Z10 q = X̂10 X̂1 X̂10 q = α̂2SLS with X̂1 = PZ X1
We use supply shifter (ct ) to estimate demand equation
Supply Over identi…ed
1 1
Here β̂2SLS = X̂20 X̂2 X̂20 q = Z2opt 0 X2 Z2opt 0 0q with Z2opt = PZ X2
Use demand shifters (m1t , m2t ) to estimate supply equation

Under identi…ed - setting where the are no (or insu¢ cient) instruments.
In this case we have a lack of identication.
Dr M. Schafgans (LSE) EC221: Endogeneity 41 / 50
Under Identi…cation vs Exact Identi…cation
A graphical explanation
VN8.14-15

Dr M. Schafgans (LSE) EC221: Endogeneity 42 / 50


Testing for Endogeneity
(Wooldridge, Ch. 15.5)

Dr M. Schafgans (LSE) EC221: Endogeneity 43 / 50


Testing for Endogeneity I

Let us consider tests that can be used to detect whether y2 and u1


are uncorrelated in

y1 = α0 + α1 y2 + α2 x1 + u1 , E (u1 ) = E (x1 u1 ) = 0

We want to test:

H0 : Cov (y2 , u1 ) = 0; y2 is exogenous


H1 : Cov (y2 , u1 ) 6= 0; y2 is endogenous

Hausman test is based on testing whether the discrepancy between


the OLS and IV parameter estimates is signi…cant, d = α̂OLS α̂IV
VN8.16
h i 1
d (d ) a
d 0 Var d χ2dim (α)

Large di¤erences provide evidence of endogeneity.

Dr M. Schafgans (LSE) EC221: Endogeneity 44 / 50


Testing for Endogeneity II

A simple regression based test is given next

y1 = α0 + α1 y2 + α2 x1 + u1 , E (u1 ) = 0, E (x1 u1 ) = 0
Let z1 and z2 be instruments for y2

Obtain …tted values ŷ2 (Step 1, 2SLS)

y2 = π 0 + π 1 z1 + π 2 z2 + π 3 x1 + v2

ŷ2 is a "good" regressor where the endogeneity in y2 is "washed out".


The "bad" component of y2 must be contained in v̂2 = y2 ŷ2

Add v̂2 to our original regression:

y1 = α0 + α1 y2 + α2 x1 + δv̂2 + error ,

Dr M. Schafgans (LSE) EC221: Endogeneity 45 / 50


Testing for Endogeneity III

y1 = α0 + α1 y2 + α2 x1 + δv̂2 + error ,

To test the endogeneity of y2 , we test H0 : δ = 0


We simply use δ̂/SE δ̂ and use critical values of N (0, 1)

Reject H0 :
To estimate α’s we need to "control" for the endogeneity of y2 by
including v̂2
The resulting parameter estimates are the same as obtained by 2SLS
Not reject H0 :
To estimate α’s we can estimate our original model without v̂2
We can obtain parameter estimates simply by using OLS

Dr M. Schafgans (LSE) EC221: Endogeneity 46 / 50


Testing for Endogeneity IV

EXAMPLE: Using College Proximity as an IV for education


(CARD.DTA). Consider a model

lwage = β0 + β1 educ + β2 exper + β3 exper2 + ... + u

In Wooldridge, both OLS and IV (using nearc4 as instrument) results are


presented in Table 15.1

While the returns to education is estimated at 0.075 (.003) by OLS, it


equals 0.132 (.055) by IV (2SLS).

Are the di¤erences statistically signi…cant? If so, suggests evidence of the


endogeneity problem (large SE’s of IV - weak instrument).

Dr M. Schafgans (LSE) EC221: Endogeneity 47 / 50


Testing for Endogeneity V

Here, implementation of the regression based test does not give


evidence of endogeneity:

.
Dr M. Schafgans (LSE) EC221: Endogeneity 48 / 50
Testing for Endogeneity VI

If we repeat the exercise using both nearc4 and nearc2 as


instruments, the evidence of endogeneity becomes stronger:

While not signi…cant at the 5% level, it is signi…cant at the 10% level.


The returns to education now exhibit a larger di¤erence (OLS versus
2SLS). It equals 0.157 (.052) by 2SLS against 0.075 (.003) by OLS.
Dr M. Schafgans (LSE) EC221: Endogeneity 49 / 50
Testing for Endogeneity VII

Including the …rst-stage residuals "controls" for the endogeneity of


educ.
The parameter estimates for the α’s obtained after adding v̂2 , are
identical to those we obtain using 2SLS:

See EC221 summer exam 2019 for a question that addresses this point.
Dr M. Schafgans (LSE) EC221: Endogeneity 50 / 50
VN8.1
VN8.2
VN8.3
VN8.4
VN8.5
VN8.6
VN8.7
VN8.8
VN8.9
VN8.10
VN8.11
VN8.12
VN8.13
VN8.14
VN8.15
VN8.16
EC221: Principles of Econometrics
Maximum Likelihood Estimation and
Trinity of Classical Testing

Dr M. Schafgans

London School of Economics

Lent 2022

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 1 / 57


Maximum Likelihood Estimation

The Maximum Likelihood Estimator (MLE) is an optimization


method which assumes that the joint density function of the
observations is known up to a set of parameters, θ.

The intuition of the ML principle is as follows.

From the (assumed) distribution of the data fyi gni=1 : f (y ; θ )


determine the likelihood of obtaining the speci…c sample we
happen to observe as a function of the unknown parameters θ
characterizing the distribution.

Choose as our MLE, θ̂ MLE , those values for the unknown parameters
that give us the highest likelihood. VN9.1

ML provides a means of choosing an asymptotically e¢ cient estimator


for our parameter(s).
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 2 / 57
MLE: The Likelihood Function

The likelihood function, de…ned as a function of the unknown


parameter vector θ, is the joint density
Assuming our random variables y1 , y2 , ..., yn are distributed
independently:
n
independence
L(θ; y1 , ..., yn ) = f (y1 , ..., yn ; θ ) = ∏ f ( yi ; θ )
i =1

It is typically more convenient to work with the log-likelihood function:


n
log L(θ; y1 , ..., yn ) = ln f (y1 , ..., yn ; θ ) = ∑ ln f (yi ; θ )
i =1
n
= ∑ log Li (θ ) summations!
i =1

log Li (θ ) is the individual contribution to the log-likelihood.

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 3 / 57


MLE: Estimator

The method of maximum likelihood de…nes as the ML estimator


b
θ MLE that value of θ that maximizes the (logarithm of the) likelihood
function, i.e.,
b
θ MLE = arg max log L(θ; y1 , ..., yn ) arg max log L(θ; y )
θ 2Θ θ 2Θ

The …rst order conditions of this problem imply that


∂ log L(θ; y )
=0
∂θ b
θ MLE

To ensure it is a maximum, we require

∂2 log L(θ; y )
is negative de…nite matrix
∂θ∂θ 0 b
θ MLE

Only in special cases can we analytically determine the ML estimator.


Typically, numerical optimization is required. VN9.2
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 4 / 57
MLE: Common terminology
The score vector (or gradient) s is a (k 1) vector of …rst derivatives
of log L(θ; y ) with respect to the vector θ.
n n
∑ ∑ si (θ )
∂ log L i (θ )
s (θ ) = ∂θ =
i =1 i =1

The Hessian matrix H is the (k k ) matrix of second derivatives of


log L(θ; y ) with respect to the vector θ.
n n
∂2 log L i (θ )
H (θ ) = ∑ ∂θ∂θ 0
= ∑ Hi ( θ )
i =1 i =1

See also PS1 Q4.

The Information matrix I (θ ) is the (k k ) matrix obtained by taking


minus the expectation of the Hessian matrix:
I (θ ) = E (H (θ ))

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 5 / 57


Maximum Likelihood Estimator
Properties (Asymptotic)

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 6 / 57


Properties MLE estimator
If we assume that certain MLE regularity conditions are met, the
MLE will have the following asymptotic properties:

Consistent It may, however, be biased.

Asymptotic Normal
b a
θ N (θ, I (θ ) 1 ), where I (θ ) = E (H (θ ))

Important as this ensures that we can do hypothesis testing.


Particularly noteworthy as not all ML estimates have an explicit form

Asymptotically E¢ cient
The inverse of the information matrix I (θ ) 1 , provides a lower bound
on the asymptotic covariance matrix for any consistent, asymptotically
normal estimator for θ. (Cramèr-Rao lower bound).
Important as this ensures that our estimates are precise and tests using
them powerful.
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 7 / 57
MLE: Estimation of the Asymptotic Variance of the MLE

b a 1
θ N (θ, I (θ ) ), where I (θ ) = E (H (θ ))

n h i 1
∂2 log L i (θ ) 1
Estimate of AVar (θ̂ ) = ∑ E ∂θ∂θ 0
( E [H (θ )])
i =1
EITHER evaluate the second derivatives matrix of the log-likelihood
function at the MLE estimates.
! 1
n
\ ∂2 log L i (θ ) 1
AVar (θ̂ ) = ∑ ∂θ∂θ0 = H θ̂
θ̂
i =1

OR evaluate the outer product of the …rst derivatives of the


log-likelihood function at the MLE estimates
! 1 ! 1
n n
\ ∂ log L i (θ ) ∂ log L i (θ ) 0
AVar (θ̂ ) = ∑ ∂θ ∂θ =
θ̂
s (θ̂ )s (θ̂ ) 0
∑ i i
i =1 i =1

First estimator has better properties in small samples.


Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 8 / 57
Maximum Likelihood Estimator
Two simple (scalar parameter) examples

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 9 / 57


Maximum Likelihood Estimator: Binary Choice

Suppose you have a random sample Y1 , ...Yn , drawn from the p.d.f.

θ y (1 θ )1 y y = 0, 1
f (y ) =
0

Find the MLE of θ and its asymptotic distribution!


Y is a Bernouilli random variable(binary), where

1 with probability θ = : f (1)


y=
0 with probability 1 θ = : f (0)

Note: E (Y ) = 1 Pr(Y = 1) + 0 Pr(Y = 0) = θ!


This may already suggest what a suitable estimator for θ is.

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 10 / 57


Maximum Likelihood Estimator: Binary Choice

Step 1: Construct the joint p.d.f.


n
f (Y1 , ...Yn ; θ ) = ∏ f (Yi ) independence
i =1
Y1
= θ (1 θ )1 Y 1 θ Y 2 (1 θ )1 Y2 θ Y n (1 θ )1 Yn
n n
∑ Yi n ∑ Yi
= θ i =1 ( 1 θ) i =1 = L(θ; Y )
Step 2: Write down the Log-likelihood VN9.3-4

n n
log L(θ; Y ) = ∑ Yi log θ + (n ∑ Yi ) log(1 θ)
i =1 i =1

Step 3: Obtain b
θ MLE
n n
∂ log L ∑ Yi n ∑ Yi
i =1 i =1
= θ 1 θ
∂θ n n n
∑ Yi n ∑ Yi ∑ Yi
b
θ MLE satis…es i =1 i =1
= 0 ) bθ MLE = i =1
=Y
b
θ MLE 1 bθ MLE n
Y is an estimator (random variable); ȳ its realisation (estimate).
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 11 / 57
Maximum Likelihood Estimator: Binary Choice

b
θ MLE = Ȳ is consistent and unbiased
n
plim n1 ∑ Yi = E (Yi ) = θ – laws of large numbers
i =1
n
E b
θ MLE = 1
n ∑ E (Yi ) = n1 .nθ = θ
i =1

b
θ MLE is Asymptotic normal

b a
θ MLE N (θ, θ (1 θ )/n )

2
To show: obtain the information matrix I (θ ) = E ∂ ln2 L VN9.5
∂θ
q
Note: here, it is easy to obtain SE θ̂ MLE = θ̂ MLE (1 θ̂ MLE )/n

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 12 / 57


Maximum Likelihood Estimator: Count Data

Suppose you have a random sample Y1 , ...Yn , drawn from the p.d.f.

exp( λ)λy
f (y ) = , y = 0, 1, 2, 3, ....
y!
with, e.g., 5! = 1 2 3 4 5.
Y is a Poisson random variable(counts)
Note: E (Y ) = Var (Y ) = λ

Find the MLE of λ and discuss its asymptotic distribution! (PS9, Q2)

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 13 / 57


Maximum Likelihood Estimator
Linear Regression Model and Normality

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 14 / 57


Maximum Likelihood Estimator
Multiple Linear Regression Model

Let yi = xi0 β + ui ui jxi i.i.d. N 0, σ2 , i = 1, ..., n


log L β, σ2 = n
2 ln (2π ) n
2 ln σ2 1
2 σ2
(y X β ) 0 (y X β)
. VN9.6

∂ log L (θ )
Our ML Estimators need to satisfy = 0, where
∂θ θ =b
θ
∂ log L
∂β = 1
2σ2
( 2X 0 y + 2X 0 X β)
∂ log L
∂σ2
= n
2σ2
+ 1
2σ4
(y X β ) 0 (y X β)
b
βMLE = (X 0 X ) 1 X 0 y
0
b2MLE = ûnû with û = y
σ X β̂MLE

To allow us to obtain the asymptotic distribution of ( β̂MLE , σ̂2MLE )


Need Hessian (matrix of second order derivatives),
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 15 / 57
Maximum Likelihood Estimator
Multiple Linear Regression Model

The required Hessian here is


0 2 2
1 0 2 2
1
∂ log L ∂ log L log L log L
∂β∂β0 ∂β∂σ2
E ∂∂β∂β 0 E ∂∂β∂σ 2
I ( β, σ2 ) = E@ ∂2 log L ∂2 log L
A=@ 2
log L 0 2
log L
A
∂σ2 ∂β0
E ( ∂∂β∂σ 2 ) E∂
(∂σ2 )2 (∂σ2 )2

You can show:


∂2 log L X 0X
∂β∂β0
= σ2
∂2 log L n (y X β ) 0 (y X β ) u0u
= = 2σn 4
(∂σ2 )2 2σ4 σ6 σ6
∂2 log L (X 0 y X 0 X β ) 0
∂β∂σ2
= σ4
= Xσ4u
And,
2
log L 0 2
E ( ∂∂β∂β 0 ) = X X /σ (for simplicity assumed …xed regressors)
2
log L
E(∂ ) = n
( E (u 0 u ) = E ∑ni=1 ui2 = nσ2
(∂σ2 )2 2σ4
2
∂ log L
E ( ∂β∂σ2 ) =0 ( E (X 0 u ) = X 0 E (u ) = 0

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 16 / 57


Maximum Likelihood Estimator
Multiple Linear Regression Model

X 0X
σ2
0
The information matrix becomes: I( β,σ2 ) = n
0 2σ4
To obtain the asymptotic variance, take inverse (easy here with block
diagonality)
!
( X 0X ) 1 0 σ 2 (X 0 X ) 1 0
I( β1,σ2 ) = σ2 =
0 n
( 2 σ4 ) 1 0 2σ4 /n

This gives:
b
βMLE a β σ 2 (X 0 X ) 1 0
N ,
b2MLE
σ σ2 0 2σ4 /n

Some observations: Zero-covariance + normality ) Independence


b
βMLE N ( β, σ2 (X 0 X ) 1 ) asymptotically. In fact this is not only true
asymptotically, but true for any n!
4 p d
b2MLE
σ N (σ2 , 2nσ ) asymptotically, or b2MLE
n(σ σ2 ) ! N (0, 2σ4 )
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 17 / 57
Trinity of Classical Tests
Wald, Likelihood Ratio and Lagrange Multiplier

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 18 / 57


Trinity of Classical Testing I

On the basis of the MLE a large number of alternative tests can be


constructed. The trinity of classical tests are based upon one of the
three principles.
Wald
Likelihood Ratio
Lagrange Multiplier
Although any of the three principles can be use to construct a test for
a given hypothesis, each of these tests has their own merits.

Suppose we are interested in testing one or more linear restrictions of


the parameter vector θ = (θ 1 , ..., θ k )0 , say

Rθ = c.

The three test principles can be summarized as follows:

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 19 / 57


Trinity of Classical Testing II
Wald test (W).
Using an estimate of θ (unrestricted MLE) we need to determine
whether the discrepancy R b
θ c is close to zero.
This is the idea that underlies the well-known t-test.

Likelihood Ratio Test (LR).


Determine whether imposing a restiction on MLE leads to a signi…cant
loss of …t.
Compare the …t of the unrestricted MLE (log(L(θ̂ )) with the …t of the
restricted MLE (log L θ̃ , where R θ̃ = c. Is the di¤erence
log L(bθ ) log L(θ̃ ) signi…cantly di¤erent from zero.
Similar to the idea of comparing URSS and RRSS when using OLS.

Lagrange Multiplier Test (LM).


Using the restricted MLE (giving θ̃ ) verify whether ∂ log L(θ )/∂θ jθ̃ is
close to zero. Equivalent to testing whether the lagrange multipliers
are signi…cantly di¤erent from zero.
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 20 / 57
Trinity of Classical Testing III

The three tests look at di¤erent aspects of the likelihood function


(distance, height, slope). VN9.7

Asymptotically the tests are equivalent. Nevertheless, they can


behave di¤erently in a small sample.
The choice among them is typically made on the basis of ease of
computations.
If the restricted and unrestricted estimator are simple to compute, the
likelihood ratio test is easy to apply.

The Wald test only requires the unrestricted estimator, and the
Lagrange multiplier test only requires the restricted estimator.

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 21 / 57


Trinity of Classical Testing
Wald Test I

Uses only the unrestricted MLE estimator.

Standard output for MLE results provide test statistics for βj = 0


β̂j d
z= SE ( β̂j )
! N (0, 1) under H0

The standard errors are obtained from the estimated variance


covariance matrix based either on the …rst or second order derivatives
of the log-likelihood.
Since MLE estimators are typically only asympt. normal, this is referred
to as the z-test (asymptotic) rather than the t-test (…nite sample).

The test statistic for joint linear hypotheses, H0 : R β = c against


H1 : R β 6= c with r equalling the number of restriction
\ d
W = d 0 (Var (d )) 1
d ! χ2r under H0 with d = R b
β c
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 22 / 57
Trinity of Classical Testing
Wald Test II

Derivation of Wald test, similar as before.


Main di¤erences
a
The use of MLE requires us to use result that b
βMLE N ( β, V ) with
b
AVar βMLE = V given by the inverse of the information matrix.

Our test statistic will be valid only asymptotically

Assumptions underlying the tests: MLE regularity conditions.

Let us look at some details for the Wald test of H0 : R β = c VN9.8

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 23 / 57


Trinity of Classical Testing
Likelihood Ratio Test

Uses both unrestricted and restricted MLE estimators.

The test statistic is given by


d
2(log LU log LR ) ! χ2r under H0

log LU is the value of the log likelihood function evaluated at the


unrestricted MLE estimator
log LR is the value of the log likelihood function evaluated at the
restricted MLE estimator
r equals the number of restrictions.

Intuition: tries to assess whether the loss in …t of imposing a


restriction is statistically signi…cant.
If H0 is true, we would expect the likelihood ratio LR /LU 1 (i.e.,
log LU log LR 0).
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 24 / 57
Trinity of Classical Testing
Lagrange Multiplier Test

Uses only the restricted MLE estimator.


max log L(θ ) s.t. Rθ = c
Lagrangean: log L(θ ) + λ0 (Rθ c)
Widely used regression diagnostic procedure.
If the shadow prices are high, we would like to reject the restrictions.
VN9.9

e
The LM test statistics, tests whether λ 0:
1
e 0 Var
LM = λ d (λe) e!
λ
d
χ2r under H0

with r the number of restrictions

Or equiv. whether s (e
θ ) = ∑ni=1 ∂ log Li (e
θ )/∂θ 0 (Score test)
h i 1
d
LM = s (e d (s (e
θ )0 Var θ )) s (e
θ ) ! χ2r under H0

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 25 / 57


Trinity of Classical Testing
Lagrange Multiplier Test

In fact, we can rewrite the latter as


" #0 " # 1 " #
n n n
LM = ∑ si (e θ) ∑ si (eθ )si (eθ )0 ∑ si (eθ )
i =1 i =1 i =1

We can obtain this test statistic by computing nR 2 of the following


auxilliary regression

1i = si (e
θ )0 γ + vi , i = 1, .., n

where si = ∂ log Li (eθ )/∂θ is a vector of individual scores. We should


reject H0 if nR 2 is too large,
d
LM = nR 2 ! χ2r under H0 VN9.10-11 .

We will consider empirical implementation of all tests later.


Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 26 / 57
Empirical examples of MLE

Next we will discuss two empirically relevant examples where we will


use Maximum Likelihood Estimation. VN9.12

We will discussion the estimation and interpretation of results from


these models, and explain the drawback of using OLS in these
settings.

We will discuss the implementation of the Trinity of Tests in these


settings.

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 27 / 57


Empirical Application: Binary Choice
Married Women’s
Labor Market participation
(Mroz, 1987)

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 28 / 57


Binary Choice Model
In binary choice models, the dependent variable is a binary variable
that re‡ects a choice.
Examples: decision to join the labour market, decision to join a
training program, and decision to join a union.
Let us denote y = 1 when the answer is "yes" and y = 0 when the
answer is "no".

In binary choice models, we are interested in explaining


Pr(y = 1jx ) : the probability of participation given characteristics
∂ Pr(y = 1jx )/∂xk : the marginal e¤ect that an explanatory variable
has on the probability of participation, ceteris paribus

With y being a discrete random variable, we note

E (y jx ) = 1 Pr(y = 1jx ) + 0 Pr(y = 0jx )

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 29 / 57


Binary Choice Model (OLS): Linear Probability Model

Using OLS when our dependent variable is binary assumes we have a


Linear Probability Model

yi = xi0 β + εi E (εi jxi ) = 0 (Linear Probability Model)


where yi f0, 1g

With E (εi jxi ) = 0,

E ( yi j xi ) Pr (yi jxi ) = xi0 β is linear in the parameters!

The interpretation of βj is nice (easy): "How does the j th explanatory


variable a¤ect the probability that y = 1, ceteris paribus"
The marginal e¤ects are constant

There are various drawbacks to using the LPM

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 30 / 57


Binary Choice Model (OLS): Linear Probability Model

Drawback of using OLS when yi = 0, 1


xi0 β̂ not necessarily constrained to lie between 0 and 1. VN9.13

[We should interpret xi0 β as the Pr (yi = 1jxi ) E (yi jxi )]


Related to the fact that the marginal e¤ects are constant in LPM.

The error term is highly non-normal (inference)


For given xi , εi = yi xi0 β can take only two values 1 xi0 β or xi0 β
rather than the full range of values assumed in the linear regression
setting.

The error term is heteroskedastic (ine¢ cient/validity SEs).


Var (yi jxi ) = (xi0 β) (1 xi0 β) is clearly not constant!

Note: Var (y ) = E (y 2 ) [E (y )]2 and E (y ) = E (y 2 ) = Pr (y = 1)

Var (yi jxi ) = Var (xi0 β + εi jxi ) = Var (εi jxi )

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 31 / 57


Binary Choice Model (OLS): Linear Probability Model

We can obtain the following estimated LPM (Mroz): n = 753, R 2 = .264

d =
inlf .586 .0034 nwifeinc + .038 educ + .039 exper
(.152 ) (.0015 ) (.007 ) (.006 )
2
.00060exper .016 age .262 kidslt6 + .013 kidsge6
(.00019 ) (.002 ) (.032 ) (.014 )

Heteroskedasticity-robust standard errors in brackets

Each year of education increases the probability of inlf by .038, or


3.8 percentage points (do not say 3.8%).
Having young children has a very large negative e¤ect: being in the
labor force falls by .262 for each young child.
It is unwise to extrapolate and consider e¤ect of large changes.

Past workforce experience has a positive but diminishing e¤ect.

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 32 / 57


Binary Choice Model - Nonlinear!

To ensure predicted values will lie between 0 and 1, we may decide to


model
P (y = 1jx) = F (x 0 β )
using some function F that takes values between zero and one.
A natural choice for F ( ) is to use a cumulative distribution function
VN9.14

The leading cases are

exp(z )
F (z ) = Logistic CDF = Λ(z ) = (logit)
[1 + exp(z )]
Z z
F (z ) = N (0, 1) CDF = Φ(z ) = p1 exp( 12 u 2 )du (probit)
∞ 2π

When z = x 0 β is large, the probability of y = 1 is close to one.

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 33 / 57


Binary Choice Model - Nonlinear!

P (y = 1jx) = F (x 0 β )

We could consider using the Non-Linear Regression model:

yi = F xi0 β + εi

Using Non Linear Least Squares:


n 2
β̂NLLS = argb min ∑i =1 yi F xi0 b . VN9.15

Problem of heteroskedasticity remains

Instead, we propose to use MLE.


Once we de…ne Pr (yi = 1jxi ) = F xi0 β , we can fully characterize the
joint density of the data.
MLE will provide the asymptotically e¢ cient parameter estimates
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 34 / 57
MLE: Binary Choice Cont’d I
yi : Binary variable that indicates whether an individual works for
wages.
yi is a Bernouilli random variable, recall

θ y i (1 θ )1 y i , θ = Pr(yi = 1)
f ( yi ) =
0

As we expect that Pr(yi = 1) di¤ers from individual to individual


θ should depend on i!

θ i = Pr (yi = 1jxi )

We will specify a relation between θ i and his/her characteristics, xi ,


using a CDF as suggested before.

exp(xi0 β)
Common: θ i = Φ(xi0 β) (probit) or θ i = (logit)
| {z } 1 + exp(xi0 β)
CDF of N (0,1 ) rv | {z }
CDF of Logisic rv

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 35 / 57


MLE: Binary Choice Cont’d II

Let us specify θ i = Φ(xi0 β) - that is we consider the Probit model.


Our likelihood function is de…ned by the (conditional) joint density of
the data

∏i =1 f (yi jxi )
n
L ( β) = f (y1 , .., yn jx1 , ...xn ) =indep

∏i =1 Φ(xi0 β)y
n 1 yi
= i
1 Φ(xi0 β)

Taking the logs gives us the following log-likelihood function:


2 3
n
6 7
log L( β) = ∑ 4yi log Φ(xi0 β) + (1
| {z }
yi ) log 1
|
Φ(xi0 β) 5
{z }
i =1
Pr (yi =1 jxi ) Pr (yi =0 jxi )

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 36 / 57


MLE: Binary Choice Cont’d III
2 3
n
6 7
log L( β) = 4 ∑ yi log Φ(xi0 β) + (1 yi ) log 1 Φ(xi0 β) 5
i =1
| {z } | {z }
Pr (yi =1 jxi ) Pr (yi =0 jxi )
i
∂ log L ( β)
β̂MLE : ∂β = 0 (yields no explicit formula for β̂MLE )
β̂MLE

The FOC (see PS 9, Q3) can be written as


n
∑ ε̂Gi xi
∂ ln L ( β)
∂β b
= =0
βML
i =1
Note, regarding use of Chain rule:
∂[y ln Φ(x 0 β)] ∂[y ln Φ(x 0 β)] ∂[Φ(x 0 β)] ∂ [x 0 β ]
∂βk = ∂Φ(x 0 β) ∂ (x 0 β ) ∂βk
| {z } | {z } | {z }
y φ (x 0 β ) xk
Φ (x 0 β )

yi Φ(xi0 b
βML )
ε̂G
i = φ(xi0 b
βML ) : so called generalized residuals
Φ(xi βML ) 1 Φ(xi0 b
0 b ( βML ) )
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 37 / 57
MLE: Binary Choice Cont’d IV
2 3
n
6 7
log L( β) = 4 ∑ yi log Φ(xi0 β) + (1 yi ) log 1 Φ(xi0 β) 5
i =1
| {z } | {z }
Pr (yi =1 jxi ) Pr (yi =0 jxi )

Intuition MLE: Choose β̂MLE in such as way that

Individuals with yi = 1 have a high Pr(y\


i = 1jxi ) = Φ (xi β̂MLE )
0

Individuals with yi = 0 have a high Pr(y\


i = 0jxi ) = 1 Φ(xi0 β̂MLE ).

Numerical procedures are required to obtain β̂MLE (FOC no explicit


solution)

β̂MLE has well known properties, consistent, asymptotic normal etc.


a 1 ),
β̂MLE N ( β, I ( β) where I ( β) = E (H ( β))
2
i 1 1
∂ log L ( β)
To …nd SE’s we use ∂β∂β0
= H ( β̂MLE )
β̂MLE
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 38 / 57
Logit and Probit - Empirical example
Many software packages allow us to estimate this model directly, e.g.,
stata. Probit/Logit (fast)
EXAMPLE: Married Women’s Labor Force Participation (MROZ.DTA)
The variable inlf is one if a woman worked for a wage during a certain
year, and zero if not.

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 39 / 57


Empirical Application: MLE Binary Choice Cont’d III
Given our probit estimates, we can evaluate the predicted probability
that indiv i works Pr(y\
i = 1jxi ) = Φ (xi β̂MLE ).
0

For continuous explanatory variables, the marginal e¤ect is


∂ Pr(y = 1jx ) ∂Φ(x 0 β)
= φ (x 0 β ) βj
∂xj ∂xj
"How does the j th explanatory variable a¤ect the probability of working
change, holding everything else constant".
Important: βj itself cannot be interpreted as a marginal e¤ect unlike
LPM. Here only the sign of βj is informative (as φ(x 0 β) > 0).

When x1 is a dummy explanatory variable (treatment), then


∆ Pr(y = 1jx )
= Φ ( β0 + β1 1 + β2 x2 + ...) Φ ( β0 + β1 0 + β2 x2 + ...)
∆x1
"E¤ect of treatment on the probability of working, c.p.".
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 40 / 57
Empirical Application: MLE Binary Choice Cont’d IV

Unlike in the LPM, the marginal e¤ects in our probit (logit) model are
not constant, but depend on the individual characteristics:

∂ Pr(y = 1jx ) ∂Φ(x 0 β)


= φ(x 0 β) βk (continuous variables)
∂xk ∂xk

Evaluate this at the means of the characteristics: φ(x̄ 0 β̂) β̂m


[In stata: margins, dydx( ) atmeans] Does individual x̄ exist?]

Report the average over all individuals: n1 ∑ni=1 φ(xi0 β̂) β̂k
[In stata: margins, dydx( )]

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 41 / 57


Empirical Application: MLE Binary Choice Cont’d VI

On average, a one year extra of schooling increases the probability of


being in the Labor Force with 3.9 percentage points, ceteris paribus.
The marginal e¤ect is not the same for all individuals: φ(xi0 β̂) β̂educ !

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 42 / 57


MLE: Binary Choice

Let me summarize the results for Binary Choice Models thus far.
VN9.16-18

Next, we will discuss for our Probit/Logit model, the Wald, LR and
LM for, say
H0 : β3 = 0 against H1 : β3 6= 0.
See also PS10 Q1.

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 43 / 57


MLE: Binary Choice – Trinity of Hypothesis Tests I

Wald Test:
Estimate the unrestricted Probit model, and verify β̂3 0. VN9.19

Test statistic
β̂3 a
z= N (0, 1) under H0
SE ( β̂3 )

a 1)
Recall, under suitable regularity conditions, β̂MLE N ( β, I ( β)
i 1
∂2 log L ( β)
Recall how to estimate Var ( β̂MLE ) : ∂β∂β0
.
β̂MLE

The squared roots of its diagonal elements give SE ( β̂j ));


Reject if jz j > zα/2 .

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 44 / 57


MLE: Binary Choice – Trinity of Hypothesis Tests II

Likelihood Ratio test : VN9.20

Estimate the unrestricted Probit model (as above)


Obtain log LU

Estimate the restricted Probit model by imposing the null


Obtain log LR
If we wanted to test H0 : β3 = 1, then we could make use of Stata’s
commands
constr 1 variable3=1
probit depvar variable1 variable2 variable3..., constr(1)

Test statistic (1 restriction)

d
LR = 2(log LU logR ) ! χ21

Reject if LR > χ21,α

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 45 / 57


MLE: Binary Choice – Trinity of Hypothesis Tests III

Lagrange Multiplier test VN9.21

Estimate the restricted Probit model leaving out the third variable
( β3 = 0)
Let the restricted parameter estimates be:
r r r r
β̂ = ( β̂0 , β̂1 , β̂2 , 0)

We need to evaluate whether the score, when evaluated at these


restricted parameters is zero: ∂ log
∂β
L
r 0 (i.e., all four derivates
β̂
zero!)
Test statistic
LM = nR 2
of the auxilliary regression

1i = ε̂G
i γ1 + ε̂G G G
i x1i γ2 + ε̂i x2i γ3 + ε̂i x3i γ4 + υi

Reject if LM > χ21,α

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 46 / 57


MLE - Goodness of …t I

On the right, the LR test provides us with a test indicating the


"signi…cance of the regression". Joint signi…cance of the slopes
(comparable to the F test in the linear regression model)

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 47 / 57


MLE - Goodness of …t II

Test of joint signi…cance slopes:


To obtain log LR : perform probit without explanatory variable

From before logU = 401.30219


LR test = 2 ( 401.302 514.8732) = 227.14
Since there are 7 slopes, this test statistic has (asymptotic) χ27 .
Clearly our explanatory variables are highly signi…cant (p < 0.001)

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 48 / 57


LR Test – Example "Chow" Test

Extension: How can we test whether the labour market participation


for women is the same for urban (city = 1) as it is for rural women
(city = 0).

We want to test:

H0 : βrural
j = βurban
j for all j = 0, 1, ..., k
H1 : At least one βrural
j 6= βurban
j

We have k + 1 restrictions we want to test (intercept + slopes)


LR Test statistic
a
LR = 2(log Lur log Lr ) χ2k +1

Reject H0 if LR > χ2α (k + 1)


How do we obtain log Lr and log Lur ?
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 49 / 57
LR Test – Example "Chow" Test

Restricted model
The LF participation decision is the same for urban and rural women
To obtain log Lr , we simply perform probit using all observations

probit inlf nwifeinc educ exper expersq age kidslt6 kidsge6


log Lr = 401.302

Unrestricted model
The LF participation for rural and urban women are di¤erent
To obtain log Lur we can run separate probit regressions for the urban
and rural sample
probit inlf nwifeinc educ exper expersq age kidslt6 kidsge6 if city==0
probit inlf nwifeinc educ exper expersq age kidslt6 kidsge6 if city==1
From here we compute log Lur = log Lurban + log Lrural

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 50 / 57


Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 51 / 57
The combined sample 484 + 269 = 753 (either urban or rural)
log LU = 255.55 142.727 = 398. 28
From before log LR = 401.302, so that
a
LR = 2 ( 398.28 401.302) = 6.044 χ28

With critical value at 5% equalling 15.507, we cannot reject the null.

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 52 / 57


Empirical Application: Count Model
Patents, R&D and Technological Spillovers
(Cincera, 1997)

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 53 / 57


MLE: Count Data Cont’d (The problem)

A count variable is a variable that takes on nonnegative integer values.


Examples: number of times someone is arrested during a year,number
of cigarettes smoked per day, and number of patents applied for by a
…rm during a given year.

The drawback of using OLS to estimate E (yi jxi ) are, similar to the
drawback in the setting of binary choice, that E (yi jxi ) should be
non-negative for all x.

It is better to model directly E (yi jxi ), ensuring the non-negativity is


ensured, e.g., E (yi jxi ) = exp(xi0 β) and implement NLLS.
Weighted non-linear least squares would be more e¢ cient as all of the
standard distributions for count data imply heteroskedasticity.
MLE though is more e¢ cient!

The Poisson regression model for count data use Poisson distribution
(MLE).
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 54 / 57
Empirical Application: MLE Count Data Cont’d I

Let yi indicate the number of patents applied for by a …rm during a


given year.

exp( λ)λyi
Recall : f (yi ) = yi = 0, 1, 2, 3, ....
yi !
where λ = E (yi ) = Var (yi ).

Clearly, we can expect that E (yi ) is not the same for all …rms.
λ is not a constant but will depend on i ! )

We will specify a relation between λi and a …rm’s characteristics, xi .


In particular assume (guarantees the non-negativity of λi )

λi = E (yi jxi ) = exp(xi0 β).

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 55 / 57


Empirical Application: MLE Count Data Cont’d II

Our MLE procedure will now involve …nding β̂MLE .


y
indep exp ( λi )λi i
ln L( β) = ∑ni=1 ln yi ! = ∑ni=1 [ λi + yi ln(λi ) ln(yi !)]

∑ni=1 exp(xi0 β) + yi (xi0 β) ln (yi !)


i
∂ ln L ( β)
β̂MLE : ∂β =0 yields no explicit formula for β̂MLE
β̂MLE
Numerical optimization required: problem again globally concave
Can you write these FOC in the nice form: ∑ni=1 ε̂R
i xi = 0, where
ε̂R
i = yi exp xi0 β̂MLE ?

β̂MLE has well known properties, consistent, asymptotic normal etc.


i 1
∂2 ln L ( β)
To …nd standard errors for β̂MLE we use ∂β∂β0 β̂MLE

Many software packages allow us to estimate this model directly.

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 56 / 57


Empirical Application: MLE Count Data Cont’d III

Important: we cannot interpret β directly as a marginal e¤ect.

A useful marginal e¤ect to consider:

∂E (y jx ) ∂ exp(x 0 β)
= = βk exp(x 0 β)
∂xk ∂xk

"How do the expected number of patents change due to the k th


explanatory variable (say R&D expenditures), holding everything else
constant".
Typical to evaluate this at the means of the characteristics or report
the average over all individuals/…rms.
In stata: margins, dydx( )
Illustration Patent and R&D expenditure (see PS10, Q2)

Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 57 / 57


VN9.1
VN9.2
VN9.3
VN9.4
VN9.5
VN9.6
VN9.7
VN9.8
VN9.9
VN9.10
VN9.11
VN9.12
VN9.13
VN9.14
VN9.15
VN9.16
VN9.17
VN9.18
VN9.19
VN9.20
VN9.21
VN9.22
VN9.23

You might also like