0% found this document useful (0 votes)
85 views154 pages

G023: Econometrics: J Er Ome Adda

Uploaded by

Lilia Xa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views154 pages

G023: Econometrics: J Er Ome Adda

Uploaded by

Lilia Xa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 154

G023: Econometrics

Jérôme Adda

[email protected]

Office # 203

I am grateful to Andrew Chesher for giving me access to his G023 course notes on which
most of these slides are based.
G023. I
Syllabus

Course Description:
This course is an intermediary econometrics course. There will be
3 hours of lectures per week and a class (sometimes in the computer
lab) each week. Previous knowledge of econometrics is assumed. By
the end of the term, you are expected to be at ease with modern
econometric techniques. The computer classes introduce you to
real life problems, and will help you to understand the theoretical
content of the lectures. You will also learn to use a powerful and
widespread econometric software, STATA.
Understanding these techniques will be of great help for your
thesis over the summer, and will help you in your future workplace.
For any contact or query, please send me an email or visit my
web page at:
http://www.ucl.ac.uk/∼uctpjea/teaching.html.
My web page contains documents which might prove useful such as
notes, previous exams and answers.

Books:
There are a several good intermediate econometric books but the
main book to be used for reference is the Wooldridge (J. Wooldridge
(2003) Econometric Analysis of Cross-Section and Panel Data, MIT
Press). Other useful books are:

• Andrew Chesher’s notes, on which most of these slides are


based.
• Gujurati “Basic Econometrics”, Mc Graw-Hill. (Introductory
text book)
• Green “Econometric Analysis”, Prentice Hall International,
Inc. (Intermediate text book)

G023. I
Course Content

1. Introduction
What is econometrics? Why is it useful?
2. The linear model and Ordinary Least Squares
Model specification.
3. Hypothesis Testing
Goodness of fit, R2 . Hypothesis tests (t and F).

4. Approximate Inference
Slutsky’s Theorem; Limit Theorems. Approximate distribu-
tion of the OLS and GLS estimators.
5. Maximum Likelihood Methods
Properties; Limiting distribution; Logit and Probit; Count
data.
6. Likelihood based Hypothesis Testing
Wald and Score tests.
7. Endogeneity and Instrumental Variables
Indirect Least Squares, IV, GMM; Asymptotic properties.

G023. I
Definition and Examples

Econometrics: statistical tools applied to economic problems.

Examples: using data to:

• Test economic hypotheses.

• Establish a link between two phenomenons.

• Assess the impact and effectiveness of a given policy.

• Provide an evaluation of the impact of future public policies.

Provide a qualitative but also a quantitative answer.

G023. I
Example 1: Global Warming

• Measuring the extent of global warming.


– when did it start?
– How large is the effect?
– has it increased more in the last 50 years?

• What are the causes of global warming?


– Does carbon dioxide cause global warming?
– Are there other determinants?
– What are their respective importance?

• Average temperature in 50 years if nothing is done?

• Average temperature if carbon dioxide concentration is reduced


by 10%?

G023. I
Example 1: Global Warming

Average Temperature in Central England (1700-1997)

Atmospheric Concentration of Carbon Dioxide (1700-1997)

G023. I
Causality

• We often observe that two variables are correlated.


– Examples:
∗ Individuals with higher education earn more.
∗ Parental income is correlated with child’s education.
∗ Smoking is correlated with peer smoking.
∗ Income and health are correlated.

• However this does NOT establish causal relationships.

G023. I
Causality

• If a variable Y is causally related to X, then changing X will


LEAD to a change in Y.
– For example: Increasing VAT may cause a reduction of
demand.
– Correlation may not be due to causal relationship:
∗ Part or the whole correlation may be induced by both
variables depending on some common factor and does
not imply causality.
∗ For example: Individuals who smoke may be more
likely to be found in similar jobs. Hence, smokers are
more likely to be surrounded by smokers, which is usu-
ally taken as a sign of peer effects. The question is how
much an increase in smoking by peers results in higher
smoking.
∗ Brighter people have more education AND earn more.
The question is how much of the increased in earnings
is caused by the increased education.

G023. I
Linear Model

• We write the linear model as:

y = Xβ + ε

where ε is a nx1 vector of values of the unobservable.


• X is a nxk vector of regressors (or explanatory variables).
• y is the dependent variable and is a vector.
       
Y1 x01 x11 x12 . . . x1k ε1
Y   x0   x x . . . x   ε2 
 2     21 22 2k   
y =  ..  , X =  ..2  =  .. .. .. ,ε =  .. 
 .   .   . . .   . 
0
Yn xn xn1 xn2 . . . xnk εn
 
β0
β =  ... 
βk−1

G023. I
Model Specifications

• Linear model:
Yi = β0 + β1 Xi + εi
∂Yi
= β1
∂Xi
Interpretation: When X goes up by 1 unit, Y goes up by β1
units.

• Log-Log model (constant elasticity model):

ln(Yi ) = β0 + β1 ln(Xi ) + εi

Yi = eβ0 Xiβ1 eεi

∂Yi
= eβ0 β1 Xiβ1 −1 eεi
∂Xi
∂Yi /Yi
= β1
∂Xi /Xi
Interpretation: When X goes up by 1%, Y goes up by β1 %.

• Log-lin model:

ln(Yi ) = β0 + β1 Xi + εi

∂Yi
= β1 eβ0 eβ1 Xi eεi
∂Xi
∂Yi /Yi
= β1
∂Xi

Interpretation: When X goes up by 1 unit, Y goes up by


100β1 %.

G023. I
Example: Global Warming

Dependent Temperature Log Temperature


variable: (Celsius)
CO2 (ppm) 0.0094 (.0018) - 0.00102 (.0002) -
Log CO2 - 2.85 (.5527) - 0.31 (0.0607)
Constant 6.47 (.5345) -6.94 (3.13) 1.92 (.0.5879) 0.46 (0.3452)

• An increase in 1ppm in CO2 raises temperature by 0.0094 de-


grees Celsius. Hence, since 1700 a raise in about 60ppm leads
to an increase in temperature of about 0.5 Celsius.
• A one percent increase in CO2 concentration leads to an in-
crease of 0.0285 degrees.
• An increase of one ppm in CO2 concentration leads to an in-
crease in temperature of 0.1%.
• A 1% increase in CO2 concentration leads to a 0.3% increase
in temperature.

G023. I
Assumptions of the Classical
Linear Regression Model

• Assumption 1: E[ε|X] = 0
– The expected value of the error term has mean zero given
any value of the explanatory variable. Thus observing a
high or a low value of X does not imply a high or a low
value of ε.
X and ε are uncorrelated.
– This implies that changes in X are not associated with
changes in ε in any particular direction - Hence the asso-
ciated changes in Y can be attributed to the impact of X.
– This assumption allows us to interpret the estimated coef-
ficients as reflecting causal impacts of X on Y .
– Note that we condition on the whole set of data for X in
the sample not on just one .

G023. I
Assumptions of the Classical Linear Regression Model

• Assumption 2: rank(X) = k.
• In this case, for all non-zero k × 1 vectors, c, Xc 6= 0.
• When the rank of X is less than k, there exists a non-zero vector
c such that Xc = 0. In words, there is a linear combination of
the columns of X which is a vector of zeros. In this situation
the OLS estimator cannot be calculated. β cannot be defined
by using the information contained in X.
• Perhaps one could obtain other values of x and then be in a
position to define β. But sometimes this is not possible, and
then β is not identifiable given the information in X. Perhaps
we could estimate functions (e.g. linear functions) of β that
would be identifiable even without more x values.

G023. I
OLS Estimator

• Assumption 1 state that E[ε|X] = 0 which implies that:

E[y − βX|X] = 0
X 0 E[y − βX|X] = E[X 0 (y − Xβ)|X]
= E[X 0 y] − X 0 Xβ
= 0

• and so, given that X 0 X has full rank (Assumption 2):

β = (X 0 X)−1 E[X 0 y|X]

• Replacing E[X 0 y|X] by X 0 y leads to the Ordinary Least Square


estimator:
β̂ = (X 0 X)−1 X 0 y

• Note that we can write the estimator as:

β̂ = (X 0 X)X 0 (Xβ + ε)
= β + (X 0 X)−1 X 0 ε

G023. I
Properties of the OLS Estimator

• Variance of the OLS estimator:


³ ´³ ´0
V ar(β̂|X) = E[ β̂ − E[β̂|X] β̂ − E[β̂|X] |X]
³ ´³ ´0
= E[ β̂ − β β̂ − β |X]
= E[(X 0 X)−1 X 0 εε0 X(X 0 X)−1 |X]
= (X 0 X)−1 X 0 E[εε0 |X]X(X 0 X)−1
= (X 0 X)−1 X 0 ΣX(X 0 X)−1

where Σ = V ar[ε|X]
• If Σ = σ 2 In (homoskedasticity and no autocorrelation) then

V ar(β̂|X) = σ 2 (X 0 X)−1

• If we are able to get an estimator of σ 2 :

Vd
ar(β̂|X) = σ̂ 2 (X 0 X)−1

• We can re-write the variance as:


µ ¶−1
ˆ σ̂ 2 (X 0 X)
V ar(β̂|X) =
n n
We can expect (X 0 X)/n to remain fairly constant as the sample
size n increases. Which means that we get more accurate OLS
estimators in larger samples.

G023. I
Alternative Way

• The OLS estimator is also defined as


ε0 ε
min = min(y − Xβ)0 (y − Xβ)
β n β

• The first order conditions for this problem are:

X 0 (y − X β̂) = 0

This is a kx1 system of equation defining the OLS estimator.

G023. I
Goodness of Fit

• We measure how well the model fits the data using the R2 .
• This is the ratio of the explained sum of squares to the total
sum of squares
N
X
– Define the Total sum of Squares as: T SS = (Yi − Ȳ )2
i=1
N
X
– Define the Explained Sum of Squares as: ESS = [β̂(Xi −
i=1
2
X̄)]
N
X
– Define the Residual Sum of Squares as: RSS = ε̂2i
i=1

• Then we define
ESS RSS
R2 = =1−
T SS T SS
• This is a measure of how much of the variance of Y is explained
by the regressor X.
• The computed R2 following an OLS regression is always be-
tween 0 and 1.
• A low R2 is not necessarily an indication that the model is
wrong - just that the included X have low explanatory power.
• The key to whether the results are interpretable as causal im-
pacts is whether the explanatory variable is uncorrelated with
the error term.

G023. I
Goodness of Fit

• The R2 is non decreasing in the number of explanatory vari-


ables.
• To compare two different model, one would like to adjust for
the number of explanatory variables: adjusted R2 :
X
ε̂2i /(N − k)
i
R̄2 = 1 − P 2
i yi /(N − 1)

• The adjusted and non adjusted R2 are related:


N −1
R̄2 = 1 − (1 − R2 )
N −k

• Note that to compare two different R2 the dependent variable


must be the same:
ln Yi = β0 + β1 Xi + ui
Yi = α0 + α1 Xi + vi
cannot be compared as the Total Sum of Squares are different.

G023. I
Alternative Analogue Estimators

• Let H be a nxk matrix containing elements which are functions


of the elements of X:

E[H 0 ε|X] = 0
E[H 0 (y − Xβ)|X] = 0
E[H 0 y|X] − E[H 0 X|X]β = 0
E[H 0 y|X] − (H 0 X)β = 0

• If the matrix H 0 X has full rank k then

β = (H 0 X)−1 E[H 0 y|X]

β̂H = (H 0 X)−1 H 0 y

V ar(β̂H |X) = (H 0 X)−1 H 0 ΣH(X 0 H)−1


with Σ = V ar(ε|X). If we can write Σ = σ 2 In where In is a
nxn identity matrix then:

V ar(β̂H |X) = σ 2 (H 0 X)−1 H 0 H(X 0 H)−1

• Different choices of H leads to different estimators. We need a


criteria that ranks estimators. Usually the estimator with the
smallest variance.

G023. I
Misspecification

• Suppose the true model is not linear but take the following
(more general) form:

E[Y |X = x] = g(x, θ)

so that
Y = g(x, θ) + ε E[ε|X] = 0
Define  
g(x1 , θ)
G(X, θ) =  .. 
.
g(xn , θ)
then

E[β̂|X] = E[(X 0 X)−1 X 0 y|X]


= (X 0 X)−1 X 0 G(X, θ)
6= β

• The OLS estimator is biased. The bias depends on the values


of x and the parameters θ. Different researches faced with
different values of x will come up with different conclusions
about the value of β if they use a linear model.
• The variance of the OLS estimator is:
³ ´³ ´0
V ar(β̂|X) = E[ β̂ − E[β̂|X] β̂ − E[β̂|X] |X]
= E[(X 0 X)−1 X 0 (y − G(X, θ))(y − G(X, θ))0 X(X 0 X)−1 |X]
= E[(X 0 X)−1 X 0 εε0 X(X 0 X)−1 |X]
= (X 0 X)−1 X 0 ΣX(X 0 X)−1

exactly as it is when the regression function is correctly speci-


fied.

G023. I
Omitted Regressors

• Suppose the true model generating y is:

y = Zγ + ε E[ε|X, Z] = 0

• Consider the OLS estimator β̂ = (X 0 X)−1 X 0 y calculated using


data X, where X and Z may have common columns.

E[β̂|X] = E[(X 0 X)−1 X 0 y|X, Z]


= E[(X 0 X)−1 X 0 (Zγ + ε)|X, Z]
= (X 0 X)−1 X 0 Zγ + (X 0 X)−1 X 0 E[ε|X, Z]
= (X 0 X)−1 X 0 Zγ

. ..
• Let Z = [X .. Q] and γ 0 = [γX 0
. γQ0 ] so that the matrix
X used to calculate β̂ is a part of the matrix Z. In the fitted
model, the variables Q have been omitted.

E[β̂|X, Z] = E[β̂|Z]
= (X 0 X)−1 X 0 Zγ
.
= (X 0 X)−1 [X 0 X .. X 0 Q]γ
.
= [I .. (X 0 X)−1 X 0 Q]γ
= γX + (X 0 X)−1 X 0 QγQ

• If X 0 Q = 0 or γQ = 0 then E[β̂|X, Z] = γX . In other words,


omitting a variable from a regression bias the coefficients unless
the omitted variable is uncorrelated with the other explanatory
variables.

G023. I
Omitted Regressors: Example

• Health and income in a sample of Swedish individuals.


• Relate Log income to a measure of overweight (body mass in-
dex).

Log Income
BMI low -0.42 (.016) -0.15 (.014)
BMI high -0.00 (.021) -0.12 (.018)
age 0.13 (.0012)
age square -0.0013 (.00001)
constant 6.64 (.0053) 3.76 (.0278)

• Are obese individuals earning less than others ?


• Obesity, income and age are related:

Age Log income Prevalence of Obesity


<20 4.73 0.007
20-40 6.76 0.033
40-60 7.01 0.0759
60 and over 6.34 0.084

G023. I
Measurement Error

• Data is often measured with error.


– reporting errors.
– coding errors.
• The measurement error can affect either the dependent vari-
able or the explanatory variables. The effect is dramatically
different.

G023. I
Measurement Error on Dependent Variable

• Yi is measured with error. We assume that the measurement


error is additive and not correlated with Xi .
• We observe Y̌ = Y + ν. We regress Y̌ on X:

Y̌ = Xβ + ε
Y = Xβ + ε − ν
= Xβ + w

• The assumptions we have made for OLS to be unbiased and


BLUE are not violated. OLS estimator is unbiased.
• The variance of the slope coefficient is:

V ˆar(β̂) = V ar(w)(X 0 X)−1


= V ar(ε − ν)(X 0 X)−1
= [V ar(ε) + V ar(ν)](X 0 X)−1
≥ V ar(ε)(X 0 X)−1

• The variance of the estimator is larger with measurement error


on Y .

G023. I
Measurement Error on Explanatory Variables

• X is measured with errors. We assume that the error is addi-


tive and not correlated with X: E[ν|x] = 0.
• We observe X̌ = X + ν instead. The regression we perform is
Y on X̌. The estimator of β is expressed as:

β̂ = (X̌ 0 X̌)−1 X̌ 0 y
= (X 0 X + ν 0 ν + X 0 ν + ν 0 X)−1 (X + ν)0 (Xβ + ε)
E[β̂|X] = (X 0 X + ν 0 ν)−1 X 0 Xβ

• Measurement error on X leads to a biased OLS estimate,


biased towards zero. This is also called attenuation bias.

G023. I
Example

• True model:

Yi = β0 + β1 Xi + ui with β0 = 1 β1 = 1

• Xi is measured with error. We observe X̃i = Xi + νi .


• Regression results:

Var(νi )/Var(Xi )
0 0.2 0.4 0.6
β0 1 1.08 1.28 1.53
β1 2 1.91 1.7 1.45

G023. I
Estimation of linear functions of β

• Sometimes we are interested in a particular combination of the


elements of β, say c0 β.
– the first element of β: c0 = [1, 0, 0, . . . , 0]
– the sum of the first two elements of β: c0 = [1, 1, 0 . . . , 0].
– the expected value of Y at x = [x1 , . . . , xk ] (which might
be used in predicting the value of Y at those values: c0 =
[x1 , . . . xk ]
• An obvious estimator of c0 β is c0 β̂ whose variance is:

V ar(c0 β̂|X) = σ 2 c0 (X 0 X)−1 c

G023. I
Minimum Variance Property of OLS

• The OLS estimator possesses an optimality property when V ar[ε|X] =


σ 2 In , namely that among the class of linear functions of y that
are unbiased estimators of β the OLS estimator has the small-
est variance, in the sense that, considering any other estimator,

β̃ = Q(X)y

(a linear function of y, with Q(X) chosen so that β̃ is unbiased),

V ar(c0 β̃) ≥ V ar(c0 β̂) for all c


This is known as the Gauss-Markov theorem.
OLS is the best linear unbiased (BLU) estimator.
• To show this result, let
−1
Q(X) = (X 0 X) X 0 + R0

where R may be a function of X, and note that

E[β̃|X] = β + R0 Xβ.

This is equal to β for all β only when R0 X = 0. This condition


is required if β̃ is to be a linear unbiased estimator. Imposing
that condition,

V ar[β̃|X] − V ar[β̂|X] = σ 2 R0 R,

and
k
X
0 0 2 0 2
V ar(c β̃) − V ar(c β̂) = σ d d = σ d2i ≥ 0
i=1
where d = Rc.

G023. I
M Estimation

• Different strategy to define an estimator.


• Estimator that ”fits the data”.
• Not obvious that this is the most desirable goal. Risk of over-
fitting.
• One early approach to this problem was due by the French
mathematician Laplace: least absolute deviation:
n
X
β̃ = argmin |Yi − b0 xi |
b i=1

The estimator is quite robust to measurement error but quite


difficult to compute.
• Note that OLS estimator belongs to M estimators as it can be
defined as: n
X
β̂ = argmin (Yi − b0 xi )2
b i=1

G023. I
Frisch-Waugh Lovell Theorem

• Suppose X is partitioned into two blocks: X = [X1 , X2 ], so


that
y = X1 β1 + X2 β2 + ε
where β1 and β2 are elements of the conformable partition of
β. Let
M1 = I − X1 (X10 X1 )−1 X10
then β̂2 can be written:

βˆ2 = ((M1 X2 )0 (M1 X2 ))−1 (M1 X2 )0 M1 y


βˆ2 = (X20 M1 X2 )−1 X20 M1 y

• Proof: writing X 0 y = (X 0 X)β̂ in partitioned form:


· 0 ¸ · 0 0
¸" #
X1 y X1 X1 X1 X2 β̂1
0 = 0 0
X2 y X2 X1 X2 X2 β̂2

X10 y = X10 X1 β̂1 + X10 X2 β̂2


X20 y = X20 X1 β̂1 + X20 X2 β̂2

so that
β̂1 = (X10 X1 )−1 X10 y − (X10 X1 )−1 X10 X2 β̂2
substituting

X20 y − X20 X1 (X10 X1 )−1 X10 y = X20 X2 β̂2 − X20 X1 (X10 X1 )−1 X10 X2 β̂2

which after rearrangements is

X20 M1 y = (X20 M1 X2 )β̂2

• Interpretation: the term M1 X2 is the matrix of residuals from


the OLS estimation of X2 on X1 . The term M1 y is the residuals
of the OLS regression of y on X1 . So to get the OLS estimate
of β2 we can perform OLS estimation using residuals as left
and right hand side variables.
G023. I
Generalised Least Squares Estimation

• The simple result V ar(β̂) = σ 2 (X 0 X)−1 is true when V ar(ε|X) =


σ 2 In which is independent of X.
• There are many situations in which we would expect to find
some dependence on X so that V ar[ε|X] 6= σ 2 In .
• Example: in a household expenditure survey we might ex-
pect to find people with high values of time purchasing large
amounts infrequently (e.g. of food, storing purchases in a
freezer) and poor people purchasing small amounts frequently.
If we just observed households’ expenditures for a week (as in
the British National Food Survey) then we would expect to see
that, conditional on variables X that are correlated with the
value of time, the variance of expenditure depends on X.
• When this happens we talk of the disturbances, ε, as being
heteroskedastic.
• In other contexts we might expect to find correlation among
the disturbances, in which case we talk of the disturbances as
being serially correlated.

G023. I
Generalised Least Squares Estimation

• The BLU property of the OLS estimator does not usually apply
when V ar[ε|X] 6= σ 2 In .
• Insight: suppose that Y has a much larger conditional variance
at one value of x, x∗ , than at other values. Realisations pro-
duced at x∗ will be less informative about the location of the
regression function than realisations obtained at other values
of x. It seems natural to give realisations obtained at x∗ less
weight when estimating the regression function.
• We know how to produce a BLU estimator when V ar[ε|X] =
σ 2 In .
• Our strategy for producing a BLU estimator when this condi-
tion does not hold is to transform the original regression model
so that the conditional variance of the transformed Y is pro-
portional to an identity matrix and apply the OLS estimator
in the context of that transformed model.

G023. I
Generalised Least Squares Estimation

• Suppose V ar[ε|X] = Σ is positive definite.


• Then we can find a matrix P such that P ΣP 0 = I. Let Λ be a
diagonal matrix with the (positive valued) eigenvalues of Σ on
its diagonal, and let C be the matrix of associated orthonormal
eigenvectors. Then CΣC 0 = Λ and so Λ−1/2 CΣC 0 Λ−1/2 = I.
The required matrix P is Λ−1/2 C.

z = P y = P Xβ + u
where u = P ε and V ar[u|X] = I
• In the context of this model the OLS estimator,

β̆ = (X 0 P 0 P X)−1 X 0 P 0 P y,

does possess the BLU property.


• Further, its conditional variance given X is (X 0 P 0 P X)−1 . Since
P ΣP 0 = I, it follows that Σ = P −1 P 0−1 = (P 0 P )−1 , so that
P 0 P = Σ−1 . The estimator β̆, and its conditional mean and
variance can therefore be written as

β̆ = (X 0 Σ−1 X)−1 X 0 Σ−1 y


E[β̆|X] = β
V ar[β̆|X] = (X 0 Σ−1 X)−1

The estimator is known as the generalised least squares (GLS)


estimator.

G023. I
Feasible Generalised Least Squares Estimation

• Obviously the estimator cannot be calculated unless Σ is known


which is rarely the case. However sometimes it is possible to
produce a well behaved estimator Σ̂ in which case the feasible
GLS estimator is:

β̆ = (X 0 Σ̂−1 X)−1 X 0 Σ̂−1 y

could be used.
• To study the properties of this estimator requires the use of
asymptotic approximations and we return to this later.

G023. I
Feasible GLS

• To produce the feasible GLS estimator we must impose some


structure on the variance matrix of the unobservables, Σ.
• If not we would have to estimate n(n + 1)/2 parameters using
data containing just n observations: infeasible.
• One way to proceed is to impose the restriction that the diag-
onal elements of Σ are constant and allow nonzero off diagonal
elements but only close to the main diagonal of Σ. This re-
quires ε to have homoskedastic variation with X but allows a
degree of correlation between values of ε for observations that
are close together (e.g. in time if the data are in time order in
the vector y).
• One could impose a parametric model on the variation of ele-
ments of Σ. You will learn more about this in the part of the
course dealing with time series.
• Heteroskedasticity: a parametric approach is occasionally em-
ployed, using a model that requires σii = f (xi ). For example
with the model σii = γ 0 xi one could estimate γ, for example by
calculating an OLS estimator of γ in the model with equation

ε̂2i = γ 0 xi + ui

where ε̂2i is the squared ith residual from an OLS estimation.


Then an estimate of Σ could be produced using γ̂ 0 xi as the ith
main diagonal element.

G023. I
Feasible GLS

• Economics rarely suggests suitable parametric models for vari-


ances of unobservables.
• One may therefore not wish to pursue the gains in efficiency
that GLS in principle offers.
• If the OLS estimator is used and Σ 6= σ 2 In one must still be
aware that the formula yielding standard errors, V ar(β̂) =
σ 2 (X 0 X)−1 is generally incorrect. The correct one is:

V ar(β̂) = (X 0 X)−1 X 0 ΣX(X 0 X)−1 .

• One popular strategy is to proceed with the OLS estimator


but to use an estimate of the matrix (X 0 X)−1 X 0 ΣX(X 0 X)−1
to construct standard errors.
• In models in which the off-diagonal elements of Σ are zero but
heteroskedasticity is potentially present this can be done by
using
V ar(β̂) = (X 0 X)−1 X 0 Σ̂X(X 0 X)−1 .
where Σ̂ is a diagonal matrix with squared OLS residuals, ε̂2i ,
on its main diagonal.
• There exist more elaborate (and non parametric) estimators of
Σ which can be used to calculate (heteroskedasticity) robust
standard errors.

G023. I
Inference: Sampling Distributions

• Suppose that y given X (equivalently ε given X) is normally


distributed .
• The OLS estimator is a linear function of y and is therefore,
conditional on X, normally distributed. (The same argument
applies to the GLS estimator employing Σ). For the OLS esti-
mator with V ar[ε|X] = σ 2 I:

β̂|X ∼ Nk [β, σ 2 (X 0 X)−1 ]

and when V ar[ε|X] = Σ, for the GLS estimator,

β̆|X ∼ Nk [β, (X 0 Σ−1 X)−1 ].

• Sticking with the homoskedastic case, consider a linear combi-


nation of β, c0 β:

c0 β̂|X ∼ N [c0 β, σ 2 c0 (X 0 X)−1 c].

G023. I
Inference: Confidence Intervals

• Let Z ∼ N [0, 1] and let zL (α) and zU (α) be the closest pair
of values such that P [zL (α) ≤ Z ≤ zU (α)] = α. zL (α) is
the (1 − α)/2 quantile of the standard normal distribution.
Choosing α = 0.95 gives zU (α) = 1.96, zL (α) = −1.96.
• The result above concerning the distribution of c0 β̂ implies that

c0 β̂ − c0 β
P [zL (α) ≤ ≤ zU (α)] = α
σ (c0 (X 0 X)−1 c)1/2
which in turn implies that
¡ ¢1/2 ¡ ¢1/2
P [c0 β̂−zU (α)σ c0 (X 0 X)−1 c ≤ c0 β ≤ c0 β̂−zL (α)σ c0 (X 0 X)−1 c ] = α.

Consider the interval


¡ ¢1/2 0 ¡ ¢1/2
[c0 β̂ − zU (α)σ c0 (X 0 X)−1 c , c β̂ − zL (α)σ c0 (X 0 X)−1 c ].

This random interval covers the value c0 β with probability α.


This is known as a 100α% confidence interval for c0 β.
• Note that this interval cannot be calculated without knowledge
of σ. In practice here and in the tests and interval estimators
that follow one will use an estimator of σ 2 .

G023. I
Estimation of σ 2

• Note that

σ 2 = n−1 E[(y − Xβ)0 (y − Xβ) |X]

which suggests the analogue estimator


³ ´0 ³ ´
2 −1
σ̂ = n y − X β̂ y − X β̂
= n−1 ε̂0 ε̂
= n−1 y 0 M y

where ε̂ = y − X β̂ = M y and M = I − X(X 0 X)−1 X 0 .


• σ̂ 2 is a biased estimator and the bias is in the downward direc-
tion:
n−k 2
E[σ̂ 2 ] = σ < σ2
n
but note that the bias is negligible unless k the number of
covariates is large relative to n the sample size.
• Intuitively, the bias arises from the fact that the OLS estimator
minimises the sum of squared residuals.
• It is possible to correct the bias using the estimator (n − k)−1 ε̂0 ε̂
but the effect is small in most economic data sets.
• Under certain conditions to be discussed shortly the estimator
σ̂ 2 is consistent. This means that in large samples the inaccu-
racy of the estimator is small and that if in the tests described
below the unknown σ 2 is replaced by σ̂ 2 the tests are still ap-
proximately correct.

G023. I
Estimation of σ 2

n−k 2
• Proof of E[σ̂ 2 ] = n σ < σ2:
• First note that M y = M ε because M X = 0. So

σ̂ 2 = n−1 y 0 M y = n−1 ε0 M ε

E[σ̂ 2 |X] = n−1 E[ε0 M ε|X]


= n−1 E[trace(ε0 M ε)|X]
= n−1 E[trace(M εε0 )|X]
= n−1 trace(M E[εε0 |X])
= n−1 trace(M Σ)

and when Σ = σ 2 In ,

n−1 trace(M Σ) = n−1 σ 2 trace(M )


= n−1 σ 2 trace(In − X(X 0 X)−1 X 0 )
n−k
= σ2 .
n

G023. I
Confidence regions

• Sometimes we need to make probability statements about the


values of more than one linear combination of β. We can do
this by developing confidence regions.
• For j linear combinations, a 100α% confidence region is a sub-
set of IRj which covers the unknown (vector) value of the j
linear combinations with probability α.
• We continue to work under the assumption that y given X
(equivalently ε given X) is normally distributed.
• Let the j linear combinations of interest be Rβ = r, say, where
R is j × k with rank j. The OLS estimator of r is Rβ̂ and

Rβ̂ ∼ N [r, σ 2 R(X 0 X)−1 R0 ]

which implies that


³ ´0 ¡ ¢ ³ ´
0 −1 0 −1
Rβ̂ − r R(X X) R Rβ̂ − r /σ 2 ∼ χ2(j) (1)

where χ2(j) denotes a Chi-square random variable with parame-


ter (degrees of freedom) j.

G023. I
Chi Square Distribution

• Let the ν x 1 element vector Z ∼ N (0, Iν ).


P
• Then ξ = Z 0 Z = νi=1 Zi2 (positive valued) has a distribution
known as a Chi-square distribution, written Z ∼ χ2(ν) . The
probability density function associated with the χ2(ν) distribu-
tion is positively skewed. For small ν its mode is at zero.
• The expected value and variance of a Chi-square random vari-
able are:

E[χ2(ν) ] = ν
V ar[χ2(ν) ] = 2ν.

For large ν, the distribution is approximately normal.


• Partial proof: if Zi ∼ N (0, 1) then V [Zi ] = E[Zi2 ] = 1. There-
fore E[Σvi=1 Zi2 ] = v.
• Generalisation: Let A ∼ Nν [µ, Σ] and let P be such that
P ΣP 0 = I, which implies that P 0 P = Σ−1 . Then Z = P (A −
µ) ∼ Nν [0, I] so that

ξ = Z 0 Z = (A − µ)0 Σ−1 (A − µ) ∼ χ2(ν) .

G023. I
Confidence regions continued

• Let qχ2 (j) (α) denote the α−quantile of the χ2(j) distribution.
Then
P [χ2(j) ≤ qχ2 (j) (α)] = α
implies that
³ ´0 ¡ ¢ ³ ´
0 −1 0 −1
P [ Rβ̂ − r R(X X) R Rβ̂ − r /σ 2 ≤ qχ2 (j) (α)] = α.

The region in IRj defined by


³ ´0 ¡ ¢ ³ ´
0 −1 0 −1
{r : Rβ̂ − r R(X X) R Rβ̂ − r /σ 2 ≤ qχ2 (j) (α)}

is a 100α% confidence region for r, covering r with probability


α. The boundary of the region is an ellipsoid centred on the
point Rβ̂.
• Setting R equal to a vector c0 (note then j = 1) and letting
c∗ = c0 β, produces
³ ´0 ¡ ¢ ³ 0 ´
0 ∗ 0 0 −1 −1
α = P [ c β̂ − c c (X X) c c β̂ − c /σ 2 ≤ qχ2 (1) (α)]

³ ´2
0 ∗
c β̂ − c
= P [ 2 0 0 −1 ≤ qχ2 (1) (α)]
σ c (X X) c
³ ´
0 ∗
¡ ¢1/2 c β̂ − c ¡ ¢1/2
= P [− qχ (1) (α)
2 ≤ ≤ q χ2 (1) (α) ]
σ (c0 (X 0 X)−1 c)1/2
³ ´
0 ∗
c β̂ − c
= P [zL (α) ≤ 1/2
≤ zU (α)]
0 0
σ (c (X X) c) −1

where we have used the relationship χ2 (1) = N (0, 1)2 .

G023. I
Tests of hypotheses

• The statistics developed to construct confidence intervals can


also be used to conduct tests of hypotheses.
• For example, suppose we wish to conduct a test of the null
hypothesis H0 : Rβ − r = 0 against the alternative H1 : Rβ −
r 6= 0. The statistic
³ ´0 ¡ ¢ ³ ´
0 −1 0 −1
S = Rβ̂ − r R(X X) R Rβ̂ − r /σ 2 . (2)

has a χ2 (j) distribution under the null hypothesis. Under the


alternative, let
Rβ − r = δ 6= 0.
Then
Rβ̂ − r ∼ N [δ, σ 2 R(X 0 X)−1 R0 ]
and the statistic S will tend to be larger than we would ex-
pect to obtain from a χ2 (j) distribution. So we reject the null
hypothesis for large values of S.

G023. I
Tests of Hypotheses

• The size of a test of H0 is the probability of rejecting H0 when


H0 is true.
• The power of a test against a specific alternative H1 is the
probability of rejecting H0 when H1 is true.
• The following test procedure has size λ.

Decision rule: Reject H0 if S > qχ2 (j) (1−λ), otherwise


do not reject H0 .

Here qχ2 (j) (1−λ) is the (1−λ) quantile of the χ2 (j) distribution.
• Note that we do not talk in terms of accepting H0 as an alter-
native to rejection. The reason is that a value of S that does
not fall in the rejection region of the test is consonant with
many values of Rβ − r that are close to but not equal to 0.

G023. I
Tests of Hypotheses

• To obtain a test concerning a single linear combination of β,


H0 : c0 β = c∗ , we can use the procedure above with j = 1,
giving ³ ´ 2
c0 β̂ − c∗
S=
σ 2 c0 (X 0 X)−1 c
and the following size λ test procedure.

Decision rule: Reject H0 if S > qχ2 (1) (1−λ), otherwise


do not reject H0 .

• Alternatively we can proceed directly from the sampling dis-


tribution of c0 β̂. Since, when H0 is true,
³ ´
0 ∗
c β̂ − c
∼ N (0, 1),
σ (c0 (X 0 X)−1 c)1/2
we can obtain zL (α), zU (α), such that

P [zL (α) < N (0, 1) < zU (α)] = α = 1 − λ.

The following test procedure has size (probability of rejecting


a true null hypothesis) equal to λ.
Decision rule: Reject H0 if S > zU (α) or S < zL (α) , otherwise
do not reject H0 .
• Because of the relationship between the standard normal N (0, 1)
distribution and the χ2(1) distribution the tests are identical.

G023. I
Confidence Interval: Example

• We regress log of income on age, sex and education dummies


in a sample of 39000 Swedish individuals.

Coef. Std. Err


Age 0.1145845 0.0010962
Age square -0.0010657 0.0000109
Male 0.060531 0.0078549
High school degree 0.5937122 0.0093677
College degree 0.7485223 0.0115236
Constant 3.563253 0.0249524
R square: 0.3435

• 95% confidence interval for College Education:

[0.748 − 1.96 ∗ 0.0115, 0.748 + 1.96 ∗ 0.0115] = [0.726, 0.771]

• Test of H0 no gender differences in income:

0.06/0.00785 = 7.71

Reject H0 .
• Test of H0 Effect of College degree equal to High School degree:

(0.748 − 0.593)/0.0115 = 13.39

Reject H0 .

G023. I
Detecting structural change

• A common application of this testing procedure in economet-


rics arises when attempting to detect “structural change”.
• In a time series application one might imagine that up to some
time Ts the vector β = βb and after Ts , β = βa , that is that
there are two regimes with switching occurring at time Ts . This
situation can be captured by specifying the model
· ¸ · ¸· ¸ · ¸
yb Xb 0 βb εb
y= = + = Xβ + ε
ya 0 Xa βa εa
where Xb contains data for the period before Ts and Xa con-
tains data for the period after Ts . The null hypothesis of no
structural change is expressed by H0 : βb = βa . If all the coef-
ficients are allowed to alter across the structural break then

ε̂0U ε̂U = ε̂0b ε̂b + ε̂0a ε̂a

where, e.g., ε̂0b ε̂b is the sum of squared residuals from estimating

yb = Xb βb + εb .

The test statistic developed above, specialised to this problem


can then be written
(ε̂0 ε̂ − (ε̂0b ε̂b + ε̂0a ε̂a ))
S=
σ2
where ε̂0 ε̂ is the sum of squared residuals from estimating with
the constraint β̂a = β̂b imposed and σ 2 is the common variance
of the errors.
• When the errors are identically and independently normally
distributed S has a χ2(k) distribution under H0 .

G023. I
Detecting Structural Change

• In practice an estimate of σ 2 is used - for example there is the


statistic
∗ (ε̂0 ε̂ − (ε̂0b ε̂b + ε̂0a ε̂a ))
S =
(ε̂0b ε̂b + ε̂0a ε̂a ) /n
where n is the total number of observations in the two periods
combined. S has approximately a χ2(k) distribution under H0 .
• This application of the theory of tests of linear hypotheses is
given the name, “Chow test”, after Gregory Chow who popu-
larised the procedure some 30 years ago.
• The test can be modified in various ways.
– We might wish to keep some of the elements of β constant
across regimes.
– In microeconometrics the same procedure can be employed
to test for differences across groups of households, firms
etc.

G023. I
Example: Structural Break in Temperature

Fitted values temp

11

10

1700 1800 1900 2000


year

• Temperature as a function of Time:


• We test for a break in 1900:

Coef. Std Err Coef. Std Err


Time (Years) 0.0015 0.00039 -0.00054 0.00069
Time after 1900 (Years) - - 0.0061 0.0022

• We can test whether the slope after 1900 is different from the
general one:

(0.0061 + 0.0054)/0.0022 = 3.03 P rob = 0.0077

• Or conduct a Chow test: S = 14.62. We come to the same


conclusion. There is a break in the trend.

G023. I
Estimation in non-linear regression models

• An obvious extension to the linear regression model studied so


far is the non-linear regression model:

E[Y |X = x] = g(x, θ)

equivalently, in regression function plus error form:

Y = g(x, θ) + ε
E[ε|X = x] = 0.

Consider M-estimation and in particular the non-linear least


squares estimator obtained as follows.
n
X
θ̂ = arg minn −1
(Yi − g(xi ; θ∗ ))2
θ∗ i=1

• For now we just consider how a minimising value θ̂ can be


found. Many of the statistical software packages have a routine
to conduct non-linear optimisation and some have a non-linear
least squares routine. Many of these routines employ a variant
of Newton’s method.

G023. I
Numerical optimisation: Newton’s method and variants

• Write the minimisation problem as:

θ̂ = arg minQ(θ∗ ).
θ∗

Newton’s method involves taking a sequence of steps, θ0 , θ1 , . . . ,


θm , . . . θM from a starting value, θ0 to an approximate minimis-
ing value θM which we will use as our estimator θ̂.
• The starting value is provided by the user. One of the tricks
is to use a good starting value near to the final solution. This
sometimes requires some thought.
• Suppose we are at θm . Newton’s method considers a quadratic
approximation to Q(θ) which is constructed to be an accurate
approximation in a neighbourhood of θm , and moves to the
value θm+1 which minimises this quadratic approximation.
• At θm+1 a new quadratic approximation, accurate in a neigh-
bourhood of θm+1 is constructed and the next value in the
sequence, θm+2 , is chosen as the value of θ minimising this new
approximation.
• Steps are taken until a convergence criterion is satisfied. Usu-
ally this involves a number of elements. For example one might
continue until the following conditions is satisfied:

Qθ (θm )0 Qθ (θm ) ≤ δ1 , |Q(θm ) − Q(θm−1 )| < δ2 .

Convergence criteria vary form package to package. Some care


is required in choosing these criteria. Clearly δ1 and δ2 above
should be chosen bearing in mind the orders of magnitude of
the objective function and its derivative.

G023. I
Numerical optimisation: Newton’s method and variants

• The quadratic approximation used at each stage is a quadratic


Taylor series approximation. At θ = θm ,
1
Q(θ) ' Q(θm )+(θ − θm )0 Qθ (θm )+ (θ − θm )0 Qθθ0 (θm ) (θ − θm ) = Qa (θ, θm ).
2
The derivative of Qa (θ, θm ) with respect to θ is

Qaθ (θ, θm ) = Qθ (θm ) + Qθθ0 (θm ) (θ − θm )

and θm+1 is chosen as the value of θ that solves Qaθ (θ, θm ) = 0,


namely
θm+1 = θm − Qθθ0 (θm )−1 Qθ (θm ).

There are a number of points to consider here.

1. Obviously the procedure can only work when the objective


function is twice differentiable with respect to θ.
2. The procedure will stop whenever Qθ (θm ) = 0, which can
occur at a maximum and saddlepoint as well as at a min-
imum. The Hessian, Qθθ0 (θm ), should be positive definite
at a minimum of the function.
3. When a minimum is found there is no guarantee that it
is a global minimum. In problems where this possibility
arises it is normal to run the optimisation from a variety
of start points to guard against using an estimator that
corresponds to a local minimum.
4. If, at a point in the sequence, Qθθ0 (θm ) is not positive def-
inite then the algorithm may move away from the mini-
mum and there may be no convergence. Many minimisa-
tion (maximisation) problems we deal with involve globally
convex (concave) objective functions and for these there is
no problem. For other cases, Newton’s method is usually
modified, e.g. by taking steps

θm+1 = θm − A(θm )−1 Qθ (θm )


where A(θm ) is constructed to be positive definite and in
cases in which Qθθ0 (θm ) is in fact positive definite, to be a
good approximation to Qθθ0 (θm ).
5. The algorithm may “overstep” the minimum to the ex-
tent that it takes an “uphill” step, i.e. so that Q(θm+1 ) >
Q(θm ). This is guarded against in many implementations
of Newton’s method by taking steps

θm+1 = θm − α(θm )A(θm )−1 Qθ (θm )

where α(θm ) is a scalar step scaling factor, chosen to ensure


that Q(θm+1 ) < Q(θm ).
6. In practice it may be difficult to calculate exact expressions
for the derivatives that appear in Newton’s method. In
some cases symbolic computational methods can help. In
others we can use a numerical approximation, e.g.
Qθ (θm + δi ei ) − Qθ (θm )
Qθi (θm ) '
δi
where ei is a vector with a one in position i and zeros else-
where, and δi is a small perturbing value, possibly varying
across the elements of θ.

G023. I
Numerical Optimisation: Example

• Function y = sin(x/10) ∗ x2 .
• This function has many (an infinite) number of local minimas.
• Start off the nonlinear optimisation at various points.
4
x 10
8
sin(x/10)*x*x
Start at 0.5
6 Start at 50
Start at 150

−2

−4

−6

−8

−10
0 50 100 150 200 250 300

G023. I
Approximate Inference
Approximate Inference

• The results set out in the previous notes let us make inferences
about coefficients of regression functions, β, when the distrib-
ution of y given X is Gaussian (normal) and the variance of
the unobservable disturbances is known.
• In practice the normal distribution at best holds approximately
and we never know the value of the nuisance parameter σ. So
how can we proceed?
• The most common approach and the one outlined here involves
employing approximations to exact distributions.
• They have the disadvantage that they can be inaccurate and
the magnitude of the inaccuracy can vary substantially from
case to case. They have the following advantages:

1. They are usually very much easier to derive than exact


distributions,
2. They are often valid for a wide family of distributions for
y while exact distributional results are only valid for a
specific distribution of y, and we rarely know which distri-
bution to use to produce an exact distributional result.

• The most common sort of approximation employed in econo-


metrics is what is known as a large sample approximation.
• The main, and increasingly popular, alternative to the use of
approximations is to use the bootstrap. This is based on in-
tensive computer replications.

G023. II
Approximate Inference

• Suppose we have a statistic, Sn , computed using n realisations,


for example the OLS estimator, β̂, or the variance estimator
σ̂ 2 , or one of the test statistics developed earlier.
• To produce a large sample approximation to the distribution
of the statistic, Sn , we regard this statistic as a member of a
sequence of statistics, S1 , . . . , Sn , . . . , indexed by n, the number
of realisations. We write this sequence as {Sn }∞ n=1 .

• Denote the distribution function of Sn by P [Sn ≤ s] = FSn (s).


We then consider how the distribution function FSn (s) behaves
as we pass through the sequence, that is as n takes larger and
larger values.
• In particular we ask what properties the distribution function
has as n tends to infinity. The distribution associated with the
limit of the sequence of statistics is sometimes referred to as a
limiting distribution.
• Sometimes this distribution can be used to produce an approx-
imation to FSn (s) which can be used to conduct approximate
inference using Sn .

G023. II
Convergence in probability

• In many cases of interest the distributions of a sequence of sta-


tistics becomes concentrated on a single point, say c, as we pass
through the sequence, increasing n. That is, FSn (s) becomes
closer and closer to a step function as n is increased, a step
function which is zero up to c, and at c, jumps to 1. In this
case we say that Sn converges in probability to the constant c.
• A sequence of (possibly vector valued) statistics converges in
probability to a constant (possibly vector), c, if, for all ε > 0,

lim P [kSn − ck > ε] = 0,


n→∞

that is, if for every ε, δ > 0, there exists N (which typically


depends upon ε and δ), such that for all n > N

P [kSn − ck > ε] < δ.

• Here the notation || · || is used to denote the Euclidean length


of a vector, that is: kzk = (z 0 z)1/2 . This is the absolute value
of z when z is a scalar.
p
• We then write plimn→∞ Sn = c, or, Sn → c, and c is referred
to as the probability limit of Sn .

G023. II
Convergence in Probability

• When Sn = θ̂n is an estimator of a parameter, θ, which takes


p
the value θ0 and θ̂n → θ0 , we say that θ̂n is a consistent esti-
mator .
h i h i

• If every member of the sequence {E θ̂n }i=1 and {V ar θ̂n }∞ i=1
exists, and
h i
lim E θ̂n = θ
n→∞
h i
lim V ar θ̂n = 0
n→∞

then we say that θ̂n converges in mean square to θ. It is quite


easily shown that convergence in mean square implies conver-
gence in probability. It is often easy to derive expected values
and variances of statistics. So a quick route to proving consis-
tency is to prove convergence in mean square.

G023. II
Convergence in Probability

• Note, though, that an estimator can be consistent but not con-


verge in mean square. There are commonly occurring cases in
econometrics where estimators are consistent but the sequences
of moments required for consideration of convergence in mean
square do not exist. (For example, the two stage least squares
estimator in just identified linear models, i.e. the indirect least
squares estimator).
• Consistency is generally regarded as a desirable property for
an estimator to possess.
• Note though that in all practical applications of econometric
methods we have a finite sized sample at our disposal. The
consistency property on its own does not tell us about the
quality of the estimate that we calculate using such a sample.
It might be better sometimes to use an inconsistent estima-
tor that generally takes values close to the unknown θ than a
consistent estimator that is very inaccurate except at a much
larger sample size than we have available.
• The consistency property does tell us that with a large enough
sample our estimate would likely be close to the unknown truth,
but not how close, nor even how large a sample is required to
get an estimate close to the unknown truth.

G023. II
Convergence in distribution

• A sequence of statistics {Sn }∞n=1 that converges in probability


to a constant has a variance (if one exists) which becomes small
as we pass to larger values of n.
• If we multiply Sn by a function of n, chosen so that the variance
of the transformed statistic remains approximately constant as
we pass to larger values of n, then we may obtain a sequence
of statistics which converge not to a constant but to a random
variable.
• If we can work out what the distribution of this random vari-
able is, then we can use this distribution to approximate the
distributions of the transformed statistics in the sequence.
• Consider a sequence of random variables {Tn }∞
n=1 . Denote the
distribution function of Tn by

P [Tn ≤ t] = FTn (t).

Let T be a random variable with distribution function

P [T ≤ t] = FT (t).

We say that {Tn }∞ n=1 converges in distribution to T if for all


ε > 0 there exists N (which will generally depend upon ε) such
that for all n > N ,

|FTn (t) − FT (t)| < ε

at all points t at which FT (t) is continuous. Then we write

d
Tn → T
.
• The definition applies for vector and scalar random variables.
In this situation we will also talk in terms of Tn converging in
probability to (the random variable) T .

G023. II
Convergence in Distribution

• Now return to the sequence {Sn }∞


n=1 that converges in proba-
bility to a constant.
• Let Tn = h(n)(Sn ) with h(·) > 0 chosen so that {Tn }∞
n=1 con-
verges in distribution to a random variable T that has a non-
degenerate distribution.
• A common case that will arise is that in which h(n) = nα . In
this course we will only encounter the special case in which
α = 1/2, that is h(n) = n1/2 .
• We can use the limiting random variable T to make approxi-
mate probability statements as follows. Since Sn = Tn /h(n),

P [Sn ≤ s] = P [Tn /h(n) < s]


= P [Tn < s × h(n)]
' P [T < s × h(n)]
= FT (s × h(n))

which allows approximate probability statements concerning


the random variable Sn .

G023. II
Convergence in Distribution: Example

• Consider the mean, X̄n of n independently and identically dis-


tributed random variables with common mean and variance
respectively µ and σ 2 .
• One of the simplest Central Limit Theorems (see below) says
d
that, if Tn = n1/2 (X̄n − µ)/σ then Tn → T ∼ N (0, 1).
• We can use this result to say that Tn ' N (0, 1) where “'”
here means “is approximately distributed as”. This sort of
result can be used to make approximate probability statements.
Since T has a standard normal distribution

P [−1.96 ≤ T ≤ 1.96] = 0.95

and so, approximately,


n1/2 (X̄n − µ)
P [−1.96 ≤ ≤ 1.96] ' 0.95
σ
leading, if σ 2 were known, to the approximate 95% confidence
interval for µ,

{X̄n − 1.96σ/n1/2 , X̄n + 1.96σ/n1/2 },

approximate in the sense that

P [X̄n − 1.96σ/n1/2 ≤ µ ≤ X̄n + 1.96σ/n1/2 ] ' 0.95

G023. II
Approximate Inference: Some Thoughts

• It is very important to realise that in making this approxima-


tion there is no sense in which we ever think of the sample size
actually becoming large.
• The sequence {Sn }∞n=1 indexed by the sample size is just a
hypothetical construct in the context of which we can develop
an approximation to the distribution of a statistic.
• For example we know that when y given X is normally distrib-
uted the OLS estimator is exactly normally distributed con-
ditional on X. For non-normal y, under some conditions, as
we will see, the limiting distribution of an appropriately scaled
OLS estimator is normal. The quality of that normal approxi-
mation depends upon the sample size, but also upon the extent
of the departure of the distribution of y given X from normal-
ity and upon the disposition of the values of the covariates. For
y close to normality the normal approximation to the distrib-
ution of the OLS estimator is good even at very small sample
sizes.
• The extent to which, at the value of n that we have, the de-
viations |FTn (t) − FT (t)| are large or small can be studied by
Monte Carlo simulation or by considering higher order approx-
imations.

G023. II
Functions of statistics - Slutsky’s Theorem

• Slutsky’s Theorem states that if Tn is a sequence of random


variables that converges in probability to a constant c, and g(·)
is a continuous function, then g(Tn ) converges in probability to
g(c).
• Tn can be a vector or matrix of random variables in which case
c is a vector or matrix of constants. Sometimes c is called the
probability limit of Tn .
• A similar result holds for convergence to a random variable,
namely that if Tn is a sequence of random variables that con-
verges in probability to a random variable T , and g(·) is a con-
tinuous function, then g(Tn ) converges in probability to g(T ).
• For example, if h i
Tn0 .. 20
= Tn10 . Tn
and h i0
d
Tn → T = 10 .. 20
T . T
then
d
Tn1 + Tn2 → T 1 + T 2

G023. II
Limit theorems

• The Lindberg-Levy Central Limit Theorem gives the limiting


distribution of a mean of identically distributed random vari-
ables. The Theorem states that if {Yi }∞ i=1 are mutually inde-
pendent random (vector) variables each with expected value
−1
Pn µ
and positive definite covariance matrix Ω then if Ȳn = n i=1 Yi ,

d
n1/2 (Ȳn − µ) → Z, Z ∼ N (0, Ω).

• Many of the statistics we encounter in econometrics can be ex-


pressed as means of non-identically distributed random vectors,
whose limiting distribution is the subject of the Lindberg-Feller
Central Limit Theorem. The Theorem states that if {Yi }∞ i=1 are
independently distributed random variables with E[Yi ] = µi ,
V ar[Yi ] = Ωi with finite third moments and
n n
1X 1X
Ȳn = Yi µi = µ̄n
n i=1 n i=1
n n
1X 1X
lim µi = µ lim Ωi = Ω,
n→∞ n n→∞ n
i=1 i=1
where Ω is finite and positive definite, and for each j
à n !−1
X
lim Ωi Ωj = 0, (3)
n→∞
i=1

then
d
n1/2 (Ȳn − µ̄n ) → Z, Z ∼ N (0, Ω).

G023. II
Limit Theorem: Example

• Start with n uniform random variables {Yi }ni=1 over [0, 1].
• Denote by Ȳn the mean of Yi based on a sample of size n.
• The graph plots the distribution of n1/2 (Ȳn − 0.5):

0.14
n=1
n=10
0.12
n=100

0.1

0.08

0.06

0.04

0.02

0
−1.5 −1 −0.5 0 0.5 1 1.5

G023. II
Approximate Distribution Of The Ols Estimator

• Consider the OLS estimator Sn = β̂n = (Xn0 Xn )−1 Xn0 yn where


we index by n to indicate that a sample of size n is involved.
We know that when

yn = Xn β + ε E[εn |Xn ] = 0 V ar[εn |Xn ] = σ 2 In

then
E[β̂n |Xn ] = β
and

n
X
2
V ar[β̂n |Xn ] = σ (Xn0 Xn )−1 −1 2
= n σ (n −1
Xn0 Xn )−1 −1 2
= n σ (n −1
xi x0i )−1 .
i=1

• Consistency: If the xi ’s were independently sampled from


some distribution such that
n
X
−1 p
n xi x0i = n−1 X 0 X → Σxx
i=1

and if this matrix of expected squares and cross-products is


non-singular then

lim V ar[β̂n |Xn ] = 0.


n→∞

In this case β̂n converges in mean square to β (recall that


p
E[β̂|X] = β), so β̂n → β and the OLS estimator is consistent.

G023. II
OLS Estimator: Limiting distribution

• To make large sample approximate inference using the OLS


estimator, consider the centred statistics

Sn = β̂n − β

and the associated scaled statistics

Tn = n1/2 Sn
= n1/2 (β̂n − β)
= (n−1 Xn0 Xn )−1 n−1/2 Xn0 εn .
p
Assuming (n−1 Xn0 Xn )−1 → Σ−1
xx ., consider the term
n
X
−1/2
n Xn0 εn =n −1/2
xi εi .
i=1

Let Ri = xi εi and note that

E[Ri ] = 0, V ar[Ri ] = σ 2 xi x0i .

Under suitable conditions on the vectors xi , the Ri ’s satisfy the


conditions of the Lindberg-Feller Central Limit Theorem and
we have
n
X
−1/2 d
n Ri = n−1/2 Xn0 εn → N (0, σ 2 Σxx ).
i=1

Finally, by Slutsky’s Theorem


d
Tn = n1/2 (β̂n − β) → N (0, σ 2 Σ−1
xx ).

• We use this approximation to say that

n1/2 (β̂n − β) ' N (0, σ 2 Σ−1


xx ).

G023. II
OLS Estimator: Limiting distribution

• In practice σ 2 and Σxx are unknown and we replace them by


estimates, e.g. σ̂n2 and n−1 Xn0 Xn .
• If these are consistent estimates then we can use Slutsky’s The-
orem to obtain the limiting distributions of the resulting sta-
tistics.
• Example: testing the hypothesis H0 : Rβ = r, we have already
considered the statistic
³ ´−1
0 0 −1 0
Sn = (Rβ̂n − r) R (Xn Xn ) R (Rβ̂n − r)/σ 2

where the subscript “n” is now appended to indicate the sample


size under consideration. In the normal linear model, Sn ∼ χ2(j) .
• When y given X is non-normally distributed the limiting dis-
tribution result given above can be used, as follows. Rewrite
Sn as
³ ´0 ³ ¡ ¢−1 0 ´−1 ³ 1/2 ´
1/2 −1 0
Sn = n (Rβ̂n − r) R n Xn Xn R n (Rβ̂n − r) /σ 2 .

Let Pn be such that


³ ¡ ¢−1 0 ´ 0
−1 0
Pn R n X n X n R Pn = I j

and consider the sequence of random variables


n1/2
Tn = Pn (Rβ̂n − r).
σ
d p
Tn → N (0, Ij ) as long as Pn → P where P 0 (RΣ−1 0
xx R )P = Ij .
Application of the results on limiting distributions of functions
of random variables gives
d
Tn0 Tn → χ2(j) .

G023. II
OLS Estimator: Limiting Distribution

• Now
n ³ ¡ ¢−1 0 ´−1
Tn0 Tn 0 −1 0
= 2 (Rβ̂n − r) R n Xn Xn R (Rβ̂n − r)
σ
where we have used
³ ¡ ¢−1 0 ´−1
Pn0 Pn −1 0
= R n Xn Xn R .

Cancelling the terms involving n:


d
Tn0 Tn = Sn → χ2(j) .

• Finally, if σ̂n2 is a consistent estimator of σ 2 then it can replace


σ 2 in the formula for Sn and the approximate χ2(j) still applies,
that is:
³ ´0 ³ ¡ ¢−1 0 ´−1 ³ 1/2 ´
1/2 −1 0 d
n (Rβ̂n − r) R n Xn Xn R n (Rβ̂n − r) /σ̂n2 → χ2(j) .

• The other results we developed earlier for the normal linear


model with “known” σ 2 also works as approximations when a
normality restrictions does not hold and when σ 2 is replaced
by a consistent estimator.

G023. II
Approximate Distribution of the GLS Estimator

• Consider the following linear model:

y = Xβ + ε
E[ε|X] = 0
V ar[ε|X] = Ω
¡ ¢−1 0 −1
• The GLS estimator β̃ = X 0 Ω−1 X X Ω y is BLU, and
when y given X is normally distributed:
¡ ¢−1
β̃ ∼ N (β, X 0 Ω−1 X ).

• When y given X is non-normally distributed we can proceed


as above, working in the context of a transformed model in
which transformed y given X has an identity covariance ma-
p
trix giving, under suitable conditions β̃ → β and the limiting
distribution:
d ¡ ¢−1
n1/2 (β̃ − β) → N (0, n−1 X 0 Ω−1 X ).

• We noted that in practice Ω is ³unknown ´and suggested using


−1
a feasible GLS estimator, β̃ = X 0 Ω̂−1 X X 0 Ω̂−1 y in which
Ω̂ was some estimate of the conditional variance of y given X.
• Suppose Ω̂ is a consistent estimator of Ω. Then it can be
shown that β̃ is a consistent estimator of β and under suitable
conditions
d ¡ ¢−1
n1/2 (β̃ − β) → N (0, n−1 X 0 Ω−1 X ).

G023. II
Approximate Distribution of the GLS Estimator

• When Ω̂ is a consistent estimator the limiting distribution of


the feasible GLS estimator is the same as the limiting distrib-
ution of the estimator that employs Ω.
• The exact distributions differ in a finite sized sample to an
extent that depends upon the accuracy of the estimator of Ω
in that finite sized sample.
• When the elements of Ω are functions of a finite number of pa-
rameters it may be possible to produce a consistent estimator,
Ω̂.
• Example: consider a heteroscedastic model in which Ω is diag-
onal with diagonal elements

ωii = f (xi , γ).

A first step OLS estimation produces residuals, ε̂i and

E[ε̂2i |X] = (M ΩM )ii = Mi0 ΩMi = ωii Mii

where Mi0 is the ith row of M and Mii is the (i, i) element of
M . This simplification follows from the diagonality of Ω and
the idempotency of M . We can therefore write
ε̂2i
= f (xi , γ) + ui
Mii
where E[ui |X] = 0, and under suitable conditions a nonlinear
least squares estimation will produce a consistent estimator of
γ, leading to a consistent estimator of Ω.

G023. II
Approximate Distribution of M-Estimators

• It is difficult to develop exact distributions for these estimators,


except under very special circumstances (e.g. for the OLS es-
timator with normally distributed y given X)
• Consider an M-estimator defined as

θ̂n = arg max U (Zn , θ)


θ
where θ is a vector of parameters and Zn is a vector random
variable. In the applications we will consider Zn contains n
random variables representing outcomes observed in a sample
of size n. We wish to obtain the limiting distribution of θ̂n .
p
• The first step is to show that θ̂n → θ0 , the true value of θ. This
is done by placing conditions on U and on the distribution of
Zn which ensure that:
p
1. for θ in a neighbourhood of θ0 , U (Zn , θ) → U ∗ (θ).
2. the sequence of values (indexed by n) of θ that maximise
U (Zn , θ) converges in probability to the value of θ that
maximises U ∗ (θ)
3. the value of θ that uniquely maximises U ∗ (θ) is θ0 , the
unknown parameter value. (identification)

G023. II
Approximate Distribution of M-Estimators

´ ³
1/2
• To obtain the limiting distribution of n θ̂n − θ0 , consider
situations in which the M-estimator can be defined as the
unique solution to first order conditions


Uθ (Zn , θ̂n ) = 0 where Uθ (Zn , θ̂n ) = U (Zn , θ)|θ=θ̂n
∂θ
This is certainly the case when U (Zn , θ) is concave.
• We first consider a Taylor series expansion of U (Zn , θ) regarded
as a function of θ around θ = θ0 , as follows:

Uθ (Zn , θ) = Uθ (Zn , θ0 ) + Uθθ (Zn , θ0 ) (θ − θ0 ) + R(θ, θ0 , Zn )

Evaluating this at θ = θ̂n gives:


³ ´
0 = Uθ (Zn , θ̂n ) = Uθ (Zn , θ0 )+Uθθ (Zn , θ0 ) θ̂n − θ0 +R(θ̂n , θ0 , Zn )

where
∂2
Uθθ (Zn , θ) = U (Zn , θ).
∂θ∂θ0
The remainder term, R(θ̂n , θ0 , Zn ), involves the third deriva-
tives of U (Zn , θ) and in many situations converges in probabil-
ity to zero as n becomes large. This allows us to write:
³ ´
Uθ (Zn , θ0 ) + Uθθ (Zn , θ0 ) θ̂n − θ0 ' 0

and then
³ ´
θ̂n − θ0 ' −Uθθ (Zn , θ0 )−1 Uθ (Zn , θ0 ).

Equivalently:
³ ´ ¡ ¢−1 −1/2
1/2
n θ̂n − θ0 ' − n−1 Uθθ (Zn , θ0 ) n Uθ (Zn , θ0 ).

G023. II
Approximate Distribution of M-Estimator

• In the situations we will encounter it is possible to find condi-


tions under which
p d
n−1 Uθθ (Zn , θ0 ) → A(θ0 ) n−1/2 Uθ (Zn , θ0 ) → N (0, B(θ0 )),
for some matrices A(θ0 ) and B(θ0 ), concluding that
³ ´
1/2 d
n θ̂n − θ0 → N (0, A(θ0 )−1 B(θ0 )A(θ0 )−10 ).

• Example: OLS estimator


( n
)
X 2
θ̂n = arg max − (Yi − x0i θ)
θ i=1

when Yi = x0i θ0
+ εi and the εi ’s are independently distributed
with expected value zero and common variance σ02 .

n
X 2
U (Zn , θ) = − (Yi − x0i θ)
i=1
n
X
−1/2 −1/2
n Uθ (Zn , θ) = 2n (Yi − x0i θ)xi
i=1
n
X
n−1 Uθθ (Zn , θ) = −2n−1 xi x0i
i=1
−1
Pn 0
and, defining ΣXX ≡ plimn→∞ n i=1 xi xi :

A(θ0 ) = −2ΣXX
which does not depend upon θ0 in this special case,
B(θ0 ) = 4σ02 ΣXX
A(θ0 )−1 B(θ0 )A(θ0 )−10 = σ02 Σ−1
XX
and finally the OLS estimator has the following limiting normal
distribution.
³ ´
1/2 d
n θ̂n − θ0 → N (0, σ02 Σ−1
XX ).

G023. II
Approximate distributions of functions of estimators
the “delta method”

• We proceed in a more general context in which we are interested


in a scalar function of a vector of parameters, h(θ), and suppose
that we have a consistent estimator θ̂ of θ whose approximate
distribution is given by
d
n1/2 (θ̂ − θ0 ) → N (0, Ω)

where θ0 is the data generating value of θ.


• What is the approximate distribution of h(θ̂)?
Consider a Taylor series expansion of h(θ) around θ = θ0 as
follows
1
h(θ) = h(θ0 ) + (θ − θ0 )0 hθ (θ0 ) + (θ − θ0 )0 hθθ (θ∗ )(θ − θ0 )
2
where hθ (θ0 ) is the vector of derivatives of h(θ) evaluated at θ =
θ0 , hθθ (θ∗ ) is the matrix of second derivatives of h(θ) evaluated
at θ = θ∗ , a value between θ and θ0 . Evaluate this at θ = θ̂
and rearrange to give
³ ´ 1
1/2
n h(θ̂) − h(θ0 ) = n1/2 (θ̂−θ0 )0 hθ (θ0 )+ n1/2 (θ̂−θ0 )0 hθθ (θ̂∗ )(θ̂−θ0 )
2
where θ̂∗ lies between θ̂ and θ0 . Since θ̂ is consistent, θ̂∗ must
converge to θ0 and if hθθ (θ0 ) is bounded then the second term
above disappears1 as n → ∞. So, we have
³ ´
1/2 d
n h(θ̂) − h(θ0 ) → hθ (θ0 )0 Z
where
d
n1/2 (θ̂ − θ0 ) → Z ∼ N (0, Ω).
Using our result on linear functions of normal random variables
³ ´
1/2 d
n h(θ̂) − h(θ0 ) → N (0, hθ (θ0 )0 Ωhθ (θ0 )).

G023. II

1 1/2 d p
n (θ̂ − θ0 ) → N (0, Ω) and (θ̂ − θ0 ) → 0.
Delta Method: Example

• Suppose we have θ = [θ1 , θ2 ]0 and that we are interested in


h(θ) = θ2 /θ1 leading to
· ¸
−θ2 /θ12
hθ (θ) = .
1/θ1

Write the approximate variance of n1/2 (θ̂ − θ0 ) as


· ¸
ω11 ω12
Ω= .
ω12 ω22
³ ´
1/2
Then the approximate variance of n h(θ̂) − h(θ0 ) is
· ¸0 · ¸· ¸
−θ2 /θ12 ω11 ω12 −θ2 /θ12
=
1/θ1 ω12 ω22 1/θ1
¡ 2 4¢ ¡ ¢ ¡ ¢
= θ2 /θ1 ω11 − 2 θ2 /θ13 ω12 + 1/θ12 ω22
¡ ¢ ¡¡ 2 2 ¢ ¢
= 1/θ12 θ2 /θ1 ω11 − 2 (θ2 /θ1 ) ω12 + ω22

in which θ1 and θ2 are here taken to indicate the data generating


values. Clearly if θ1 is very close to zero then this will be large.
Note that if θ1 were actually zero then the development above
would not go through because the condition on hθθ (θ0 ) being
bounded would be violated.
The method we have used here is sometimes called the “delta
method”.

G023. II
Maximum Likelihood Methods
Maximum Likelihood Methods

• Some of the models used in econometrics specify the complete


probability distribution of the outcomes of interest rather than
just a regression function.
• Sometimes this is because of special features of the outcomes
under study - for example because they are discrete or censored,
or because there is serial dependence of a complex form.
• When the complete probability distribution of outcomes given
covariates is specified we can develop an expression for the
probability of observation of the responses we see as a function
of the unknown parameters embedded in the specification.
• We can then ask what values of these parameters maximise
this probability for the data we have. The resulting statistics,
functions of the observed data, are called maximum likelihood
estimators. They possess important optimality properties and
have the advantage that they can be produced in a rule directed
fashion.

G023. III
Estimating a Probability

• Suppose Y1 , . . . Yn are binary independently and identically dis-


tributed random variables with P [Yi = 1] = p, P [Yi = 0] = 1−p
for all i.
• We might use such a model for data recording the occurrence
or otherwise of an event for n individuals, for example being
in work or not, buying a good or service or not, etc.
• Let y1 , . . . , yn indicate the data values obtained and note that
in this model
n
Y
P [Y1 = y1 ∩ · · · ∩ Yn = yn , p] = pyi (1 − p)(1−yi )
i=1
Pn Pn
= p i=1 yi (1 − p) i=1 (1−yi )

= L(p; y).

With any set of data L(p; y) can be calculated for any value of
p between 0 and 1. The result is the probability of observing
the data to hand for each chosen value of p.
• One strategy for estimating p is to use that value that max-
imises this probability. The resulting estimator is called the
maximum likelihood estimator (MLE) and the maximand, L(p; y),
is called the likelihood function.

G023. III
Log Likelihood Function

• The maximum of the log likelihood function, l(p; y) = log L(p, y),
is at the same value of p as is the maximum of the likelihood
function (because the log function is monotonic).
• It is often easier to maximise the log likelihood function (LLF).
For the problem considered here the LLF is
à n ! n
X X
l(p; y) = yi log p + (1 − yi ) log(1 − p).
i=1 i=1

Let
p̂ = arg maxL(p; y) = arg maxl(p; y).
p p

On differentiating we have the following.


n n
1X 1 X
lp (p; y) = yi − (1 − yi )
p i=1 1 − p i=1
n n
1 X 1 X
lpp (p; y) = − 2 yi − (1 − yi ).
p i=1 (1 − p)2 i=1

Note that lpp (p; y) is always negative for admissable p so the


optimisation problem has a unique solution corresponding to a
maximum. The solution to lp (p̂; y) = 0 is
n
1X
p̂ = yi
n i=1

just the mean of the observed values of the binary indicators,


equivalently the proportion of 1’s observed in the data.

G023. III
Likelihood Functions and Estimation in General

• Let Yi , i = 1, . . . , n be continuously distributed random vari-


ables with joint probability density function f (y1 , . . . , yn , θ).
• The probability that Y falls in infinitesimal intervals of width
dy1 , . . . dyn centred on values y1 , . . . , yn is

A = f (y1 , . . . , yn , θ)dy1 dy2 . . . dyn

Here only the joint density function depends upon θ and the
value of θ that maximises f (y1 , . . . , yn , θ) also maximises A.
• In this case the likelihood function is defined to be the joint
density function of the Yi ’s.
• When the Yi ’s are discrete random variables the likelihood func-
tion is the joint probability mass function of the Yi ’s, and in
cases in which there are discrete and continuous elements the
likelihood function is a combination of probability density ele-
ments and probability mass elements.
• In all cases the likelihood function is a function of the observed
data values that is equal to, or proportional to, the probability
of observing these particular values, where the constant of pro-
portionality does not depend upon the parameters which are
to be estimated.

G023. III
Likelihood Functions and Estimation in General

• When Yi , i = 1, . . . , n are independently distributed the joint


density (mass) function is the product of the marginal density
(mass) functions of each Yi , the likelihood function is
n
Y
L(y; θ) = fi (yi ; θ),
i=1

and the log likelihood function is the sum:


n
X
l(y; θ) = log fi (yi ; θ).
i=1

There is a subscript i on f to allow for the possibility that each


Yi has a distinct probability distribution.
• This situation arises when modelling conditional distributions
of Y given some covariates x. In particular, fi (yi ; θ) = fi (yi |xi ; θ),
and often fi (yi |xi ; θ) = f (yi |xi ; θ).
• In time series and panel data problems there is often depen-
dence among the Yi ’s. For any list of random variables Y =
{Y1 , . . . , Yn } define the i − 1 element list Yi− = {Y1 , . . . , Yi−1 }.
The joint density (mass) function of Y can be written as
n
Y
f (y) = fyi |yi− (yi |yi− )fy1 (y1 ),
i=2

G023. III
Invariance

• Note that (parameter free) monotonic transformations of the


Yi ’s (for example, a change of units of measurement, or use of
logs rather than the original y data) usually leads to a change
in the value of the maximised likelihood function when we work
with continuous distributions.
• If we transform from y to z where y = h(z) and the joint
density function of y is fy (y; θ) then the joint density function
of z is ¯ ¯
¯ ∂h(z) ¯
fz (z; θ) = ¯¯ ¯ fy (h(z); θ).
∂z ¯
• For any given set of values, y ∗ , the value of θ that maximises
the likelihood function fy (y ∗ , θ) also maximises the likelihood
function fz (z ∗ ; θ) where y ∗ = h(z ∗ ), so the maximum likelihood
estimator is invariant with respect to such changes in the way
the data are presented.
• However the maximised
¯ ¯ likelihood functions will differ by a
¯ ∂h(z) ¯
factor equal to ¯ ∂z ¯ ∗ .
z=z

• The reason for this is that we omit the infinitesimals dy1 , . . . dyn
from the likelihood function for continuous variates and these
change when we move from y to z because they are denomi-
nated in the units in which y or z are measured.

G023. III
Maximum Likelihood: Properties

• Maximum likelihood estimators possess another important in-


variance property. Suppose two researchers choose different
ways in which to parameterise the same model. One uses θ,
and the other uses λ = h(θ) where this function is one-to-one.
Then faced with the same data and producing estimators θ̂ and
λ̂, it will always be the case that λ̂ = h(θ̂).
• There are a number of important consequences of this:
– For instance, if we are interested in the ratio of two para-
meters, the MLE of the ratio will be the ratio of the ML
estimators.
– Sometimes a re-parameterisation can improve the numeri-
cal properties of the likelihood function. Newton’s method
and its variants may in practice work better if parameters
are rescaled.

G023. III
Maximum Likelihood: Improving Numerical Properties

• An example of this often arises when, in index models, elements


of x involve squares, cubes, etc., of some covariate, say x1 .
Then maximisation of the likelihood function may be easier
if instead of x21 , x31 , etc., you use x21 /10, x31 /100, etc., with
consequent rescaling of the coefficients on these covariates. You
can always recover the MLEs you would have obtained without
the rescaling by rescaling the estimates.
• There are some cases in which a re-parameterisation can pro-
duce a globally concave likelihood function where in the origi-
nal parameterisation there was not global concavity.
• An example of this arises in the “Tobit” model.
– This is a model in which each Yi is N (x0i β, σ 2 ) with negative
realisations replaced by zeros. The model is sometimes
used to model expenditures and hours worked, which are
necessarily non-negative.
– In this model the likelihood as parameterised here is not
globally concave, but re-parameterising to λ = β/σ, and
γ = 1/σ, produces a globally concave likelihood function.
– The invariance property tells us that having maximised the
“easy” likelihood function and obtained estimates λ̂ and γ̂,
we can recover the maximum likelihood estimates we might
have had difficulty finding in the original parameterisation
by calculating β̂ = λ̂/γ̂ and σ̂ = 1/γ̂.

G023. III
Properties Of Maximum Likelihood Estimators

• First we just sketch the main results:

– Let l(θ; Y ) be the log likelihood function now regarded as


a random variable, a function of a set of (possibly vector)
random variables Y = {Y1 , . . . , Yn }.
– Let lθ (θ; Y ) be the gradient of this function, itself a vector
of random variables (scalar if θ is scalar) and let lθθ (θ; Y )
be the matrix of second derivatives of this function (also a
scalar if θ is a scalar).
– Let
θ̂ = arg max l(θ; Y ).
θ

In order to make inferences about θ using θ̂ we need to


determine the distribution of θ̂. We consider developing a
large sample approximation. The limiting distribution for
a quite wide class of maximum likelihood problems is as
follows:

d
n1/2 (θ̂ − θ) → N (0, V0 )
where
V0 = − plim(n−1 lθθ (θ0 ; Y ))−1
n→∞
and θ0 is the unknown parameter value. To get an ap-
proximate distribution that can be used in practice we use
(n−1 lθθ (θ̂; Y ))−1 or some other consistent estimator of V0
in place of V0 .

G023. III
Properties Of Maximum Likelihood Estimators

• We apply the method for dealing with M-estimators.


• Suppose θ̂ is uniquely determined as the solution to the first
order condition
lθ (θ̂; Y ) = 0
and that θ̂ is a consistent estimator of the unknown value of
the parameter, θ0 . Weak conditions required for consistency
are quite complicated and will not be given here.
• Taking a Taylor series expansion around θ = θ0 and then eval-
uating this at θ = θ̂ gives

0 ' lθ (θ0 ; Y ) + lθθ0 (θ0 ; Y )(θ̂ − θ0 )

and rearranging and scaling by powers of the sample size n


¡ ¢−1 −1/2
n1/2 (θ̂ − θ0 ) ' − n−1 lθθ0 (θ; Y ) n lθ (θ; Y ).

As in our general treatment of M-estimators if we can show


that
p
n−1 lθθ0 (θ0 ; Y ) → A(θ0 )
and
d
n−1/2 lθ (θ0 ; Y ) → N (0, B(θ0 ))
then

d
n1/2 (θ̂ − θ0 ) → N (0, A(θ0 )−1 B(θ0 )A(θ0 )−10 ).

G023. III
Maximum Likelihood: Limiting Distribution

• What is the limiting distribution of n−1/2 lθ (θ0 ; Y )?


• First note that in problems for which the Yi ’s are indepen-
dently distributed, n−1/2 lθ (θ0 ; Y ) is a scaled mean of random
variables and we may be able to find conditions under which
a central limit theorem applies, indicating a limiting normal
distribution.
• We must now find the mean and variance of this distribution.
Since L(θ; Y ) is a joint probability density function (we just
consider the continuous distribution case here),
Z
L(θ; y)dy = 1

where multiple integration is over the support of Y . If this


support does not depend upon θ, then
Z Z

L(θ; y)dy = Lθ (θ; y)dy = 0.
∂θ
But, because l(θ; y) = log L(θ; y), and lθ (θ; y) = Lθ (θ; y)/L(θ; y),
we have
Z Z
Lθ (θ; y)dy = lθ (θ; y)L(θ; y)dy = E [lθ (θ; Y )]

and so E [lθ (θ; Y )] = 0.


• This holds for any value of θ, in particular for θ0 above. If the
variance of lθ (θ0 ; Y ) converges to zero as n becomes large then
lθ (θ0 ; Y ) will converge in probability to zero and the mean of
the limiting distribution of n−1/2 lθ (θ0 ; Y ) will be zero.

G023. III
Maximum Likelihood: Limiting Distribution

• We turn now to the variance of the limiting distribution. We


have just shown that
Z
lθ (θ; y)L(θ; y)dy = 0.

Differentiating again
Z Z

lθ (θ; y)L(θ; y)dy = (lθθ0 (θ; y)L(θ; y) + lθ (θ; y)Lθ0 (θ; y)) dy
∂θ0 Z
= (lθθ0 (θ; y) + lθ (θ; y)lθ (θ; y)0 ) L(θ; y)dy
= E [lθθ0 (θ; Y ) + lθ (θ; Y )lθ (θ; Y )0 ]
= 0.

Separating the two terms in the penultimate line,

E [lθ (θ; Y )lθ (θ; Y )0 ] = −E [lθθ0 (θ; Y )] (4)

and note that, since E [lθ (θ; Y )] = 0,

V ar[lθ (θ; Y )] = E [lθ (θ; Y )lθ (θ; Y )0 ]

and so

V ar[lθ (θ; Y )] = −E [lθθ0 (θ; Y )]


£ ¤
⇒ V ar[n−1/2 lθ (θ; Y )] = −E n−1 lθθ0 (θ; Y )

giving
B(θ0 ) = − plim n−1 lθθ0 (θ0 ; Y ).
n→∞
The matrix
I(θ) = −E [lθθ (θ; Y )]
plays a central role in likelihood theory - it is called the Infor-
mation Matrix .
Finally, because B(θ0 ) = −A(θ0 )
µ ¶−1
A(θ)−1 B(θ)A(θ)−10 = − plim n−1 lθθ0 (θ; Y ) .
n→∞
• Of course a number of conditions are required to hold for the
results above to hold. These include the boundedness of third
order derivatives of the log likelihood function, independence or
at most weak dependence of the Yi ’s, existence of moments of
derivatives of the log likelihood, or at least of probability limits
of suitably scaled versions of them, and lack of dependence of
the support of the Yi ’s on θ.
• The result in equation (4) above leads, under suitable condi-
tions concerning convergence, to
¡ ¢ ¡ ¢
plim n−1 lθ (θ; Y )lθ (θ; Y )0 = − plim n−1 lθθ0 (θ; Y ) .
n→∞ n→∞
This gives an alternative way of “estimating ” V0 , namely
n o−1
o −1 0
V̂0 = n lθ (θ̂; Y )lθ (θ̂; Y )
which compared with
n o−1
Ṽ0o −1
= −n lθθ0 (θ̂; Y )
has the advantage that only first derivatives of the log like-
lihood function need to be calculated. Sometimes V̂0o is re-
ferred to as the “outer product of gradient” (OPG) estimator.
Both these estimators use the “observed” values of functions
of derivatives of the LLF and. It may be possible to derive
explicit expressions for the expected values of these functions.
Then one can estimate V0 by
© ª−1
V̂0e = E[n−1 lθ (θ; Y )lθ (θ; Y )0 ]|θ=θ̂
© ª−1
= −E[n−1 lθθ0 (θ; Y )]|θ=θ̂ .
These two sorts of estimators are sometimes referred to as “ob-
served information” (V̂0o , Ṽ0o ) and “expected information” (V̂0e )
estimators.
• Maximum likelihood estimators possess optimality property,
namely that, among the class of consistent and asymptotically
normally distributed estimators, the variance matrix of their
limiting distribution is the smallest that can be achieved in the
sense that other estimators in the class have limiting distribu-
tions with variance matrices exceeding the MLE’s by a positive
semidefinite matrix.
G023. III
Estimating a Conditional Probability

• Suppose Y1 , . . . Yn are binary independently and identically dis-


tributed random variables with

P [Yi = 1|X = xi ] = p(xi , θ)


P [Yi = 0|X = xi ] = 1 − p(xi , θ).

This is an obvious extension of the model in the previous sec-


tion.
• The likelihood function for this problem is
n
Y
P [Y1 = y1 ∩ · · · ∩ Yn = yn |x] = p(xi , θ)yi (1 − p(xi , θ))(1−yi )
i=1
= L(θ; y).

where y denotes the complete set of values of yi and dependence


on x is suppressed in the notation. The log likelihood function
is
n
X n
X
l(θ; y) = yi log p(xi , θ) + (1 − yi ) log(1 − p(xi , θ))
i=1 i=1

and the maximum likelihood estimator of θ is

θ̂ = arg max l(θ; y).


θ

So far this is an obvious generalisation of the simple problem


met in the last section.

G023. III
Estimating a Conditional Probability

• To implement the model we choose a form for the function


p(x, θ), which must of course lie between zero and one.

– One common choice is


exp(x0 θ)
p(x, θ) =
1 + exp(x0 θ)
which produces what is commonly called a logit model .
– Another common choice is
Z x0 θ
p(x, θ) = Φ(x0 θ) = φ(w)dw
−∞
φ(w) = (2π)−1/2 exp(−w2 /2)

in which Φ is the standard normal distribution function.


This produces what is known as a probit model .

• Both models are widely used. Note that in both cases a single
index model is specified, the probability functions are monotonic
increasing, probabilities arbitrarily close to zero or one are ob-
tained when x0 θ is sufficiently large or small, and there is a
symmetry in both of the models in the sense that p(−x, θ) =
1 − p(x, θ).
• Any or all of these properties might be inappropriate in a par-
ticular application but there is rarely discussion of this in the
applied econometrics literature.

G023. III
More on Logit and Probit

• Both models can also be written as a linear model involving a


latent variable.
• We define a latent variable Yi∗ , which is unobserved, but
determined by the following model:

Yi∗ = Xi θ + εi

We observe the variable Yi which is linked to Yi∗ as:



 Yi = 0 if Yi∗ < 0

Yi = 1 if Yi∗ ≥ 0

• The probability of observing Yi = 1 is:

pi = P (Yi = 1) = P (Yi∗ ≥ 0)
= P (Xi θ + εi ≥ 0)
= P (εi ≥ −Xi θ)
= 1 − Fε (−Xi θ)

where Fε is the cumulative distribution function of the random


variable ε.
• If εi is distributed normally, the model is the probit model.
• If εi follows an extreme value distribution, the model is the
logit model.

G023. III
Shape of Logit and Probit Models

G023. III
Odds-Ratio

• Define the ratio pi /(1−pi ) as the odds-ratio. This is the ratio


of the probability of outcome 1 over the probability of outcome
0. If this ratio is equal to 1, then both outcomes have equal
probability (pi = 0.5). If this ratio is equal to 2, say, then
outcome 1 is twice as likely than outcome 0 (pi = 2/3).
• In the logit model, the log odds-ratio is linear in the parame-
ters:
pi
ln = Xi θ
1 − pi
• In the logit model, θ is the marginal effect of X on the log
odds-ratio. A unit increase in X leads to an increase of θ % in
the odds-ratio.

G023. III
Marginal Effects

• Logit model:
∂pi θ exp(Xi θ)(1 + exp(Xi θ)) − θ exp(Xi θ)2
=
∂X (1 + exp(Xi θ))2
θ exp(Xi θ)
=
(1 + exp(Xi θ))2
= θpi (1 − pi )

A one unit increase in X leads to an increase in the probability


of choosing option 1 of θpi (1 − pi ).

• Probit model:
∂pi
= θφ(Xi θ)
∂Xi
A one unit increase in X leads to an increase in the probability
of choosing option 1 of θφ(Xi θ).

G023. III
Maximum Likelihood in Single Index Models

• We can cover both cases by considering general single index


models, so for the moment rewrite p(x, θ) as g(w) where w =
x0 θ.
• The log-likelihood is then:
n
X n
X
l(θ, y) = yi log(g(wi )) + (1 − yi ) log(1 − g(wi ))
i=1 i=1

• The first derivative of the log likelihood function is:

n
X gw (x0 θ)xi
i gw (x0i θ)xi
lθ (θ; y) = yi − (1 − yi )
i=1
g(x0i θ) 1 − g(x0i θ)
Xn
gw (x0i θ)
= (yi − g(x0i θ)) xi
i=1
g(x0i θ) (1 − g(x0i θ))

Here gw (w) is the derivative of g(w) with respect to w.


• The expression for the second derivative is rather messy. Here
we just note that its expected value given x is quite simple,
namely
n
X gw (x0i θ)2
E[lθθ (θ; y)|x] = − 0 0 xi x0i ,
i=1
g(xi θ) (1 − g(xi θ))

the negative of which is the Information Matrix for general


single index binary data models.

G023. III
Asymptotic Properties of the Logit Model

• For the logit model there is major simplification


exp(w)
g(w) =
1 + exp(w)
exp(w)
gw (w) =
(1 + exp(w))2
gw (w)
⇒ = 1.
g(w) (1 − g(w))
Therefore in the logit model the MLE satisfies
n
à !
X 0
exp(xi θ̂)
yi − xi = 0,
1 + exp(x 0 θ̂)
i=1 i

the Information Matrix is


Xn
exp(x0i θ) 0
I(θ) = 0 2 xi xi ,
i=1
(1 + exp(xi θ))
the MLE has the limiting distribution
d
n1/2 (θ̂n − θ0 ) → N (0, V0 )
à n
!−1
X 0
exp(xi θ)
V0 = plim n−1 0
0
2 xi xi ,
n→∞
i=1
(1 + exp(x i θ))
and we can conduct approximate inference using the following
approximation
n1/2 (θ̂n − θ0 ) ' N (0, V0 )
using the estimator
 −1
n
 −1 X exp(x0i θ̂) 0
V̂0 = n ³ ´2 xi xi 
i=1 1 + exp(x0i θ̂)

when producing approximate hypothesis tests and confidence


intervals.
G023. III
Asymptotic Properties of the Probit Model

• In the probit model

g(w) = Φ(w)
gw (w) = φ(w)
gw (w) φ(w)
⇒ = .
g(w) (1 − g(w)) Φ(w)(1 − Φ(w))
Therefore in the probit model the MLE satisfies
n ³
X ´ φ(x0i θ̂)
yi − Φ(x0i θ̂) xi = 0,
i=1 Φ(x0i θ̂)(1 − Φ(x0i θ̂))

the Information Matrix is


n
X φ(x0i θ)2
I(θ) = 0 0 xi x0i ,
i=1
Φ(xi θ)(1 − Φ(xi θ))

the MLE has the limiting distribution


d
n1/2 (θ̂n − θ0 ) → N (0, V0 )
à n
!−1
X 0 2
φ(xi θ)
V0 = plim n−1 0
0
0 θ)) xi xi ,
n→∞
i=1
Φ(x i θ)(1 − Φ(x i

and we can conduct approximate inference using the following


approximation

n1/2 (θ̂n − θ0 ) ' N (0, V0 )

using the estimator


à n
!−1
X φ(x0i θ̂)2
V̂0 = n−1 xi x0i
0 Φ(x0i θ̂))
i=1 Φ(xi θ̂)(1 −

when producing approximate tests and confidence intervals.

G023. III
Example: Logit and Probit

• We have data from households in Kuala Lumpur (Malaysia)


describing household characteristics and their concern about
the environment. The question is
”Are you concerned about the environment? Yes / No”.
We also observe their age, sex (coded as 1 men, 0 women), in-
come and quality of the neighborhood measured as air quality.
The latter is coded with a dummy variable smell, equal to 1 if
there is a bad smell in the neighborhood. The model is:

Concerni = β0 +β1 agei +β2 sexi +β3 log incomei +β4 smelli +ui

• We estimate this model with three specifications, linear prob-


ability model (LPM), logit and probit:

Probability of being concerned by Environment


Variable LPM Logit Probit
Est. t-stat Est. t-stat Est. t-stat
age .0074536 3.9 .0321385 3.77 .0198273 3.84
sex .0149649 0.3 .06458 0.31 .0395197 0.31
log income .1120876 3.7 .480128 3.63 .2994516 3.69
smell .1302265 2.5 .5564473 2.48 .3492112 2.52
constant -.683376 -2.6 -5.072543 -4.37 -3.157095 -4.46
Some Marginal Effects
Age .0074536 .0077372 .0082191
log income .1120876 .110528 .1185926
smell .1302265 .1338664 .1429596

G023. III
Multinomial Logit

• The logit model was dealing with two qualitative outcomes.


This can be generalized to multiple outcomes:
– choice of transportation: car, bus, train...
– choice of dwelling: house, apartment, social housing.
• The multinomial logit: Denote the outcomes as j = 1, . . . , J
and pj the probability of outcome j.

exp(Xθj )
pj = PJ
k)
k=1 exp(Xθ

where θj is a vector of parameter associated with outcome j.

G023. III
Identification

• If we multiply all the coefficients by a factor λ this does not


change the probabilities pj , as the factor cancel out. This
means that there is under identification. We have to normalize
the coefficients of one outcome, say, J to zero. All the results
are interpreted as deviations from the baseline choice.
• We write the probability of choosing outcome j = 1, . . . , J − 1
as:
exp(Xθj )
pj = P
1 + J−1 k
k=1 exp(Xθ )

• We can express the logs odds-ratio as:


pj
ln = Xθj
pJ

• The odds-ratio of choice j versus J is only expressed as a


function of the parameters of choice j, but not of those other
choices: Independence of Irrelevant Alternatives (IIA).

G023. III
Independence of Irrelevant Alternatives

An anecdote which illustrates a violation of this property has


been attributed to Sidney Morgenbesser:

After finishing dinner, Sidney Morgenbesser decides to order


dessert. The waitress tells him he has two choices: apple pie and
blueberry pie. Sidney orders the apple pie.

After a few minutes the waitress returns and says that they also
have cherry pie at which point Morgenbesser says ”In that case I’ll
have the blueberry pie.”

G023. III
Independence of Irrelevant Alternatives

• Consider travelling choices, by car or with a red bus. Assume


for simplicity that the choice probabilities are equal:
P (car)
P (car) = P (red bus) = 0.5 =⇒ =1
P (red bus)

• Suppose we introduce a blue bus, (almost) identical to the red


bus. The probability that individuals will choose the blue bus
is therefore the same as for the red bus and the odd ratio is:
P (blue bus)
P (blue bus) = P (red bus) =⇒ =1
P (red bus)

• However, the IIA implies that odds ratios are the same whether
of not another alternative exists. The only probabilities for
which the three odds ratios are equal to one are:

P (car) = P (blue bus) = P (red bus) = 1/3

However, the prediction we ought to obtain is:

P (red bus) = P (blue bus) = 1/4 P (car) = 0.5

G023. III
Marginal Effects: Multinomial Logit

• θj can be interpreted as the marginal effect of X on the log


odds-ratio of choice j to the baseline choice.
• The marginal effect of X on the probability of choosing out-
come j can be expressed as:
X J
∂pj
= pj [θj − pk θk ]
∂X
k=1

Hence, the marginal effect on choice j involves not only the


coefficients relative to j but also the coefficients relative to the
other choices.
• Note that we can have θj < 0 and ∂pj /∂X > 0 or vice versa.
Due to the non linearity of the model, the sign of the coefficients
does not indicate the direction nor the magnitude of the effect
of a variable on the probability of choosing a given outcome.
One has to compute the marginal effects.

G023. III
Example

• We analyze here the choice of dwelling: house, apartment or


low cost flat, the latter being the baseline choice. We include as
explanatory variables the age, sex and log income of the head
of household:

Variable Estimate Std. Err. Marginal Effect


Choice of House
age .0118092 .0103547 -0.002
sex -.3057774 .2493981 -0.007
log income 1.382504 .1794587 0.18
constant -10.17516 1.498192
Choice of Apartment
age .0682479 .0151806 0.005
sex -.89881 .399947 -0.05
log income 1.618621 .2857743 0.05
constant -15.90391 2.483205

G023. III
Ordered Models

• In the multinomial logit, the choices were not ordered. For


instance, we cannot rank cars, busses or train in a meaningful
way. In some instances, we have a natural ordering of the out-
comes even if we cannot express them as a continuous variable:
– Yes / Somehow / No.
– Low / Medium / High
• We can analyze these answers with ordered models.

G023. III
Ordered Probit

• We code the answers by arbitrary assigning values:

Yi = 0 if No, Yi = 1 if Somehow, Yi = 2 if Yes

• We define a latent variable Yi∗ which is linked to the explana-


tory variables:
Yi∗ = Xi0 θ + εi
Yi = 0 if Yi∗ < 0
Yi = 1 if Yi∗ ∈ [0, µ[
Yi = 2 if Yi∗ ≥ µ
µ is a threshold and an auxiliary parameter which is estimated
along with θ.
• We assume that εi is distributed normally.
• The probability of each outcome is derived from the normal
cdf:
P (Yi = 0) = Φ(−Xi0 θ)
P (Yi = 1) = Φ(µ − Xi0 θ) − Φ(−Xi0 θ)
P (Yi = 2) = 1 − Φ(µ − Xi0 θ)

G023. III
Ordered Probit

• Marginal Effects:
∂P (Yi = 0)
= −θφ(−Xi0 θ)
∂Xi
∂P (Yi = 1)
= θ (φ(Xi0 θ) − φ(µ − Xi0 θ))
∂Xi
∂P (Yi = 2)
= θφ(µ − Xi0 θ)
∂Xi

• Note that if θ > 0, ∂P (Yi = 0)/∂Xi < 0 and ∂P (Yi = 2)/∂Xi >
0:
– If Xi has a positive effect on the latent variable, then by
increasing Xi , fewer individuals will stay in category 0.
– Similarly, more individuals will be in category 2.
– In the intermediate category, the fraction of individual will
either increase or decrease, depending on the relative size
of the inflow from category 0 and the outflow to category 2.

G023. III
Ordered Probit: Example

• We want to investigate the determinants of health.


• Individuals are asked to report their health status in three cat-
egories: poor, fair or good.
• We estimate an ordered probit and calculate the marginal ef-
fects at the mean of the sample.

Variable Coeff sd. err. Marginal Effects Sample


Poor Fair Good Mean
Age 18-30 -1.09** .031 -.051** -.196** .248** .25
Age 30-50 -.523** .031 -.031** -.109** .141** .32
Age 50-70 -.217** .026 -.013** -.046** .060** .24
Male -.130** .018 -.008** -.028** .037** .48
Income low third .428** .027 .038** .098** -.136** .33
Income medium third .264** .022 .020** .059** -.080** .33
Education low .40** .028 .031** .091** -.122** .43
Education Medium .257** .026 .018** .057** -.076** .37
Year of interview -.028 .018 -.001 -.006 .008 1.9
Household size -.098** .008 -.006** -.021** .028** 2.5
Alcohol consumed .043** .041 .002** .009** -.012** .04
Current smoker .160** .018 .011** .035** -.046** .49
cut1 .3992** .058
cut2 1.477** .059

Age group Proportion


Poor Health Fair Health Good Health
Age 18-30 .01 .08 .90
Age 30-50 .03 .13 .83
Age 50-70 .07 .28 .64
Age 70 + .15 .37 .46

G023. III
Ordered Probit: Example

• Marginal Effects differ by individual characteristics.


• Below, we compare the marginal effects from an ordered probit
and a multinomial logit.

Marginal Effects for Good Health


Variable Ordered X Ordered Multinomial
Probit at mean Probit at X Logit at X
Age 18-30 .248** 1 .375** .403**
Age 30-50 .141** 0 .093** .077**
Age 50-70 .060** 0 .046** .035**
Male .037** 1 .033** .031**
Income low third -.136** 1 -.080** -.066**
Income medium third -.080** 0 -.071** -.067**
Education low -.122** 1 -.077** -.067**
Education Medium -.076** 0 -.069** -.064**
Year of interview .008 1 .006 .003
Household size .028** 2 .023** .020**
Alcohol consumed -.012** 0 -.010** -.011**
Current smoker -.046** 0 -.041** -.038**

G023. III
Tobit Model

2
• First proposed by Tobin (1958),
• We define a latent (unobserved) variable Y ∗ such as:

Y ∗ = Xβ + ε ε' N (0, σ 2 )

• We only observe a variable Y which is related to Y ∗ such as:


Y =Y∗ if Y∗ >a
Y =a if Y∗ ≤a
y yst

−5
−4 −2 0 2
X

G023. III
2
Tobin, J. (1958), Estimation of Relationships for Limited Dependent Variables, Economet-
rica 26, 24-36.
Truncation Bias

• The conditional mean of Y given X takes the form:


φ(α)
E[Y |Y ∗ > a, X] = Xβ + σ
1 − Φ(α)
with α = a−Xβσ . The ratio φ(α)/(1−Φ(α)) is called the inverse
Mills ratio.
• Therefore, if you regress only the Ys which are above a on
the corresponding Xs then, due to the latter term, the OLS
parameters estimate of β will be biased and inconsistent.
• Proof: Note that the conditional c.d.f of Y ∗ |Y ∗ > a is:
P (a < Y ∗ ≤ y)
H(y|Y ∗ > a, X) = P (Y ∗ ≤ y|Y ∗ > a) =
P (Y ∗ > a)
P (a − Xβ < ε ≤ y − Xβ)
=
P (ε > a − Xβ)
Φ( y−Xβ a−Xβ
σ ) − Φ( σ )
=
1 − Φ( a−Xβ
σ )

so that the conditional distribution is:

∗ ∂H(y|Y ∗ > a, X) φ( y−Xβ


σ )
h(y|Y > a, X) = =
∂y σ(1 − Φ( a−Xβ
σ ))

Z +∞

E[Y |Y > a, X] = yh(y|Y ∗ > a, X)dy
a
Z +∞
1 y − Xβ
= yφ( )dy
σ(1 − Φ(α)) a σ
Z +∞
1
= (Xβ + σz)φ(z)dz
1 − Φ(α) (a−Xβ)/σ
Z +∞
1
= βX − σ φ0 (z)dz
1 − Φ(α) (a−Xβ)/σ
φ(α)
= Xβ + σ
1 − Φ(α)
G023. III
Tobit Model: Marginal Effects

• How do we interpret the coefficient β?


∂Y ∗
β=
∂X
This is the marginal effect of X on the (latent) variable Y ∗ .
• Note that

E[Y |X] = Xβ (1 − Φ(α)) + σφ(α)

Therefore, if you treat the censored values of Y as regular de-


pendent variable values in a linear regression model the OLS
parameters estimate of β will be biased and inconsistent as
well.

G023. III
Likelihood for Tobit Model

• The conditional c.d.f of Y given X is:

G(y|X, β, σ) = P (Y ≤ y|X]
= P (Y ≤ y|X, Y > a)P (Y > a|X)
+P (Y ≤ y|X, Y = a)P (P = a|X)
a − Xβ
= I(y > a)H(y|Y > a, X)(1 − Φ( ))
σ
a − Xβ
+I(y = a)Φ( )
σ

where I(.) is the indicator function: I(true) = 1, I(f alse) = 0.


• The corresponding conditional density is:
a − Xβ a − Xβ
g(y|X, β, σ) = I(y > a)h(y|Y > a, X)(1 − Φ( )) + I(y = a)Φ(
σ σ

• The log-likelihood function of the Tobit model is:


n
X
l(β, σ) = log(g(Yi |Xi , β, σ))
i=1
Xn
= I(yi > a) log(h(Yj |Yj > 0, Xj ))
i=1
n
X X n
a − Xi β a − Xβ
+ I(yi > a)(1 − Φ( )) + I(yi = a) log(Φ( ))
i=1
σ i=1
σ
Xn µ ¶
1 2 2

= I(yi > a) − (Yi − Xi β) /σ − log(σ) − log( 2π)
i=1
2
n
X a − Xi β
+ I(yi = a) log(Φ( ))
i=1
σ

• This can be maximised with respect to β, σ or γ = 1/σ and


λ = β/σ.

G023. III
Example: WTP

• The WTP is censored at zero. We can compare the two regres-


sions:
OLS: W T Pi = β0 + β1 lny + β2 agei + β3 smelli + ui

Tobit: W T Pi∗ = β0 + β1 lny + β2 agei + β3 smelli + ui


W T Pi = W T Pi∗ if W T Pi∗ > 0
W T Pi = 0 if W T Pi∗ < 0

OLS Tobit
Variable Estimate t-stat Estimate t-stat Marginal effect
lny 2.515 2.74 2.701 2.5 2.64
age -.1155 -2.00 -.20651 -3.0 -0.19
sex .4084 0.28 .14084 0.0 .137
smell -1.427 -0.90 -1.8006 -0.9 -1.76
constant -4.006 -0.50 -3.6817 -0.4

G023. III
Models for Count Data

• The methods developed above are useful when we want to


model the occurrence or otherwise of an event. Sometimes
we want to model the number of times an event occurs. In
general it might be any nonnegative integer. Count data are
being used increasingly in econometrics.
• An interesting application is to the modelling of the returns to
R&D investment in which data on numbers of patents filed in a
series of years by a sample of companies is studied and related
to data on R&D investments.
• Binomial and Poisson probability models provide common start-
ing points in the development of count data models.
• If Z1 , . . . , Zm are identically and independently distributed bi-
nary random variables with P [Zi = 1] = p, P [Zi = 0] = 1 − p,
then the sum of the Zi ’s has a Binomial distribution,
m
X
Y = Zi ∼ Bi(m, p)
i=1

and
m!
P [Y = j] = pj (1 − p)m−j , j ∈ {0, 1, 2, . . . , m}
j!(m − j)!

G023. III
Models for Count Data

• As m becomes large, m1/2 (m−1 Y − p) becomes approximately


normally distributed, N (0, p(1 − p)), and as m becomes large
while mp = λ remains constant, Y comes to have a Poisson
distribution,
Y ∼ P o(λ)
and
λj
P [Y = j] = exp(−λ), j ∈ {0, 1, 2, . . . }.
j!

• In each case letting p or λ be functions of covariates creates


a model for the conditional distribution of a count of events
given covariate values.
• The Poisson model is much more widely used, in part because
there is no need to specify or estimate the parameter m.
• In the application to R&D investment one might imagine that
a firm seeds a large number of research projects in a period
of time, each of which has only a small probability of produc-
ing a patent. This is consonant with the Poisson probability
model but note that one might be concerned about the under-
lying assumption of independence across projects built into the
Poisson model.

G023. III
Models for Count Data

• The estimation of the model proceeds by maximum likelihood.


The Poisson model is used as an example. Suppose that we
specify a single index model:
λ(x0i θ)yi
P [Yi = yi |xi ] = exp(−λ(x0i θ)), j ∈ {0, 1, 2, . . . }.
yi !

• The log likelihood function is


n
X
l(θ, y) = yi log λ(x0i θ) − λ(x0i θ) − log yi !
i=1

with first derivative


Xn µ ¶
λw (x0i θ)
lθ (θ, y) = yi 0 − λw (x0i θ) xi
i=1
λ(xi θ)
n
X λw (x0i θ)
= (yi − λ(x0i θ)) xi
i=1
λ(x0i θ)

where λw (w) is the derivative of λ(w) with respect to w.


• The MLE satisfies
n ³
X ´ λ (x0 θ̂)
w i
yi − λ(x0i θ̂) xi = 0.
i=1 λ(x0i θ̂)

G023. III
Models for Count Data

• The second derivative matrix is


n
à µ ¶2 ! n
X 0
λww (xi θ) 0
λw (xi θ) X λw (x0i θ)2
0 0
lθθ (θ, y) = (yi − λ(xi θ)) 0 − 0 x x
i i − 0 xi x0i
i=1
λ(xi θ) λ(xi θ) i=1
λ(xi θ)

where, note, the first term has expected value zero. Therefore
the Information Matrix for this conditional Poisson model is
n
X λw (x0 θ)2
i
I(θ) = 0 xi x0i .
i=1
λ(xi θ)

The limiting distribution of the MLE is (under suitable condi-


tions)
d
n1/2 (θ̂ − θ0 ) → N (0, V0 )
à n
!−1
X 0 2
λw (xi θ)
V0 = plim n−1 0 xi x0i
n→∞
i=1
λ(xi θ)

and we can make approximate inference about θ0 using


¡ ¢
(θ̂ − θ0 ) ' N 0, n−1 V0

with V0 estimated by
à n
!−1
X λw (x0 θ̂)2
i
V̂0 = n−1 xi x0i .
0
λ(xi θ̂)
i=1

• In applied work a common choice is λ(w) = exp(w) for which


λw (w) λw (w)2
=1 = exp(w).
λ(w) λ(w)

G023. III
Likelihood Based Hypothesis
Testing
Likelihood Based Hypothesis Testing

• We now consider test of hypotheses in econometric models in


which the complete probability distribution of outcomes given
conditioning variables is specified.
• There are three natural ways to develop tests of hypotheses
when a likelihood function is available.

1. Is the unrestricted ML estimator significantly far from the


hypothesised value? This leads to what is known as the
Wald test.
2. If the ML estimator is restricted to satisfy the hypothesis,
is the value of the maximised likelihood function signifi-
cantly smaller than the value obtained when the restric-
tions of the hypothesis are not imposed? This leads to
what is known as the likelihood ratio test.
3. If the ML estimator is restricted to satisfy the hypothesis,
are the Lagrange multipliers associated with the restric-
tions of the hypothesis significantly far from zero? This
leads to what is known as the Lagrange multiplier or score
test.

G023. IV
Likelihood Based Hypothesis Testing

• In the normal linear regression model all three approaches, af-


ter minor adjustments, lead to the same statistic which has an
(j)
F(n−k) distribution when the null hypothesis is true and there
are j restrictions.
• Outside that special case, in general the three methods lead to
different statistics, but in large samples the differences tend to
be small.
• All three statistics have, under certain weak conditions, χ2(j)
limiting distributions when the null hypothesis is true and there
are j restrictions.
• The exact distributional result in the normal linear regres-
sion model
³ fits into
´ this large sample theory on noting that
(j)
plimn→∞ jF(n−k) = χ2(j) .

G023. IV
Test of Hypothesis

• We now consider tests of a hypothesis H0 : θ2 = 0 where the


.
full parameter vector is partitioned into θ0 = [θ10 ..θ20 ] and θ2
contains j elements. Recall that the MLE has the approximate
distribution
d
n1/2 (θ̂ − θ) → N (0, V0 )
where
V0 = − plim(n−1 lθθ (θ0 ; Y ))−1 = I(θ0 )−1
n→∞

and I(θ0 ) is the asymptotic information matrix per observation.

G023. IV
Wald Test

• This test is obtained by making a direct comparison of θ̂2 with


the hypothesised value of θ2 , zero.
• Using the approximate distributional result given above leads
to the following test statistic.

c −1 θ̂20
SW = nθ̂20 W 22

where Wc22 is a consistent estimator of the lower right hand


j × j block of V0 .
d
• Under the null hypothesis SW → χ2(j) and we reject the null
hypothesis for large values of SW .
• Using one of the formulas for the inverse of a partitioned matrix
the Wald statistic can also be written as
³ ´
0 b b 0 b −1b
SW = nθ̂2 I(θ̂)22 − I(θ̂)21 I(θ̂)11 I(θ̂)12 θ̂20

where the elements b I(θ̂)ij are consistent estimators of the ap-


propriate blocks of the asymptotic Information Matrix per ob-
servation evaluated at the (unrestricted) MLE.

G023. IV
The Score - or Lagrange Multiplier - test

• Sometimes we are in a situation where a model has been es-


timated with θ2 = 0, and we would like to see whether the
model should be extended by adding additional parameters
and perhaps associated conditioning variables or functions of
ones already present.
• It is convenient to have a method of conducting a test of the
hypothesis that the additional parameters are zero ( in which
case we might decide not to extend the model) without having
to estimate the additional parameters. The score test provides
such a method.

G023. IV
The Score - or Lagrange Multiplier - test

• The score test considers the gradient of the log likelihood func-
tion evaluated at the point

θ̂R = [θ̂1R0 , 0]0

and examines the departure from zero of that part of the gra-
dient of the log likelihood function that is associated with θ2 .
• Here θ̂1R is the MLE of θ1 when θ2 is restricted to be zero. If
the unknown value of θ2 is in fact zero then this part of the
gradient should be close to zero. The score test statistic is

SS = n−1 lθ (θ̂R ; Y )0b


I(θ̂R )−1 lθ (θ̂R ; Y )
d
and SS → χ2(j) under the null hypothesis. There are a variety
of ways of estimating b
I(θ ) and hence its inverse.
0

• Note that the complete score (gradient) vector appears in this


formula. Of course the part of that associated with θ1 is zero
because we are evaluating at the restricted MLE. That means
the score statistic can also be written, using the formula for the
inverse of a partitioned matrix, as the algebraically identical
³ ´−1
−1 R 0 b R b R 0 b R −1b R
SS = n lθ2 (θ̂ ; Y ) I(θ̂ )22 − I(θ̂ )21 I(θ̂ )11 I(θ̂ )12 lθ2 (θ̂R ; Y ).

• When the information matrix is block diagonal, which means


that the MLEs of θ1 and θ2 are asymptotically uncorrelated,
the second term in the inverse above vanishes.

G023. IV
Likelihood ratio tests

• The final method for constructing hypothesis tests that we will


consider involves comparing the value of the maximised likeli-
hood function at the restricted MLE ( θ̂R ) and the unrestricted
MLE (now written as θ̂U ).
• This likelihood ratio test statistic takes the form
³ ´
U R
SL = 2 l(θ̂ ; Y ) − l(θ̂ ; Y )
d
and it can be shown that under H0 , SL → χ2(j) .

G023. IV
Specification Testing

• Maximum likelihood estimation requires a complete specifica-


tion of the probability distribution of the random variables
whose realisations we observe.
• In practice we do not know this distribution though we may be
able to make a good guess. If our guess is badly wrong then we
may produce poor quality estimates, for example badly biased
estimates, and the inferences we draw using the properties of
the likelihood function may be incorrect.
• In regression models the same sorts of problems occur. If there
is heteroskedasticity or serial correlation then, though we may
produce reasonable point estimates of regression coefficients if
we ignore these features of the data generating process, our
inferences will usually be incorrect if these features are not
allowed for, because we will use incorrect formulae for standard
errors and so forth.
• It is important then to seek for evidence of departure from a
model specification, that is to conduct specification tests.
• In a likelihood context the score test provides an easy way of
generating specification tests.
• The score specification test does not tell us exactly how the
model should be extended.

G023. IV
Detecting Heteroskedasticity

• We consider one example here, namely detecting heteroskedas-


ticity in a normal linear regression model.
• In the model considered, Y1, . . . , Yn are independently distrib-
uted with Yi given xi being N (x0i β, σ 2 h(zi0 α)) where h(0) = 1
and h0 (0) = 1, both achievable by suitable scaling of h(·).
• Let θU = [β, σ 2 , α] and let θR = [β, σ 2 , 0]. A score test of
H0 : α = 0 will provide a specification test to detect het-
eroskedasticity.
• The log likelihood function when α = 0, in which case there is
homoskedasticity, is as follows.
n
n n 1 X 2
l(θ ; y|x) = − log 2π − log σ 2 − 2
R
(yi − x0i β)
2 2 2σ i=1

whose gradients with respect to β and σ 2 are


n
R 1 X
lβ (θ ; y|x) = − 2 (yi − x0i β) xi
σ i=1
n
R n 1 X 2
lσ2 (θ ; y|x) = − 2 + 4 (yi − x0i β)
2σ 2σ i=1

which lead to the restricted MLEs under homoskedasticity, as


follows.
−1
β̂ = (X 0 X) X 0 y
1 X³ ´2
n
2 0
σ̂ = yi − xi β̂
n i=1

G023. IV
Detecting Heteroskedasticity

• The log likelihood function for the unrestricted model is


X n n
U n n 2 1 0 1 X (yi − x0i β)2
l(θ ; y|x) = − log 2π− log σ − log h(zi α)− 2
2 2 2 i=1 2σ i=1 h(zi0 α)

whose gradient with respect to α is


n n
U 1 X h0 (zi0 α) 1 X (yi − x0i β)2 h0 (zi0 α)
lα (θ ; y|x) = − zi + 2 zi
2 i=1 h(zi0 α) 2σ i=1 h(zi0 α)2

which evaluated at the restricted MLE (for which α = 0) is

1 X³ ´2
n n
R 1X 0
lα (θ̂ ; y|x) = − zi + 2 yi − xi β̂ zi
2 i=1 2σ̂ i=1
n
1 X¡ 2 2
¢
= ε̂ − σ̂ zi .
2σ̂ 2 i=1 i

• The specification test examines the correlation between the


squared OLS residuals and zi . The score test will lead to re-
jection when this correlation is large.
• Details of calculation of this test are given in the intermediate
textbooks and the test (Breusch-Pagan-Godfrey) is built into
many of the econometric software packages.
• Note that the form of the function h(·) does not figure in the
score test. This would not be the case had we developed either
a Wald test or a Likelihood Ratio test.

G023. IV
Information Matrix Tests

• We have seen that the results on the limiting distribution of


the MLE rest at one point on the Information Matrix Equality
E[lθ (θ0 , Y )lθ (θ0 , Y )0 ] = −E[lθθ0 (θ0 , Y )]
where Y = (Y1 , . . . , Yn ) are n random variables whose realisa-
tions constitute our data.
• In the case relevant to much microeconometric work the log
likelihood function is a sum of independently distributed ran-
dom variables, e.g. in the continuous Y case:
n
X
l(θ, Y ) = log f (Yi , θ),
i=1

where f (Yi , θ) is the probability density function of Yi . Here


the Information Matrix Equality derives from the result
∂ ∂ ∂2
E[ log f (Y, θ) 0 log f (Y, θ) + log f (Y, θ)] = 0.
∂θ ∂θ ∂θ∂θ0

• Given a value θ̂ of the MLE we can calculate a sample analogue


of the left hand side of this equation:
n µ ¶
1X ∂ ∂ ∂2
IM = log f (Yi , θ) 0 log f (Yi , θ) + log f (Yi , θ)|θ=θ̂
n i=1 ∂θ ∂θ ∂θ∂θ0

• If the likelihood function is a correct specification for the data


generating process, then we expect the resulting statistic (which
is a matrix of values unless θi is scalar) to be close to (a matrix
of zeros).
• A general purpose statistic for detecting incorrect specification
of a likelihood function is produced by considering a quadratic
form in a vectorised version of all or part of n1/2 IM . This
Information Matrix Test statistic was introduced by Halbert
White3 in 1982.
G023. IV
3
See “Maximum Likelihood Estimation in Misspecified Models”, Halbert White
Endogeneity and Instrumental
Variables
Endogeneity and Simultaneity

• In many problems studied in econometrics it is not possible to


maintain restrictions requiring that the expected value of the
latent variable in an equation is zero given the values of the
right hand side variables in the equation:

E(ε|X) 6= 0

• This leads to a biased OLS estimate.


• There are many cases in which the OLS identification assump-
tion does not hold:
– simultaneous equations.
– explanatory variables measured with error.
– omitted variables correlated with explanatory variables.

G023. V
Simultaneity

• Definition: Simultaneity arises when the causal relationship


between Y and X runs both ways. In other words, the ex-
planatory variable X is a function of the dependent variable
Y , which in turn is a function of X.

Direct effect

ª
Y X
µ

Indirect Effect

• This arises in many economic examples:


– Income and health.
– Sales and advertizing.
– Investment and productivity.
• What are we estimating when we run an OLS regression of Y
on X? Is it the direct effect, the indirect effect or a mixture of
both.

G023. V
Examples

Advertisement - Higher Sales

ª
Higher revenues

Investment - Higher Productivity

ª
Higher revenues

Low income - Poor health

ª
reduced hours
of work

G023. V
Implications of Simultaneity

• 
 Yi = β0 + β1 Xi + ui (direct effect)

Xi = α0 + α1 Yi + vi (indirect effect)
• Replacing the second equation in the first one, we get an equa-
tion expressing Yi as a function of the parameters and the error
terms ui and vi only. Substituting this into the second equa-
tion, we get Xi also as a function of the parameters and the
error terms:

 β0 + β1 α0 β1 vi + ui
 Yi = 1 − α1β1 + 1 − α1 β1 = B0 + ũi



 Xi = α0 + α1 β0 + vi + α1 ui = A0 + ṽi
1 − α1 β1 1 − α1 β1

• This is the reduced form of our model. In this rewritten


model, Yi is not a function of Xi and vice versa. However, Yi
and Xi are both a function of the two original error terms ui
and vi .
• Now that we have an expression for Xi , we can compute:
α0 + α1 β0 vi + α1 ui
cov(Xi , ui ) = cov( + , ui )
1 − α1 β1 1 − α 1 β1
α1
= V ar(ui )
1 − α1 β1
which, in general is different from zero. Hence, with simultane-
ity, our assumption 1 is violated. An OLS regression of Yi
on Xi will lead to a biased estimate of β1 . Similarly, an
OLS regression of Xi on Yi will lead to a biased estimate of α1 .

G023. V
What are we estimating?

• For the model:


Yi = β0 + β1 + Xi + ui

• The OLS estimate is:


cov(Xi , ui )
β̂1 = β1 +
V ar(Xi )
α1 V ar(ui )
= β1 +
1 − α1 β1 V ar(Xi )

• So
– E β̂1 6= β1
– E β̂1 6= α1
– E β̂1 6= an average of β1 and α1 .

G023. V
Identification

• Suppose a more general model:


½
Yi = β0 + β1 Xi + β2 Ti + ui
Xi = α0 + α1 Yi + α2 Zi + vi

• We have two sorts of variables:


– Endogenous: Yi and Xi because they are determined
within the system. They appear on the right and left hand
side.
– Exogenous: Ti and Zi . They are determined outside of
our model, and in particular are not caused by either Xi
or Yi . They appear only on the right-hand-side.

G023. V
Example

• Consider a simple version of the Mincer model for returns to


schooling with the following structural equations.

W = α0 + α1 S + α2 Z + ε1
S = β0 + β1 Z + ε2

Here W is the log wage, S is years of schooling, Z is some


characteristic of the individual, and ε1 and ε2 are unobservable
latent random variables.
• We might expect those who receive unusually high levels of
schooling given Z to also receive unusually high wages given Z
and S, a situation that would arise if ε1 and ε2 were affected
positively by ability, a characteristic not completely captured
by variation in Z.
• In this problem we might be prepared to impose the following
restrictions.

E[ε1 |Z = z] = 0
E[ε2 |Z = z] = 0

but not
E[ε1 |S = s, Z = z] = 0
unless ε1 was believed to be uncorrelated with ε2 .
• Considering just the first (W ) equation,

E[W |S = s, Z = z] = α0 + α1 s + α2 z + E[ε1 |S = s, Z = z]

• A variable like S, appearing in a structural form equation and


correlated with the latent variable in the equation, is called an
endogenous variable.

G023. V
Reduced Form Equations

• Substitute for S in the wage equation:

W = (α0 + α1 β0 ) + (α1 β1 + α2 ) Z + ε1 + α1 ε2
S = β0 + β1 Z + ε2

• Equations like this, in which each equation involves exactly one


endogenous variable are called reduced form equations.
• The restrictions E[ε1 |Z = z] = 0 and E[ε2 |Z = z] = 0 imply
that

E[W |Z = z] = (α0 + α1 β0 ) + (α1 β1 + α2 ) z


E[S|Z = z] = β0 + β1 z

• Given enough (at least 2) distinct values of z and knowledge


of the left hand side quantities we can solve for (α0 + α1 β0 ),
(α1 β1 + α2 ), β0 and β1 . So, the values of these functions of
parameters of the structural equations can be identified.
• In practice we do not know the left hand side quantities but
with enough data we can estimate the data generating values
of (α0 + α1 β0 ), (α1 β1 + α2 ), β0 and β1 , for example by OLS
applied first to (W, Z) data and then to (S, Z) data.
• The values of β0 and β1 are identified but the values of α0 ,
α1 and α2 are not, for without further restrictions their values
cannot be deduced from knowledge of (α0 + α1 β0 ), (α1 β1 + α2 ),
β0 .

G023. V
Identification using an Exclusion Restriction

• One restriction we might be prepared to add to the model is


the restriction α2 = 0. Whether or not that is a reasonable
restriction to maintain depends on the nature of the variable
Z.
• If Z were a measure of some characteristic of the environment
of the person at the time that schooling decisions were made
(for example the parents’ income, or some measure of an event
that perturbed the schooling choice) then we might be prepared
to maintain the restriction that, given schooling achieved (S),
Z does not affect W , i.e. that α2 = 0.
• This restriction may be sufficient to identify the remaining pa-
rameters. If the restriction is true then the coefficients on Z
become α1 β1 .
• We have already seen that (the value of) the coefficient β1
is identified. If β1 is not itself zero (that is Z does indeed
affect years of schooling) then α1 is identified as the ratio of the
coefficients on Z in the regressions of W and S on Z. With α1
identified and β0 already identified, identification of α0 follows
directly.

G023. V
Indirect Least Squares Estimation

• Estimation could proceed under the restriction α2 = 0 by cal-


culating OLS (or GLS) estimates of the “reduced form” equa-
tions:

W = π01 + π11 Z + U1
S = π02 + π12 Z + U2

where
π01 = α0 + α1 β0 π11 = α1 β1
π02 = β0 π12 = β1
U1 = ε1 + α1 ε2 U2 = ε2
and
E[U1 |Z = z] = 0 E[U2 |Z = z] = 0
solving the equations:

π̂01 = α̂0 + α̂1 β̂0 π̂11 = α̂1 β̂1


π̂02 = β̂0 π̂12 = β̂1

given values of the π̂’s for values of the α̂’s and β̂’s, as follows.
α̂0 = π̂01 − π̂02 (π̂11 /π̂12 ) α̂1 = π̂11 /π̂12
β̂0 = π̂02 β̂1 = π̂12

• Estimators obtained in this way, by solving the equations re-


lating structural form parameters to reduced form parameters
with OLS estimates replacing the reduced form parameters,
are known as Indirect Least Squares estimators. They were
first proposed by Jan Tinbergen in 1930.

G023. V
Over Identification

• Suppose that there are two covariates, Z1 and Z2 whose impact


on the structural equations we are prepared to restrict so that
both affect schooling choice but neither affect the wage given
the amount of schooling achieved:
W = α0 + α1 S + ε1
S = β0 + β1 Z1 + β2 Z2 + ε2
• the reduced form equations are as follows
W = π01 + π11 Z1 + π21 Z2 + U1
S = π02 + π12 Z1 + π22 Z2 + U2
where
π01 = α0 + α1 β0 π11 = α1 β1 π21 = α1 β2
π02 = β0 π12 = β1 π22 = β2
and
U1 = ε1 + α1 ε2 U2 = ε2 .
• The values of the reduced form equations’ coefficients are iden-
tified under restrictions .
• Note, there are two ways in which the coefficient α1 can be
identified, as follows
π11 π21
α1 = α1Z1 = α1 = α1Z2 =
π12 π22
• In this situation we say that the value of the parameter α1 is
over identified .
• We will usually find that α̂1Z1 6= α̂1Z2 even though these are both
estimates of the value of the same structural form parameter.
• If the discrepancy was found to be very large then we might
doubt whether the restrictions of the model are correct. This
suggests that tests of over identifying restrictions can detect
misspecification of the econometric model.
• If the discrepancy is not large then there is scope for combining
the estimates to produce a single estimate that is more efficient
than either taken alone.
G023. V
Instrumental Variables

• Consider the linear model for an outcome Y given covariates X

Y = Xβ + ε

• Suppose that the restriction E[ε|X = x] = 0 cannot be main-


tained but that there exist m variables Z for which the restric-
tion E[ε|Z = z] = 0 can be maintained. It implies:

E[Y − Xβ|Z = z] = 0

and thus that

E[Z 0 (Y − Xβ) |Z = z] = 0

which implies that, unconditionally

E[Z 0 (Y − Xβ)] = 0.

and thus
E[Z 0 Y ] = E[Z 0 X]β.

• First suppose m = k, and that E[Z 0 X] has rank k. Then β can


be expressed in terms of moments of Y , X and Z as follows

β = E[Z 0 X]−1 E[Z 0 Y ].

and β is (just) identifiable. This leads directly to an analogue


type estimator:
β̂ = (Z 0 X)−1 (Z 0 Y )
In the context of the just identified returns to schooling model
this is the Indirect Least Squares estimator.

G023. V
Generalised Method of Moments estimation

• Suppose that m > k. We will not find a solution since we have


m > k equations in k unknowns.
• Define a family of estimators, β̂W as
0
β̂W = arg min (Z 0 Y − Z 0 Xβ) W (Z 0 Y − Z 0 Xβ)
β

where W is a m × m full rank, positive definite symmetric


matrix.
• This M-estimator is an example of what is known as the Gen-
eralised Method of Moments (GMM) estimator.
• Different choices of W lead to different estimators unless m =
k.
• The choice among these is commonly made by considering their
accuracy. We consider the limiting distribution of the GMM
estimator for alternative choices of W and choose W to min-
imise the variance of the limiting distribution of n1/2 (β̂W − β0 ).
• In standard cases this means choosing W to be proportional
to a consistent estimator of the inverse of the variance of the
limiting distribution of n1/2 (Z 0 Y − Z 0 Xβ).

G023. V
Generalised Instrumental Variables Estimation

• Write β̂W explicitly in terms of sample moments:


µ 0 ¶0 µ 0 ¶
Zn yn − Zn0 Xn β Zn yn − Zn0 Xn β
β̂W = arg min W
β n1/2 n1/2

• Consider what the (asymptotically) efficient choice of W is by


examining the variance of n−1/2 (Zn0 yn − Zn0 Xn β).
• We have, since yn = Xn β + εn ,

n−1/2 (Zn0 yn ) − n−1/2 (Zn0 Xn )β = n−1/2 (Zn0 εn )

and if we suppose that V ar(εn |Zn ) = σ 2 In ,


³ ´
−1/2
V ar n (Zn εn )|Zn = σ 2 (n−1 Zn0 Zn ).
0

This suggests choosing W = (n−1 Zn0 Zn )−1 leading to the fol-


lowing minimisation problem:
0
β̂n = arg min (Zn0 yn − Zn0 Xn β) (Zn0 Zn )−1 (Zn0 yn − Zn0 Xn β)
β

• The first order conditions for this problem, satisfied by β̂n are:

2β̂n0 (Xn0 Zn )(Zn0 Zn )−1 (Zn0 Xn ) − 2(Xn0 Zn )(Zn0 Zn )−1 (Zn0 yn ) = 0

leading to the following estimator.


¡ ¢−1 0
β̂ = X 0 Z(Z 0 Z)−1 Z 0 X X Z(Z 0 Z)−1 Z 0 y

This is known as the generalised instrumental variable estima-


tor (GIVE).

G023. V
GIVE: Asymptotic Properties

• The asymptotic properties of this estimator are obtained as


follows. Substituting yn = Xn β + εn gives
¡ ¢−1 0
β̂n = β + Xn0 Zn (Zn0 Zn )−1 Zn0 Xn Xn Zn (Zn0 Zn )−1 Zn0 εn
¡ ¢−1 −1 0
= β + n−1 Xn0 Zn (n−1 Zn0 Zn )−1 n−1 Zn0 Xn n Xn Zn (n−1 Zn0 Zn )−1 n−1 Zn0 εn

and if

plim(n−1 Zn0 Zn ) = ΣZZ


n→∞
plim(n−1 Xn0 Zn ) = ΣXZ
n→∞
plim(n−1 Zn0 εn ) = 0
n→∞

with ΣZZ having full rank (m) and ΣXZ having full rank (k)
then
plim β̂n = β
n→∞
and we have a consistent estimator.

G023. V
GIVE Asymptotic Properties

• To obtain the limiting distribution of n1/2 (β̂ − β) note that


¡ ¢−1
n1/2 (β̂ − β) = n−1 Xn0 Zn (n−1 Zn0 Zn )−1 n−1 Zn0 Xn
n−1 Xn0 Zn (n−1 Zn0 Zn )−1 n−1/2 Zn0 εn

• Under the conditions in the previous slide we have the limiting


distribution:
¡ ¢−1 ³ ´
1/2 −1 −1 −1/2 0
plim n (β̂−β) = ΣXZ ΣZZ ΣZX ΣXZ ΣZZ plim n Zn εn

where ΣZX = Σ0XZ , and if a Central Limit Theorem applies to


n−1/2 Zn0 εn ³ ´
−1/2 0
plim n Zn εn = N (0, σ 2 ΣZZ )
then
plim n1/2 (β̂ − β) = N (0, V )
where
¡ ¢−1 ¡ ¢−1
V = σ 2 ΣXZ Σ−1
ZZ ΣXZ ΣXZ Σ−1
ZZ ΣZZ Σ −1
ZZ ΣZX ΣXZ Σ−1
ZZ Σ ZX
2
¡ −1
¢−1
= σ ΣXZ ΣZZ ΣZX

and so
¡ ¢−1
plim n1/2 (β̂ − β) ' N (0, σ 2 ΣXZ Σ−1
ZZ ΣZX ).

G023. V
GIVE and Two Stage OLS

• Suppose there is a model for X,

X = ZΦ + V

where E[V |Z] = 0. The OLS estimator of Φ is


−1
Φ̂n = (Zn0 Zn ) Zn0 Xn

and the “predicted value” of X for a given Z is


−1
X̂n = Zn (Zn0 Zn ) Zn0 Xn .

Note that
X̂n0 X̂n = Xn0 Zn (Zn0 Zn )−1 Zn0 Xn
and
X̂n0 yn = Xn0 Zn (Zn0 Zn )−1 Zn0 yn .
So the Generalised Instrumental Variables Estimator can be
written as ³ ´−1
0
β̂n = X̂n X̂n X̂n0 yn .
that is, as the OLS estimator of the coefficients of a linear
relationship between yn and the predicted values of Xn got from
OLS estimation of a linear relationship between Xn and the
instrumental variables Zn .

G023. V
Examples: Measurement Errors

• Suppose we are measuring the impact of income, X, on con-


sumption, Y . The true model is:

Yi = β0 + β1 Xi + ui

β0 = 0, β1 = 1

• Suppose we have two measures of income, both with measure-


ment errors.
– X̌1i = Xi + v1i , s.d.(v1i ) = 0.2 ∗ Ȳ
– X̌2i = Xi + v2i , s.d.(v2i ) = 0.4 ∗ Ȳ
If we use X̌2 to instrument X̌1 , we get:
N
X ¯ )(Y − Ȳ )
(X̌2i − X̌2 i
i=1
β̂1 = N
X ¯ )(X̌ − X̌
¯ )
(X̌2i − X̌2 1i 1
i=1

• Results:

Method Estimate of β1
OLS regressing Y on X̌1 0.88
OLS regressing Y on X̌2 0.68
IV, using X̌2 as instrument 0.99

G023. V

You might also like