G023: Econometrics: J Er Ome Adda
G023: Econometrics: J Er Ome Adda
Jérôme Adda
Office # 203
I am grateful to Andrew Chesher for giving me access to his G023 course notes on which
most of these slides are based.
G023. I
Syllabus
Course Description:
This course is an intermediary econometrics course. There will be
3 hours of lectures per week and a class (sometimes in the computer
lab) each week. Previous knowledge of econometrics is assumed. By
the end of the term, you are expected to be at ease with modern
econometric techniques. The computer classes introduce you to
real life problems, and will help you to understand the theoretical
content of the lectures. You will also learn to use a powerful and
widespread econometric software, STATA.
Understanding these techniques will be of great help for your
thesis over the summer, and will help you in your future workplace.
For any contact or query, please send me an email or visit my
web page at:
http://www.ucl.ac.uk/∼uctpjea/teaching.html.
My web page contains documents which might prove useful such as
notes, previous exams and answers.
Books:
There are a several good intermediate econometric books but the
main book to be used for reference is the Wooldridge (J. Wooldridge
(2003) Econometric Analysis of Cross-Section and Panel Data, MIT
Press). Other useful books are:
G023. I
Course Content
1. Introduction
What is econometrics? Why is it useful?
2. The linear model and Ordinary Least Squares
Model specification.
3. Hypothesis Testing
Goodness of fit, R2 . Hypothesis tests (t and F).
4. Approximate Inference
Slutsky’s Theorem; Limit Theorems. Approximate distribu-
tion of the OLS and GLS estimators.
5. Maximum Likelihood Methods
Properties; Limiting distribution; Logit and Probit; Count
data.
6. Likelihood based Hypothesis Testing
Wald and Score tests.
7. Endogeneity and Instrumental Variables
Indirect Least Squares, IV, GMM; Asymptotic properties.
G023. I
Definition and Examples
G023. I
Example 1: Global Warming
G023. I
Example 1: Global Warming
G023. I
Causality
G023. I
Causality
G023. I
Linear Model
y = Xβ + ε
G023. I
Model Specifications
• Linear model:
Yi = β0 + β1 Xi + εi
∂Yi
= β1
∂Xi
Interpretation: When X goes up by 1 unit, Y goes up by β1
units.
ln(Yi ) = β0 + β1 ln(Xi ) + εi
∂Yi
= eβ0 β1 Xiβ1 −1 eεi
∂Xi
∂Yi /Yi
= β1
∂Xi /Xi
Interpretation: When X goes up by 1%, Y goes up by β1 %.
• Log-lin model:
ln(Yi ) = β0 + β1 Xi + εi
∂Yi
= β1 eβ0 eβ1 Xi eεi
∂Xi
∂Yi /Yi
= β1
∂Xi
G023. I
Example: Global Warming
G023. I
Assumptions of the Classical
Linear Regression Model
• Assumption 1: E[ε|X] = 0
– The expected value of the error term has mean zero given
any value of the explanatory variable. Thus observing a
high or a low value of X does not imply a high or a low
value of ε.
X and ε are uncorrelated.
– This implies that changes in X are not associated with
changes in ε in any particular direction - Hence the asso-
ciated changes in Y can be attributed to the impact of X.
– This assumption allows us to interpret the estimated coef-
ficients as reflecting causal impacts of X on Y .
– Note that we condition on the whole set of data for X in
the sample not on just one .
G023. I
Assumptions of the Classical Linear Regression Model
• Assumption 2: rank(X) = k.
• In this case, for all non-zero k × 1 vectors, c, Xc 6= 0.
• When the rank of X is less than k, there exists a non-zero vector
c such that Xc = 0. In words, there is a linear combination of
the columns of X which is a vector of zeros. In this situation
the OLS estimator cannot be calculated. β cannot be defined
by using the information contained in X.
• Perhaps one could obtain other values of x and then be in a
position to define β. But sometimes this is not possible, and
then β is not identifiable given the information in X. Perhaps
we could estimate functions (e.g. linear functions) of β that
would be identifiable even without more x values.
G023. I
OLS Estimator
E[y − βX|X] = 0
X 0 E[y − βX|X] = E[X 0 (y − Xβ)|X]
= E[X 0 y] − X 0 Xβ
= 0
β̂ = (X 0 X)X 0 (Xβ + ε)
= β + (X 0 X)−1 X 0 ε
G023. I
Properties of the OLS Estimator
where Σ = V ar[ε|X]
• If Σ = σ 2 In (homoskedasticity and no autocorrelation) then
V ar(β̂|X) = σ 2 (X 0 X)−1
Vd
ar(β̂|X) = σ̂ 2 (X 0 X)−1
G023. I
Alternative Way
X 0 (y − X β̂) = 0
G023. I
Goodness of Fit
• We measure how well the model fits the data using the R2 .
• This is the ratio of the explained sum of squares to the total
sum of squares
N
X
– Define the Total sum of Squares as: T SS = (Yi − Ȳ )2
i=1
N
X
– Define the Explained Sum of Squares as: ESS = [β̂(Xi −
i=1
2
X̄)]
N
X
– Define the Residual Sum of Squares as: RSS = ε̂2i
i=1
• Then we define
ESS RSS
R2 = =1−
T SS T SS
• This is a measure of how much of the variance of Y is explained
by the regressor X.
• The computed R2 following an OLS regression is always be-
tween 0 and 1.
• A low R2 is not necessarily an indication that the model is
wrong - just that the included X have low explanatory power.
• The key to whether the results are interpretable as causal im-
pacts is whether the explanatory variable is uncorrelated with
the error term.
G023. I
Goodness of Fit
G023. I
Alternative Analogue Estimators
E[H 0 ε|X] = 0
E[H 0 (y − Xβ)|X] = 0
E[H 0 y|X] − E[H 0 X|X]β = 0
E[H 0 y|X] − (H 0 X)β = 0
β̂H = (H 0 X)−1 H 0 y
G023. I
Misspecification
• Suppose the true model is not linear but take the following
(more general) form:
E[Y |X = x] = g(x, θ)
so that
Y = g(x, θ) + ε E[ε|X] = 0
Define
g(x1 , θ)
G(X, θ) = ..
.
g(xn , θ)
then
G023. I
Omitted Regressors
y = Zγ + ε E[ε|X, Z] = 0
. ..
• Let Z = [X .. Q] and γ 0 = [γX 0
. γQ0 ] so that the matrix
X used to calculate β̂ is a part of the matrix Z. In the fitted
model, the variables Q have been omitted.
E[β̂|X, Z] = E[β̂|Z]
= (X 0 X)−1 X 0 Zγ
.
= (X 0 X)−1 [X 0 X .. X 0 Q]γ
.
= [I .. (X 0 X)−1 X 0 Q]γ
= γX + (X 0 X)−1 X 0 QγQ
G023. I
Omitted Regressors: Example
Log Income
BMI low -0.42 (.016) -0.15 (.014)
BMI high -0.00 (.021) -0.12 (.018)
age 0.13 (.0012)
age square -0.0013 (.00001)
constant 6.64 (.0053) 3.76 (.0278)
G023. I
Measurement Error
G023. I
Measurement Error on Dependent Variable
Y̌ = Xβ + ε
Y = Xβ + ε − ν
= Xβ + w
G023. I
Measurement Error on Explanatory Variables
β̂ = (X̌ 0 X̌)−1 X̌ 0 y
= (X 0 X + ν 0 ν + X 0 ν + ν 0 X)−1 (X + ν)0 (Xβ + ε)
E[β̂|X] = (X 0 X + ν 0 ν)−1 X 0 Xβ
G023. I
Example
• True model:
Yi = β0 + β1 Xi + ui with β0 = 1 β1 = 1
Var(νi )/Var(Xi )
0 0.2 0.4 0.6
β0 1 1.08 1.28 1.53
β1 2 1.91 1.7 1.45
G023. I
Estimation of linear functions of β
G023. I
Minimum Variance Property of OLS
β̃ = Q(X)y
E[β̃|X] = β + R0 Xβ.
V ar[β̃|X] − V ar[β̂|X] = σ 2 R0 R,
and
k
X
0 0 2 0 2
V ar(c β̃) − V ar(c β̂) = σ d d = σ d2i ≥ 0
i=1
where d = Rc.
G023. I
M Estimation
G023. I
Frisch-Waugh Lovell Theorem
so that
β̂1 = (X10 X1 )−1 X10 y − (X10 X1 )−1 X10 X2 β̂2
substituting
X20 y − X20 X1 (X10 X1 )−1 X10 y = X20 X2 β̂2 − X20 X1 (X10 X1 )−1 X10 X2 β̂2
G023. I
Generalised Least Squares Estimation
• The BLU property of the OLS estimator does not usually apply
when V ar[ε|X] 6= σ 2 In .
• Insight: suppose that Y has a much larger conditional variance
at one value of x, x∗ , than at other values. Realisations pro-
duced at x∗ will be less informative about the location of the
regression function than realisations obtained at other values
of x. It seems natural to give realisations obtained at x∗ less
weight when estimating the regression function.
• We know how to produce a BLU estimator when V ar[ε|X] =
σ 2 In .
• Our strategy for producing a BLU estimator when this condi-
tion does not hold is to transform the original regression model
so that the conditional variance of the transformed Y is pro-
portional to an identity matrix and apply the OLS estimator
in the context of that transformed model.
G023. I
Generalised Least Squares Estimation
z = P y = P Xβ + u
where u = P ε and V ar[u|X] = I
• In the context of this model the OLS estimator,
β̆ = (X 0 P 0 P X)−1 X 0 P 0 P y,
G023. I
Feasible Generalised Least Squares Estimation
could be used.
• To study the properties of this estimator requires the use of
asymptotic approximations and we return to this later.
G023. I
Feasible GLS
ε̂2i = γ 0 xi + ui
G023. I
Feasible GLS
G023. I
Inference: Sampling Distributions
G023. I
Inference: Confidence Intervals
• Let Z ∼ N [0, 1] and let zL (α) and zU (α) be the closest pair
of values such that P [zL (α) ≤ Z ≤ zU (α)] = α. zL (α) is
the (1 − α)/2 quantile of the standard normal distribution.
Choosing α = 0.95 gives zU (α) = 1.96, zL (α) = −1.96.
• The result above concerning the distribution of c0 β̂ implies that
c0 β̂ − c0 β
P [zL (α) ≤ ≤ zU (α)] = α
σ (c0 (X 0 X)−1 c)1/2
which in turn implies that
¡ ¢1/2 ¡ ¢1/2
P [c0 β̂−zU (α)σ c0 (X 0 X)−1 c ≤ c0 β ≤ c0 β̂−zL (α)σ c0 (X 0 X)−1 c ] = α.
G023. I
Estimation of σ 2
• Note that
G023. I
Estimation of σ 2
n−k 2
• Proof of E[σ̂ 2 ] = n σ < σ2:
• First note that M y = M ε because M X = 0. So
σ̂ 2 = n−1 y 0 M y = n−1 ε0 M ε
and when Σ = σ 2 In ,
G023. I
Confidence regions
G023. I
Chi Square Distribution
E[χ2(ν) ] = ν
V ar[χ2(ν) ] = 2ν.
G023. I
Confidence regions continued
• Let qχ2 (j) (α) denote the α−quantile of the χ2(j) distribution.
Then
P [χ2(j) ≤ qχ2 (j) (α)] = α
implies that
³ ´0 ¡ ¢ ³ ´
0 −1 0 −1
P [ Rβ̂ − r R(X X) R Rβ̂ − r /σ 2 ≤ qχ2 (j) (α)] = α.
³ ´2
0 ∗
c β̂ − c
= P [ 2 0 0 −1 ≤ qχ2 (1) (α)]
σ c (X X) c
³ ´
0 ∗
¡ ¢1/2 c β̂ − c ¡ ¢1/2
= P [− qχ (1) (α)
2 ≤ ≤ q χ2 (1) (α) ]
σ (c0 (X 0 X)−1 c)1/2
³ ´
0 ∗
c β̂ − c
= P [zL (α) ≤ 1/2
≤ zU (α)]
0 0
σ (c (X X) c) −1
G023. I
Tests of hypotheses
G023. I
Tests of Hypotheses
Here qχ2 (j) (1−λ) is the (1−λ) quantile of the χ2 (j) distribution.
• Note that we do not talk in terms of accepting H0 as an alter-
native to rejection. The reason is that a value of S that does
not fall in the rejection region of the test is consonant with
many values of Rβ − r that are close to but not equal to 0.
G023. I
Tests of Hypotheses
G023. I
Confidence Interval: Example
0.06/0.00785 = 7.71
Reject H0 .
• Test of H0 Effect of College degree equal to High School degree:
Reject H0 .
G023. I
Detecting structural change
where, e.g., ε̂0b ε̂b is the sum of squared residuals from estimating
yb = Xb βb + εb .
G023. I
Detecting Structural Change
G023. I
Example: Structural Break in Temperature
11
10
• We can test whether the slope after 1900 is different from the
general one:
G023. I
Estimation in non-linear regression models
E[Y |X = x] = g(x, θ)
Y = g(x, θ) + ε
E[ε|X = x] = 0.
G023. I
Numerical optimisation: Newton’s method and variants
θ̂ = arg minQ(θ∗ ).
θ∗
G023. I
Numerical optimisation: Newton’s method and variants
G023. I
Numerical Optimisation: Example
• Function y = sin(x/10) ∗ x2 .
• This function has many (an infinite) number of local minimas.
• Start off the nonlinear optimisation at various points.
4
x 10
8
sin(x/10)*x*x
Start at 0.5
6 Start at 50
Start at 150
−2
−4
−6
−8
−10
0 50 100 150 200 250 300
G023. I
Approximate Inference
Approximate Inference
• The results set out in the previous notes let us make inferences
about coefficients of regression functions, β, when the distrib-
ution of y given X is Gaussian (normal) and the variance of
the unobservable disturbances is known.
• In practice the normal distribution at best holds approximately
and we never know the value of the nuisance parameter σ. So
how can we proceed?
• The most common approach and the one outlined here involves
employing approximations to exact distributions.
• They have the disadvantage that they can be inaccurate and
the magnitude of the inaccuracy can vary substantially from
case to case. They have the following advantages:
G023. II
Approximate Inference
G023. II
Convergence in probability
G023. II
Convergence in Probability
G023. II
Convergence in Probability
G023. II
Convergence in distribution
P [T ≤ t] = FT (t).
d
Tn → T
.
• The definition applies for vector and scalar random variables.
In this situation we will also talk in terms of Tn converging in
probability to (the random variable) T .
G023. II
Convergence in Distribution
G023. II
Convergence in Distribution: Example
G023. II
Approximate Inference: Some Thoughts
G023. II
Functions of statistics - Slutsky’s Theorem
G023. II
Limit theorems
d
n1/2 (Ȳn − µ) → Z, Z ∼ N (0, Ω).
then
d
n1/2 (Ȳn − µ̄n ) → Z, Z ∼ N (0, Ω).
G023. II
Limit Theorem: Example
• Start with n uniform random variables {Yi }ni=1 over [0, 1].
• Denote by Ȳn the mean of Yi based on a sample of size n.
• The graph plots the distribution of n1/2 (Ȳn − 0.5):
0.14
n=1
n=10
0.12
n=100
0.1
0.08
0.06
0.04
0.02
0
−1.5 −1 −0.5 0 0.5 1 1.5
G023. II
Approximate Distribution Of The Ols Estimator
then
E[β̂n |Xn ] = β
and
n
X
2
V ar[β̂n |Xn ] = σ (Xn0 Xn )−1 −1 2
= n σ (n −1
Xn0 Xn )−1 −1 2
= n σ (n −1
xi x0i )−1 .
i=1
G023. II
OLS Estimator: Limiting distribution
Sn = β̂n − β
Tn = n1/2 Sn
= n1/2 (β̂n − β)
= (n−1 Xn0 Xn )−1 n−1/2 Xn0 εn .
p
Assuming (n−1 Xn0 Xn )−1 → Σ−1
xx ., consider the term
n
X
−1/2
n Xn0 εn =n −1/2
xi εi .
i=1
G023. II
OLS Estimator: Limiting distribution
G023. II
OLS Estimator: Limiting Distribution
• Now
n ³ ¡ ¢−1 0 ´−1
Tn0 Tn 0 −1 0
= 2 (Rβ̂n − r) R n Xn Xn R (Rβ̂n − r)
σ
where we have used
³ ¡ ¢−1 0 ´−1
Pn0 Pn −1 0
= R n Xn Xn R .
G023. II
Approximate Distribution of the GLS Estimator
y = Xβ + ε
E[ε|X] = 0
V ar[ε|X] = Ω
¡ ¢−1 0 −1
• The GLS estimator β̃ = X 0 Ω−1 X X Ω y is BLU, and
when y given X is normally distributed:
¡ ¢−1
β̃ ∼ N (β, X 0 Ω−1 X ).
G023. II
Approximate Distribution of the GLS Estimator
where Mi0 is the ith row of M and Mii is the (i, i) element of
M . This simplification follows from the diagonality of Ω and
the idempotency of M . We can therefore write
ε̂2i
= f (xi , γ) + ui
Mii
where E[ui |X] = 0, and under suitable conditions a nonlinear
least squares estimation will produce a consistent estimator of
γ, leading to a consistent estimator of Ω.
G023. II
Approximate Distribution of M-Estimators
G023. II
Approximate Distribution of M-Estimators
´ ³
1/2
• To obtain the limiting distribution of n θ̂n − θ0 , consider
situations in which the M-estimator can be defined as the
unique solution to first order conditions
∂
Uθ (Zn , θ̂n ) = 0 where Uθ (Zn , θ̂n ) = U (Zn , θ)|θ=θ̂n
∂θ
This is certainly the case when U (Zn , θ) is concave.
• We first consider a Taylor series expansion of U (Zn , θ) regarded
as a function of θ around θ = θ0 , as follows:
where
∂2
Uθθ (Zn , θ) = U (Zn , θ).
∂θ∂θ0
The remainder term, R(θ̂n , θ0 , Zn ), involves the third deriva-
tives of U (Zn , θ) and in many situations converges in probabil-
ity to zero as n becomes large. This allows us to write:
³ ´
Uθ (Zn , θ0 ) + Uθθ (Zn , θ0 ) θ̂n − θ0 ' 0
and then
³ ´
θ̂n − θ0 ' −Uθθ (Zn , θ0 )−1 Uθ (Zn , θ0 ).
Equivalently:
³ ´ ¡ ¢−1 −1/2
1/2
n θ̂n − θ0 ' − n−1 Uθθ (Zn , θ0 ) n Uθ (Zn , θ0 ).
G023. II
Approximate Distribution of M-Estimator
when Yi = x0i θ0
+ εi and the εi ’s are independently distributed
with expected value zero and common variance σ02 .
n
X 2
U (Zn , θ) = − (Yi − x0i θ)
i=1
n
X
−1/2 −1/2
n Uθ (Zn , θ) = 2n (Yi − x0i θ)xi
i=1
n
X
n−1 Uθθ (Zn , θ) = −2n−1 xi x0i
i=1
−1
Pn 0
and, defining ΣXX ≡ plimn→∞ n i=1 xi xi :
A(θ0 ) = −2ΣXX
which does not depend upon θ0 in this special case,
B(θ0 ) = 4σ02 ΣXX
A(θ0 )−1 B(θ0 )A(θ0 )−10 = σ02 Σ−1
XX
and finally the OLS estimator has the following limiting normal
distribution.
³ ´
1/2 d
n θ̂n − θ0 → N (0, σ02 Σ−1
XX ).
G023. II
Approximate distributions of functions of estimators
the “delta method”
G023. II
1 1/2 d p
n (θ̂ − θ0 ) → N (0, Ω) and (θ̂ − θ0 ) → 0.
Delta Method: Example
G023. II
Maximum Likelihood Methods
Maximum Likelihood Methods
G023. III
Estimating a Probability
= L(p; y).
With any set of data L(p; y) can be calculated for any value of
p between 0 and 1. The result is the probability of observing
the data to hand for each chosen value of p.
• One strategy for estimating p is to use that value that max-
imises this probability. The resulting estimator is called the
maximum likelihood estimator (MLE) and the maximand, L(p; y),
is called the likelihood function.
G023. III
Log Likelihood Function
• The maximum of the log likelihood function, l(p; y) = log L(p, y),
is at the same value of p as is the maximum of the likelihood
function (because the log function is monotonic).
• It is often easier to maximise the log likelihood function (LLF).
For the problem considered here the LLF is
à n ! n
X X
l(p; y) = yi log p + (1 − yi ) log(1 − p).
i=1 i=1
Let
p̂ = arg maxL(p; y) = arg maxl(p; y).
p p
G023. III
Likelihood Functions and Estimation in General
Here only the joint density function depends upon θ and the
value of θ that maximises f (y1 , . . . , yn , θ) also maximises A.
• In this case the likelihood function is defined to be the joint
density function of the Yi ’s.
• When the Yi ’s are discrete random variables the likelihood func-
tion is the joint probability mass function of the Yi ’s, and in
cases in which there are discrete and continuous elements the
likelihood function is a combination of probability density ele-
ments and probability mass elements.
• In all cases the likelihood function is a function of the observed
data values that is equal to, or proportional to, the probability
of observing these particular values, where the constant of pro-
portionality does not depend upon the parameters which are
to be estimated.
G023. III
Likelihood Functions and Estimation in General
G023. III
Invariance
• The reason for this is that we omit the infinitesimals dy1 , . . . dyn
from the likelihood function for continuous variates and these
change when we move from y to z because they are denomi-
nated in the units in which y or z are measured.
G023. III
Maximum Likelihood: Properties
G023. III
Maximum Likelihood: Improving Numerical Properties
G023. III
Properties Of Maximum Likelihood Estimators
d
n1/2 (θ̂ − θ) → N (0, V0 )
where
V0 = − plim(n−1 lθθ (θ0 ; Y ))−1
n→∞
and θ0 is the unknown parameter value. To get an ap-
proximate distribution that can be used in practice we use
(n−1 lθθ (θ̂; Y ))−1 or some other consistent estimator of V0
in place of V0 .
G023. III
Properties Of Maximum Likelihood Estimators
d
n1/2 (θ̂ − θ0 ) → N (0, A(θ0 )−1 B(θ0 )A(θ0 )−10 ).
G023. III
Maximum Likelihood: Limiting Distribution
G023. III
Maximum Likelihood: Limiting Distribution
Differentiating again
Z Z
∂
lθ (θ; y)L(θ; y)dy = (lθθ0 (θ; y)L(θ; y) + lθ (θ; y)Lθ0 (θ; y)) dy
∂θ0 Z
= (lθθ0 (θ; y) + lθ (θ; y)lθ (θ; y)0 ) L(θ; y)dy
= E [lθθ0 (θ; Y ) + lθ (θ; Y )lθ (θ; Y )0 ]
= 0.
and so
giving
B(θ0 ) = − plim n−1 lθθ0 (θ0 ; Y ).
n→∞
The matrix
I(θ) = −E [lθθ (θ; Y )]
plays a central role in likelihood theory - it is called the Infor-
mation Matrix .
Finally, because B(θ0 ) = −A(θ0 )
µ ¶−1
A(θ)−1 B(θ)A(θ)−10 = − plim n−1 lθθ0 (θ; Y ) .
n→∞
• Of course a number of conditions are required to hold for the
results above to hold. These include the boundedness of third
order derivatives of the log likelihood function, independence or
at most weak dependence of the Yi ’s, existence of moments of
derivatives of the log likelihood, or at least of probability limits
of suitably scaled versions of them, and lack of dependence of
the support of the Yi ’s on θ.
• The result in equation (4) above leads, under suitable condi-
tions concerning convergence, to
¡ ¢ ¡ ¢
plim n−1 lθ (θ; Y )lθ (θ; Y )0 = − plim n−1 lθθ0 (θ; Y ) .
n→∞ n→∞
This gives an alternative way of “estimating ” V0 , namely
n o−1
o −1 0
V̂0 = n lθ (θ̂; Y )lθ (θ̂; Y )
which compared with
n o−1
Ṽ0o −1
= −n lθθ0 (θ̂; Y )
has the advantage that only first derivatives of the log like-
lihood function need to be calculated. Sometimes V̂0o is re-
ferred to as the “outer product of gradient” (OPG) estimator.
Both these estimators use the “observed” values of functions
of derivatives of the LLF and. It may be possible to derive
explicit expressions for the expected values of these functions.
Then one can estimate V0 by
© ª−1
V̂0e = E[n−1 lθ (θ; Y )lθ (θ; Y )0 ]|θ=θ̂
© ª−1
= −E[n−1 lθθ0 (θ; Y )]|θ=θ̂ .
These two sorts of estimators are sometimes referred to as “ob-
served information” (V̂0o , Ṽ0o ) and “expected information” (V̂0e )
estimators.
• Maximum likelihood estimators possess optimality property,
namely that, among the class of consistent and asymptotically
normally distributed estimators, the variance matrix of their
limiting distribution is the smallest that can be achieved in the
sense that other estimators in the class have limiting distribu-
tions with variance matrices exceeding the MLE’s by a positive
semidefinite matrix.
G023. III
Estimating a Conditional Probability
G023. III
Estimating a Conditional Probability
• Both models are widely used. Note that in both cases a single
index model is specified, the probability functions are monotonic
increasing, probabilities arbitrarily close to zero or one are ob-
tained when x0 θ is sufficiently large or small, and there is a
symmetry in both of the models in the sense that p(−x, θ) =
1 − p(x, θ).
• Any or all of these properties might be inappropriate in a par-
ticular application but there is rarely discussion of this in the
applied econometrics literature.
G023. III
More on Logit and Probit
Yi∗ = Xi θ + εi
pi = P (Yi = 1) = P (Yi∗ ≥ 0)
= P (Xi θ + εi ≥ 0)
= P (εi ≥ −Xi θ)
= 1 − Fε (−Xi θ)
G023. III
Shape of Logit and Probit Models
G023. III
Odds-Ratio
G023. III
Marginal Effects
• Logit model:
∂pi θ exp(Xi θ)(1 + exp(Xi θ)) − θ exp(Xi θ)2
=
∂X (1 + exp(Xi θ))2
θ exp(Xi θ)
=
(1 + exp(Xi θ))2
= θpi (1 − pi )
• Probit model:
∂pi
= θφ(Xi θ)
∂Xi
A one unit increase in X leads to an increase in the probability
of choosing option 1 of θφ(Xi θ).
G023. III
Maximum Likelihood in Single Index Models
n
X gw (x0 θ)xi
i gw (x0i θ)xi
lθ (θ; y) = yi − (1 − yi )
i=1
g(x0i θ) 1 − g(x0i θ)
Xn
gw (x0i θ)
= (yi − g(x0i θ)) xi
i=1
g(x0i θ) (1 − g(x0i θ))
G023. III
Asymptotic Properties of the Logit Model
g(w) = Φ(w)
gw (w) = φ(w)
gw (w) φ(w)
⇒ = .
g(w) (1 − g(w)) Φ(w)(1 − Φ(w))
Therefore in the probit model the MLE satisfies
n ³
X ´ φ(x0i θ̂)
yi − Φ(x0i θ̂) xi = 0,
i=1 Φ(x0i θ̂)(1 − Φ(x0i θ̂))
G023. III
Example: Logit and Probit
Concerni = β0 +β1 agei +β2 sexi +β3 log incomei +β4 smelli +ui
G023. III
Multinomial Logit
exp(Xθj )
pj = PJ
k)
k=1 exp(Xθ
G023. III
Identification
G023. III
Independence of Irrelevant Alternatives
After a few minutes the waitress returns and says that they also
have cherry pie at which point Morgenbesser says ”In that case I’ll
have the blueberry pie.”
G023. III
Independence of Irrelevant Alternatives
• However, the IIA implies that odds ratios are the same whether
of not another alternative exists. The only probabilities for
which the three odds ratios are equal to one are:
G023. III
Marginal Effects: Multinomial Logit
G023. III
Example
G023. III
Ordered Models
G023. III
Ordered Probit
G023. III
Ordered Probit
• Marginal Effects:
∂P (Yi = 0)
= −θφ(−Xi0 θ)
∂Xi
∂P (Yi = 1)
= θ (φ(Xi0 θ) − φ(µ − Xi0 θ))
∂Xi
∂P (Yi = 2)
= θφ(µ − Xi0 θ)
∂Xi
• Note that if θ > 0, ∂P (Yi = 0)/∂Xi < 0 and ∂P (Yi = 2)/∂Xi >
0:
– If Xi has a positive effect on the latent variable, then by
increasing Xi , fewer individuals will stay in category 0.
– Similarly, more individuals will be in category 2.
– In the intermediate category, the fraction of individual will
either increase or decrease, depending on the relative size
of the inflow from category 0 and the outflow to category 2.
G023. III
Ordered Probit: Example
G023. III
Ordered Probit: Example
G023. III
Tobit Model
2
• First proposed by Tobin (1958),
• We define a latent (unobserved) variable Y ∗ such as:
Y ∗ = Xβ + ε ε' N (0, σ 2 )
−5
−4 −2 0 2
X
G023. III
2
Tobin, J. (1958), Estimation of Relationships for Limited Dependent Variables, Economet-
rica 26, 24-36.
Truncation Bias
Z +∞
∗
E[Y |Y > a, X] = yh(y|Y ∗ > a, X)dy
a
Z +∞
1 y − Xβ
= yφ( )dy
σ(1 − Φ(α)) a σ
Z +∞
1
= (Xβ + σz)φ(z)dz
1 − Φ(α) (a−Xβ)/σ
Z +∞
1
= βX − σ φ0 (z)dz
1 − Φ(α) (a−Xβ)/σ
φ(α)
= Xβ + σ
1 − Φ(α)
G023. III
Tobit Model: Marginal Effects
G023. III
Likelihood for Tobit Model
G(y|X, β, σ) = P (Y ≤ y|X]
= P (Y ≤ y|X, Y > a)P (Y > a|X)
+P (Y ≤ y|X, Y = a)P (P = a|X)
a − Xβ
= I(y > a)H(y|Y > a, X)(1 − Φ( ))
σ
a − Xβ
+I(y = a)Φ( )
σ
G023. III
Example: WTP
OLS Tobit
Variable Estimate t-stat Estimate t-stat Marginal effect
lny 2.515 2.74 2.701 2.5 2.64
age -.1155 -2.00 -.20651 -3.0 -0.19
sex .4084 0.28 .14084 0.0 .137
smell -1.427 -0.90 -1.8006 -0.9 -1.76
constant -4.006 -0.50 -3.6817 -0.4
G023. III
Models for Count Data
and
m!
P [Y = j] = pj (1 − p)m−j , j ∈ {0, 1, 2, . . . , m}
j!(m − j)!
G023. III
Models for Count Data
G023. III
Models for Count Data
G023. III
Models for Count Data
where, note, the first term has expected value zero. Therefore
the Information Matrix for this conditional Poisson model is
n
X λw (x0 θ)2
i
I(θ) = 0 xi x0i .
i=1
λ(xi θ)
with V0 estimated by
à n
!−1
X λw (x0 θ̂)2
i
V̂0 = n−1 xi x0i .
0
λ(xi θ̂)
i=1
G023. III
Likelihood Based Hypothesis
Testing
Likelihood Based Hypothesis Testing
G023. IV
Likelihood Based Hypothesis Testing
G023. IV
Test of Hypothesis
G023. IV
Wald Test
c −1 θ̂20
SW = nθ̂20 W 22
G023. IV
The Score - or Lagrange Multiplier - test
G023. IV
The Score - or Lagrange Multiplier - test
• The score test considers the gradient of the log likelihood func-
tion evaluated at the point
and examines the departure from zero of that part of the gra-
dient of the log likelihood function that is associated with θ2 .
• Here θ̂1R is the MLE of θ1 when θ2 is restricted to be zero. If
the unknown value of θ2 is in fact zero then this part of the
gradient should be close to zero. The score test statistic is
G023. IV
Likelihood ratio tests
G023. IV
Specification Testing
G023. IV
Detecting Heteroskedasticity
G023. IV
Detecting Heteroskedasticity
1 X³ ´2
n n
R 1X 0
lα (θ̂ ; y|x) = − zi + 2 yi − xi β̂ zi
2 i=1 2σ̂ i=1
n
1 X¡ 2 2
¢
= ε̂ − σ̂ zi .
2σ̂ 2 i=1 i
G023. IV
Information Matrix Tests
E(ε|X) 6= 0
G023. V
Simultaneity
Direct effect
ª
Y X
µ
Indirect Effect
G023. V
Examples
ª
Higher revenues
ª
Higher revenues
ª
reduced hours
of work
G023. V
Implications of Simultaneity
•
Yi = β0 + β1 Xi + ui (direct effect)
Xi = α0 + α1 Yi + vi (indirect effect)
• Replacing the second equation in the first one, we get an equa-
tion expressing Yi as a function of the parameters and the error
terms ui and vi only. Substituting this into the second equa-
tion, we get Xi also as a function of the parameters and the
error terms:
β0 + β1 α0 β1 vi + ui
Yi = 1 − α1β1 + 1 − α1 β1 = B0 + ũi
Xi = α0 + α1 β0 + vi + α1 ui = A0 + ṽi
1 − α1 β1 1 − α1 β1
G023. V
What are we estimating?
• So
– E β̂1 6= β1
– E β̂1 6= α1
– E β̂1 6= an average of β1 and α1 .
G023. V
Identification
G023. V
Example
W = α0 + α1 S + α2 Z + ε1
S = β0 + β1 Z + ε2
E[ε1 |Z = z] = 0
E[ε2 |Z = z] = 0
but not
E[ε1 |S = s, Z = z] = 0
unless ε1 was believed to be uncorrelated with ε2 .
• Considering just the first (W ) equation,
E[W |S = s, Z = z] = α0 + α1 s + α2 z + E[ε1 |S = s, Z = z]
G023. V
Reduced Form Equations
W = (α0 + α1 β0 ) + (α1 β1 + α2 ) Z + ε1 + α1 ε2
S = β0 + β1 Z + ε2
G023. V
Identification using an Exclusion Restriction
G023. V
Indirect Least Squares Estimation
W = π01 + π11 Z + U1
S = π02 + π12 Z + U2
where
π01 = α0 + α1 β0 π11 = α1 β1
π02 = β0 π12 = β1
U1 = ε1 + α1 ε2 U2 = ε2
and
E[U1 |Z = z] = 0 E[U2 |Z = z] = 0
solving the equations:
given values of the π̂’s for values of the α̂’s and β̂’s, as follows.
α̂0 = π̂01 − π̂02 (π̂11 /π̂12 ) α̂1 = π̂11 /π̂12
β̂0 = π̂02 β̂1 = π̂12
G023. V
Over Identification
Y = Xβ + ε
E[Y − Xβ|Z = z] = 0
E[Z 0 (Y − Xβ) |Z = z] = 0
E[Z 0 (Y − Xβ)] = 0.
and thus
E[Z 0 Y ] = E[Z 0 X]β.
G023. V
Generalised Method of Moments estimation
G023. V
Generalised Instrumental Variables Estimation
• The first order conditions for this problem, satisfied by β̂n are:
G023. V
GIVE: Asymptotic Properties
and if
with ΣZZ having full rank (m) and ΣXZ having full rank (k)
then
plim β̂n = β
n→∞
and we have a consistent estimator.
G023. V
GIVE Asymptotic Properties
and so
¡ ¢−1
plim n1/2 (β̂ − β) ' N (0, σ 2 ΣXZ Σ−1
ZZ ΣZX ).
G023. V
GIVE and Two Stage OLS
X = ZΦ + V
Note that
X̂n0 X̂n = Xn0 Zn (Zn0 Zn )−1 Zn0 Xn
and
X̂n0 yn = Xn0 Zn (Zn0 Zn )−1 Zn0 yn .
So the Generalised Instrumental Variables Estimator can be
written as ³ ´−1
0
β̂n = X̂n X̂n X̂n0 yn .
that is, as the OLS estimator of the coefficients of a linear
relationship between yn and the predicted values of Xn got from
OLS estimation of a linear relationship between Xn and the
instrumental variables Zn .
G023. V
Examples: Measurement Errors
Yi = β0 + β1 Xi + ui
β0 = 0, β1 = 1
• Results:
Method Estimate of β1
OLS regressing Y on X̌1 0.88
OLS regressing Y on X̌2 0.68
IV, using X̌2 as instrument 0.99
G023. V