EC221: Principles of Econometrics Introducing Lent: DR M. Schafgans
EC221: Principles of Econometrics Introducing Lent: DR M. Schafgans
Introducing Lent
Dr M. Schafgans
Lent 2022
The error term, εi , accounts for this discrepancy. Re‡ects our ignorance
(everything we haven’t modelled).
The error term, εi , accounts for this discrepancy. Re‡ects our ignorance
(everything we haven’t modelled).
Aim: we want to estimate the causal e¤ect β1 (the e¤ect of X on Y
holding other factors, lumped into ε, …xed - ceteris paribus)
Dr Schafgans (LSE) EC221: Introducing Lent 2 / 17
Bivariate Regression Analysis - recap II
Y = β0 + β1 X + ε
Terminology
Y = β0 + β1 X + ε
Terminology
Y is the dependent variable
Y = β0 + β1 X + ε
Terminology
Y is the dependent variable
X the independent, explanatory variable
Y = β0 + β1 X + ε
Terminology
Y is the dependent variable
X the independent, explanatory variable
ε is the error term (our ignorance), E (ε) = 0
Y = β0 + β1 X + ε
Terminology
Y is the dependent variable
X the independent, explanatory variable
ε is the error term (our ignorance), E (ε) = 0
Y = β0 + β1 X + ε
Terminology
Y is the dependent variable
X the independent, explanatory variable
ε is the error term (our ignorance), E (ε) = 0
Y = β0 + β1 X + ε
Terminology
Y is the dependent variable
X the independent, explanatory variable
ε is the error term (our ignorance), E (ε) = 0
Y = β 0 + β 1 X1 + β 2 X2 + ε
Y = β 0 + β 1 X1 + β 2 X2 + ε
Y = β 0 + β 1 X1 + β 2 X2 + ε
Y = β 0 + β 1 X1 + β 2 X2 + ε
Y = β 0 + β 1 X1 + β 2 X2 + ε
E ( ε j X1 , X2 ) E (εjX ) = 0.
E ( ε j X1 , X2 ) E (εjX ) = 0.
This guarantees that there is no correlation between the errors (ε) and
regressors (X1 , X2 ).
E ( ε j X1 , X2 ) E (εjX ) = 0.
This guarantees that there is no correlation between the errors (ε) and
regressors (X1 , X2 ).
∂E (Y jX ) ∂ ( β0 + β1 X1 + β2 X2 )
= = β1
∂X1 ∂X1
E ( ε j X1 , X2 ) E (εjX ) = 0.
This guarantees that there is no correlation between the errors (ε) and
regressors (X1 , X2 ).
∂E (Y jX ) ∂ ( β0 + β1 X1 + β2 X2 )
= = β1
∂X1 ∂X1
E ( ε j X1 , X2 ) E (εjX ) = 0.
This guarantees that there is no correlation between the errors (ε) and
regressors (X1 , X2 ).
∂E (Y jX ) ∂ ( β0 + β1 X1 + β2 X2 )
= = β1
∂X1 ∂X1
Yi = β0 + β1 Xi + εi , i = 1, .., n.
Yi = β0 + β1 Xi + εi , i = 1, .., n.
Yi = β0 + β1 Xi + εi , i = 1, .., n.
Yi = β0 + β1 Xi + εi , i = 1, .., n.
Yi = β0 + β1 Xi + εi , i = 1, .., n.
Yi = β0 + β1 Xi + εi , i = 1, .., n.
Yi = β0 + β1 Xi + εi , i = 1, .., n.
Yi = β0 + β1 Xi + εi , i = 1, .., n.
Yi = β0 + β1 Xi + εi , i = 1, .., n.
Yi = β0 + β1 Xi + εi , i = 1, .., n.
Yi = β0 + β1 Xi + εi , i = 1, .., n.
Yi = β0 + β1 Xi + εi , i = 1, .., n.
Terminology:
Terminology:
β̂0 and β̂1 are estimators (random variables). Realisations of these
estimators for a particular sample are called estimates.
Terminology:
β̂0 and β̂1 are estimators (random variables). Realisations of these
estimators for a particular sample are called estimates.
Terminology:
β̂0 and β̂1 are estimators (random variables). Realisations of these
estimators for a particular sample are called estimates.
Terminology:
β̂0 and β̂1 are estimators (random variables). Realisations of these
estimators for a particular sample are called estimates.
2 ∑ i = 1 Yi
∂ n
∂b 0 : β̂0 β̂1 Xi = 0
2 ∑ i = 1 Yi
∂ n
∂b 1 : β̂0 β̂1 Xi Xi = 0
2 ∑ i = 1 Yi
∂ n
∂b 0 : β̂0 β̂1 Xi = 0
2 ∑ i = 1 Yi
∂ n
∂b 1 : β̂0 β̂1 Xi Xi = 0
β̂0 = Ȳ β̂1 X̄
n
∑i =1 (Yi Ȳ )(Xi X̄ ) Sample Cov (Yi , Xi )
β̂1 = =
∑ni=1 (Xi X̄ )2 Sample Var (Xi )
2 ∑ i = 1 Yi
∂ n
∂b 0 : β̂0 β̂1 Xi = 0
2 ∑ i = 1 Yi
∂ n
∂b 1 : β̂0 β̂1 Xi Xi = 0
β̂0 = Ȳ β̂1 X̄
n
∑i =1 (Yi Ȳ )(Xi X̄ ) Sample Cov (Yi , Xi )
β̂1 = =
∑ni=1 (Xi X̄ )2 Sample Var (Xi )
Unbiasedness: E β̂1 = β1
Unbiasedness: E β̂1 = β1
Is the estimator under repeated sampling correct on average?
Unbiasedness ensures that we do not make systematic errors when
estimating a parameter.
Unbiasedness: E β̂1 = β1
Is the estimator under repeated sampling correct on average?
Unbiasedness ensures that we do not make systematic errors when
estimating a parameter.
Variability: Var β̂1 indicates how disperse the realisations for β̂1
. are under repeated sampling. Is our estimator e¢ cient?
Unbiasedness: E β̂1 = β1
Is the estimator under repeated sampling correct on average?
Unbiasedness ensures that we do not make systematic errors when
estimating a parameter.
Variability: Var β̂1 indicates how disperse the realisations for β̂1
. are under repeated sampling. Is our estimator e¢ cient?
Standard errors of our OLS estimators are de…ned as VN1.2
q
d β̂
SE ( β̂ ) = Var
1 1
Unbiasedness: E β̂1 = β1
Is the estimator under repeated sampling correct on average?
Unbiasedness ensures that we do not make systematic errors when
estimating a parameter.
Variability: Var β̂1 indicates how disperse the realisations for β̂1
. are under repeated sampling. Is our estimator e¢ cient?
Standard errors of our OLS estimators are de…ned as VN1.2
q
d β̂
SE ( β̂ ) = Var
1 1
We will formally discuss the need and use of robust standard errors in
empirical research.
Dr Schafgans (LSE) EC221: Introducing Lent 10 / 17
Bivariate Regression Analysis - recap VI
Matrix that contains the variances on the diagonal and all covariances
on the o¤-diagonal. VN1.4
2 3 2 3
ε1 E ε1
6 7 6 .. 7
Var (ε) = E 4(ε E (ε))(ε E (ε))0 5 , with ε E (ε) = 4 . 5
| {z }| {z }
n 1 1 n ε n E ε n
Matrix that contains the variances on the diagonal and all covariances
on the o¤-diagonal. VN1.4
2 3 2 3
ε1 E ε1
6 7 6 .. 7
Var (ε) = E 4(ε E (ε))(ε E (ε))0 5 , with ε E (ε) = 4 . 5
| {z }| {z }
n 1 1 n ε n E ε n
Var (ε) =
Dr M. Schafgans
Lent 2022
yi = α + β1 xi 1 + β2 xi 2 + εi , i = 1, ..., n
OLS procedure
n
α̂, β̂1 , β̂2 = arg min ∑ (yi a b1 xi 1 b2 xi 2 )2 VN2.1
a,b 1 ,b 2 i =1
α̂ = ȳ β̂1 x̄1 β̂2 x̄2 , easy to obtain from 1st equation. Can you
express β̂j in terms of f(yi , xi 1 , xi 2 )gni=1 only using plain calculus?
Special case of
Normal equations
∑ni=1 bεi = 0
∑ni=1 xi 2bεi = 0
.. solve for β̂1 , β̂2 ... β̂k
.
∑ni=1 xik bεi = 0 with ε̂i = yi β̂1 β̂2 xi 2 .. β̂k xik
Gauss-Markov Assumption
yi = β1 + β2 xi 2 + β3 xi 3 + εi where xi 3 = 2xi 2
yi = β1 + β2 xi 2 + εi where β2 = β2 + 2β3
E (y jX ) = E (X β + εjX ) = X β + E (εjX ) = X β.
As
∂ (E (y jX )) ∂ ( β1 x1 + ...βk xk )
= = βj
∂xj ∂xj
βj provides the marginal e¤ect of the explanatory variable xj on the
conditional expectation of y , ceteris paribus (holding everything else
constant).
Convenient assumption:
First pretend we can keep X …xed – we do this by conditioning on X
(impose E (εjX ) = 0 (stronger) instead of E (ε) = 0)
Deal with the stochastic nature of the regressors afterwards
(Use the Law of Iterated Expectations)
Theorem
Law of iterated expectations
Var (εjX ) = σ2 I
Graphically:
∑i =1 (yi
n
S (b ) = b1 xi 1 b2 xi 2 ... bk xik )2
8 n
< ∑i =1 xi 1bεi = 0
The normal equations (FOC) are: >
. ..
> .
: n
∑i =1 xik bεi = 0
. where ε̂i = yi β̂1 xi 1 β̂2 xi 2 ... β̂k xik
X0 bε =0
(kxn) (nx1) (kx1)
b
β = (X 0 X ) 1X 0y ( Our OLS estimator for β!
In the simple linear regression model, we can only estimate the slope
provided ∑ni=1 (xi x )2 > 0
Recap VN2.8
1 n
n i∑
xis bεi = 0 for s = 1, .., k
=1
b
β = (X 0 X ) 1
X 0y
Theorem
Under A.1-A.3, b
β is Unbiased, i.e., E β̂ = β
b
β = (X 0 X ) 1
X 0y properly de…ned under A.2
Theorem
1
Under A.1-A.4, Var b
β jX = σ 2 (X 0 X )
Theorem
1
Under A.1-A.4, Var b
β jX = σ 2 (X 0 X )
1
Question: What does Var b βjX = σ2 (X 0 X ) tell us about aspects
that will help lower the variability of our estimates under repeated
sampling (improve the precision of our parameter estimates)?
Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 30 / 52
Variance of β̂ – Simple linear regression model
1
In PS2, Q1, you are asked to show that Var b β jX = σ 2 (X 0 X ) ,
when applied to the simple linear regression setting
yi = β1 + β2 xi + εi ,
yields the following variance of the slope parameter: VN2.14
σ2 σ2
Var (b
β2 jX ) = n
nsx2
∑ (xi x̄ )2
i =1
n
1
n i∑
with sx2 = (xi x̄ )2 > 0 (sample variance of the regressor)
=1
Reveals that we can increase the precision (reduce the variability of our
estimator in repeated sampling) by:
1 Increasing the sample size (n ).
2 Having a greater variability (excitation) of the regressors (sx2 ).
3 Having a lower variance of the errors (σ2 ).
Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 31 / 52
Variance of β̂ – Near Multicollinearity
σ2 σ2
Var ( β̂j jX ) = n =
nsx2j (1 rx21 ,x2 )
∑ (xij x̄j )2 (1 rx21 ,x2 )
i =1
1 n
with sx2j = ∑ (xij x̄j )2 and rx1 ,x2 the sample correlation
n i =1
This shows that as rx21 ,x2 ! 1 the variances become very large! Add:
. Ensure regressors are not highly collinear
Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 32 / 52
Variance of β̂ – Estimation and Standard errors
To estimate the variance of b
β we need to replace the unknown error
2
variance σ by an estimate.
The estimator (unbiased) for σ2 used is given by1
n 0
s2 = n 1 k ∑ bε2i = nε̂ ε̂k , E (s 2 ) = σ 2
i =1
1 In Wooldridge (2013) you will …nd the formula s 2 = SSR / (n k 1). There, the
number of parameters equals k (slopes) + 1 (intercept). Here it equals k.
Dr M. Schafgans (LSE) EC221: Multiple Linear Regression 33 / 52
Finite Sample Properties β̂ – E¢ ciency: BLUE
Theorem
Gauss-Markov Theorem: Under A.1-A.4, the OLS estimator β̂ is the best
linear estimator (BLUE) of β.
1
Proof: Let us show that β̂ = (X 0 X ) X 0 y is BLUE and assume (for
simplicity) that X is …xed (result holds when X is stochastic too). We
use the following three steps, details VN2.16-17 .
Var(c 0 b
β) Var(c 0 e
β)
The BLUE of c 0 β is c 0 b
β!
Each coe¢ cient βj is estimated at least as e¢ ciently by b βj as by any
other linear unbiased estimator.
0 1
0
B .. C
B . C
B C
B 1 C
c=B B C j th position ) c 0 β = βj
C
B 0 C
B .. C
@ . A
0
Theorem
1
Under A.1-A.4 and A.5, b
β jX N ( β, σ2 (X 0 X ) )
The joint density of the observations y1 , y2, y3, ..., yn, (de…nes the
likelihood function) is [using independence]:
n 1 1 xi0 β)
2
L α, β, σ2 f y1 , ..., yn ; β, σ2 = ∏ p e 2σ2
(y i
i =1 2πσ2
The MLE estimator chooses β, and σ2 so as to maximizing the
likelihood, or, equivalently, the log-likelihood:
n n 1 n
2σ2 i∑
log L( β, σ2 ) = ln(2π ) ln σ2 (yi xi0 β)2
2 2 =1
∂ 1 ∂S ( β )
: )=0
∂β
2σ̂2MLE ∂β β̂MLE
∂ n 1
∂σ2
: + S β̂ = 0
2σ̂2MLE 2σ̂4MLE
Under normality: b
βMLE = b
βOLS = b
βMME
n
The last FOC yields σ̂2MLE = 1
n ∑ ε̂2i ε̂0 ε̂
n = σ̂2MME
i =1
X 0y X 0X b
β = X 0 (y Xb
β) = X 0bε = 0.
These normal equations imply that the residuals and regressors are
orthogonal to each other.
Recall the vectors w and z are orthogonal i¤ w 0 z = 0
Model y = Xβ+ε
Estimator: b
β = (X 0 X ) 1 X 0 y
Fitted values: yb = X b
β = X (X 0 X ) 1 X 0 y
Residual: bε = y yb = y X (X 0 X ) 1 X 0 y = (In X (X 0 X ) 1 X 0 )y
P = X (X 0 X ) 1 X 0
M = In X (X 0 X ) 1X 0 = In P
P = X (X 0 X ) 1 X 0
M = In X ( X 0 X ) 1X 0 = In P
Projection: PX = X (X 0 X ) 1 X 0 X = X
) MX = 0 (M is orthogonal to X )
Orthogonal: PM = 0
Fitted values: yb = Py
Projects y onto the space of all linear combination of the regressors
Theorem
1
Under A.1-A.5, b
β jX N β, σ2 (X 0 X )
1
OLS is Orthogonal Projection: ŷ = Py , with P = X (X 0 X ) X0
1
Residuals ε̂ = My , with M = I X (X 0 X ) X0
RSS bε0bε
s2= =
n k n k
Theorem
Under A.1-A.4, s 2 is Unbiased, i.e., E s 2 = σ2
bε0bε
s2 =
n k
bε0bε ε0 Mε
s2 = =
n k n k
ε0 M ε
The proof uses the fact that using A.1 we can write s 2 = n k
This gives us
ε0 Mε
(n k )σ 2 s 2 = 2 .
σ
Quadratic form of normal rv’s; M is symmetric, idempotent matrix with
rank(M ) = n k, (see PS2, Q3) VN2.23
ε0 M ε
χ2n k ε N (0, σ2 I ) [X , M …xed]
σ2
ε0 M ε
If X is stochastic, we …rst claim σ2
X χ2n k (pretend X …xed).
Show Cov (b
β,bε) = 0 (for simplicity treat X as …xed) VN2.25
Dr M. Schafgans
Lent 2022
Reason:
Interest may only be in a subset of parameters;
Allows us to deal with computational issues when k1 + k2 is very big.
X 0 X is a very large matrix to invert.
E.g. …xed e¤ect panel data models.
β̂1
We are interested in our partitioned regression estimators
β̂2
X10 X1 b
β1 + X10 X2 b
β2 = X10 y
X20 X1 b
β1 + X20 X2 b
β2 = X20 y
Yielding
b
β1 = (X10 M2 X1 ) 1
X10 M2 y , with M2 = I X2 (X20 X2 ) 1 X20
b
β = (X20 M1 X2 ) 1
X20 M1 y , with M1 = I X1 (X10 X1 ) 1 X10
2
y = X1 β 1 + X2 β 2 + ε
1
While β̂ = (X 0 X ) X 0 y , the partitioned OLS estimators
b
β1 = (X10 M2 X1 ) 1
X10 M2 y , with M2 = I X2 (X20 X2 ) 1
X20
b 1
β1 = (X10 M2 X1 ) 1 X10 M2 y = X10 X1 X10 y
y = X β + ε = X1 β 1 + X2 β 2 + ε
b
β1 = (X10 M2 X1 ) 1 X10 M2 y = (X10 X1 ) 1 X10 y
Implication: to obtain the parameter estimates for β1 we can ignore
the presence of X2 if X10 X2 = 0
E.g., yt = α1 s1t + α2 s2t + .. + α4 s4t + εt , where sj seasonal dummies
y = X β + ε = X1 β 1 + X2 β 2 + ε
Var (b
β1 jX ) = σ2 (X10 M2 X1 ) 1
Theorem
Frisch-Waugh Lovell
We can obtain b
β1 by estimating the "residual based model"
rror = ε̂
The residual sum of squares are identical as e[
For computation of SEs need to recognize we have used, explicitly or
implicitly, k1 + k2 regressors.
β̂ = (Z 0 MD Z ) 1 Z 0 MD y
Ee
β = E ( β + (X 0 X ) 1
X 0 Z δ + (X 0 X ) 1 X 0 ε )
= β + (X 0 X ) 1 X 0 Z δ + (X 0 X ) 1 X 0 E ( ε ) = β + (X 0 X ) 1 X 0 Z δ
| {z }
BIAS
Observe: Necessary conditions for omitted variable bias:
I δ 6= 0 i.e., Z is relevant
I X 0 Z 6= 0 i.e., X a¤ects Z
The bias re‡ects the indirect e¤ect X has on the expected value of
y associated with changes in Z
e
β = β + (X 0 X ) 1
X 0 Z δ + (X 0 X ) 1
X 0ε
Var e
β = Var β + (X 0 X ) 1 X 0 Z δ + (X 0 X ) 1 X 0 ε
1
(X 0 X ) 1 X 0 Var ( ε )X (X 0 X ) = σ 2 (X 0 X ) 1 as Var (ε) = σ2 In
v̂ 0 v̂
s̃ 2 = , with v̂ = y X β̃
n kX
Ee
e
β = E ( β + (X 0 MZ X ) 1
X 0 MZ v )
= β + (X 0 MZ X ) 1 X 0 MZ E (v ) = β
1
E¢ ciency loss: σ2 (X 0 MZ X ) 1 is bigger than σ2 (X 0 X ) ! VN3.11
Dr M. Schafgans
Lent 2022
We note: b
βi N ( βi , σ2 cii ), where cii = (X 0 X ) 1
ii
(n k )σ 2 s 2 χ2n k
3 Independence of b
β and s 2
Dr M. Schafgans (LSE) CLRM and Hypothesis testing 2 / 32
Hypothesis Testing
Single Linear Restrictions - t-test
Under GM assumptions + normality
Unfortunately this distribution is not nice (which values of β̂2 are too
far away from 5 for a given signi…cance level?)
h i
1
Under H0 : b
β2 N (5, σ2 c22 ), where c22 = X 0 X
22
Discussion visualizer: VN4.1-4
Associated with the tests statistics you may also …nd p-values.
De…nition
P-values: the lowest level of signi…cance at which you want to reject H0 .
b
β2 b 2
Reject H0 if > tn k ,α/2 .
SE ( β̂2 )
/ [b
b2 2 β2 tn k ,α/2 SE (b
β2 ), b
β2 + tn k ,α/2 SE (b
β2 )]
β2 + β3 + .. + βk = 1
H0 : r 0 β = c
H0 : r 0 b
β N (c, σ2 r 0 (X 0 X ) 1
r ).
r 0b
β c r 0b
β c
z= p N (0, 1) under H0
σ r 0 (X 0 X ) 1 r Stdev (r 0 b
β c)
yi = β1 + β2 ki + β3 li + εi
Test statistic:
β̂2 + β̂3 1 β̂2 + β̂3 1
tN 3 under H0
SE ( β̂2 + β̂3 1) SE ( β̂2 + β̂3 )
1 De…ne H0 and H1 .
2 Formulate a test statistic and provide its distribution under H0
A test statistic is a random variable which
is computable from the data and does not comprise any unknown
quantities.
has a well de…ned distribution needed to de…ne the rejection region.
3 De…ne the signi…cance level of your test, and provide the associated
critical values.
4 State the rejection rule and interpret your …ndings.
Do not use the terminology: "accept H0 "!
5 Clearly indicate the assumptions you make for validity of test (e.g.,
GM + normality)
Dr M. Schafgans (LSE) CLRM and Hypothesis testing 13 / 32
Hypothesis Testing
Multiple Linear Restrictions - F-test
Under GM assumptions + normality
How do we test when there are more than one linear restrictions?
Formalize the null and alternative hypothesis
H0 : R β = c H : r0β = c
, 0
HA : R β 6= c (only consider 2-sided) HA : 2-sided/1-sided
Rβ = c R: J k matrix
01
c : J 1 vector β1
B β C
Examples: Let k = 4 β=B 2 C
@ β A
3
β4
β2 = β3
R= 0 1 1 0 c= 0 single restriction R β = r 0 β
β2 = 2 and β3 = 1
0 1 0 0 2
R= c= 2 restrictions
0 0 1 0 1
β2 = β3 = β4 = 0
0 1 01
0 1 0 0 0
R=@ 0 0 1 0 A c=@ 0 A 3 restrictions
0 0 0 1 0
How would you write the following restrictions: β2 = β3 = 2β4 ? VN4.9
H0 : R β = c, HA : R β 6 = c
Calculate d = R b
β c : discrepancy vector
d 0 [Var (d )] 1 d
Recall:
If z is an n-dimensional vector of random variables distributed as
N (0, V ) where V is non-singular then z 0 V 1 z χ2n (PS2, Q3).
Under H0 : d = Rb
β N (0, Var (d ))
c
h i
with Var (d ) = σ2 R (X 0 X ) 1 R 0
0 1
(R bβ c ) [R (X 0 X ) 1 R 0 ] (R bβ c )
Under H0 : d 0 [Var (d )] 1 d = σ2
χ2J
d 0 Var (d ) 1d χ2J
z }| {
0 1
Rb
β c R (X 0 X ) 1 R 0 Rb
β c
/J
W = σ2
(n k) s2
2
/ (n k)
| σ
{z }
χ2n k
W >K
H0 : R β = c, HA : R β 6 = c
(RRSS URSS ) / J
F = FJ ,n k under H0
URSS / n k
Thus to test the hypothesis we proceed as follows (see PS5, Q1c, and
PS5-extra, Q1):
1 Calculate F based on the results of the restricted and unrestricted least
squares regression.
2 If we test at a 5% level of signi…cance …nd that value K such that 5%
at the area under an F distribution with (J, n k ) degrees of freedom
lies to the right of K
3 Reject H0 if
F >K
yi = β1 + β2 xi 2 + .. + βk xik + εi = xi0 β + εi
H0 : β2 = β3 = ... = βk = 0
HA : β2 6= 0 and/or β3 6= 0 and/or...βk 6= 0
The alternative states that at least one of the coe¢ cients is not equal
to zero.
yi = xi0 β + εi
yi = β1 + εi
RSS
where we use R 2 = 1 TSS R 2 = Coe¢ cient of determination.
(y X β ) 0 (y X β) subject to Rβ = c
β = b
β (X 0 X ) 1 R 0 (R (X 0 X ) 1 R 0 ) 1
(R b
β c)
λ = (R (X 0 X ) 1 R 0 ) 1 (c R bβ)
Observe: if R b
β = c (discrepancy=0) then β = β̂ and λ = 0.
Dr M. Schafgans
Lent 2022
De…nition
Convergence in probability (consistency). Let Xn be a random variable indexed by
p
the size of a sample. Xn converges in probability to X , (Xn ! X ) if
These are su¢ cient conditions that guarantee that the sample mean,
X̄ , converges to the population mean E (Xi ), so:
1 n
plim
n ∑ i =1 X i =µ
1
plim(X ) = plim
n ∑ Xi = E (Xi ) µ
Alternative LLN’s exist for cases where Xi (εi ) are not i.i.d.
Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 6 / 18
Consistency - Sample Mean II
plim X̄ = E (Xi )
Advantage of approach:
Recognizes that Var (Xi ) = σ2 < ∞ is not needed for consistency
Proof does not require us to derive Var (X̄ ), we just need to look at
plims of averages!
As it turns out, the plim is a nice operator - nicer than the expectation
operator.
Theorem
Slutsky Theorem. For a continuous function g (Xn ) that is not a function of n
Show that b
β consistently estimates β in the simple linear regression
model: n n
β = ∑i =∑1 n i (x x )i 2 = β + ∑i =∑1n i (x x )i 2
b (x x )(y y ) (x x )(ε ε)
i =1 i i =1 i
Slutsky Theorem
plim n1 ∑ni=1 (xi x )(εi ε) plim SampleCov (xi ,εi )
plim β̂ = β + plim n1 ∑ni=1 (xi x )2
= β+ plim SampleVar (xi )
How can we see, e.g,. that plim SampleCov (xi , εi ) = Cov (xi , εi )?
Observe
1 n 1 n
n i∑ n i∑
SampleCov (xi , εi ) = (xi x )(εi ε) = xi εi x̄ ε
=1 =1
By Slutsky Theorem
1 n
n i∑
plim SampleCov (xi , εi ) = plim xi εi plim x̄ plim ε
=1
Show that b
β consistently estimates β
b
β = (X 0 X ) 1
X 0 y = β + (X 0 X ) 1
X 0ε
VN5.3-5
ε̂0 ε̂ ε0 Mε
s2 = =
n k n k
Apply Slutsky Theorem and Law of Large numbers (see PS5-extra, Q2)
Using the su¢ cient conditions, would, e.g., require the existence of the
4th moment of ε.
De…nition
(Convergence in Distribution): The sequence Zn with cumulative distribution
functions FZ n (z ), converges in distribution to a random variable Z with
cumulative distribution function FZ (z ) if limn !∞ jFZ n (z ) FZ (z )j = 0, at all
points of continuity of FZ (z ).
A1–A5: ) b
β jX N ( β, σ2 (X 0 X ) 1
)
Can we still do hypothesis testing if we are not happy to make this
assumption?
Yes! We will then rely on a CLT that will tell us that even if the
errors are not normally distributed, but, say simply random (i.i.d.)
with zero mean and …nite variance σ2
a
A1–A4: ) b
β jX N ( β, σ2 (X 0 X ) 1 )
(c 0 β̂ γ) a
z= N (0, 1) under H0
SE (c 0 β̂ γ)
The t and the F test are exact test that rely on the assumption of
normality, the z and the χ2 -test don’t!
Dr M. Schafgans (LSE) EC221: CLRM and Asymptotic Theory 18 / 18
VN5.1
VN5.2
VN5.3
VN5.4
VN5.5
VN5.6
VN5.7
VN5.8
VN5.9
EC221: Principles of Econometrics
Stationary Time Series Models
Dr M. Schafgans
Lent 2022
The name "static model" comes from the fact that we are modeling a
contemporaneous relationship between y and x.
Example: Static Phillips Curve. One way to write a static Phillips curve
is
inft = β0 + β1 unemt + εt ,
where inft is, say, the annual rate of in‡ation during year t, and unemt is
annual unemployment rate during year t. β1 is supposed the measure the
trade-o¤ between in‡ation and unemployment.
For inference, we will need to use robust SE’s to deal with the
1
autocorrelation in the errors as Var β̂jX 6= σ2 (X 0 X ) (next
handout).
yt = α + xt γ1 + xt 1 γ2 + xt 2 γ3 + εt
yt = α + xt γ1 + xt 1 γ2 + xt 2 γ3 + εt
2 3
. . . .
Our matrix of regressor is given by X = 4 1 xt xt 1 xt 2 5
. . . .
If xt changes slowly over time, we may get very imprecise estimates for
γj due to the problem of near multicollinearity.
If xt changes slowly over time, xt and xt 1 highly correlated
Fortunately, In general, we can estimate the LR e¤ect γ1 + γ2 + γ3
precisely.
The OLS estimator of FDL (and static) models will typically be biased!
Dr M. Schafgans (LSE) EC221: Time Series - Stationary 7 / 22
Models using Time series data (stationary) VI
(2) Finite Distributed Lag Models
yt = α + xt γ1 + xt 1 γ2 + xt 2 γ3 + εt
speci…cally
0 1
It
z }| {
E @εt jxt , xt 1 , ..., yt 1 , yt 2 , ...A = 0 or simply
E (εt jxt , xt 1 , xt 2 , yt 1 ) = 0.
To ensure we can apply LLN and CLT we will rely on two assumptions
(stationarity and weak dependence) which we discuss today.
1 The static model and the FDL model will be consistent under weak
exogeneity assumptions even if the error term exhibits weak
dependence.
If the error term in the ADL model does exhibit dependence, OLS will
not be consistent either because E (εt yt 1 ) 6= 0. (Will discuss this
later).
De…nition
A stochastic process fxt ; t = 1, 2, ..g is covariance stationary if
(i) E (xt ) is …nite and constant
(ii) Var (xt ) is …nite and constant
(iii) Cov (xt , xt +h ) is …nite and depends only on distance in time, h
The …rst two moments need to exist and not change (be identical) over
time!
Dependence
To describe the dependence in fxt gT t =1 over time, we consider the
Corr (xt , xt +h ) as a function of h :
Cov (xt ,xt +h )
Corr (xt , xt +h ) = p
Var (xt )Var (xt +h )
Corr (xt , xt +h ) ! 0 as h ! ∞
Stationary
The assumption that jφj < 1 is needed to ensure a …nite and positive
mean and variance – stationarity.
Recursive substition clari…es this:
yt = φ(φyt 2+ εt 1 ) + εt = φ2 yt 2 + φεt 1 + εt
= φ (φyt 3 + εt 2 ) + φεt 1 + εt = φ3 yt 3 + φ2 εt
2
2 + φεt 1 + εt
= ...
= εt + φεt 1 + φ2 εt 2 + φ3 εt 3 + ........as φs ! 0
From the ACF (and correlogram) we can infer the extend to which
one value of the process is correlated with previous values and thus
the length and strength of the memory of the process.
It indicates how long (and how strongly) a shock in the process (εt )
a¤ects the current and future values of yt . Let h > 0
We want to ensure that this dependence dies out su¢ ciently fast
Satis…ed for both the MA(1) and stationary AR(1).
E (yt ) = 0,
Var (yt ) = (1 + θ 2 )σ2 , and
Cov (yt +1 , yt ) = θσ while
Cov (yt +h , yt ) = 0 for h > 1.
Dr M. Schafgans
Lent 2022
De…nition
A1 : True model - linear in parameters y = X β + ε with E (ε) = 0
A2 : No perfect multicollinearity
A3 : Zero conditional mean E (εjX ) = 0
A3
A4 : General covariance matrix Var (εjX ) = E (εε0 jX ) = Σ
σ2i if i = j
Cov (εi , εj jX ) =
0 i 6= j
0 1
1 ρ ρ2 ρ3 ρn 1
σ2 BB ρ 1 ρ ρn 2 C
C "fading
Σ= 2 B .. C
1 ρ @ . A memory"
ρn 1 ρn 2 1
b
β remains Unbiased (and Consistent) and Linear
E b
β =β since E (εjX ) = 0 by A.3
1
Var ( β̂jX ) 6= σ2 (X 0 X ) VN7.1
Var b
β jX = (X 0 X ) 1
X 0 ΣX (X 0 X ) 1
given A.4
= σ 2 (X 0 X ) 1
X 0 ΩX (X 0 X ) 1
y = X β+ε De…ne: y = Ry , X = RX , ε = Rε
satis…es the Gauss-Markov conditons
The matrix R that would guarantee this, needs to satisfy the condition:
R 0R = Ω 1
Ry = RX β + Rε
y = X β+ε De…ne: y = Ry , X = RX , ε = Rε
i.e.,
Ω = C ΛC 0 C 0 C = In
Ω 1 = C Λ 1C 0
R = C Λ 1 /2 C 0
0
R 0R = C Λ 1/2 C 0 CΛ 1/2 C 0 = CΛ 1/2 C 0 C Λ 1/2 C 0 = CΛ 1C 0
β̂GLS = (X 0 X ) 1 X 0 y = (X 0 Ω 1 X ) 1 X 0 Ω 1 y ,
S (b ) = (y Xb )0 Ω 1 (y Xb )
i.e.,
β̂GLS = arg min S (b )
b
(Derive FOC, and solve). VN7.6
S (b ) = (y Xb )0 (y Xb )
1
Var β̂GLS jX = σ2 X 0 Ω 1
X
In PS 7:
2 ε̂ 0 ε̂ (y X β̂GLS )0 (y X β̂GLS )
sGLS = =
N k N k
(y X β̂GLS ) Ω (y X β̂GLS ) VN7.8-9
0 1
=
N k
b (Slutsky)
Consistency of θ̂ ensures consistency of Ω
plim Ω̂ plim Ω(θ̂ ) = Ω(plim θ̂ ) = Ω(θ ) = Ω
b
βFGLS has only desirable asymptotic properties!
Consistent
b
βFGLS typically is biased; in fact, β̂FGLS is not even linear! VN7.10
a 1
β̂FGLS jX N ( β, σ2 X 0 Ω 1 X )
Important: This does not mean that Σ̂ = diag (ε̂21 ...ε̂2n ) is a consistent
estimator of Σ! Result relies on the fact that the matrix n1 ∑ bε2i xi xi0 is a
good approximation of n1 ∑ σ2i xi xi0 (White, 1980).
The heteroskedastic-consistent standard errors or simply White SEsare
obtained by taking square roots of its diagonal elements..
Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 21 / 43
GLS e¢ ciency - Heteroskedasticity = WLS
" # 1
n n
xi x 0 xi yi
β̂ = ∑ 2i ∑ 2
i =1 σ i i =1 σ i
If σ2i is known (up to scale), we can calculate this estimator, and apply
the standard t- and F -tests as in the linear regression model. VN7.14
An example where this may be the case is where we observe only group
averages and the size of each group di¤ers. See PS7, Q2
3 White test
Does not specify anything about the form of heteroskedastic
1 Ifwe specify HA : σ2i = exp(δ0 + zi0 δ1 ), we would use the test of the signi…cance of
the regression ln(bε2i ) = δ0 + zi0 δ1 + vi instead.
Dr M. Schafgans (LSE) EC221: Generalized Linear Regression Model 28 / 43
Breush-Pagan Test for Heteroskedasticity I
yi = β1 + β2 xi 2 + β3 xi 3 + εi , i = 1, ..., N
yt = β0 + β1 xt + εt ,
If we could observe fεt g , we just estimate AR(1) model for εt and use
a t-test for ρ = 0
Since we don’t observe fεt g, we will base our test on the OLS
residuals, ε̂t (similar to heteroskedasticity test).
yt = β0 + β1 x1t + .. + βk xkt + ut
We can easily add lags to the above test and use the F test. For
example we can regress
Dr M. Schafgans
Lent 2022
omitted variables
lagged dependent variables when errors exhibit dependence
measurement errors in the regressors
simultaneity
E ( ε jX ) 6 = 0
yt = β1 + β2 xt + β3 yt 1 + εt , j β3 j < 1, t = 1, ..., T
εt = ρεt 1 + vt , jρj < 1, vt i.i.d.(0, σ2v )
vt uncorrelated with any xs and εt 1 , εt 2,....
yt = β1 + β2 xt + β3 yt 1 + εt , j β3 j < 1, t = 1, ..., T
εt = ρεt 1 + vt , jρj < 1, vt i.i.d.(0, σ2v )
vt uncorrelated with any xs and εt 1 , εt 2,....
) Cov (εt , yt ρ 2
1) = 1 ρβ3 σ ε 6= 0 [ρβ3 6= 1]
xi = xi + ui
These are strong assumptions: the true value does not reveal any
information about the size, sign or value of the measurement error.
Other assumptions can be made here, and the exact result will depend
on what is assumed
Dr M. Schafgans (LSE) EC221: Endogeneity 11 / 50
Measurement Error in regressors III
E ( xi vi ) = E ui ui0 β 6= 0
E xi xi0 = E xi xi 0 + E ui ui0
In general, all parameters are a¤ected by measurement error in one or
more of the explanatory variables .
When only one explanatory variable is measured with a CME error, its
parameter will exhibit "attenuation bias" (PS8, Q1). Little can be said
about direction of bias if more than one variable is measured with error.
Dr M. Schafgans (LSE) EC221: Endogeneity 12 / 50
Endogeneity
Simultaneity
Demand : qt = α1 + α2 pt + α3 mt + u1t
Supply : qt = β1 + β2 pt + β3 ct + u2t
Ct = α + βYt + εt
Yt = Ct + It
Ct = α + βYt + εt , E (εt ) = 0
Yt = C t + I t
The structural form (SF) equations provide the relationships of
interest (behavioural relations)
Ct and Yt are endogenous variables, which are jointly determined
It is assumed to be exogenous, or predetermined (i.e., determined
outside the model). I.e., It is independent of the error εt .
OLS to the structural form equation yields inconsistent parameter
estimates for (α, β) because
Cov (Yt , εt ) 6= 0
To obtain the Cov (Yt , εt ) we need to obtain the reduced form of this
model.
Reduced form: express the endogenous variables in terms of the
exogenous variables and errors only.
Dr M. Schafgans (LSE) EC221: Endogeneity 15 / 50
Simultaneity (Keynesian Model ) III
Reduced Form
Yt = π 11 + π 12 It + v1t
Ct = π 21 + π 22 It + v2t
To get the reduced form for Yt (plug in 1st equation in the 2nd ):
Yt = α + βYt + εt + It
) Y t = 1 α β + 1 1 β It + 1
1 β εt
Demand : qt = α1 + α2 pt + α3 mt + u1t
;
Supply : qt = β1 + β2 pt + β3 ct + u2t
σ21 σ12
E (ut ) = 0, E ut ut0 = , mt , ct exog., indep of (u1t , u2t )
σ12 σ22
pt = π 11 + π 12 mt + π 13 ct + v1t
qt = π 21 + π 22 mt + π 23 ct + v2t
See PS8, Q2
Demand : qt = α1 + α2 pt + α3 mt + u1t
;
Supply : qt = β1 + β2 pt + β3 ct + u2t
σ21 σ12
E (ut ) = 0, E ut ut0 = , mt , ct exog., indep of (u1t , u2t )
σ12 σ22
E ( ε jX ) 6 = 0
y = X β + ε with E (εjX ) 6= 0
Recall, OLS estimator was bad as it imposes
1 0 1
X ε̂ = X 0 (y X b βOLS ) = 0
n n
when in fact the errors and regressors are correlated.
Suppose there exist a set of explanatory variables Z which
are correlated with our regressors X (relevance) and
are NOT correlated with the error ε (validity).
These variables can be used for estimation purposes and are known as:
Instrumental Variables.
De…ne our Instrumental Variable estimator b
βIV by imposing:
1 0 IV 1
Z ε̂ = Z 0 (y Xb
βIV ) = 0.
n n
1 0 IV 1
Z ε̂ = Z 0 (y Xb
βIV ) = 0.
n n
β̂IV = (Z 0 X ) 1 Z 0 y
β̂IV = (Z 0 X ) 1
Z 0 y (just-identi…ed)
In general E β̂IV 6= β
E (b
βIV jX , Z ) = β + (Z 0 X ) 1 Z 0 E (εjX , Z ) 6= β
If there is endogeneity, then E (εjX , Z ) 6= 0 because Cov (ε, X ) 6= 0!
Therefore β̂IV is in general biased (…nite sample property)
y = X β + ε with E (εjZ ) = 0
β̂IV = (Z 0 X ) 1
Z 0 y (just-identi…ed)
Var (εjZ ) = σ2 I
2 (y X b
βIV )0 (y X b
βIV )
sIV = nRSSk = n k
E εt
Validity: E (zt ε) =0
Edt εt
Shows that dt should be uncorrelated with εt , Edt εt = Cov (dt , εt ).
1 E (wt )
Relevant: E (zt xt0 ) needs to be invertible.
E ( dt ) E ( dt w t )
Non-zero determinant: E (dt wt ) E (dt )E (w t ) Cov (dt , wt ) 6= 0.
Shows that dt should be correlated with wt .
(zi z̄ )
β̂1,IV = β1 + ∑i =1 di ui with di =
n
∑ni=1 (zi z̄ )(xi x̄ )
n σ2 1
Var β̂1,IV jx, z = σ2 ∑i =1 di2 = n
∑ i = 1 ( xi x̄ )2 Rzx
2
AVar (b
βIV ) = σ2 (Z 0 X ) 1
Z 0 Z (X 0 Z ) 1
β̂IV = (Z 0 X ) 1
Z 0 y (just-identi…ed)
Common strategy
Variables in X that are thought to be "good" (exogenous) are included
in Z
AVar (b
βIV ) = σ2 (Z 0 X ) 1 Z 0 Z (X 0 Z ) 1
Instruments for endogenous variables are variables that do not enter
the regression equation itself
These variables a¤ect the dependent variable only through regressors
(exclusion restriction)
In dynamic models, lags of the included exogenous variables may
provide suitable instruments for endogenous lagged dependent variable.
In simultaneous equation models, considering the other equations in
this system provides a natural way to propose suitable instruments.
Demand : qt = α1 + α2 pt + α3 mt + u1t
Supply : qt = β1 + β2 pt + β3 ct + u2t
Recall Cov (pt , u1t ) 6= 0 and Cov (pt , u2t ) 6= 0. pt is an endogenous
("bad") regressor in demand and supply equation. VN8.11
Demand
We need a instrument for pt . A "good" variable (valid), correlated with
pt (relevance), which doesn’2t appear in the 3
equation 2itself: ct 3
. . . . . .
1
α̂IV = (Z10 X1 ) Z10 q, X1 = 4 1 pt mt 5; Z1 = 4 1 ct mt 5
. . . . . .
Supply
We need a instrument for pt . A "good" variable (valid), correlated with
pt (relevance), which doesn’t2 appear in the3equation2itself: mt 3
. . . . . .
1 0
β̂IV = (Z2 X2 ) Z2 q, X2 = 4 1 pt ct 5; Z2 = 4 1 mt ct 5
0
. . . . . .
Dr M. Schafgans (LSE) EC221: Endogeneity 32 / 50
Two Stage Least Squares
or
Optimal IV
b b 0X )
βIV = (X 1 b0
X y = ( X 0 PZ X ) 1
X 0 PZ y
yt = β1 xt + β2 yt 1 + εt , t = 1, ..., T
εt = ρεt 1 + vt , jρj < 1, vt i.i.d.(0, σ2v ) AR (1) model
εt is uncorrelated with xt
vt uncorr. with εt 1 , yt 1 , xt 1 , εt 2 , yt 2 , xt 2 , ..
qt = β1 + β2 p̂t + β3 ct + e2t
2 3 2 3
. . . . . .
Let X2 = 4 1 pt ct 5 ; X̂2 = 4 1 p̂t ct 5
. . . . . .
1
Optimal IV: β̂OptIV = X̂20 X2 X̂20 q
Under identi…ed - setting where the are no (or insu¢ cient) instruments.
In this case we have a lack of identication.
Dr M. Schafgans (LSE) EC221: Endogeneity 41 / 50
Under Identi…cation vs Exact Identi…cation
A graphical explanation
VN8.14-15
y1 = α0 + α1 y2 + α2 x1 + u1 , E (u1 ) = E (x1 u1 ) = 0
We want to test:
y1 = α0 + α1 y2 + α2 x1 + u1 , E (u1 ) = 0, E (x1 u1 ) = 0
Let z1 and z2 be instruments for y2
y2 = π 0 + π 1 z1 + π 2 z2 + π 3 x1 + v2
y1 = α0 + α1 y2 + α2 x1 + δv̂2 + error ,
y1 = α0 + α1 y2 + α2 x1 + δv̂2 + error ,
Reject H0 :
To estimate α’s we need to "control" for the endogeneity of y2 by
including v̂2
The resulting parameter estimates are the same as obtained by 2SLS
Not reject H0 :
To estimate α’s we can estimate our original model without v̂2
We can obtain parameter estimates simply by using OLS
.
Dr M. Schafgans (LSE) EC221: Endogeneity 48 / 50
Testing for Endogeneity VI
See EC221 summer exam 2019 for a question that addresses this point.
Dr M. Schafgans (LSE) EC221: Endogeneity 50 / 50
VN8.1
VN8.2
VN8.3
VN8.4
VN8.5
VN8.6
VN8.7
VN8.8
VN8.9
VN8.10
VN8.11
VN8.12
VN8.13
VN8.14
VN8.15
VN8.16
EC221: Principles of Econometrics
Maximum Likelihood Estimation and
Trinity of Classical Testing
Dr M. Schafgans
Lent 2022
Choose as our MLE, θ̂ MLE , those values for the unknown parameters
that give us the highest likelihood. VN9.1
∂2 log L(θ; y )
is negative de…nite matrix
∂θ∂θ 0 b
θ MLE
Asymptotic Normal
b a
θ N (θ, I (θ ) 1 ), where I (θ ) = E (H (θ ))
Asymptotically E¢ cient
The inverse of the information matrix I (θ ) 1 , provides a lower bound
on the asymptotic covariance matrix for any consistent, asymptotically
normal estimator for θ. (Cramèr-Rao lower bound).
Important as this ensures that our estimates are precise and tests using
them powerful.
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 7 / 57
MLE: Estimation of the Asymptotic Variance of the MLE
b a 1
θ N (θ, I (θ ) ), where I (θ ) = E (H (θ ))
n h i 1
∂2 log L i (θ ) 1
Estimate of AVar (θ̂ ) = ∑ E ∂θ∂θ 0
( E [H (θ )])
i =1
EITHER evaluate the second derivatives matrix of the log-likelihood
function at the MLE estimates.
! 1
n
\ ∂2 log L i (θ ) 1
AVar (θ̂ ) = ∑ ∂θ∂θ0 = H θ̂
θ̂
i =1
Suppose you have a random sample Y1 , ...Yn , drawn from the p.d.f.
θ y (1 θ )1 y y = 0, 1
f (y ) =
0
n n
log L(θ; Y ) = ∑ Yi log θ + (n ∑ Yi ) log(1 θ)
i =1 i =1
Step 3: Obtain b
θ MLE
n n
∂ log L ∑ Yi n ∑ Yi
i =1 i =1
= θ 1 θ
∂θ n n n
∑ Yi n ∑ Yi ∑ Yi
b
θ MLE satis…es i =1 i =1
= 0 ) bθ MLE = i =1
=Y
b
θ MLE 1 bθ MLE n
Y is an estimator (random variable); ȳ its realisation (estimate).
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 11 / 57
Maximum Likelihood Estimator: Binary Choice
b
θ MLE = Ȳ is consistent and unbiased
n
plim n1 ∑ Yi = E (Yi ) = θ – laws of large numbers
i =1
n
E b
θ MLE = 1
n ∑ E (Yi ) = n1 .nθ = θ
i =1
b
θ MLE is Asymptotic normal
b a
θ MLE N (θ, θ (1 θ )/n )
2
To show: obtain the information matrix I (θ ) = E ∂ ln2 L VN9.5
∂θ
q
Note: here, it is easy to obtain SE θ̂ MLE = θ̂ MLE (1 θ̂ MLE )/n
Suppose you have a random sample Y1 , ...Yn , drawn from the p.d.f.
exp( λ)λy
f (y ) = , y = 0, 1, 2, 3, ....
y!
with, e.g., 5! = 1 2 3 4 5.
Y is a Poisson random variable(counts)
Note: E (Y ) = Var (Y ) = λ
Find the MLE of λ and discuss its asymptotic distribution! (PS9, Q2)
∂ log L (θ )
Our ML Estimators need to satisfy = 0, where
∂θ θ =b
θ
∂ log L
∂β = 1
2σ2
( 2X 0 y + 2X 0 X β)
∂ log L
∂σ2
= n
2σ2
+ 1
2σ4
(y X β ) 0 (y X β)
b
βMLE = (X 0 X ) 1 X 0 y
0
b2MLE = ûnû with û = y
σ X β̂MLE
X 0X
σ2
0
The information matrix becomes: I( β,σ2 ) = n
0 2σ4
To obtain the asymptotic variance, take inverse (easy here with block
diagonality)
!
( X 0X ) 1 0 σ 2 (X 0 X ) 1 0
I( β1,σ2 ) = σ2 =
0 n
( 2 σ4 ) 1 0 2σ4 /n
This gives:
b
βMLE a β σ 2 (X 0 X ) 1 0
N ,
b2MLE
σ σ2 0 2σ4 /n
Rθ = c.
The Wald test only requires the unrestricted estimator, and the
Lagrange multiplier test only requires the restricted estimator.
e
The LM test statistics, tests whether λ 0:
1
e 0 Var
LM = λ d (λe) e!
λ
d
χ2r under H0
Or equiv. whether s (e
θ ) = ∑ni=1 ∂ log Li (e
θ )/∂θ 0 (Score test)
h i 1
d
LM = s (e d (s (e
θ )0 Var θ )) s (e
θ ) ! χ2r under H0
1i = si (e
θ )0 γ + vi , i = 1, .., n
d =
inlf .586 .0034 nwifeinc + .038 educ + .039 exper
(.152 ) (.0015 ) (.007 ) (.006 )
2
.00060exper .016 age .262 kidslt6 + .013 kidsge6
(.00019 ) (.002 ) (.032 ) (.014 )
exp(z )
F (z ) = Logistic CDF = Λ(z ) = (logit)
[1 + exp(z )]
Z z
F (z ) = N (0, 1) CDF = Φ(z ) = p1 exp( 12 u 2 )du (probit)
∞ 2π
P (y = 1jx) = F (x 0 β )
yi = F xi0 β + εi
θ y i (1 θ )1 y i , θ = Pr(yi = 1)
f ( yi ) =
0
θ i = Pr (yi = 1jxi )
exp(xi0 β)
Common: θ i = Φ(xi0 β) (probit) or θ i = (logit)
| {z } 1 + exp(xi0 β)
CDF of N (0,1 ) rv | {z }
CDF of Logisic rv
∏i =1 f (yi jxi )
n
L ( β) = f (y1 , .., yn jx1 , ...xn ) =indep
∏i =1 Φ(xi0 β)y
n 1 yi
= i
1 Φ(xi0 β)
yi Φ(xi0 b
βML )
ε̂G
i = φ(xi0 b
βML ) : so called generalized residuals
Φ(xi βML ) 1 Φ(xi0 b
0 b ( βML ) )
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 37 / 57
MLE: Binary Choice Cont’d IV
2 3
n
6 7
log L( β) = 4 ∑ yi log Φ(xi0 β) + (1 yi ) log 1 Φ(xi0 β) 5
i =1
| {z } | {z }
Pr (yi =1 jxi ) Pr (yi =0 jxi )
Unlike in the LPM, the marginal e¤ects in our probit (logit) model are
not constant, but depend on the individual characteristics:
Report the average over all individuals: n1 ∑ni=1 φ(xi0 β̂) β̂k
[In stata: margins, dydx( )]
Let me summarize the results for Binary Choice Models thus far.
VN9.16-18
Next, we will discuss for our Probit/Logit model, the Wald, LR and
LM for, say
H0 : β3 = 0 against H1 : β3 6= 0.
See also PS10 Q1.
Wald Test:
Estimate the unrestricted Probit model, and verify β̂3 0. VN9.19
Test statistic
β̂3 a
z= N (0, 1) under H0
SE ( β̂3 )
a 1)
Recall, under suitable regularity conditions, β̂MLE N ( β, I ( β)
i 1
∂2 log L ( β)
Recall how to estimate Var ( β̂MLE ) : ∂β∂β0
.
β̂MLE
d
LR = 2(log LU logR ) ! χ21
Estimate the restricted Probit model leaving out the third variable
( β3 = 0)
Let the restricted parameter estimates be:
r r r r
β̂ = ( β̂0 , β̂1 , β̂2 , 0)
1i = ε̂G
i γ1 + ε̂G G G
i x1i γ2 + ε̂i x2i γ3 + ε̂i x3i γ4 + υi
We want to test:
H0 : βrural
j = βurban
j for all j = 0, 1, ..., k
H1 : At least one βrural
j 6= βurban
j
Restricted model
The LF participation decision is the same for urban and rural women
To obtain log Lr , we simply perform probit using all observations
Unrestricted model
The LF participation for rural and urban women are di¤erent
To obtain log Lur we can run separate probit regressions for the urban
and rural sample
probit inlf nwifeinc educ exper expersq age kidslt6 kidsge6 if city==0
probit inlf nwifeinc educ exper expersq age kidslt6 kidsge6 if city==1
From here we compute log Lur = log Lurban + log Lrural
The drawback of using OLS to estimate E (yi jxi ) are, similar to the
drawback in the setting of binary choice, that E (yi jxi ) should be
non-negative for all x.
The Poisson regression model for count data use Poisson distribution
(MLE).
Dr M. Schafgans (LSE) EC221: Maximum Likelihood Estimation 54 / 57
Empirical Application: MLE Count Data Cont’d I
exp( λ)λyi
Recall : f (yi ) = yi = 0, 1, 2, 3, ....
yi !
where λ = E (yi ) = Var (yi ).
Clearly, we can expect that E (yi ) is not the same for all …rms.
λ is not a constant but will depend on i ! )
∂E (y jx ) ∂ exp(x 0 β)
= = βk exp(x 0 β)
∂xk ∂xk