0% found this document useful (0 votes)
79 views9 pages

Vb V ε X = σ Vb = σ Vb = X'X Σx X'X: I X'X X'

The document compares different estimation techniques for regression models: - White and Newey-West estimators allow for heteroskedasticity and autocorrelation, correcting standard errors compared to OLS. - Generalized least squares (GLS) is efficient but requires specifying the exact heteroskedasticity structure. - Tests for instrumental variables (IV) assumptions include Hausman for endogeneity, overidentification for instrument validity, and weak instruments tests. - Measurement error in regressors biases OLS, while panel data allows controlling for additional unobserved factors.

Uploaded by

Dinhthe Hoi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views9 pages

Vb V ε X = σ Vb = σ Vb = X'X Σx X'X: I X'X X'

The document compares different estimation techniques for regression models: - White and Newey-West estimators allow for heteroskedasticity and autocorrelation, correcting standard errors compared to OLS. - Generalized least squares (GLS) is efficient but requires specifying the exact heteroskedasticity structure. - Tests for instrumental variables (IV) assumptions include Hausman for endogeneity, overidentification for instrument validity, and weak instruments tests. - Measurement error in regressors biases OLS, while panel data allows controlling for additional unobserved factors.

Uploaded by

Dinhthe Hoi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Lecture Notes 11

Review

What is difference between


- robust estimators (White)
- HAC robust estimators (Newey-West heteroscedasticity-autocorrelation
consistent)
- GLS estimator (Generalized Least Squares - i.e. general
heteroskedasticity)?

- White is for heteroscedasticity with no auto-correlation


- Newey-West is for auto-correlation and heteroscedasticity
- calculates correct V ( b ) ,
- which OLS regression packages do not do
- since OLS assumes V ε X = σ 2I ( )
- making V ( b ) = σ 2 ( X ' X )
−1

- but with heteroscedasticity, V ( b ) = ( X ' X ) X ' ΣX ( X ' X )


−1 −1

- why not use GLS instead of OLS?


- after all, it is efficient
- have to specify exact structure of heteroscedasticity
- White and Newey-West robust estimators esp. for case
where you don’t think you have heteroscedasticity problem
- check to see if you have a problem
- OLS still consistent
- but White & Newey-West allows correct inference

Testing IV assumptions

1. E(ε X) ≠ 0 - Hausman test for endogeneity


test if b = β̂ 2SLS
2. E ( ε Z ) = 0 - Overidentification test (only possible if L > K)
test if β̂ is the same with or without extra L-K instruments
3. E ( Z'X ) = QZX ≠ 0 - Weak instrument test
test correlation of Z and X from first stage of 2SLS

1 of 9
Important because one is using IV in the first place because there is doubt about
endogeneity, and never obvious that instruments are both exogenous and highly
correlated with X.

- several variations on each of these tests


- also different versions if you allow robust standard errors (H or AC or HAC)

1. Hausman (and Wu, Durban) test

Is there an endogeneity problem in the first place?


i.e. E ( ε X ) = 0
Can’t test E ( ε X ) = 0 directly
OLS residuals e are constructed so that X 'e = 0
If E ( ε X ) = 0 , then OLS is consistent, and so is IV
- because it is still true that E ( ε Z ) = 0 and E ( Z'X ) = QZX
But if E ( ε X ) ≠ 0 , then OLS in inconsistent, but IV is consistent
Hausman tests whether β̂ 2SLS − b = 0

H 0 : β̂ 2SLS − b = 0
H A : β̂ 2SLS − b ≠ 0

Use Wald statistic

( ){

} (β̂ )
−1
H = β̂ 2SLS − b Est.Asy.Var ⎡⎣ β̂ 2SLS − b ⎤⎦ 2SLS
−b

Asy.V ⎡⎣ β̂ 2SLS − b ⎤⎦ = Asy.V ⎡⎣ β̂ 2SLS ⎤⎦ + Asy.V ⎡⎣b ⎤⎦ − 2Asy.Cov ⎡⎣ β̂ 2SLS ,b ⎤⎦

But what is Asy.Cov ⎡⎣ β̂ 2SLS ,b ⎤⎦ ?

First, Hausman noted that under H 0 , OLS is efficient and IV is not

−1 −1
σ2 ⎛ X̂ ' X̂ ⎞ σ2 ⎛ X'X⎞
Asy.V ⎡⎣ β̂ 2SLS ⎤⎦ − Asy.V ⎡⎣b ⎤⎦ = plim ⎜ − plim ⎜
n ⎝ n ⎠ ⎟ n ⎝ n ⎟⎠

since X̂ is a estimate of X

2 of 9
- it is less correlated with X than X is with itself
- (unless columns of Z perfectly predict columns of X )
−1 −1
⎛ X̂ ' X̂ ⎞ ⎛ X 'X ⎞
> plim ⎜ so Asy.Var ⎡⎣ β̂ 2SLS ⎤⎦ > Asy.Var [ b ]
⎝ n ⎟⎠
plim ⎜
⎝ n ⎟⎠
Second, he proved that
the Cov between an efficient estimator (b)
and the difference with an inefficient estimator ( β̂ 2SLS )
for the same parameter is zero.

So Cov ⎡⎣b , β̂ 2SLS − b ⎤⎦ = V ⎡⎣b ⎤⎦ − Cov ⎡⎣b , β̂ 2SLS ⎤⎦ = 0


or Cov ⎡⎣b , β̂ 2SLS ⎤⎦ = V ⎡⎣b ⎤⎦

so Asy.V ⎡⎣ β̂ 2SLS − b ⎤⎦ = Asy.V ⎡⎣ β̂ 2SLS ⎤⎦ − Asy.V ⎡⎣b ⎤⎦

( ) − s2 ( X ' X )
−1 −1
Est.Asy.V ⎡⎣ β̂ 2SLS − b ⎤⎦ = s 2 X̂ ' X̂

( )′ {Est.Asy.V ⎡⎣β̂ } (β̂ )


−1
so H = β̂ 2SLS − b 2SLS
− b ⎤⎦ 2SLS
−b

(
H = β̂ 2SLS − b )′ (V ⎡⎣β̂ 2SLS
⎤ − V ⎡b ⎤ β̂
⎦ )(
⎣ ⎦ 2SLS − b )

2. Overidentification test
- only possible if L > K
E ( z i ε i ) = 0 othogonality condition
⎛1 n ⎞
E(m) = E ⎜ ∑ z i ε i ⎟ = 0 , even though not exactly true in sample
⎝ n i=1 ⎠
1 n
So test whether ∑ z iε i = 0 when L>K
n i=1
i.e. test m = 0
n
1 1 n
- use m = ∑ i IV ,i n ∑ z i (yi − xi ' β̂ IV )
n i=1
z e =
i=1

- then m' ⎡⎣Var ( m ) ⎤⎦ m ∼ χ L−K


−1 2

- only L-K degrees of freedom because

3 of 9
β̂ IV already forces first K moment conditions to be exactly equal to
zero
1 n
2 ∑ ( i IV ,i ) ( i IV ,i )
1
- Est.Var ( m ) = ze z e ' = 2 Z'e IV e IV 'Z
n i=1 n
1 n 1
- m = ∑ z i eIV ,i = Z'e IV
n i=1 n
- so Wald stat is χ 2 = e IV 'Z [ Z'e IV e IV 'Z ] Z'e IV
−1

- can view this as a test of whether the instruments give the same answer
as each other.

3. Test for weak instruments


- testing for E ( Z'X ) ≠ 0 : whether Z are sufficiently correlated with X
- if just one endogenous variable, then first stage of 2SLS is regression is
xi = z i 'γ + υi
- how to test correlation?
- just test that all γ = 0
- how would we carry this out?
- more complicated if more than one endogenous variable
- if weak correlation of X and Z
−1
- Asy.Var ⎡⎣ β̂ IV ⎤⎦ = σ 2 ⎡⎣ X 'Z ( Z'Z ) Z'X ⎤⎦
−1

- if X 'Z → 0 , then Asy.Var ⎡⎣ β̂ IV ⎤⎦ → ∞


- Godfrey test compares variance of b and β̂ 2SLS
- for just one endogenous xk , with ratio

( X 'X )kk
Rk2 = ,
( X̂ ' X̂ ) kk

R (n − L )
2
then k
∼F
1− Rk2 ( L − 1)
- more complicated with multiple endogenous xk

4 of 9
Measurement Error

yi* = β xi* + ε i
yi = yi* + υi
xi = xi* + ui
- if only error in yi , no problem
yi = β xi* + ε i + υi = β xi* + ε i′
- if error in xi , big problem
yi = β xi + ε i − β ui = β xi + wi
Cov[ xi ,wi ] = Cov ⎡⎣ xi* + u i , ε i − β ui ⎤⎦ = − βσ u2
- violates exogeneity of x
β
plim b = attenuation bias - b too small
1+ σ u / plim(x'x)
2

- in multivariable context, don’t know direction of bias


- even if just one x has measurement error, all b is biased
- to fix, use IV

Panel Data

Have cross section data on units (the “panel”) repeatedly measured over time
- AKA cross-section time-series data (“xt” in Stata)

Nothing inherently problematic, just allows you to correct for more issues
- an opportunity to make more precise estimates
- in particular, to control for all unchanging individual characteristics
- with an individual-specific constant term

Panel data typically expensive and difficult to collect


- attrition bias
- not typically random who drops out of panel over time
- important to have dedicated surveyors who track everyone down

Have both a individual subscript i and a time subscript t

yit = x it′ β + ε it

if T is the same for all individuals, then a “balanced panel”

5 of 9
if Ti is different for each individual, “unbalanced panel”
- in general, just complicates the notation a bit
- rarely a substantive issue, unless you are programming estimators

How many observations?


nT or nT

Most important issues


1. How do we estimate individual effects?
- fixed effects or random effects models?
2. Do the coefficient estimates ( βi ) vary by individual?
- random coefficients model
3. How do we model autoregressive errors?
- Arellano-Bond GMM estimators

Fixed vs. Random effects

yit = x it′ β + α i + ε it , where x it doesn’t have a column of ones (why not?)

i.e. why can’t we estimate yit = x it′ β + β 0 + α i + ε it


y = Xβ + iβ 0 + d1α 1 + d 2α 2 +!+ d nα n + ε
1 n
∑ di = i ,
n i=1
so i is collinear with d i

X matrix (including i and d i ) will not be full rank

- this is just the usual dummy variable problem

This is the big deal of most panel data estimation


- we can estimate an individual-specific constant term
- means we can control for all unchanging individual characteristics
- another tool for reducing endogeneity
- why don’t we estimate this for cross-sectional data?
yi = x i′β + α i + ε i
because we would be estimating n+K coefficients
- with n observations
- failure of identification

6 of 9
with A1-A4, we can estimate this with OLS
consistent and efficient
known as “fixed effects”, but doesn’t mean that α i are not random variables
- misnomer

Issues:
1) α i not consistently estimated
- each α i just estimated from T observations
- imagine we just had data on 1 individual
- could still estimate that α i
- since T is typically small, too few obs for consistent estimate
- typically less than 25, almost certainly less than 100
- often said that “T is assumed fixed”
- not a good way to say it
- T just too small for accurate estimates
- and asymptotic approximations
- therefore can’t trust value of α i
but we have controlled for all unchanging individual characteristics

Aside: sample size doesn’t only matter for asymptotics


- with small sample statistics,
- still have inaccurate estimates with small samples
- just have more confidence that we know true variance

2) cannot estimate effect of any other unchanging characteristic


- e.g. effect of ability on earnings
- can control for effect of ability if it is unchanging
- can’t independently estimate effect of education
- since unchanging among adults
- lack of identification

Estimating Fixed Effects:


- if 1000s of individuals, 1000s of individual effects
- each with its own dummy variable
- regression with 1000s for indep. variables
computationally inefficient,
especially since we don’t care about value of α i

7 of 9
instead subtract off individual means:

yit = x it′ β + α i + ε it
1 T
yi ≡ ∑ yit
T t=1
yi = xi′β + α i + ε i n.b. α i = α i

yit − yi = ( x it′ − xi′ ) β + α i − α i + ε it − ε i


yit − yi = ( x it′ − xi′ ) β + ε it − ε i

Are A1-A4 still met for regressing yit − yi on x it′ − xi′ ?

- Is [ x it′ − xi′ ] full column rank?


yes - subtracting off means doesn’t change that
- Is E ( ε it − ε i X ) = 0 ?
- yes - because E ( ε it ) = 0 ∀i,t , so E ( ε i ) = 0
( )
- Is V ε it − ε i X = σ 2 ?

( ) ( ) ( )
V ε it − ε i X = V ε it X + V ε i X − 2Cov(ε it , ε i X)
⎛ ε +!+ ε ⎞ 1 σ2
V (ε X ) = V ⎜ X⎟ =
i1
Tσ iT 2
=
i
⎝ T ⎠ T 2
T
ε i1 +!+ ε iT ε
Cov(ε it , ε i X) = Cov(ε it , X) = Cov(ε it , it X)
T T
because Cov(ε it , ε is X) = 0 ∀ t ≠ s
σ2
Cov(ε it , ε i X) =
T
σ2 σ2 ⎛ 1⎞ 2
so, ( )
V ε it − ε i X = σ +
T
−2 2
= 1−
T ⎜⎝ T ⎟⎠
σ

Variance no longer equal to σ 2 , but still homoscedastic

How about autocorrelation?

Cov ( ε it − ε i , ε is − ε i X ) = Cov(ε it , ε is X) − Cov(ε i , ε is X) − Cov(ε it , ε i X) + Cov(ε i , ε i X)


Cov(ε it , ε is X) = 0
σ2
Cov(ε i , ε is X) = Cov(ε it , ε i X) =
T

8 of 9
σ2
(
Cov(ε i , ε i X) = V ε i X = ) T
, so
σ 2 σ 2 −σ 2
Cov ( ε it − ε i , ε is − ε i X ) = −2 + =
T T T

⎡ ⎛ 1 ⎞ 2 −σ 2 −σ 2 ⎤
⎢ ⎜ 1− ⎟ σ ! ⎥ ⎡ ε1 ⎤
⎢ ⎝ T⎠ T T ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ! ⎥
−σ 2 ⎢ ε1 ⎥
⎢ ! ! " ⎥
Let Σ i = ⎢ T ⎥ , and [ ε ] = ⎢ ⎥
! ⎥ , then
⎢ −σ 2 ⎥ i ⎢
⎢ " ! ! ⎥ ⎢ εn ⎥
⎢ T ⎥ ⎢ ⎥
⎢ ! ⎥
⎢ −σ 2 −σ 2 ⎛ 1⎞ 2 ⎥
⎢ ! ⎜⎝ 1− ⎟⎠ σ ⎥ ⎢⎣ εn ⎥
T T T ⎦
⎢⎣ ⎥⎦
⎡ Σ 0 ! 0 ⎤
⎢ 1 ⎥
⎢ 0 Σ2 ! 0 ⎥
V ⎡⎣ε − ⎡⎣ε i ⎤⎦ X ⎤⎦ = ⎢ ⎥
⎢ ! ! " ! ⎥
⎢ 0 0 ! Σn ⎥
⎣ ⎦
How big is this matrix?
nT x nT
Are our OLS assumptions met?
No - Autocorrelation within individual time series
Use GLS - easy to form P matrices
s2 σ2
because just need estimate of
T T

Time and individual fixed effects:

yit = x it′ β + α i + δ t + ε it
1 n
if yt = ∑ yit , then regress yit − yi − yt on xit′ − xi′ − xt′
n i=1

9 of 9

You might also like