Econometricks-Short Guide
Econometricks-Short Guide
Davud Rostam-Afschar
December 1, 2024
Abstract
Short guides to econometrics illustrate statistical methods and demonstrate how
they work in theory and practice. With many examples.
* These guides were developed based on lectures delivered by Davud Rostam-Afschar at the University
of Mannheim. I am grateful for the valuable input provided by numerous cohorts of PhD students at
the Graduate School of Economic and Social Sciences, University of Mannheim. I thank the Deutsche
Forschungsgemeinschaft (DFG, German Research Foundation) for nancial support through CRC TRR
266 Accounting for Transparency (Davud Rostam-Afschar, Project-ID 403041268). Replication les and
updates are available here.
University of Mannheim, 68131 Mannheim, Germany; GLO; IZA; NeST (e-mail: rostam-afschar@uni-
mannheim.de),
1
Contents
1 Review of Probability Theory 4
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Specic Distributions 17
2.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Properties of the OLS Estimator in the Small and in the Large . . . . . . . 66
2
5 Simplifying Linear Regressions using Frisch-Waugh-Lovell 81
5.1 Frisch-Waugh-Lovell theorem in equation algebra . . . . . . . . . . . . . . 81
8 Conclusion 109
References 110
3
1 Review of Probability Theory
1.1 Introduction
This guide takes a look under the hood of widely used methods in econometrics and
of Moments. It shows when and why these methods work with simple examples. This
guide also provides an overview of the most important fundamentals of Probability Theory
and Distribution Theory on which these methods are based and how to analyze them with
innite.
4
Discrete probabilities
For values x of a discrete random variable X, the probability mass function (pmf )
0 ≤ P rob(X = x) ≤ 1,
X
f (x) = 1.
x
where
Example
x f (x) F (X ≤ x)
F (X ≥ 5) = 1 − F (X ≤ 4) = 1 − 2/3 = 1/3.
5
Continuous probabilities
For values x of a continuous random variable X , the probability is zero but the area under
f (x) ≥ 0 in the range form a to b is the probability density function (pdf )
Z b
P rob(a ≤ x ≤ b) = P rob(a < x < b) = f (x)dx ≥ 0.
a
Z +∞
f (x)dx = 1.
−∞
dF (x)
f (x) = .
dx
Properties of cdf
0 ≤ F (x) ≤ 1
F (+∞) = 1
F (−∞) = 0
and
Symmetric distributions
For symmetric distributions
f (µ − x) = f (µ + x)
and
1 − F (x) = F (−x).
6
1.3 Mean and variance
Mean of a random variable (Discrete)
Example
7
Mean of a random variable (Continuous)
Z
E[x] = xf (x)dx.
x
Example
Z b Z b
x 1
E[x] = dx = xdx.
a b−a b−a a
Antiderivative of x is x2 /2
1 (b − a)(b + a) a+b
E[x] = (b2 /2 − a2 /2) = = .
b−a 2(b − a) 2
P
For a function g(x) of x, the expected value is E[g(x)] = x g(x)P rob(X = x) or
R
E[g(x)] = x g(x)f (x)dx. If g(x) = a+bx for constants a and b, then E[a+bx] = a+bE[x].
Example
Roll of a six-sided die. What's the variance V [x] from rolling the dice?
The probability of observing x, P r(X = x) = 1/n, is discretely uniformly distributed
n+1 (n + 1)2
E[x] = ; (E[x])2 = .
2 4
8
n
X 1 X 2 (n + 1)(2n + 1)
E[x2 ] = P r(X = x) = x = due to the sequence sum of squares.
x
n x=1 6
Chebychev inequality
1
Pr(µ − kσ < x < µ + kσ) ≥ 1 − .
k2
95% of the observations are within 1.96 standard deviations for normally distributed
9
x. If x is not normal, 95% are at most within 4.47 standard deviations.
Normal coverage
µr = E[(x − µ)r ].
Example
10
Higher order moments
For the random variable X, with probability density function f (x), if the function
M (t) = E[etx ].
The nth moment is the nth derivative of the moment-generating function, evaluated at
11
t = 0.
Example
2 t2 /2 2 /2
Mz (t) = eµt+σ = et .
Example
1 22
d exp µt + σ t
1 ′ 2
E[(x − µ) ] = Mx (t) =
dt
1 22 1 22
d µt + σ t d exp µt + σ t
2 2
=
dt 1
d(µt + σ 2 t2 )
2
2 1 22
= (µ + σ t) exp µt + σ t .
2
If x ∼ N (0, 1),
12
Example
1 ′ 1 22
2
E[(x−µ) ] = Mx (t) = (µ+σ t) exp µt+ σ t with µ = 0, σ = 1, t = 0 : E[x] = µ = 0
2
2 ′′ 2 2 2 1 22
E[(x − µ) ] = Mx (t) = σ + (µ + σ t) exp µt + σ t
2
µ = 0, σ = 1, t = 0 : E[(x − µ)2 ] = σ 2 = 1
with
3 ′′′ 2 2 2 3 1 22
E[(x − µ) ] = Mx (t) = 3σ (µ + σ t) + (µ + σ t) exp µt + σ t
2
µ = 0, σ = 1, t = 0 : E[(x − µ)3 ] = 0
with
4 (4) 4 2 2 2 2 4 1 22
E[(x − µ) ] = Mx (t) = 3σ + 6σ (µ + σ t) + (µ + σ t) exp µt + σ t
2
with µ = 0, σ = 1, t = 0 : E[(x − µ)4 ] = 3.
and Z
V ar[g(x)] = (g(x) − E[g(x)])2 f (x)dx.
x
E[g(x)] and V ar[g(x)] can be approximated by a rst order linear Taylor series:
13
Taylor approximation Order 1
A natural choice for the expansion point is x0 = µ = E(x). Inserting this value in Eq.
(1) gives
so that
E[g(x)] ≈ g(µ),
and
14
Example
Isoelastic utility. cbad = 10.00 Euro; cgood = 100.00 Euro; probability good outcome
50%
u(c) = c1/2
Example
Isoelastic utility.
cbad = 10.00 Euro; cgood = 100.00 Euro; probability good outcome 50%; µ = 55.00 Euro
u(c) = ln(c)
Jensen's inequality:
E[g(x)] ≤ g(E[x])] if g ′′ (x) < 0.
E[x2 ] = σ 2 + µ2
V ar[a] = 0
15
Coverage Pr(|X − µ| ≥ kσ) ≤ 1
k2
E[g(x)] ≈ g(µ)
16
2 Specic Distributions
17
Discrete distributions
Bernoulli distribution
P rob(x = 1) = p,
P rob(x = 0) = 1 − p,
E[x] = p and
E[x] = np and
18
Poisson distribution
eλ λx
P rob(X = x) = .
x!
E[x] = λ and
V [x] = λ.
19
2.1 Normal distribution
The normal distribution
1 1 x−µ 2
f (x|µ, σ) = √ e− 2 ( σ ) .
σ 2π
The density is denoted ϕ(x) and the cumulative distribution function is denoted Φ(x)
for the standard normal. Example of a standard normal, (x ∼ N [0, 1]), and a normal
20
2.2 Method of transformations
Transformation of random variables
Method of transformations
Z b
P rob(y ≤ b) = fx (g −1 (y))|g −1′ (y)|dy.
−∞
Z b
P rob(y ≤ b) = fy (y)dy.
−∞
Example
x−µ
If x ∼ N [µ, σ 2 ], then the distribution of y = g(x) = σ
is found as follows:
g −1 (y) = x = σy + µ
dx
g −1′ (y) = =σ
dy
1 −1 2 2
Therefore with fx (x) = √1 e− 2 [(g (y)−µ) /σ ] |g −1′ (y)|
σ 2π
1 2 2 1 2
fy (y) = √ e−[(σy+µ)−µ] /2σ |σ| = √ e−y /2 .
2πσ 2π
21
Properties of the normal distribution
Preservation under linear transformation:
1 z2
ϕ(z) = √ e− 2 .
2π
Example
1 x2
fx (x) = √ e− 2
2π
y = g(x) = x2
√
g −1 (y) = x = ± y there are two solutions to g1 , g2 .
dx
g −1′ (y) = = ±1/2y −1/2
dy
fy (y) = fx (g1−1 (y))|g1−1′ (y)| + fx (g2−1 (y))|g2−1′ (y)|
√ √
fy (y) = fx ( y)|1/2y −1/2 | + fx (− y)| − 1/2y −1/2 |
1 y 1 y 1 y
fy (y) = √ e− 2 + √ e− 2 = √ e− 2
2 2πy 2 2πy 2πy
n
X
xi ∼ χ2 [n].
i=1
22
Normal
Parameters µ ∈ R , σ ∈ R>0
Support x∈R
1 x−µ 2
x−µ
= σ√12π e− 2 ( σ )
PDF ϕ σ
x−µ
1h
x−µ
i
CDF Φ σ
= 2
1 + erf √
σ 2
Mean µ
Median µ
Mode µ
Variance σ2
Skewness 0
Ex. Kurtosis 0
MGF exp(µt + σ 2 t2 /2)
n
X
zi2 ∼ χ2 [n].
i=1
n 2
X zi
∼ χ2 [n].
i=1
σ
x1 + x2 ∼ χ2 [n1 + n2 ].
23
2.3 The χ distribution
2
The χ2 distribution
E[x] = n
V [x] = 2n
Approximating a χ2
For degrees of freedom greater than 30 the distribution of the chi-squared variable x is
approx.
24
χ2
Parameters n ∈ N>0
Support x ∈ R>0 if n = 1,
else x ∈ R≥0
PDF
1
2n/2 Γ(n/2)
xn/2−1 e−x/2
1
γ n2 , x2
CDF
Γ(n/2)
Mean n
Median No simple closed form
Mode max(n − 2, 0)
Variance 2n
p
Skewness 8/n
12
Ex. Kurtosis
n
B(x, a,b)
Regularized incomplete beta function I(x, a, b) = B(a,b)
with B(x, a, b) =
R x a−1
0
t (1 − t)b−1 dt.
25
2.4 The F-distribution
The F-distribution
Ifx1 and x2 are two independent chi-squared variables with degrees of freedom parameters
n1 and n2 , respectively, then the ratio
x1 /n1
F [n1 , n2 ] =
x2 /n2
26
F
Parameters n1 , n2 ∈ N>0
Support x ∈ R>0 if n1 = 1,
else x ∈ R≥0
n1 n2 n1 +n2 n1
Γ( ) −1
x 2
PDF n1 n22 Γ( n1 )Γ(
2 2
n2
) n1 +n2
2 2 (n1 x+n2 ) 2
CDF I n1nx+n
1x
2
, n21 , n22
n2
Mean
n2 −2
for n2 > 2
Median No simple closed form
n1 −2 n2
Mode
n1 n2 +2
for n1 > 2
2 n22 (n1 +n2 −2)
Variance
n1 (n2 −2)2 (n2 −4)
for n2 > 4
√
(2n1 +n2 −2) 8(n2 −4)
Skewness √ for n2 > 6
(n2 −6) n1 (n1 +n2 −2)
2
Ex. Kurtosis 12 n1 (5n2n−22)(n 1 +n2 −2)+(n2 −4)(n2 −2)
1 (n2 −6)(n2 −8)(n1 +n2 −2)
for n2 > 8
MGF does not exist
B(x, a,b)
Regularized incomplete beta function I(x, a, b) = B(a,b)
with B(x, a, b) =
R x a−1
0
t (1 − t)b−1 dt.
27
2.5 The student t-distribution
The student t-distribution
Example for the t distributions with 3 and 10 degrees of freedom with the standard
normal distribution.
28
t
Parameters n ∈ R>0
Support x∈R
− n+1
Γ( n+1 )
2 x2 2
PDF √
πn Γ( n
1 +
2 )
n
1
+ x Γ n+1
CDF
2 2
×
x2
1 n+1 3
2 F1 2
, 2
; 2
; − n
√ n
πn Γ( 2)
Mean 0 for n>1
Median 0
Mode 0
n
Variance
n−2
for n > 2,
∞ for 1<n≤2
Skewness 0 for n>3
6
Ex. Kurtosis
n−4
for n > 4, ∞ for 2<n≤4
MGF does not exist
29
2.6 The lognormal distribution
The lognormal distribution
The lognormal distribution, denoted LN [µ, σ 2 ], has been particularly useful in mod-
1 1 2
f (x) = √ e− 2 [(ln x−µ)/σ] , x>0
2πσx
2 /2
E[x] = eµ+σ , and
2 2
V ar[x] = e2µ+σ (eσ − 1).
30
Log-normal
Parameters µ ∈ R , σ ∈ R>0
Support x ∈ R>0
)2
PDF √1
xσ 2π
exp − (ln x−µ
2σ 2
h i
1 ln √
x−µ
CDF
2
1 + erf σ 2
ln(x)−µ
=Φ σ
σ2
Mean exp µ + 2
Median exp(µ)
Mode exp (µ − σ 2 )
Variance [exp(σ 2 ) − 1] exp (2µ + σ 2 )
p
Skewness [exp (σ 2 ) + 2] exp(σ 2 ) − 1
Ex. Kurtosis 1 exp (4σ 2 ) + 2 exp (3σ 2 ) + 3 exp (2σ 2 ) − 6
MGF not determined by its moments
31
2.7 The gamma distribution
The gamma distribution
Many familiar distributions are special cases, including the exponential distribution(α =
1) and chi-squared(β = 1/2, α = n/2). The Erlang distribution results if α is a posi-
tive integer. The mean is α/β , and the variance is α/β . The inverse gamma distribution
2
32
Γ Γ
β α α−1 −βx
PDF f (x) = 1
Γ(k)θk
xk−1 e−x/θ f (x) = Γ(α)
x e
1
k, xθ 1
CDF F (x) = Γ(k)
γ F (x) = Γ(α)
γ(α, βx)
α
Mean kθ β
α−1
Mode (k − 1)θ for k ≥ 1, 0 for k<1 β
for α ≥ 1, 0 for α<1
α
Variance kθ2 β2
Skewness √2 √2
k α
6 6
Ex. Kurtosis
k
α −α
−k 1 t
MGF (1 − θt) for t< θ
1− β
for t<β
R∞
Γ(z) = 0
tz−1 e−t dt, ℜ(z) > 0, for complex numbers with a positive real part.
Rx
lower incomplete gamma function is γ(s, x) = 0
ts−1 e−t dt, for complex numbers
33
2.8 The beta distribution
The beta distribution
For a variable constrained between 0 and c > 0, the beta distribution has proved useful.
Its density is
It is symmetric if α = β , asymmetric otherwise. The mean is ca/(α + β), and the variance
is c αβ/[(α + β + 1)(α + β)2 ].
2
34
B
Parameters α, β ∈ R>0
Support x ∈ [0, 1] or x ∈ (0, 1)
xα−1 (1−x)β−1
PDF
B(α,β)
CDF I(x, α, β)
α
Mean
α+β
1
[−1] α−
Median I1 (α, β) ≈ 3 α, β > 1
2 for
2 α+β−
3
∗
Mode
αβ
Variance
(α+β)2 (α+β+1)
√
2 (β−α) α+β+1
Skewness √
(α+β+2) αβ
6[(α−β)2 (α+β+1)−αβ(α+β+2)]
Ex. Kurtosis
αβ(α+β+2)(α+β+3)
P∞ Qk−1 α+r
tk
MGF 1+ k=1 r=0 α+β+r k!
Γ(α)Γ(β)
B(α, β) = Γ(α+β)
and Γ is the Gamma function.
R∞
Γ(z) = 0
tz−1 e−t dt, ℜ(z) > 0, for complex numbers with a positive real part.
B(x, a,b)
Regularized incomplete beta function I(x, a, b) = B(a,b)
with B(x, a, b) =
Rx
0
ta−1 (1 − t)b−1 dt.
∗ α−1
α+β−2
for α, β > 1; any value in(0, 1) for α, β = 1; {0, 1} (bimodal) for α, β <
1; 0 for α ≤ 1, β > 1; 1 for α > 1, β ≤ 1.
35
2.9 The logistic distribution
The logistic distribution
The logistic distribution is an alternative if the normal cannot model the mass in the
1
F (x) = Λ(x) = .
1 + e−x
The density is f (x) = Λ(x)[1 − Λ(x)]. The mean and variance of this random variable are
36
Logistic
Parameters µ ∈ R , s ∈ R>0
Support x∈R
x−µ e−(x−µ)/s
PDF λ s
= 2
s(1+e−(x−µ)/s )
x−µ 1
CDF Λ s
= 1+e−(x−µ)/s
Mean µ
Median µ
Mode µ
s2 π 2
Variance
3
Skewness 0
Ex. Kurtosis 6/5
MGF eµt B(1 − st, 1 + st)
for t ∈ (−1/s, 1/s)
where xi is the ith ofnK element random vectors from the multivariate normal distribu-
tion with mean vector, µ, and covariance matrix, Σ. The density of the Wishart random
matrix is
1
exp − 12 trace(Σ−1 W ) |W |− 2 (n−K−1)
f (W ) = .
2nK/2 |Σ|K/2 π K(K−1)/4 K n+1−j
Q
j=1 Γ 2
37
3 Review of Distribution Theory
3.1 Joint and marginal bivariate distributions
Bivariate distributions
For observations of two discrete variables y ∈ {1, 2} and x ∈ {1, 2, 3}, we can calculate
x=1 1 2 3/10
x=2 1 2 3/10
x=3 0 4 4/10
P
freq. nx,y y=1 y=2 f (x) = nx /N cond. distr. f (y|x) y=1 y=2 y
38
3.2 The joint density function
The joint density function
X X
f (x, y) = P rob(a ≤ x ≤ b, c ≤ y ≤ d) = f (x, y)
a≤x≤b c≤y≤d
Z bZ d
f (x, y) = P rob(a ≤ x ≤ b, c ≤ y ≤ d) = f (x, y)dxdy
a c
Example
joint distr.
f (x = 1, y) 1/10 2/10
f (x = 2, y) 1/10 2/10
f (x = 3, y) 0 4/10
P rob(1 ≤ x ≤ 2, 2 ≤ y ≤ 2) = f (y = 2, x = 1) + f (y = 2, x = 2) = 2/5.
For values x and y of two discrete random variable X and Y, the probability dis-
tribution
f (x, y) = P rob(X = x, Y = y).
f (x, y) ≥ 0,
XX
f (x, y) = 1.
x y
39
If X and Y are continuous, Z Z
f (x, y)dxdy = 1.
x y
The bivariate normal distribution is the joint distribution of two normally distributed
1 2 2 2
f (x, y) = p e−1/2[(ϵx +ϵy −2ρϵx ϵy )/(1−ρ )],
2πσx σy 1 − ρ 2
x−µx y−µy
where ϵx = σx
, and ϵy = σy
.
The probability of a joint event of X and Y have joint cumulative density function
XX
F (x, y) = P rob(X ≤ x, Y ≤ y) = f (x, y)
X≤x Y ≤y
Z x Z y
F (x, y) = P rob(X ≤ x, Y ≤ y) = f (t, s)dsdt
−∞ −∞
40
Example
f (x = 1, y) 1/10 2/10
f (x = 2, y) 1/10 2/10
f (x = 3, y) 0 4/10
P rob(X ≤ 2, Y ≤ 2) = f (x = 1, y = 1)+
f (x = 2, y = 1) + f (x = 1, y = 2) + f (x =
2, y = 2) = 3/5.
For values x and y of two discrete random variable X and Y , the cumulative probability
distribution
F (x, y) = P rob(X ≤ x, Y ≤ y).
0 ≤ F (x, y) ≤ 1,
F (∞, ∞) = 1,
F (−∞, y) = 0,
F (x, −∞) = 0.
41
3.4 The marginal probability density
The marginal probability density
To obtain the marginal distributions fx (x) and fy (y) from the joint density f (x, y), it is
Example
f (x = 3, y) 0 4/10 4/10
fx (x = 1) = f (x = 1, y = 1) + f (x = 1, y = 2) = 3/10.
fy (y = 2) = f (x = 1, y = 2) + f (x = 2, y = 2) + f (x = 3, y = 2) = 4/5.
42
The bivariate normal distribution
Expectations
If x and y are discrete
" #
X X X XX
E[x] = xfx (x) = x f (x, y) = xf (x, y).
x x y x y
Z Z Z
E[x] = xfx (x) = xf (x, y)dydx.
x x y
Variances
X XX
V ar[x] = (x − E[x])2 fx (x) = (x − E[x])2 f (x, y).
x x y
43
3.5 Covariance and correlation
For any function g(x, y),
P P
g(x, y)f (x, y) in the discrete case,
x y
E[g(x, y)] =
R R g(x, y)f (x, x)dydx
in the continuous case.
x y
XX
σxy = fx (x)fy (y)(x − µx )(y − µy )
x y
X X
= (x − µx )fx (x) (y − µy )fy (y) = E[x − µx ]E[y − µy ] = 0.
x y
σxy
correlation ρxy = σx σy
Two random variables are statistically independent if and only if their joint density
If (and only if ) x and y are independent, then the marginal cdfs factors the cdf as
well:
44
Example
P (x ≤ 2)P (y ≤ 2)
The conditional distribution over y for each value of x (and vice versa) has conditional
densities
f (x, y) f (x, y)
f (y|x) = f (x|y) = .
fx (x) fy (y)
The marginal distribution of x averages the probability of x given y over the distri-
bution of all values of y fx (x) = E[f (x|y)f (y)]. If x and y are independent, knowing the
Example
45
f (x = 3, y = 2)
f (x = 3|y = 2) = = 4/10 × 10/8 = 1/2.
fy (y = 2)
fx (x = 2) = Ey [f (x = 2|y)f (y)] = f (x = 2|y = 1)f (y = 1) + f (x = 2|y = 2)f (y = 2)
y = E[y|x] + (y − E[y|x])
= E[y|x] + ϵ.
Denition
Predict y at values of x:
X
yf (y|x = 1) = 1 × 1/3 + 2 × 2/3 = 5/3.
y
46
Conditional variance
47
Example
Average of variances within each x, E[V [y|x]] is less or equal total variance V [y].
Example
variation in x, is on average
For each conditional mean E[y|x = 1] = 5/3, E[y|x = 2] = 5/3, and E[y|x = 3] = 2,
y varies with
V [E[y|x]] = E[(E[y|x])2 ] − (E[y|x])2 = 3/10 × (5/3)2 + 3/10 × (5/3)2 + 4/10 × (2)2 −
(9/5)2 = 2/75.
48
E[V [y|x]] + V [E[y|x]] = V [y] = 2/75 + 2/15 = 4/25.
With degree of freedom correction (n − 1) (as reported in software):
E[V [y|x]] + V [E[y|x]] = V [y] = 2/75/(10 − 1) × 10 + 2/15/(10 − 1) × 10 = 8/45.
1 2 2 2
f (x, y) = p e−1/2[(ϵx +ϵy −2ρϵx ϵy )/(1−ρ )],
2πσx σy 1 − ρ 2
where ϵx = x−µx
σx
, and ϵy = y−µ
σy
y
.
The covariance is σxy = ρxy σx σy , where
If x and y are bivariately normally distributed (x, y) ∼ N2 [µx , µy , σx2 , σy2 , ρxy ]
σxy
α = µy − βµx ; β =
σx2
f (x, y) = fx (x)fx (x) if ρxy = 0: x and y are independent if and only if they are
uncorrelated
49
3.9 Useful rules
σxy
ρxy = σx σy
Linearity
E[y] = Ex [E[y|x]]
Independence
E[y] = E[y|x],
E[(y − E[y|x])h(x)] = 0.
E[xy] = E[xE[y|x]].
50
Eve's Law (EVVE) / Law of Total Variance
51
4 The Least Squares Estimator
4.1 What is the Relationship between Two Variables?
Political Connections and Firms
Firm prots increase with the degree of political connections
52
4.2 The Econometric Model
Specication of a Linear Regression
dependent variable yi = prots of rm i
acteristics
xi0 = 1 is a constant
acteristics
xi0 = 1 is a constant
53
ui is called the error term
LRM1: Linearity
LRM3: Exogeneity
LRM5: Identiability
54
Data Generating Process: Linearity
LRM1: Linearity
Anscombe's Quartet
Figure 1: All four sets are identical when examined using linear statistics, but very
dierent when graphed. Correlation between x and y is 0.816. Linear Regression y = 3.00
+ 0.50x.
55
Data Generating Process: Random Sampling
LRM2: Simple Random Sampling
{xi1 , . . . , xiK , yi }N
i=1 i.i.d. (independent and identically distributed)
non-response or truncation.
56
Data Generating Process: Exogeneity
LRM3: Exogeneity
explanatory variables.
LRM3c states that the mean of the error term is independent of explanatory vari-
ables.
4. cov(xik , ui ) = 0 ∀k (uncorrelated)
LRM3d means that the error term and the explanatory variables are uncorrelated.
of interest xi1 .
57
Data Generating Process: Error Variance
LRM4: Error Variance
1.
2.
LRM4b allows the variance of the error term to depend on a function g of the
explanatory variables.
Heteroskedasticity
58
Data Generating Process: Identiability
LRM5: Identiability
the regressors are not perfectly collinear , i.e. no variable is a linear combination of
the others
all regressors (but the constant) have strictly positive variance both in expectations
Figure 5: The number of red and blue dots is the same. Using which would you get a
more accurate regression line?
59
4.3 Estimation with OLS
Ordinary least squares (OLS) minimizes the squared distances (SD) between the
min SD(β0 , . . . , βK ),
β0 ,...,βK
N
X
where SD = [yi − (β0 + β1 xi1 + . . . + βK xiK )]2 .
i=1
60
61
62
Invention of OLS
ber 1827, Plackett, 1972): ...How can Mr. by Boilly (1820), the only existing portrait
Invention of OLS
β̂0 = ȳ − β̂1 x̄
PN
(xi1 − x̄)(yi − ȳ)
i=1 cov(x, y)
β̂1 = PN =
i=1 (xi1 − x̄)
2 var(x)
63
where R ≡ cov(x, y)/(sx sy ) is Pearson's correlation coecient with sz denoting the
standard deviation of z .
β̂1 N
P PN
i=1 (xi1 − x̄) (β̂1 xi1 − β̂1 x̄)
R = sx /sy β̂1 = PN = i=1
PN .
i=1 (yi − ȳ) i=1 (y i − ȳ)
Squaring gives
PN PN
2 i=1 (ŷi − ȳ)2 i=1 û2i
R = PN = 1 − PN .
i=1 (yi − ȳ)2 i=1 (yi − ȳ)
2
β̂ = (X ′ X)−1 X′ y
(K+1)×1 (K+1)×(K+1) (K+1)×N N ×1
PN
û2i
Adjusted R2 = 1 − N N−K−1
−1
PN i=1
(y i −ȳ)
2
.
i=1
64
Figure 8: Scatter cloud visualized with
GRAPH3D for Stata.
65
4.4 Properties of the OLS Estimator in the Small and in the
Large
Properties of the OLS Estimator
Small sample properties of β̂
unbiased
normally distributed
ecient
approx. normal
asymptotically ecient
66
Small Sample Properties
Figure 10: What is a small sample?
Familien-Duell
Light Entertainment.
Figure 11: What is a small sample? (Wooldridge, 2009, p. 755): But large sample
N = 20.
Source:
approximations have been known to work well for sample sizes as small as
Familien-Duell Grundy Light Entertainment.
67
Unbiasedness and Normality of β̂k
Assuming LRM1, LRM2, LRM3a, LRM4, and LRM5,
E(β̂k |x11 , . . . , xN K ) = βk .
σ̂ 2
Vb = PN with
2
i=1 (xi − x̄)
PN
2 û2i
i=1
σ̂ = .
N −K −1
β̂k is the BLUE (best linear unbiased estimator, e.g., non-linear least squares bi-
ased).
Unbiasedness
The OLS estimator of β is unbiased .
Plug y = Xβ + u into the formula for β̂ and then use the law of iterated expectation
to rst take expectation with respect to u conditional on X and then take the
unconditional expectation:
h i
′ −1 ′
E[ β̂] = EX,u (X X) X (Xβ + u)
68
h i
′ −1 ′
= β + EX,u (X X) X u
h h ii
= β + EX Eu|X (X ′ X)−1 X ′ u|X
h i
= β + EX (X ′ X)−1 X ′ Eu|X [u|X]
= β,
Variance
The OLS estimator β has variance Vb (β̂k |x11 , . . . , xN K ) = σ 2 (X ′ X)−1
Let σ2I denote the covariance matrix of u. Then,
h i
E[ (β̂ − β)(β̂ − β)′ ] = E ((X ′ X)−1 X ′ u)((X ′ X)−1 X ′ u)′
h i
′ −1 ′ ′ ′ −1
= E (X X) X uu X(X X)
h i
= E (X ′ X)−1 X ′ σ 2 X(X ′ X)−1
h i
= E σ 2 (X ′ X)−1 X ′ X(X ′ X)−1
= σ 2 (X ′ X)−1 ,
where we used the fact that β̂ − β is just an ane transformation of u by the matrix
′ −1 ′
(X X) X .
X −1
σ 2 (X ′ X)−1 = σ 2 xi x′i
X −1
= σ2 (1, xi )′ (1, xi )
−1
X 1xi
= σ2
xi x2i
P −1
N xi
= σ 2 P P
xi x2i
69
P P
2
1 x i − xi
= σ2 · P 2 P 2
N xi − ( xi ) P
− xi N
P P
2
1 xi − xi
= σ 2 · PN
N i=1 (xi − x̄) 2 P
− xi N
σ2
V ar(β1 ) = PN .
i=1 (xi − x̄)2
generating process is
yi = β0 + β1 xi1 + ui .
β0 = 2.00
β1 = 0.5
Try it yourself...
ui ∼ N (0.00, 1.00)
N = 3, N = 5, N = 10,
N = 25, N = 100, N = 1000
70
How to Establish Asymptotic Properties of β̂k ?
Law of Large Numbers
As N increases, the distribution of β̂k becomes more tightly centered around βk .
71
Central Limit Theorem
As N increases, the distribution of β̂k becomes normal (starting from a t-distribution).
can be established using law of large numbers and central limit theorem for large samples.
plimβ̂k = βk .
72
(Avar means asymptotic variance)
robust Eicker-Huber-White
or estimator.
PN 2 2
[ β̂1 ) = h i=1 ûi (xi1 − x̄)i .
Avar( PN 2
i=1 (xi1 − x̄)
Note: In practice we can almost never be sure that the errors are homoskedastic and
1 ′
−1 1 ′ 1 ′
−1 1 ′
β̂ can be written as: β̂ =
Estimator
N
X X N
Xy = β+ N
X X N
Xu =
−1
N N
β + N1 i=1 xi x′i 1
P P
N i=1 xi ui
PN p
We can use the law of large numbers to establish that :
1
N i=1 xi x′i →
− E[xi x′i ] =
Qxx 1
P N p
N
, N i=1 xi ui →
− E[xi ui ] = 0
By Slutsky's theorem and continuous mapping theorem these results can be com-
p
− β + Q−1
bined to establish consistency of estimator β̂ : β̂ → xx · 0 = β
PN d
√1
The central limit theorem tells us that: i=1 xi ui →
− N 0, V , where V =
N 2 Qxx
Var[xi ui ] = E[ u2i xi x′i ] = E E[u2i |xi ] xi x′i =
σ N
73
Applying Slutsky's theorem again we'll have:
N −1 N
√
1 X ′ 1 X d 2 Qxx
− Q−1
N (β̂ − β) = xi xi √ xi ui → xx N · N 0, σ
N i=1 N i=1 N
= N 0, σ 2 Q−1
xx N
LRM1: linearity f u l f i l l e d
LRM5: identiability f u l f i l l e d
- LRM4a: homoskedastic ✓ ✓ ✓ × × ×
- LRM4b: heteroskedastic × × × ✓ ✓ ✓
LRM3: exogeneity
- LRM3a: normality ✓ × × ✓ × ×
- LRM3b: independent ✓ ✓ × × × ×
- LRM3c: mean indep. ✓ ✓ ✓ ✓ ✓ ×
- LRM3d: uncorrelated ✓ ✓ ✓ ✓ ✓ ✓
74
Tests in Small Samples I
Assume LRM1, LRM2, LRM3a, LRM4a, and LRM5. A simple null hypotheses of the
β̂k − q
t= ∼ tN −K−1
se(
b β̂k )
follows a
qt-distribution with N −K −1 degrees of freedom. The standard error is
se(
b β̂k ) = V̂ (β̂k ).
′ h i−1
Rβ̂ − q RV̂ (β̂|X)R′ Rβ̂ − q
F = ∼ FJ,N −K−1 .
J
(R2 − Rrestricted
2
)/J
F = 2
∼ FJ,N −K−1 ,
(1 − R )/(N − K − 1)
2
where Rrestricted is estimated by restricted least squares which minimizes SD(β) s.t. rj1 β1 +
75
Exclusionary restrictions of the form H0 : βk = 0, βm = 0, . . . are a special case of
H0 : rj1 β1 + . . . + rjK βK = qj for all j. In this case, restricted least squares is simply
If the F distribution has degrees of freedom (df ) 1 as the numerator df, and N −K −1
as the denominator df, then it can be shown that t2 = F (1, N − K − 1).
β̂k − t(1−α/2),(N −K−1) se(
b β̂k ), β̂k + t(1−α/2),(N −K−1) se(
b β̂k ) ,
Convicted! Convicted!
Freed! Freed!
Asymptotic Tests
Assume LRM1, LRM2, LRM3d, LRM4a or LRM4b, and LRM5. A simple null hypotheses
of the form H0 : βk = q is tested with the z -test. If the null hypotheses is true, the z-
statistic
β̂k − q A
z= ∼ N (0, 1)
se(
b β̂k )
76
q follows approximately the standard normal distribution. The standard error is se(
b β̂k ) =
Avar(
[ β̂k ).
For example, to perform a two sided test of H0 against the alternative hypotheses
β̂k − z(1−α/2) se(
b β̂k ), β̂k + z(1−α/2) se(
b β̂k )
For example, the 95% condence interval is β̂k − 1.96se(
b β̂k ), β̂k + 1.96se(
b β̂k ) .
77
OLS Properties in the Small and in the Large
LRM1: linearity f u l f i l l e d
LRM5: identiability f u l f i l l e d
- LRM4a: homoskedastic ✓ ✓ ✓ × × ×
- LRM4b: heteroskedastic × × × ✓ ✓ ✓
LRM3: exogeneity
- LRM3a: normality ✓ × × ✓ × ×
- LRM3b: independent ✓ ✓ × × × ×
- LRM3c: mean indep. ✓ ✓ ✓ ✓ ✓ ×
- LRM3d: uncorrelated ✓ ✓ ✓ ✓ ✓ ✓
78
4.5 Politically Connected Firms: Causality or Correlation?
Arguments For Causality of Eect
Simultaneous causality:
79
prots may be higher because of political connections
80
5 Simplifying Linear Regressions using Frisch-Waugh-
Lovell
5.1 Frisch-Waugh-Lovell theorem in equation algebra
From the multivariate to the bivariate regression
Regress yi on two explanatory variables, where x2i is the variable of interest and x1i (or
yi = β0 + β2 x2i + β1 x1i + εi .
We can obtain exactly the same coecients and residuals from a regression of
ỹi = β0 + β2 x̃2i + εi .
We can obtain exactly the same coecient and residuals from a regression of two
residualized variables
εyi = β2 ε2i + εi .
models.
Absorbs xed eects to reduce computation time (see reghdfe for Stata)
81
this implies Cov(x1i , ε2i ) = 0,
yi = δ0 + δ1 x1i + εyi .
x̃2i = γ0 + ε2i ,
ỹi = δ0 + εyi .
Finally,
Decomposition theorem
Decomposition theorem
yi = β0 + β2 x2i + β1 x1i + εi ,
the same regression coecients will be obtained with any non-empty subset of the ex-
Examining either set of residuals will convey precisely the same information about the
Detrended variables
Show that
Plug in the variables yi = δ0 + δ1 x1i + εyi and x2i = γ0 + γ1 x1i + ε2i in the equation (2)
82
Because we partialled out x1i x1i is mechanically uncorrelated to ε2i and to εyi .
using OLS,
1
Therefore, the regression coecient (β2 γ1 − δ1 + β1 ) of the partialled out variable xi is
2 2
zero. The equation simplies with x̃i = γ0 + εi to
Regression anatomy: Only detrending x2i and not yi . The regression constant, residu-
Residualized variables
The same result of the FWL Theorem holds as well for a regression of the residualized
variables because β0 = δ0 − β2 γ0 :
εyi = β2 ε2i + εi .
y = ŷ + e = Xb + e = P y + M y.
83
Projection matrix
P y = Xb = X(X ′ X)−1 X ′ y
→ P = X(X ′ X)−1 X ′ .
Projection matrix
Properties.
annihilator matrix PX = X
84
Project y on the column space of X, i.e. regress y on x and predict E[y] = ŷ.
1 1/2 0 1/2 1 2
y = 2 ; P y = 0 1 0 2 = ŷ = 2 .
3 1/2 0 1/2 3 2
M y = e = y − Xb = y − X(X ′ X)−1 X ′ y
M y = (I − X(X ′ X)−1 X ′ )y
→ M = I − X(X ′ X)−1 X ′ = (I − P ).
Properties.
annihilator matrix MX = 0
orthogonal to P : P M = M P = 0.
85
1 0 0 1/2 0 1/2 1/2 0 −1/2
M = (I − P ) = 0 1 0 − 0 1 0 = 0
0 0
0 0 1 1/2 0 1/2 −1/2 0 1/2
1/2 0 −1/2 1 0 0 0
MX = 0 = 0 0 .
0 0 1 1
−1/2 0 1/2 1 0 0 0
space of X.
86
Decomposing the normal equations
1
The normal equations in matrix form are X ′ Xb = X ′ y . If X is partitioned into an
′ ′ ′
X X X1 X2 b Xy
1 1 1 = 1 .
X2′ X1 X2′ X2 b2 X2′ y
h i b1 h i
′ ′ ′
= X1 y (3)
X1 X 1 X1 X2
b2
h i b1 h i
X2′ X1 X2′ X2 = X2′ y . (4)
b2
Solving for b2
Idea: Solve equation (3) for b1 in terms of b2 , then substituting that solution into the
equation (4).
h i b1 h i
X1′ X1 ′
X1 X2 = ′
X1 y
b2
X1′ X1 b1 + X1′ X2 b2 = X1′ y
X1′ X1 b1 = X1′ y − X1′ X2 b2
b1 = (X1′ X1 )−1 X1′ y − (X1′ X1 )−1 X1′ X2 b2
= (X1′ X1 )−1 X1′ (y − X2 b2 )
h i b1 h i
′ ′ = ′
X 2 X1 X2 X 2 X2 y
b2
X2′ X1 b1 + X2′ X2 b2 = X2′ y
X2′ X1 (X1 X1 ) X1 (y − X2 b2 ) + X2′ X2 b2 = X2′ y.
′ −1 ′
87
X2′ X1 (X1′ X1 )−1 X1′ (y − X2 b2 ) + X2′ X2 b2 = X2′ y.
The middle part of the rst term is X1 (X1′ X1 )−1 X1′ . This is the projection matrix PX1
from a regression of y on X1 .
The residualizer matrix is symmetric and idempotent, such that MX1 = MX1 MX1 =
MX′ 1 MX1 .
This is the OLS solution for b2 , with X̃2 instead of X and ỹ instead of y.
The solution of the regression coecients b2 in a regression that includes other regres-
sors X1 is the same as rst regressing all of X2 and y on X1 , then regressing the residuals
88
from the y regression on the residuals from the X2 regression.
89
6 The Maximum Likelihood Estimator
6.1 From Probability to Likelihood
The Likelihood Principle
Suppose you have three credit cards. You forgot, which has money on it or not. Thus,
the number credit cards with money, call it θ, might be 0, 1, 2, or 3. You can try your
1, if the ith card has money on it,
yi =
0, otherwise.
Since you chose yi 's uniformly, they are i.i.d. and yi ∼ Bernoulli(θ/3). After checking,
we nd y1 = 1, y2 = 0, y3 = 1, y4 = 1. We observe 3 cards with money and 1 without.
θ/3, for y = 1,
P rob(yi = y) =
1 − θ/3, for y = 0.
Since yi 's are independent, the joint PMF of y1 , y2 , y3 , and y4 can be written as
P rob(y1 = y, y2 = y, y3 = y, y4 = y|θ) =
P rob(y1 )P rob(y2 )P rob(y3 )P rob(y4 ).
L(θ|yi ) = P rob(y1 = 1, y2 = 0, y3 = 1, y4 = 1, θ) =
θ/3(1 − θ/3)θ/3θ/3 = (θ/3)3 (1 − θ/3).
90
Trial 1 2 3 4
θ 0 1 2 3
because our sample included both cards with and without money. The observed data is
Likelihood principle: choose θ that maximizes the likelihood of observing the actual
sample to get an estimator for θ0 .
The likelihood is the probability from
density f (y, X|θ), viewed as a function of vector θ given the data (y, X).
91
Model Range of y Density f (y) Common Parametrization
′
e−x β
Bernoulli 0 or 1 py (1 − p)1−y p= 1+e−x′ β
Poisson 0, 1, 2, . . . e−λ λy /y! λ=e x′ β
λ = ex β 1/λ = ex β
′ ′
Exponential (0, ∞) λe−λy or
2 /2σ 2
Normal (−∞, ∞) (2πσ 2 )−1/2 e−(y−µ) µ = x′ β, σ 2 = σ 2
For observations (yi , xi ) independent over i and distributed with f (y|X, θ),
f (y|X, θ) = ΠN
i=1 f (yi |xi , θ),
N
1 1 X
LN (θ) = ln f (yi |xi , θ).
N N i=1
92
The maximum likelihood estimator (MLE) is the estimator that maximizes the (con-
ditional) log-likelihood function LN (θ).
The MLE is the local maximum that solves the rst-order conditions
N
1 ∂LN (θ) 1 X ∂ ln f (yi |xi , θ)
= = 0.
N ∂θ N i=1 ∂θ
of the log density, and when evaluated at θ0 it is called the ecient score.
{xi1 , . . . , xiK , yi }N
i=1 i.i.d. (independent and identically distributed)
non-response or truncation.
93
I.i.d. data simplify the maximization as the joint density of the two variables is simply
1 −
[ ]
(y1 −µ)2 +(y2 −µ)2 −(y1 −µ)(y2 −µ)
2σ 2 (1−ρ2 )
p e .
2πσ 2 1 − ρ2
Z
∂ ln f (y|x, θ) ∂ ln f (y|x, θ)
Ef g(θ) = Ef = f (y|x, θ)dy = 0.
∂θ ∂θ
Example
Z Z
∂
f (y|θ)dy = 1. f (y|θ)dy = 0.
∂θ
Z
∂f (y|θ)
dy = 0.
∂θ
∂f (y|θ) ∂ ln f (y|θ)
= f (y|θ).
∂θ ∂θ
Z
∂ ln f (y|θ)
f (y|θ)dy = 0.
∂θ
Fisher Information
The information matrix is the expectation of the outer product of the score vector,
∂ ln f (y|x, θ) ∂ ln f (y|x, θ)
I = Ef .
∂θ ∂θ ′
94
∂LN (θ)
The Fisher information I is equals the variance of the score, since
∂θ
has mean
zero.
Large values of I mean that small changes in θ lead to large changes in the log-
likelihood
Small values of I mean that the maximum is shallow and there are many nearby
Example
∂ ln f (y|θ)
For vector moment function, e.g., m(y, θ) = with E[m(y, θ)] = 0,
∂θ
Z
m(y, θ)f (y|θ)dy = 0.
Z
∂m(y, θ) ∂f (y|θ)
f (y|θ) + m(y, θ) dy = 0.
∂θ ′ ∂θ ′
Z
∂m(y, θ) ∂ ln f (y|θ)
f (y|θ) + m(y, θ) f (y|θ) dy = 0.
∂θ ′ ∂θ ′
∂m(y, θ) ∂ ln f (y|θ)
E = −E m(y, θ) = 0.
∂θ ′ ∂θ ′
After taking the expected value, θb is substituted for θ . Problem: Taking the expected
95
There exist two alternatives which are asymptotically equivalent:
2
I( b = − ∂ ln L .
b θ)
b θb′
∂ θ∂
Berndt-Hall-Hall-Hausman (BHHH) algorithm
Never take a second derivative and sum over the outer product of the scores: (rst
′
n
X n
X ∂ ln f yi , θ
b ∂ ln f yi , θ
b
ˇ θ)
I( b = gbi gbi′ = .
i=1 i=1 ∂ θb ∂ θb
approx. normal
asymptotically ecient
invariant
96
Consistency
Law of Large Numbers
As N increases, the distribution of θ̂ becomes more tightly centered around θ.
Likelihood Inequality
The expected value of the log-likelihood is maximized at the true value of the parameters.
97
Approximate Normality
Central Limit Theorem
As N becomes large,
2 −1
a ∂ LN (θ)
θ̂ ∼ N θ, − E .
∂θ∂θ ′
Figure 16: Sampling distribution of θ̂ drawn from Bernoulli distribution and normal
distribution at N = 100. True θ = 0.6.
Eciency
The precision of the estimate θ̂ is limited by the Fisher information I of the likelihood.
1
Var θ̂ ≥ .
I (θ)
For large samples, this is the so-called Cramér-Rao lower bound for the variance matrix
√
of consistent asymptotically normal estimators with convergence to normality of N (θ̂ −
θ0 ) uniform in compact intervals of θ0 .
Under the strong assumption of correct specication of the conditional density, the
MLE has the smallest asymptotic variance among root-N consistent estimators.
98
Example
h i Z
E θ̂ − θ θ = θ̂ − θ f (y; θ) dy = 0 regardless of the value of θ.
This expression is zero independent of θ, so its partial derivative with respect to θ must
also be zero. By the product rule, this partial derivative is also equal to
Z Z Z
∂ ∂f
0= θ̂ − θ f (y; θ) dy = θ̂ − θ dy − f dy.
∂θ ∂θ
For each θ, the likelihood function is a probability density function, and therefore
R
f dy = 1. By using the chain rule on the partial derivative of ln f and then divid-
∂f ∂ ln f
=f .
∂θ ∂θ
Using these two facts, we get
Z
∂ ln f
θ̂ − θ f dy = 1.
∂θ
R √ √
f ∂ ∂θ
ln f
Factoring the integrand gives θ̂ − θ f dy = 1.
Squaring the expression in the integral, the Cauchy-Schwarz inequality yields
Z h p i p ∂ ln f 2 Z "Z 2 #
2 ∂ ln f
1= θ̂ − θ f · f dy ≤ θ̂ − θ f dy · f dy .
∂θ ∂θ
The rst factor is the expected mean-squared error (the variance) of the estimator θ̂ , the
Invariance
The MLE of γ = c (θ) is θb = c(θ)
b if c (θ) is a continuous and continuous dierentiable
function.
of an MLE.
99
Example
Suppose that the normal log-likelihood is parameterized in terms of the precision param-
N
θ2 X
ln L(µ, σ 2 ) = −(N/2) ln(2π) + (N/2) ln θ2 − (yi − µ)2 .
2 i=1
The MLE for µ is x̄. But the likelihood equation for θ2 is now
N
∂ ln L(µ, θ2 )
X
2 2
= 1/2 N/θ − (yi − µ) = 0,
∂θ2 i=1
PN
which has solution θ̂2 = N/ i=1 (yi − µ)2 = 1/σ̂ 2 .
The MLE is also equivariant with respect to certain transformations of the data.
If y = c(x) where c is one to one and does not depend on the parameters to be
fX (x)
fY (y) = ,
|c′ (x)|
and hence the likelihood functions for x and y dier only by a factor that does not
The MLE parameters of the log-normal distribution are the same as those of the normal
100
7 The Generalized Method of Moments
7.1 How to choose from too many restrictions?
Minimize the quadratic form
The overidentied GMM estimator θ̂GM M (Wn ) for K parameters in θ identied by L > K
moment conditions is a function of the weighting matrix Wn for a sample of i = 1, ..., n
observations:
where the quadratic form qn (θ) is the criterion function and is given as a function of
N
X
m̄n (θ) = 1/n m(Xi , Zi , θ0 )
i=1
Weighting matrix W is symmetric (and positive denite that is x′ W x > 0 for all
non-zero x)!
rameters gives:
101
∂ m̄n (θ̄)
where Ḡn (θ̄) = ∂ θ̄′
and θ̄ is a point between θ̂GM M and θ0 .
rameters gives:
∂ m̄n (θ̄)
L×1
where Ḡn (θ̄) = ∂ θ̄′
and θ̄ is a point between θ̂GM M and θ0 , because of the
1×K
Approximation introduced θ̄
∂ m̄n (θ̄)
...where Ḡn (θ̄) = ∂ θ̄′
and θ̄ is a point between θ̂GM M and θ0 .
Do the minimization
To minimize the quadratic form criterion, we take the rst derivative of
102
qn (θ) = m̄n (θ)′ W m̄n (θ)
∂qn (θ̂GM M )
= 2Ḡn (θ̂GM M )′ Wn m̄n (θ̂GM M ) = 0.
∂ θ̂GM M
to obtain
So the estimate θ̂GM M is approximately the true parameter θ0 plus a sampling error that
So the estimate θ̂GM M is approximately the true parameter θ0 plus a sampling error that
103
7.3 The econometric model
Three assumptions: moment conditions
GMM1: Moment Conditions and Identication
Identication implies that the probability limit of the GMM criterion function is uniquely
.
N
p
X
m̄n (θ) = 1/n m(Xi , Zi , θ0 ) → E[m(Xi , Zi , θ0 )].
i=1
The data meets the conditions for a law of large numbers to apply, so that we may assume
.
N
√ √ X d
nm̄n (θ) = n/n m(Xi , Zi , θ0 ) → N [0, Φ].
i=1
The empirical moments obey a central limit theorem. This assumes that the moments
7.4 Consistency
Recall the useful approximation of the estimator:
104
Assumption GMM2 implies that
N
p
X
m̄n (θ0 ) = 1/n m(Xi , Zi , θ0 ) → E[m(Xi , Zi , θ0 )] = m̄(θ0 ).
i=1
That is, the sample moment equals the population moment in probability. Assumption
m̄(θ0 ) = 0.
Then
p
m̄n (θ0 ) → m̄(θ0 ) = 0,
such that
p
θ̂GM M → θ0 for N →∞
Rewrite to obtain
√ √
n(θ̂GM M − θ0 ) ≈ −(Ḡn (θ̂GM M )′ Wn Ḡn (θ̄))−1 Ḡn (θ̂GM M )′ Wn nm̄n (θ0 ),
The right hand side has several parts for which we made assumptions on what happens
√ d
nm̄n (θ0 ) → N [0, Φ]
plimWn = W
∂m(Xi , Zi , θ0 ) ∂ m̄(θ0 )
plimḠn (θ̂GM M ) = plimḠn (θ̄) = plim =E = Γ(θ0 )
∂θ0′ ∂θ0′
With plimWn = W and
105
the expression
√ √
n(θ̂GM M − θ0 ) ≈ −(Ḡn (θ̂GM M )′ Wn Ḡn (θ̄))−1 Ḡn (θ̂GM M )′ Wn nm̄n (θ0 )
becomes
√ √
n(θ̂GM M − θ0 ) ≈ −(Γ(θ0 )′ W Γ(θ0 ))−1 Γ(θ0 )′ W nm̄n (θ0 )
√ d
n(θ̂GM M − θ0 ) → N [0, V ]
with
That is by GMM1, GMM2, and GMM3 the GMM estimator is asymptotic normal.
estimator θ̂GM M .
The variance of the GMM estimator V depends on the choice of W
So let us minimize V to get the optimal weight matrix. Try from GMM3
plimWn = W = Φ−1
n→∞
′
VGM M,optimal = 1/n[Γ(θ0 )′ Φ−1 Γ(θ0 )]−1 [Γ(θ0 )′ Φ−1 ΦΦ−1 Γ(θ0 )][Γ(θ0 )′ Φ−1 Γ(θ0 )]−1
If Φ is small, there is little variation of this specic sample moment around zero and the
moment condition is very informative about θ0 . So it is best to assign a high weight to it.
106
VGM M,optimal = 1/n[Γ(θ0 )′ Φ−1 Γ(θ0 )]−1
If Γ is large, there is a large penalty from violating the moment condition by evaluating
at θ ̸= θ0 . Then the moment condition is very informative about θ0 . V is inversely related
to Γ.
107
Estimate the variance in practice
V̂GM M,optimal = 1/n[Ḡn (θ̂)′ Φ−1
n Ḡn (θ̂)]
−1
Consistent estimator
Φn = N V (m̄n (θ̂))
∂m(Xi , Zi , θ̂)
Ḡn (θ̂) =
∂ θ̂′
108
8 Conclusion
Congratulations! If you made it through this document, you are ready to read some econo-
metrics papers, program and develop new estimators, and analyze statistical properties.
If this caught your interest, check out non-parametric and Bayesian econometrics.
109
References
Angrist, J. D., and J.-S. Pischke (2009): Mostly Harmless Econometrics: An Em-
piricist's Companion . Princeton University Press.
Filoso, V. (2013): Regression Anatomy, Revealed, The Stata Journal , 13(1), 92106.
Frisch, R., and F. V. Waugh (1933): Partial Time Regressions as Compared with
Lovell, M. C. (2008): A Simple Proof of the FWL Theorem, The Journal of Economic
Education , 39(1), 8891.
The discovery of the method of least squares, Biometrika , 59(2), 239251. Cited on
page 63.
Department of Economics.
Verbeek, M. (2012): A Guide to Modern Econometrics . John Wiley & Sons, 3rd edn.
MIT Press.
110