0% found this document useful (0 votes)
19 views110 pages

Econometricks-Short Guide

The document 'Econometricks: Short Guides to Econometrics' by Davud Rostam-Afschar provides a concise overview of key econometric methods such as Ordinary Least Squares, Maximum Likelihood, and the Generalized Method of Moments, along with foundational concepts in Probability and Distribution Theory. It includes practical examples to illustrate these statistical methods and their applications. The guide is based on lectures from the University of Mannheim and supported by the Deutsche Forschungsgemeinschaft.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views110 pages

Econometricks-Short Guide

The document 'Econometricks: Short Guides to Econometrics' by Davud Rostam-Afschar provides a concise overview of key econometric methods such as Ordinary Least Squares, Maximum Likelihood, and the Generalized Method of Moments, along with foundational concepts in Probability and Distribution Theory. It includes practical examples to illustrate these statistical methods and their applications. The guide is based on lectures from the University of Mannheim and supported by the Deutsche Forschungsgemeinschaft.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

Econometricks: Short Guides to Econometrics

„
Davud Rostam-Afschar

December 1, 2024

Abstract
Short guides to econometrics illustrate statistical methods and demonstrate how
they work in theory and practice. With many examples.

Keywords: Econometrics, Ordinary Least Squares, Maximum Likelihood, Gener-


alized Method of Moments, Probability Theory, Distribution Theory, Frisch-Waugh-
Lovell, Monte Carlo Simulation
JEL classication: A20, A23, C01, C10, C12, C13

* These guides were developed based on lectures delivered by Davud Rostam-Afschar at the University
of Mannheim. I am grateful for the valuable input provided by numerous cohorts of PhD students at
the Graduate School of Economic and Social Sciences, University of Mannheim. I thank the Deutsche
Forschungsgemeinschaft (DFG, German Research Foundation) for nancial support through CRC TRR
266 Accounting for Transparency (Davud Rostam-Afschar, Project-ID 403041268). Replication les and
updates are available here.
„ University of Mannheim, 68131 Mannheim, Germany; GLO; IZA; NeST (e-mail: rostam-afschar@uni-
mannheim.de),

1
Contents
1 Review of Probability Theory 4
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Probability fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Mean and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Moments of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Useful rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Specic Distributions 17
2.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2 Method of transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 The χ2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 The F-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5 The student t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6 The lognormal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.7 The gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.8 The beta distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.9 The logistic distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.10 The Wishart distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Review of Distribution Theory 38


3.1 Joint and marginal bivariate distributions . . . . . . . . . . . . . . . . . . 38

3.2 The joint density function . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3 The joint cumulative density function . . . . . . . . . . . . . . . . . . . . . 40

3.4 The marginal probability density . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.6 The conditional density function . . . . . . . . . . . . . . . . . . . . . . . . 45

3.7 Conditional mean aka regression . . . . . . . . . . . . . . . . . . . . . . . . 46

3.8 The bivariate normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.9 Useful rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 The Least Squares Estimator 52


4.1 What is the Relationship between Two Variables? . . . . . . . . . . . . . . 52

4.2 The Econometric Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3 Estimation with OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4 Properties of the OLS Estimator in the Small and in the Large . . . . . . . 66

4.5 Politically Connected Firms: Causality or Correlation? . . . . . . . . . . . 79

2
5 Simplifying Linear Regressions using Frisch-Waugh-Lovell 81
5.1 Frisch-Waugh-Lovell theorem in equation algebra . . . . . . . . . . . . . . 81

5.2 Projection and residual maker matrices . . . . . . . . . . . . . . . . . . . . 83

6 The Maximum Likelihood Estimator 90


6.1 From Probability to Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2 The Econometric Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.3 Properties of the Maximum Likelihood Estimator . . . . . . . . . . . . . . 96

7 The Generalized Method of Moments 101


7.1 How to choose from too many restrictions? . . . . . . . . . . . . . . . . . . 101

7.2 Get the sampling error (at least approximately) . . . . . . . . . . . . . . . 101

7.3 The econometric model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.5 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.6 Asymptotic eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

8 Conclusion 109

References 110

3
1 Review of Probability Theory
1.1 Introduction
This guide takes a look under the hood of widely used methods in econometrics and

beyond. It focuses on Ordinary Least Squares, Maximum Likelihood, Generalized Method

of Moments. It shows when and why these methods work with simple examples. This

guide also provides an overview of the most important fundamentals of Probability Theory

and Distribution Theory on which these methods are based and how to analyze them with

the Frisch-Waugh-Lovell decomposition and with Monte Carlo Simulation.

1.2 Probability fundamentals


Discrete and continuous random variables
Discrete Random Variable

A random variable X is discrete if the set

of outcomes x is either nite or countably

innite.

Continuous Random Variable

The random variable X is continuous if the


set of outcomes x is innitely divisible and,

hence, not countable.

4
Discrete probabilities
For values x of a discrete random variable X, the probability mass function (pmf )

f (x) = P rob(X = x).

The axioms of probability require

0 ≤ P rob(X = x) ≤ 1,
X
f (x) = 1.
x

Discrete cumulative probabilities


For values x of a discrete random variable X, the cumulative distribution function
X
F (x) = f (x) = P rob(X ≤ x),
X≤x

where

f (xi ) = F (xi ) − F (xi−1 ).

Example

Roll of a six-sided die

x f (x) F (X ≤ x)

1 f (1) = 1/6 F (X ≤ 1) = 1/6


2 f (2) = 1/6 F (X ≤ 2) = 2/6
3 f (3) = 1/6 F (X ≤ 3) = 3/6
4 f (4) = 1/6 F (X ≤ 4) = 4/6
5 f (5) = 1/6 F (X ≤ 5) = 5/6
6 f (6) = 1/6 F (X ≤ 6) = 6/6

What's the probability that you roll a 5 or higher?

F (X ≥ 5) = 1 − F (X ≤ 4) = 1 − 2/3 = 1/3.

5
Continuous probabilities
For values x of a continuous random variable X , the probability is zero but the area under
f (x) ≥ 0 in the range form a to b is the probability density function (pdf )
Z b
P rob(a ≤ x ≤ b) = P rob(a < x < b) = f (x)dx ≥ 0.
a

The axioms of probability require

Z +∞
f (x)dx = 1.
−∞

f (x) = 0 outside the range of x.


The cumulative distribution function (cdf ) is
Z x
F (x) = f (t)dt,
−∞

dF (x)
f (x) = .
dx

Cumulative distribution function


For continuous and discrete variables, F (x) satises

Properties of cdf

ˆ 0 ≤ F (x) ≤ 1

ˆ If x > y, then F (x) ≥ F (y)

ˆ F (+∞) = 1

ˆ F (−∞) = 0

and

P rob(a < x ≤ b) = F (b) − F (a).

Symmetric distributions
For symmetric distributions

f (µ − x) = f (µ + x)

and

1 − F (x) = F (−x).

6
1.3 Mean and variance
Mean of a random variable (Discrete)

The mean, or expected value, of a discrete random variable is


X
µ = E[x] = xf (x)
x

Example

Roll of a six-sided die

x f (x) = 1/n F (X ≤ x) = (x − a + 1)/n

a = 1 f (1) = 1/6 F (X ≤ 1) = 1/6


2 f (2) = 1/6 F (X ≤ 2) = 2/6
3 f (3) = 1/6 F (X ≤ 3) = 3/6
4 f (4) = 1/6 F (X ≤ 4) = 4/6
5 f (5) = 1/6 F (X ≤ 5) = 5/6
b = 6 f (6) = 1/6 F (X ≤ 6) = 6/6

What's the expected value from rolling the dice?

E[x] = 1/6 + 2/6 + 3/6 + 4/6 + 5/6 + 6/6 = 3.5.


This is the mean (and the median) of a uniform distribution (n + 1)/2 = (a + b)/2 = 3.5.

7
Mean of a random variable (Continuous)

For a continuous random variable x, the expected value is

Z
E[x] = xf (x)dx.
x

Example

The continuous uniform distribution is 1/(b − a) for a≤x≤b and 0 otherwise.

Z b Z b
x 1
E[x] = dx = xdx.
a b−a b−a a

Antiderivative of x is x2 /2

1 (b − a)(b + a) a+b
E[x] = (b2 /2 − a2 /2) = = .
b−a 2(b − a) 2

The mean (and the median) is again (a + b)/2 = 3.5.

P
For a function g(x) of x, the expected value is E[g(x)] = x g(x)P rob(X = x) or
R
E[g(x)] = x g(x)f (x)dx. If g(x) = a+bx for constants a and b, then E[a+bx] = a+bE[x].

Variance of a random variable

The variance of a random variable σ 2 > 0 is


 P
 (x − µ)2 f (x) if x is discrete,
 x


σ 2 = V ar[x] = E[(x − µ)2 ] =


 R (x − µ)2 f (x)dx if x is continuous.

x

Example

Roll of a six-sided die. What's the variance V [x] from rolling the dice?
The probability of observing x, P r(X = x) = 1/n, is discretely uniformly distributed

n+1 (n + 1)2
E[x] = ; (E[x])2 = .
2 4

8
n
X 1 X 2 (n + 1)(2n + 1)
E[x2 ] = P r(X = x) = x = due to the sequence sum of squares.
x
n x=1 6

V [x] = E[x2 ] − (E[x])2 .


(n+1)(2n+1) (n+1)2 n2 −1
V [x] = 6
− 4
= 12
= (62 − 1)/12 ≈ 2.92.

Chebychev inequality

For any random variable x and any positive constant k > 1,

1
Pr(µ − kσ < x < µ + kσ) ≥ 1 − .
k2

Share outside k standard deviations.


If x is normally distributed, the bound is 1 − (2Φ(k) − 1).

95% of the observations are within 1.96 standard deviations for normally distributed

9
x. If x is not normal, 95% are at most within 4.47 standard deviations.

Normal coverage

1.4 Moments of a random variable


Central moments of a random variable

The central moments are

µr = E[(x − µ)r ].

Example

Moments: Two measures often used to describe a probability distribution are

ˆ expectation = E[(x − µ)1 ]

ˆ variance = E[(x − µ)2 ]

ˆ skewness = E[(x − µ)3 ]

ˆ kurtosis = E[(x − µ)4 ]

The skewness is zero for symmetric distributions.

10
Higher order moments

Moment generating function

For the random variable X, with probability density function f (x), if the function

M (t) = E[etx ].

exists, then it is the moment generating function(M GF ).

ˆ Often simpler alternative to working directly with probability density functions or

cumulative distribution functions

ˆ Not all random variables have moment-generating functions

The nth moment is the nth derivative of the moment-generating function, evaluated at

11
t = 0.

Example

The MGF for the standard normal distribution with µ = 0, σ = 1 is

2 t2 /2 2 /2
Mz (t) = eµt+σ = et .

If x and y are independent, then the MGF of x+y is Mx (t)My (t).

For x ∼ N (µ, σ 2 ) for some µ, σ > 0 with moment generating function


1
Mx (t) = exp(µt + σ 2 t2 ), the rst moment generating function of x is
2
 
1 ′ 2 1 22
E[(x − µ) ] = Mx (t) = (µ + σ t) exp µt + σ t .
2

Example

  
1 22
d exp µt + σ t
1 ′ 2
E[(x − µ) ] = Mx (t) =
dt
    
1 22 1 22
d µt + σ t d exp µt + σ t
2 2
=
dt 1
d(µt + σ 2 t2 )
2
 
2 1 22
= (µ + σ t) exp µt + σ t .
2

If x ∼ N (0, 1),

ˆ the skewness is E[(x − µ)3 ] = 0 and

ˆ the kurtosis is E[(x − µ)4 ] = 3.

12
Example

 
1 ′ 1 22
2
E[(x−µ) ] = Mx (t) = (µ+σ t) exp µt+ σ t with µ = 0, σ = 1, t = 0 : E[x] = µ = 0
2
   
2 ′′ 2 2 2 1 22
E[(x − µ) ] = Mx (t) = σ + (µ + σ t) exp µt + σ t
2
µ = 0, σ = 1, t = 0 : E[(x − µ)2 ] = σ 2 = 1
with
   
3 ′′′ 2 2 2 3 1 22
E[(x − µ) ] = Mx (t) = 3σ (µ + σ t) + (µ + σ t) exp µt + σ t
2
µ = 0, σ = 1, t = 0 : E[(x − µ)3 ] = 0
with
   
4 (4) 4 2 2 2 2 4 1 22
E[(x − µ) ] = Mx (t) = 3σ + 6σ (µ + σ t) + (µ + σ t) exp µt + σ t
2
with µ = 0, σ = 1, t = 0 : E[(x − µ)4 ] = 3.

Approximating mean and variance


For any two functions g1 (x) and g2 (x),

E[g1 (x) + g2 (x)] = E[g1 (x)] + E[g2 (x)].

For the general case of a possibly nonlinear g(x),


Z
E[g(x)] = g(x)f (x)dx,
x

and Z
V ar[g(x)] = (g(x) − E[g(x)])2 f (x)dx.
x

E[g(x)] and V ar[g(x)] can be approximated by a rst order linear Taylor series:

First order linear Taylor series

g(x) ≈ [g(x0 ) − g ′ (x0 )x0 ] + g ′ (x0 )x. (1)

13
Taylor approximation Order 1

A natural choice for the expansion point is x0 = µ = E(x). Inserting this value in Eq.

(1) gives

g(x) ≈ [g(µ) − g ′ (µ)µ] + g ′ (µ)x,

so that

E[g(x)] ≈ g(µ),

and

V ar[g(x)] ≈ [g ′ (µ)]2 V ar[x].

14
Example

Isoelastic utility. cbad = 10.00 Euro; cgood = 100.00 Euro; probability good outcome

50%

µ = E[c] = 1/2 × cbad + 1/2 × cgood =55.00 Euro

u(c) = c1/2

u(µ) = 7.42 approximates E[u(c)] = 1/2 × 101/2 + 1/2 × 1001/2 = 6.58

Example

Isoelastic utility.
cbad = 10.00 Euro; cgood = 100.00 Euro; probability good outcome 50%; µ = 55.00 Euro

u(c) = ln(c)

u(µ) = 4.01 approx.

E[u(c)] = 1/2 × ln(10) + 1/2 × ln(100)


= 3.45

Jensen's inequality:
E[g(x)] ≤ g(E[x])] if g ′′ (x) < 0.

V [u(c)] ≈ (1/55)2 ((10 − 55)2 + (100 − 55)2 ) = 1.34


V [u(c)] = (ln(10) − E[u(c)])2 + (ln(100) − E[u(c)])2 = 2.65

1.5 Useful rules


ˆ V ar[x] = E[x2 ] − µ2

ˆ E[x2 ] = σ 2 + µ2

ˆ If a and b constants, V ar[a + bx] = b2 V ar[x]

ˆ V ar[a] = 0

ˆ If g(x) = a + bx and a and b are constants, E[a + bx] = a + bE[x]

15
ˆ Coverage Pr(|X − µ| ≥ kσ) ≤ 1
k2

ˆ Skewness = E[(x − µ)3 ]

ˆ Kurtosis = E[(x − µ)4 ]

ˆ For symmetric distributions f (µ − x) = f (µ + x); 1 − F (x) = F (−x)

ˆ E[g(x)] ≈ g(µ)

ˆ V ar[g(x)] ≈ [g ′ (µ)]2 V ar[x]

16
2 Specic Distributions

17
Discrete distributions
Bernoulli distribution

The Bernoulli distribution for a single binomial outcome (trial) is

P rob(x = 1) = p,
P rob(x = 0) = 1 − p,

where 0≤p≤1 is the probability of success.

ˆ E[x] = p and

ˆ V [x] = E[x2 ] − E[x]2 = p − p2 = p(1 − p).

The distribution for x successes in n trials is the binomial distribution,


n!
P rob(X = x) = px (1 − p)n−x x = 0, 1, . . . , n.
(n − x)!x!
The mean and variance of x are

ˆ E[x] = np and

ˆ V [x] = np(1 − p).

Example of a binomial [n = 15, p = 0.5] distribution:

18
Poisson distribution

The limiting form of the binomial distribution, n → ∞, is the Poisson distribution,

eλ λx
P rob(X = x) = .
x!

The mean and variance of x are

ˆ E[x] = λ and

ˆ V [x] = λ.

Example of a Poisson [3] distribution:

19
2.1 Normal distribution
The normal distribution

Random variable x ∼ N [µ, σ 2 ] is distributed according to the normal distribution with


mean µ and standard deviation σ obtained as

1 1 x−µ 2
f (x|µ, σ) = √ e− 2 ( σ ) .
σ 2π

The density is denoted ϕ(x) and the cumulative distribution function is denoted Φ(x)
for the standard normal. Example of a standard normal, (x ∼ N [0, 1]), and a normal

with mean 0.5 and standard deviation 1.3:

20
2.2 Method of transformations
Transformation of random variables

Continuous variable x may be transformed to a discrete variable y. Calculate the mean

of variable x in the respective interval:

P rob(Y = µ1 ) = P (−∞ < X ≤ a),


P rob(Y = µ2 ) = P (a < X ≤ b),
P rob(Y = µ3 ) = P (b < X ≤ ∞).

Method of transformations

If x is a continuous random variable with pdf fx (x) and if y = g(x) is a continuous

monotonic function of x, then the density of y is obtained by

Z b
P rob(y ≤ b) = fx (g −1 (y))|g −1′ (y)|dy.
−∞

With fy (y) = fx (g −1 (y))|g −1′ ](y)|dy , this equation can be written as

Z b
P rob(y ≤ b) = fy (y)dy.
−∞

Example

x−µ
If x ∼ N [µ, σ 2 ], then the distribution of y = g(x) = σ
is found as follows:

g −1 (y) = x = σy + µ

dx
g −1′ (y) = =σ
dy
1 −1 2 2
Therefore with fx (x) = √1 e− 2 [(g (y)−µ) /σ ] |g −1′ (y)|
σ 2π

1 2 2 1 2
fy (y) = √ e−[(σy+µ)−µ] /2σ |σ| = √ e−y /2 .
2πσ 2π

21
Properties of the normal distribution
ˆ Preservation under linear transformation:

If x ∼ N [µ, σ 2 ], then (a + bx) ∼ N [a + bµ, b2 σ 2 ].

ˆ Convenient transformation a = −µ/σ and b = 1/σ :


(x−µ)
The resulting variable z= σ
has the standard normal distribution with density

1 z2
ϕ(z) = √ e− 2 .

ˆ If x ∼ N [µ, σ 2 ], then f (x) = σ1 ϕ[ x−µ


σ
]

ˆ P rob(a ≤ x ≤ b) = P rob a−µ x−µ b−µ



σ
≤ σ
≤ σ

ˆ ϕ(−z) = 1 − ϕ(z) and Φ(−x) = 1 − Φ(x) because of symmetry

If z ∼ N [0, 1], then z 2 ∼ χ2 [1] with pdf √


1
2πy
e−y/2 .

Example

1 x2
fx (x) = √ e− 2

y = g(x) = x2

g −1 (y) = x = ± y there are two solutions to g1 , g2 .
dx
g −1′ (y) = = ±1/2y −1/2
dy
fy (y) = fx (g1−1 (y))|g1−1′ (y)| + fx (g2−1 (y))|g2−1′ (y)|
√ √
fy (y) = fx ( y)|1/2y −1/2 | + fx (− y)| − 1/2y −1/2 |
1 y 1 y 1 y
fy (y) = √ e− 2 + √ e− 2 = √ e− 2
2 2πy 2 2πy 2πy

Distributions derived from the normal


ˆ If z ∼ N [0, 1], then z 2 ∼ χ2 [1] with E[z 2 ] = 1 and V [z 2 ] = 2.

ˆ If x1 , ..., xn are n independent χ2 [1] variables, then

n
X
xi ∼ χ2 [n].
i=1

22
Normal

Parameters µ ∈ R , σ ∈ R>0
Support x∈R
1 x−µ 2
x−µ
= σ√12π e− 2 ( σ )

PDF ϕ σ
x−µ
 1h 
x−µ
i
CDF Φ σ
= 2
1 + erf √
σ 2
Mean µ
Median µ
Mode µ
Variance σ2
Skewness 0
Ex. Kurtosis 0
MGF exp(µt + σ 2 t2 /2)

ˆ PDF denotes probability density function, CDF cumulative distribution function,


MGF moment-generating function.

ˆ µ mean (location), σ, s (scale).

ˆ Excess Kurtosis is dened as Kurtosis minus 3.


Rz 2
ˆ The Gauss error function is erf z = √2
π 0
e−t dt.

ˆ If zi , i = 1, ..., n, are independent N [0, 1] variables, then

n
X
zi2 ∼ χ2 [n].
i=1

ˆ If zi , i = 1, ..., n, are independent N [0, σ 2 ] variables, then

n  2
X zi
∼ χ2 [n].
i=1
σ

ˆ If x1 and x2 are independent χ2 variables with n1 and n2 degrees of freedom, then

x1 + x2 ∼ χ2 [n1 + n2 ].

23
2.3 The χ distribution
2

The χ2 distribution

Random variable x ∼ χ2 [n] is distributed according to the chi-squared distribution


with n degrees of freedom
xn/2−1 e−x/2
f (x|n) =  ,
2n/2 Γ n2
where Γ is the Gamma-distribution (more below).

ˆ E[x] = n

ˆ V [x] = 2n

Example of a χ2 [3] distribution:

Approximating a χ2

For degrees of freedom greater than 30 the distribution of the chi-squared variable x is

approx.

z = (2x)1/2 − (2n − 1)1/2 ,

which is approximately standard normally distributed. Thus,

P rob(χ2 [n] ≤ a) ≈ Φ[(2a)1/2 − (2n − 1)1/2 ].

24
χ2

Parameters n ∈ N>0
Support x ∈ R>0 if n = 1,
else x ∈ R≥0
PDF
1
2n/2 Γ(n/2)
xn/2−1 e−x/2
1
γ n2 , x2

CDF
Γ(n/2)

Mean n
Median No simple closed form

Mode max(n − 2, 0)
Variance 2n
p
Skewness 8/n
12
Ex. Kurtosis
n

MGF (1 − 2t)−n/2 for t< 1


2

ˆ n, n1 , n2 known as degrees of freedom.

B(x, a,b)
ˆ Regularized incomplete beta function I(x, a, b) = B(a,b)
with B(x, a, b) =
R x a−1
0
t (1 − t)b−1 dt.

25
2.4 The F-distribution
The F-distribution

Ifx1 and x2 are two independent chi-squared variables with degrees of freedom parameters
n1 and n2 , respectively, then the ratio

x1 /n1
F [n1 , n2 ] =
x2 /n2

has the F distribution with n1 and n2 degrees of freedom.

26
F

Parameters n1 , n2 ∈ N>0
Support x ∈ R>0 if n1 = 1,
else x ∈ R≥0
n1 n2 n1 +n2 n1
Γ( ) −1
x 2
PDF n1 n22 Γ( n1 )Γ(
2 2
n2
) n1 +n2
2 2 (n1 x+n2 ) 2
 
CDF I n1nx+n
1x
2
, n21 , n22
n2
Mean
n2 −2
for n2 > 2
Median No simple closed form

n1 −2 n2
Mode
n1 n2 +2
for n1 > 2
2 n22 (n1 +n2 −2)
Variance
n1 (n2 −2)2 (n2 −4)
for n2 > 4

(2n1 +n2 −2) 8(n2 −4)
Skewness √ for n2 > 6
(n2 −6) n1 (n1 +n2 −2)
2
Ex. Kurtosis 12 n1 (5n2n−22)(n 1 +n2 −2)+(n2 −4)(n2 −2)
1 (n2 −6)(n2 −8)(n1 +n2 −2)
for n2 > 8
MGF does not exist

ˆ n, n1 , n2 known as degrees of freedom.

B(x, a,b)
ˆ Regularized incomplete beta function I(x, a, b) = B(a,b)
with B(x, a, b) =
R x a−1
0
t (1 − t)b−1 dt.

27
2.5 The student t-distribution
The student t-distribution

If x1 is an N [0, 1] variable, often denoted by z, and x2 is χ2 [n2 ] and is independent of x1 ,


then the ratio
x1
t[n2 ] = p .
x2 /n2
has the t distribution with n2 degrees of freedom.

Example for the t distributions with 3 and 10 degrees of freedom with the standard

normal distribution.

Comparing (2.4) with n1 = 1 and (2.5), if t ∼ t[n], then t2 ∼ F [1, n].


The t[30] approx. the standard normal

28
t

Parameters n ∈ R>0
Support x∈R
− n+1
Γ( n+1 )

2 x2 2
PDF √
πn Γ( n
1 +
2 )
n

1
+ x Γ n+1

CDF
2 2
×
x2
 
1 n+1 3
2 F1 2
, 2
; 2
; − n
√ n
πn Γ( 2)
Mean 0 for n>1
Median 0
Mode 0
n
Variance
n−2
for n > 2,
∞ for 1<n≤2
Skewness 0 for n>3
6
Ex. Kurtosis
n−4
for n > 4, ∞ for 2<n≤4
MGF does not exist

ˆ n denote degrees of freedom.

ˆ 2 F1(·, ·; ·; ·) is a particular instance of the hypergeometric function.

29
2.6 The lognormal distribution
The lognormal distribution

The lognormal distribution, denoted LN [µ, σ 2 ], has been particularly useful in mod-

eling the size distributions.

1 1 2
f (x) = √ e− 2 [(ln x−µ)/σ] , x>0
2πσx

A lognormal variable x has

2 /2
ˆ E[x] = eµ+σ , and

2 2
ˆ V ar[x] = e2µ+σ (eσ − 1).

If y ∼ LN [µ, σ 2 ], then ln y ∼ N [µ, σ 2 ].

30
Log-normal

Parameters µ ∈ R , σ ∈ R>0
Support x ∈ R>0
 
)2
PDF √1
xσ 2π
exp − (ln x−µ
2σ 2
h  i
1 ln √
x−µ
CDF
2
1 + erf σ 2
 
ln(x)−µ
=Φ σ
 
σ2
Mean exp µ + 2

Median exp(µ)
Mode exp (µ − σ 2 )
Variance [exp(σ 2 ) − 1] exp (2µ + σ 2 )
p
Skewness [exp (σ 2 ) + 2] exp(σ 2 ) − 1
Ex. Kurtosis 1 exp (4σ 2 ) + 2 exp (3σ 2 ) + 3 exp (2σ 2 ) − 6
MGF not determined by its moments

31
2.7 The gamma distribution
The gamma distribution

The general form of the gamma distribution is


β α −βx α−1
f (x) = e x , x ≥ 0, β = 1/θ > 0, α = k > 0.
Γ(α)

Many familiar distributions are special cases, including the exponential distribution(α =
1) and chi-squared(β = 1/2, α = n/2). The Erlang distribution results if α is a posi-
tive integer. The mean is α/β , and the variance is α/β . The inverse gamma distribution
2

is the distribution of 1/x, where x has the gamma distribution.

32
Γ Γ

Parameters k>0∈R (shape), α>0∈R (shape),

θ>0∈R scale β>0∈R (rate)

Support x ∈ R(0, ∞) x ∈ R(0, ∞)

β α α−1 −βx
PDF f (x) = 1
Γ(k)θk
xk−1 e−x/θ f (x) = Γ(α)
x e
1
k, xθ 1

CDF F (x) = Γ(k)
γ F (x) = Γ(α)
γ(α, βx)
α
Mean kθ β

Median No simple closed form No simple closed form

α−1
Mode (k − 1)θ for k ≥ 1, 0 for k<1 β
for α ≥ 1, 0 for α<1
α
Variance kθ2 β2

Skewness √2 √2
k α
6 6
Ex. Kurtosis
k
α −α
−k 1 t
MGF (1 − θt) for t< θ
1− β
for t<β

R∞
ˆ Γ(z) = 0
tz−1 e−t dt, ℜ(z) > 0, for complex numbers with a positive real part.

Rx
ˆ lower incomplete gamma function is γ(s, x) = 0
ts−1 e−t dt, for complex numbers

with a positive real part.

33
2.8 The beta distribution
The beta distribution

For a variable constrained between 0 and c > 0, the beta distribution has proved useful.
Its density is

Γ(α + β)  x α−1  x β−1 1


f (x) = 1− , 0 ≤ x ≤ 1.
Γ(α)Γ(β) c c c

It is symmetric if α = β , asymmetric otherwise. The mean is ca/(α + β), and the variance
is c αβ/[(α + β + 1)(α + β)2 ].
2

34
B

Parameters α, β ∈ R>0
Support x ∈ [0, 1] or x ∈ (0, 1)
xα−1 (1−x)β−1
PDF
B(α,β)

CDF I(x, α, β)
α
Mean
α+β
1
[−1] α−
Median I1 (α, β) ≈ 3 α, β > 1
2 for
2 α+β−
3

Mode

αβ
Variance
(α+β)2 (α+β+1)

2 (β−α) α+β+1
Skewness √
(α+β+2) αβ
6[(α−β)2 (α+β+1)−αβ(α+β+2)]
Ex. Kurtosis
αβ(α+β+2)(α+β+3)
P∞ Qk−1 α+r

tk
MGF 1+ k=1 r=0 α+β+r k!

Γ(α)Γ(β)
ˆ B(α, β) = Γ(α+β)
and Γ is the Gamma function.

R∞
ˆ Γ(z) = 0
tz−1 e−t dt, ℜ(z) > 0, for complex numbers with a positive real part.

B(x, a,b)
ˆ Regularized incomplete beta function I(x, a, b) = B(a,b)
with B(x, a, b) =
Rx
0
ta−1 (1 − t)b−1 dt.

ˆ ∗ α−1
α+β−2
for α, β > 1; any value in(0, 1) for α, β = 1; {0, 1} (bimodal) for α, β <
1; 0 for α ≤ 1, β > 1; 1 for α > 1, β ≤ 1.

35
2.9 The logistic distribution
The logistic distribution

The logistic distribution is an alternative if the normal cannot model the mass in the

tails; the cdf for a logistic random variable with µ = 0, s = 1 is

1
F (x) = Λ(x) = .
1 + e−x

The density is f (x) = Λ(x)[1 − Λ(x)]. The mean and variance of this random variable are

zero and σ = π 2 /3.


2

36
Logistic

Parameters µ ∈ R , s ∈ R>0
Support x∈R
x−µ e−(x−µ)/s

PDF λ s
= 2
s(1+e−(x−µ)/s )
x−µ 1

CDF Λ s
= 1+e−(x−µ)/s
Mean µ
Median µ
Mode µ
s2 π 2
Variance
3
Skewness 0
Ex. Kurtosis 6/5
MGF eµt B(1 − st, 1 + st)
for t ∈ (−1/s, 1/s)

2.10 The Wishart distribution


The Wishart distribution

The Wishart distribution describes the distribution of a random matrix obtained as


n
X
f (W ) = (xi − µ)(xi − µ)′ .
i=1

where xi is the ith ofnK element random vectors from the multivariate normal distribu-
tion with mean vector, µ, and covariance matrix, Σ. The density of the Wishart random

matrix is
1
exp − 12 trace(Σ−1 W ) |W |− 2 (n−K−1)
 
f (W ) = .
2nK/2 |Σ|K/2 π K(K−1)/4 K n+1−j
Q
j=1 Γ 2

The mean matrix is nΣ. For the individual pairs of elements in W,

Cov[wij , wrs ] = n(σir σjs + σis σjr ).

The Wishart distribution is a multivariate extension of χ2 distribution. If W ∼ W (n, σ 2 ),


then W /σ 2 ∼ χ2 [n].

37
3 Review of Distribution Theory
3.1 Joint and marginal bivariate distributions
Bivariate distributions
For observations of two discrete variables y ∈ {1, 2} and x ∈ {1, 2, 3}, we can calculate

ˆ the frequencies nx,y ,

freq. nx,y y = 1 y = 2 f (x) = nx /N

x=1 1 2 3/10

x=2 1 2 3/10

x=3 0 4 4/10

f (y) = ny /N 2/10 8/10 1

ˆ the frequencies nx,y ,

ˆ conditional distributions f (y|x) and f (x|y),

ˆ joint distributions f (x, y), and

ˆ marginal distributions fy (y) and fx (x).

P
freq. nx,y y=1 y=2 f (x) = nx /N cond. distr. f (y|x) y=1 y=2 y

x=1 1 2 3/10 f (y|x = 1) 1/3 2/3 1

x=2 1 2 3/10 f (y|x = 2) 1/3 2/3 1

x=3 0 4 4/10 f (y|x = 3) 0 1 1

f (y) = ny /N 2/10 8/10 1 f (y|x = 1, x = 2, x = 3) 1/5 4/5 1

cond. distr. joint distr. marginal pr.

f (x|y) f (x|y = 1) f (x|y = 2) f (x|y = 1, y = 2) f (x, y) f (x, y = 1) f (x, y = 2) fx (x)

x=1 1/2 1/4 3/10 f (x = 1, y) 1/10 2/10 3/10

x=2 1/2 1/4 3/10 f (x = 2, y) 1/10 2/10 3/10

x=3 0 1/2 4/10 f (x = 3, y) 0 4/10 4/10


P
x 1 1 1 marginal pr. fy (y) 2/10 8/10 1

38
3.2 The joint density function
The joint density function

Two random variables X and Y have joint density function

ˆ if x and y are discrete

X X
f (x, y) = P rob(a ≤ x ≤ b, c ≤ y ≤ d) = f (x, y)
a≤x≤b c≤y≤d

ˆ if x and y are continuous

Z bZ d
f (x, y) = P rob(a ≤ x ≤ b, c ≤ y ≤ d) = f (x, y)dxdy
a c

Example

With a = 1, b = 2, c = 2, d = 2 and the following f (x, y)

joint distr.

f (x, y) f (x, y = 1) f (x, y = 2)

f (x = 1, y) 1/10 2/10

f (x = 2, y) 1/10 2/10

f (x = 3, y) 0 4/10

P rob(1 ≤ x ≤ 2, 2 ≤ y ≤ 2) = f (y = 2, x = 1) + f (y = 2, x = 2) = 2/5.

For values x and y of two discrete random variable X and Y, the probability dis-
tribution
f (x, y) = P rob(X = x, Y = y).

The axioms of probability require

f (x, y) ≥ 0,
XX
f (x, y) = 1.
x y

39
If X and Y are continuous, Z Z
f (x, y)dxdy = 1.
x y

bivariate normal distribution

The bivariate normal distribution is the joint distribution of two normally distributed

variables. The density is

1 2 2 2
f (x, y) = p e−1/2[(ϵx +ϵy −2ρϵx ϵy )/(1−ρ )],
2πσx σy 1 − ρ 2

x−µx y−µy
where ϵx = σx
, and ϵy = σy
.

3.3 The joint cumulative density function


The joint cumulative density function.

The probability of a joint event of X and Y have joint cumulative density function

ˆ if x and y are discrete

XX
F (x, y) = P rob(X ≤ x, Y ≤ y) = f (x, y)
X≤x Y ≤y

ˆ if x and y are continuous

Z x Z y
F (x, y) = P rob(X ≤ x, Y ≤ y) = f (t, s)dsdt
−∞ −∞

40
Example

With x = 2, y = 2 and the following f (x, y)

f (x, y) f (x, y = 1) f (x, y = 2)

f (x = 1, y) 1/10 2/10

f (x = 2, y) 1/10 2/10

f (x = 3, y) 0 4/10

P rob(X ≤ 2, Y ≤ 2) = f (x = 1, y = 1)+
f (x = 2, y = 1) + f (x = 1, y = 2) + f (x =
2, y = 2) = 3/5.

Cumulative probability distribution

For values x and y of two discrete random variable X and Y , the cumulative probability
distribution
F (x, y) = P rob(X ≤ x, Y ≤ y).

The axioms of probability require

0 ≤ F (x, y) ≤ 1,

F (∞, ∞) = 1,

F (−∞, y) = 0,

F (x, −∞) = 0.

The marginal probabilities can be found from the joint cdf

fx (x) = P (X ≤ x) = P rob(X ≤ x, Y ≤ ∞) = F (x, ∞).

41
3.4 The marginal probability density
The marginal probability density

To obtain the marginal distributions fx (x) and fy (y) from the joint density f (x, y), it is

necessary to sum or integrate out the other variable. For example,

ˆ if x and y are discrete


X
fx (x) = f (x, y),
y

ˆ if x and y are continuous Z


fx (x) = f (x, s)ds.
y

Example

f (x, y) f (x, y = 1) f (x, y = 2) fx (x)

f (x = 1, y) 1/10 2/10 3/10

f (x = 2, y) 1/10 2/10 3/10

f (x = 3, y) 0 4/10 4/10

fy (y) 2/10 8/10 1

fx (x = 1) = f (x = 1, y = 1) + f (x = 1, y = 2) = 3/10.

fy (y = 2) = f (x = 1, y = 2) + f (x = 2, y = 2) + f (x = 3, y = 2) = 4/5.

42
The bivariate normal distribution

Why do we care about marginal distributions?


Means, variances, and higher moments of the variables in a joint distribution are dened

with respect to the marginal distributions.

ˆ Expectations
If x and y are discrete

" #
X X X XX
E[x] = xfx (x) = x f (x, y) = xf (x, y).
x x y x y

If x and y are continuous

Z Z Z
E[x] = xfx (x) = xf (x, y)dydx.
x x y

ˆ Variances
X XX
V ar[x] = (x − E[x])2 fx (x) = (x − E[x])2 f (x, y).
x x y

43
3.5 Covariance and correlation
For any function g(x, y),
 P P
 g(x, y)f (x, y) in the discrete case,
 x y


E[g(x, y)] =


 R R g(x, y)f (x, x)dydx

in the continuous case.
x y

The covariance of x and y is a special case:

Cov[x, y] = E[(x − µx )(y − µy )]


= E[xy] − µx µy = σxy

If x and y are independent, then f (x, y) = fx (x)fy (y) and

XX
σxy = fx (x)fy (y)(x − µx )(y − µy )
x y
X X
= (x − µx )fx (x) (y − µy )fy (y) = E[x − µx ]E[y − µy ] = 0.
x y

σxy
ˆ correlation ρxy = σx σy

ˆ σxy = 0 does not imply independence (except for bivariate normal).

Independence: Pdf and cdf from marginal densities

ˆ Two random variables are statistically independent if and only if their joint density

is the product of the marginal densities:

f (x, y) = fx (x)fy (y) ⇔ x and y are independent.

ˆ If (and only if ) x and y are independent, then the marginal cdfs factors the cdf as

well:

F (x, y) = Fx (x)Fy (y) = P rob(X ≤ x, Y ≤ y) = P rob(X ≤ x)P rob(Y ≤ y).

44
Example

f (x, y) f (x, y = 1) f (x, y = 2) fx (x) F (x, y) F (x, y = 1) F (x, y = 2)

f (x = 1, y) 1/6 1/6 1/3 F (x = 1, y) 1/6 2/6

f (x = 2, y) 1/6 1/6 1/3 F (x = 2, y) 2/6 4/6

f (x = 3, y) 1/6 1/6 1/3 F (x = 3, y) 3/6 1

fy (y) 1/2 1/2 1

P (x ≤ 2)P (y ≤ 2)

fx (x = 3) × fy (y = 2) = 1/3 × 1/2 = 1/6. = [f (x = 2, y = 1) + f (x = 2, y = 2)]


× [f (x = 1, y = 2) + f (x = 2, y = 2)]
= [1/6 + 1/6][1/6 + 1/6] = 4/36 = 2/18.

3.6 The conditional density function


The conditional density function

The conditional distribution over y for each value of x (and vice versa) has conditional
densities
f (x, y) f (x, y)
f (y|x) = f (x|y) = .
fx (x) fy (y)

The marginal distribution of x averages the probability of x given y over the distri-
bution of all values of y fx (x) = E[f (x|y)f (y)]. If x and y are independent, knowing the

value of y does not provide any information about x, so fx (x) = f (x|y).

Example

cond. distr. joint distr. marginal pr.

f (x|y) f (x|y = 1) f (x|y = 2) f (x|y = 1, y = 2) f (x, y) f (x, y = 1) f (x, y = 2) fx (x)

x=1 1/2 1/4 3/10 f (x = 1, y) 1/10 2/10 3/10

x=2 1/2 1/4 3/10 f (x = 2, y) 1/10 2/10 3/10

x=3 0 1/2 4/10 f (x = 3, y) 0 4/10 4/10


P
x 1 1 1 marginal pr. fy (y) 2/10 8/10 1

45
f (x = 3, y = 2)
f (x = 3|y = 2) = = 4/10 × 10/8 = 1/2.
fy (y = 2)
fx (x = 2) = Ey [f (x = 2|y)f (y)] = f (x = 2|y = 1)f (y = 1) + f (x = 2|y = 2)f (y = 2)

= 1/2 × 2/10 + 1/4 × 8/10 = 1/10 + 2/10 = 3/10.

3.7 Conditional mean aka regression


A random variable may always be written as

y = E[y|x] + (y − E[y|x])
= E[y|x] + ϵ.

Denition

The regression of y on x is obtained from the conditional mean


 P
 yf (y|x) if y is discrete,
 y


E[y|x] =


 R yf (y|x)dy

if y is continuous.
y

Predict y at values of x:
X
yf (y|x = 1) = 1 × 1/3 + 2 × 2/3 = 5/3.
y

46
Conditional variance

A conditional variance is the variance of the conditional distribution:


 P
 (y − E[y|x])2 f (y|x) if y is discrete,
 y


V ar[y|x] =


 R (y − E[y|x])2 f (y|x)dy,

if y is continuous.
y

The computation can be simplied by using

V ar[y|x] = E[y 2 |x] − (E[y|x])2 ≥ 0.

Decomposition of variance V ar[y] = Ex [V ar[y|x]] + V arx [E[y|x]]

ˆ When we condition on x, the variance of y reduces on average. V ar[y] ≥ Ex [V ar[y|x]]

ˆ Ex [V ar[y|x]] is the average of variances within each x

ˆ V arx [E[y|x]] is variance between y averages in each x.

ˆ E[y|x = 1] = 1.67, E[y|x = 2] = 1.67, and E[y|x = 3] = 2

ˆ V [y|x = 1] = 0.22, V [y|x = 2] = 0.22, and V [y|x = 3] = 0

47
Example

f (y|x) y=1 y=2 f (x, y) f (x, y = 1) f (x, y = 2) fx (x)

f (y|x = 1) 1/3 2/3 1 f (x = 1, y) 1/10 2/10 3/10

f (y|x = 2) 1/3 2/3 1 f (x = 2, y) 1/10 2/10 3/10

f (y|x = 3) 0 1 1 f (x = 3, y) 0 4/10 4/10

fy (y) 2/10 8/10 1

E[y|x = 1] = 1/3×1+2/3×2 = 5/3


V [y|x = 1] = 12 × 1/3 + 22 × 2/3 − (5/3)2 = 2/9
E[y|x = 2] = 1/3×1+2/3×2 = 5/3
V [y|x = 2] = 12 × 1/3 + 22 × 2/3 − (5/3)2 = 2/9
E[y|x = 3] = 0 × 1 + 1 × 2 = 2 V [y|x = 3] = 12 × 0 + 22 × 1 − 22 = 0

alternatively (requiring more dierences)

V [y|x = 1] = (1−5/3)2 ×1/3+(2−5/3)2 ×2/3 = 2/9

Average of variances within each x, E[V [y|x]] is less or equal total variance V [y].

Example

ˆ Use the conditional mean to calculate E[y]:

E[y] = Ex [E[y|x]] = E[y|x = 1]f (x = 1) + E[y|x = 2]f (x = 2) + E[y|x = 3]f (x = 3)

= 5/3 × 3/10 + 5/3 × 3/10 + 2 × 4/10 = 9/5.


X
E[y] = fy (y) = 1 × 2/10 + 2 × 8/10 = 9/5.
y

ˆ Variation in y , V [y|x = 1] = 0.22, V [y|x = 2] = 0.22, and V [y|x = 3] = 0 due to

variation in x, is on average

E[V [y|x]] = 3/10 × 2/9 + 3/10 × 2/9 + 4/10 × 0 = 2/15.

ˆ For each conditional mean E[y|x = 1] = 5/3, E[y|x = 2] = 5/3, and E[y|x = 3] = 2,
y varies with
V [E[y|x]] = E[(E[y|x])2 ] − (E[y|x])2 = 3/10 × (5/3)2 + 3/10 × (5/3)2 + 4/10 × (2)2 −
(9/5)2 = 2/75.

48
ˆ E[V [y|x]] + V [E[y|x]] = V [y] = 2/75 + 2/15 = 4/25.
With degree of freedom correction (n − 1) (as reported in software):
E[V [y|x]] + V [E[y|x]] = V [y] = 2/75/(10 − 1) × 10 + 2/15/(10 − 1) × 10 = 8/45.

3.8 The bivariate normal


Properties of the bivariate normal
Recall bivariate normal distribution is the joint distribution of two normally distributed

variables. The density is

1 2 2 2
f (x, y) = p e−1/2[(ϵx +ϵy −2ρϵx ϵy )/(1−ρ )],
2πσx σy 1 − ρ 2

where ϵx = x−µx
σx
, and ϵy = y−µ
σy
y
.
The covariance is σxy = ρxy σx σy , where

ˆ −1 < ρxy < 1 is the correlation between x and y

ˆ µx , σx , µy , σy are means and standard deviations of the marginal distributions of x


or y

If x and y are bivariately normally distributed (x, y) ∼ N2 [µx , µy , σx2 , σy2 , ρxy ]

ˆ the marginal distributions are normal

fx (x) = N [µx , σx2 ]

fy (y) = N [µy , σy2 ]

ˆ the conditional distributions are normal

f (y|x) = N [α + βx, σy2 (1 − ρ2 )]

σxy
α = µy − βµx ; β =
σx2

ˆ f (x, y) = fx (x)fx (x) if ρxy = 0: x and y are independent if and only if they are

uncorrelated

49
3.9 Useful rules
σxy
ˆ ρxy = σx σy

ˆ E[ax + by + c] = aE[x] + bE[y] + c

ˆ V ar[ax + by + c] = a2 V ar[x] + b2 V ar[y] + 2abCov[x, y] = V ar[ax + by]

ˆ Cov[ax + by, cx + dy] = acV ar[x] + bdV ar[y] + (ad + bc)Cov[x, y]

ˆ If X and Y are uncorrelated, then V ar[x + y] = V ar[x − y] = V ar[x] + V ar[y].

ˆ Linearity

E[ax + by|z] = aE[x|z] + bE[y|z].

ˆ Adam's Law / Law of Iterated Expectation

E[y] = Ex [E[y|x]]

ˆ Adam's general Law / Law of Iterated Expectation

E[y|g2 (g1 (x))] = E[E[y|g1 (x)]|g2 (g1 (x))]

ˆ Independence

If x and y are independent, then

E[y] = E[y|x],

E[g1 (x)g2 (y)] = E[g1 (x)]E[g2 (y)].

ˆ Taking out what is known

E[g1 (x)g2 (y)|x] = g1 (x)E[g2 (y)|x].

ˆ Projection of y by E[y|x], such that orthogonal to h(x)

E[(y − E[y|x])h(x)] = 0.

ˆ Keeping just what is needed (y predictable from x needed, not residual)

E[xy] = E[xE[y|x]].

50
ˆ Eve's Law (EVVE) / Law of Total Variance

V ar[y] = Ex [V ar[y|x]] + V arx [E[y|x]]

ˆ ECCE law / Law of Total Covariance

Cov[x, y] = Ez [Cov[y, x|z]] + Covz [E[x|z], E[y|z]]

ˆ Cov[x, y] = Covx [x, E[y|x]] =


R
x
(x − E[x]) E[y|x]fx (x)dx.
Cov[x,y]
ˆ If E[y|x] = α + βx, then α = E[y] − βE[x] and β= V ar[x]

ˆ Regression variance V arx [E[y|x]], because E[y|x] varies with x

ˆ Residual variance Ex [V ar[y|x]] = V ar[y] − V arx [E[y|x]], because y varies around

the conditional mean

ˆ Decomposition of variance V ar[y] = V arx [E[y|x]] + Ex [V ar[y|x]]


regression variance
ˆ Coecient of determination = total variance

ˆ If E[y|x] = α + βx and if V ar[y|x] is a constant, then

V ar[y|x] = V ar[y] 1 − Corr2 [y, x] = σy2 1 − σxy


2
 

51
4 The Least Squares Estimator
4.1 What is the Relationship between Two Variables?
Political Connections and Firms
Firm prots increase with the degree of political connections

ˆ Learn how to represent relationships between two or more variables

ˆ How to quantify and predict eects of shocks and policy changes

ˆ Show properties of the OLS estimator in small & large samples

ˆ Apply Monte Carlo Simulations to assess properties of OLS

52
4.2 The Econometric Model
Specication of a Linear Regression
ˆ dependent variable yi = prots of rm i

ˆ explanatory variables xi1 , . . . , xiK k = 1, . . . K political connections, other rm char-

acteristics

ˆ xi0 = 1 is a constant

ˆ parameters to be estimated β0 , β1 , . . . , βK are K +1

ˆ ui is called the error term

yi = (β0 = 4) + (β1 = 0)xi1 + ui .

ˆ dependent variable yi = prots of rm i

ˆ explanatory variables xi1 , . . . , xiK k = 1, . . . K political connections, other rm char-

acteristics

ˆ xi0 = 1 is a constant

ˆ parameters to be estimated β0 , β1 , . . . , βK are K +1

53
ˆ ui is called the error term

yi = (β0 = 2.36) + (β1 = 0.01)xi1 + ui .

How Were the Data Generated?


The data generating process is fully described by a set of assumptions.

The Five Assumptions of the Econometric Model

ˆ LRM1: Linearity

ˆ LRM2: Simple random sampling

ˆ LRM3: Exogeneity

ˆ LRM4: Error variance

ˆ LRM5: Identiability

54
Data Generating Process: Linearity
LRM1: Linearity

yi = β0 + β1 xi1 + . . . + βK xiK + ui and E(ui ) = 0.

LRM1 assumes that the

ˆ functional relationship is linear in parameters βk

ˆ error term ui enters additively

ˆ parameters βk are constant across individual rms i and j ̸= i.

Anscombe's Quartet

Figure 1: All four sets are identical when examined using linear statistics, but very
dierent when graphed. Correlation between x and y is 0.816. Linear Regression y = 3.00
+ 0.50x.

55
Data Generating Process: Random Sampling
LRM2: Simple Random Sampling

{xi1 , . . . , xiK , yi }N
i=1 i.i.d. (independent and identically distributed)

LRM2 means that

ˆ observation i has no information content for observation j ̸= i

ˆ all observations i come from the same distribution

This assumption is guaranteed by simple random sampling provided there is no systematic

non-response or truncation.

Density of Population and Truncated Sample

Figure 2: Distribution of a dependent variable and an independent variable truncated at


y ∗ = 15.

56
Data Generating Process: Exogeneity
LRM3: Exogeneity

1. ui |xi1 , . . . , xiK ∼ N (0, σi2 )


LRM3a assumes that the error term is normally distributed conditional on the

explanatory variables.

2. ui ⊥ xik ∀k (independent), pdfu,x (ui xik ) = pdfu (ui )pdfx (xik )


LRM3b means that the error term is independent of the explanatory variables.

3. E(ui |xi1 , . . . , xiK ) = E(ui ) = 0 (mean independent)

LRM3c states that the mean of the error term is independent of explanatory vari-

ables.

4. cov(xik , ui ) = 0 ∀k (uncorrelated)

LRM3d means that the error term and the explanatory variables are uncorrelated.

LRM3a or LRM3b imply LRM3c and LRM3d. LRM3c implies LRM3d.

Figure 3: Distributions of the dependent variable conditional on values of an independent


variable.

Weaker exogeneity assumption if interest only in, say, xi1 :


Conditional Mean Independence E(ui |xi1 , xi2 , . . . , xiK ) = E(ui |xi2 , . . . , xiK )
Given the control variables xi2 , . . . , xiK , the mean of ui does not depend on the variable

of interest xi1 .

57
Data Generating Process: Error Variance
LRM4: Error Variance

1.

V (ui |xi1 , . . . , xiK ) = σ 2 < ∞ (homoskedasticity)

LRM4a means that the variance of the error term is a constant.

2.

V (ui |xi1 , . . . , xiK ) = σi2 = g(xi1 , . . . , xiK ) < ∞ (cond. heteroskedasticity)

LRM4b allows the variance of the error term to depend on a function g of the

explanatory variables.

Heteroskedasticity

Figure 4: The simple regression model under homo- and heteroskedasticity.


V ar(prof its|lobbying, employees) increasing with lobbying .

58
Data Generating Process: Identiability
LRM5: Identiability

(xi0 , xi1 , . . . , xiK ) are not linearly dependent

0 < V (xik ) < ∞ ∀k > 0

LRM5 assumes that

ˆ the regressors are not perfectly collinear , i.e. no variable is a linear combination of

the others

ˆ all regressors (but the constant) have strictly positive variance both in expectations

and in the sample and not too many extreme values.

LRM5 means that every explanatory variable adds additional information.

The Identifying Variation from xik

Figure 5: The number of red and blue dots is the same. Using which would you get a
more accurate regression line?

59
4.3 Estimation with OLS
Ordinary least squares (OLS) minimizes the squared distances (SD) between the

observed and the predicted dependent variable y:

min SD(β0 , . . . , βK ),
β0 ,...,βK

N
X
where SD = [yi − (β0 + β1 xi1 + . . . + βK xiK )]2 .
i=1

How to Describe the Relationship Best?

60
61
62
Invention of OLS

Legendre to Jacobi (Paris, 30 Novem- Figure 6: Watercolor caricature of Legendre

ber 1827, Plackett, 1972): ...How can Mr. by Boilly (1820), the only existing portrait

Gauss have dared to tell you that the greater known.

part of your theorems were known to him...?


... this is the same man ... who wanted
to appropriate in 1809 the method of least
squares published in 1805.
 Other examples will be found in other
places, but a man of honour should refrain
from imitating them. 

Invention of OLS

Legendre to Jacobi (Paris, 30 Novem- Figure 7: Portrait of Gauss by Jensen

ber 1827, Plackett, 1972): ...How can Mr. (1840).

Gauss have dared to tell you that the greater


part of your theorems were known to him...?
... this is the same man ... who wanted
to appropriate in 1809 the method of least
squares published in 1805.
 Other examples will be found in other
places, but a man of honour should refrain
from imitating them. 

Estimation with OLS


For the bivariate regression model, the OLS estimators of β0 and β1 are

β̂0 = ȳ − β̂1 x̄
PN
(xi1 − x̄)(yi − ȳ)
i=1 cov(x, y)
β̂1 = PN =
i=1 (xi1 − x̄)
2 var(x)

β̂1 = cov(x, y)/(sx sx ) = Rsy /sx ,

63
where R ≡ cov(x, y)/(sx sy ) is Pearson's correlation coecient with sz denoting the

standard deviation of z .

OLS estimator Measures Linear Correlation


Equivalently,

β̂1 N
P PN
i=1 (xi1 − x̄) (β̂1 xi1 − β̂1 x̄)
R = sx /sy β̂1 = PN = i=1
PN .
i=1 (yi − ȳ) i=1 (y i − ȳ)

Squaring gives
PN PN
2 i=1 (ŷi − ȳ)2 i=1 û2i
R = PN = 1 − PN .
i=1 (yi − ȳ)2 i=1 (yi − ȳ)
2

R2 as measure of the goodness of t:


The t improves with the fraction of the sample variation in y that is explained by the x.

The Case with K Explanatory Variables


The more general case with K explanatory variables is

β̂ = (X ′ X)−1 X′ y
(K+1)×1 (K+1)×(K+1) (K+1)×N N ×1

Given the OLS estimator, we can predict the

ˆ dependent variable by ŷi = β̂0 + β̂1 xi1 + . . . + β̂K xiK

ˆ the error term by ûi = yi − ŷi .

ûi is called the residual.

PN
û2i
Adjusted R2 = 1 − N N−K−1
−1
PN i=1
(y i −ȳ)
2
.
i=1

64
Figure 8: Scatter cloud visualized with
GRAPH3D for Stata.

Figure 9: OLS surface visualized with


GRAPH3D for Stata.

65
4.4 Properties of the OLS Estimator in the Small and in the
Large
Properties of the OLS Estimator
ˆ Small sample properties of β̂
 unbiased

 normally distributed

 ecient

ˆ Large sample properties of β̂


 consistent

 approx. normal

 asymptotically ecient

66
Small Sample Properties
Figure 10: What is a small sample?
Familien-Duell
Light Entertainment.

Figure 11: What is a small sample? (Wooldridge, 2009, p. 755): But large sample
N = 20.
Source:
approximations have been known to work well for sample sizes as small as
Familien-Duell Grundy Light Entertainment.

67
Unbiasedness and Normality of β̂k
Assuming LRM1, LRM2, LRM3a, LRM4, and LRM5,

the following properties can be established even for small samples.

ˆ The OLS estimator of β is unbiased.

E(β̂k |x11 , . . . , xN K ) = βk .

ˆ The OLS estimator is (multivariate) normally distributed.

β̂k |x11 , . . . , xN K ∼ N (βk , V (β̂k )).

ˆ Under homoskedasticity (LRM4a)

the variance Vb (β̂k |x11 , . . . , xN K ) can be unbiasedly estimated.

Variance of β̂k and Eciency


ˆ For the bivariate regression model, it is estimated as

σ̂ 2
Vb = PN with
2
i=1 (xi − x̄)

PN
2 û2i
i=1
σ̂ = .
N −K −1

ˆ Gauÿ-Markov-Theorem: under homoskedasticity (LRM4a)

β̂k is the BLUE (best linear unbiased estimator, e.g., non-linear least squares bi-

ased).

ˆ Vb (β̂k ) inates with

 micronumerosity (small sample size)

 multicollinearity (high (but not perfect) correlation between two or more of


the independent variables).

Unbiasedness
ˆ The OLS estimator of β is unbiased .

Plug y = Xβ + u into the formula for β̂ and then use the law of iterated expectation

to rst take expectation with respect to u conditional on X and then take the

unconditional expectation:

h i
′ −1 ′
E[ β̂] = EX,u (X X) X (Xβ + u)

68
h i
′ −1 ′
= β + EX,u (X X) X u
h h ii
= β + EX Eu|X (X ′ X)−1 X ′ u|X
h i
= β + EX (X ′ X)−1 X ′ Eu|X [u|X]

= β,

where E[u|X] = 0 by assumptions of the model.

Variance
ˆ The OLS estimator β has variance Vb (β̂k |x11 , . . . , xN K ) = σ 2 (X ′ X)−1
Let σ2I denote the covariance matrix of u. Then,

h i
E[ (β̂ − β)(β̂ − β)′ ] = E ((X ′ X)−1 X ′ u)((X ′ X)−1 X ′ u)′
h i
′ −1 ′ ′ ′ −1
= E (X X) X uu X(X X)
h i
= E (X ′ X)−1 X ′ σ 2 X(X ′ X)−1
h i
= E σ 2 (X ′ X)−1 X ′ X(X ′ X)−1

= σ 2 (X ′ X)−1 ,

where we used the fact that β̂ − β is just an ane transformation of u by the matrix
′ −1 ′
(X X) X .

Estimator for Variance


For a simple linear regression model, where β = [β0 , β1 ]′ (β0 is the y-intercept and β1 is

the slope), one obtains

X −1
σ 2 (X ′ X)−1 = σ 2 xi x′i
X −1
= σ2 (1, xi )′ (1, xi )
  −1
X 1xi
= σ2   
xi x2i
 P −1
N xi
= σ 2 P P 
xi x2i

69
P P 
2
1 x i − xi
= σ2 · P 2 P 2 
N xi − ( xi ) P
− xi N
P P 
2
1 xi − xi
= σ 2 · PN  
N i=1 (xi − x̄) 2 P
− xi N

σ2
V ar(β1 ) = PN .
i=1 (xi − x̄)2

Parameter Values for Simulations


Monte Carlo Simulations show the distribution of the estimate. Suppose the data

generating process is

yi = β0 + β1 xi1 + ui .

ˆ β0 = 2.00

ˆ β1 = 0.5
Try it yourself...
ˆ ui ∼ N (0.00, 1.00)

ˆ N = 3, N = 5, N = 10,
N = 25, N = 100, N = 1000

70
How to Establish Asymptotic Properties of β̂k ?
Law of Large Numbers
As N increases, the distribution of β̂k becomes more tightly centered around βk .

(a) N=3 (b) N=5

(c) N=10 (d) N=100

71
Central Limit Theorem
As N increases, the distribution of β̂k becomes normal (starting from a t-distribution).

(a) N=3 (b) N=5

(c) N=10 (d) N=100

Consistency, Asymptotically Normality


Assuming LRM1, LRM2, LRM3d, LRM4a or LRM4b, and LRM5 the following properties

can be established using law of large numbers and central limit theorem for large samples.

ˆ The OLS estimator is consistent:

plimβ̂k = βk .

That is, for all ε>0



lim Pr |β̂k − βk | > ε = 0.
N →∞

ˆ The OLS estimator is asymptotically normally distributed


√ d
N (β̂k − βk ) → N (0, Avar(β̂k ) × N )

72
(Avar means asymptotic variance)

ˆ The OLS estimator is approximately normally distributed


 
A
β̂k ∼ N βk , Avar(β̂k )

Eciency and Asymptotic Variance


For the bivariate regression under LRM4a (homoskedasticity) it can be consistently
estimated as
σ̂ 2
Avar(
[ β̂1 ) = P
N
,
i=1 (xi1 − x̄)2
with PN
2 û2i
i=1
σ̂ = .
N −2
Under LRMb (heteroskedasticity), Avar(β̂) can be consistently estimated as the

robust Eicker-Huber-White
or estimator.

The robust variance estimator is calculated as

PN 2 2
[ β̂1 ) = h i=1 ûi (xi1 − x̄)i .
Avar( PN 2
i=1 (xi1 − x̄)

Note: In practice we can almost never be sure that the errors are homoskedastic and

should therefore always use robust standard errors.

Sketch of Proof for Asymptotic Properties


ˆ The OLS estimator of β̂ is consistent and asymptotic normal

1 ′
−1 1 ′ 1 ′
−1 1 ′
β̂ can be written as: β̂ =
Estimator
N
X X N
Xy = β+ N
X X N
Xu =
 −1  
N N
β + N1 i=1 xi x′i 1
P P
N i=1 xi ui

PN p
We can use the law of large numbers to establish that :
1
N i=1 xi x′i →
− E[xi x′i ] =
Qxx 1
P N p
N
, N i=1 xi ui →
− E[xi ui ] = 0
By Slutsky's theorem and continuous mapping theorem these results can be com-
p
− β + Q−1
bined to establish consistency of estimator β̂ : β̂ → xx · 0 = β

PN d
√1

The central limit theorem tells us that: i=1 xi ui →
− N 0, V , where V =
 N 2 Qxx
Var[xi ui ] = E[ u2i xi x′i ] = E E[u2i |xi ] xi x′i =

σ N

73
Applying Slutsky's theorem again we'll have:

N −1  N

 
1 X ′ 1 X d 2 Qxx
− Q−1

N (β̂ − β) = xi xi √ xi ui → xx N · N 0, σ
N i=1 N i=1 N
= N 0, σ 2 Q−1

xx N

OLS Properties in the Small and in the Large

Set of assumptions (1) (2) (3) (4) (5) (6)

LRM1: linearity f u l f i l l e d

LRM2: simple random sampling f u l f i l l e d

LRM5: identiability f u l f i l l e d

LRM4: error variance

- LRM4a: homoskedastic ✓ ✓ ✓ × × ×
- LRM4b: heteroskedastic × × × ✓ ✓ ✓
LRM3: exogeneity

- LRM3a: normality ✓ × × ✓ × ×
- LRM3b: independent ✓ ✓ × × × ×
- LRM3c: mean indep. ✓ ✓ ✓ ✓ ✓ ×
- LRM3d: uncorrelated ✓ ✓ ✓ ✓ ✓ ✓

Small sample properties of β̂


- unbiased ✓ ✓ ✓ ✓ ✓ ×
- normally distributed ✓ × × ✓ × ×
- ecient ✓ ✓ ✓ × × ×

Large sample properties of β̂


- consistent ✓ ✓ ✓ ✓ ✓ ✓
- approx. normal ✓ ✓ ✓ ✓ ✓ ✓
- asymptotically ecient ✓ ✓ ✓ × × ×

ˆ Notes: ✓ = fullled, × = violated

74
Tests in Small Samples I
Assume LRM1, LRM2, LRM3a, LRM4a, and LRM5. A simple null hypotheses of the

form H0 : βk = q is tested with the t-test.


If the null hypotheses is true, the t-statistic

β̂k − q
t= ∼ tN −K−1
se(
b β̂k )

follows a
qt-distribution with N −K −1 degrees of freedom. The standard error is

se(
b β̂k ) = V̂ (β̂k ).

For example, to perform a two-sided test of H0 against the alternative hypotheses

HA : βk ̸= q on the 5% signicance level, we calculate the t-statistic and compare its


absolute value to the 0.975-quantile of the t-distribution. With N = 30 and K = 2, H0 is

rejected if |t| > 2.052.

Tests in Small Samples II


A null hypotheses of the form H0 : rj1 β1 + . . . + rjK βK = qj , in matrix notation H0 : Rβ =
q , with J linear restrictions j = 1 . . . J is jointly tested with the F -test.
If the null hypotheses is true, the F -statistic follows an F distribution with J numer-

ator degrees of freedom and N − K − 1 denominator degrees of freedom:

 ′ h i−1  
Rβ̂ − q RV̂ (β̂|X)R′ Rβ̂ − q
F = ∼ FJ,N −K−1 .
J

For example, to perform a two-sided test of H0 against the alternative hypotheses

HA : rj1 β1 + . . . + rjK βK ̸= qj for all j at the 5% signicance level, we calculate the

F -statistic and compare it to the 0.95-quantile of the F -distribution.

With N = 30, K = 2 and J = 2, H0 is rejected if F > 3.35. We cannot perform two-sided

F -tests because the F distribution has one tail.

Tests in Small Samples III


Only under homoskedasticity (LRM4a), the F -statistic can also be computed as

(R2 − Rrestricted
2
)/J
F = 2
∼ FJ,N −K−1 ,
(1 − R )/(N − K − 1)

2
where Rrestricted is estimated by restricted least squares which minimizes SD(β) s.t. rj1 β1 +

. . . + rjK βK ̸= qj for all j .

75
Exclusionary restrictions of the form H0 : βk = 0, βm = 0, . . . are a special case of

H0 : rj1 β1 + . . . + rjK βK = qj for all j. In this case, restricted least squares is simply

estimated as a regression were the explanatory variables k, m, . . . are excluded, e.g. a

regression with a constant only.

If the F distribution has degrees of freedom (df ) 1 as the numerator df, and N −K −1
as the denominator df, then it can be shown that t2 = F (1, N − K − 1).

Condence Intervals in Small Samples


Assuming LRM1, LRM2, LRM3a, LRM4a, and LRM5, we can construct condence in-

tervals for a particular coecient βk . The (1 − α) condence interval is given by

 
β̂k − t(1−α/2),(N −K−1) se(
b β̂k ), β̂k + t(1−α/2),(N −K−1) se(
b β̂k ) ,

where t(1−α/2),(N −K−1) is the (1 − α/2) quantile of the t-distribution with (N − K − 1)


degrees of freedom. For example, the 95% condence interval with N = 30 and K = 2 is
 
β̂k − 2.052se(
b β̂k ), β̂k + 2.052se(
b β̂k ) .

Recall: α is the maximum acceptable probability of a Type I error.

Null hypothesis (H0 ) is valid (Innocent) is invalid (Guilty)

Reject H0 Type I (α = 0.05) error Correct outcome

I think he is guilty! False positive True positive

Convicted! Convicted!

Don't reject H0 Correct outcome Type II (β ) error

I think he is innocent! True negative False negative

Freed! Freed!

Asymptotic Tests
Assume LRM1, LRM2, LRM3d, LRM4a or LRM4b, and LRM5. A simple null hypotheses

of the form H0 : βk = q is tested with the z -test. If the null hypotheses is true, the z-
statistic
β̂k − q A
z= ∼ N (0, 1)
se(
b β̂k )

76
q follows approximately the standard normal distribution. The standard error is se(
b β̂k ) =
Avar(
[ β̂k ).

For example, to perform a two sided test of H0 against the alternative hypotheses

HA : βk ̸= q on the 5% signicance level, we calculate the z -statistic and compare its

absolute value to the 0.975-quantile of the standard normal distribution. H0 is rejected if

|z| > 1.96.

We talk about the Wald test later...

Condence Intervals in Large Samples


Assuming LRM1, LRM2, LRM3d, LRM5, and LRM4a or LRM4b, we can construct

condence intervals for a particular coecient βk . The (1 − α) condence interval is given


by

 
β̂k − z(1−α/2) se(
b β̂k ), β̂k + z(1−α/2) se(
b β̂k )

where z(1−α/2) is the (1 − α/2) quantile of the standard normal distribution.

 
For example, the 95% condence interval is β̂k − 1.96se(
b β̂k ), β̂k + 1.96se(
b β̂k ) .

77
OLS Properties in the Small and in the Large

Set of assumptions (1) (2) (3) (4) (5) (6)

LRM1: linearity f u l f i l l e d

LRM2: simple random sampling f u l f i l l e d

LRM5: identiability f u l f i l l e d

LRM4: error variance

- LRM4a: homoskedastic ✓ ✓ ✓ × × ×
- LRM4b: heteroskedastic × × × ✓ ✓ ✓
LRM3: exogeneity

- LRM3a: normality ✓ × × ✓ × ×
- LRM3b: independent ✓ ✓ × × × ×
- LRM3c: mean indep. ✓ ✓ ✓ ✓ ✓ ×
- LRM3d: uncorrelated ✓ ✓ ✓ ✓ ✓ ✓

Small sample properties of β̂


- unbiased ✓ ✓ ✓ ✓ ✓ ×
- normally distributed ✓ × × ✓ × ×
- ecient ✓ ✓ ✓ × × ×
t-test, F -test ✓ × × × × ×

Large sample properties of β̂


- consistent ✓ ✓ ✓ ✓ ✓ ✓
- approx. normal ✓ ✓ ✓ ✓ ✓ ✓
- asymptotically ecient ✓ ✓ ✓ × × ×
z -test, Wald test ✓ ✓ ✓ ✓* ✓* ✓*

ˆ Notes: ✓ = fullled, × = violated, * = corrected standard errors.

78
4.5 Politically Connected Firms: Causality or Correlation?
Arguments For Causality of Eect

Econometric methods need to address concerns, including:

ˆ Misspecication: Results robust to dierent functional forms

ˆ Errors-in-variables: little concern with administrative data

ˆ External validity: Similar eect found in independent studies.

Arguments Against Causality of Eect


ˆ Omitted variable bias:
e.g., business acumen

→ Panel data models

ˆ Sample selection bias:


lobbying expenditures only observed if in transparency register.

→ Selection correction models

ˆ Simultaneous causality:

79
 prots may be higher because of political connections

 rms may become connected because of their high prots

All of those concerns may be addressed with


→instrumental variable models. What would be a good instrument/experiment?

80
5 Simplifying Linear Regressions using Frisch-Waugh-
Lovell
5.1 Frisch-Waugh-Lovell theorem in equation algebra
From the multivariate to the bivariate regression
Regress yi on two explanatory variables, where x2i is the variable of interest and x1i (or

further variables) are not of interest.

yi = β0 + β2 x2i + β1 x1i + εi .

Surprising and useful result:

ˆ We can obtain exactly the same coecients and residuals from a regression of

two demeaned variables

ỹi = β0 + β2 x̃2i + εi .

ˆ We can obtain exactly the same coecient and residuals from a regression of two
residualized variables

εyi = β2 ε2i + εi .

Why is the decomposition useful?


Allows breaking a multivariate model with K independent variables into K bivariate

models.

ˆ Relationship between two variables from a multivariate model can be shown in a

two-dimensional scatter plot

ˆ Absorbs xed eects to reduce computation time (see reghdfe for Stata)

ˆ Allows to separate variability between the regressors (multicollinearity) and between

the residualized variable x̃2i and the dependent variable yi .

ˆ Understand biases in multivariate models tractably.

How to decompose yi and x2i ?


Partial out x1i from yi and from x2i .

ˆ Regress x2i on all x1i and get residuals ε2i :

x2i = γ0 + γ1 x1i + ε2i ,

81
this implies Cov(x1i , ε2i ) = 0,

ˆ Regress yi on all x1i and get residuals εyi :

yi = δ0 + δ1 x1i + εyi .

This implies Cov(x1i , εyi ) = 0.


From the residuals and the constants γ0 and δ0 generate

ˆ x̃2i = γ0 + ε2i ,

ˆ ỹi = δ0 + εyi .
Finally,

ỹi = β̃0 + β̃1 x̃2i + ε̃i = β0 + β2 x̃2i + εi .

Decomposition theorem
Decomposition theorem

For multivariate regressions and detrended regressions, e.g.,

yi = β0 + β2 x2i + β1 x1i + εi ,

ỹi = β̃0 + β̃1 x̃2i + ε̃i ,

the same regression coecients will be obtained with any non-empty subset of the ex-

planatory variables, such that

β̃1 = β2 and also ε̃i = εi .

Examining either set of residuals will convey precisely the same information about the

properties of the unobservable stochastic disturbances.

Detrended variables
Show that

yi = β0 + β2 x2i + β1 x1i + εi (2)

= ỹi = β̃0 + β̃1 x̃2i + ε̃i .

Plug in the variables yi = δ0 + δ1 x1i + εyi and x2i = γ0 + γ1 x1i + ε2i in the equation (2)

yi = δ0 + δ1 x1i + εyi = β0 + β2 (γ0 + γ1 x1i + ε2i ) + β1 x1i + εi


ỹi = δ0 + εyi = β0 + β2 (γ0 + ε2i ) + (β2 γ1 − δ1 + β1 )x1i + εi .

82
Because we partialled out x1i x1i is mechanically uncorrelated to ε2i and to εyi .
using OLS,
1
Therefore, the regression coecient (β2 γ1 − δ1 + β1 ) of the partialled out variable xi is
2 2
zero. The equation simplies with x̃i = γ0 + εi to

ỹi = δ0 + εyi = β0 + β2 (γ0 + ε2i ) + εi .

Regression anatomy: Only detrending x2i and not yi . The regression constant, residu-

als, and the standard errors change but β2 remains

yi = δ0 + δ1 x1i + εyi = (β0 + δ1 x̄1 ) + β2 (γ0 + ε2i ) + (εi + δ1 x1i )


yi = κ + β2 x̃2 + ϵi .

Residualized variables

ỹi = δ0 + εyi = β0 + β2 (γ0 + ε2i ) + εi


εyi = β0 − δ0 + β2 γ0 + β2 ε2i + εi .

The same result of the FWL Theorem holds as well for a regression of the residualized

variables because β0 = δ0 − β2 γ0 :

εyi = β2 ε2i + εi .

5.2 Projection and residual maker matrices


Partition of y
Least squares partitions the vector y into two orthogonal parts

y = ŷ + e = Xb + e = P y + M y.

ˆ n×1 vector of data y

ˆ n×n projection matrix P

ˆ n×n residual maker matrix M

ˆ n×1 vector of residuals e

83
Projection matrix

P y = Xb = X(X ′ X)−1 X ′ y

→ P = X(X ′ X)−1 X ′ .

Projection matrix

Properties.

ˆ symmetric such that P = P ′, thus orthogonal

ˆ idempotent such that P = P 2, thus indeed a projection

ˆ annihilator matrix PX = X

Example for projection matrix


Example

Show P X = X(X ′ X)−1 X ′ X = X.


   
1 0   1 0    
  1 1 1   3 1 −1/2
1/2
X= 1 1 ; X'X = 

1 1 = 
   ; X'X−1 = ;
  0 1 0   1 1 −1/2 1.5
1 0 1 0
   
1 0    1/2 0 1/2
  1/2 −1/2 1 1 1  
X(X ′ X)−1 X ′ = 1 1  =
  
  0 1 0 
  −1/2 3/2 0 1 0  
1 0 1/2 0 1/2
    
1/2 0 1/2 1 0 1 0
    
P X =  0 1 0  1 1 = 1 1 .
    
    
1/2 0 1/2 1 0 1 0

84
Project y on the column space of X, i.e. regress y on x and predict E[y] = ŷ.
      
1 1/2 0 1/2 1 2
      
y = 2 ; P y =  0 1 0  2 = ŷ = 2 .
      
      
3 1/2 0 1/2 3 2

Residual maker matrix

M y = e = y − Xb = y − X(X ′ X)−1 X ′ y
M y = (I − X(X ′ X)−1 X ′ )y

→ M = I − X(X ′ X)−1 X ′ = (I − P ).

Residual maker matrix

Properties.

ˆ symmetric such that M = M′

ˆ idempotent such that M = M2

ˆ annihilator matrix MX = 0

ˆ orthogonal to P : P M = M P = 0.

Example for residual maker matrix


Example

Show M X = (I − X(X ′ X)−1 X ′ )X = (I − P )X = X − X = 0.


   
1 0 0 1 0
   
I = 0 1 0 ; X = 1 1
  
;
   
0 0 1 1 0

85
     
1 0 0 1/2 0 1/2 1/2 0 −1/2
     
M = (I − P ) = 0 1 0 −  0 1 0  =  0
     
0 0 
     
0 0 1 1/2 0 1/2 −1/2 0 1/2
    
1/2 0 −1/2 1 0 0 0
    
MX =  0 = 0 0 .
    
0 0  1 1  
    
−1/2 0 1/2 1 0 0 0

Obtain residuals from a projection of y on the column space of X, i.e. regress y on x


and predict y − E[y] = y − ŷ.
      
1 1/2 0 −1/2 1 −1
      
y = 2 ; M y =  0 0  2 = y − ŷ =  0  .
      
0
      
3 −1/2 0 1/2 3 1

Column space of X is x0 and x1 .


     
x10 = 1 x11 = 0 y 1 = 1 2 −1
     
x0 = 1 x21 = 1 y 2 = 2 ; ŷ = 2 ; y − ŷ =  0  .
 2     
     
3 3 3
x0 = 1 x1 = 0 y = 3 2 1

The closest point from the vector y′ =


[1, 2, 3] onto the column space of X , is
ŷ = Xb, here ŷ ′ = [2, 2, 2]. At this point,

we can draw a line orthogonal to the column

space of X.

86
Decomposing the normal equations
1
The normal equations in matrix form are X ′ Xb = X ′ y . If X is partitioned into an

interesting segment X2 and an uninteresting X1 , normal equations are

    
′ ′ ′
X X X1 X2 b Xy
 1 1   1 =  1  .
X2′ X1 X2′ X2 b2 X2′ y

The multiplication of the two equations can be done separately

 
h i b1 h i
′ ′ ′
= X1 y (3)
X1 X 1 X1 X2  
b2
 
h i b1 h i
X2′ X1 X2′ X2   = X2′ y . (4)
b2

How can we nd an expression for b2 that does not involve b1 ?

Solving for b2
Idea: Solve equation (3) for b1 in terms of b2 , then substituting that solution into the

equation (4).

 
h i b1 h i
X1′ X1 ′
X1 X2   = ′
X1 y
b2
X1′ X1 b1 + X1′ X2 b2 = X1′ y
X1′ X1 b1 = X1′ y − X1′ X2 b2
b1 = (X1′ X1 )−1 X1′ y − (X1′ X1 )−1 X1′ X2 b2
= (X1′ X1 )−1 X1′ (y − X2 b2 )

Multiplying out equation (4) gives

 
h i b1 h i
′ ′ = ′
X 2 X1 X2 X 2   X2 y
b2
X2′ X1 b1 + X2′ X2 b2 = X2′ y

Plugging in the solution for b1 gives

 
X2′ X1 (X1 X1 ) X1 (y − X2 b2 ) + X2′ X2 b2 = X2′ y.
′ −1 ′

1 It is called a normal equation because y − Xb is normal to the range of X .

87
X2′ X1 (X1′ X1 )−1 X1′ (y − X2 b2 ) + X2′ X2 b2 = X2′ y.

The middle part of the rst term is X1 (X1′ X1 )−1 X1′ . This is the projection matrix PX1
from a regression of y on X1 .

X2′ PX1 y − X2′ PX1 X2 b2 + X2′ X2 b2 = X2′ y.

We can multiply by an identity matrix I without changing anything

X2′ PX1 y − X2′ PX1 X2 b2 + X2′ IX2 b2 = X2′ Iy.


X2′ Iy − X2′ PX1 y = X2′ IX2 b2 − X2′ PX1 X2 b2 .
X2′ (I − PX1 )y = X2′ (I − PX1 )X2 b2 .

Now (I − PX1 ) is the residual maker matrix MX1

X2′ MX1 y = X2′ MX1 X2 b2 .

Solving for b2 gives

b2 = (X2′ MX1 X2 )−1 X2′ MX1 y.

b2 = (X2′ MX1 X2 )−1 X2′ MX1 y.

The residualizer matrix is symmetric and idempotent, such that MX1 = MX1 MX1 =
MX′ 1 MX1 .

b2 = (X2′ MX′ 1 MX1 X2 )−1 X2′ MX′ 1 MX1 y


 −1

= (MX1 X2 ) (MX1 X2 ) (MX1 X2 )′ (MX1 y)

= (X̃2′ X̃2 )−1 X̃2′ ỹ.

This is the OLS solution for b2 , with X̃2 instead of X and ỹ instead of y.

ˆ X̃2 are residuals from a regression of X2 on X1

ˆ ỹ are residuals from a regression of y on X1

The solution of the regression coecients b2 in a regression that includes other regres-
sors X1 is the same as rst regressing all of X2 and y on X1 , then regressing the residuals

88
from the y regression on the residuals from the X2 regression.

89
6 The Maximum Likelihood Estimator
6.1 From Probability to Likelihood
The Likelihood Principle
Suppose you have three credit cards. You forgot, which has money on it or not. Thus,

the number credit cards with money, call it θ, might be 0, 1, 2, or 3. You can try your

cards 4 times at random to check if you can make a payment.

The checks are random variables y1 , y2 , y3 , and y4 . They are


1, if the ith card has money on it,
yi =
0, otherwise.

Since you chose yi 's uniformly, they are i.i.d. and yi ∼ Bernoulli(θ/3). After checking,
we nd y1 = 1, y2 = 0, y3 = 1, y4 = 1. We observe 3 cards with money and 1 without.

The number credit cards with money could still be 0, 1, 2, or 3.

Which is most likely?

From Probability to Likelihood


You could test for the true θ0 in many samples. Conversely, you can check each possible

value of θ to nd the probability of observing the sample (y1 = 1, y2 = 0, y3 = 1, y4 = 1).

Since yi ∼ Bernoulli(θ/3), we have


θ/3, for y = 1,
P rob(yi = y) =
1 − θ/3, for y = 0.

Since yi 's are independent, the joint PMF of y1 , y2 , y3 , and y4 can be written as

P rob(y1 = y, y2 = y, y3 = y, y4 = y|θ) =
P rob(y1 )P rob(y2 )P rob(y3 )P rob(y4 ).

This depends on θ, and is called likelihood function:

L(θ|yi ) = P rob(y1 = 1, y2 = 0, y3 = 1, y4 = 1, θ) =
θ/3(1 − θ/3)θ/3θ/3 = (θ/3)3 (1 − θ/3).

90
Trial 1 2 3 4

θ 0 1 2 3

P rob(·) 0.0000 0.0247 0.0988 0.0000

Values of the Likelihood L(θ|yi ) for dierent θ


The probability of the observed sample for θ=0 and θ=3 is zero. This makes sense

because our sample included both cards with and without money. The observed data is

most likely to occur for θ = 2.

Likelihood principle: choose θ that maximizes the likelihood of observing the actual
sample to get an estimator for θ0 .
The likelihood is the probability from

ˆ probability mass function if discrete

ˆ probability distribution function if continuous

From Likelihood to Log-Likelihood

ˆ The likelihood function LN (θ|y, X) is the joint probability mass function or

density f (y, X|θ), viewed as a function of vector θ given the data (y, X).

ˆ Maximizing LN (θ) is equivalent to maximizing the log-likelihood function LN (θ) =


ln LN (θ). Because taking the logarithm is a monotonic transformation. A maximum

91
Model Range of y Density f (y) Common Parametrization

e−x β
Bernoulli 0 or 1 py (1 − p)1−y p= 1+e−x′ β
Poisson 0, 1, 2, . . . e−λ λy /y! λ=e x′ β

λ = ex β 1/λ = ex β
′ ′
Exponential (0, ∞) λe−λy or
2 /2σ 2
Normal (−∞, ∞) (2πσ 2 )−1/2 e−(y−µ) µ = x′ β, σ 2 = σ 2

for LN (θ) corresponds with a maximum for LN (θ).

6.2 The Econometric Model


Specication of a Likelihood Function
The conditional likelihood LN (θ) = f (y, X|θ)/f (X|θ) = f (y|X, θ) does not require the
specication of the marginal distribution of X .

For observations (yi , xi ) independent over i and distributed with f (y|X, θ),

ˆ the joint density is

f (y|X, θ) = ΠN
i=1 f (yi |xi , θ),

ˆ the log-likelihood function divided by N is

N
1 1 X
LN (θ) = ln f (yi |xi , θ).
N N i=1

Maximum Likelihood Estimator

92
The maximum likelihood estimator (MLE) is the estimator that maximizes the (con-
ditional) log-likelihood function LN (θ).
The MLE is the local maximum that solves the rst-order conditions

N
1 ∂LN (θ) 1 X ∂ ln f (yi |xi , θ)
= = 0.
N ∂θ N i=1 ∂θ

This estimator is an extremum estimator on based on the conditional density of y given


x. The gradient vector ∂L∂θ
N (θ)
score vector, as it sums the rst derivatives
is called the

of the log density, and when evaluated at θ0 it is called the ecient score.

How Were the Data Generated?


Simple Random Sampling

{xi1 , . . . , xiK , yi }N
i=1 i.i.d. (independent and identically distributed)

This assumption means that

ˆ observation i has no information content for observation j ̸= i

ˆ all observations i come from the same distribution

This assumption is guaranteed by simple random sampling provided there is no systematic

non-response or truncation.

93
I.i.d. data simplify the maximization as the joint density of the two variables is simply

the product of the two marginal densities.

For example with a normal joint pdf with two observations

1 − [(y1 −µ) +(y 2 −µ) ]


2 2

f (y1 , y2 ) = fY1 (y1 ) fY2 (y2 ) = e 2σ 2 .


2πσ 2

With dependent observations we would have to maximize the following likelihood

function, where ρ is the correlation:

1 −
[ ]
(y1 −µ)2 +(y2 −µ)2 −(y1 −µ)(y2 −µ)
2σ 2 (1−ρ2 )
p e .
2πσ 2 1 − ρ2

The Score has Expected Value Zero


Likelihood Equation:

    Z
∂ ln f (y|x, θ) ∂ ln f (y|x, θ)
Ef g(θ) = Ef = f (y|x, θ)dy = 0.
∂θ ∂θ

Example

Z Z

f (y|θ)dy = 1. f (y|θ)dy = 0.
∂θ
Z
∂f (y|θ)
dy = 0.
∂θ

∂ ln f (y|θ)/∂θ = [∂f (y|θ)/∂θ]/[f (y|θ)]

∂f (y|θ) ∂ ln f (y|θ)
= f (y|θ).
∂θ ∂θ
Z
∂ ln f (y|θ)
f (y|θ)dy = 0.
∂θ

Fisher Information
The information matrix is the expectation of the outer product of the score vector,

 
∂ ln f (y|x, θ) ∂ ln f (y|x, θ)
I = Ef .
∂θ ∂θ ′

94
∂LN (θ)
The Fisher information I is equals the variance of the score, since
∂θ
has mean

zero.

ˆ Large values of I mean that small changes in θ lead to large changes in the log-

likelihood

→ LN (θ) contains considerable information about θ,

ˆ Small values of I mean that the maximum is shallow and there are many nearby

values of θ with a similar log-likelihood.

Information Matrix Equality


The Fisher information I is equals the expectation of the Hessian H:
   2   
∂ ln f (y|x, θ) ∂ ln f (y|x, θ) ∂ ln f (y|x, θ)
−Ef H(θ) = −Ef = Ef .
∂θ∂θ ′ ∂θ ∂θ ′

Example

∂ ln f (y|θ)
For vector moment function, e.g., m(y, θ) = with E[m(y, θ)] = 0,
∂θ
Z
m(y, θ)f (y|θ)dy = 0.

Z  
∂m(y, θ) ∂f (y|θ)
f (y|θ) + m(y, θ) dy = 0.
∂θ ′ ∂θ ′
Z  
∂m(y, θ) ∂ ln f (y|θ)
f (y|θ) + m(y, θ) f (y|θ) dy = 0.
∂θ ′ ∂θ ′
   
∂m(y, θ) ∂ ln f (y|θ)
E = −E m(y, θ) = 0.
∂θ ′ ∂θ ′

The Information Matrix in Practice


The variance of the sum of random score vector is:

Information matrix equality:


" n
#
∂ 2 ln L
X  
Var gi (θ) = Var [g (θ)] = −Ef [H (θ)] = −E .
i=1
∂θ∂θ ′

After taking the expected value, θb is substituted for θ . Problem: Taking the expected

value of the second derivative matrix is frequently infeasible.

95
There exist two alternatives which are asymptotically equivalent:

ˆ Ignore the expected value operator:

2
I( b = − ∂ ln L .
b θ)
b θb′
∂ θ∂
ˆ Berndt-Hall-Hall-Hausman (BHHH) algorithm

Never take a second derivative and sum over the outer product of the scores: (rst

derivatives per observation):

      ′
n
X n
X ∂ ln f yi , θ
b ∂ ln f yi , θ
b
ˇ θ)
I( b = gbi gbi′ =    .
i=1 i=1 ∂ θb ∂ θb

6.3 Properties of the Maximum Likelihood Estimator


Properties of the MLE
ˆ Small sample properties of θ̂
 may be biased

 may have unknown distribution

 variance may be biased, even towards zero

ˆ Large sample properties of θ̂


 consistent

 approx. normal

 asymptotically ecient

 invariant

96
Consistency
Law of Large Numbers
As N increases, the distribution of θ̂ becomes more tightly centered around θ.

(a) N=3 (b) N=10

(c) N=25 (d) N=100

Likelihood Inequality

E[(1/N ) ln L(θ̂)] ≥ E[(1/N ) ln L(θ)].

The expected value of the log-likelihood is maximized at the true value of the parameters.

Figure 15: θ̂, Likelihood and Log-Likelihood as n → ∞. True θ = 0.6.

lim P (|θ̂ − θ| > ϵ) = 0. lim E[θ̂] = θ.


n→∞ n→∞

97
Approximate Normality
Central Limit Theorem
As N becomes large,
  2 −1 
a ∂ LN (θ)
θ̂ ∼ N θ, − E .
∂θ∂θ ′

Figure 16: Sampling distribution of θ̂ drawn from Bernoulli distribution and normal
distribution at N = 100. True θ = 0.6.

Eciency
The precision of the estimate θ̂ is limited by the Fisher information I of the likelihood.

  1
Var θ̂ ≥ .
I (θ)

For large samples, this is the so-called Cramér-Rao lower bound for the variance matrix

of consistent asymptotically normal estimators with convergence to normality of N (θ̂ −
θ0 ) uniform in compact intervals of θ0 .
Under the strong assumption of correct specication of the conditional density, the

MLE has the smallest asymptotic variance among root-N consistent estimators.

98
Example

Since the MLE is unbiased,

h i Z  
E θ̂ − θ θ = θ̂ − θ f (y; θ) dy = 0 regardless of the value of θ.

This expression is zero independent of θ, so its partial derivative with respect to θ must

also be zero. By the product rule, this partial derivative is also equal to

Z  Z  Z
∂   ∂f
0= θ̂ − θ f (y; θ) dy = θ̂ − θ dy − f dy.
∂θ ∂θ
For each θ, the likelihood function is a probability density function, and therefore
R
f dy = 1. By using the chain rule on the partial derivative of ln f and then divid-

ing and multiplying by f (y; θ), one can verify that

∂f ∂ ln f
=f .
∂θ ∂θ
Using these two facts, we get

Z 
∂ ln f

θ̂ − θ f dy = 1.
∂θ
R  √  √
f ∂ ∂θ
ln f

Factoring the integrand gives θ̂ − θ f dy = 1.
Squaring the expression in the integral, the Cauchy-Schwarz inequality yields

Z h  p i p ∂ ln f  2 Z   "Z  2 #
2 ∂ ln f
1= θ̂ − θ f · f dy ≤ θ̂ − θ f dy · f dy .
∂θ ∂θ

The rst factor is the expected mean-squared error (the variance) of the estimator θ̂ , the

second factor is the Fisher Information.

Invariance
The MLE of γ = c (θ) is θb = c(θ)
b if c (θ) is a continuous and continuous dierentiable

function.

ˆ This simplies the log-likelihood,

ˆ This allows a function of θb to serve as MLE if it is desired to analyze the function

of an MLE.

99
Example

Suppose that the normal log-likelihood is parameterized in terms of the precision param-

eter, θ2 = 1/σ 2 . The log-likelihood becomes

N
θ2 X
ln L(µ, σ 2 ) = −(N/2) ln(2π) + (N/2) ln θ2 − (yi − µ)2 .
2 i=1

The MLE for µ is x̄. But the likelihood equation for θ2 is now

N
∂ ln L(µ, θ2 )
 X 
2 2
= 1/2 N/θ − (yi − µ) = 0,
∂θ2 i=1

PN
which has solution θ̂2 = N/ i=1 (yi − µ)2 = 1/σ̂ 2 .

The MLE is also equivariant with respect to certain transformations of the data.
If y = c(x) where c is one to one and does not depend on the parameters to be

estimated, then the density functions satisfy

fX (x)
fY (y) = ,
|c′ (x)|
and hence the likelihood functions for x and y dier only by a factor that does not

depend on the model parameters.


Example

The MLE parameters of the log-normal distribution are the same as those of the normal

distribution tted to the logarithm of the data.

100
7 The Generalized Method of Moments
7.1 How to choose from too many restrictions?
Minimize the quadratic form
The overidentied GMM estimator θ̂GM M (Wn ) for K parameters in θ identied by L > K
moment conditions is a function of the weighting matrix Wn for a sample of i = 1, ..., n

observations:

θ̂GM M (Wn ) = argminqn (θ),


θ

where the quadratic form qn (θ) is the criterion function and is given as a function of

the sample moments m̄n (θ)


qn (θ) = m̄n (θ)′ W m̄n (θ).

The sample moments are a function

N
X
m̄n (θ) = 1/n m(Xi , Zi , θ0 )
i=1

of the model variables Xi , the instruments Zi , and the true parameters θ0 .

What are the properties of the quadratic form


qn (θ) = m̄n (θ)′ W m̄n (θ).
L×L
1×1 1×L L×1

Quadratic form criterion function qn (θ) ≥ 0 is a scalar!

Weighting matrix W is symmetric (and positive denite that is x′ W x > 0 for all

non-zero x)!

7.2 Get the sampling error (at least approximately)


Get an approximate deviation from the true θ0
First order Taylor expansion of sample moments m̄n (θ̂GM M ) around m̄n (θ0 ) at true pa-

rameters gives:

m̄n (θ̂GM M ) ≈ m̄n (θ0 ) + Ḡn (θ̄)(θ̂GM M − θ0 ),

101
∂ m̄n (θ̄)
where Ḡn (θ̄) = ∂ θ̄′
and θ̄ is a point between θ̂GM M and θ0 .

Check the dimensions


First order Taylor expansion of sample moments m̄n (θ̂GM M ) around m̄n (θ0 ) at true pa-

rameters gives:

m̄n (θ̂GM M ) ≈ m̄n (θ0 ) + Ḡn (θ̄)(θ̂GM M − θ0 ),


L×1 L×1 L×K K×1

∂ m̄n (θ̄)
L×1
where Ḡn (θ̄) = ∂ θ̄′
and θ̄ is a point between θ̂GM M and θ0 , because of the
1×K

Mean value theorem...

Approximation introduced θ̄
∂ m̄n (θ̄)
...where Ḡn (θ̄) = ∂ θ̄′
and θ̄ is a point between θ̂GM M and θ0 .

Mean value theorem

m̄n (θ̂GM M ) − m̄n (θ0 )


Ḡn (θ̄) = for θ0 < θ̄ < θ̂GM M .
θ̂GM M − θ0

Do the minimization
To minimize the quadratic form criterion, we take the rst derivative of

102
qn (θ) = m̄n (θ)′ W m̄n (θ)

∂qn (θ̂GM M )
= 2Ḡn (θ̂GM M )′ Wn m̄n (θ̂GM M ) = 0.
∂ θ̂GM M

Express as much as possible asymptotically


∂qn (θ̂GM M )
= 2Ḡn (θ̂GM M )′ Wn m̄n (θ̂GM M ) = 0,
∂ θ̂GM M
Plug in the approximation from before

m̄n (θ̂GM M ) ≈ m̄n (θ0 ) + Ḡn (θ̄)(θ̂GM M − θ0 )

to obtain

Ḡn (θ̂GM M )′ Wn m̄n (θ0 ) + Ḡn (θ̂GM M )′ Wn Ḡn (θ̄)(θ̂GM M − θ0 ) ≈ 0

which we rearrange to get the very useful

θ̂GM M ≈ θ0 − (Ḡn (θ̂GM M )′ Wn Ḡn (θ̄))−1 Ḡn (θ̂GM M )′ Wn m̄n (θ0 ).

So the estimate θ̂GM M is approximately the true parameter θ0 plus a sampling error that

depends on the sample moment m̄n (θ0 ).

Quickly check dimensions


Useful approximation

θ̂GM M ≈ θ0 − (Ḡn (θ̂GM M )′ Wn Ḡn (θ̄))−1 Ḡn (θ̂GM M )′ Wn m̄n (θ0 ).


K×1 K×1 K×L L×L L×K K×L L×L L×1

So the estimate θ̂GM M is approximately the true parameter θ0 plus a sampling error that

depends on the sample moment m̄n (θ0 ).

103
7.3 The econometric model
Three assumptions: moment conditions
GMM1: Moment Conditions and Identication

m̄(θa ) ̸= m̄(θ0 ) = E[m(Xi , Zi , θ0 )] = 0.

Identication implies that the probability limit of the GMM criterion function is uniquely

minimized at the true parameters.

Three assumptions: law of large numbers


GMM2: Law of Large Numbers Applies

.
N
p
X
m̄n (θ) = 1/n m(Xi , Zi , θ0 ) → E[m(Xi , Zi , θ0 )].
i=1

The data meets the conditions for a law of large numbers to apply, so that we may assume

that the empirical moments converge in probability to their expectation.

Three assumptions: central limit theorem


GMM3: Central Limit Theorem Applies

.
N
√ √ X d
nm̄n (θ) = n/n m(Xi , Zi , θ0 ) → N [0, Φ].
i=1

The empirical moments obey a central limit theorem. This assumes that the moments

have a nite asymptotic covariance matrix E[m(Xi , Zi , θ0 )m(Xi , Zi , θ0 )′ ] = Φ.

7.4 Consistency
Recall the useful approximation of the estimator:

θ̂GM M ≈ θ0 − (Ḡn (θ̂GM M )′ Wn Ḡn (θ̄))−1 Ḡn (θ̂GM M )′ Wn m̄n (θ0 ).

104
Assumption GMM2 implies that

N
p
X
m̄n (θ0 ) = 1/n m(Xi , Zi , θ0 ) → E[m(Xi , Zi , θ0 )] = m̄(θ0 ).
i=1

That is, the sample moment equals the population moment in probability. Assumption

GMM1 implies that

m̄(θ0 ) = 0.

Then
p
m̄n (θ0 ) → m̄(θ0 ) = 0,

such that
p
θ̂GM M → θ0 for N →∞

That is, by GMM1 and GMM2 the GMM estimator is consistent.

7.5 Asymptotic normality


Recall the useful approximation of the estimator:

θ̂GM M ≈ θ0 − (Ḡn (θ̂GM M )′ Wn Ḡn (θ̄))−1 Ḡn (θ̂GM M )′ Wn m̄n (θ0 ).

Rewrite to obtain

√ √
n(θ̂GM M − θ0 ) ≈ −(Ḡn (θ̂GM M )′ Wn Ḡn (θ̄))−1 Ḡn (θ̂GM M )′ Wn nm̄n (θ0 ),

The right hand side has several parts for which we made assumptions on what happens

when N → ∞. Under the central limit theorem (GMM3)

√ d
nm̄n (θ0 ) → N [0, Φ]

plimWn = W
 
∂m(Xi , Zi , θ0 ) ∂ m̄(θ0 )
plimḠn (θ̂GM M ) = plimḠn (θ̄) = plim =E = Γ(θ0 )
∂θ0′ ∂θ0′
With plimWn = W and

plimḠn (θ̂GM M ) = plimḠn (θ̄) = Γ(θ0 )

105
the expression

√ √
n(θ̂GM M − θ0 ) ≈ −(Ḡn (θ̂GM M )′ Wn Ḡn (θ̄))−1 Ḡn (θ̂GM M )′ Wn nm̄n (θ0 )

becomes

√ √
n(θ̂GM M − θ0 ) ≈ −(Γ(θ0 )′ W Γ(θ0 ))−1 Γ(θ0 )′ W nm̄n (θ0 )

from which we get the variance V. So

√ d
n(θ̂GM M − θ0 ) → N [0, V ]

with

V = 1/n[Γ(θ0 )′ W Γ(θ0 )]−1 [Γ(θ0 )′ W ΦW ′ Γ(θ0 )][Γ(θ0 )′ W Γ(θ0 )]−1


K×K

That is by GMM1, GMM2, and GMM3 the GMM estimator is asymptotic normal.

7.6 Asymptotic eciency


Which weighting matrix W gives us the smallest possible asymptotic variance of the GMM

estimator θ̂GM M .
The variance of the GMM estimator V depends on the choice of W

V = 1/n[Γ(θ0 )′ W Γ(θ0 )]−1 [Γ(θ0 )′ W ΦW ′ Γ(θ0 )][Γ(θ0 )′ W Γ(θ0 )]−1

So let us minimize V to get the optimal weight matrix. Try from GMM3

plimWn = W = Φ−1
n→∞


VGM M,optimal = 1/n[Γ(θ0 )′ Φ−1 Γ(θ0 )]−1 [Γ(θ0 )′ Φ−1 ΦΦ−1 Γ(θ0 )][Γ(θ0 )′ Φ−1 Γ(θ0 )]−1

Which can be simplied to

VGM M,optimal = 1/n[Γ(θ0 )′ Φ−1 Γ(θ0 )]−1

VGM M,optimal = 1/n[Γ(θ0 )′ Φ−1 Γ(θ0 )]−1

If Φ is small, there is little variation of this specic sample moment around zero and the

moment condition is very informative about θ0 . So it is best to assign a high weight to it.

106
VGM M,optimal = 1/n[Γ(θ0 )′ Φ−1 Γ(θ0 )]−1

If Γ is large, there is a large penalty from violating the moment condition by evaluating
at θ ̸= θ0 . Then the moment condition is very informative about θ0 . V is inversely related

to Γ.

107
Estimate the variance in practice
V̂GM M,optimal = 1/n[Ḡn (θ̂)′ Φ−1
n Ḡn (θ̂)]
−1

Consistent estimator

Φn = N V (m̄n (θ̂))

∂m(Xi , Zi , θ̂)
Ḡn (θ̂) =
∂ θ̂′

108
8 Conclusion
Congratulations! If you made it through this document, you are ready to read some econo-

metrics papers, program and develop new estimators, and analyze statistical properties.

If this caught your interest, check out non-parametric and Bayesian econometrics.

109
References
Angrist, J. D., and J.-S. Pischke (2009): Mostly Harmless Econometrics: An Em-
piricist's Companion . Princeton University Press.

Cameron, A. C., and P. K. Trivedi (2005): Microeconometrics: Methods and Appli-


cations. Cambridge University Press, 3rd edn.

Filoso, V. (2013): Regression Anatomy, Revealed, The Stata Journal , 13(1), 92106.

Frisch, R., and F. V. Waugh (1933): Partial Time Regressions as Compared with

Individual Trends, Econometrica , 1(4), 387401.

Greene, W. H. (2011): Econometric Analysis . Prentice Hall, 5th edn.

Hansen, L. P. (1982): Large Sample Properties of Generalized Method of Moments

Estimators, Econometrica , 50(4), 10291054.

(2012): Proofs for large sample properties of generalized method of moments

estimators, Journal of Econometrics , 170(2), 325330, Thirtieth Anniversary of Gen-

eralized Method of Moments.

Hill, R. C., W. E. Griffiths, and G. C. Lim (2010): Principles of Econometrics .

John Wiley & Sons, 4th edn.

Kennedy, P. (2008): A Guide to Econometrics . Blackwell Publishing, 6th edn., In par-

ticular, Chapter 7, 8.18.3.

Lovell, M. C. (2008): A Simple Proof of the FWL Theorem, The Journal of Economic
Education , 39(1), 8891.

Pishro-Nik, H. (2014): Introduction to Probability, Statistics, and Random Processes .

Kappa Research LLC.

Plackett, R. L. (1972): Studies in the History of Probability and Statistics. XXIX:

The discovery of the method of least squares, Biometrika , 59(2), 239251. Cited on

page 63.

Rostam-Afschar, D., and R. Jessen (2014):  GRAPH3D: Stata module to draw

colored, scalable, rotatable 3D plots, Statistical Software Components, Boston College

Department of Economics.

Stock, J. H., and M. W. Watson (2012): Introduction to Econometrics . Pearson

Addison-Wesley, 3rd edn.

Verbeek, M. (2012): A Guide to Modern Econometrics . John Wiley & Sons, 3rd edn.

Wooldridge, J. M. (2002): Econometric Analysis of Cross Section and Panel Data .

MIT Press.

(2009): Introductory Econometrics: A Modern Approach . Cengage Learning, 4th

edn. Cited on page 67.

110

You might also like