0% found this document useful (0 votes)

19 views110 pages

Econometricks-Short Guide

The document 'Econometricks: Short Guides to Econometrics' by Davud Rostam-Afschar provides a concise overview of key econometric methods such as Ordinary Least Squares, Maximum Likelihood, and the Generalized Method of Moments, along with foundational concepts in Probability and Distribution Theory. It includes practical examples to illustrate these statistical methods and their applications. The guide is based on lectures from the University of Mannheim and supported by the Deutsche Forschungsgemeinschaft.

Uploaded by

AMBADIANG Yves Bertrand

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views110 pages

Econometricks-Short Guide

Uploaded by

AMBADIANG Yves Bertrand

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 110

Econometricks: Short Guides to Econometrics

Davud Rostam-Afschar

December 1, 2024

Abstract
Short guides to econometrics illustrate statistical methods and demonstrate how
they work in theory and practice. With many examples.

Keywords: Econometrics, Ordinary Least Squares, Maximum Likelihood, Gener-

alized Method of Moments, Probability Theory, Distribution Theory, Frisch-Waugh-
Lovell, Monte Carlo Simulation
JEL classication: A20, A23, C01, C10, C12, C13

* These guides were developed based on lectures delivered by Davud Rostam-Afschar at the University
of Mannheim. I am grateful for the valuable input provided by numerous cohorts of PhD students at
the Graduate School of Economic and Social Sciences, University of Mannheim. I thank the Deutsche
Forschungsgemeinschaft (DFG, German Research Foundation) for nancial support through CRC TRR
266 Accounting for Transparency (Davud Rostam-Afschar, Project-ID 403041268). Replication les and
updates are available here.
University of Mannheim, 68131 Mannheim, Germany; GLO; IZA; NeST (e-mail: rostam-afschar@uni-
mannheim.de),

1
Contents
1 Review of Probability Theory 4
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Probability fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Mean and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Moments of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Useful rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Specic Distributions 17
2.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2 Method of transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 The χ2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 The F-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5 The student t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6 The lognormal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.7 The gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.8 The beta distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.9 The logistic distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.10 The Wishart distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Review of Distribution Theory 38

3.1 Joint and marginal bivariate distributions . . . . . . . . . . . . . . . . . . 38

3.2 The joint density function . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3 The joint cumulative density function . . . . . . . . . . . . . . . . . . . . . 40

3.4 The marginal probability density . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.6 The conditional density function . . . . . . . . . . . . . . . . . . . . . . . . 45

3.7 Conditional mean aka regression . . . . . . . . . . . . . . . . . . . . . . . . 46

3.8 The bivariate normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.9 Useful rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 The Least Squares Estimator 52

4.1 What is the Relationship between Two Variables? . . . . . . . . . . . . . . 52

4.2 The Econometric Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3 Estimation with OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4 Properties of the OLS Estimator in the Small and in the Large . . . . . . . 66

4.5 Politically Connected Firms: Causality or Correlation? . . . . . . . . . . . 79

2
5 Simplifying Linear Regressions using Frisch-Waugh-Lovell 81
5.1 Frisch-Waugh-Lovell theorem in equation algebra . . . . . . . . . . . . . . 81

5.2 Projection and residual maker matrices . . . . . . . . . . . . . . . . . . . . 83

6 The Maximum Likelihood Estimator 90

6.1 From Probability to Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2 The Econometric Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.3 Properties of the Maximum Likelihood Estimator . . . . . . . . . . . . . . 96

7 The Generalized Method of Moments 101

7.1 How to choose from too many restrictions? . . . . . . . . . . . . . . . . . . 101

7.2 Get the sampling error (at least approximately) . . . . . . . . . . . . . . . 101

7.3 The econometric model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.5 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.6 Asymptotic eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

8 Conclusion 109

References 110

3
1 Review of Probability Theory
1.1 Introduction
This guide takes a look under the hood of widely used methods in econometrics and

beyond. It focuses on Ordinary Least Squares, Maximum Likelihood, Generalized Method

of Moments. It shows when and why these methods work with simple examples. This

guide also provides an overview of the most important fundamentals of Probability Theory

and Distribution Theory on which these methods are based and how to analyze them with

the Frisch-Waugh-Lovell decomposition and with Monte Carlo Simulation.

1.2 Probability fundamentals

Discrete and continuous random variables
Discrete Random Variable

A random variable X is discrete if the set

of outcomes x is either nite or countably

innite.

Continuous Random Variable

The random variable X is continuous if the

set of outcomes x is innitely divisible and,

hence, not countable.

4
Discrete probabilities
For values x of a discrete random variable X, the probability mass function (pmf )

f (x) = P rob(X = x).

The axioms of probability require

0 ≤ P rob(X = x) ≤ 1,
X
f (x) = 1.
x

Discrete cumulative probabilities

For values x of a discrete random variable X, the cumulative distribution function
X
F (x) = f (x) = P rob(X ≤ x),
X≤x

where

f (xi ) = F (xi ) − F (xi−1 ).

Example

Roll of a six-sided die

x f (x) F (X ≤ x)

1 f (1) = 1/6 F (X ≤ 1) = 1/6

2 f (2) = 1/6 F (X ≤ 2) = 2/6
3 f (3) = 1/6 F (X ≤ 3) = 3/6
4 f (4) = 1/6 F (X ≤ 4) = 4/6
5 f (5) = 1/6 F (X ≤ 5) = 5/6
6 f (6) = 1/6 F (X ≤ 6) = 6/6

What's the probability that you roll a 5 or higher?

F (X ≥ 5) = 1 − F (X ≤ 4) = 1 − 2/3 = 1/3.

5
Continuous probabilities
For values x of a continuous random variable X , the probability is zero but the area under
f (x) ≥ 0 in the range form a to b is the probability density function (pdf )
Z b
P rob(a ≤ x ≤ b) = P rob(a < x < b) = f (x)dx ≥ 0.
a

The axioms of probability require

Z +∞
f (x)dx = 1.
−∞

f (x) = 0 outside the range of x.

The cumulative distribution function (cdf ) is
Z x
F (x) = f (t)dt,
−∞

dF (x)
f (x) = .
dx

Cumulative distribution function

For continuous and discrete variables, F (x) satises

Properties of cdf

0 ≤ F (x) ≤ 1

If x > y, then F (x) ≥ F (y)

F (+∞) = 1

F (−∞) = 0

and

P rob(a < x ≤ b) = F (b) − F (a).

Symmetric distributions
For symmetric distributions

f (µ − x) = f (µ + x)

and

1 − F (x) = F (−x).

6
1.3 Mean and variance
Mean of a random variable (Discrete)

The mean, or expected value, of a discrete random variable is

X
µ = E[x] = xf (x)
x

Example

Roll of a six-sided die

x f (x) = 1/n F (X ≤ x) = (x − a + 1)/n

a = 1 f (1) = 1/6 F (X ≤ 1) = 1/6

2 f (2) = 1/6 F (X ≤ 2) = 2/6
3 f (3) = 1/6 F (X ≤ 3) = 3/6
4 f (4) = 1/6 F (X ≤ 4) = 4/6
5 f (5) = 1/6 F (X ≤ 5) = 5/6
b = 6 f (6) = 1/6 F (X ≤ 6) = 6/6

What's the expected value from rolling the dice?

E[x] = 1/6 + 2/6 + 3/6 + 4/6 + 5/6 + 6/6 = 3.5.

This is the mean (and the median) of a uniform distribution (n + 1)/2 = (a + b)/2 = 3.5.

7
Mean of a random variable (Continuous)

For a continuous random variable x, the expected value is

Z
E[x] = xf (x)dx.
x

Example

The continuous uniform distribution is 1/(b − a) for a≤x≤b and 0 otherwise.

Z b Z b
x 1
E[x] = dx = xdx.
a b−a b−a a

Antiderivative of x is x2 /2

1 (b − a)(b + a) a+b
E[x] = (b2 /2 − a2 /2) = = .
b−a 2(b − a) 2

The mean (and the median) is again (a + b)/2 = 3.5.

P
For a function g(x) of x, the expected value is E[g(x)] = x g(x)P rob(X = x) or
R
E[g(x)] = x g(x)f (x)dx. If g(x) = a+bx for constants a and b, then E[a+bx] = a+bE[x].

Variance of a random variable

The variance of a random variable σ 2 > 0 is

 P
 (x − µ)2 f (x) if x is discrete,
 x


σ 2 = V ar[x] = E[(x − µ)2 ] =


 R (x − µ)2 f (x)dx if x is continuous.

x

Example

Roll of a six-sided die. What's the variance V [x] from rolling the dice?
The probability of observing x, P r(X = x) = 1/n, is discretely uniformly distributed

n+1 (n + 1)2
E[x] = ; (E[x])2 = .
2 4

8
n
X 1 X 2 (n + 1)(2n + 1)
E[x2 ] = P r(X = x) = x = due to the sequence sum of squares.
x
n x=1 6

V [x] = E[x2 ] − (E[x])2 .

(n+1)(2n+1) (n+1)2 n2 −1
V [x] = 6
− 4
= 12
= (62 − 1)/12 ≈ 2.92.

Chebychev inequality

For any random variable x and any positive constant k > 1,

1
Pr(µ − kσ < x < µ + kσ) ≥ 1 − .
k2

Share outside k standard deviations.

If x is normally distributed, the bound is 1 − (2Φ(k) − 1).

95% of the observations are within 1.96 standard deviations for normally distributed

9
x. If x is not normal, 95% are at most within 4.47 standard deviations.

Normal coverage

1.4 Moments of a random variable

Central moments of a random variable

The central moments are

µr = E[(x − µ)r ].

Example

Moments: Two measures often used to describe a probability distribution are

expectation = E[(x − µ)1 ]

variance = E[(x − µ)2 ]

skewness = E[(x − µ)3 ]

kurtosis = E[(x − µ)4 ]

The skewness is zero for symmetric distributions.

10
Higher order moments

Moment generating function

For the random variable X, with probability density function f (x), if the function

M (t) = E[etx ].

exists, then it is the moment generating function(M GF ).

Often simpler alternative to working directly with probability density functions or

cumulative distribution functions

Not all random variables have moment-generating functions

The nth moment is the nth derivative of the moment-generating function, evaluated at

11
t = 0.

Example

The MGF for the standard normal distribution with µ = 0, σ = 1 is

2 t2 /2 2 /2
Mz (t) = eµt+σ = et .

If x and y are independent, then the MGF of x+y is Mx (t)My (t).

For x ∼ N (µ, σ 2 ) for some µ, σ > 0 with moment generating function

1
Mx (t) = exp(µt + σ 2 t2 ), the rst moment generating function of x is
2

1 ′ 2 1 22
E[(x − µ) ] = Mx (t) = (µ + σ t) exp µt + σ t .
2

Example

1 22
d exp µt + σ t
1 ′ 2
E[(x − µ) ] = Mx (t) =
dt

1 22 1 22
d µt + σ t d exp µt + σ t
2 2
=
dt 1
d(µt + σ 2 t2 )
2

2 1 22
= (µ + σ t) exp µt + σ t .
2

If x ∼ N (0, 1),

the skewness is E[(x − µ)3 ] = 0 and

the kurtosis is E[(x − µ)4 ] = 3.

12
Example

1 ′ 1 22
2
E[(x−µ) ] = Mx (t) = (µ+σ t) exp µt+ σ t with µ = 0, σ = 1, t = 0 : E[x] = µ = 0
2

2 ′′ 2 2 2 1 22
E[(x − µ) ] = Mx (t) = σ + (µ + σ t) exp µt + σ t
2
µ = 0, σ = 1, t = 0 : E[(x − µ)2 ] = σ 2 = 1
with

3 ′′′ 2 2 2 3 1 22
E[(x − µ) ] = Mx (t) = 3σ (µ + σ t) + (µ + σ t) exp µt + σ t
2
µ = 0, σ = 1, t = 0 : E[(x − µ)3 ] = 0
with

4 (4) 4 2 2 2 2 4 1 22
E[(x − µ) ] = Mx (t) = 3σ + 6σ (µ + σ t) + (µ + σ t) exp µt + σ t
2
with µ = 0, σ = 1, t = 0 : E[(x − µ)4 ] = 3.

Approximating mean and variance

For any two functions g1 (x) and g2 (x),

E[g1 (x) + g2 (x)] = E[g1 (x)] + E[g2 (x)].

For the general case of a possibly nonlinear g(x),

Z
E[g(x)] = g(x)f (x)dx,
x

and Z
V ar[g(x)] = (g(x) − E[g(x)])2 f (x)dx.
x

E[g(x)] and V ar[g(x)] can be approximated by a rst order linear Taylor series:

First order linear Taylor series

g(x) ≈ [g(x0 ) − g ′ (x0 )x0 ] + g ′ (x0 )x. (1)

13
Taylor approximation Order 1

A natural choice for the expansion point is x0 = µ = E(x). Inserting this value in Eq.

(1) gives

g(x) ≈ [g(µ) − g ′ (µ)µ] + g ′ (µ)x,

so that

E[g(x)] ≈ g(µ),

and

V ar[g(x)] ≈ [g ′ (µ)]2 V ar[x].

14
Example

Isoelastic utility. cbad = 10.00 Euro; cgood = 100.00 Euro; probability good outcome

50%

µ = E[c] = 1/2 × cbad + 1/2 × cgood =55.00 Euro

u(c) = c1/2

u(µ) = 7.42 approximates E[u(c)] = 1/2 × 101/2 + 1/2 × 1001/2 = 6.58

Example

Isoelastic utility.
cbad = 10.00 Euro; cgood = 100.00 Euro; probability good outcome 50%; µ = 55.00 Euro

u(c) = ln(c)

u(µ) = 4.01 approx.

E[u(c)] = 1/2 × ln(10) + 1/2 × ln(100)

= 3.45

Jensen's inequality:
E[g(x)] ≤ g(E[x])] if g ′′ (x) < 0.

V [u(c)] ≈ (1/55)2 ((10 − 55)2 + (100 − 55)2 ) = 1.34

V [u(c)] = (ln(10) − E[u(c)])2 + (ln(100) − E[u(c)])2 = 2.65

1.5 Useful rules

V ar[x] = E[x2 ] − µ2

E[x2 ] = σ 2 + µ2

If a and b constants, V ar[a + bx] = b2 V ar[x]

V ar[a] = 0

If g(x) = a + bx and a and b are constants, E[a + bx] = a + bE[x]

15
Coverage Pr(|X − µ| ≥ kσ) ≤ 1
k2

Skewness = E[(x − µ)3 ]

Kurtosis = E[(x − µ)4 ]

For symmetric distributions f (µ − x) = f (µ + x); 1 − F (x) = F (−x)

E[g(x)] ≈ g(µ)

V ar[g(x)] ≈ [g ′ (µ)]2 V ar[x]

16
2 Specic Distributions

17
Discrete distributions
Bernoulli distribution

The Bernoulli distribution for a single binomial outcome (trial) is

P rob(x = 1) = p,
P rob(x = 0) = 1 − p,

where 0≤p≤1 is the probability of success.

E[x] = p and

V [x] = E[x2 ] − E[x]2 = p − p2 = p(1 − p).

The distribution for x successes in n trials is the binomial distribution,

n!
P rob(X = x) = px (1 − p)n−x x = 0, 1, . . . , n.
(n − x)!x!
The mean and variance of x are

E[x] = np and

V [x] = np(1 − p).

Example of a binomial [n = 15, p = 0.5] distribution:

18
Poisson distribution

The limiting form of the binomial distribution, n → ∞, is the Poisson distribution,

eλ λx
P rob(X = x) = .
x!

The mean and variance of x are

E[x] = λ and

V [x] = λ.

Example of a Poisson [3] distribution:

19
2.1 Normal distribution
The normal distribution

Random variable x ∼ N [µ, σ 2 ] is distributed according to the normal distribution with

mean µ and standard deviation σ obtained as

1 1 x−µ 2
f (x|µ, σ) = √ e− 2 ( σ ) .
σ 2π

The density is denoted ϕ(x) and the cumulative distribution function is denoted Φ(x)
for the standard normal. Example of a standard normal, (x ∼ N [0, 1]), and a normal

with mean 0.5 and standard deviation 1.3:

20
2.2 Method of transformations
Transformation of random variables

Continuous variable x may be transformed to a discrete variable y. Calculate the mean

of variable x in the respective interval:

P rob(Y = µ1 ) = P (−∞ < X ≤ a),

P rob(Y = µ2 ) = P (a < X ≤ b),
P rob(Y = µ3 ) = P (b < X ≤ ∞).

Method of transformations

If x is a continuous random variable with pdf fx (x) and if y = g(x) is a continuous

monotonic function of x, then the density of y is obtained by

Z b
P rob(y ≤ b) = fx (g −1 (y))|g −1′ (y)|dy.
−∞

With fy (y) = fx (g −1 (y))|g −1′ ](y)|dy , this equation can be written as

Z b
P rob(y ≤ b) = fy (y)dy.
−∞

Example

x−µ
If x ∼ N [µ, σ 2 ], then the distribution of y = g(x) = σ
is found as follows:

g −1 (y) = x = σy + µ

dx
g −1′ (y) = =σ
dy
1 −1 2 2
Therefore with fx (x) = √1 e− 2 [(g (y)−µ) /σ ] |g −1′ (y)|
σ 2π

1 2 2 1 2
fy (y) = √ e−[(σy+µ)−µ] /2σ |σ| = √ e−y /2 .
2πσ 2π

21
Properties of the normal distribution
Preservation under linear transformation:

If x ∼ N [µ, σ 2 ], then (a + bx) ∼ N [a + bµ, b2 σ 2 ].

Convenient transformation a = −µ/σ and b = 1/σ :

(x−µ)
The resulting variable z= σ
has the standard normal distribution with density

1 z2
ϕ(z) = √ e− 2 .
2π

If x ∼ N [µ, σ 2 ], then f (x) = σ1 ϕ[ x−µ

σ
]

P rob(a ≤ x ≤ b) = P rob a−µ x−µ b−µ

σ
≤ σ
≤ σ

ϕ(−z) = 1 − ϕ(z) and Φ(−x) = 1 − Φ(x) because of symmetry

If z ∼ N [0, 1], then z 2 ∼ χ2 [1] with pdf √

1
2πy
e−y/2 .

Example

1 x2
fx (x) = √ e− 2
2π
y = g(x) = x2
√
g −1 (y) = x = ± y there are two solutions to g1 , g2 .
dx
g −1′ (y) = = ±1/2y −1/2
dy
fy (y) = fx (g1−1 (y))|g1−1′ (y)| + fx (g2−1 (y))|g2−1′ (y)|
√ √
fy (y) = fx ( y)|1/2y −1/2 | + fx (− y)| − 1/2y −1/2 |
1 y 1 y 1 y
fy (y) = √ e− 2 + √ e− 2 = √ e− 2
2 2πy 2 2πy 2πy

Distributions derived from the normal

If z ∼ N [0, 1], then z 2 ∼ χ2 [1] with E[z 2 ] = 1 and V [z 2 ] = 2.

If x1 , ..., xn are n independent χ2 [1] variables, then

n
X
xi ∼ χ2 [n].
i=1

22
Normal

Parameters µ ∈ R , σ ∈ R>0
Support x∈R
1 x−µ 2
x−µ
= σ√12π e− 2 ( σ )

PDF ϕ σ
x−µ
1h
x−µ
i
CDF Φ σ
= 2
1 + erf √
σ 2
Mean µ
Median µ
Mode µ
Variance σ2
Skewness 0
Ex. Kurtosis 0
MGF exp(µt + σ 2 t2 /2)

PDF denotes probability density function, CDF cumulative distribution function,

MGF moment-generating function.

µ mean (location), σ, s (scale).

Excess Kurtosis is dened as Kurtosis minus 3.

Rz 2
The Gauss error function is erf z = √2
π 0
e−t dt.

If zi , i = 1, ..., n, are independent N [0, 1] variables, then

n
X
zi2 ∼ χ2 [n].
i=1

If zi , i = 1, ..., n, are independent N [0, σ 2 ] variables, then

n 2
X zi
∼ χ2 [n].
i=1
σ

If x1 and x2 are independent χ2 variables with n1 and n2 degrees of freedom, then

x1 + x2 ∼ χ2 [n1 + n2 ].

23
2.3 The χ distribution
2

The χ2 distribution

Random variable x ∼ χ2 [n] is distributed according to the chi-squared distribution

with n degrees of freedom
xn/2−1 e−x/2
f (x|n) = ,
2n/2 Γ n2
where Γ is the Gamma-distribution (more below).

E[x] = n

V [x] = 2n

Example of a χ2 [3] distribution:

Approximating a χ2

For degrees of freedom greater than 30 the distribution of the chi-squared variable x is

approx.

z = (2x)1/2 − (2n − 1)1/2 ,

which is approximately standard normally distributed. Thus,

P rob(χ2 [n] ≤ a) ≈ Φ[(2a)1/2 − (2n − 1)1/2 ].

24
χ2

Parameters n ∈ N>0
Support x ∈ R>0 if n = 1,
else x ∈ R≥0
PDF
1
2n/2 Γ(n/2)
xn/2−1 e−x/2
1
γ n2 , x2

CDF
Γ(n/2)

Mean n
Median No simple closed form

Mode max(n − 2, 0)
Variance 2n
p
Skewness 8/n
12
Ex. Kurtosis
n

MGF (1 − 2t)−n/2 for t< 1

n, n1 , n2 known as degrees of freedom.

B(x, a,b)
Regularized incomplete beta function I(x, a, b) = B(a,b)
with B(x, a, b) =
R x a−1
0
t (1 − t)b−1 dt.

25
2.4 The F-distribution
The F-distribution

Ifx1 and x2 are two independent chi-squared variables with degrees of freedom parameters
n1 and n2 , respectively, then the ratio

x1 /n1
F [n1 , n2 ] =
x2 /n2

has the F distribution with n1 and n2 degrees of freedom.

26
F

Parameters n1 , n2 ∈ N>0
Support x ∈ R>0 if n1 = 1,
else x ∈ R≥0
n1 n2 n1 +n2 n1
Γ( ) −1
x 2
PDF n1 n22 Γ( n1 )Γ(
2 2
n2
) n1 +n2
2 2 (n1 x+n2 ) 2

CDF I n1nx+n
1x
2
, n21 , n22
n2
Mean
n2 −2
for n2 > 2
Median No simple closed form

n1 −2 n2
Mode
n1 n2 +2
for n1 > 2
2 n22 (n1 +n2 −2)
Variance
n1 (n2 −2)2 (n2 −4)
for n2 > 4
√
(2n1 +n2 −2) 8(n2 −4)
Skewness √ for n2 > 6
(n2 −6) n1 (n1 +n2 −2)
2
Ex. Kurtosis 12 n1 (5n2n−22)(n 1 +n2 −2)+(n2 −4)(n2 −2)
1 (n2 −6)(n2 −8)(n1 +n2 −2)
for n2 > 8
MGF does not exist

n, n1 , n2 known as degrees of freedom.

B(x, a,b)
Regularized incomplete beta function I(x, a, b) = B(a,b)
with B(x, a, b) =
R x a−1
0
t (1 − t)b−1 dt.

27
2.5 The student t-distribution
The student t-distribution

If x1 is an N [0, 1] variable, often denoted by z, and x2 is χ2 [n2 ] and is independent of x1 ,

then the ratio
x1
t[n2 ] = p .
x2 /n2
has the t distribution with n2 degrees of freedom.

Example for the t distributions with 3 and 10 degrees of freedom with the standard

normal distribution.

Comparing (2.4) with n1 = 1 and (2.5), if t ∼ t[n], then t2 ∼ F [1, n].

The t[30] approx. the standard normal

28
t

Parameters n ∈ R>0
Support x∈R
− n+1
Γ( n+1 )

2 x2 2
PDF √
πn Γ( n
1 +
2 )
n

1
+ x Γ n+1

CDF
2 2
×
x2

1 n+1 3
2 F1 2
, 2
; 2
; − n
√ n
πn Γ( 2)
Mean 0 for n>1
Median 0
Mode 0
n
Variance
n−2
for n > 2,
∞ for 1<n≤2
Skewness 0 for n>3
6
Ex. Kurtosis
n−4
for n > 4, ∞ for 2<n≤4
MGF does not exist

n denote degrees of freedom.

2 F1(·, ·; ·; ·) is a particular instance of the hypergeometric function.

29
2.6 The lognormal distribution
The lognormal distribution

The lognormal distribution, denoted LN [µ, σ 2 ], has been particularly useful in mod-

eling the size distributions.

1 1 2
f (x) = √ e− 2 [(ln x−µ)/σ] , x>0
2πσx

A lognormal variable x has

2 /2
E[x] = eµ+σ , and

2 2
V ar[x] = e2µ+σ (eσ − 1).

If y ∼ LN [µ, σ 2 ], then ln y ∼ N [µ, σ 2 ].

30
Log-normal

Parameters µ ∈ R , σ ∈ R>0
Support x ∈ R>0

)2
PDF √1
xσ 2π
exp − (ln x−µ
2σ 2
h i
1 ln √
x−µ
CDF
2
1 + erf σ 2

ln(x)−µ
=Φ σ

σ2
Mean exp µ + 2

Median exp(µ)
Mode exp (µ − σ 2 )
Variance [exp(σ 2 ) − 1] exp (2µ + σ 2 )
p
Skewness [exp (σ 2 ) + 2] exp(σ 2 ) − 1
Ex. Kurtosis 1 exp (4σ 2 ) + 2 exp (3σ 2 ) + 3 exp (2σ 2 ) − 6
MGF not determined by its moments

31
2.7 The gamma distribution
The gamma distribution

The general form of the gamma distribution is

β α −βx α−1
f (x) = e x , x ≥ 0, β = 1/θ > 0, α = k > 0.
Γ(α)

Many familiar distributions are special cases, including the exponential distribution(α =
1) and chi-squared(β = 1/2, α = n/2). The Erlang distribution results if α is a posi-
tive integer. The mean is α/β , and the variance is α/β . The inverse gamma distribution
2

is the distribution of 1/x, where x has the gamma distribution.

32
Γ Γ

Parameters k>0∈R (shape), α>0∈R (shape),

θ>0∈R scale β>0∈R (rate)

Support x ∈ R(0, ∞) x ∈ R(0, ∞)

β α α−1 −βx
PDF f (x) = 1
Γ(k)θk
xk−1 e−x/θ f (x) = Γ(α)
x e
1
k, xθ 1

CDF F (x) = Γ(k)
γ F (x) = Γ(α)
γ(α, βx)
α
Mean kθ β

Median No simple closed form No simple closed form

α−1
Mode (k − 1)θ for k ≥ 1, 0 for k<1 β
for α ≥ 1, 0 for α<1
α
Variance kθ2 β2

Skewness √2 √2
k α
6 6
Ex. Kurtosis
k
α −α
−k 1 t
MGF (1 − θt) for t< θ
1− β
for t<β

R∞
Γ(z) = 0
tz−1 e−t dt, ℜ(z) > 0, for complex numbers with a positive real part.

Rx
lower incomplete gamma function is γ(s, x) = 0
ts−1 e−t dt, for complex numbers

with a positive real part.

33
2.8 The beta distribution
The beta distribution

For a variable constrained between 0 and c > 0, the beta distribution has proved useful.
Its density is

Γ(α + β) x α−1 x β−1 1

f (x) = 1− , 0 ≤ x ≤ 1.
Γ(α)Γ(β) c c c

It is symmetric if α = β , asymmetric otherwise. The mean is ca/(α + β), and the variance
is c αβ/[(α + β + 1)(α + β)2 ].
2

34
B

Parameters α, β ∈ R>0
Support x ∈ [0, 1] or x ∈ (0, 1)
xα−1 (1−x)β−1
PDF
B(α,β)

CDF I(x, α, β)
α
Mean
α+β
1
[−1] α−
Median I1 (α, β) ≈ 3 α, β > 1
2 for
2 α+β−
3
∗
Mode

αβ
Variance
(α+β)2 (α+β+1)
√
2 (β−α) α+β+1
Skewness √
(α+β+2) αβ
6[(α−β)2 (α+β+1)−αβ(α+β+2)]
Ex. Kurtosis
αβ(α+β+2)(α+β+3)
P∞ Qk−1 α+r

tk
MGF 1+ k=1 r=0 α+β+r k!

Γ(α)Γ(β)
B(α, β) = Γ(α+β)
and Γ is the Gamma function.

R∞
Γ(z) = 0
tz−1 e−t dt, ℜ(z) > 0, for complex numbers with a positive real part.

B(x, a,b)
Regularized incomplete beta function I(x, a, b) = B(a,b)
with B(x, a, b) =
Rx
0
ta−1 (1 − t)b−1 dt.

∗ α−1
α+β−2
for α, β > 1; any value in(0, 1) for α, β = 1; {0, 1} (bimodal) for α, β <
1; 0 for α ≤ 1, β > 1; 1 for α > 1, β ≤ 1.

35
2.9 The logistic distribution
The logistic distribution

The logistic distribution is an alternative if the normal cannot model the mass in the

tails; the cdf for a logistic random variable with µ = 0, s = 1 is

1
F (x) = Λ(x) = .
1 + e−x

The density is f (x) = Λ(x)[1 − Λ(x)]. The mean and variance of this random variable are

zero and σ = π 2 /3.

36
Logistic

Parameters µ ∈ R , s ∈ R>0
Support x∈R
x−µ e−(x−µ)/s

PDF λ s
= 2
s(1+e−(x−µ)/s )
x−µ 1

CDF Λ s
= 1+e−(x−µ)/s
Mean µ
Median µ
Mode µ
s2 π 2
Variance
3
Skewness 0
Ex. Kurtosis 6/5
MGF eµt B(1 − st, 1 + st)
for t ∈ (−1/s, 1/s)

2.10 The Wishart distribution

The Wishart distribution

The Wishart distribution describes the distribution of a random matrix obtained as

n
X
f (W ) = (xi − µ)(xi − µ)′ .
i=1

where xi is the ith ofnK element random vectors from the multivariate normal distribu-
tion with mean vector, µ, and covariance matrix, Σ. The density of the Wishart random

matrix is
1
exp − 12 trace(Σ−1 W ) |W |− 2 (n−K−1)

f (W ) = .
2nK/2 |Σ|K/2 π K(K−1)/4 K n+1−j
Q
j=1 Γ 2

The mean matrix is nΣ. For the individual pairs of elements in W,

Cov[wij , wrs ] = n(σir σjs + σis σjr ).

The Wishart distribution is a multivariate extension of χ2 distribution. If W ∼ W (n, σ 2 ),

then W /σ 2 ∼ χ2 [n].

37
3 Review of Distribution Theory
3.1 Joint and marginal bivariate distributions
Bivariate distributions
For observations of two discrete variables y ∈ {1, 2} and x ∈ {1, 2, 3}, we can calculate

the frequencies nx,y ,

freq. nx,y y = 1 y = 2 f (x) = nx /N

x=1 1 2 3/10

x=2 1 2 3/10

x=3 0 4 4/10

f (y) = ny /N 2/10 8/10 1

the frequencies nx,y ,

conditional distributions f (y|x) and f (x|y),

joint distributions f (x, y), and

marginal distributions fy (y) and fx (x).

P
freq. nx,y y=1 y=2 f (x) = nx /N cond. distr. f (y|x) y=1 y=2 y

x=1 1 2 3/10 f (y|x = 1) 1/3 2/3 1

x=2 1 2 3/10 f (y|x = 2) 1/3 2/3 1

x=3 0 4 4/10 f (y|x = 3) 0 1 1

f (y) = ny /N 2/10 8/10 1 f (y|x = 1, x = 2, x = 3) 1/5 4/5 1

cond. distr. joint distr. marginal pr.

f (x|y) f (x|y = 1) f (x|y = 2) f (x|y = 1, y = 2) f (x, y) f (x, y = 1) f (x, y = 2) fx (x)

x=1 1/2 1/4 3/10 f (x = 1, y) 1/10 2/10 3/10

x=2 1/2 1/4 3/10 f (x = 2, y) 1/10 2/10 3/10

x=3 0 1/2 4/10 f (x = 3, y) 0 4/10 4/10

P
x 1 1 1 marginal pr. fy (y) 2/10 8/10 1

38
3.2 The joint density function
The joint density function

Two random variables X and Y have joint density function

if x and y are discrete

X X
f (x, y) = P rob(a ≤ x ≤ b, c ≤ y ≤ d) = f (x, y)
a≤x≤b c≤y≤d

if x and y are continuous

Z bZ d
f (x, y) = P rob(a ≤ x ≤ b, c ≤ y ≤ d) = f (x, y)dxdy
a c

Example

With a = 1, b = 2, c = 2, d = 2 and the following f (x, y)

joint distr.

f (x, y) f (x, y = 1) f (x, y = 2)

f (x = 1, y) 1/10 2/10

f (x = 2, y) 1/10 2/10

f (x = 3, y) 0 4/10

P rob(1 ≤ x ≤ 2, 2 ≤ y ≤ 2) = f (y = 2, x = 1) + f (y = 2, x = 2) = 2/5.

For values x and y of two discrete random variable X and Y, the probability dis-
tribution
f (x, y) = P rob(X = x, Y = y).

The axioms of probability require

f (x, y) ≥ 0,
XX
f (x, y) = 1.
x y

39
If X and Y are continuous, Z Z
f (x, y)dxdy = 1.
x y

bivariate normal distribution

The bivariate normal distribution is the joint distribution of two normally distributed

variables. The density is

1 2 2 2
f (x, y) = p e−1/2[(ϵx +ϵy −2ρϵx ϵy )/(1−ρ )],
2πσx σy 1 − ρ 2

x−µx y−µy
where ϵx = σx
, and ϵy = σy
.

3.3 The joint cumulative density function

The joint cumulative density function.

The probability of a joint event of X and Y have joint cumulative density function

if x and y are discrete

XX
F (x, y) = P rob(X ≤ x, Y ≤ y) = f (x, y)
X≤x Y ≤y

if x and y are continuous

Z x Z y
F (x, y) = P rob(X ≤ x, Y ≤ y) = f (t, s)dsdt
−∞ −∞

40
Example

With x = 2, y = 2 and the following f (x, y)

f (x, y) f (x, y = 1) f (x, y = 2)

f (x = 1, y) 1/10 2/10

f (x = 2, y) 1/10 2/10

f (x = 3, y) 0 4/10

P rob(X ≤ 2, Y ≤ 2) = f (x = 1, y = 1)+
f (x = 2, y = 1) + f (x = 1, y = 2) + f (x =
2, y = 2) = 3/5.

Cumulative probability distribution

For values x and y of two discrete random variable X and Y , the cumulative probability
distribution
F (x, y) = P rob(X ≤ x, Y ≤ y).

The axioms of probability require

0 ≤ F (x, y) ≤ 1,

F (∞, ∞) = 1,

F (−∞, y) = 0,

F (x, −∞) = 0.

The marginal probabilities can be found from the joint cdf

fx (x) = P (X ≤ x) = P rob(X ≤ x, Y ≤ ∞) = F (x, ∞).

41
3.4 The marginal probability density
The marginal probability density

To obtain the marginal distributions fx (x) and fy (y) from the joint density f (x, y), it is

necessary to sum or integrate out the other variable. For example,

if x and y are discrete

X
fx (x) = f (x, y),
y

if x and y are continuous Z

fx (x) = f (x, s)ds.
y

Example

f (x, y) f (x, y = 1) f (x, y = 2) fx (x)

f (x = 1, y) 1/10 2/10 3/10

f (x = 2, y) 1/10 2/10 3/10

f (x = 3, y) 0 4/10 4/10

fy (y) 2/10 8/10 1

fx (x = 1) = f (x = 1, y = 1) + f (x = 1, y = 2) = 3/10.

fy (y = 2) = f (x = 1, y = 2) + f (x = 2, y = 2) + f (x = 3, y = 2) = 4/5.

42
The bivariate normal distribution

Why do we care about marginal distributions?

Means, variances, and higher moments of the variables in a joint distribution are dened

with respect to the marginal distributions.

Expectations
If x and y are discrete

" #
X X X XX
E[x] = xfx (x) = x f (x, y) = xf (x, y).
x x y x y

If x and y are continuous

Z Z Z
E[x] = xfx (x) = xf (x, y)dydx.
x x y

Variances
X XX
V ar[x] = (x − E[x])2 fx (x) = (x − E[x])2 f (x, y).
x x y

43
3.5 Covariance and correlation
For any function g(x, y),
 P P
 g(x, y)f (x, y) in the discrete case,
 x y


E[g(x, y)] =


 R R g(x, y)f (x, x)dydx

in the continuous case.
x y

The covariance of x and y is a special case:

Cov[x, y] = E[(x − µx )(y − µy )]

= E[xy] − µx µy = σxy

If x and y are independent, then f (x, y) = fx (x)fy (y) and

XX
σxy = fx (x)fy (y)(x − µx )(y − µy )
x y
X X
= (x − µx )fx (x) (y − µy )fy (y) = E[x − µx ]E[y − µy ] = 0.
x y

σxy
correlation ρxy = σx σy

σxy = 0 does not imply independence (except for bivariate normal).

Independence: Pdf and cdf from marginal densities

Two random variables are statistically independent if and only if their joint density

is the product of the marginal densities:

f (x, y) = fx (x)fy (y) ⇔ x and y are independent.

If (and only if ) x and y are independent, then the marginal cdfs factors the cdf as

well:

F (x, y) = Fx (x)Fy (y) = P rob(X ≤ x, Y ≤ y) = P rob(X ≤ x)P rob(Y ≤ y).

44
Example

f (x, y) f (x, y = 1) f (x, y = 2) fx (x) F (x, y) F (x, y = 1) F (x, y = 2)

f (x = 1, y) 1/6 1/6 1/3 F (x = 1, y) 1/6 2/6

f (x = 2, y) 1/6 1/6 1/3 F (x = 2, y) 2/6 4/6

f (x = 3, y) 1/6 1/6 1/3 F (x = 3, y) 3/6 1

fy (y) 1/2 1/2 1

P (x ≤ 2)P (y ≤ 2)

fx (x = 3) × fy (y = 2) = 1/3 × 1/2 = 1/6. = [f (x = 2, y = 1) + f (x = 2, y = 2)]

× [f (x = 1, y = 2) + f (x = 2, y = 2)]
= [1/6 + 1/6][1/6 + 1/6] = 4/36 = 2/18.

3.6 The conditional density function

The conditional density function

The conditional distribution over y for each value of x (and vice versa) has conditional
densities
f (x, y) f (x, y)
f (y|x) = f (x|y) = .
fx (x) fy (y)

The marginal distribution of x averages the probability of x given y over the distri-
bution of all values of y fx (x) = E[f (x|y)f (y)]. If x and y are independent, knowing the

value of y does not provide any information about x, so fx (x) = f (x|y).

Example

cond. distr. joint distr. marginal pr.

f (x|y) f (x|y = 1) f (x|y = 2) f (x|y = 1, y = 2) f (x, y) f (x, y = 1) f (x, y = 2) fx (x)

x=1 1/2 1/4 3/10 f (x = 1, y) 1/10 2/10 3/10

x=2 1/2 1/4 3/10 f (x = 2, y) 1/10 2/10 3/10

x=3 0 1/2 4/10 f (x = 3, y) 0 4/10 4/10

P
x 1 1 1 marginal pr. fy (y) 2/10 8/10 1

45
f (x = 3, y = 2)
f (x = 3|y = 2) = = 4/10 × 10/8 = 1/2.
fy (y = 2)
fx (x = 2) = Ey [f (x = 2|y)f (y)] = f (x = 2|y = 1)f (y = 1) + f (x = 2|y = 2)f (y = 2)

= 1/2 × 2/10 + 1/4 × 8/10 = 1/10 + 2/10 = 3/10.

3.7 Conditional mean aka regression

A random variable may always be written as

y = E[y|x] + (y − E[y|x])
= E[y|x] + ϵ.

Denition

The regression of y on x is obtained from the conditional mean

 P
 yf (y|x) if y is discrete,
 y


E[y|x] =


 R yf (y|x)dy

if y is continuous.
y

Predict y at values of x:
X
yf (y|x = 1) = 1 × 1/3 + 2 × 2/3 = 5/3.
y

46
Conditional variance

A conditional variance is the variance of the conditional distribution:

The computation can be simplied by using

V ar[y|x] = E[y 2 |x] − (E[y|x])2 ≥ 0.

Decomposition of variance V ar[y] = Ex [V ar[y|x]] + V arx [E[y|x]]

When we condition on x, the variance of y reduces on average. V ar[y] ≥ Ex [V ar[y|x]]

Ex [V ar[y|x]] is the average of variances within each x

V arx [E[y|x]] is variance between y averages in each x.

E[y|x = 1] = 1.67, E[y|x = 2] = 1.67, and E[y|x = 3] = 2

V [y|x = 1] = 0.22, V [y|x = 2] = 0.22, and V [y|x = 3] = 0

47
Example

f (y|x) y=1 y=2 f (x, y) f (x, y = 1) f (x, y = 2) fx (x)

f (y|x = 1) 1/3 2/3 1 f (x = 1, y) 1/10 2/10 3/10

f (y|x = 2) 1/3 2/3 1 f (x = 2, y) 1/10 2/10 3/10

f (y|x = 3) 0 1 1 f (x = 3, y) 0 4/10 4/10

fy (y) 2/10 8/10 1

E[y|x = 1] = 1/3×1+2/3×2 = 5/3

V [y|x = 1] = 12 × 1/3 + 22 × 2/3 − (5/3)2 = 2/9
E[y|x = 2] = 1/3×1+2/3×2 = 5/3
V [y|x = 2] = 12 × 1/3 + 22 × 2/3 − (5/3)2 = 2/9
E[y|x = 3] = 0 × 1 + 1 × 2 = 2 V [y|x = 3] = 12 × 0 + 22 × 1 − 22 = 0

alternatively (requiring more dierences)

V [y|x = 1] = (1−5/3)2 ×1/3+(2−5/3)2 ×2/3 = 2/9

Average of variances within each x, E[V [y|x]] is less or equal total variance V [y].

Example

Use the conditional mean to calculate E[y]:

E[y] = Ex [E[y|x]] = E[y|x = 1]f (x = 1) + E[y|x = 2]f (x = 2) + E[y|x = 3]f (x = 3)

= 5/3 × 3/10 + 5/3 × 3/10 + 2 × 4/10 = 9/5.

X
E[y] = fy (y) = 1 × 2/10 + 2 × 8/10 = 9/5.
y

Variation in y , V [y|x = 1] = 0.22, V [y|x = 2] = 0.22, and V [y|x = 3] = 0 due to

variation in x, is on average

E[V [y|x]] = 3/10 × 2/9 + 3/10 × 2/9 + 4/10 × 0 = 2/15.

For each conditional mean E[y|x = 1] = 5/3, E[y|x = 2] = 5/3, and E[y|x = 3] = 2,
y varies with
V [E[y|x]] = E[(E[y|x])2 ] − (E[y|x])2 = 3/10 × (5/3)2 + 3/10 × (5/3)2 + 4/10 × (2)2 −
(9/5)2 = 2/75.

48
E[V [y|x]] + V [E[y|x]] = V [y] = 2/75 + 2/15 = 4/25.
With degree of freedom correction (n − 1) (as reported in software):
E[V [y|x]] + V [E[y|x]] = V [y] = 2/75/(10 − 1) × 10 + 2/15/(10 − 1) × 10 = 8/45.

3.8 The bivariate normal

Properties of the bivariate normal
Recall bivariate normal distribution is the joint distribution of two normally distributed

variables. The density is

1 2 2 2
f (x, y) = p e−1/2[(ϵx +ϵy −2ρϵx ϵy )/(1−ρ )],
2πσx σy 1 − ρ 2

where ϵx = x−µx
σx
, and ϵy = y−µ
σy
y
.
The covariance is σxy = ρxy σx σy , where

−1 < ρxy < 1 is the correlation between x and y

µx , σx , µy , σy are means and standard deviations of the marginal distributions of x

or y

If x and y are bivariately normally distributed (x, y) ∼ N2 [µx , µy , σx2 , σy2 , ρxy ]

the marginal distributions are normal

fx (x) = N [µx , σx2 ]

fy (y) = N [µy , σy2 ]

the conditional distributions are normal

f (y|x) = N [α + βx, σy2 (1 − ρ2 )]

σxy
α = µy − βµx ; β =
σx2

f (x, y) = fx (x)fx (x) if ρxy = 0: x and y are independent if and only if they are

uncorrelated

49
3.9 Useful rules
σxy
ρxy = σx σy

E[ax + by + c] = aE[x] + bE[y] + c

V ar[ax + by + c] = a2 V ar[x] + b2 V ar[y] + 2abCov[x, y] = V ar[ax + by]

Cov[ax + by, cx + dy] = acV ar[x] + bdV ar[y] + (ad + bc)Cov[x, y]

If X and Y are uncorrelated, then V ar[x + y] = V ar[x − y] = V ar[x] + V ar[y].

Linearity

E[ax + by|z] = aE[x|z] + bE[y|z].

Adam's Law / Law of Iterated Expectation

E[y] = Ex [E[y|x]]

Adam's general Law / Law of Iterated Expectation

E[y|g2 (g1 (x))] = E[E[y|g1 (x)]|g2 (g1 (x))]

Independence

If x and y are independent, then

E[y] = E[y|x],

E[g1 (x)g2 (y)] = E[g1 (x)]E[g2 (y)].

Taking out what is known

E[g1 (x)g2 (y)|x] = g1 (x)E[g2 (y)|x].

Projection of y by E[y|x], such that orthogonal to h(x)

E[(y − E[y|x])h(x)] = 0.

Keeping just what is needed (y predictable from x needed, not residual)

E[xy] = E[xE[y|x]].

50
Eve's Law (EVVE) / Law of Total Variance

V ar[y] = Ex [V ar[y|x]] + V arx [E[y|x]]

ECCE law / Law of Total Covariance

Cov[x, y] = Ez [Cov[y, x|z]] + Covz [E[x|z], E[y|z]]

Cov[x, y] = Covx [x, E[y|x]] =

R
x
(x − E[x]) E[y|x]fx (x)dx.
Cov[x,y]
If E[y|x] = α + βx, then α = E[y] − βE[x] and β= V ar[x]

Regression variance V arx [E[y|x]], because E[y|x] varies with x

Residual variance Ex [V ar[y|x]] = V ar[y] − V arx [E[y|x]], because y varies around

the conditional mean

Decomposition of variance V ar[y] = V arx [E[y|x]] + Ex [V ar[y|x]]

regression variance
Coecient of determination = total variance

If E[y|x] = α + βx and if V ar[y|x] is a constant, then

V ar[y|x] = V ar[y] 1 − Corr2 [y, x] = σy2 1 − σxy

51
4 The Least Squares Estimator
4.1 What is the Relationship between Two Variables?
Political Connections and Firms
Firm prots increase with the degree of political connections

Learn how to represent relationships between two or more variables

How to quantify and predict eects of shocks and policy changes

Show properties of the OLS estimator in small & large samples

Apply Monte Carlo Simulations to assess properties of OLS

52
4.2 The Econometric Model
Specication of a Linear Regression
dependent variable yi = prots of rm i

explanatory variables xi1 , . . . , xiK k = 1, . . . K political connections, other rm char-

acteristics

xi0 = 1 is a constant

parameters to be estimated β0 , β1 , . . . , βK are K +1

ui is called the error term

yi = (β0 = 4) + (β1 = 0)xi1 + ui .

dependent variable yi = prots of rm i

explanatory variables xi1 , . . . , xiK k = 1, . . . K political connections, other rm char-

acteristics

xi0 = 1 is a constant

parameters to be estimated β0 , β1 , . . . , βK are K +1

53
ui is called the error term

yi = (β0 = 2.36) + (β1 = 0.01)xi1 + ui .

How Were the Data Generated?

The data generating process is fully described by a set of assumptions.

The Five Assumptions of the Econometric Model

LRM1: Linearity

LRM2: Simple random sampling

LRM3: Exogeneity

LRM4: Error variance

LRM5: Identiability

54
Data Generating Process: Linearity
LRM1: Linearity

yi = β0 + β1 xi1 + . . . + βK xiK + ui and E(ui ) = 0.

LRM1 assumes that the

functional relationship is linear in parameters βk

error term ui enters additively

parameters βk are constant across individual rms i and j ̸= i.

Anscombe's Quartet

Figure 1: All four sets are identical when examined using linear statistics, but very
dierent when graphed. Correlation between x and y is 0.816. Linear Regression y = 3.00
+ 0.50x.

55
Data Generating Process: Random Sampling
LRM2: Simple Random Sampling

{xi1 , . . . , xiK , yi }N
i=1 i.i.d. (independent and identically distributed)

LRM2 means that

observation i has no information content for observation j ̸= i

all observations i come from the same distribution

This assumption is guaranteed by simple random sampling provided there is no systematic

non-response or truncation.

Density of Population and Truncated Sample

Figure 2: Distribution of a dependent variable and an independent variable truncated at

y ∗ = 15.

56
Data Generating Process: Exogeneity
LRM3: Exogeneity

1. ui |xi1 , . . . , xiK ∼ N (0, σi2 )

LRM3a assumes that the error term is normally distributed conditional on the

explanatory variables.

2. ui ⊥ xik ∀k (independent), pdfu,x (ui xik ) = pdfu (ui )pdfx (xik )

LRM3b means that the error term is independent of the explanatory variables.

3. E(ui |xi1 , . . . , xiK ) = E(ui ) = 0 (mean independent)

LRM3c states that the mean of the error term is independent of explanatory vari-

ables.

4. cov(xik , ui ) = 0 ∀k (uncorrelated)

LRM3d means that the error term and the explanatory variables are uncorrelated.

LRM3a or LRM3b imply LRM3c and LRM3d. LRM3c implies LRM3d.

Figure 3: Distributions of the dependent variable conditional on values of an independent

variable.

Weaker exogeneity assumption if interest only in, say, xi1 :

Conditional Mean Independence E(ui |xi1 , xi2 , . . . , xiK ) = E(ui |xi2 , . . . , xiK )
Given the control variables xi2 , . . . , xiK , the mean of ui does not depend on the variable

of interest xi1 .

57
Data Generating Process: Error Variance
LRM4: Error Variance

V (ui |xi1 , . . . , xiK ) = σ 2 < ∞ (homoskedasticity)

LRM4a means that the variance of the error term is a constant.

V (ui |xi1 , . . . , xiK ) = σi2 = g(xi1 , . . . , xiK ) < ∞ (cond. heteroskedasticity)

LRM4b allows the variance of the error term to depend on a function g of the

explanatory variables.

Heteroskedasticity

Figure 4: The simple regression model under homo- and heteroskedasticity.

V ar(prof its|lobbying, employees) increasing with lobbying .

58
Data Generating Process: Identiability
LRM5: Identiability

(xi0 , xi1 , . . . , xiK ) are not linearly dependent

0 < V (xik ) < ∞ ∀k > 0

LRM5 assumes that

the regressors are not perfectly collinear , i.e. no variable is a linear combination of

the others

all regressors (but the constant) have strictly positive variance both in expectations

and in the sample and not too many extreme values.

LRM5 means that every explanatory variable adds additional information.

The Identifying Variation from xik

Figure 5: The number of red and blue dots is the same. Using which would you get a
more accurate regression line?

59
4.3 Estimation with OLS
Ordinary least squares (OLS) minimizes the squared distances (SD) between the

observed and the predicted dependent variable y:

min SD(β0 , . . . , βK ),
β0 ,...,βK

N
X
where SD = [yi − (β0 + β1 xi1 + . . . + βK xiK )]2 .
i=1

How to Describe the Relationship Best?

60
61
62
Invention of OLS

Legendre to Jacobi (Paris, 30 Novem- Figure 6: Watercolor caricature of Legendre

ber 1827, Plackett, 1972): ...How can Mr. by Boilly (1820), the only existing portrait

Gauss have dared to tell you that the greater known.

part of your theorems were known to him...?

... this is the same man ... who wanted
to appropriate in 1809 the method of least
squares published in 1805.
Other examples will be found in other
places, but a man of honour should refrain
from imitating them.

Invention of OLS

Legendre to Jacobi (Paris, 30 Novem- Figure 7: Portrait of Gauss by Jensen

ber 1827, Plackett, 1972): ...How can Mr. (1840).

Gauss have dared to tell you that the greater

part of your theorems were known to him...?
... this is the same man ... who wanted
to appropriate in 1809 the method of least
squares published in 1805.
Other examples will be found in other
places, but a man of honour should refrain
from imitating them.

Estimation with OLS

For the bivariate regression model, the OLS estimators of β0 and β1 are

β̂0 = ȳ − β̂1 x̄
PN
(xi1 − x̄)(yi − ȳ)
i=1 cov(x, y)
β̂1 = PN =
i=1 (xi1 − x̄)
2 var(x)

β̂1 = cov(x, y)/(sx sx ) = Rsy /sx ,

63
where R ≡ cov(x, y)/(sx sy ) is Pearson's correlation coecient with sz denoting the

standard deviation of z .

OLS estimator Measures Linear Correlation

Equivalently,

β̂1 N
P PN
i=1 (xi1 − x̄) (β̂1 xi1 − β̂1 x̄)
R = sx /sy β̂1 = PN = i=1
PN .
i=1 (yi − ȳ) i=1 (y i − ȳ)

Squaring gives
PN PN
2 i=1 (ŷi − ȳ)2 i=1 û2i
R = PN = 1 − PN .
i=1 (yi − ȳ)2 i=1 (yi − ȳ)
2

R2 as measure of the goodness of t:

The t improves with the fraction of the sample variation in y that is explained by the x.

The Case with K Explanatory Variables

The more general case with K explanatory variables is

β̂ = (X ′ X)−1 X′ y
(K+1)×1 (K+1)×(K+1) (K+1)×N N ×1

Given the OLS estimator, we can predict the

dependent variable by ŷi = β̂0 + β̂1 xi1 + . . . + β̂K xiK

the error term by ûi = yi − ŷi .

ûi is called the residual.

PN
û2i
Adjusted R2 = 1 − N N−K−1
−1
PN i=1
(y i −ȳ)
2
.
i=1

64
Figure 8: Scatter cloud visualized with
GRAPH3D for Stata.

Figure 9: OLS surface visualized with

GRAPH3D for Stata.

65
4.4 Properties of the OLS Estimator in the Small and in the
Large
Properties of the OLS Estimator
Small sample properties of β̂
unbiased

normally distributed

ecient

Large sample properties of β̂

consistent

approx. normal

asymptotically ecient

66
Small Sample Properties
Figure 10: What is a small sample?
Familien-Duell
Light Entertainment.

Figure 11: What is a small sample? (Wooldridge, 2009, p. 755): But large sample
N = 20.
Source:
approximations have been known to work well for sample sizes as small as
Familien-Duell Grundy Light Entertainment.

67
Unbiasedness and Normality of β̂k
Assuming LRM1, LRM2, LRM3a, LRM4, and LRM5,

the following properties can be established even for small samples.

The OLS estimator of β is unbiased.

E(β̂k |x11 , . . . , xN K ) = βk .

The OLS estimator is (multivariate) normally distributed.

β̂k |x11 , . . . , xN K ∼ N (βk , V (β̂k )).

Under homoskedasticity (LRM4a)

the variance Vb (β̂k |x11 , . . . , xN K ) can be unbiasedly estimated.

Variance of β̂k and Eciency

For the bivariate regression model, it is estimated as

σ̂ 2
Vb = PN with
2
i=1 (xi − x̄)

PN
2 û2i
i=1
σ̂ = .
N −K −1

Gauÿ-Markov-Theorem: under homoskedasticity (LRM4a)

β̂k is the BLUE (best linear unbiased estimator, e.g., non-linear least squares bi-

ased).

Vb (β̂k ) inates with

micronumerosity (small sample size)

multicollinearity (high (but not perfect) correlation between two or more of

the independent variables).

Unbiasedness
The OLS estimator of β is unbiased .

Plug y = Xβ + u into the formula for β̂ and then use the law of iterated expectation

to rst take expectation with respect to u conditional on X and then take the

unconditional expectation:

h i
′ −1 ′
E[ β̂] = EX,u (X X) X (Xβ + u)

68
h i
′ −1 ′
= β + EX,u (X X) X u
h h ii
= β + EX Eu|X (X ′ X)−1 X ′ u|X
h i
= β + EX (X ′ X)−1 X ′ Eu|X [u|X]

= β,

where E[u|X] = 0 by assumptions of the model.

Variance
The OLS estimator β has variance Vb (β̂k |x11 , . . . , xN K ) = σ 2 (X ′ X)−1
Let σ2I denote the covariance matrix of u. Then,

h i
E[ (β̂ − β)(β̂ − β)′ ] = E ((X ′ X)−1 X ′ u)((X ′ X)−1 X ′ u)′
h i
′ −1 ′ ′ ′ −1
= E (X X) X uu X(X X)
h i
= E (X ′ X)−1 X ′ σ 2 X(X ′ X)−1
h i
= E σ 2 (X ′ X)−1 X ′ X(X ′ X)−1

= σ 2 (X ′ X)−1 ,

where we used the fact that β̂ − β is just an ane transformation of u by the matrix
′ −1 ′
(X X) X .

Estimator for Variance

For a simple linear regression model, where β = [β0 , β1 ]′ (β0 is the y-intercept and β1 is

the slope), one obtains

X −1
σ 2 (X ′ X)−1 = σ 2 xi x′i
X −1
= σ2 (1, xi )′ (1, xi )
  −1
X 1xi
= σ2   
xi x2i
 P −1
N xi
= σ 2 P P 
xi x2i

69
P P 
2
1 x i − xi
= σ2 · P 2 P 2 
N xi − ( xi ) P
− xi N
P P 
2
1 xi − xi
= σ 2 · PN  
N i=1 (xi − x̄) 2 P
− xi N

σ2
V ar(β1 ) = PN .
i=1 (xi − x̄)2

Parameter Values for Simulations

Monte Carlo Simulations show the distribution of the estimate. Suppose the data

generating process is

yi = β0 + β1 xi1 + ui .

β0 = 2.00

β1 = 0.5
Try it yourself...
ui ∼ N (0.00, 1.00)

N = 3, N = 5, N = 10,
N = 25, N = 100, N = 1000

70
How to Establish Asymptotic Properties of β̂k ?
Law of Large Numbers
As N increases, the distribution of β̂k becomes more tightly centered around βk .

(a) N=3 (b) N=5

(c) N=10 (d) N=100

71
Central Limit Theorem
As N increases, the distribution of β̂k becomes normal (starting from a t-distribution).

(a) N=3 (b) N=5

(c) N=10 (d) N=100

Consistency, Asymptotically Normality

Assuming LRM1, LRM2, LRM3d, LRM4a or LRM4b, and LRM5 the following properties

can be established using law of large numbers and central limit theorem for large samples.

The OLS estimator is consistent:

plimβ̂k = βk .

That is, for all ε>0

lim Pr |β̂k − βk | > ε = 0.
N →∞

The OLS estimator is asymptotically normally distributed

√ d
N (β̂k − βk ) → N (0, Avar(β̂k ) × N )

72
(Avar means asymptotic variance)

The OLS estimator is approximately normally distributed

A
β̂k ∼ N βk , Avar(β̂k )

Eciency and Asymptotic Variance

For the bivariate regression under LRM4a (homoskedasticity) it can be consistently
estimated as
σ̂ 2
Avar(
[ β̂1 ) = P
N
,
i=1 (xi1 − x̄)2
with PN
2 û2i
i=1
σ̂ = .
N −2
Under LRMb (heteroskedasticity), Avar(β̂) can be consistently estimated as the

robust Eicker-Huber-White
or estimator.

The robust variance estimator is calculated as

PN 2 2
[ β̂1 ) = h i=1 ûi (xi1 − x̄)i .
Avar( PN 2
i=1 (xi1 − x̄)

Note: In practice we can almost never be sure that the errors are homoskedastic and

should therefore always use robust standard errors.

Sketch of Proof for Asymptotic Properties

The OLS estimator of β̂ is consistent and asymptotic normal

1 ′
−1 1 ′ 1 ′
−1 1 ′
β̂ can be written as: β̂ =
Estimator
N
X X N
Xy = β+ N
X X N
Xu =
−1
N N
β + N1 i=1 xi x′i 1
P P
N i=1 xi ui

PN p
We can use the law of large numbers to establish that :
1
N i=1 xi x′i →
− E[xi x′i ] =
Qxx 1
P N p
N
, N i=1 xi ui →
− E[xi ui ] = 0
By Slutsky's theorem and continuous mapping theorem these results can be com-
p
− β + Q−1
bined to establish consistency of estimator β̂ : β̂ → xx · 0 = β

PN d
√1

The central limit theorem tells us that: i=1 xi ui →
− N 0, V , where V =
N 2 Qxx
Var[xi ui ] = E[ u2i xi x′i ] = E E[u2i |xi ] xi x′i =

σ N

73
Applying Slutsky's theorem again we'll have:

N −1 N
√

1 X ′ 1 X d 2 Qxx
− Q−1

N (β̂ − β) = xi xi √ xi ui → xx N · N 0, σ
N i=1 N i=1 N
= N 0, σ 2 Q−1

xx N

OLS Properties in the Small and in the Large

Set of assumptions (1) (2) (3) (4) (5) (6)

LRM1: linearity f u l f i l l e d

LRM2: simple random sampling f u l f i l l e d

LRM5: identiability f u l f i l l e d

LRM4: error variance

- LRM4a: homoskedastic ✓ ✓ ✓ × × ×
- LRM4b: heteroskedastic × × × ✓ ✓ ✓
LRM3: exogeneity

- LRM3a: normality ✓ × × ✓ × ×
- LRM3b: independent ✓ ✓ × × × ×
- LRM3c: mean indep. ✓ ✓ ✓ ✓ ✓ ×
- LRM3d: uncorrelated ✓ ✓ ✓ ✓ ✓ ✓

Small sample properties of β̂

- unbiased ✓ ✓ ✓ ✓ ✓ ×
- normally distributed ✓ × × ✓ × ×
- ecient ✓ ✓ ✓ × × ×

Large sample properties of β̂

- consistent ✓ ✓ ✓ ✓ ✓ ✓
- approx. normal ✓ ✓ ✓ ✓ ✓ ✓
- asymptotically ecient ✓ ✓ ✓ × × ×

Notes: ✓ = fullled, × = violated

74
Tests in Small Samples I
Assume LRM1, LRM2, LRM3a, LRM4a, and LRM5. A simple null hypotheses of the

form H0 : βk = q is tested with the t-test.

If the null hypotheses is true, the t-statistic

β̂k − q
t= ∼ tN −K−1
se(
b β̂k )

follows a
qt-distribution with N −K −1 degrees of freedom. The standard error is

se(
b β̂k ) = V̂ (β̂k ).

For example, to perform a two-sided test of H0 against the alternative hypotheses

HA : βk ̸= q on the 5% signicance level, we calculate the t-statistic and compare its

absolute value to the 0.975-quantile of the t-distribution. With N = 30 and K = 2, H0 is

rejected if |t| > 2.052.

Tests in Small Samples II

A null hypotheses of the form H0 : rj1 β1 + . . . + rjK βK = qj , in matrix notation H0 : Rβ =
q , with J linear restrictions j = 1 . . . J is jointly tested with the F -test.
If the null hypotheses is true, the F -statistic follows an F distribution with J numer-

ator degrees of freedom and N − K − 1 denominator degrees of freedom:

′ h i−1
Rβ̂ − q RV̂ (β̂|X)R′ Rβ̂ − q
F = ∼ FJ,N −K−1 .
J

For example, to perform a two-sided test of H0 against the alternative hypotheses

HA : rj1 β1 + . . . + rjK βK ̸= qj for all j at the 5% signicance level, we calculate the

F -statistic and compare it to the 0.95-quantile of the F -distribution.

With N = 30, K = 2 and J = 2, H0 is rejected if F > 3.35. We cannot perform two-sided

F -tests because the F distribution has one tail.

Tests in Small Samples III

Only under homoskedasticity (LRM4a), the F -statistic can also be computed as

(R2 − Rrestricted
2
)/J
F = 2
∼ FJ,N −K−1 ,
(1 − R )/(N − K − 1)

2
where Rrestricted is estimated by restricted least squares which minimizes SD(β) s.t. rj1 β1 +

. . . + rjK βK ̸= qj for all j .

75
Exclusionary restrictions of the form H0 : βk = 0, βm = 0, . . . are a special case of

H0 : rj1 β1 + . . . + rjK βK = qj for all j. In this case, restricted least squares is simply

estimated as a regression were the explanatory variables k, m, . . . are excluded, e.g. a

regression with a constant only.

If the F distribution has degrees of freedom (df ) 1 as the numerator df, and N −K −1
as the denominator df, then it can be shown that t2 = F (1, N − K − 1).

Condence Intervals in Small Samples

Assuming LRM1, LRM2, LRM3a, LRM4a, and LRM5, we can construct condence in-

tervals for a particular coecient βk . The (1 − α) condence interval is given by

β̂k − t(1−α/2),(N −K−1) se(
b β̂k ), β̂k + t(1−α/2),(N −K−1) se(
b β̂k ) ,

where t(1−α/2),(N −K−1) is the (1 − α/2) quantile of the t-distribution with (N − K − 1)

degrees of freedom. For example, the 95% condence interval with N = 30 and K = 2 is

β̂k − 2.052se(
b β̂k ), β̂k + 2.052se(
b β̂k ) .

Recall: α is the maximum acceptable probability of a Type I error.

Null hypothesis (H0 ) is valid (Innocent) is invalid (Guilty)

Reject H0 Type I (α = 0.05) error Correct outcome

I think he is guilty! False positive True positive

Convicted! Convicted!

Don't reject H0 Correct outcome Type II (β ) error

I think he is innocent! True negative False negative

Freed! Freed!

Asymptotic Tests
Assume LRM1, LRM2, LRM3d, LRM4a or LRM4b, and LRM5. A simple null hypotheses

of the form H0 : βk = q is tested with the z -test. If the null hypotheses is true, the z-
statistic
β̂k − q A
z= ∼ N (0, 1)
se(
b β̂k )

76
q follows approximately the standard normal distribution. The standard error is se(
b β̂k ) =
Avar(
[ β̂k ).

For example, to perform a two sided test of H0 against the alternative hypotheses

HA : βk ̸= q on the 5% signicance level, we calculate the z -statistic and compare its

absolute value to the 0.975-quantile of the standard normal distribution. H0 is rejected if

|z| > 1.96.

We talk about the Wald test later...

Condence Intervals in Large Samples

Assuming LRM1, LRM2, LRM3d, LRM5, and LRM4a or LRM4b, we can construct

condence intervals for a particular coecient βk . The (1 − α) condence interval is given

β̂k − z(1−α/2) se(
b β̂k ), β̂k + z(1−α/2) se(
b β̂k )

where z(1−α/2) is the (1 − α/2) quantile of the standard normal distribution.

For example, the 95% condence interval is β̂k − 1.96se(
b β̂k ), β̂k + 1.96se(
b β̂k ) .

77
OLS Properties in the Small and in the Large

Set of assumptions (1) (2) (3) (4) (5) (6)

LRM1: linearity f u l f i l l e d

LRM2: simple random sampling f u l f i l l e d

LRM5: identiability f u l f i l l e d

LRM4: error variance

- LRM4a: homoskedastic ✓ ✓ ✓ × × ×
- LRM4b: heteroskedastic × × × ✓ ✓ ✓
LRM3: exogeneity

- LRM3a: normality ✓ × × ✓ × ×
- LRM3b: independent ✓ ✓ × × × ×
- LRM3c: mean indep. ✓ ✓ ✓ ✓ ✓ ×
- LRM3d: uncorrelated ✓ ✓ ✓ ✓ ✓ ✓

Small sample properties of β̂

- unbiased ✓ ✓ ✓ ✓ ✓ ×
- normally distributed ✓ × × ✓ × ×
- ecient ✓ ✓ ✓ × × ×
t-test, F -test ✓ × × × × ×

Large sample properties of β̂

- consistent ✓ ✓ ✓ ✓ ✓ ✓
- approx. normal ✓ ✓ ✓ ✓ ✓ ✓
- asymptotically ecient ✓ ✓ ✓ × × ×
z -test, Wald test ✓ ✓ ✓ ✓* ✓* ✓*

Notes: ✓ = fullled, × = violated, * = corrected standard errors.

78
4.5 Politically Connected Firms: Causality or Correlation?
Arguments For Causality of Eect

Econometric methods need to address concerns, including:

Misspecication: Results robust to dierent functional forms

Errors-in-variables: little concern with administrative data

External validity: Similar eect found in independent studies.

Arguments Against Causality of Eect

Omitted variable bias:
e.g., business acumen

→ Panel data models

Sample selection bias:

lobbying expenditures only observed if in transparency register.

→ Selection correction models

Simultaneous causality:

79
prots may be higher because of political connections

rms may become connected because of their high prots

All of those concerns may be addressed with

→instrumental variable models. What would be a good instrument/experiment?

80
5 Simplifying Linear Regressions using Frisch-Waugh-
Lovell
5.1 Frisch-Waugh-Lovell theorem in equation algebra
From the multivariate to the bivariate regression
Regress yi on two explanatory variables, where x2i is the variable of interest and x1i (or

further variables) are not of interest.

yi = β0 + β2 x2i + β1 x1i + εi .

Surprising and useful result:

We can obtain exactly the same coecients and residuals from a regression of

two demeaned variables

ỹi = β0 + β2 x̃2i + εi .

We can obtain exactly the same coecient and residuals from a regression of two
residualized variables

εyi = β2 ε2i + εi .

Why is the decomposition useful?

Allows breaking a multivariate model with K independent variables into K bivariate

models.

Relationship between two variables from a multivariate model can be shown in a

two-dimensional scatter plot

Absorbs xed eects to reduce computation time (see reghdfe for Stata)

Allows to separate variability between the regressors (multicollinearity) and between

the residualized variable x̃2i and the dependent variable yi .

Understand biases in multivariate models tractably.

How to decompose yi and x2i ?

Partial out x1i from yi and from x2i .

Regress x2i on all x1i and get residuals ε2i :

x2i = γ0 + γ1 x1i + ε2i ,

81
this implies Cov(x1i , ε2i ) = 0,

Regress yi on all x1i and get residuals εyi :

yi = δ0 + δ1 x1i + εyi .

This implies Cov(x1i , εyi ) = 0.

From the residuals and the constants γ0 and δ0 generate

x̃2i = γ0 + ε2i ,

ỹi = δ0 + εyi .
Finally,

ỹi = β̃0 + β̃1 x̃2i + ε̃i = β0 + β2 x̃2i + εi .

Decomposition theorem
Decomposition theorem

For multivariate regressions and detrended regressions, e.g.,

yi = β0 + β2 x2i + β1 x1i + εi ,

ỹi = β̃0 + β̃1 x̃2i + ε̃i ,

the same regression coecients will be obtained with any non-empty subset of the ex-

planatory variables, such that

β̃1 = β2 and also ε̃i = εi .

Examining either set of residuals will convey precisely the same information about the

properties of the unobservable stochastic disturbances.

Detrended variables
Show that

yi = β0 + β2 x2i + β1 x1i + εi (2)

= ỹi = β̃0 + β̃1 x̃2i + ε̃i .

Plug in the variables yi = δ0 + δ1 x1i + εyi and x2i = γ0 + γ1 x1i + ε2i in the equation (2)

yi = δ0 + δ1 x1i + εyi = β0 + β2 (γ0 + γ1 x1i + ε2i ) + β1 x1i + εi

ỹi = δ0 + εyi = β0 + β2 (γ0 + ε2i ) + (β2 γ1 − δ1 + β1 )x1i + εi .

82
Because we partialled out x1i x1i is mechanically uncorrelated to ε2i and to εyi .
using OLS,
1
Therefore, the regression coecient (β2 γ1 − δ1 + β1 ) of the partialled out variable xi is
2 2
zero. The equation simplies with x̃i = γ0 + εi to

ỹi = δ0 + εyi = β0 + β2 (γ0 + ε2i ) + εi .

Regression anatomy: Only detrending x2i and not yi . The regression constant, residu-

als, and the standard errors change but β2 remains

yi = δ0 + δ1 x1i + εyi = (β0 + δ1 x̄1 ) + β2 (γ0 + ε2i ) + (εi + δ1 x1i )

yi = κ + β2 x̃2 + ϵi .

Residualized variables

ỹi = δ0 + εyi = β0 + β2 (γ0 + ε2i ) + εi

εyi = β0 − δ0 + β2 γ0 + β2 ε2i + εi .

The same result of the FWL Theorem holds as well for a regression of the residualized

variables because β0 = δ0 − β2 γ0 :

εyi = β2 ε2i + εi .

5.2 Projection and residual maker matrices

Partition of y
Least squares partitions the vector y into two orthogonal parts

y = ŷ + e = Xb + e = P y + M y.

n×1 vector of data y

n×n projection matrix P

n×n residual maker matrix M

n×1 vector of residuals e

83
Projection matrix

P y = Xb = X(X ′ X)−1 X ′ y

→ P = X(X ′ X)−1 X ′ .

Projection matrix

Properties.

symmetric such that P = P ′, thus orthogonal

idempotent such that P = P 2, thus indeed a projection

annihilator matrix PX = X

Example for projection matrix

Example

Show P X = X(X ′ X)−1 X ′ X = X.

   
1 0   1 0    
  1 1 1   3 1 −1/2
1/2
X= 1 1 ; X'X = 

1 1 = 
   ; X'X−1 = ;
  0 1 0   1 1 −1/2 1.5
1 0 1 0
   
1 0    1/2 0 1/2
  1/2 −1/2 1 1 1  
X(X ′ X)−1 X ′ = 1 1  =
  
  0 1 0 
  −1/2 3/2 0 1 0  
1 0 1/2 0 1/2
    
1/2 0 1/2 1 0 1 0
    
P X =  0 1 0  1 1 = 1 1 .
    
    
1/2 0 1/2 1 0 1 0

84
Project y on the column space of X, i.e. regress y on x and predict E[y] = ŷ.
      
1 1/2 0 1/2 1 2
      
y = 2 ; P y =  0 1 0  2 = ŷ = 2 .
      
      
3 1/2 0 1/2 3 2

Residual maker matrix

M y = e = y − Xb = y − X(X ′ X)−1 X ′ y
M y = (I − X(X ′ X)−1 X ′ )y

→ M = I − X(X ′ X)−1 X ′ = (I − P ).

Residual maker matrix

Properties.

symmetric such that M = M′

idempotent such that M = M2

annihilator matrix MX = 0

orthogonal to P : P M = M P = 0.

Example for residual maker matrix

Example

Show M X = (I − X(X ′ X)−1 X ′ )X = (I − P )X = X − X = 0.

   
1 0 0 1 0
   
I = 0 1 0 ; X = 1 1
  
;
   
0 0 1 1 0

85
     
1 0 0 1/2 0 1/2 1/2 0 −1/2
     
M = (I − P ) = 0 1 0 −  0 1 0  =  0
     
0 0 
     
0 0 1 1/2 0 1/2 −1/2 0 1/2
    
1/2 0 −1/2 1 0 0 0
    
MX =  0 = 0 0 .
    
0 0  1 1  
    
−1/2 0 1/2 1 0 0 0

Obtain residuals from a projection of y on the column space of X, i.e. regress y on x

and predict y − E[y] = y − ŷ.
      
1 1/2 0 −1/2 1 −1
      
y = 2 ; M y =  0 0  2 = y − ŷ =  0  .
      
0
      
3 −1/2 0 1/2 3 1

Column space of X is x0 and x1 .

     
x10 = 1 x11 = 0 y 1 = 1 2 −1
     
x0 = 1 x21 = 1 y 2 = 2 ; ŷ = 2 ; y − ŷ =  0  .
 2     
     
3 3 3
x0 = 1 x1 = 0 y = 3 2 1

The closest point from the vector y′ =

[1, 2, 3] onto the column space of X , is
ŷ = Xb, here ŷ ′ = [2, 2, 2]. At this point,

we can draw a line orthogonal to the column

space of X.

86
Decomposing the normal equations
1
The normal equations in matrix form are X ′ Xb = X ′ y . If X is partitioned into an

interesting segment X2 and an uninteresting X1 , normal equations are

    
′ ′ ′
X X X1 X2 b Xy
 1 1   1 =  1  .
X2′ X1 X2′ X2 b2 X2′ y

The multiplication of the two equations can be done separately

 
h i b1 h i
′ ′ ′
= X1 y (3)
X1 X 1 X1 X2  
b2
 
h i b1 h i
X2′ X1 X2′ X2   = X2′ y . (4)
b2

How can we nd an expression for b2 that does not involve b1 ?

Solving for b2
Idea: Solve equation (3) for b1 in terms of b2 , then substituting that solution into the

equation (4).

 
h i b1 h i
X1′ X1 ′
X1 X2   = ′
X1 y
b2
X1′ X1 b1 + X1′ X2 b2 = X1′ y
X1′ X1 b1 = X1′ y − X1′ X2 b2
b1 = (X1′ X1 )−1 X1′ y − (X1′ X1 )−1 X1′ X2 b2
= (X1′ X1 )−1 X1′ (y − X2 b2 )

Multiplying out equation (4) gives

 
h i b1 h i
′ ′ = ′
X 2 X1 X2 X 2   X2 y
b2
X2′ X1 b1 + X2′ X2 b2 = X2′ y

Plugging in the solution for b1 gives

X2′ X1 (X1 X1 ) X1 (y − X2 b2 ) + X2′ X2 b2 = X2′ y.
′ −1 ′

1 It is called a normal equation because y − Xb is normal to the range of X .

87
X2′ X1 (X1′ X1 )−1 X1′ (y − X2 b2 ) + X2′ X2 b2 = X2′ y.

The middle part of the rst term is X1 (X1′ X1 )−1 X1′ . This is the projection matrix PX1
from a regression of y on X1 .

X2′ PX1 y − X2′ PX1 X2 b2 + X2′ X2 b2 = X2′ y.

We can multiply by an identity matrix I without changing anything

X2′ PX1 y − X2′ PX1 X2 b2 + X2′ IX2 b2 = X2′ Iy.

X2′ Iy − X2′ PX1 y = X2′ IX2 b2 − X2′ PX1 X2 b2 .
X2′ (I − PX1 )y = X2′ (I − PX1 )X2 b2 .

Now (I − PX1 ) is the residual maker matrix MX1

X2′ MX1 y = X2′ MX1 X2 b2 .

Solving for b2 gives

b2 = (X2′ MX1 X2 )−1 X2′ MX1 y.

The residualizer matrix is symmetric and idempotent, such that MX1 = MX1 MX1 =
MX′ 1 MX1 .

b2 = (X2′ MX′ 1 MX1 X2 )−1 X2′ MX′ 1 MX1 y

−1
′
= (MX1 X2 ) (MX1 X2 ) (MX1 X2 )′ (MX1 y)

= (X̃2′ X̃2 )−1 X̃2′ ỹ.

This is the OLS solution for b2 , with X̃2 instead of X and ỹ instead of y.

X̃2 are residuals from a regression of X2 on X1

ỹ are residuals from a regression of y on X1

The solution of the regression coecients b2 in a regression that includes other regres-
sors X1 is the same as rst regressing all of X2 and y on X1 , then regressing the residuals

88
from the y regression on the residuals from the X2 regression.

89
6 The Maximum Likelihood Estimator
6.1 From Probability to Likelihood
The Likelihood Principle
Suppose you have three credit cards. You forgot, which has money on it or not. Thus,

the number credit cards with money, call it θ, might be 0, 1, 2, or 3. You can try your

cards 4 times at random to check if you can make a payment.

The checks are random variables y1 , y2 , y3 , and y4 . They are


1, if the ith card has money on it,
yi =
0, otherwise.

Since you chose yi 's uniformly, they are i.i.d. and yi ∼ Bernoulli(θ/3). After checking,
we nd y1 = 1, y2 = 0, y3 = 1, y4 = 1. We observe 3 cards with money and 1 without.

The number credit cards with money could still be 0, 1, 2, or 3.

Which is most likely?

From Probability to Likelihood

You could test for the true θ0 in many samples. Conversely, you can check each possible

value of θ to nd the probability of observing the sample (y1 = 1, y2 = 0, y3 = 1, y4 = 1).

Since yi ∼ Bernoulli(θ/3), we have


θ/3, for y = 1,
P rob(yi = y) =
1 − θ/3, for y = 0.

Since yi 's are independent, the joint PMF of y1 , y2 , y3 , and y4 can be written as

P rob(y1 = y, y2 = y, y3 = y, y4 = y|θ) =
P rob(y1 )P rob(y2 )P rob(y3 )P rob(y4 ).

This depends on θ, and is called likelihood function:

L(θ|yi ) = P rob(y1 = 1, y2 = 0, y3 = 1, y4 = 1, θ) =
θ/3(1 − θ/3)θ/3θ/3 = (θ/3)3 (1 − θ/3).

90
Trial 1 2 3 4

θ 0 1 2 3

P rob(·) 0.0000 0.0247 0.0988 0.0000

Values of the Likelihood L(θ|yi ) for dierent θ

The probability of the observed sample for θ=0 and θ=3 is zero. This makes sense

because our sample included both cards with and without money. The observed data is

most likely to occur for θ = 2.

Likelihood principle: choose θ that maximizes the likelihood of observing the actual
sample to get an estimator for θ0 .
The likelihood is the probability from

probability mass function if discrete

probability distribution function if continuous

From Likelihood to Log-Likelihood

The likelihood function LN (θ|y, X) is the joint probability mass function or

density f (y, X|θ), viewed as a function of vector θ given the data (y, X).

Maximizing LN (θ) is equivalent to maximizing the log-likelihood function LN (θ) =

ln LN (θ). Because taking the logarithm is a monotonic transformation. A maximum

91
Model Range of y Density f (y) Common Parametrization
′
e−x β
Bernoulli 0 or 1 py (1 − p)1−y p= 1+e−x′ β
Poisson 0, 1, 2, . . . e−λ λy /y! λ=e x′ β

λ = ex β 1/λ = ex β
′ ′
Exponential (0, ∞) λe−λy or
2 /2σ 2
Normal (−∞, ∞) (2πσ 2 )−1/2 e−(y−µ) µ = x′ β, σ 2 = σ 2

for LN (θ) corresponds with a maximum for LN (θ).

6.2 The Econometric Model

Specication of a Likelihood Function
The conditional likelihood LN (θ) = f (y, X|θ)/f (X|θ) = f (y|X, θ) does not require the
specication of the marginal distribution of X .

For observations (yi , xi ) independent over i and distributed with f (y|X, θ),

the joint density is

f (y|X, θ) = ΠN
i=1 f (yi |xi , θ),

the log-likelihood function divided by N is

N
1 1 X
LN (θ) = ln f (yi |xi , θ).
N N i=1

Maximum Likelihood Estimator

92
The maximum likelihood estimator (MLE) is the estimator that maximizes the (con-
ditional) log-likelihood function LN (θ).
The MLE is the local maximum that solves the rst-order conditions

N
1 ∂LN (θ) 1 X ∂ ln f (yi |xi , θ)
= = 0.
N ∂θ N i=1 ∂θ

This estimator is an extremum estimator on based on the conditional density of y given

x. The gradient vector ∂L∂θ
N (θ)
score vector, as it sums the rst derivatives
is called the

of the log density, and when evaluated at θ0 it is called the ecient score.

How Were the Data Generated?

Simple Random Sampling

{xi1 , . . . , xiK , yi }N
i=1 i.i.d. (independent and identically distributed)

This assumption means that

observation i has no information content for observation j ̸= i

all observations i come from the same distribution

This assumption is guaranteed by simple random sampling provided there is no systematic

non-response or truncation.

93
I.i.d. data simplify the maximization as the joint density of the two variables is simply

the product of the two marginal densities.

For example with a normal joint pdf with two observations

1 − [(y1 −µ) +(y 2 −µ) ]

2 2

f (y1 , y2 ) = fY1 (y1 ) fY2 (y2 ) = e 2σ 2 .

2πσ 2

With dependent observations we would have to maximize the following likelihood

function, where ρ is the correlation:

1 −
[ ]
(y1 −µ)2 +(y2 −µ)2 −(y1 −µ)(y2 −µ)
2σ 2 (1−ρ2 )
p e .
2πσ 2 1 − ρ2

The Score has Expected Value Zero

Likelihood Equation:

Z
∂ ln f (y|x, θ) ∂ ln f (y|x, θ)
Ef g(θ) = Ef = f (y|x, θ)dy = 0.
∂θ ∂θ

Example

Z Z
∂
f (y|θ)dy = 1. f (y|θ)dy = 0.
∂θ
Z
∂f (y|θ)
dy = 0.
∂θ

∂ ln f (y|θ)/∂θ = [∂f (y|θ)/∂θ]/[f (y|θ)]

∂f (y|θ) ∂ ln f (y|θ)
= f (y|θ).
∂θ ∂θ
Z
∂ ln f (y|θ)
f (y|θ)dy = 0.
∂θ

Fisher Information
The information matrix is the expectation of the outer product of the score vector,

∂ ln f (y|x, θ) ∂ ln f (y|x, θ)
I = Ef .
∂θ ∂θ ′

94
∂LN (θ)
The Fisher information I is equals the variance of the score, since
∂θ
has mean

zero.

Large values of I mean that small changes in θ lead to large changes in the log-

likelihood

→ LN (θ) contains considerable information about θ,

Small values of I mean that the maximum is shallow and there are many nearby

values of θ with a similar log-likelihood.

Information Matrix Equality

The Fisher information I is equals the expectation of the Hessian H:
2
∂ ln f (y|x, θ) ∂ ln f (y|x, θ) ∂ ln f (y|x, θ)
−Ef H(θ) = −Ef = Ef .
∂θ∂θ ′ ∂θ ∂θ ′

Example

∂ ln f (y|θ)
For vector moment function, e.g., m(y, θ) = with E[m(y, θ)] = 0,
∂θ
Z
m(y, θ)f (y|θ)dy = 0.

The Information Matrix in Practice

The variance of the sum of random score vector is:

Information matrix equality:

" n
#
∂ 2 ln L
X
Var gi (θ) = Var [g (θ)] = −Ef [H (θ)] = −E .
i=1
∂θ∂θ ′

After taking the expected value, θb is substituted for θ . Problem: Taking the expected

value of the second derivative matrix is frequently infeasible.

95
There exist two alternatives which are asymptotically equivalent:

Ignore the expected value operator:

2
I( b = − ∂ ln L .
b θ)
b θb′
∂ θ∂
Berndt-Hall-Hall-Hausman (BHHH) algorithm

Never take a second derivative and sum over the outer product of the scores: (rst

derivatives per observation):

   ′
n
X n
X ∂ ln f yi , θ
b ∂ ln f yi , θ
b
ˇ θ)
I( b = gbi gbi′ =    .
i=1 i=1 ∂ θb ∂ θb

6.3 Properties of the Maximum Likelihood Estimator

Properties of the MLE
Small sample properties of θ̂
may be biased

may have unknown distribution

variance may be biased, even towards zero

Large sample properties of θ̂

consistent

approx. normal

asymptotically ecient

invariant

96
Consistency
Law of Large Numbers
As N increases, the distribution of θ̂ becomes more tightly centered around θ.

(a) N=3 (b) N=10

(c) N=25 (d) N=100

Likelihood Inequality

E[(1/N ) ln L(θ̂)] ≥ E[(1/N ) ln L(θ)].

The expected value of the log-likelihood is maximized at the true value of the parameters.

Figure 15: θ̂, Likelihood and Log-Likelihood as n → ∞. True θ = 0.6.

lim P (|θ̂ − θ| > ϵ) = 0. lim E[θ̂] = θ.

n→∞ n→∞

97
Approximate Normality
Central Limit Theorem
As N becomes large,
2 −1
a ∂ LN (θ)
θ̂ ∼ N θ, − E .
∂θ∂θ ′

Figure 16: Sampling distribution of θ̂ drawn from Bernoulli distribution and normal
distribution at N = 100. True θ = 0.6.

Eciency
The precision of the estimate θ̂ is limited by the Fisher information I of the likelihood.

1
Var θ̂ ≥ .
I (θ)

For large samples, this is the so-called Cramér-Rao lower bound for the variance matrix
√
of consistent asymptotically normal estimators with convergence to normality of N (θ̂ −
θ0 ) uniform in compact intervals of θ0 .
Under the strong assumption of correct specication of the conditional density, the

MLE has the smallest asymptotic variance among root-N consistent estimators.

98
Example

Since the MLE is unbiased,

h i Z
E θ̂ − θ θ = θ̂ − θ f (y; θ) dy = 0 regardless of the value of θ.

This expression is zero independent of θ, so its partial derivative with respect to θ must

also be zero. By the product rule, this partial derivative is also equal to

Z Z Z
∂ ∂f
0= θ̂ − θ f (y; θ) dy = θ̂ − θ dy − f dy.
∂θ ∂θ
For each θ, the likelihood function is a probability density function, and therefore
R
f dy = 1. By using the chain rule on the partial derivative of ln f and then divid-

ing and multiplying by f (y; θ), one can verify that

∂f ∂ ln f
=f .
∂θ ∂θ
Using these two facts, we get

Z
∂ ln f

θ̂ − θ f dy = 1.
∂θ
R √ √
f ∂ ∂θ
ln f

Factoring the integrand gives θ̂ − θ f dy = 1.
Squaring the expression in the integral, the Cauchy-Schwarz inequality yields

Z h p i p ∂ ln f 2 Z "Z 2 #
2 ∂ ln f
1= θ̂ − θ f · f dy ≤ θ̂ − θ f dy · f dy .
∂θ ∂θ

The rst factor is the expected mean-squared error (the variance) of the estimator θ̂ , the

second factor is the Fisher Information.

Invariance
The MLE of γ = c (θ) is θb = c(θ)
b if c (θ) is a continuous and continuous dierentiable

function.

This simplies the log-likelihood,

This allows a function of θb to serve as MLE if it is desired to analyze the function

of an MLE.

99
Example

Suppose that the normal log-likelihood is parameterized in terms of the precision param-

eter, θ2 = 1/σ 2 . The log-likelihood becomes

N
θ2 X
ln L(µ, σ 2 ) = −(N/2) ln(2π) + (N/2) ln θ2 − (yi − µ)2 .
2 i=1

The MLE for µ is x̄. But the likelihood equation for θ2 is now

N
∂ ln L(µ, θ2 )
X
2 2
= 1/2 N/θ − (yi − µ) = 0,
∂θ2 i=1

PN
which has solution θ̂2 = N/ i=1 (yi − µ)2 = 1/σ̂ 2 .

The MLE is also equivariant with respect to certain transformations of the data.
If y = c(x) where c is one to one and does not depend on the parameters to be

estimated, then the density functions satisfy

fX (x)
fY (y) = ,
|c′ (x)|
and hence the likelihood functions for x and y dier only by a factor that does not

depend on the model parameters.

Example

The MLE parameters of the log-normal distribution are the same as those of the normal

distribution tted to the logarithm of the data.

100
7 The Generalized Method of Moments
7.1 How to choose from too many restrictions?
Minimize the quadratic form
The overidentied GMM estimator θ̂GM M (Wn ) for K parameters in θ identied by L > K
moment conditions is a function of the weighting matrix Wn for a sample of i = 1, ..., n

observations:

θ̂GM M (Wn ) = argminqn (θ),

where the quadratic form qn (θ) is the criterion function and is given as a function of

the sample moments m̄n (θ)

qn (θ) = m̄n (θ)′ W m̄n (θ).

The sample moments are a function

N
X
m̄n (θ) = 1/n m(Xi , Zi , θ0 )
i=1

of the model variables Xi , the instruments Zi , and the true parameters θ0 .

What are the properties of the quadratic form

qn (θ) = m̄n (θ)′ W m̄n (θ).
L×L
1×1 1×L L×1

Quadratic form criterion function qn (θ) ≥ 0 is a scalar!

Weighting matrix W is symmetric (and positive denite that is x′ W x > 0 for all

non-zero x)!

7.2 Get the sampling error (at least approximately)

Get an approximate deviation from the true θ0
First order Taylor expansion of sample moments m̄n (θ̂GM M ) around m̄n (θ0 ) at true pa-

rameters gives:

m̄n (θ̂GM M ) ≈ m̄n (θ0 ) + Ḡn (θ̄)(θ̂GM M − θ0 ),

101
∂ m̄n (θ̄)
where Ḡn (θ̄) = ∂ θ̄′
and θ̄ is a point between θ̂GM M and θ0 .

Check the dimensions

First order Taylor expansion of sample moments m̄n (θ̂GM M ) around m̄n (θ0 ) at true pa-

rameters gives:

m̄n (θ̂GM M ) ≈ m̄n (θ0 ) + Ḡn (θ̄)(θ̂GM M − θ0 ),

L×1 L×1 L×K K×1

∂ m̄n (θ̄)
L×1
where Ḡn (θ̄) = ∂ θ̄′
and θ̄ is a point between θ̂GM M and θ0 , because of the
1×K

Mean value theorem...

Approximation introduced θ̄
∂ m̄n (θ̄)
...where Ḡn (θ̄) = ∂ θ̄′
and θ̄ is a point between θ̂GM M and θ0 .

Mean value theorem

m̄n (θ̂GM M ) − m̄n (θ0 )

Ḡn (θ̄) = for θ0 < θ̄ < θ̂GM M .
θ̂GM M − θ0

Do the minimization
To minimize the quadratic form criterion, we take the rst derivative of

102
qn (θ) = m̄n (θ)′ W m̄n (θ)

∂qn (θ̂GM M )
= 2Ḡn (θ̂GM M )′ Wn m̄n (θ̂GM M ) = 0.
∂ θ̂GM M

Express as much as possible asymptotically

∂qn (θ̂GM M )
= 2Ḡn (θ̂GM M )′ Wn m̄n (θ̂GM M ) = 0,
∂ θ̂GM M
Plug in the approximation from before

m̄n (θ̂GM M ) ≈ m̄n (θ0 ) + Ḡn (θ̄)(θ̂GM M − θ0 )

to obtain

Ḡn (θ̂GM M )′ Wn m̄n (θ0 ) + Ḡn (θ̂GM M )′ Wn Ḡn (θ̄)(θ̂GM M − θ0 ) ≈ 0

which we rearrange to get the very useful

θ̂GM M ≈ θ0 − (Ḡn (θ̂GM M )′ Wn Ḡn (θ̄))−1 Ḡn (θ̂GM M )′ Wn m̄n (θ0 ).

So the estimate θ̂GM M is approximately the true parameter θ0 plus a sampling error that

depends on the sample moment m̄n (θ0 ).

Quickly check dimensions

Useful approximation

θ̂GM M ≈ θ0 − (Ḡn (θ̂GM M )′ Wn Ḡn (θ̄))−1 Ḡn (θ̂GM M )′ Wn m̄n (θ0 ).

K×1 K×1 K×L L×L L×K K×L L×L L×1

So the estimate θ̂GM M is approximately the true parameter θ0 plus a sampling error that

depends on the sample moment m̄n (θ0 ).

103
7.3 The econometric model
Three assumptions: moment conditions
GMM1: Moment Conditions and Identication

m̄(θa ) ̸= m̄(θ0 ) = E[m(Xi , Zi , θ0 )] = 0.

Identication implies that the probability limit of the GMM criterion function is uniquely

minimized at the true parameters.

Three assumptions: law of large numbers

GMM2: Law of Large Numbers Applies

.
N
p
X
m̄n (θ) = 1/n m(Xi , Zi , θ0 ) → E[m(Xi , Zi , θ0 )].
i=1

The data meets the conditions for a law of large numbers to apply, so that we may assume

that the empirical moments converge in probability to their expectation.

Three assumptions: central limit theorem

GMM3: Central Limit Theorem Applies

.
N
√ √ X d
nm̄n (θ) = n/n m(Xi , Zi , θ0 ) → N [0, Φ].
i=1

The empirical moments obey a central limit theorem. This assumes that the moments

have a nite asymptotic covariance matrix E[m(Xi , Zi , θ0 )m(Xi , Zi , θ0 )′ ] = Φ.

7.4 Consistency
Recall the useful approximation of the estimator:

θ̂GM M ≈ θ0 − (Ḡn (θ̂GM M )′ Wn Ḡn (θ̄))−1 Ḡn (θ̂GM M )′ Wn m̄n (θ0 ).

104
Assumption GMM2 implies that

N
p
X
m̄n (θ0 ) = 1/n m(Xi , Zi , θ0 ) → E[m(Xi , Zi , θ0 )] = m̄(θ0 ).
i=1

That is, the sample moment equals the population moment in probability. Assumption

GMM1 implies that

m̄(θ0 ) = 0.

Then
p
m̄n (θ0 ) → m̄(θ0 ) = 0,

such that
p
θ̂GM M → θ0 for N →∞

That is, by GMM1 and GMM2 the GMM estimator is consistent.

7.5 Asymptotic normality

Recall the useful approximation of the estimator:

θ̂GM M ≈ θ0 − (Ḡn (θ̂GM M )′ Wn Ḡn (θ̄))−1 Ḡn (θ̂GM M )′ Wn m̄n (θ0 ).

Rewrite to obtain

√ √
n(θ̂GM M − θ0 ) ≈ −(Ḡn (θ̂GM M )′ Wn Ḡn (θ̄))−1 Ḡn (θ̂GM M )′ Wn nm̄n (θ0 ),

The right hand side has several parts for which we made assumptions on what happens

when N → ∞. Under the central limit theorem (GMM3)

√ d
nm̄n (θ0 ) → N [0, Φ]

plimWn = W

∂m(Xi , Zi , θ0 ) ∂ m̄(θ0 )
plimḠn (θ̂GM M ) = plimḠn (θ̄) = plim =E = Γ(θ0 )
∂θ0′ ∂θ0′
With plimWn = W and

plimḠn (θ̂GM M ) = plimḠn (θ̄) = Γ(θ0 )

105
the expression

√ √
n(θ̂GM M − θ0 ) ≈ −(Ḡn (θ̂GM M )′ Wn Ḡn (θ̄))−1 Ḡn (θ̂GM M )′ Wn nm̄n (θ0 )

becomes

√ √
n(θ̂GM M − θ0 ) ≈ −(Γ(θ0 )′ W Γ(θ0 ))−1 Γ(θ0 )′ W nm̄n (θ0 )

from which we get the variance V. So

√ d
n(θ̂GM M − θ0 ) → N [0, V ]

with

V = 1/n[Γ(θ0 )′ W Γ(θ0 )]−1 [Γ(θ0 )′ W ΦW ′ Γ(θ0 )][Γ(θ0 )′ W Γ(θ0 )]−1

K×K

That is by GMM1, GMM2, and GMM3 the GMM estimator is asymptotic normal.

7.6 Asymptotic eciency

Which weighting matrix W gives us the smallest possible asymptotic variance of the GMM

estimator θ̂GM M .
The variance of the GMM estimator V depends on the choice of W

V = 1/n[Γ(θ0 )′ W Γ(θ0 )]−1 [Γ(θ0 )′ W ΦW ′ Γ(θ0 )][Γ(θ0 )′ W Γ(θ0 )]−1

So let us minimize V to get the optimal weight matrix. Try from GMM3

plimWn = W = Φ−1
n→∞

′
VGM M,optimal = 1/n[Γ(θ0 )′ Φ−1 Γ(θ0 )]−1 [Γ(θ0 )′ Φ−1 ΦΦ−1 Γ(θ0 )][Γ(θ0 )′ Φ−1 Γ(θ0 )]−1

Which can be simplied to

VGM M,optimal = 1/n[Γ(θ0 )′ Φ−1 Γ(θ0 )]−1

If Φ is small, there is little variation of this specic sample moment around zero and the

moment condition is very informative about θ0 . So it is best to assign a high weight to it.

106
VGM M,optimal = 1/n[Γ(θ0 )′ Φ−1 Γ(θ0 )]−1

If Γ is large, there is a large penalty from violating the moment condition by evaluating
at θ ̸= θ0 . Then the moment condition is very informative about θ0 . V is inversely related

to Γ.

107
Estimate the variance in practice
V̂GM M,optimal = 1/n[Ḡn (θ̂)′ Φ−1
n Ḡn (θ̂)]
−1

Consistent estimator

Φn = N V (m̄n (θ̂))

∂m(Xi , Zi , θ̂)
Ḡn (θ̂) =
∂ θ̂′

108
8 Conclusion
Congratulations! If you made it through this document, you are ready to read some econo-

metrics papers, program and develop new estimators, and analyze statistical properties.

If this caught your interest, check out non-parametric and Bayesian econometrics.

109
References
Angrist, J. D., and J.-S. Pischke (2009): Mostly Harmless Econometrics: An Em-
piricist's Companion . Princeton University Press.

Cameron, A. C., and P. K. Trivedi (2005): Microeconometrics: Methods and Appli-

cations. Cambridge University Press, 3rd edn.

Filoso, V. (2013): Regression Anatomy, Revealed, The Stata Journal , 13(1), 92106.

Frisch, R., and F. V. Waugh (1933): Partial Time Regressions as Compared with

Individual Trends, Econometrica , 1(4), 387401.

Greene, W. H. (2011): Econometric Analysis . Prentice Hall, 5th edn.

Hansen, L. P. (1982): Large Sample Properties of Generalized Method of Moments

Estimators, Econometrica , 50(4), 10291054.

(2012): Proofs for large sample properties of generalized method of moments

estimators, Journal of Econometrics , 170(2), 325330, Thirtieth Anniversary of Gen-

eralized Method of Moments.

Hill, R. C., W. E. Griffiths, and G. C. Lim (2010): Principles of Econometrics .

John Wiley & Sons, 4th edn.

Kennedy, P. (2008): A Guide to Econometrics . Blackwell Publishing, 6th edn., In par-

ticular, Chapter 7, 8.18.3.

Lovell, M. C. (2008): A Simple Proof of the FWL Theorem, The Journal of Economic
Education , 39(1), 8891.

Pishro-Nik, H. (2014): Introduction to Probability, Statistics, and Random Processes .

Kappa Research LLC.

Plackett, R. L. (1972): Studies in the History of Probability and Statistics. XXIX:

The discovery of the method of least squares, Biometrika , 59(2), 239251. Cited on

page 63.

Rostam-Afschar, D., and R. Jessen (2014): GRAPH3D: Stata module to draw

colored, scalable, rotatable 3D plots, Statistical Software Components, Boston College

Department of Economics.

Stock, J. H., and M. W. Watson (2012): Introduction to Econometrics . Pearson

Addison-Wesley, 3rd edn.

Verbeek, M. (2012): A Guide to Modern Econometrics . John Wiley & Sons, 3rd edn.

Wooldridge, J. M. (2002): Econometric Analysis of Cross Section and Panel Data .

MIT Press.

(2009): Introductory Econometrics: A Modern Approach . Cengage Learning, 4th

edn. Cited on page 67.

110

Probability and Statistics For Machine Learning - A Textbook
No ratings yet
Probability and Statistics For Machine Learning - A Textbook
530 pages
STAT 330 Course Notes Fall 2024 Edition
No ratings yet
STAT 330 Course Notes Fall 2024 Edition
482 pages
Book Statistik Non Parametrik, Komang Suardika
No ratings yet
Book Statistik Non Parametrik, Komang Suardika
492 pages
STAT230 Course Notes F16
No ratings yet
STAT230 Course Notes F16
365 pages
ST102/ST109 Elementary Statistical Theory Course Pack 2022/23 (Michaelmas Term)
100% (1)
ST102/ST109 Elementary Statistical Theory Course Pack 2022/23 (Michaelmas Term)
235 pages
Ee230 Lectures
No ratings yet
Ee230 Lectures
103 pages
MA-202: Probability & Statistics: Class Notes
No ratings yet
MA-202: Probability & Statistics: Class Notes
221 pages
Cimentaciones Maquinas
100% (1)
Cimentaciones Maquinas
235 pages
Mathematical Statistics Intro Course 1713243381
No ratings yet
Mathematical Statistics Intro Course 1713243381
142 pages
STA2004F
No ratings yet
STA2004F
212 pages
JB Ies 109 Exercises Answers
No ratings yet
JB Ies 109 Exercises Answers
246 pages
Probability in Computer Science
100% (1)
Probability in Computer Science
353 pages
Ma 202
No ratings yet
Ma 202
219 pages
Lecture Notes
No ratings yet
Lecture Notes
90 pages
Data Analysis
No ratings yet
Data Analysis
51 pages
2020-2021 EDA 101 Handout
No ratings yet
2020-2021 EDA 101 Handout
192 pages
Probability
No ratings yet
Probability
180 pages
مقياس الضغوط النفسيه والمهنيه
100% (1)
مقياس الضغوط النفسيه والمهنيه
16 pages
Math 630 Course Notes Fall 2021
No ratings yet
Math 630 Course Notes Fall 2021
274 pages
Statistics For Econometrics
No ratings yet
Statistics For Econometrics
100 pages
ECE 313 Course Notes: Probability With Engineering Applications
No ratings yet
ECE 313 Course Notes: Probability With Engineering Applications
188 pages
Stoch Book 19
No ratings yet
Stoch Book 19
134 pages
STAT230 Course Notes F16
No ratings yet
STAT230 Course Notes F16
365 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
28 pages
Statistical Methods in Data Analysis - W. J. Metzger
No ratings yet
Statistical Methods in Data Analysis - W. J. Metzger
278 pages
Doc-Cours MathsV
No ratings yet
Doc-Cours MathsV
69 pages
EC400Stats Lecturenotes2021
No ratings yet
EC400Stats Lecturenotes2021
101 pages
Math382 Lecture Notes Probability and Statistics PDF
No ratings yet
Math382 Lecture Notes Probability and Statistics PDF
178 pages
Formulario Ep Probability and Statistics
No ratings yet
Formulario Ep Probability and Statistics
28 pages
Msiii PDF
No ratings yet
Msiii PDF
118 pages
Stat PDF
No ratings yet
Stat PDF
132 pages
Lecture Notes
No ratings yet
Lecture Notes
80 pages
MS Theory Exam Study Guide
No ratings yet
MS Theory Exam Study Guide
50 pages
Exercises
No ratings yet
Exercises
7 pages
A Probability and Statistics Cheatsheet
No ratings yet
A Probability and Statistics Cheatsheet
28 pages
Phase-Type Distributions & Mixtures of Erlangs
No ratings yet
Phase-Type Distributions & Mixtures of Erlangs
132 pages
Lecture Notes For STAT2602
No ratings yet
Lecture Notes For STAT2602
104 pages
Chapter4 Estimation
No ratings yet
Chapter4 Estimation
28 pages
Basic Facts Quals
No ratings yet
Basic Facts Quals
30 pages
Lecture Notes Statistics
100% (2)
Lecture Notes Statistics
117 pages
STAT515 Lecture
No ratings yet
STAT515 Lecture
85 pages
Probability and Statistics Cookbook
No ratings yet
Probability and Statistics Cookbook
28 pages
Poisson
No ratings yet
Poisson
54 pages
Probability Distribution
No ratings yet
Probability Distribution
20 pages
BUQU 1230 5.3 Binomial Dist
No ratings yet
BUQU 1230 5.3 Binomial Dist
63 pages
Probability 3 2024 Fall 4
No ratings yet
Probability 3 2024 Fall 4
43 pages
Lecture Material 2.5 - Bayesian Estimation & Concepts
No ratings yet
Lecture Material 2.5 - Bayesian Estimation & Concepts
12 pages
2-Random Variables
No ratings yet
2-Random Variables
31 pages
Chi Square
No ratings yet
Chi Square
13 pages
Probability and Statistics - Cookbook
No ratings yet
Probability and Statistics - Cookbook
28 pages
BNZSEE1555 Rizal+et+al Seismic+activity+analysis+of+Sumatran+Megathrust
No ratings yet
BNZSEE1555 Rizal+et+al Seismic+activity+analysis+of+Sumatran+Megathrust
16 pages
Chap2 Multivariate Normal and Related Distributions
No ratings yet
Chap2 Multivariate Normal and Related Distributions
18 pages
Stats Cheat Sheet
No ratings yet
Stats Cheat Sheet
28 pages
Discrete Probability Distribution
No ratings yet
Discrete Probability Distribution
21 pages
Stats 2 Week1-2 Mock
No ratings yet
Stats 2 Week1-2 Mock
11 pages
MAT 3201 - Simulation and Modeling - December 2022
No ratings yet
MAT 3201 - Simulation and Modeling - December 2022
4 pages
PA Lec 7 2024
No ratings yet
PA Lec 7 2024
25 pages
Logistic Regression
No ratings yet
Logistic Regression
19 pages
Stochastic Asignment Group Asignment 2
No ratings yet
Stochastic Asignment Group Asignment 2
3 pages
Chi-Squared Distribution
No ratings yet
Chi-Squared Distribution
15 pages
Continuation of Random Variable
No ratings yet
Continuation of Random Variable
11 pages
Stable Distribution
No ratings yet
Stable Distribution
8 pages
Chapter 14 MCMC For Continuous Distribution, Gaussian Process (Lecture On 02-18-2021) - STAT 243 - Stochastic Process
No ratings yet
Chapter 14 MCMC For Continuous Distribution, Gaussian Process (Lecture On 02-18-2021) - STAT 243 - Stochastic Process
6 pages
Bayesian 1 - Exercise Three
No ratings yet
Bayesian 1 - Exercise Three
2 pages
Toon signal bolowsruulalt A.Гэрэлсайхан
No ratings yet
Toon signal bolowsruulalt A.Гэрэлсайхан
14 pages
The Inverse Transform Method: 1 Discrete Case
No ratings yet
The Inverse Transform Method: 1 Discrete Case
3 pages
Mathematical Statistics
No ratings yet
Mathematical Statistics
271 pages
Tugas 6 Analisis Multivariat Data Panel
No ratings yet
Tugas 6 Analisis Multivariat Data Panel
11 pages
Assignment Sta116
No ratings yet
Assignment Sta116
3 pages
Probability and Statistics: Cookbook
No ratings yet
Probability and Statistics: Cookbook
28 pages
Chapter 8 - The Geometric Distribution
No ratings yet
Chapter 8 - The Geometric Distribution
3 pages
Assignment D
No ratings yet
Assignment D
2 pages

Econometricks-Short Guide

Uploaded by

Econometricks-Short Guide

Uploaded by

Econometricks: Short Guides to Econometrics

Keywords: Econometrics, Ordinary Least Squares, Maximum Likelihood, Gener-

1.2 Probability fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Mean and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Moments of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Useful rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Method of transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 The χ2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 The F-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5 The student t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6 The lognormal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.7 The gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.8 The beta distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.9 The logistic distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.10 The Wishart distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Review of Distribution Theory 38

3.2 The joint density function . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3 The joint cumulative density function . . . . . . . . . . . . . . . . . . . . . 40

3.4 The marginal probability density . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.6 The conditional density function . . . . . . . . . . . . . . . . . . . . . . . . 45

3.7 Conditional mean aka regression . . . . . . . . . . . . . . . . . . . . . . . . 46

3.8 The bivariate normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.9 Useful rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 The Least Squares Estimator 52

4.2 The Econometric Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3 Estimation with OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.5 Politically Connected Firms: Causality or Correlation? . . . . . . . . . . . 79

5.2 Projection and residual maker matrices . . . . . . . . . . . . . . . . . . . . 83

6 The Maximum Likelihood Estimator 90

6.2 The Econometric Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.3 Properties of the Maximum Likelihood Estimator . . . . . . . . . . . . . . 96

7 The Generalized Method of Moments 101

7.2 Get the sampling error (at least approximately) . . . . . . . . . . . . . . . 101

7.3 The econometric model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.5 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.6 Asymptotic eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

beyond. It focuses on Ordinary Least Squares, Maximum Likelihood, Generalized Method

the Frisch-Waugh-Lovell decomposition and with Monte Carlo Simulation.

1.2 Probability fundamentals

A random variable X is discrete if the set

of outcomes x is either nite or countably

Continuous Random Variable

The random variable X is continuous if the

hence, not countable.

f (x) = P rob(X = x).

The axioms of probability require

Discrete cumulative probabilities

f (xi ) = F (xi ) − F (xi−1 ).

Roll of a six-sided die

1 f (1) = 1/6 F (X ≤ 1) = 1/6

What's the probability that you roll a 5 or higher?

The axioms of probability require

f (x) = 0 outside the range of x.

Cumulative distribution function

 If x > y, then F (x) ≥ F (y)

P rob(a < x ≤ b) = F (b) − F (a).

The mean, or expected value, of a discrete random variable is

Roll of a six-sided die

x f (x) = 1/n F (X ≤ x) = (x − a + 1)/n

a = 1 f (1) = 1/6 F (X ≤ 1) = 1/6

What's the expected value from rolling the dice?

E[x] = 1/6 + 2/6 + 3/6 + 4/6 + 5/6 + 6/6 = 3.5.

For a continuous random variable x, the expected value is

The continuous uniform distribution is 1/(b − a) for a≤x≤b and 0 otherwise.

The mean (and the median) is again (a + b)/2 = 3.5.

Variance of a random variable

The variance of a random variable σ 2 > 0 is

V [x] = E[x2 ] − (E[x])2 .

For any random variable x and any positive constant k > 1,

Share outside k standard deviations.

1.4 Moments of a random variable

The central moments are

Moments: Two measures often used to describe a probability distribution are

 expectation = E[(x − µ)1 ]

7.6 Asymptotic eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

of outcomes x is either nite or countably

If x > y, then F (x) ≥ F (y)

expectation = E[(x − µ)1 ]

variance = E[(x − µ)2 ]

skewness = E[(x − µ)3 ]

kurtosis = E[(x − µ)4 ]

Often simpler alternative to working directly with probability density functions or

Not all random variables have moment-generating functions

the skewness is E[(x − µ)3 ] = 0 and

the kurtosis is E[(x − µ)4 ] = 3.

If a and b constants, V ar[a + bx] = b2 V ar[x]

If g(x) = a + bx and a and b are constants, E[a + bx] = a + bE[x]

Skewness = E[(x − µ)3 ]

Kurtosis = E[(x − µ)4 ]

For symmetric distributions f (µ − x) = f (µ + x); 1 − F (x) = F (−x)

V ar[g(x)] ≈ [g ′ (µ)]2 V ar[x]

V [x] = E[x2 ] − E[x]2 = p − p2 = p(1 − p).

V [x] = np(1 − p).

Convenient transformation a = −µ/σ and b = 1/σ :

If x ∼ N [µ, σ 2 ], then f (x) = σ1 ϕ[ x−µ

P rob(a ≤ x ≤ b) = P rob a−µ x−µ b−µ

ϕ(−z) = 1 − ϕ(z) and Φ(−x) = 1 − Φ(x) because of symmetry

If x1 , ..., xn are n independent χ2 [1] variables, then

PDF denotes probability density function, CDF cumulative distribution function,

µ mean (location), σ, s (scale).

Excess Kurtosis is dened as Kurtosis minus 3.

If zi , i = 1, ..., n, are independent N [0, 1] variables, then

If zi , i = 1, ..., n, are independent N [0, σ 2 ] variables, then

If x1 and x2 are independent χ2 variables with n1 and n2 degrees of freedom, then