0% found this document useful (0 votes)
10 views113 pages

Lecture Notes On Maximum Likelihood Estimation: Michael Peress December 30, 2024

The lecture notes by Michael Peress cover Maximum Likelihood Estimation (MLE) and its applications in various statistical models, including Logit, Probit, and Count Models. The document discusses the theory behind MLE, numerical optimization techniques, and model fit measures. It also includes suggested readings for further understanding of the topics presented.

Uploaded by

gojj.meng2015
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views113 pages

Lecture Notes On Maximum Likelihood Estimation: Michael Peress December 30, 2024

The lecture notes by Michael Peress cover Maximum Likelihood Estimation (MLE) and its applications in various statistical models, including Logit, Probit, and Count Models. The document discusses the theory behind MLE, numerical optimization techniques, and model fit measures. It also includes suggested readings for further understanding of the topics presented.

Uploaded by

gojj.meng2015
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

Lecture Notes on Maximum Likelihood

Estimation
Michael Peress*

December 30, 2024

* Department of Political Science, SUNY-Stony Brook. [email protected]


Contents
1 Overview and Logit 5
1.1 What Makes a Model Nonlinear? . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Nonlinear Model Complications . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Logit 8
2.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Marginal Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Measures of Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 When is Using a Logit Model Necessary . . . . . . . . . . . . . . . . . . . . 16
2.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Numerical Optimization 18
3.1 The Need for Numerical Optimization . . . . . . . . . . . . . . . . . . . . . . 18
3.2 One-Dimensional Root-Finding . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 One-Dimensional Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Multi-Dimensional Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Checking for an Optimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Theory of MLE 25
4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Identification of a Statistical Model . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 Efficiency and the Cramer Rao Lower Bound . . . . . . . . . . . . . . . . . . 39
4.6 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6.2 Advanced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

1
5 Probit, More on Logit, and Ordered Probit 44
5.1 The Probit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Perfect Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Testing Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Interactions in Binary Choice Models . . . . . . . . . . . . . . . . . . . . . . 46
5.5 Ordered Probit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.6 Ordered Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.8 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.8.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.8.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6 Multinomial Choice 61
6.1 Multinomial Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Conditional Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3 The IIA Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4 Multinomial Probit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.5 Substantive Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.6 Measures of Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.7 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.7.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7 Count Models 69
7.1 Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Negative Binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3 Zero-Inflated Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.4 Semi-parametric Analysis of the Count Regression Model . . . . . . . . . . . 73
7.5 Model Fit for Parametric Models . . . . . . . . . . . . . . . . . . . . . . . . 75
7.6 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.6.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

8 Censoring, Selection, and Truncation 76


8.1 The Tobit Model (Censored Regression) . . . . . . . . . . . . . . . . . . . . 76

2
8.2 Truncated Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.3 The Heckman Selection Model . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.5 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.5.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

9 Duration Models 87
9.1 Survival Functions and Hazard Rates . . . . . . . . . . . . . . . . . . . . . . 88
9.2 Parametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.3 The Cox Proportional Hazard Model . . . . . . . . . . . . . . . . . . . . . . 90
9.4 Discrete Duration Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.5 Time Varying Covariates in Continuous Models . . . . . . . . . . . . . . . . 91
9.6 Suggested Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.6.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

10 Monte Carlo Simulation and the Bootstrap 93


10.1 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.2 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.3 Recommended Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.3.2 Advanced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

11 Nonlinear Panel Data 95


11.1 Clustered Standard Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
11.2 Nonlinear Fixed Effects Models . . . . . . . . . . . . . . . . . . . . . . . . . 97
11.3 Conditional Fixed Effects Estimators . . . . . . . . . . . . . . . . . . . . . . 98
11.4 Random Effects Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
11.5 Generalized Estimating Equations . . . . . . . . . . . . . . . . . . . . . . . . 100
11.6 Recommended Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.6.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.6.3 Advanced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

3
12 Time Series Dependence in Nonlinear Models 106
12.1 Parametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
12.2 Semiparametric Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
12.3 Recommended Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
12.3.1 Advanced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

13 References 107

A Appendix 111
A.1 Proof that Asymptotic Normality Implies Asymptotic Unbiasedness . . . . . 111
A.2 Derivation of the Expected Value of yn in the Heckman Selection Model . . . 112

4
1 Overview and Logit
1.1 What Makes a Model Nonlinear?
Consider the classical linear regression model,

yn = β00 xn + εn (1)

with E[εn |xn ] = 0. Here, we have E[yn |xn ] = β00 xn , or that the expectation of the dependent
variable is linear in the independent variables. Consider alternatively the logistic regression
model,
0
e β 0 xn
Pr(yn = 1|xn ) = 0 (2)
1 + eβ0 xn
We have that,
0
eβ0 xn
E[yn |xn ] = Pr(yn = 1|xn ) = 0 (3)
1 + eβ0 xn
so that the expectation of the dependent variable is not linear in the independent variables.

1.2 Types of Data


An important motivation for considering nonlinear models involves dealing with special types
of data. In the classical linear model, the regressor is continuous, having an interval scale,
ranging from negative infinity to infinity. The linear regression framework can handle devi-
ations from this—it can often be applied as long as the expectation of the DV conditional
on the IV is linear in the IV, but this approach has some limitations. We can consider the
following alternative forms of dependent variables:

ˆ Logit and Probit: The dependent variable is binary

ˆ Ordered Probit: The dependent variable is ordered, but does not have an interval level
scale

ˆ Multinomial Logit, Multinomial Probit, and Conditional Logit: The dependent vari-
able is discrete and unordered

ˆ Tobit: The dependent variable is censored at 0 (or possibly other values). The depen-
dent variable in this case is neither discrete nor continuous—it is “mixed”

5
ˆ Poisson and Negative Binomial Regression: The dependent variable is discrete and
weakly positive (i.e. count data)

For each of these types of variables, we cannot avoid nonlinearity of the expectation. To
see this, consider the case of a binary dependent variable. If E[yn |xn ] = Pr(yn = 1|xn ) =
β00 xn , then it must be the case that for xn such that β00 xn is small (large), Pr(yn = 1|xn )
must dip below zero (above one) and is thus not a proper probability model unless the range
of xn is restricted.
For each of the above models, we specify the probability distribution of the dependent
variable conditional on the independent variables. When the dependent variable is discrete,
we specify the probability mass function of the DV given the IVs. When the dependent
variable is continuous, we specify the density of the DV given the IVs. The case where
the dependent variable is mixed is somewhat more complicated to deal with, but we can
characterize the dependent variable using the cumulative distribution function. We write
the probability model of the as follows:

ˆ pY |X (y|x; θ) (y is discrete)

ˆ fY |X (y|x; θ) (y is continuous)

ˆ FY |X (y|x; θ) (y is mixed)

In most cases, we will assume that (yn , xn ) are independent, in which case we have,
QN
ˆ pY |X (y|x; θ) = n=1 pYn |Xn (yn |xn ; θ) (y is discrete)
QN
ˆ fY |X (y|x; θ) = n=1 fYn |Xn (yn |xn ; θ) (y is continuous)
QN
ˆ FY |X (y|x; θ) = n=1 FYn |Xn (yn |xn ; θ) (y is mixed)

1.3 Maximum Likelihood Estimation


Given the types of models described above, maximum likelihood estimation is a procedure
for deriving an estimator from a probability model. The MLE is given by,1

ˆ θ̂M LE = arg max pY |X (y|x; θ) (y is discrete)


θ∈Θ

1
The definition of the MLE for mixed random variables is more complicated, so we will deal with it when
we consider the Tobit model.

6
ˆ θ̂M LE = arg max fY |X (y|x; θ) (y is continuous)
θ∈Θ

Here, L(θ) = pY |X (y|x; θ) for the discrete case and L(θ) = fY |X (y|x; θ) for the continuous
case is called the “likelihood”, which is the probability of observing the data in the discrete
case and the density for observing the data in the continuous case. The MLE selects the
parameters that make it most likely that we would observe the data we actually observed.
When (yn , xn ) are independent, we have,
QN
ˆ L(θ) = n=1 pYn |Xn (yn |xn ; θ) (y is discrete)
QN
ˆ L(θ) = n=1 fYn |Xn (yn |xn ; θ) (y is continuous)

We define the log-likelihood l(θ) as the log of the likelihood,


PN
ˆ l(θ) = n=1 log pYn |Xn (yn |xn ; θ) (y is discrete)
PN
ˆ l(θ) = n=1 log fYn |Xn (yn |xn ; θ) (y is continuous)

Maximizing the log-likelihood is equivalent to maximizing the likelihood (since log is a mono-
tonically increasing transformation) and it is often more convenient to maximize the log
likelihood because the derivatives of the log-likelihood are easier to compute (the product
in the likelihood means we would have to apply the product rule N times) and because the
product in the definition of the likelihood can often lead to having to work with very small
numbers.
Under general conditions, the MLE is guaranteed to have several “nice” properties. The
MLE is not the only such procedure—Method of Moments, Generalized Method of Moments,
and Bayesian estimation are alternative procedures for automatically generating an estimator
for a statistical model that is guaranteed to have nice properties. Under specified conditions,
MLEs can be shown to have the following nice properties:

1. Invariance

2. Consistency

3. Asymptotic Normality

4. Asymptotic Efficiency

5. The Information Equality

7
We can compare this to the list of nice properties of the ordinary least squares estimator for
linear regression:

1. Unbiasedness

2. Known finite sample distribution (normal, chi-squared, t, etc.)

3. Gauss-Markov Theorem

4. Rao-Blackwell Theorem

5. Consistency

6. Asymptotic Normality

7. Efficiency

Besides invariance (which is a very weak property), the nice properties of MLEs only hold
in large samples. The ordinary least squares estimator has nice properties that hold in finite
samples.

1.4 Nonlinear Model Complications


Nonlinear models present a number of complications. In general, there will not be an ana-
lytical solution to the maximization problem that defines the MLE.2 The lack of analytical
expressions for the estimator means we cannot use simple math to derive the properties of
MLEs. Marginal effects are also more complicated to calculate in nonlinear models.

2 Logit
Consider a binary dependent variable yn ∈ {0, 1}. We can specify a probability model for
y|x by assuming that (yn , xn ) are independent and specifying Pr(yn = 1|xn )—this would
complete the model since Pr(yn = 0|xn ) = 1 − Pr(yn = 1|xn ). We would like the model to
have the property that 0 ≤ Pr(yn = 1|xn ) ≤ 1 for all xn and for the model to depend on xn
only through β 0 xn . There are many possibilities—one possibility is,
2
We have already encountered this problem for some estimators of the linear model such as the MLEs
for the ARMA model and the linear model with heteroskedastic errors.

8
0
eβ0 xn
Pr(yn = 1|xn ) = 0 (4)
1 + eβ0 xn
z
e z z
Note that 1+e z is monotonically increasing in z with lim e z = 0 and lim 1+ee
z = 1.
z→−∞ 1+e z→∞
One way to motivate the logistic regression model is as a model where a latent variable
is linear. We have,

yn∗ = β 0 xn + εn (5)

where yn = 1{yn∗ ≥ 0}. The logistic distribution has density function,


z−µ
e− s
f (z; µ, s) = z−µ (6)
s(1 + e− s )2

and cumulative distribution function,

1
F (z; µ, s) = z−µ (7)

1+e s

where µ is a location parameter and s is a scale parameter. The mean, median, and mode
2
of the logistic distribution are all µ and the variance is s2 π3 . If we take εn to have the
Logistic(0, 1) distribution, we have that Fεn (z) = 1+e1−z . Consequently,

Pr(yn = 1|xn ) = Pr(yn∗ ≥ 0|xn ) = Pr(β 0 xn + εn ≥ 0|xn ) = Pr(εn ≥ −β 0 xn |xn ) (8)

0 0
0 0 1 1 + eβ xn 1 e β xn
= 1−Pr(εn ≤ −β xn |xn ) = 1−Fεn (−β xn ) = 1− 0 = 0 − 0 =
1 + eβ xn 1 + eβ xn 1 + eβ xn 1 + eβ 0 xn

Alternatively, selecting εn ∼ N (0, 1) yields the (binomial) probit model,

Pr(yn = 1|xn ) = Φ(β00 xn ) (9)

Other choices of εn will lead to alternative models, but these are not common.
In many cases, we can think of yn as a censored random variable. For example, let yn∗
be the two-party vote share in excess of 50% of the incumbent candidate for governor. We
could imagine a variable yn = 1{yn∗ ≥ 0}, in which case yn = 1 if the incumbent governor is

9
reelected. Alternatively, we could let yn∗ denote the preference for voting for a Republican
candidate. In this case, yn = 1 could denote a survey respondent indicating a preference
for the Republican candidate. The variable yn∗ is not observable, but we can still think of
elections as censoring a latent preference between the two parties. For almost any binary
DV, we can think of a theoretically continuous DV that is censored—for civil war, we can
imagine a continuum under which certain events or characteristics of countries make civil
wars more likely. When a threshold is reached, a civil war is observed.

2.1 Estimation
To estimate the logit model, we start with the definition of the MLE,

N
X
β̂M LE = arg max l(β) = log pYn |Xn (yn |xn ; β) (10)
β∈B n=1

N
X
log Pr(yn = 1|xn )yn Pr(yn = 0|xn )1−yn

= arg max
β∈B n=1

N
X
= arg max yn log Pr(yn = 1|xn ) + (1 − yn ) log Pr(yn = 0|xn )
β∈B n=1

N 0
eβ xn
   
X 1
= arg max yn log + (1 − yn ) log
β∈B n=1
1 + e β 0 xn 1 + eβ 0 xn

It is tempting to solve this problem by taking first-order conditions, but this approach does
not lead anywhere good. Instead, we use numerical optimization to maximize l(β). To
compute standard errors for β̂M LE , we rely on the following result,

√ dist.
N (β̂M LE − β0 ) −→ N (0, V0 ) (11)

where,

∂ 2 −1
V0 = E[− ∂β∂β 0 log pYn |Xn (yn |xn ; β0 )] (12)

To estimate V0 , we use V̂ = (− N1 lββ (β̂M LE ))−1 , where lββ is the matrix of second derivatives
of the log-likelihood function evaluated at the MLE estimate β̂M LE . To calculate lββ , we
again typically would use numerical methods rather than computing it by hand.

10
2.2 Marginal Effects
Consider the linear probability model, Pr(yn = 1|xn ) = β 0 xn . In this case, marginal effects
are very easy to calculate,


∂xnk
Pr(yn = 1|xn ) = βk (13)

The marginal effect of an IV on the DV is given by a single number, which is just the
estimated coefficient. For the logit model, things are more complicated,

0 0 0 0 0 0
∂ e β xn (1 + eβ xn )eβ xn βk − eβ xn eβ xn βk e β xn
= = βk = πn (1 − πn )βk (14)
∂xnk
1 + eβ 0 xn (1 + eβ 0 xn )2 (1 + eβ 0 xn )2

β 0 xn
e
where πn = 1+e β 0 xn . Since 0 < πn < 1, the sign of the marginal effect is the same as the sign

of the estimated coefficient, but the actual marginal effect depends on the entire vector xn .
0
e β xn
If we start with a value of β 0 xn such that πn = 1+e 1
β 0 xn is far from 2 , the marginal effect will

be smaller.
We can compute the marginal effect for a given individual using π̂n (1 − π̂n )β̂k where
0
eβ̂ xn
π̂n = 1+e β̂ 0 xn
, but then we have a separate marginal effect from each point in our sample, or
a marginal effect for each hypothetical value of xn . Two alternative approaches are used. In
the first approach, we compute the marginal effect for the mean value of xn in the sample,
x̄. This marginal effect is given by,
0
eβ̂ x̄
M
d E1 = β̂k (15)
(1 + eβ̂ 0 x̄ )2
This can be interpreted as the marginal effect for an individual whose independent variables
are the average in the sample. An alternative is to compute the marginal effect for each unit
in the sample and report the average,

N 0
1
X eβ̂ xn
M
d E2 = N
β̂k (16)
n=1 (1 + eβ̂ 0 xn )2
This can be interpreted as the average marginal effect in the sample.
Instead of computing the marginal effect, we can compute the change in the predicted
probability for a specific change in one of the independent variables. We could consider,

11
0 0
eβ̂−k xn,−k +β̂k δ1 eβ̂−k xn,−k +β̂k δ0
0
− 0
(17)
1 + eβ̂−k xn,−k +β̂k δ1 1 + eβ̂−k xn,−k +β̂k δ0
Again, this would take on a different value for each individual in our sample. We could
report the average value in the sample,

N 0 0
1
X eβ̂−k xn,−k +β̂k δ1 eβ̂−k xn,−k +β̂k δ0
N 0
− 0
(18)
n=1 1 + eβ̂−k xn,−k +β̂k δ1 1 + eβ̂−k xn,−k +β̂k δ0
We could think of xnk as representing the growth rate and yn as indicating whether a survey
respondent votes for the incumbent candidate. If we set δ1 = 3 and δ0 = 1, the quantity
above would indicate the increase in the probability of voting for incumbent candidates when
growth increases from 1 percent to 3 percent. We could get a different answer if we selected
δ1 = 4 and δ0 = 2. To resolve this indeterminacy, it is common to consider a change between
one standard deviation below the mean and one standard deviation above the mean:
0 0
eβ̂ x̄+β̂k sxk eβ̂ x̄−β̂k sxk
CP
d1 =
0
− 0
(19)
1 + eβ̂ x̄+β̂k sxk 1 + eβ̂ x̄−β̂k sxk
Alternatively, we can consider the average magnitude of change in an individuals probability
of a success when every individual’s covariate k is changed by two standard deviations:

N 0 0
1
X eβ̂ xn +β̂k sxk eβ̂ xn −β̂k sxk
CP
d2 =
N 0
− 0
(20)
n=1 1 + eβ̂ xn +β̂k sxk 1 + eβ̂ xn −β̂k sxk
For each type of marginal effect, we may also wish to compute a measure of uncertainty.
The delta method provides one way of doing this. Each marginal effect can be expressed as
a function of the parameter β, c(β). In each case above, c(β) is a differentiable function. In
this case, the delta method tells us that,

√ dist. √
N (c(β̂) − c(β0 )) −→ cβ (β0 ) N (β̂ − β0 ) (21)

where cβ is the matrix of derivatives of the vector-valued function c(β). Using the fact that
√ dist.
N (β̂ − β0 ) −→ N (0, V0 ), we have that,

√ dist.
N (c(β̂) − c(β0 )) −→ N (0, ĉβ V̂ ĉ0β ) (22)

where ĉβ = cβ (β̂) and V̂ is an estimate of the asymptotic covariance matrix of β̂.
Consider first M d E 1 where,

12
0
eβ x̄
c(β) = βk (23)
(1 + eβ 0 x̄ )2
In principle, we could take the derivatives by hand to obtain:
0 0

(1+eβ x̄ )(1+x̄k βk )−2eβ x̄ βk2 β 0 x̄

(1+eβ 0 x̄ )3
e , j=k
[cβ (β)]j = β 0 x̄ β 0 x̄ (24)
(1−e )e

(1+eβ 0 x̄ )3
βk x̄j , j 6= k

Even this very simple example is very tedious. A simpler approach is to apply numerical
differentiation. Specifically, recall the definition of the derivative,

f (x + h) − f (x)
f 0 (x) = lim (25)
h→∞ h
If we select a “small” value of h (for example, h = 0.000001), we can approximate the
derivative using,

f (x + h) − f (x)
f 0 (x) ≈ (26)
h
To compute ĉβ , we could use,

c(β̂ + ej h) − c(β̂)
ĉβ,j ≈ (27)
h
where ej is a unit vector.
Next, we mention one simple shortcut for interpreting the coefficients. Since the marginal
effect is π̂n (1 − π̂n )β̂k and 0 < π̂n (1 − π̂n ) < 14 , the largest possible marginal effect is 14 β̂k . We
can therefore obtain some idea of the marginal effect just by dividing the coefficient by 4,
though this can be seriously misleading in certain circumstances. If most of the observations
have π̂n close to 0.5, this will give us a sense of the average marginal effect, but having most
of the data with π̂n close to 0.5 corresponds to a model with little predictive power. If many
observations have π̂n close to zero or one, the average marginal effect will be much smaller.
One more alternative approach for interpreting logistic regression estimates is given by
the odds ratio. The odds ratio for xnk = a vs. xnk = b is given by,

13
0 x
β−k
e −k +βk a
β 0 x−k +βk a
1+e −k
Pr(yn =1|xnk =a,xn,−k ) 1 0
Pr(yn =0|xnk =a,xn,−k ) β0 x +β a
1+e −k −k k eβ−k x−k +βk a
Pr(yn =1|xnk =b,xn,−k )
= 0 x
β−k
= β 0 x +β b = eβk (a−b) (28)
−k +βk b
Pr(yn =0|xnk =b,xn,−k )
e e −k −k k
β 0 x−k +βk b
1+e −k
1
β 0 x−k +βk b
1+e −k

Suppose that yn is voter turnout (0 or 1) and xn is whether someone is assigned to a GOTV


call (a = 1 and b = 0). Then eβk is an estimate of the ratio of the odds of voting to the odds
of nonvoting. If 75% of people vote when receiving no GOTV call, the odds of voting is 3 to
1. If when receiving a phone call, 80% of people vote, the odds of voting are 4 to 1. In this
case, the odds ratio would be 4/3=1.333, indicating a 33.33% increase in the odds of voting
from receiving a GOTV call.
The odds ratio would typically be used for a dummy IV (in which case a = 1 and b = 0).
More generally, the odds ratio would indicate the increase in the odds of observing a success
when changing an IV from a to b. Odds ratios can be viewed as more interpretable than lo-
gistic regression coefficients, however, predicted probabilities are arguably more interpretable
than odds ratios. The example above suggests why odds ratios are sometimes used—they
make effects seem bigger. A 33.33% increase in the odds of voting sounds bigger than a 5
percentage point increase in voter turnout.
In some cases, odds ratios may be reported instead of regression coefficients. In this case,
to determine whether a variable has a positive effect, it must be compared to 1 rather than 0.
We can compute standard errors for odds ratios using the delta method. Specifically, since
c(β) = eβk , cβ (β) = eβk in which case we can compute the standard error of the odds ratio
eβ̂k using eβ̂k se(β̂k ) where se(β̂k ) is the standard error for the logistic regression coefficient.
We can also reverse this process to get coefficient estimates and their standard errors from
odds ratios and their standard errors using β̂k = log OR d k and se(β̂k ) = se(ORdk )
.
OR
d k

2.3 Measures of Model Fit


A number of different measures of model fit are available for the logistic regression model.
The basic is the percent correctly predicted,

N
X
P CP = 1
N
yn 1{β̂ 0 xn > 0} + (1 − yn )1{β̂ 0 xn < 0} (29)
n=1

Alternatively, one might desire a measure more closely related to the R-squared from the

14
linear regression model. There are many different variants of the “pseudo R-squared” for the
logistic regression model, but below two are presented. The McKelvey and Zavoina pseudo
R-squared starts with the latent variable formulation, yn∗ = β00 xn + εn , which is very closely
related to the linear regression model, yn = β00 xn + εn . In the linear regression model, we
could calculate,
PN 1
PN
2 ESS (ŷn − ȳ)2 N −1 n=1 (ŷn − ȳ)2 V ar(ŷn )
R = = Pn=1
N
= 1
PN ≈ (30)
T SS n=1 (y n − ȳ)2
N −1 n=1 (yn − ȳ)2 V ar(yn )

V ar(β̂ 0 xn ) V ar(β̂ 0 xn ) β̂ 0 V ar(xn )β̂


= = =
V ar(β00 xn + εn ) V ar(β00 xn ) + V ar(εn ) β00 V ar(xn )β0 + V ar(εn )

β̂ 0 Sx β̂
≈ (31)
β̂ 0 Sx β̂ + V ar(εn )
where Sx = N1−1 N 0
P
n=1 (xn − x̄)(xn −x̄) is the sample covariance matrix. For the Logistic(0, 1)
2
distribution, we have V ar(εn ) = π3 . Analogizing to the linear regression model, we could
select,

2 β̂ 0 Sx β̂
pseudo − R ≈ π2
(32)
β̂ 0 Sx β̂ + 3

Note that if the model has no explanatory power (β̂ = 0), pseudo − R2 = 0. If β̂ 0 Sx β̂ is
very large relative to the error variance, pseudo − R2 will be very close to 1. This particular
measure has the advantage that it is, apart from sampling error, the R-squared one would
obtain if the latent variable were not censored. This version of the R-squared can be applied
to other models that can be interpreted as censored versions of the linear regression model,
such as probit, ordered logit, ordered probit, and tobit models.
An alternative measure is McFadden’s R-squared, which is defined by,

l
pseudo − R2 = 1 − (33)
l0
where l is the log-likelihood and l0 is the log-likelihood from a model that only includes an
intercept. The idea is that a model with only an intercept term does not explain any variance
around the mean and will have likelihood L0 ∈ (0, 1]. A perfect model will have a likelihood
of L = 1. All models will have L0 ≤ L ≤ 1 or log L0 ≤ log L ≤ 0 or l0 ≤ l ≤ 0. The worst
case is, pseudo − R2 = 1 − ll00 = 0 while the best case if pseudo − R2 = 1 − l00 = 1. All in

15
between models will have 0 < pseudo − R2 < 1. This version of the R-squared be applied to
other models of discrete dependent variables such as probit, ordered logit, ordered probit,
multinomial logit, multinomial probit, and poisson regression models.
In addition, AIC and BIC are alternative measures of model fit which we can use to
compare between models. We have,

AIC = −2l + 2k (34)

BIC = −2l + k log n (35)

where we select models where the AIC an BIC are small. These measures have the disadvan-
tage that their magnitudes are not interpretable, but have the advantage that they penalize
models for having many parameters that don’t contribute to improved model fit.

2.4 When is Using a Logit Model Necessary


If the goal is simply to identify the signs of coefficients, than the logit model will rarely
produce estimates that differ substantially from the linear probability model. If the goal is
to report average marginal effects, again the logit model will rarely produce estimates that
differ substantially from the linear probability model. The logit model should be used when
the goal to obtain predictions for individual observations because the predictions from the
linear probability model can fall outside the zero-one interval. In addition, if the goal is
to consider more specific or complicated counterfactuals, the logit model should be used.
There is rarely any drawback to estimating a logit model though—while the logit model
maybe slightly more computationally complex, it would be very rare to find a dataset large
enough for this to be an issue. The logit model makes instrumental variable estimation
more difficult. The logit model also makes it more difficult to include fixed effects, since
fixed effects logit will require the number of observations per fixed effect to go to infinity for
consistency (while the linear probability model with fixed effects will only requires the total
number of observations to go to infinity for consistent estimation of the main effects). This
issue will be considered in 11.

2.5 Applications
Application 2.1 (Individual Level Model of the Economic Vote).

16
In this application, we consider an individual level model of the economic vote. The data
come from the Comparative Study of Electoral Systems, modules 1 and 2, which merges
survey data from multiple elections in multiple countries. The dependent variable is equal
to 1 if the individual voted for the prime minister’s party and 0 if the individual voted for
someone else. Table 1 presents the estimates of the model, along with odds ratios and four
measures of substantive effect sizes. Table 2 compares a basic model (with only growth
included as an explanatory variable) with a model that includes additional controls.

Model Odds Ratio ME1 ME2 CP1 CP2

Independent Variables:
Constant -0.231**
(0.085)
Distance -0.361*** 0.697ˆ -0.072*** -0.070*** -0.066*** -0.065***
(0.008) (0.005) (0.002) (0.001) (0.001) (0.001)
Education -0.111*** 0.895ˆ -0.022*** -0.022*** -0.022*** -0.021***
(0.007) (0.007) (0.001) (0.001) (0.001) (0.001)
Age 0.006*** 1.006ˆ 0.001*** 0.001*** 0.001*** 0.001***
(0.001) (0.001) (0.000) (0.000) (0.000) (0.000)
Gender 0.021 1.021 0.004 0.004 0.004 0.004
(0.023) (0.024) (0.005) (0.004) (0.005) (0.004)
Income 0.070*** 1.072ˆ 0.014*** 0.014*** 0.014*** 0.014***
(0.009) (0.010) (0.002) (0.002) (0.002) (0.002)
Growth 0.132*** 1.141ˆ 0.027*** 0.026*** 0.027*** 0.026***
(0.008) (0.009) (0.002) (0.002) (0.002) (0.002)
Unemployment -0.038*** 0.963ˆ -0.008*** -0.007*** -0.008*** -0.007***
(0.005) (0.004) (0.001) (0.001) (0.001) (0.001)

N 39550

Table 1: Binomial Logit Model and Substantive Effects — Standard errors in parentheses. + p < .10,∗ p <
.05,∗∗ p < .01, and ∗∗∗ p < .001 indicate that the coefficient or marginal effect is statistically significantly
different from 0. ˆp < .05 indicates that the odds ratio is statistically significantly different from 1.

Application 2.2 (Predicting the Likelihood of Civil War).

In this application, we consider predicting the likelihood of a civil war. The data come
from the Correlates of War project, with each observation being a country-year. The de-
pending variable is equal to 1 if a country was experiencing a civil war in that year and is
equal to zero otherwise. Table 3 reports a binomial logit model, four types of associated
substantive effects, and a linear probability model.

2.6 Suggested Reading


2.6.1 Background

[1] Greene (2000)

17
(1) (2)

Independent Variables:
Constant -1.258*** -0.231**
(0.026) (0.085)
Growth 0.140*** 0.132***
(0.008) (0.008)
Distance -0.361***
(0.008)
Education -0.111***
(0.007)
Age 0.006***
(0.001)
Gender 0.021
(0.023)
Income 0.070***
(0.009)
Unemployment -0.038***
(0.005)

N 39550 39550
AIC 47783 45123
BIC 47800 45192
McKelvey R2 0.012 0.121
McFadden R2 0.007 0.062

Table 2: Binomial Logit Models and Fit Statistics — Standard errors in parentheses. +
p < .10,∗ p <
∗∗ ∗∗∗
.05, p < .01, p < .001.

[2] Kennedy (1992)

[3] King (1998)

3 Numerical Optimization
3.1 The Need for Numerical Optimization
Consider the following very simple equation, ex + .3x = .7. Since ex + .3x is monotonically
increasing, there must be a unique x that solves ex + .3x = .7, however, we cannot solve this
equation for x by hand, so we say that it has no analytical solution. Consider optimization
problem,

min ex + .15x2 − .7x (36)


x

To find the minimum, we can attempt to take first-order conditions,

F OC = ex + .3x − .7 = 0 (37)

or equivalently,

18
Logit ME1 ME2 CP1 CP2 OLS

Independent Variables:
Constant -6.623*** -0.381***
(0.359) (0.034)
Log(Population) 0.519*** 0.035*** 0.048*** 0.044*** 0.055*** 0.053***
(0.029) (0.002) (0.003) (0.003) (0.003) (0.003)
Western Democracies -2.201*** -0.150*** -0.204*** -0.065*** -0.111*** -0.206***
(0.275) (0.018) (0.025) (0.004) (0.006) (0.024)
Eastern Europe -1.945*** -0.133*** -0.180*** -0.063*** -0.105*** -0.117***
(0.341) (0.023) (0.031) (0.005) (0.009) (0.023)
Latin America -0.195 -0.013 -0.018 -0.012 -0.017 -0.002
(0.222) (0.015) (0.021) (0.013) (0.018) (0.022)
Sub-Saharan Africa -0.055 -0.004 -0.005 -0.004 -0.005 0.028
(0.206) (0.014) (0.019) (0.013) (0.019) (0.021)
Asia -0.329 -0.022 -0.030 -0.020+ -0.028+ 0.057*
(0.207) (0.014) (0.019) (0.011) (0.016) (0.024)
Polity Score 0.045*** 0.003*** 0.004*** 0.003*** 0.004*** 0.005***
(0.007) (0.000) (0.001) (0.001) (0.001) (0.001)
GDP per Capita -0.199*** -0.014*** -0.018*** -0.012*** -0.017*** -0.006***
(0.027) (0.002) (0.003) (0.001) (0.002) (0.001)
Former British Colony -0.017 -0.001 -0.002 -0.001 -0.002 -0.013
(0.124) (0.008) (0.011) (0.008) (0.011) (0.012)
Former French Colony 0.237 0.016 0.022 0.018 0.023 -0.015
(0.145) (0.010) (0.013) (0.012) (0.015) (0.015)
Per. Mountainous Terrain 0.012*** 0.001*** 0.001*** 0.001*** 0.001*** 0.001***
(0.002) (0.000) (0.000) (0.000) (0.000) (0.000)
Num. of Contiguous Countries 1.386*** 0.095*** 0.128*** 0.168*** 0.176*** 0.135***
(0.135) (0.009) (0.012) (0.024) (0.021) (0.016)
Ethnic Fractionalization 0.667*** 0.046*** 0.062*** 0.061** 0.073** 0.081***
(0.183) (0.013) (0.017) (0.021) (0.023) (0.020)
Per. Muslim 0.002 0.000 0.000 0.000 0.000 0.001***
(0.002) (0.000) (0.000) (0.000) (0.000) (0.000)

N 6214 6214
r.squared 0.174

Table 3: Binomial Logit Model and Substantive Effects — Standard errors in parentheses. + p < .10,∗ p <
.05,∗∗ p < .01, and ∗∗∗
p < .001.

19
ex + .3x = .7 (38)

but we already found that this has a unique solution that cannot be computed by hand. We
can take second-order conditions,

SOC = ex + .3 > 0 (39)

The second-order conditions indicate that ex + .15x2 − .7 is globally concave and hence the
solution to ex + .3x = .7 is the unique minimizer of ex + .15x2 − .7x, but we still need a way
to compute the value of the minimizer.

3.2 One-Dimensional Root-Finding


Consider solving f (x) = 0. Suppose that we know that f (x) < 0 and f (x̄) > 0 where x < x̄.
Suppose further that f is continuous. From the we have x < x̄ with f (x) < 0 and f (x̄) > 0,
we say that we bracket a solution to f (x) = 0. From the intermediate value theorem, it
follows that there is an x ∈ (x, x̄) such that f (x) = 0.
To find the solution to f (x) = 0, we could consider the following algorithm. We set
y = 12 x + 12 x̄ . If f (y) = 0, we have found a solution. If f (y) < 0, we now bracket a solution
between y and x̄, so we set the new x = y. If f (y) > 0, we now bracket a solution between
x and y, so we set the new x̄ = y. In either case, we bracket a new solution, but x and x̄ are
half as far apart. Each time we do this process, x and x̄ get closer and closer and we close in
on the solution. We can stop this procedure when f (y) is closer enough to zero (say, within
0.00000001). This algorithm is called the bisection method.
An alternative algorithm assumes that f is differentiable. Let us start with the guess
x0 . We take a Taylor expansion of f around x0 . We have, f (x) ≈ f (x0 ) + f 0 (x0 )(x − x0 ).
We want to set f (x) = 0, so we solve x = x0 − ff0(x 0)
(x0 )
. We use this so set up the following
iteration,

f (xk−1 )
xk = xk−1 − (40)
f 0 (xk−1 )
We stop this process when f (xk ) is close enough to zero. This process is called Newton’s
method.
In some cases, we can calculate f 0 (xk−1 ) by hand. In other cases, we might approximate
it using numerical differentiation,

20
f (xk−1 + h) − f (xk−1 )
f 0 (xk−1 ) ≈ (41)
h
A third option is the secant method. The secant method starts with two iterates, x1 and x0 .
It approximates the derivative in the formula for Newton’s method using,

f (xk−1 ) − f (xk−2 )
f 0 (xk−1 ) ≈ (42)
xk−1 − xk−2
The secant method algorithm becomes,

f (xk−1 )(xk−1 − xk−2 )


xk = xk−1 − (43)
f (xk−1 ) − f (xk−2 )

3.3 One-Dimensional Optimization


Consider the problem of finding the minimum of the function f (x). Suppose that we have
3 points, a < b < c such that f (a) > f (b) and f (c) > f (b). Suppose that the function
f (x) is continuous, then f must have a local minimum between a and c. Since f goes down
between a and b, it must turn around to reach f (c) at c. At some point the slope must go
from negative to positive, and this point will be a local minimum. The algorithm selects
an intermediate point between a and c, d. If this new point has f (d) < f (b), it replaces
b. Otherwise, it replaces one of the boundary points. Continuing this procedure, we will
obtain smaller and smaller intervals and converge to a local minimum. This process is called
golden-section search.
An alternative procedure is Newton’s method. Newton’s method uses a quadratic ap-
proximation for f (x) around x0 ,

f (x) ≈ f (x0 ) + f 0 (x0 )(x − x0 ) + 21 f 00 (x0 )(x − x0 )2 (44)

If x is the minimum, we must have,

f 0 (x) ≈ f 0 (x0 ) + f 00 (x0 )(x − x0 ) = 0 (45)

Solving this for x, we obtain,

f 0 (x0 )
x = x0 − (46)
f 00 (x0 )
This suggest the following iteration procedure,

21
f 0 (xk−1 )
xk = xk−1 − (47)
f 00 (xk−1 )
Note that we now must know both f 0 and f 00 .
It is possible to approximate f 00 by applying the finite difference formula twice,

f (x + h) − f (x)
f 0 (x + h) ≈ (48)
h

f (x) − f (x − h)
f 0 (x) ≈ (49)
h

f (x+h)−f (x) f (x)−f (x−h)


f 0 (x + h) − f 0 (x) −
f 00 (x) ≈ ≈ h h
(50)
h h

f (x + h) − 2f (x) + f (x − h)
=
h2
However, in practice approximating the second derivative at each step of Newton’s method
is rarely done. Instead, the second derivative is typically approximated using secant-like
methods.

3.4 Multi-Dimensional Optimization


The first set of algorithms we consider are based on Newton’s method. In particular, we
approximate the objective function f (x) with a quadratic function,

f (x∗ ) ≈ f (x) + f 0 (x)(x∗ − x) + 12 (x∗ − x)0 f 00 (x)(x∗ − x) (51)

Now, suppose that f 00 (x) is negative definite. If x∗ is the minimum of this function, then we
must have,

f 0 (x) + f 00 (x)(x∗ − x) = 0 (52)

which is obtained by taking the first-derivative of the right hand side with respect to x∗ . We
can solve this for x∗ to obtain, x∗ ≈ x − f 00 (x)−1 f 0 (x), which suggests at iterative approach
for finding the minimum of f .
Let xk denote the current iterate, let gk = ∇f (xk ), and let Hk = ∇2 f (xk ). We can
generate a new iterate using the scheme,

22
xk+1 = xk − Hk−1 gk (53)

Notice that if the current iterate is approximately a minimum, then gk = ∇f (xk ) ≈ 0,


indicating that our algorithm will stop near a local minimum (it will also stop when exactly
at a local maximum).
In practice, Newton’s method is unlikely to converge without medication. We can achieve
better results if we use,

xk+1 = xk − λHk−1 gk (54)

for some 0 < λ ≤ 1. The question becomes, how can we choose λ? A back-tracking
procedure can be effective. First, we try the full Newton’s step, with λ = 1. We then see if
this decreases the function value. If it does not, we reduce the step by some factor (usually
one half) and continue, until we reach the point where the function decreases.
Two drawbacks still exists with Newton’s method. The first is that we need to supply
the Hessian, which may be expensive to compute. To see why the Hessian is difficult to
2f
compute, note that H is a matrix with components Hij = ∂x∂i ∂x j
(x). We can compute,

∂f f (x + ei h) − f (x)
∂xi
≈ (55)
h
where ei is a unit vector with the ith element equal to 1. We then have,

∂f ∂f f (x+ej h+ei h)−f (x+ej h) f (x+ei h)−f (x)


∂2f ∂xi
(x + ej h) − ∂xi
(x) h
− h
∂xi ∂xj
(x) ≈ ≈ (56)
h h

f (x + ej h + ei h) − f (x + ej h) − f (x + ei h) + f (x)
=
h2
Computing the Hessian H using numerical derivatives thus requires computing the objective
function f at J(J + 1)/2 points above what is needed to compute the gradient.
The second drawback to using the Hessian in Newton’s methods is that the Hessian will
fail to be negative definite far from the minimum in many applications. We can actually
solve both problems simultaneously by considering quasi-Newton methods. These methods
extend on the idea of secant methods, and approximate the Hessian using recent values of
the objection function and its’ gradient.
The most widely used approximation to the Hessian is the BFGS update,

23
(gk − gk−1 )(gk − gk−1 )0 Hk (xk − xk−1 )(xk − xk−1 )0 Hk
Hk+1 = Hk + − (57)
(gk − gk−1 )0 (xk − xk−1 ) (xk − xk−1 )0 Hk (xk − xk−1 )
This process is guaranteed to produce a symmetric definite updated Hessian as long as
(gk − gk−1 )0 (xk − xk−1 ) > 0. The BFGS algorithm then proceeds by only updating when this
condition is met.
To summarize, we have the following algorithm,

1. Initialize x0 , f0 = f (x0 ), g0 = f 0 (x0 ), and H0 = I (or some other diagonal matrix).

2. In iteration k,

(a) If gk is small enough, terminate successfully.


(b) Set xk+1 = xk − Hk−1 gk and compute fk+1 = f (xk+1 )
(c) If fk+1 > fk − δ (with δ > 0), then initialize λ = 1 and start backtracking.
Otherwise, go to (d).
i. Set λ = 12 λ
ii. Set xk+1 = xk − λHk−1 gk and compute fk+1 = f (xk+1 ).
iii. If fk+1 > fk − δ , then go to (i), otherwise go to (d).
(d) Compute gk+1 = f 0 (xk+1 )
(gk −gk−1 )(gk −gk−1 )0 Hk (xk −xk−1 )(xk −xk−1 )0 Hk
(e) Update, Hk+1 = Hk + (gk −gk−1 )0 (xk −xk−1 )
− (xk −xk−1 )0 Hk (xk −xk−1 )

(f) Go to (a)

Quasi-Newton methods are most effective in problems that are sufficiently smooth. The
Nelder Mead simplex method does not rely on derivatives. The basic idea behind the Nelder
Mead simplex method is to maintain a simplex of dimension n + 1 (for an n-dimensional
minimization problem) that spans the n-dimensional space. We start by reflecting the worst
point in the simplex through one of the vertices of the simplex. If the new function value is
an improvement over the best function value, then we take a double step. Alternatively, if
the reelected point was worse than the second highest, we contract by a factor of one-half.
We continue this process until our simplex is really small. Since contraction will only occur
when there are no nearby points with lower function values, this procedure will tend to
converge to a local minimum.

24
3.5 Checking for an Optimum
A necessary condition for a stationary point is that f 0 (x∗ ) = 0. A sufficient condition for
a local minimum is that f 00 (x∗ ) is positive definite. Determining whether x∗ is a global
minimum is more difficult. If f is globally concave, then a local minimum is also a global
minimum, but the objective function we work with are typically not globally concave.
If we use a quasi-Newton method to minimize f (x), it is fairly likely to find a point
where f 0 (x∗ ) = 0. It is also very unlikely to converge to a point that is not a local minimum.
Nonetheless, we will typically check that f 00 (x∗ ) is positive definite because computing f 00 (x∗ )
is necessary for computing the asymptotic variance of the parameter estimates. We can check
whether f 00 (x∗ ) is positive definite by checking that its eigenvalues are strictly positive. The
most typical way in which f 00 (x∗ ) will fail to be positive definite is the case where is has zero
eigenvalues, which can happen for an unidentified model (see Section 4.3). This often means
that we either specified the model incorrectly or that some of the variables are perfect linear
combinations of the others.

3.6 Suggested Reading


3.6.1 Background

[1] Greene (2000)

[2] King (1998)

4 Theory of MLE
In developing the theory of MLE, we will assume that observations are independent over
n. We will assume that the data are described by a probability mass function (in the case of
discrete data) or a probability density function (in the case of continuous data). The model
may be model for a data vector xn or a model for the dependent variable conditional on a
vector of independent variables yn |xn . Using independence, we have that the log-likelihood
function is given by,
PN
ˆ l(θ) = n=1 log p(xn ; θ) (discrete case)
PN
ˆ l(θ) = n=1 log f (xn ; θ) (continuous case)

25
We can define the MLE using,

θ̂M LE = arg max l(θ) (58)


θ∈Θ

4.1 Examples
Example 4.1 (Poisson Distribution MLE).
e−λ λx
Consider the Poisson distribution, with probability mass function, p(x; λ) = x!
, for
x ∈ {0, 1, 2, ...} and λ ≥ 0. We can write the log-likelihood as,

N
X N
X N
X
−λ xn
l(λ) = log(e ) + log(λ ) − log(xn !) = −N λ + log(λ) xn − log(xn !) (59)
n=1 n=1 n=1

The first-order condition yields,

N
X
lλ (λ̂) = −N + λ̂−1 xn = 0 (60)
n=1

which implies λ̂ = x̄. The second-order condition yields,

N
X
lλλ (λ̂) = −λ̂−2 xn = −N x̄−1 < 0 (61)
n=1

indicating that λ̂ = x̄ is a local maximum. Since it is the only local maximum and lim l(λ) =
λ→0
∞, it is also the global maximum, indicating that λ̂ = x̄ is the unique MLE.

Example 4.2 (Exponential Distribution MLE).

Suppose that xn has the exponential distribution, with probability density function,
f (x; λ) = λeλx for x ≥ 0 and λ ≥ 0. The log-likelihood function is given by,

N
X
l(λ) = log(λ) − λxn (62)
n=1

The first-order condition yields,

N
X
−1
lλ (λ̂) = N λ̂ − xn = 0 (63)
n=1

26
which implies λ̂ = x̄−1 . The second-order condition yields,

lλλ (λ̂) = −N λ̂−2 = −N x̄2 < 0 (64)

indicating that λ̂ = x̄−1 is a local maximum. Since it is the only local maximum and
lim l(λ) = ∞, it is also the global maximum, indicating that λ̂ = x̄−1 is the unique MLE.
λ→0

Example 4.3 (Normal Distribution MLE).

Suppose xn has the normal distribution, with probability density function,

1 2
f (x; µ, σ 2 ) = √1 e− 2σ 2 (x−µ) (65)
σ 2π

where ∞ < x < ∞ and σ > 0. We can form the log-likelihood function as,

N   N
X 1 2
X
l(µ, σ) = log √1 e− 2σ 2 (xn −µ) = −N log σ − 1
N log(2π) − 1
(xn − µ)2 (66)
σ 2π 2 2σ 2
n=1 n=1

We can take first order conditions to find the MLE,

N
X
1
lµ (µ̂, σ̂) = σ̂ 2
(xn − µ̂) = 0 (67)
n=1

N
X
lσ (X; µ̂, σ̂) = − Nσ̂ + 1
σ̂ 3
(xn − µ̂)2 = 0 (68)
n=1

The first equation indicates that,

N
X
1
µ̂ = N
xn = x̄ (69)
n=1

while the second equation yields,

N
X
2
σ̂ = 1
N
(xn − µ̂)2 = N
s2
N −1 x
(70)
n=1

The second-order condition yields,

lµµ (µ̂, σ̂) = − σ̂N2 (71)

27
N
X
lµσ (µ̂, σ̂) = lσµ (µ, σ) = − σ̂23 (xn − µ̂) = 0 (72)
n=1

N
X
lσσ (µ̂, σ̂) = − σ̂N2 − 3
σ̂ 4
(xn − µ̂)2 = − 4N
σ̂ 2
(73)
n=1

Since the matrix of second derivatives is negative definite, µ̂ = X̄ and σ̂ = NN−1 s2x is a local
maximum. Since it is the only local maximum, if is also a global maximum, indicate that
µ̂ = X̄ and σ̂ = NN−1 s2x is the unique MLE.

Example 4.4 (Uniform Distribution MLE).

Suppose that xn is uniformly distributed on the interval [0, θ]. The probability density
function is given by f (x; θ) = 1θ 1{0 ≤ x ≤ θ}. The likelihood function is given by,

N
Y
1

L(θ) = θ
1{xn ≤ θ} (74)
n=1

Notice that this likelihood function is zero whenever there exists an xn > θ, so it must by
the case that L(θ) is maximized at a point for which xn ≤ θ for all n. Notice further that
among such points, decreasing θ always increases the likelihood. Thus, the maximum will
occur when θ̂ = max {xn }.
1≤n≤N

4.2 Consistency
We will derive the consistency of MLEs from the consistency of M-estimators. An M-
estimator is defined by,

N
X
1
θ̂ = arg max N
ψ(xn ; θ) (75)
θ∈Θ n=1

MLEs are a special case of M-estimators, where ψ(xn ; θ) = log p(xn ; θ) or ψ(xn ; θ) =
log f (xn ; θ). MLEs inherit the properties of M-estimators, but have some additional proper-
ties (see subsection 4.4).
The basic idea of the proof of consistency is as follows. First, we show that the objec-
tive function N1 N
P
n=1 ψ(xn ; θ) converges to its expectation E[ψ(xn ; θ)]. From this, we can
determine that θ̂ converges to the minimizer of E[ψ(xn ; θ)]. Finally, we show that θ0 is the
unique maximizer of E[ψ(xn ; θ)] (this is called the identification condition).

28
Theorem 4.1 (Consistency of M-Estimators). Suppose that xn are independent identically
distributed, and that θ0 is the unique maximizer of E[ψ(xn ; θ)]. Suppose 
that Θ is compact

and ψ(x; θ) is continuous at each θ ∈ Θ with probability 1. Suppose that E sup |ψ(xn ; θ)| <
θ∈Θ
1
PN prob. 3
∞. Define θ̂ = arg max N n=1 ψ(xn ; θ). Then θ̂ −→ θ0 .
θ∈Θ

Recall that maximum likelihood estimators are a special case of M-estimators. In order
for maximum likelihood estimators to be consistent, it must be the case that certain reg-
ularity conditions are met and that the MLE objective function identifies the population
parameters. We can show that the condition that the MLE objective function identifies θ0
holds automatically.
Suppose that θ̂ is a maximum likelihood estimator and consider the discrete case (the
continuous case is very similar). We have ψ(xn ; θ) = log p(xn ; θ) where xn has probability
mass function p(xn ; θ0 ). We would like to show that E[log p(xn ; θ)] has a unique maximum
at θ = θ0 .

Theorem 4.2 (Identification of Maximum Likelihood Estimators). .

(i) Suppose that there does not exist a θ 6= θ0 such that p(x; θ) = p(x; θ0 ) for all x. Then
θ0 is the unique maximizer of E[log p(xn ; θ)]. (discrete case)

(ii) Suppose that there does not exist a θ 6= θ0 such that f (x; θ) = f (x; θ0 ) for all x ∈
R
X where x∈X f (x; θ0 )dx = 1. Then θ0 is the unique maximizer of E[log f (xn ; θ)].4
(continuous case)

Proof. We prove the discrete case. The proof for the continuous case is nearly identical. We
want to show that for all θ 6= θ0 ,

E[log p(xn ; θ)] < E[log p(xn ; θ0 )] (76)


p(x;θ)
Notice that −1 + z ≥ log z and −1 + z > log z for z 6= 1. Setting z = p(x;θ0 )
, we have,
 
p(x; θ) p(x; θ)
−1+ ≥ log (77)
p(x; θ0 ) p(x; θ0 )
Multiplying both sides by p(x; θ0 ), we obtain,
3
Adapted from Newey and McFadden (1994).
4
Adapted from Gallant (1997).

29
 
p(x; θ)
− p(x; θ0 ) + p(x; θ) ≥ p(x; θ0 ) log (78)
p(x; θ0 )
Summing both sides over x, we obtain,
 
X X X p(x; θ)
− p(x; θ0 ) + p(x; θ) ≥ p(x; θ0 ) log (79)
x x x
p(x; θ0 )

Since p(x; θ0 ) and p(x; θ) are probability


 mass
 functions, they sum to 1. Moreover, the left
p(x;θ)
hand side is the expectation of log p(x;θ0 ) , so we have,
  
p(xn ; θ)
0 ≥ E log = E[log(p(xn ; θ)] − E[log(p(xn ; θ)] (80)
p(xn ; θ0 )
This can be rearranged to give,

E[log(p(xn ; θ)] ≥ E[log(p(xn ; θ)] ∀θ ∈ Θ (81)

To complete the proof, we must show that equation (81) hold strictly when θ 6= θ0 . By
assumption, there is at least one value of x such that p(x; θ) 6= p(x; θ0 ) for θ 6= θ0 . Therefore,
−1 + z > log z for at least one value of z, and when we sum equation (78) for θ 6= θ0 , the
inequality is strict, proving the result.

The condition that there does not exist a θ 6= θ0 such that p(x; θ) = p(x; θ0 ) for all x is
called identification of the statistical model. We will study it in detail in the next subsection.
For now, we state a consistency result for maximum likelihood estimators.

Theorem 4.3 (Consistency of Maximum Likelihood Estimators). .

(i) Suppose that xn are independent identically distributed with probability mass function
p(x; θ0 ). Suppose that there does not exists a θ 6= θ0 such that p(x; θ) 6= p(x; θ0 ) for
all x. Suppose that Θ is compact
 and log p(x;θ) is continuous at each θ ∈ Θ with
prob.
probability 1. Suppose that E sup | log p(xn ; θ)| < ∞. Then θ̂M LE −→ θ0 . (discrete
θ∈Θ
case)

(ii) Suppose that xn are independent identically distributed with probability density function
f (x; θ0 ). Suppose that there does not exist a θ 6= θ0 such that f (x; θ) 6= f (x; θ0 ) for
R
all x ∈ X where x∈X f (x; θ0 )dx = 1. Suppose that Θ is compact and log f (x; θ) is

30
 
continuous at each θ ∈ Θ with probability 1. Suppose that E sup | log f (xn ; θ)| < ∞.
θ∈Θ
prob. 5
Then θ̂M LE −→ θ0 . (continuous case)

4.3 Identification of a Statistical Model


Suppose that the data xn are i.i.d. Then the statistical model can be described by a cdf
F (x; θ0 ) where θ0 represents the true parameter value. The true parameter varies in the class
θ ∈ Θ. Identification answers the question, “are the data informative about the parameter
of interest”.

Definition 4.1 (Identification of a Statistical Model). We say that θ0 is identified if there


does not exists a θ 6= θ0 such that F (x; θ) = F (x; θ0 ) for all x.6

Identification is a necessary condition for a consistent estimator to exist (Gabrielsen,


1978; Rao, 1992; Martı́n and Quintana, 2002). Since identification of the underlying statis-
tical model is a necessary condition, we will spend some time on describing how to check
identification. We have the following definitions of identification,

Definition 4.2 (Identification of a Statistical Model in the Continuous Case). Let X be the
set of x such that f (x; θ0 ) > 0. We say that θ0 is identified if there does not exists a θ 6= θ0
such that f (x; θ) 6= f (x; θ0 ) for all x ∈ X .

Definition 4.3 (Identification of a Statistical Model in the Discrete Case). Let X be the set
of x such that p(x; θ0 ) > 0. We say that θ0 is identified if there does not exists a θ 6= θ0 such
that p(x; θ) 6= p(x; θ0 ) for all x ∈ X .

Example 4.1 (continued). We wish to show that there does not exists a λ > 0 with
λ 6= λ0 such that p(x; λ) = p(x, λ0 ) for all x ∈ {0, 1, 2, ...}. Consider a λ such that p(0; λ) =
p(0, λ0 ). In this case, we have p(0; λ) = e−λ and p(0; λ0 ) = e−λ0 . Equating these two implies
that λ = λ0 , so there does not exist a λ 6= λ0 such that p(x; λ) = p(x, λ0 ) for all x.

Example 4.2 (continued). We wish to show that there does not exists a λ > 0 with
λ 6= λ0 such that f (x; λ) = f (x, λ0 ) for all x ≥ 0. Consider a λ such that f (0; λ) = f (0, λ0 ).
In this case, we have f (0; λ) = λ and f (0; λ0 ) = e−λ0 . Equating these two implies that
λ = λ0 , so there does not exist a λ 6= λ0 such that f (x; λ) = f (x, λ0 ) for all x.
5
Adapted from Newey and McFadden (1994).
6
To be technically correct, we would replace “for all x ” with “for x on a set of measure 1”.

31
Example 4.3 (continued). Suppose that xn ∼ N (µ0 , σ02 ) and we observe xn , then (µ0 , σ02 )
are identified. To see that this is the case, we can show that there is no (µ, σ 2 ) 6= (µ0 , σ02 )
such that φ(x; µ, σ 2 ) = φ(x; µ0 , σ02 ) for all x. We will show that φ(x; µ, σ 2 ) = φ(x; µ0 , σ02 ) for
x = 0, x = µ0 , and x = 2µ0 implies that (µ, σ 2 ) = (µ0 , σ02 ). The three conditions imply that,

1 2 2 1
√1 e− 2 µ /σ 1 − µ20 /σ02
σ 2π
= √
σ0 2π
e 2 (82)

1 2 2
√1 e− 2 (µ0 −µ) /σ = 1
√ (83)
σ 2π σ0 2π

1 2 2 1 2 2
√1 e− 2 (2µ0 −µ) /σ = 1
√ e− 2 µ0 /σ0 (84)
σ 2π σ0 2π

The first and third conditions imply that,

1 2 2 1 2 2
√1 e− 2 µ /σ = √1 e− 2 (2µ0 −µ) /σ (85)
σ 2π σ 2π

which implies that µ = µ0 . Plugging this into the second condition gives,

√1 = 1
√ (86)
σ 2π σ0 2π

or that σ = σ0 . Hence, a contradiction implies that the model is identified.

Example 4.5 (Identification Failure).

Suppose that xn ∼ N (0, σ02 ), but that we observe, yn = 1{xn ≥ 0}. Hence, we only
observe the number
 of positive occurrences. In this case, the distribution of yn is given by

0
P (yn = 0) = Φ − σ0 = Φ (0) = 21 and P (yn = 1) = 21 . Since these probabilities hold for all
σ 6= σ0 with σ > 0, the model is not identified.

Example 4.6 (Identification of the Uniform Distribution).

Suppose that xn ∼ U (a0 , b0 ) and that we observe xn . Then the parameters (a0 , b0 ) are
identified. To see that this is the case, we have,

1
f (x; a, b) = b−a
1{a ≤ x ≤ b} (87)

We wish to show that for any (a0 , b0 ) 6= (a, b), we have f (x; a, b) 6= f (x; a0 , b0 ) for some x.
Suppose that a < a0 . When a0 < x < a,

32
1 1 1
f (x; a, b) − f (x; a0 , b0 ) = b−a
1{a ≤ x ≤ b} − b0 −a0
1{a0 ≤ x ≤ b0 } = b−a
6= 0 (88)

Now suppose that a > a0 . When a < x < a0 ,

1 1 1
f (x; a, b) − f (x; a0 , b0 ) = b−a
1{a ≤ x ≤ b} − b0 −a0
1{a0 ≤ x ≤ b0 } = − b0 −a0
6= 0 (89)

Now suppose that b > b0 . When b < x < b0 , we have,

1 1 1
f (x; a, b) − f (x; a0 , b0 ) = b−a
1{a ≤ x ≤ b} − b0 −a0
1{a0 ≤ x ≤ b0 } = − b0 −a0
6= 0 (90)

Finally, suppose that b < b0 . When b0 < x < b, we have,

1 1 1
f (x; a, b) − f (x; a0 , b0 ) = b−a
1{a ≤ x ≤ b} − b0 −a0
1{a0 ≤ x ≤ b0 } = b−a
6= 0 (91)

Example 4.7 (Identification of the (Overspecified) Logit Model).

Consider the logit model, but assume that the error term does not have scale parameter
1. That is, suppose that, yn∗ = β 0 xn + εn , εn ∼ Logistic(0, s2 ), and yn = 1{yn∗ ≥ 0}. We have
that,

Pr(yn = 1|xn ; β, s2 ) = Pr(yn ∗ ≥ 0|xn ; β, s2 ) = Pr(β 0 xn + εn ≥ 0|xn ; β, s2 ) (92)

= Pr(εn ≥ −β 0 xn |xn ; β, s2 ) = 1 − Pr(εn < −β 0 xn |xn ; β, s2 )

−β 0 xn
1 e s
=1− −β 0 xn
= −β 0 xn
1+ e s 1+ e s
Does there exist a (β, s2 ) 6= (β0 , s20 ) such that,

β 0 xn β00 xn
e s e s0
β 0 xn
= β00 xn
(93)
1+ e s 1+ e s0
2
for all xn ? Consider (β, s ) = (2β0 , 2s20 ). We have,

33
2β00 xn β00 xn
e 2s0 e s0
2β00 xn
= β00 xn
(94)
1+ e 2s0 1+ e s0
which is true for all xn , so (β0 , s20 ) is not identified.

Example 4.8 (Identification of the (Properly Specified) Logit Model).

Consider the standard logit model, with yn∗ = β 0 xn + εn , εn ∼ Logistic(0, 1), and yn =
1{yn∗ ≥ 0}. We have that,
0
e β xn
Pr(yn = 1|xn ; β, σ 2 ) = (95)
1 + eβ 0 xn
We will check whether there exists a β 6= β0 such that,
0 0
e β xn eβ0 xn
= 0 (96)
1 + eβ 0 xn 1 + eβ0 xn
We can rearrange this to obtain,

0 0
eβ xn = eβ0 xn (97)

Taking logs, we have,

β 0 xn = β00 xn (98)

or,

(β − β0 )0 xn = 0 (99)

Does there exists β 6= β0 such that (β − β0 )0 xn = 0 for all xn ? All long as xn has full support,
the answer is no. This conditional can fail if some of the xn ’s are perfect linear combinations
of the other xn ’s.

4.4 Asymptotic Normality


prob.
Consider an M-estimator defined by θ̂ = arg max N1 N
P
n=1 ψ(xn ; θ). Suppose that θ̂ −→ θ0 .
θ∈Θ
Define Q(x; θ) = N1 N
P
n=1 ψ(x n ; θ) and suppose that Q is twice continuously differentiable
(notice that Qθ (x; θ) = N n=1 ψθ (xn ; θ) and Qθθ (x; θ) = N1 N
1
PN P
n=1 ψθθ (xn ; θ)). We have that
Qθ (x; θ̂) = 0. Taking a Taylor expansion of Qθ around θ0 , we have,

34
Qθ (x; θ) = Qθ (x; θ0 ) + Qθθ (x; θ̄)(θ − θ0 ) (100)

where θ̄ = λθ + (1 − λ)θ0 and λ ∈ (0, 1). Plugging in θ̂ for θ, we obtain,

0 = Qθ (x; θ̂) = Qθ (x; θ0 ) + Qθθ (x; θ̄)(θ̂ − θ0 ) (101)

We can rearrange this to obtain,

(θ̂ − θ0 ) = −Qθθ (x; θ̄)−1 Qθ (x; θ0 ) (102)

or,

√ √
N (θ̂ − θ0 ) = −Qθθ (x; θ̄)−1 N Qθ (x; θ0 ) (103)

We can write,
" N
#−1 N
√ X X
1 √1
N (θ̂ − θ0 ) = − N
ψθθ (xn ; θ̄) N
ψθ (xn ; θ0 ) (104)
n=1 n=1

A central limit theorem implies that,

N
dist.
X
√1 (ψθ (xn ; θ0 ) − E[ψθ (xn ; θ0 )]) −→ N (0, V ar(ψθ (xn ; θ0 )) (105)
N
n=1

Since θ0 = arg max E[ψ(xn ; θ)], we have E[ψθ (xn ; θ0 )] = 0 which implies that,
θ∈Θ
N
dist.
X
√1 ψθ (xn ; θ0 ) −→ N (0, V ar(ψθ (xn ; θ0 )) (106)
N
n=1

We further have that,

V ar(ψθ (xn ; θ0 )) = E[ψθ (xn ; θ0 )ψθ (xn ; θ0 )0 ] − E[ψθ (xn ; θ0 )]E[ψθ (xn ; θ0 )]0 (107)

= E[ψθ (xn ; θ0 )ψθ (xn ; θ0 )0 ]

so that,

N
dist.
X
√1
N
ψθ (xn ; θ0 ) −→ N (0, E[ψθ (xn ; θ0 )ψθ (xn ; θ0 )0 ]) (108)
n=1

35
prob.
Using the fact that θ̄ = λθ̂ + (1 − λ)θ0 −→ θ0 , we have,

N N
prob.
X X
1
N
ψθθ (xn ; θ̄) −→ N1 ψθθ (xn ; θ0 ) (109)
n=1 n=1

A law of large numbers implies that,

N
prob.
X
1
N
ψθθ (xn ; θ0 ) −→ E[ψθθ (xn ; θ0 )] (110)
n=1

which together implies that,

N
prob.
X
1
N
ψθθ (xn ; θ̄) −→ E[ψθθ (xn ; θ0 )] (111)
n=1

Slutsky’s theorem then implies that,

√ dist.
N (θ̂ − θ0 ) −→ N (0, E[ψθθ (xn ; θ0 )]−1 E[ψθ (xn ; θ0 )ψθ (xn ; θ0 )0 ]E[ψθθ (xn ; θ0 )]−1 ) (112)

We state this result formally below,

Theorem 4.4 (Asymptotic Normality of M-Estimators). Suppose that the conditions of


Theorem 4.1 hold, θ0 ∈ int(Θ), ψ(xn ; θ) is twice continuous differentiable in a neighborhood
N of θ0 , V ar(ψθ (xn ; θ0 )) = E[ψθ (xn ; θ0 )ψθ (xn ; θ0 )0 ] = B0 is finite, C(θ) = E[ψθθ (xn ; θ)] is
prob.
continuous at θ0 and sup || N1 N
P
n=1 ψθθ (xn ; θ)−C(θ)|| −→ 0 where C0 = C(θ0 ) is nonsingular.
θ∈N
√ dist.
Define θ̂ as θ̂ ∈ arg max N1 N N (θ̂ − θ0 ) −→ N (0, C0−1 B0 C0−1 ).7
P
n=1 ψ(xn ; θ). Then
θ∈Θ

To apply this result, we need estimators for B0 and C0 . The obvious estimators are,

N
X
B̂ = 1
N
ψθ (xn ; θ̂)ψθ (xn ; θ̂)0 (113)
n=1

N
X
1
Ĉ = N
ψθθ (xn ; θ̂) (114)
n=1

An important property of maximum likelihood estimators is the information equality.


Here, we demonstrate the result in the discrete case (the proof for the continuous case is
7
Adapted from Newey and McFadden (1994).

36
P
nearly identical. We begin with x p(x; θ)dx = 1, which we differentiate on both sides to
yield,

X

∂θ
p(x; θ)dx =0 (115)
x

Notice that,


∂ ∂θ
p(x; θ)
∂θ
log p(x; θ) = (116)
p(x; θ)
so that,

∂ ∂
p(x; θ)[ ∂θ log p(x; θ)] = ∂θ
p(x; θ) (117)

We sum both sides over x to yield,

X X
∂ ∂
[ ∂θ log p(x; θ)]p(x; θ) = ∂θ
p(x; θ) =0 (118)
x x

We can differentiate these one more time to give,

X 2
X

[ ∂θ∂θ 0 log p(x; θ)]p(x; θ) +

[ ∂θ ∂
log p(x; θ)][ ∂θ p(x; θ)dx]0 (119)
x x

X 2
X
= ∂
[ ∂θ∂θ 0 log p(x; θ)]p(x; θ) +

[ ∂θ ∂
log p(x; θ)][ ∂θ log p(x; θ)]0 p(x; θ) = 0
x x

We therefore have,
h i
∂2
log p(x; θ)]0
 ∂ ∂

E ∂θ∂θ0
log p(x; θ) = −E [ ∂θ log p(x; θ)][ ∂θ (120)

This last result is known as the information equality. We define the information matrix by
log p(x; θ0 )]0 in the discrete case and J0 = E [ ∂θ log f (x; θ0 )]0
 ∂ ∂
  ∂ ∂

J0 = E [ ∂θ log p(x; θ0 )][ ∂θ log f (x; θ0 )][ ∂θ
in the continuous case.

Theorem
h 2 4.5 (Information
i Equality for Maximum Likelihood Estimators). We have,

log p(x; θ0 )]0 (discrete case)
 ∂ ∂

E ∂θ∂θ0 log p(x; θ0 ) = −E [ ∂θ log p(x; θ0 )][ ∂θ
h i
∂2
log f (x; θ0 )]0 (continuous case)
 ∂ ∂

E ∂θ∂θ0
log f (x; θ0 ) = −E [ ∂θ log f (x; θ0 )][ ∂θ
In the special case where θ̂ is a MLE,

37
B0 = E[ψθ (xn ; θ0 )ψθ (xn ; θ0 )0 ] = E[ ∂θ
∂ ∂
log p(xn ; θ0 ) ∂θ log p(xn ; θ0 )0 ] (121)

∂ 2
C0 = E[ψθθ (xn ; θ0 )] = E[ ∂θ∂θ 0 log p(xn ; θ0 )] (122)

The information equality states that,


h i
∂2
log p(x; θ0 )]0
 ∂ ∂

E ∂θ∂θ0
log p(x; θ0 ) = −E [ ∂θ log p(x; θ0 )][ ∂θ (123)

or that C0 = −B0 , which yields the following 3 equivalent formulas for the asymptotic
variance of the MLE,

2 −1 2

V0 = E[ ∂θ∂θ 0 log f (xn ; θ0 )]

E[ ∂θ ∂
log p(xn ; θ0 ) ∂θ log p(xn ; θ0 )0 ]E[ ∂θ∂θ

0 log f (xn ; θ0 )]
−1
(124)

∂ 2 −1
V0 = −E[ ∂θ∂θ 0 log p(xn ; θ0 )] (125)


V0 = E[ ∂θ ∂
log p(xn ; θ0 ) ∂θ log p(xn ; θ0 )0 ]−1 (126)

Applying Theorem 4.4 yields the following result,

Theorem 4.6 (Asymptotic Normality of Maximum Likelihood Estimators). .

(i) Suppose that the conditions of Theorem 4.3 are satisfied, θ0 ∈ int(Θ), p(xn ; θ) is twice
∂2
continuously differentiable, p(xn ; θ) > 0 in a neighborhood N of θ 0 , J0 = −E[ ∂θ∂θ 0 log p(xn ; θ0 )]

 
∂2 dist.
exists and is non-singular, and E sup || ∂θ∂θ 0 log p(xn ; θ)|| < ∞. Then N (θ̂−θ0 ) −→
θ∈N
N (0, V0 ) where V0 = J0−1 . (discrete case)

(ii) Suppose that the conditions of Theorem 4.3 are satisfied, θ0 ∈ int(Θ), f (xn ; θ) is twice
R
continuously differentiable, f (xn ; θ) > 0 in a neighborhood N of θ0 , sup ||fθ (xn ; θ)||dx <
θ∈N
∂2
R
∞, sup ||fθθ (xn ; θ)||dx < ∞, J0 = −E[ ∂θ∂θ 0 log f (x ;
n 0θ )] exists and is non-singular,
θ∈N

 
∂2 dist.
and E sup || ∂θ∂θ 0 log f (xn ; θ)|| < ∞. Then N (θ̂−θ0 ) −→ N (0, V0 ) where V0 = J0−1 .8
θ∈N
(continuous case)
8
Adapted from Newey and McFadden (1994).

38
To estimate V0 , we can use one of the following three formulas,

N
!−1 N
! N
!−1
X X X
∂2 ∂2
V̂ = 1
N ∂θ∂θ0
log f (xn ; θ̂) 1
N

∂θ

log f (xn ; θ̂) ∂θ log f (xn ; θ̂)0 1
N ∂θ∂θ0
log f (xn ; θ̂)
n=1 n=1 n=1
(127)

N
!−1
X
1 ∂2
V̂ = − N ∂θ∂θ0
log f (xn ; θ̂) (128)
n=1

N
!−1
X
V̂ = 1
N

∂θ

log f (xn ; θ̂) ∂θ log f (xn ; θ̂)0 (129)
n=1

Note that,

N
X N
X
∂2 1 ∂2 1 1 ∂2
∂θ∂θ0 N
l(θ̂) = ∂θ∂θ0 N
log f (xn ; θ̂) = N ∂θ∂θ0
log f (xn ; θ̂) (130)
n=1 n=1

∂2 1
Since we must compute l(θ̂)
in the course of checking whether θ̂ satisfied the SOC,
∂θ∂θ0 N  −1
∂2 1
the simplest estimator is V̂ = − ∂θ∂θ0 N l(θ̂) . The first of the three estimators above is
called the sandwich estimator and is the estimator we would use if specifying the “, robust”
option in stata. It is important to note that in general, the standard errors will not be robust
to departures from the assumed model since consistency of θ̂ is only guaranteed when the
assumed model is correct. Exceptions to this are the linear model, where the MLE will be
consistent even if the error terms of heteroskedastic or not normally distributed. In this case,
the information equality will fail to hold and the sandwich estimator for the variance must
be used. For most MLEs, there is no benefit (but also no harm) to using robust standard
errors.

4.5 Efficiency and the Cramer Rao Lower Bound


Why do people like maximum likelihood estimators so much? To guarantee a consistent
estimator, the limiting objective function must satisfy the identification condition. Finding
an objective function that satisfies this condition is hard however, and maximum likelihood
gives an automatic way to satisfy the identification condition.
There is another important reason to prefer maximum likelihood estimators over alter-
native consistent and asymptotically normal estimators—they are efficient. Consider any

39
√ dist.
estimator θ̂ (not even, necessarily, an M-estimator) such that N (θ̂ − θ0 ) −→ N (0, V0 )
where V0 is positive definite. The Cramer Rao Lower Bound implies that V0 − J0−1 is a posi-
tive semi-definite matrix. Since the maximum likelihood estimator achieves the lower bound
J0−1 , we say that the maximum likelihood estimator is efficient. There are other estimators
that achieve this lower bound—for example, a properly constructed Bayesian estimator.

Theorem 4.7 (Cramer Rao Lower Bound). Let θ̂ be an estimator of θ0 such that N (θ̂ −
dist.
θ0 ) −→ N (0, V0 ) where V0 is positive definite. Then V0 − J0−1 is a positive semi-definite
matrix.
√ dist.
Proof. Suppose that N (θ̂ − θ0 ) −→ N (0, V0 ). This implies that lim E[θ̂ − θ0 ] = 0.9
N →∞
Consider,
Z
E[θ̂ − θ] = (θ̂ − θ)f (x; θ)dx (131)
x

Differentiating with respect to θ0 , we obtain,


Z Z

∂θ0
E[θ̂ − θ] = −I f (x; θ)dx + (θ̂ − θ)fθ (x; θ)0 dx (132)
x x
fθ (x;θ)0
Note that ∂
∂θ0
log f (x; θ) = f (x;θ)
, or f (x; θ) ∂θ∂ 0 log f (x; θ) = fθ (x; θ)0 , so that,
Z

∂θ0
E[θ̂ − θ] = −I + ∂
(θ̂ − θ) ∂θ log f (x; θ)0 f (x; θ)dx (133)
x

Evaluating this at θ = θ0 and taking limits of both sides, we obtain,


Z
0= lim ∂ 0 E[θ̂ − θ0 ] = −I + lim ∂
(θ̂ − θ0 ) ∂θ log f (x; θ0 )0 f (x; θ0 )dx (134)
N →∞ ∂θ N →∞ x

From this, we can obtain,

Z √ N
X
I = lim N (θ̂ − θ0 ) √1N ∂
∂θ
log f (xn ; θ0 )0 f (x; θ0 )dx (135)
N →∞ x n=1
or,
" N
#
√ X
0
I = lim E N (θ̂ − θ0 ) √1N ∂
∂θ
log f (xn ; θ0 ) (136)
N →∞
n=1

Define,
9
See A.1

40

V0 = lim V ar( N (θ̂ − θ0 )) (137)
N →∞

Note that,

N
X
V ar( √1N ∂
∂θ
log f (xn ; θ0 )0 ) = V ar( ∂θ

log f (xn ; θ0 )0 ) (138)
n=1


= E[ ∂θ ∂
log f (x; θ0 ) ∂θ log f (x; θ0 )0 ]


lim Cov( N (θ̂ − θ0 ), √1N ∂θ∂ 0 log f (x; θ0 ) (139)
N →∞

" N
#
√ X
0
= lim E N (θ̂ − θ0 ) √1N ∂
∂θ
log f (xn ; θ0 ) =I
N →∞
n=1

Combined, we have,
√ ! " #
N (θ̂ − θ0 ) V0 I
lim V ar PN ∂ = (140)
√1
n=1 ∂θ log f (xn ; θ0 ) I J0
N →∞
N

Since the variance must be positive semi-definite, we have that for all z,
" #
V0 I
z0 z≥0 (141)
I J0
We have, z10 V0 z1 + z20 J0 z2 + 2z10 Iz2 ≥ 0. Setting z1 = α and z2 = −J0 −1 α for any α, we obtain
α0 (V0 − J0 −1 )α ≥ 0 for all α. It follows that V0 − J0 −1 is positive semi-definite.

It is possible that an alternative estimator could be “superefficient” (to have lower


variance than the Cramer Rao lower bound). This is ruled out by the requirement that
√ dist.
N (θ̂ − θ0 ) −→ N (0, V0 ) where V0 is positive definite. Superefficient estimators converge at

a rate faster than N . van der Vaart (1997), for example, argues that while such estimators
exist, they are superefficient for θ0 on a set of measure zero. It is also possible for an esti-
mator that is asymptotically biased to be “superefficient”. Both of these are technicalities
that are not worth worrying about because they do not lead to estimators that are better in
practice.

Example 4.9 (An Inefficient Estimator of the Mean of the Normal Distribution).

41
Suppose that Xn ∼ N (µ, 1). We can show that the MLE is µ̂ = X̄. Consider the alterna-
tive estimator, µ̂2 = N2 (X1 + X3 + ... + X2N −1 ), which is also consistent and asymptotically
normal. We have that,

N
! N N
X X X
1 1 1 1
V ar(µ̂) = V ar N
Xn = N2
V ar(Xn ) = N2
1= N
(142)
n=1 n=1 n=1

V ar(µ̂2 ) = V ar( N2 (X1 + X3 + ... + X2N −1 )) (143)

4 4 N 2
= N2
(V ar(X1 ) + V ar(X3 ) + ... + V ar(X2N −1 )) = N2 2
1 = N

Note that V ar(µ̂2 ) > V ar(µ̂), or that the alternative estimator has larger variance than the
MLE.

Example 4.10 (Another Inefficient Estimator of the Mean of the Normal Distribution).

Suppose that Xn ∼ N (µ, σn2 ) where σn2 is known. The log-likelihood is given by,

N N
X 1 2 /σ 2
X
l(µ) = log σ 1


e− 2 (Xn −µ) n
= − log σn − 21 log(2π) − 12 (Xn − µ)2 /σn2 (144)
n
n=1 n=1

Using first-order conditions, we can determine that the MLE is given by,
PN Xn
2
n=1 σn
µ̂ = PN 1 (145)
2
n=1 σn

Consider the alternative estimator X̄ which is also consistent and asymptotically normal.
We can show that,
PN PN
V ar( Xσ2n )
n=1 n=1
1 2
σn4 σn 1 1
N
V ar(µ̂) = P n2 = P 2 = PN 1
= 1
PN 1 (146)
N 1 N 1 2 2
2 2 n=1 σn N n=1 σn
n=1 σn n=1 σn

N
X
V ar(X̄) = 1
N2
σn2 (147)
n=1

Consider the special case where σn2 = (1, 2, 1, 2, ...). We have,

42
1 1
N N 4
V ar(µ̂) = 1
PN 1 1
= 1 3N
= 3N
(148)
N n=1 (1 + 2
+ 1 + + ...)
2 N 2 2

N
X
1 1
V ar(X̄) = N2
(1 + 2 + 1 + 2 + ...) = 3N
N2 2
= 3
2N
(149)
n=1
3 4
Since 2N > 3N , we have that the alternative estimator has smaller variance than the MLE.
More generally, let φ(x) = x−2 . We have φ00 (x) = 6x−4 > 0 indicating that φ is convex.
Jensen’s inequality states that for a convex function, φ(λ1 x1 + ... + λN xN ) ≤ λ1 φ(x1 ) + ... +
λN φ(xN ) for λ1 + ... + λN = 1 where equality holds if and only if x1 = ... = xN . Applying
this, we have,

N
1
X
1 1
N σn2 > 1
PN (150)
n=1 N n=1 σn2
or,

N
X 1
1
N
σn2 > 1
PN 1
(151)
n=1 N n=1 σn2

which further implies,

N
X 1
1
N2
σn2 > 1
PN
N 1
(152)
n=1 N n=1 σn2

Hence, V ar(X̄) > V ar(µ̂), i.e. the alternative estimator has greater variance than the MLE.

4.6 Suggested Reading


4.6.1 Background

[1] Gallant (1997)

[2] King (1998)

[3] Greene (2000)

[4] Wooldridge (2002)

43
4.6.2 Advanced

[1] Newey and McFadden (1994)

[2] White (1984)

5 Probit, More on Logit, and Ordered Probit


5.1 The Probit Model
The Probit model is an alternative binary choice model where,

Pr(yn = 1|xn ; β) = Φ(β 0 xn ) (153)

We can derive the Probit model as a latent variable model. Suppose that yn∗ = β 0 xn + εn
and εn ∼ N (0, 1). In this case, we have,

Pr(yn = 1|xn ; β) = Pr(yn∗ ≥ 0|xn ; β) = Pr(β 0 xn + εn ≥ 0|xn ; β) (154)

= Pr(εn ≥ −β 0 xn |xn ; β) = 1 − Pr(εn < −β 0 xn |xn ; β) = 1 − Fε (−β 0 xn )

= 1 − Φ(−β 0 xn ) = Φ(β 0 xn )

There is a third interpretation of the probit model, which is the random utility interpre-
tation. Let u0n and u1n denote the utilities of option 0 and option 1. We can specify,

u0n = 0 (155)

u1n = β 0 xn + εn (156)

where εn ∼ N (0, 1). We “normalize” u0n = 0 since only difference in utilities matter. We
observe the individual choosing 1 over 0 if u1n ≥ u0n or if u1n − u0n = β 0 xn + εn . Notice that
this is equivalent to the latent variable formulation, so we can interpret yn∗ as the latent
utility difference between option 1 and option 0 and β as indicating whether covariate xn is
associated with increasing the utility difference between option 1 and option 0.
For the Probit model, we can compute marginal effects as follows,

44
1 0 2

Pr(yn = 1|xn ; β) = φ(β 0 xn )βk = √1 e− 2 (β xn ) βk (157)
∂xn,k 2π

An alternative approach for computing substantive effects is to rely on differences in predicted


probabilities Φ(β 0 a) − Φ(β 0 b), for two vectors of covariates a and b.
It is possible to compare logit and probit coefficients, but a scale adjustment must be
2
made. It is conventional to assume that V ar(εn ) = 1 for a probit model and V ar(εn ) = π3
for a logit model. For this reasons, to compare the logit and probit models, we would use
βlogit ≈ √π3 βprobit .

5.2 Perfect Separation


Consider a logit model Pr(Yn = 1; β) = Λ(β1 + β2 Xn ) and suppose that the data are such
that Yn = 1 for Xn = −0.8, −0.6, −0.5, −0.3 and Yn = 0 for Xn = 0.2, 0.3, 1.5, 8.1. We can
write the log likelihood as,

l(β) = log Λ(β1 −β2 ∗0.8)+log Λ(β1 −β2 ∗0.6)+log Λ(β1 −β2 ∗0.5)+log Λ(β1 −β2 ∗0.3) (158)

+ log(1 − Λ(β1 + β2 ∗ 0.2)) + log(1 − Λ(β1 + β2 ∗ 0.3))

+ log(1 − Λ(β1 + β2 ∗ 1.5)) + log(1 − Λ(β1 + β2 ∗ 8.1))

If we select β1 = 0, we can make the log-likelihood arbitrarily close to zero by making β2


arbitrarily close to negative infinity. We can say that the maximum will occur when β2 = −∞
and β1 is any finite number. In this case, there are multiple optima that are only obtained
asymptotically.
This type of situation is called perfect separation. When perfect separation is present,
there is no unique optimum. This is different than an identification problem because if
the underlying probit model is correct, this cannot occur in large samples. Since the error
distribution has full support, there is always some probability of observing a zero and a one
for every combination of independent variables.

45
5.3 Testing Hypotheses

Suppose that we want to test the null hypothesis H0 : c(β0 ) = 0 and suppose that N (β̂ −
dist. √
β0 ) −→ N (0, V0 ). The same derivation used by the delta method implies that N (c(β̂) −
dist.
c(β0 )) −→ N (0, cβ V0 c0β ). Under the null hypothesis, we have,

√ dist.
N c(β̂) −→ N (0, cβ V0 c0β ) (159)

or,

√ dist.
(cβ V0 c0β )−1/2 N c(β̂) −→ N (0, I) (160)

or,

dist.
W = N c(β̂)0 (cβ V0 c0β )−1 c(β̂) −→ χ2R (161)

where R = dim(c). We can use this test (called the Wald test) to test potentially nonlinear
hypotheses about multiple parameter values.

5.4 Interactions in Binary Choice Models


Consider a linear regression model with two independent variables and no interaction term.
We have,

E[yn |xn , zn ; β] = β1 + β2 xn + β3 zn (162)

∂2 ∂
∂xn ∂zn
E[yn |xn , zn ; β] = β
∂zn 2
=0 (163)

Consider alternatively a binary choice model with two independent variables and no inter-
action term (this can be either a logit or probit model),

Pr(yn = 1|xn , zn ; β) = G(β1 + β2 xn + β3 zn ) (164)


ex
where G(x) = Λ(x) = 1+ex
or G(x) = Φ(x). We have,

∂2
∂xn ∂zn
Pr(yn = 1|xn , zn ; β) = ∂
∂zn
g(β1 + β2 xn + β3 zn )β2 = g 0 (β1 + β2 xn + β3 zn )β2 β3 (165)

46
ex
We can calculate g and g 0 for the logit and probit models as follows. If G(x) = 1+ex
, then
1 2
ex (1−ex )ex
g(x) = and g 0 (x) = . If G(x) = Φ(x), then g(x) = φ(x) = √1 e− 2 x and
(1+ex )2 (1+ex )3 2π
1 2
g 0 (x) = − √12π xe− 2 x .
The marginal effect of xn depends on zn even though no interaction
term is explicitly included in the model.
Does this mean that we can study interaction effects without an interaction term? In
general, this is not a good idea because even though the interaction effects are non-zero,
their form is not flexibly estimated.
Consider the same calculation with an interaction term,

∂2 ∂
∂xn ∂zn
Pr(yn = 1|xn , zn ; β) = ∂zn
g(β1 + β2 xn + β3 zn + β4 xn zn )(β2 + β4 zn ) (166)

= g 0 (β1 + β2 xn + β3 zn + β4 xn zn )(β2 + β4 zn )(β3 + β4 xn ) + g(β1 + β2 xn + β3 zn + β4 xn zn )β4

Even in this case, it turns out that regardless of the coefficient vector β, there will always
exists values of (xn , zn ) such that the interaction is positive and other values of (xn , zn ) such
that the interaction is negative. For example, consider the probit model. Suppose that
(xn , zn ) = (B, B) for large B > 0. We have,

∂2
∂xn ∂zn
Pr(yn = 1|xn , zn ; β) (167)

= g 0 (β1 + β2 B + β3 B + β4 BB)(β2 + β4 B)(β3 + β4 B) + g(β1 + β2 B + β3 B + β4 BB)β4

1 2 2
→ g 0 (β4 BB)(β4 B)2 + g(β4 B 2 )β4 = √1 e− 2 (β4 B ) −(β4 B)2 + β4
 

1 2 )2
→ − √12π e− 2 (β4 B (β4 B)2 < 0

If we instead suppose that (xn , zn ) = (B, −B) for large B > 0, we have,

∂2
∂xn ∂zn
Pr(yn = 1|xn , zn ; β) = (168)

47
= g 0 (β1 + β2 B − β3 B − β4 BB)(β2 − β4 B)(β3 + β4 B) + g(β1 + β2 B − β3 B − β4 BB)β4

1 2 2
→ −g 0 (−β4 BB)(β4 B)2 + g(−β4 B 2 )β4 = √1 e− 2 (β4 B ) (β4 B)2 + β4
 

1 2 2
→ √1 e− 2 (β4 B ) (β4 B)2 >0

There are a few consequences of this though—it may not make sense to theorize about
interaction effects for binary variables since interaction effects on the probability of observing
a 1 will always be present and both signs will always be possible. So for example, it does not
make sense to develop a theory that says that the positions of candidates and the economy
have a positive interactive effect on the probability of voting for the incumbent, since there
will always be covariates values for which this is true and covariate values for which this is
false. We can however develop theories about interaction effects on the latent propensity to
vote for the incumbent or about the latent utility for voting for the incumbent. However, if
we develop these theories, we would test them based on the coefficient β4 rather than some
2
measure of the marginal effect ∂xn∂ ∂zn Pr(yn = 1|xn , zn ; β) and the difference in predicted
probabilities.

5.5 Ordered Probit


The ordered probit model is typically motivated using a latent variable approach. Suppose
that yn∗ = β 0 xn + εn where εn ∼ N (0, 1). Suppose that yn ∈ {1, 2, ..., J} where we observe,

yn = 1 ⇔ yn∗ < τ1 (169)

yn = 2 ⇔ τ1 < yn∗ < τ2 (170)

... (171)

yn = J − 1 ⇔ τJ−2 < yn∗ < τJ−1 (172)

48
yn = J ⇔ yn∗ > τJ−1 (173)

where τ1 < τ2 < ... < τJ−1 . From this, we obtain,

Pr(yn = 1) = Φ(τ1 − β 0 xn ) (174)

Pr(yn = 2) = Φ(τ2 − β 0 xn ) − Φ(τ1 − β 0 xn ) (175)

... (176)

Pr(yn = J − 1) = Φ(τJ−1 − β 0 xn ) − Φ(τJ−2 − β 0 xn ) (177)

Pr(yn = J) = 1 − Φ(τJ−1 − β 0 xn ) (178)

To identify the model, we must omit a constant term (or alternatively, normalize τJ−1 = 0).
To simplify writing the likelihood, we define τ0 = −∞ and τJ = ∞ so that Pr(yn = j) =
Φ(τj − β 0 xn ) − Φ(τj−1 − β 0 xn ). This allows us to write the log-likelihood as,

N
X
l(β, τ ) = log [Φ(τyn − β 0 xn ) − Φ(τyn −1 − β 0 xn )] (179)
n=1

or equivalently,

N X
X J
l(β, τ ) = 1{yn = j} log [Φ(τj − β 0 xn ) − Φ(τj−1 − β 0 xn )] (180)
n=1 j=1

Marginal effects are fairly complicated with the ordered probit model. Consider,

Pr(yn = 2|xn ; β) = Φ(τ2 − β 0 xn ) − Φ(τ1 − β 0 xn ) (181)


∂xnk
Pr(yn = 2|xn ; β) = −(φ(τ2 − β 0 xn ) − φ(τ1 − β 0 xn ))βk (182)

1 0 2 1 0 2 1 0 2 1 0 2
= √1 (e− 2 (−β xn +τ1 ) − e− 2 (−β xn +τ2 ) )βk = √1 (e− 2 (β xn −τ1 ) − e− 2 (β xn −τ2 ) )βk
2π 2π

49
1 2 1 2
= √1 (e− 2 (z−τ1 ) − e− 2 (z−τ2 ) )βk

For z < 12 (τ1 +τ2 ), we have (z −τ2 )2 > (z −τ1 )2 which implies − 21 (z −τ2 )2 < − 12 (z −τ1 )2 which
1 2 1 2
further implies e− 2 (z−τ2 ) < e− 2 (z−τ1 ) . Hence, sign(M E) = sign(βk ). For z > 21 (τ1 + τ2 ),
we have (z − τ2 )2 < (z − τ1 )2 which implies − 12 (z − τ2 )2 > − 21 (z − τ1 )2 which further implies
1 2 1 2
e− 2 (z−τ2 ) > e− 2 (z−τ1 ) . Hence, sign(M E) 6= sign(βk ). The sign of the marginal effect will
reverse depending on how large z is, for any intermediate category (for the end categories,
it will be monotonic).
In general, marginal effects will not be useful summaries of effect sizes because they will
generally not be monotonic. Instead, the best approach is to predict the probabilities of each
value of the dependent variable for different values of x. When interpreting the coefficients
directly, we can use the sign of a coefficient to determine the effect of the IV on the latent
variable. From this, a positive (negative) coefficient implies means that as the IV increases,
we will observe higher (lower) values of the DV. However, we cannot say anything about the
effect of the IV on observing particular, intermediate, values of the DV, without computing
substantive effects.

5.6 Ordered Logit


The ordered logit is nearly identical to the ordered probit model. We set yn∗ = β 0 xn + εn
where εn ∼ Logistic(0, 1). We can write,

Pr(yn = j) = Λ(τj − β 0 xn ) − Λ(τj−1 − β 0 xn ) (183)


e z
where Λ(z) = 1+e z is the CDF of the standard logistic distribution. Besides replacing the

normal CDF Φ with the logistic CDF Λ in the various expression, the only difference between
the ordered logit and ordered probit models is that you can interpret effects sizes using odds
ratios.
Note that,
0
eτj −β xn 1
Pr(yn > j) = 1 − Λ(τj − β 0 xn ) = 1 − 0x = (184)
1+e jτ −β n 1 + e j −β 0 xn
τ

0
1 eτj −β xn
Pr(yn ≤ j) = 1 − Pr(yn > j) = 1 − = (185)
1 + eτj −β 0 xn 1 + eτj −β 0 xn

50
Thus, if we calculate the odds of yn > j over yn ≤ j at xn = (xn,k = a, xn,−k ), and define
β−k = (β1 , · · · , βk−1 , βk+1 , · · · , βK ), then,

1
Pr(yn > j) τ 0
j −β xn 1
= 1 +τ e−β 0 xn = τj −β 0 xn = e−(τj −β xn )
0
(186)
Pr(yn ≤ j) ej e
0
1 + eτj −β xn
We can calculate the odds ratio when xn,k = a + 1 over xn,k = a, keeping the rest of the
variables at their mean value (that is, xn,−k ),

Pr(yn > j|xnk = a + 1)


0
Pr(yn ≤ j|xnk = a + 1) e−τj +β−k xn,−k +βk ·(a+1)
= 0 x
−τj +β−k
= eβk (187)
Pr(yn > j|xnk = a) n,−k +βk ·a
e
Pr(yn ≤ j|xnk = a)
This implies that if we exponentiate βk , we have the odds ratio of an outcome greater than
j to an outcome less than or equal to j, for a one unit increase in xnk .

5.7 Applications
Application 2.1: Individual Level Model of the Economic Vote (continued).

In Table 4, we compare the results of a logit and probit model. As can be see, the
coefficient generally have the same signs, but are of different magnitudes. The third column
examines the ratios of the logit to probit coefficients and the fourth column compares these
directly to π3 . Table 5 compares the marginal effects of the logit and probit models. These
p

are nearly identical, suggesting that there is little consequence of the choice between a logit
and probit model.

Application 5.1 (The Effect of Election Day Registration on Voter Turnout).

This application is drawn from Berry, Demeritt and Esaray (2010), who compare specifi-
cations from Wolfinger and Rosenstone (1980) and Nagler (1991). The main effect of interest
is the effect of the closing date for registration on voter turnout. Voters turnout is modeled
using a probit model. Closing date is measured in terms of days to an election before reg-
istration closes (with a value of zero indicating election day registration). Table 6 presents
specification with and without interaction terms. Figure 1 plots the effect sizes using both
models.

Application 5.2 (The Effect of Anti-war Speeches on Support for War).

51
q
π
Logit Probit Ratio Ratio / 3
Independent Variables:
Constant -0.231** -0.157** 1.467 0.809
(0.085) (0.050)
Distance -0.361*** -0.214*** 1.686 0.929
(0.008) (0.004)
Education -0.111*** -0.067*** 1.670 0.921
(0.007) (0.004)
Age 0.006*** 0.004*** 1.680 0.926
(0.001) (0.000)
Gender 0.021 0.014 1.550 0.855
(0.023) (0.014)
Income 0.070*** 0.040*** 1.739 0.959
(0.009) (0.005)
Growth 0.132*** 0.078*** 1.683 0.928
(0.008) (0.005)
Unemployment -0.038*** -0.021*** 1.827 1.007
(0.005) (0.003)

N 39550 39550

Table 4: Comparing Logit and Probit — Standard errors in parentheses. + p < .10,∗ p < .05,∗∗ p < .01, and
∗∗∗
p < .001.

ME1 ME2 CP1 CP2


Logit Probit Logit Probit Logit Probit Logit Probit

Independent Variables:
Distance -0.072*** -0.072*** -0.070*** -0.069*** -0.066*** -0.067*** -0.065*** -0.065***
(0.002) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)
Education -0.022*** -0.023*** -0.022*** -0.022*** -0.022*** -0.022*** -0.021*** -0.021***
(0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)
Age 0.001*** 0.001*** 0.001*** 0.001*** 0.001*** 0.001*** 0.001*** 0.001***
(0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
Gender 0.004 0.005 0.004 0.004 0.004 0.005 0.004 0.004
(0.005) (0.005) (0.004) (0.004) (0.005) (0.005) (0.004) (0.004)
Income 0.014*** 0.014*** 0.014*** 0.013*** 0.014*** 0.014*** 0.014*** 0.013***
(0.002) (0.002) (0.002) (0.002) (0.002) (0.002) (0.002) (0.002)
Growth 0.027*** 0.026*** 0.026*** 0.025*** 0.027*** 0.027*** 0.026*** 0.026***
(0.002) (0.002) (0.002) (0.002) (0.002) (0.002) (0.002) (0.002)
Unemployment -0.008*** -0.007*** -0.007*** -0.007*** -0.008*** -0.007*** -0.007*** -0.007***
(0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)

Table 5: Comparing Logit and Probit Marginal Effects — Standard errors in parentheses. + p < .10,∗ p <
.05,∗∗ p < .01, and ∗∗∗
p < .001.

52
(1) (2)

Independent Variables:
Constant -2.523*** -2.743***
(0.048) (0.109)
Closing Date -0.008*** 0.001
(0.000) (0.004)
Education 0.182*** 0.265***
(0.015) (0.043)
Education Sq. 0.012*** 0.005
(0.001) (0.004)
Age 0.070*** 0.070***
(0.001) (0.001)
Age Sq. -0.001*** -0.001***
(0.000) (0.000)
South -0.116*** -0.115***
(0.011) (0.011)
Governor Race 0.003 0.003
(0.012) (0.012)
Closing Date * Education -0.003*
(0.002)
Closing Date * Education Sq. 0.000+
(0.000)

N 99676 99676
AIC 111652 111651
BIC 111728 111746
McFadden R2 0.117 0.117
McKelvey R2 0.233 0.233
P-Value for Joint Test of Interaction Terms 0.066+
Effect of Moving Closing Date to Election Day 0.060*** 0.059***

Table 6: The Effect of Registration Closing Dates on Voter Turnout — Standard errors in parentheses.
+
p < .10,∗ p < .05,∗∗ p < .01, and ∗∗∗
p < .001.

53
Effect of Moving Closing Date from Mean to Zero

−0.10 −0.05 0.00 0.05 0.10

0
2
4
6

Education Level
8
No Interaction Model

54
Effect of Moving Closing Date from Mean to Zero

Figure 1
−0.10 −0.05 0.00 0.05 0.10

0
2
4
6

Education Level
8
Interaction Model
0.5

● ●
0.4

1.0
0.3

Ordered Logit Coefficients



OLS Coefficients



0.2

0.5
0.1



0.0

0.0



−0.1
−0.2

● ●

−0.2 0.0 0.2 0.4 0.6 −0.2 0.0 0.2 0.4 0.6

Ordered Probit Coefficients Ordered Probit Coefficients

Figure 2: Comparing OLS, Ordered Probit, and Ordered Logit Coefficients

OLS Ordered Probit Ordered Logit

Independent Variables:
Constant 1.603***
(0.149)
Anti-war Speeches -0.047** -0.067** -0.107**
(0.015) (0.023) (0.040)
Republican 0.482*** 0.712*** 1.326***
(0.059) (0.093) (0.167)
Democrat -0.195** -0.283** -0.425**
(0.064) (0.092) (0.161)
Age 0.004* 0.006* 0.011*
(0.002) (0.002) (0.004)
Education 0.087** 0.120** 0.225**
(0.028) (0.042) (0.077)
Male 0.287*** 0.426*** 0.721***
(0.049) (0.073) (0.126)
White 0.228** 0.319** 0.584**
(0.077) (0.113) (0.192)
Cutpoint 1|2 0.001 0.196
(0.223) (0.389)
Cutpoint 2|3 1.185*** 2.213***
(0.228) (0.395)
Cutpoint 3|4 2.840*** 5.147***
(0.248) (0.442)

N 956 956 956


AIC 2157 2112 2105
BIC 2201 2160 2153
R2 0.209
McKelvey R2 0.238 0.228
McFadden R2 0.095 0.098

+
Table 7: Comparing OLS, Ordered Probit, and Ordered Logit — Standard errors in parentheses. p<
∗ ∗∗ ∗∗∗
.10, p < .05, p < .01, and p < .001.

55
1.0

Stay the Course:1


0.8

Stay the Course:2


Stay the Course:3
Stay the Course:4
0.6
Prob.

0.4
0.2
0.0

0 5 10 15

Anti−war Speaches

Figure 3: Substantive Effects of Anti-War Speeches

56
1.0

Stay the Course:1


Stay the Course:2
0.8

Stay the Course:3


Stay the Course:4
0.6
Prob.

0.4
0.2
0.0

−40 −20 0 20 40

Anti−war Speaches

Figure 4: Substantive Effects of Anti-War Speeches, Extended Range

57
Ordered Probit Ordered Logit
1.0

1.0
Stay the Course:1
0.8

0.8
Stay the Course:2
Stay the Course:3
Stay the Course:4
0.6

0.6
Prob.

Prob.
0.4

0.4
0.2

0.2
0.0

0.0
0 5 10 15 0 5 10 15

Anti−war Speaches Anti−war Speaches

Figure 5: Substantive Effects of Anti-War Speeches, Ordered Probit vs. Ordered Logit

High Info Low Info


1.0

1.0

Stay the Course:1 Stay the Course:1


0.8

0.8

Stay the Course:2 Stay the Course:2


Stay the Course:3 Stay the Course:3
Stay the Course:4 Stay the Course:4
0.6

0.6
Prob.

Prob.
0.4

0.4
0.2

0.2
0.0

0.0

0 5 10 15 0 5 10 15

Anti−war Speaches Anti−war Speaches

Figure 6: Substantive Effects of Anti-War Speeches, Interactive Model

58
(1) (2)

Independent Variables:
Anti-war Speeches -0.067** -0.025
(0.023) (0.024)
Know Party 0.014
(0.077)
Anti-war Speeches * Know Party -0.092+
(0.050)
Republican 0.712*** 0.708***
(0.093) (0.094)
Democrat -0.283** -0.281**
(0.092) (0.092)
Age 0.006* 0.006*
(0.002) (0.002)
Education 0.120** 0.123**
(0.042) (0.043)
Male 0.426*** 0.430***
(0.073) (0.074)
White 0.319** 0.323**
(0.113) (0.113)
Cutpoint 1|2 0.001 0.020
(0.223) (0.223)
Cutpoint 2|3 1.185*** 1.208***
(0.228) (0.227)
Cutpoint 3|4 2.840*** 2.864***
(0.248) (0.248)

N 956 956
AIC 2112 2112
BIC 2160 2171
McKelvey R2 0.238 0.241
McFadden R2 0.095 0.097

Table 8: Ordered Probit Models — Standard errors in parentheses. +


p < .10,∗ p < .05,∗∗ p < .01,∗∗∗ p <
.001.

59
Application 5.3 (Measuring the Locations of National Policy).

This application is drawn from Richman (2011).

Status Quo Estimates Prior to 2000 Congressionl Election

Spend: Nat. Def. ●


Spend: Arts ●
Spend: Env. Prog. ●
Spend: Welfare ●
Spend: Inter. Aid ●
Spend: Educ. ●
Spend: Nat. Parks ●
Spend: Army Training ●
Spend: Covert Intel. ●
Spend: Defense Plant Conv. ●
Spend: Mili. Hardware ●
Spend: Mili. Space Shuttle ●
Spend: Weapon Modern. ●
Spend: Mis. Defense ●
Spend: R&D ●
Spend: Army Readiness ●
Tax: Retiree Inc. > 40k ●
Tax: Fam. Inc. < 25k ●
Tax: Fam. Inc. 25−75k ●
Tax: Fam. Inc. 75−150k ●
Tax: Fam. Inc. > 150k ●
Tax: Alcohol ●
Tax: Cap. Gains ●
Tax: Char. Deduct. ●
Tax: Cigarette ●
Tax: Corporate ●
Tax: Earned Inc. Tax. Cred. ●
Tax: Estate ●
Tax: Med. Exp. Deduct. ●
Tax: Mort. Deduct. ●
Tax: Gasoline ●
Tax: Stud. Loan Tax Cred. ●

−10 −5 0 5

W−Nominate Score

Figure 7: Status Quo Policy Outcomes

5.8 Suggested Reading


5.8.1 Background

[1] Berry, Demeritt and Esaray (2010)

[2] Greene (2000)

[3] Kennedy (1992)

5.8.2 Examples

[1] Kriner and Shen (2014)

[2] Richman (2011)

60
6 Multinomial Choice
6.1 Multinomial Logit
Consider an unordered categorical variable yn ∈ {1, 2, ..., J}. We will motivate the model
using the random utility formulation. Let unj denote the utility individual n gets from choice
j. We specify,

unj = βj0 xn + εnj (188)

where we normalize β1 = 0. We will let εnj ∼ EV1 (0, 1) and we assume that the errors are
independent across choices. Here, EV1 (0, 1) denotes the type-I extreme value distribution,
−x −x
with PDF fX (x) ∼ e−x e−e and CDF FX (x) ∼ e−e .
We observe a choice of yn = j if unj ≥ unk for all k 6= j. We can write,

Pr(yn = j|xn ) = Pr(unj ≥ unk ∀k 6= j) = Pr(βj0 xn + εnj ≥ βk0 xn + εnk ∀k 6= j) (189)

Z Z
= fεn (εn )dεn = fεn (εn,−j |εnj )fεnj (εnj )dεn
εn :βj0 xn +εnj ≥βk0 xn +εnk ∀k6=j εn :βj0 xn +εnj ≥βk0 xn +εnk ∀k6=j

Z ∞ Z
= fεnj (εnj ) fεn (εn,−j |εnj )dεn,−j dεnj
εnj =−∞ εn,−j :εnk ≤(βj −βk )0 xn +εnj ∀k6=j

Z ∞ Z ∞
Y −εnj Y −εnj −(βj −βk )0 xn
0
= fεnj (εnj ) Fεnk ((βj −βk ) xn +εnj )dεnj = e−εnj e−e e−e dεnj
εnj =−∞ k6=j εnj =−∞ k6=j

−εnj 0x 0x
e−(βj −βk ) e−βj −βk )
R∞ P n) R∞ P n)
= εnj =−∞
e−εnj e−e (1+ k6=j dεnj = u=0
e−u(1+ k6=j du

P −(βj −βk )0 xn ∞ 0
e−u(1+ k6=j e )
1 1 eβj xn
= = = 0 =
−(1 + k6=j e−(βj −βk )0 xn ) 1 + k6=j e−(βj −βk )0 xn 0 βk0 xn
PJ
1 + k6=j e−βj xn eβk xn
P P P
u=0 k=1 e

61
We thus have that,
0
eβj xn
Pr(yn = j|xn ) = PJ 0
(190)
k=1 eβk xn
We can write the log-likelihood for the multinomial logit model as,

N 0
X eβyn xn
l(β) = log PJ 0
(191)
n=1 k=1 eβk xn
For identification purposes, we will normalize β1 = 0. To see why this is necessary,
consider the system of equations,
0 0
eβj x eβj0 x
PJ 0
= PJ 0
∀j, x ∈ X (192)
k=1 eβk x k=1 eβk0 x
Consider βk = βk0 + a with a 6= 0. We have,
0 0 0 0
e(βj0 +a) x eβj0 x ea x eβj0 x
PJ = PJ 0
= PJ 0
∀j, x ∈ X (193)
(βk0 +a)0 x βk0 x a0 x βk0 x
k=1 e k=1 e e k=1 e

Hence, β0 cannot be identified.


Instead, consider β1 = 0. Suppose that,
0 0
eβj x eβj0 x
PJ 0
= PJ 0
∀j, x ∈ X (194)
k=1 eβk x k=1 eβk0 x
We can divide each equation for j > 1 by each equation for j = 1 to obtain,

0 0
eβj x = eβj0 x ∀j > 1, x ∈ X (195)

which is equivalent to,

0
e(βj −βj0 ) x = 1∀j > 1, x ∈ X (196)

which is equivalent to,

(βj − βj0 )0 x = 0∀j > 1, x ∈ X (197)

This will hold as long as X has full support. Hence, the model is identified once the restriction
is made.
With the multinomial logit model, we can report odds ratios. For j > 1, we have,

62
β0 x +βj,l a
e j,−l −l
β0 x +βk,l a
e k,−l −l
PJ
k=1
Pr(yn =j|xnl =a,xn,−l ) 1
0
β0
Pr(yn =1|xnl =a,xn,−l )
PJ x +βk,l a
e k,−l −l eβj,−l x−l +βj,l a
Pr(yn =j|xnl =b,xn,−l )
= k=1
β0 x +βj,l b
= 0
βj,−l x−l +βj,l b
= eβj,l (a−b) (198)
e j,−l −l e
Pr(yn =1|xnl =b,xn,−l ) PJ β0 x−l +βk,l b
e k,−l
k=1
1
β0 x +βk,l b
e k,−l −l
PJ
k=1

If we exponentiate the coefficient on choice j > 1, we can interpret this as the ratio of odds
of observing choice j relative to choice 1, from a one unit increase in the covariate.

6.2 Conditional Logit


Under the conditional logit model, unj = β 0 xnj + εnj , where εnj are independent EV1 (0, 1)
random variables. As before, we assume that we observe yn = j if unj ≥ unk for all k 6= j.
This slight difference between these models is that the covariates here are choice specific. A
similar derivation as before leads to,
0
eβ xnj
Pr(yn = j|xn ) = PJ (199)
k=1 eβ 0 xnk
In the multinomial logit model, we always restrict the choice set to be {1, 2, ..., J}. With the
conditional logit model, we can generalize the choice set to Jn ⊂ {1, 2, ..., J} in which case
we have,
0
eβ xnj
Pr(yn = j|xn ) = P β 0 xnk
∀j ∈ Jn (200)
k∈Jn e

We can write the log-likelihood as,

N 0
X eβ xn,yn
l(β) = log P β 0 xnk
(201)
n=1 k∈Jn e

The identification of the conditional logit model is slightly more tricky. Note that,
0 0
e β xj eβ0 xj
P β 0 xk
= P β00 xk
∀j, x ∈ X (202)
k∈Jn e k∈Jn e

which holds if and only if,


0 0
e β xj eβ0 xj
= 0 ∀j, l, x ∈ X (203)
e β 0 xl eβ0 xl

63
which holds if and only if,

(β − β0 )0 (xj − xl ) = 0∀j, l, x ∈ X (204)

If x has some common covariates, the coefficients on those common covariates cannot be
identified because if xj = (z, wj ), we have xj − xl = (0, wj − wl ). We can interact a common
variable with a choice-specific dummy, but we cannot interact it with every choice specific
dummy. We also cannot include a constant term.

6.3 The IIA Property


The multinomial logit model has a potentially unattractive property. Consider the proba-
bility of choosing j over l. We have,

βj0 xn
e 0
Pr(yn = j|xn ) P
ke
β 0 xn
k eβj xn 0
= β 0 xn
= β 0 xn = e(βj −βl ) xn (205)
Pr(yn = l|xn ) e l
P β 0 xn
el
ke
k

Note that the probability does not depend on the choices in the choice set. This is called
the Independence of Irrelevant Alternative (IIA) property. A similar problem occurs with
the conditional logit model.
Consider the choice of three transportation options: Bus, Fancy Bus, Car. Suppose that
the Fancy Bus is eliminated. Suppose that Pr(Bus) / Pr(Car) = 2 (i.e. twice as many
people take the regular bus than the car. Suppose that the fancy bus is eliminated. We
would expect that that Pr(Bus) / Pr(Car) > 2 since the people who take the fancy bus
probably don’t have a car. This is impossible in the multinomial logit framework (at least
conditional on the x’s).
Consider another example. Suppose that George W. Bush, Al Gore, and Ralph Nader
are running. Suppose that Bush gets 48% of the vote, Gore gets 49% of the vote, and Nader
gets 3%. This means that Gore’s two-party vote share would be .49 / (.49 + .48). If Nader
were to leave the race, the multinomial logit model would predict that Gore’s 2 party vote
share would stay exactly the same. In reality, we would expect that far more of the Nader
voters would go to Gore than Bush. This limitation motivates the multinomial probit model,
considered in the next subsection.

64
6.4 Multinomial Probit
A solution to this problem is the multinomial probit model. As with the multinomial logit
model, we specify unj = βj0 xn + εnj , but we now assume that εn ∼ N (0, Ω). As before, we
assume that we observe yn = j if unj ≥ unk for all k 6= j. We have,

Pr(yn = j|xn ) = Pr(unj ≥ unk ∀k 6= j) = Pr(βj0 xn + εnj ≥ βk0 xn + εnk ∀k 6= j) (206)

Z Z
= fεn (εn )dεn = φ(εn ; 0, Ω)dεn
εn :βj0 xn +εnj ≥βk0 xn +εnk ∀k6=j εn :βj0 xn +εnj ≥βk0 xn +εnk ∀k6=j

The integral above has no closed form solution.


The log-likelihood for the multinomial probit model is given by,

N
X Z
l(β, Ω) = log φ(εn ; 0, Ω)dεn (207)
n=1 εn :βy0 n xn +εn,yn ≥βk0 xn +εnk ∀k6=yn

In order to compute the MLE, we need a way of approximating the integral above. Older
software had typically used Gaussian quadrature to compute the integral, but this approach
is not very effective, especially when there are more than three choices. Two approaches
are now used—the first is to approximate the integral above using the GHK simulator.
The second approach is to abandon maximum likelihood estimation and used a Bayesian
approach, which also relies of simulation (Markov Chain Monte Carlo methods in particular).
The GHK simulator can be used to compute integrals of the form,
Z
Pr(a ≤ x ≤ b) = φ(x; µ, Ω)dx (208)
a≤x≤b

which are called rectangles of the normal distribution. One approach for computing such
integrals would use the fact that,
Z
Pr(a ≤ x ≤ b) = φ(x; µ, Ω)dx (209)
a≤x≤b

Z
= 1{a ≤ x ≤ b}φ(x; µ, Ω)dx = E[1{a ≤ x ≤ b}]
x

65
Suppose that {ṽr }R
r=1 are independent draws from φ(x; 0, I). We can form x̃r = µ + Ω
1/2
ṽr ,
which will be independent draws from the N (µ, Ω) distribution. A law of large numbers
would then imply that,

R
prob.
X
1
R
g(x̃r ) −→ E[g(x)] (210)
r=1

Applying this to rectangles of the normal distribution, we have,

R Z
prob.
X
1 1/2
R
1{a ≤ µ + Ω ṽr ≤ b} −→ φ(x; µ, Ω)dx (211)
r=1 a≤x≤b

This approach, while effective for compute rectangles of the normal distribution for particu-
lar values of (a, b, µ, Ω) is problematic when applied as an approximation within a likelihood
function that is being optimized. This is because R1 R 1/2
P
r=1 1{a ≤ µ+Ω ṽr ≤ b} varies discon-
tinuously in the parameters (a, b, µ, Ω). The numerical optimizers that are used to compute
maximum likelihood estimators rely on continuity and differentiability of the likelihood func-
tion and will not function correctly when applied to a discontinuous and non-differentiable
approximation to the likelihood function.
The GHK simulator is designed to approximate rectangles of the normal distribution in
a way that is continuous and differentiable in (a, b, µ, Ω). We can obtain the following using
a change of variables,
Z Z
Pr(a ≤ x ≤ b) = φ(x; µ, Ω)dx = φ(v; 0, I)dv (212)
a≤x≤b a≤µ+Lv≤b

= Pr(a ≤ µ + Lv ≤ b)

where L is the (lower triangular) Cholesky decomposition of Ω.


We illustrate the GHK simulator in the two dimensional case. We have,

Pr(a ≤ µ + Lv ≤ b) (213)

= Pr(a1 ≤ µ1 + L11 v1 ≤ b1 , a2 ≤ µ2 + L21 v1 + L22 v2 ≤ b2 )

= Pr(a1 ≤ µ1 + L11 v1 ≤ b1 ) Pr(a2 ≤ µ2 + L21 v1 + L22 v2 ≤ b2 |a1 ≤ µ1 + L11 v1 ≤ b1 )

66
 
a1 − µ 1 b1 − µ 1
= Pr ≤ v1 ≤
L11 L11

a2 − µ2 − L21 v1 b2 − µ2 − L21 v1 a1 − µ1 b1 − µ 1
∗ Pr( ≤ v2 ≤ | ≤ v1 ≤ )
L22 L22 L11 L11
Note first that,
     
a1 − µ 1 b1 − µ 1 b1 − µ 1 a1 − µ 1
Pr ≤ v1 ≤ =Φ −Φ (214)
L11 L11 L11 L11
 
For the second probability, note that v1 ∼ T N 0, 1, a1L−µ
11
1 b1 −µ1
, L11
where T N denotes the
truncated normal distribution. Let {ṽ1r }R
r=1 denote samples from this truncated normal
distribution. We can approximate the second probability using,

a2 − µ2 − L21 v1 b2 − µ2 − L21 v1 a1 − µ1 b1 µ 1
Pr( ≤ v2 ≤ | ≤ v1 ≤ ) (215)
L22 L22 L11 L11

R    
1
X b2 − µ2 − L21 ṽ1r a2 − µ2 − L21 ṽ1r
≈ R
Φ −Φ
r=1
L22 L22

Putting this together, we have,

Pr(a ≤ µ + Lv ≤ b) ≈ (216)

R          
1
X b1 − µ 1 a1 − µ 1 b2 − µ2 − L21 ṽ1r a2 − µ2 − L21 ṽ1r
≈ R
Φ −Φ Φ −Φ
r=1
L 11 L11 L 22 L22

Note that unlike the expression in equation (211), equation (216) is continuous and differ-
entiable in (µ, Ω, a, b).
To identify the multinomial probit model, we typically normalize the β1 = 0. We also
assume that εn1 = 0 (so that Ωjk = 0 if j = 1 or k = 1) and we assume that Ω22 = 1. A
slight variation of the multinomial probit model allows for choice specific covariates, unj =
βj0 xnj + εnj , and similar to the conditional logit model, there are different considerations for
identification of β0 .

67
6.5 Substantive Effects
When computing effect sizes for the multinomial logit and conditional logit models, the
easiest approach is to compute Pr(y = j|x) for different possible values of x and observe the
probabilities of all the choices. Like the ordered probit model, marginal effects will often
not be very informative summaries because marginal effects will be non-monotonic in the
covariates.
To compute substantive effects for the multinomial probit model, we have to use simu-
lation. Here, two approaches are possible—the first based on equation (211) and the second
based on the GHK simulator. To illustrate the first approach, let (β̂, Ω̂) denote the maximum
likelihood estimator and suppose we would like to compute the probabilities P r(y = 1|x),
P r(y = 2|x), ..., P r(y = J|x), for a particular value of x. We have,
Z
Pr(y = j|x) = φ(ε; 0, Ω̂)dε (217)
ε:β̂j0 x+εj ≥β̂k0 x+εk ∀k6=j

Z
= 1{β̂j0 x + εj ≥ β̂k0 x + εk ∀k 6= j}φ(ε; 0, Ω̂)dε
ε

R
X
≈ 1
R
1{β̂j0 x + ε̃rj ≥ β̂k0 x + ε̃rk ∀k 6= j}
r=1

where ε˜r are draws from the N (0, Ω̂) distribution, again relying on the law of large numbers
to deliver an approximation.
This approach will approximate Pr(y = j|x), but will deliver an approximation that is not
continuous in the parameters (β̂, Ω̂). This, in turn, means that the delta method cannot be
used to perform inferences on the substantive effects. If inferences are desired, the bootstrap
√ dist.
can be used (discussed further in subsection 10.2). Suppose that N (θ̂ − θ0 ) −→ N (0, V0 )
prob.
where V̂ −→ V0 . Consider a function C that is not necessarily continuous and differentiable.
Let {θ̃}Ss=1 be a sample from N (θ̂, V̂ ). We have that,
 
prob.
Pr q̂(C(θ̃), α/2) ≤ C(θ0 ) ≤ q̂(C(θ̃), 1 − α/2) −→ α (218)

as S → ∞, where q̂(x, α) represent empirical quantiles of the data x.


An alternative procedure for computing Pr(y = j|x) is to use the GHK simulator. Since
the GHK simulator is continuous and differentiable in the model parameters, the delta
method can be applied to perform inferences on the substantive effects.

68
6.6 Measures of Model Fit
For all these models, we do not simply have a censored version of the linear regression model,
so we can’t apply a McKelvey-Zavoina R-Squared. We can however consider a McFadden
R-squared, by including only a choice specific constant in the multinomial probit and multi-
nomial logit models. If the model we are computing the McKelvey R-Squared for does not
have a choice specific intercept, there is no guarantee that the R-squared will be positive
(this could be the case for the conditional logit model, or the multinomial probit model with
choice-specific intercepts).

6.7 Suggested Reading


6.7.1 Background

[1] Greene (2000)

[2] Kennedy (1992)

[3] Train (1992)

6.7.2 Examples

[1] Alvarez and Nagler (1995)

[2] Burden et al. (2014)

[3] Cox and McCubbins (1993), Chapter 7

[4] Kayser and Peress (2012)

7 Count Models
7.1 Poisson Regression
The Poisson distribution is given by,

λx e−λ
pX (x; λ) = Pr(X = x; λ) = (219)
x!
for x = 0, 1, .... The Poisson regression model is used to model a dependent variable that
can only take on non-negative integers as values. Note that,

69
∞ ∞ ∞ ∞
X X xλx e−λ X xλx e−λ −λ
X λx−1
E[X] = xpX (x) = = =e λ (220)
x=0 x=0
x! x=1
x! x=1
(x − 1)!

X λy
= e−λ λ = e−λ λeλ = λ
y=0
y!
0
The Poisson regression model specifies the mean to be eβ xn —the exponential transformation
here is used to force the mean to be positive. The Poisson regression model assumes that,

0 β 0 xn
(eβ xn )j e−e
Pr(yn = j|xn ; β) = (221)
j!
The log-likelihood is given by,

N 0 β 0 xn N
X (eβ xn )yn e−e X 0
l(β) = log = yn β 0 xn − eβ xn − log yn ! (222)
n=1
yn ! n=1

We can compute the marginal effect as,

0
E[yn |xn ] = eβ xn (223)

0

∂xnk
E[yn |xn ] = eβ xn βk (224)

We can thus interpret the coefficient βk as the effect of a one-unit change on the dependent
variable in percentage terms.
Consider the following property of the Poisson distribution,

∞ ∞ ∞
X X x(x − 1)λx e−λ X x(x − 1)λx e−λ
E[X(X − 1)] = x(x − 1)pX (x) = = (225)
x=0 x=0
x! x=2
x!

∞ ∞
−λ 2
X λx−2 −λ 2
X λy
=e λ =e λ = e−λ λ2 eλ = λ2
x=1
(x − 2)! y=0
y!

E[X(X − 1)] = E[X 2 ] − E[X] (226)

E[X 2 ] = E[X(X − 1)] + E[X] = λ2 + λ (227)

70
V ar(X) = E[X 2 ] − E[X]2 = λ2 + λ − λ2 = λ (228)

The Poisson distribution has a variance equal to its’ mean, a property that is unlikely to
hold.

7.2 Negative Binomial


The negative binomial model is designed to allow for over-dispersion of count data, or to
allow the variance to be greater than the mean. The negative binomial distribution is given
by,

α−1  x
Γ(α−1 + x) α−1

λ
pX (x; λ, α) = Pr(X = x; λ, α) = (229)
Γ(α−1 )Γ(x + 1) α−1 + λ −1
α +λ

for x = 0, 1, ... where λ > 0, α > 0, and Γ denotes the gamma function,
Z ∞
Γ(x) = ux−1 e−u du (230)
u=0

The negative binomial distribution has mean E[X] = λ and variance V ar(X) = λ(1 + αλ) >
λ. The parameter α measures the degree of over-dispersion (note that we cannot have
under-dispersion).
The negative binomial regression model has,

α−1  0 j
Γ(α−1 + j) α−1 e β xn

Pr(yn = j|xn ; β, α) = (231)
Γ(α−1 )Γ(j + 1) α−1 + eβ 0 xn α−1 + eβ 0 xn

and the log-likelihood is given by,

N α−1  0 yn
Γ(α−1 + yn ) α−1 e β xn
X 
l(β, α) = log (232)
n=1
Γ(α−1 )Γ(yn + 1) α−1 + eβ 0 xn α−1 + eβ 0 xn

Similarly to the Poisson regression model, marginal effects can be interpreted in terms of
percentage changes in the dependent variable.

71
7.3 Zero-Inflated Poisson Regression
An alternative count model is the zero-inflated Poisson regression. The zero-inflated model
presumes that the data-generating process for zero outcomes is different (this can be useful
if there are many zeros in the data). To derive the zero-inflated model, assume that there is
a π probability of observing a zero and a 1 − π probability of observing a Poisson random
variable. We have the following distribution for a zero-inflated Poisson random variable,

λx e−λ
pX (x; λ) = Pr(X = x; λ, π) = π1{x = 0} + (1 − π) (233)
x!
0
0 eγ zn
for x = 0, 1, .... We can specify λ as eβ xn and π and 1+eγ 0 zn
to obtain,

0 0 β 0 xn
e γ zn 1 (eβ xn )j e−e
Pr(yn = j|xn , zn ) = 1{j = 0} 0 + 0 (234)
1 + eγ zn 1 + eγ zn j!
Here, the notation allows for the parameters λ and π to depend on different covariates. We
allows for xn and zn to contain some (or all) of the same covariates.
We have that the mean and variance of a zero-inflated Poisson random variable is given
by,

E[X] = λ(1 − π) (235)

V ar(X) = λ(1 − π)(1 + λπ) > E[X] (236)

It follows that,
0
e β xn
E[yn |xn , zn ] = (237)
1 + e γ 0 zn
There is no simple way to interpret marginal effects in a zero-inflated Poisson regression
model. Instead, let us write xn = (xn1 , xn2 ) and zn = (xn1 , xn3 ) where xn1 , xn2 , and xn3 are
distinct covariates. We have,
0 0
eβ1 xn1 +β2 xn2
E[yn |xn1 , xn2 , xn3 ] = 0 0 (238)
1 + eγ1 xn1 +γ2 xn3

0 0 0 0 0 0 0 0
∂ (1 + eγ1 xn1 +γ2 xn3 )eβ1 xn1 +β2 xn2 β1 − eβ1 xn1 +β2 xn2 eγ1 xn1 +γ2 xn3 γ1
∂xn1
E[yn |xn1 , xn2 , xn3 ] = 0 0 (239)
(1 + eγ1 xn1 +γ2 xn3 )2

72
0 0 0 0
eβ1 xn1 +β2 xn2 β1 + (β1 − γ1 )eγ1 xn1 +γ2 xn3
= 0 0 0 0
(1 + eγ1 xn1 +γ2 xn3 ) (1 + eγ1 xn1 +γ2 xn3 )
0 0
∂ eβ1 xn1 +β2 xn2
∂xn2
E[yn |xn1 , xn2 , xn3 ] = β2 0 0 = β2 E[yn |xn1 , xn2 , xn3 ] (240)
1 + eγ1 xn1 +γ2 xn3
0 0
∂ γ2 eβ1 xn1 +β2 xn2
∂xn3
E[yn |xn1 , xn2 , xn3 ] =− 0 0 (241)
(1 + eγ1 xn1 +γ2 xn3 )2
Covariates that only appear in the specification for λ can be interpreted as percentage
changes on the dependent variable. For the other formulas, the marginal effects depend on
the other variables. We can interpret γ as the marginal effect of the IVs on the number
of “non-Poisson zeros” and we can interpret β as the marginal effect of the IVs on the
“Poisson expected value”, but this type of interpretation is silly. Instead, we can calculate
0
eβ xn
E[yn |xn , zn ] = 1+e γ 0 zn under difference scenarios for (xn , zn ). In principle, we can calculate

the probability of observing any outcome. For example,


0
eγ zn 1 0
−eβ xn
Pr(yn = 0|xn , zn ) = 0 + 0 e (242)
1 + e γ zn 1 + e γ zn
Using a similar approach, we can derive a zero-inflated version of the negative binomial
model. The zero-inflated negative binomial model has the same expected value as the zero-
inflated Poisson, but has a different variance.

7.4 Semi-parametric Analysis of the Count Regression Model


Consider the MLE for the Poisson regression model,

N N
0
X X
β̂ = arg max 1
N
ψ(xn , yn ; β) = arg max 1
N
yn β 0 xn − eβ xn − log yn ! (243)
β n=1 β n=1

0
β 0 xn j −eβ xn
Consistency of β̂ under the assumption that Pr(yn = j|xn ; β) = (e ) j!e follows from
the consistency of MLEs. Here, we will establish that β̂ is consistent under a weaker set of
0
conditions—we assume that E[yn |xn ; β] = eβ xn . We will treat β̂ as an M-estimator. Recall
that the identification condition for an M-estimator is β0 = arg max E[ψ(xn , yn ; β)]. In this
β
0 0
case, we have ψ(xn , yn ; β) = yn log(eβ xn ) − eβ xn − log yn !. We have,

73
0
E[ψ(xn , yn ; β)] = E[yn β 0 xn − eβ xn − log yn !] (244)

We use first order conditions to β 0 , we obtain,

0 0 0 0 0
0 = E[(yn −eβ xn )xn ] = E[E[(yn −eβ xn )xn |xn ]] = E[xn (E[yn |xn ]−eβ xn )] = E[xn (eβ0 xn −eβ xn )]
(245)
It is clear that β = β0 is a solution to the first order conditions. To show that it is the only
solution, consider the Taylor expansion,

0 0 0 0 0
eβ xn = eβ0 xn + eβ̄(xn ) xn x0n (β − β0 ) = eβ0 xn + eβ̄(xn ) xn x0n (β − β0 ) (246)

We have,

0
E[xn x0n eβ̄(xn ) xn ](β − β0 ) = 0 (247)
0
There will be a unique solution if E[xn x0n eβ̄(xn ) xn ] has full rank. We will demonstrate that
0
E[xn x0n eβ̄(xn ) xn ] is positive definite (and hence has full rank). Consider z 6= 0. We have,

0 0 0
z 0 E[xn x0n eβ̄(xn ) xn ]z = E[z 0 xn x0n zeβ̄(xn ) xn ] = E[(z 0 xn )2 eβ̄(xn ) xn ] (248)
0 0
We have eβ̄(xn ) xn > 0 and (z 0 xn )2 ≥ 0. Strict positivity of E[(z 0 xn )2 eβ̄(xn ) xn ] will follow if
(z 0 xn )2 > 0 with positive probability. Suppose that z 0 xn = 0 for z 6= 0 with probability 1.
Then one of xn is a perfect linear combination of the other covariates with probability 1. If we
0
assume that this is not the case, then β0 is the unique minimizer of E[yn β 0 xn −eβ xn −log yn !].
Using this and certain technical conditions, we can establish that β̂ is consistent.
Applying this result, the Poisson regression estimator is consistent as long as the con-
ditional mean is correctly specified and we can compute the asymptotic variance using the
sandwich estimator—or robust standard errors—were the standard errors are robust to zero-
inflation, over/under-dispersion, or any other deviation from the Poisson distribution for the
dependent variable. This result allows to avoid modeling the exact distribution of the vari-
able, which may make calculating marginal effects much more difficult. With the Poisson
regression model with robust standard errors, we can compute marginal effects in terms of
percentage changes on the dependent variable.
A similar result holds for the negative binomial regression model. The limiting objective
function is given by,

74
" α−1  0 yn #
Γ(α−1 + yn ) α−1 e β xn

E log (249)
Γ(α−1 )Γ(yn + 1) α−1 + eβ 0 xn α−1 + eβ 0 xn

Ignoring terms that do not involve β, the limiting maximizer will maximize,
h  0
i
E yn β 0 xn − (yn + α−1 ) log α−1 + eβ xn (250)

Taking first-order conditions, we have,


0
(yn + α−1 ) β 0 xn eβ0 xn + α−1 β 0 xn
    
β00 xn
E yn xn − −1 e xn = E e − β 0 xn e xn (251)
α + e β 0 xn e + α−1
0 0
eβ0 xn − eβ xn
  
−1
=E α xn = 0
eβ 0 xn + α−1
It is clear that for all α−1 , β = β0 is a solution. To demonstrate that it is the only solution,
we take a Taylor expansion,

0 0 0
eβ xn = eβ0 xn + eβ̄(xn ) xn x0n (β − β0 ) (252)

We have,
" 0
#
eβ̄(xn ) xn xn x0n
E (β − β0 ) = 0 (253)
eβ 0 xn + α−1

Consider,
" 0
# " 0
#
0 eβ̄(xn ) xn xn x0n eβ̄(xn ) xn (z 0 xn )2
zE z=E ≥0 (254)
eβ 0 xn + α−1 eβ 0 xn + α−1

with strict positivity if (z 0 xn )2 > 0 with positive probability. This is guaranteed to be the
case, implying that β = β0 is the only solution to the first -order conditions. Hence, the
negative binomial regression estimator of β0 will be consistent as long as the conditional
mean is correctly specified.

7.5 Model Fit for Parametric Models


If we estimate a Poisson regression model, we can test the fit of the model using the following
goodness of fit statistic,

75
N
X (yn − µ̂n )2
G= (255)
n=1
µ̂n
β̂ 0 xn 0
where µ̂n = e . Under the null hypothesis that yn |xn ∼ P oisson(eβ xn ), we have that
dist.
G −→ χ2N −K where K is the number of estimated parameters. In principle, if we fail to
reject the Poisson model, we would be justified in using the Poisson model without robust
standard errors. Otherwise, we would have to consider robust standard errors, the negative
binomial model, the zero-inflated Poisson model, etc.

7.6 Suggested Reading


7.6.1 Background

[1] Cameron and Trivedi (2001)

[2] Greene (2000)

[3] Kennedy (1992)

[4] King (1989)

7.6.2 Examples

[1] Nepal, Bohara and Gawande (2011)

[2] Weghorst and Lindberg (2013)

[3] Wilson and Piazza (2013)

8 Censoring, Selection, and Truncation


8.1 The Tobit Model (Censored Regression)
Consider the latent variable model given by yn∗ = β 0 xn + εn where εn ∼ N (0, σ 2 ). Suppose
that we observe a censored version of yn∗ —we observe,
(
yn∗ , yn∗ ≥ 0
yn = (256)
0, yn∗ < 0

76
The random variable yn is “mixed”—it is neither discrete nor continuous. This censored
regression model is called the Tobit model.
We can determine that the discrete part is characterized by,

0
Pr(yn∗ < 0|xn ) = Pr(β 0 xn + εn < 0|xn ) = Pr(εn < −β 0 xn |xn ) = Φ( −βσxn ) (257)

and the continuous part is characterized by,

1 ∗ 0 2 2
f (yn∗ |xn ) = √1 e− 2 (yn −β xn ) /σ (258)
σ 2π

We can combine these two expressions to form the log-likelihood,

N
−β 0 xn
   
X 1 0 2
2
l(β, σ ) = 1{yn > 0} log √1 e− 2 (yn −β xn ) + 1{yn = 0} log Φ (259)
σ 2π σ
n=1

The above does not qualify as derivation of the Tobit model because we simply assumed we
can combine the discrete part and continuous part without proof.
Suppose that X ∼ T N (µ, σ 2 ). One can show that,

b−µ a−µ
 
φ σ
−φ σ
E[X|a < X < b] = µ − σ b−µ
 a−µ
 (260)
Φ σ
−Φ σ

When a = 0 and b = ∞, we obtain,


µ

φ σ
E[X|X > 0] = µ + σ µ (261)
Φ σ

Note that,

E[yn∗ |xn ] = β 0 xn (262)

E[yn |xn ] = E[yn |xn , yn = 0] Pr(yn = 0|xn ) + E[yn |xn , yn > 0] Pr(yn > 0|xn ) (263)

77
 
β 0 xn

φ σ
 
β 0 xn
= E[yn∗ |xn , yn∗ > 0] Pr(yn∗ > 0|xn ) = β 0 xn + σ   Φ σ
β 0 xn
Φ σ

   
β 0 xn β 0 xn
=Φ σ
β 0 xn + σφ σ

The estimator of β0 directly reveals the marginal effect of an independent variable xk on the
latent (uncensored) dependent variable. The marginal effect of an independent variable on
the (censored) dependent variable is given by equation (263).
A generalization of the Tobit model allows for censoring of the dependent variable on
both sides at arbitrary points, a and b. In the more general case, we observe

 a,
 yn∗ < a
yn = yn∗ , a ≤ yn∗ ≤ b (264)

b, yn∗ < b

The random variable yn is again mixed. We have,

Pr(yn = a|xn ) = Pr(yn∗ < a|xn ) = Pr(β 0 xn + εn < a|xn ) (265)

0
= Pr(εn < a − β 0 xn |xn ) = Φ( a−βσ xn )

Pr(yn = b|xn ) = Pr(yn∗ > b|xn ) = Pr(β 0 xn + εn > b|xn ) = Pr(εn > b − β 0 xn |xn ) (266)

0
= 1 − Pr(εn < b − β 0 xn |xn ) = 1 − Φ( b−βσ xn )

1 ∗ 0 2 2
f (yn∗ |xn ) = √1 e− 2 (yn −β xn ) /σ (267)
σ 2π

We can combine these expressions to form the log-likelihood,

N
a − β 0 xn
   
X 1 0 2
2
l(β, σ ) = 1{a ≤ yn ≤ b} log √1 e− 2 (yn −β xn ) + 1{yn = a} log Φ (268)
σ 2π σ
n=1

78
b − β 0 xn
  
+ 1{yn = b} log 1 − Φ
σ
In the general case (with arbitrary censoring from above and below), we have

E[yn∗ |xn ] = β 0 xn (269)

E[yn |xn ] = a Pr(yn = a|xn ) + b Pr(yn = b|xn ) + Pr(a < yn < b|xn )E[yn |a < yn < b] (270)

    
a−β 0 xn b−β 0 xn
= aΦ σ
+b 1−Φ σ

   
b−β 0 xn a−β 0 xn

  0   0  φ −φ σ σ
+ Φ b−βσ xn − Φ a−βσ xn β 0 xn − σ    0 
b−β 0 xn
Φ σ
− Φ a−βσ xn

Suppose that yn∗ is the number of hours people desire to work (which may depend on
their wage, number of children, etc.). If people’s desired hours are less than zero, we observe
them working zero hours. We let yn denote the number of hours people work. If we are
interested in the effect of an independent variable on the number of desired hours, we could
simply look at βk . If we are interested in the effect of an independent
 variable
 on the number
 0 
0
of hours actually worked, we would have to consider E[yn |xn ] = Φ β σxn β 0 xn + σφ β σxn .
We have,
 
∂ β 0 xn
∂xnk
E[yn |xn ] =Φ σ
βk (271)

Now, suppose that we were to apply OLS to yn . We would obtain,

prob.
β̂OLS −→ E[xn x0n ]−1 E[xn yn ] (272)

E[xn yn ] = E[xn E[yn |xn ]] (273)

   
β00 xn β00 xn
E[yn |xn ] = Φ σ0
β00 xn + σ0 φ σ0
(274)

79
   
prob. β00 xn β00 xn
β̂OLS −→ E[xn x0n ]−1 E[Φ σ0
xn x0n β0 + xn σ0 φ σ0
] 6= β0 (275)

The OLS estimator of β0 would be a biased estimator for the marginal effect of the inde-
pendent variables on the desired number of work hours. It could be considered a reasonable
estimator for the marginal effect of the independent variables on the actual number of work
hours.
When considering whether to apply a tobit model, it is not enough to simply examine
whether there is a point mass at zero. Consider first a case where the data collection method
censors the dependent variable. Suppose, for example, that we are interested in the effect
of education on income, but our dataset sets all incomes above $250,000 equal to $250,000.
If the data is censored in this way, we would almost certainly be interested in estimating a
tobit model since we are interested in the effect of education on income, not of education on
censored income.
Instead, consider the effect of ideological location on PAC contributions to the party lead-
ership Jenkins and Monroe (2012). Here, there are multiple possibilities. We could imagine
the latent intentions of the party leadership involves punishing members with negative con-
tributions, but this is not actually feasible. In this case, a Tobit model could be used to
assess the effect of ideological positions on intended PAC contributions. Alternatively, if we
are not interested in intentions or if this model of intentions does not make complete sense,
but an actual contribution model makes sense, a Tobit model would not be appropriate.
We could also consider wages, where we observe a point mass at the minimum wage.
We could attempt to explain wages by assuming that each individual has a specified labor
productivity and that individuals are paid in proportion to their labor productivity. In this
case, one may think that since an employer cannot pay less than the minimum wage, the
process would be censored at the minimum wage. In this case, it would not make sense to
apply a Tobit because according to this theory, individuals who are less productive than
the minimum wage should simply not be hired (which suggested truncation rather than
censoring, considered below).

8.2 Truncated Regression


Consider again the model yn = β 0 xn + εn where εn ∼ N (0, σ 2 ), but suppose that we observe
yn if yn > 0 and observe nothing otherwise. In particular, we do not observe covariates when
yn < 0. This is the truncated regression model. The density of the data is given by the

80
truncated normal distribution,

1 0 2 2
√1 e− 2 (yn −β xn ) /σ
σ 2π
f (yn |xn ) = 0 (276)
Φ( β σxn )
We have a log-likelihood of,

1 0 2 2
N
X √1 e− 2 (yn −β xn ) /σ
σ 2π
l(β, σ) = log 0 (277)
n=1 Φ( β σxn )
N
0
X
= − log σ − 12 log(2π) − 12 (yn − β 0 xn )2 /σ 2 − log Φ( β σxn )
n=1

This type of model could be used if we observed a sample of hours worked (as before), but
we collected the data for a sample of workers. In this case, we would not observe individuals
working zero hours because they prefer to work less than zero hours. We could also let wn
indicate wage, w the minimum wage and yn = wn − w is the amount above the minimum
wage an individual makes. In this case, we would never observe wages for those whose
productivity is below the minimum wage and hence are not employed.
If we wanted to calculate the marginal effect of an IV on hours of work desired, we would
use,


∂xnk
E[yn |xn ] = βk (278)

If we wanted to calculate the marginal effect of an IV on hours work desired among the
employed, we would calculate,
 
β 0 xn

φ σ

E[yn |yn > 0, xn ] = ∂  β 0 xn + σ  (279)
∂xnk ∂xnk

β 0 xn
Φ σ
      2 
β 0 xn 0 β 0 xn β 0 xn
Φ σ
φ σ
−φ σ
= βk 1 +
 
 2 
β0x n
Φ σ

8.3 The Heckman Selection Model


Consider the model yn = β 0 xn + εn , but suppose that we only observe yn if wn = 1 where
wn = 1 ⇔ wn∗ = γ 0 zn + ηn ≥ 0 and where,

81
" # " #!
0 σε2 σεη
(εn , ηn ) ∼ N , (280)
0 σεη ση2
For identification purposes, we normalize ση2 = 1. We therefore simply write σε2 as σ 2 and
we write σεη as σρ. We can observe (yn , 1) or (., 0). We have,10
Z
f (yn , wn = 1) = f (εn , ηn )d(εn , ηn ) (281)
(εn ,ηn ):γ 0 zn +ηn ≥0,yn =β 0 xn +εn

 
yn −β 0 xn 0
!
φ σ ρ yn −β xn
+ γ 0 zn
σ
= Φ
σΦ(γ 0 zn )
p
1 − ρ2
Z
Pr(wn = 0) = f (ηn )dηn = 1 − Φ(γ 0 zn ) (282)
ηn :γ 0 z n +ηn <0

The likelihood for the Heckman selection model is given by,

 
yn −β 0 xn 0x
!
N
X φ σ ρ yn −β n
+ γ 0
zn
l(β, γ, σ 2 , ρ) = wn log Φ σ
+(1−wn ) log(1−Φ(γ 0 zn )) (283)
σΦ(γ 0 zn )
p
1−ρ 2
n=1

To see an example, suppose that we are interested in the effect of education on wages.
We will only observe wages among the employed, so we let yn be wages and wn = 1 indicate
employment. We can show that,11

φ(−γ 0 zn )
E[yn |xn , zn , wn = 1] = β 0 xn + ρσ = β 0 xn + ρσλ(−γ 0 zn ) (284)
1 − Φ(−γ 0 zn )
φ(u)
Here, λ(u) = 1−Φ(u) is called the Inverse Mills Ratio. Using this, we can derive the large
sample bias of OLS when applied to sample selected data. We have,

prob,
β̂ −→ E[xn x0n ]−1 E[xn yn |wn = 1] = E[xn x0n ]−1 E[xn E[yn |xn , zn , wn = 1]] (285)

= E[xn x0n ]−1 E[xn β 0 xn + ρσλ(−γ 0 zn )] = β + ρσE[xn x0n ]−1 E[xn λ(−γ 0 zn )]

The term ρσE[xn x0n ]−1 E[xn λ(−γ 0 zn )] indicates the bias of OLS when applied to sample
10
See Appendix A.3 for a derivation.
11
See Appendix A.4 for a derivation.

82
selected data. It is zero when ρ = 0 and will generally be non-zero otherwise.
Returning to the example, suppose that intelligence is unobserved. It is likely to enter
both equations—more intelligent individuals are more likely to be employed (holding con-
stant the covariates) and more likely to earn higher wages when employed (holding constant
the covariates). This is likely to lead to bias in the coefficients in the wage equation.
There is also a variation of the Heckman selection where the DV is binary. The outcome
is modeled using a latent variable, yn∗ = β 0 xn + εn where we observe yn = 1{yn∗ ≥ 0}. The
selection equation is also governed by a latent variable, wn∗ = γ 0 zn + ηn ≥ 0 where we observe
wn = 1{wn∗ ≥ 0}. We assume that (εn , ηn ) are bivariate normal,
" # " #!
0 1 ρ
(εn , ηn ) ∼ N , (286)
0 ρ 1
where the variances of εn and ηn are normalized for identification purposes. We only observe
yn is wn = 1—here there are three things we can observe, (yn , wn ) = (0, 1), (yn , wn ) = (0, 0),
and (yn , wn ) = (., 0).
We can easily calculate,

Pr(yn = ., wn = 0|xn , zn ; β, γ, ρ) = Pr(wn = 0) = 1 − Φ(γ 0 zn ) (287)

The expressions for Pr(yn = 0, wn = 1|xn , zn ; β, γ, ρ) and Pr(yn = 1, wn = 1|xn , zn ; β, γ, ρ)


are more complicated as they involve double integrals. We have,

Pr(yn = 1, wn = 1|xn , zn ; β, γ, ρ) = Pr(yn∗ ≥ 0, wn∗ ≥ 0|xn , zn ; β, γ, ρ) (288)

= Pr(β 0 xn +εn ≥ 0, γ 0 zn +ηn ≥ 0|xn , zn ; β, γ, ρ) = Pr(εn ≥ −β 0 xn , ηn ≥ −γ 0 zn |xn , zn ; β, γ, ρ)

" # " #!
Z ∞ Z ∞
0 1 ρ
= f (ε, η); , d(ε, η)
ε=−β 0 xn η=−γ 0 zn 0 ρ 1
We can similarly demonstrate that,

83
" # " #!
Z −β 0 xn Z ∞
0 1 ρ
Pr(yn = 0, wn = 1|xn , zn ; β, γ, ρ) = f (ε, η); , d(ε, η)
ε=−∞ η=−γ 0 zn 0 ρ 1
(289)
Using these expressions, we can form the log-likehood as,

N
X
l(β, γ ρ) = (1 − wn )log (1 − Φ(γ 0 zn )) (290)
n=1
" # " #! !
Z ∞ Z ∞
0 1 ρ
+ wn yn log f (ε, η); , d(ε, η)
ε=−β 0 xn η=−γ 0 zn 0 ρ 1
" # " #! !
Z −β 0 xn Z ∞
0 1 ρ
+ wn (1 − yn )log f (ε, η); , d(ε, η)
ε=−∞ η=−γ 0 zn 0 ρ 1
Unlike the expression for the Heckman selection model with a continuous dependent
variable, the integrals in the likelihood for the Heckman selection model with a binary depe-
dendent variable do not reduce to the simpler expression. We thus need a way of computing
the integrals numerically. The GHK-Simulator (also used for estimating multinomial probit
models) provides one approach.
There is another variation (called a switching model) where we observe one of two equa-
tions depending on the outcome of the selection process.

8.4 Applications
Application 8.1 (The Effect of Election Proximity on Punitive Sentences).

Application 8.2 (The Effect of Ideological Location on Campaign Contributions from Party
Leadership).

Application 8.3 (The Politics of EU Enlargement).

Plumper, Schneider and Troeger (2006) describe a two-step process by which countries
can join the European Union (EU). Countries must first apply and the EU next decides
whether to allow the countries to join the EU. One hypothesize that countries that are more
democratic are more likely to be allowed to join on the EU. One could imagine testing this
using a logit model where being admitted to the EU is the dependent variable, which is mea-
sured on a sample of countries that applied to the EU. This approach could potentially yield

84
a biased and inconsistent estimate of the effect of level of democracy on EU admittance—if
we are interested in whether level of would increase EU acceptance in a world in which every
country applied—because of selection bias (which, in this case might be more accurately
called selection inconsistency). It would yield a consistent estimate of EU acceptance con-
ditional on applying, but this is arguably less theoretically relevant. We could imagine an
omitted variable present in both stages—for example, legal sophistication. Countries with a
more sophisticated legal profession may be more inclined to apply. They may be also more
effective in placating the EU to the point of being admitted. If democracies have on average
more legal sophistication, the effect of level of democracy we estimate using a logit model
might pick up the effect of legal sophistication.
One could consider two alternative analyses. First, one could code the dependent variable
based on whether the country was admitted to the EU, coding both countries that did
not apply and countries that were denied admittance as zeros. In this case, applying a
logit model would yield a consistent estimate of the effect of the level of democracy on
admittance to the EU. This analysis would not tell us whether this effect occurred because
democracies were differentially likely to apply, or whether democracies would have been
differentially likely to be accepted in a world in which every country applied. Second, one
could estimate just the selection stage—that is, we could use a logit model to consistently
estimate the effect of level of democracy on apply for EU ascension. Suppose we found
that (a) democratic countries were more likely to be admitted to the EU, (b) democratic
countries were more likely to apply, and that (c) democratic countries were more likely to be
admitted conditional on applying. It might be tempting to conclude that the mechanism by
which democratic countries are disproportionably admitted to the EU is a combination of
democratic countries being more likely to apply and the EU being more favorable towards
democratic countries. This inference, however, might not be justified because it could be that
there are unobservables correlated with level of democracy that increase the likelihood that
a country applied to the EU and increases the favorability of the EU towards the country
(i.e. selection inconsistency is present).

Application 8.4 (The Effect of Distributive Spending on Political Participation).

Chen (2013) examines the effects of receiving disaster aid on voter turnout. He hypothe-
sizes that co-partisans of the president will increase their turnout and non-co-partisans will
decrease their turnout when having their aid requests met. In a probit analysis on the sample
of disaster relief applicants, he finds evidence for this claim. Chen considers the possibility
that those who apply for aid are different than those who do not apply for aid and worries

85
about “potentially limiting the external validity of the article’s findings”. Chen proposes
a Heckman selection model where the first stage is whether an individual applied for aid
and the second stage captures the effect of receiving aid on turnout. Chen uses a number of
exclusion restrictions—median income, home value, gender, race, age, and wind speed are all
included in the selection equation, but excluded from the outcome equation. Of these, only
wind speed seems potentially valid. Beyond this, there is the interpretation of the selection
model. It will provide the effect of receiving aid on turnout if everybody applied for aid.
This is probably not an interesting counterfactual. An partisan member of the government
would likely be more interested in the effect of receiving aid on turnout among those that
applied (to determine whether politicizing aid is effective is mobilizing voters).

Application 8.5 (International Treaties and Current Account Restrictions).

Another example of a potential selection problem that has seem some coverage in the
literature is whether international treaties affect the behavior of countries. Simmons (2000)
argued that countries that sign Article VII are less likely to impose current account restric-
tions. She determines this using a probit model where being an Article VII signatory is
the main independent variable and current account restrictions is the dependent variable.
Stein (2005) argues that treaties do not constrain but instead screen. In particular, Stein
argues that countries who were otherwise more likely to refrain from placing current account
restrictions are exactly those countries who are more likely to sign. She tests this using a
switching model and finds that (i) the unobservables in the selection and outcome equations
are negatively correlated and (ii) that once this is taken into account, countries that sign
treaties are no less likely to impose current account restrictions. Von Stein’s model however
does not justify her exclusion restrictions.
It is worthwhile to comment on the difference between selection models and endogeneity.
Consider the effect of education on wages. More intelligent individuals are likely to spend
more time in school and also likely to receive higher wages. Failing to control for intelligence
when estimating the effect of education on wages may lead to an upward bias on the education
coefficient. We may instead conceive of education as a dummy variable, indicating whether
an individual chooses to attend college. We may even be tempted to say that individuals
“select” into whether to attend college. The difference between this and a Heckman selection
model is that while we observe wages for individuals who choose not to attend college, the
Heckman model assumes that we do not observe the dependent variable for non-selected
individuals. The example from Stein (2005) above somewhat confuses this distinction—in a
switching model, we observe the dependent variable for selected and un-selected observations.

86
In fact, we could have applied an IV estimator to Stein’s application. In a typical IV setup,
we would be assuming that an endogenous IV shifts the mean level of the DV. In a switching
model, we have an entire different equation for selected and unselected individuals. Chen’s
(2013) presents an example where both selection and reverse causality could be at play—
individual may select into apply for aid, for incumbent government may choose to reward
individuals they believe are likely to turn out.

8.5 Suggested Reading


8.5.1 Background

[1] Greene (2000)

[2] Kennedy (1992)

8.5.2 Examples

[1] Chen (2013)

[2] Huber and Gordon (2004)

[3] Jenkins and Monroe (2012)

[4] Plumper, Schneider and Troeger (2006)

[5] Simmons (2000)

[6] Stein (2005)

9 Duration Models
Duration data refers to data that records the length of time until an event happens. Examples
could be the length of time before a cabinet government dissolves, the length of time between
civil wars, or the length of time an individual is unemployed. Duration data can be treated as
discrete or continuous. Duration data cannot take on negative values. Beyond this, duration
data is often censored—some individuals may still be unemployed when our data set ends
or cabinet governments may be dissolved because of a mandatory election. Duration data
consists of a length of time (discrete or continuous) along with a variable indicating whether
an observation is censored.

87
9.1 Survival Functions and Hazard Rates
Consider a continuous (uncensored) duration Tn∗ . We can model this duration using the CDF,
F (t) = Pr(Tn∗ ≤ t). We can model it using the survival function S(t) = 1−F (t) = Pr(Tn∗ ≥ t).
We can use the density function f (t). Since S(t) = 1 − F (t), we have that S 0 (t) = −f (t).
Finally, we can use the hazard rate,

Pr(t ≤ Tn∗ ≤ t + δ|yn ≥ t) Pr(t ≤ Tn∗ ≤ t + δ, Tn∗ ≥ t)


h(t) = lim = lim (291)
δ→0 δ δ→0 δ Pr(Tn∗ ≥ t)

Pr(t ≤ Tn∗ ≤ t + δ) F (t + δ) − F (t) F 0 (t) f (t) S 0 (t)


= lim = lim = = = −
δ→0 δ Pr(Tn∗ ≥ t) δ→0 δS(t) S(t) S(t) S(t)
Define the integrated hazard to be H(t) = − log S(t). The implies that S(t) = e−H(t) .
Differentiating implies that S 0 (t) = −e−H(t) H 0 (t) or S 0 (t) = −S(t)H 0 (t) which demonstrates
Rt
that H 0 (t) = h(t). From this, we obtain H(t) = u=0 h(u)du.
Let Dn = 1 indicate that the duration is censored, let Tn indicate a duration, and let
Xn be a vector of covariates. Let τn be the censoring point for observation n. We have
Tn∗ = (1 − Dn )Tn + Dn τn .

9.2 Parametric Models


If we assume a constant hazard rate, we have,

f (t) S 0 (t)
h(t) = =− =λ (292)
S(t) S(t)
This yields the differential equation,

S(t) = − λ1 S 0 (t) (293)

which has solution S(t) = Ke−λt . The condition that S(0) = 1 implies that S(t) = e−λt which
is the exponential model. It corresponds to an exponential distribution for the durations.
We have that, E[Tn∗ ] = λ−1 . Note that F (t) = 1 − e−λt and f (t) = λe−λt .
0
Specify λ = e−β Xn . For any Dn = 0, we have that the duration is characterized by
0 −β 0 Xn T
the density, f (Tn ) = e−β Xn e−e n
, and when Dn = 1, we have that the probability of
−β 0 Xn T
observing a censored observation is given by S(Tn ) = e−e n
. We have the following
log-likelihood for the exponential duration model,

88
N
0 −β 0 Xn T −β 0 Xn T
X
l(β) = (1 − Dn ) log e−β Xn e−e n
+ Dn log e−e n
(294)
n=1

N
0
X
= −(1 − Dn )β 0 Xn − e−β Xn Tn
n=1

−β 0 Xn 0
Notice that E[Tn∗ |Xn ] = (e )−1 = eβ Xn which allows us to interpret the model in terms
of percentage changes in the expected value of the uncensored duration (an approximation
which is based on small changes in the independent variable). The exponential model can
be too restrictive however.
More generally, we have,

N
X
l(θ) = (1 − Dn ) log f (Tn |Xn ; θ) + Dn log S(Tn |Xn ; θ) (295)
n=1

We can consider the following alternative models,

ˆ Weibull:

α α
f (t; λ, α) = αλα tα−1 e−(λt) , S(t; λ, α) = e−(λt) , E[T ] = λ−1 Γ(α−1 + 1) (296)

ˆ Log-Normal:

1 2 2
  2 /2
2
f (t; µ, σ ) = 1

tσ 2π
e− 2 (log t−µ) /σ , S(t; µ, σ 2 ) = 1
2
− 1
2
Φ log√
t−µ
σ 2
, E[T ] = eµ+σ (297)

0
For the Weibull model, we have E[Tn |Xn ] = eβ Xn Γ(α−1 + 1), or,

log E[Tn |Xn ] = β 0 Xn + log Γ(α−1 + 1) (298)


0 2
so that ∂X∂nk log E[Tn |Xn ] = βk . For the log-normal, we have E[Tn |Xn ] = eβ Xn +σ /2 so that

∂Xnk
log E[Tn |Xn ] = βk and well. In either case, we can interpret a one unit change in Xnk
as causing a βk percent change in the uncensored duration.

89
9.3 The Cox Proportional Hazard Model
0
Suppose that h(Tn ) = h0 (Tn )eβ Xn . The Cox-Proportional Hazard model develops an esti-
mator for β0 that does not require knowledge of h0 , which is called the baseline hazard. To
estimate β0 , we maximize the following partial log-likelihood,

N 0
!
X eβ Xn
l(β) = (1 − Dn ) log (299)
eβ 0 Tm
P
n=1 m:Tm ≥Tn

Note that log h(Tn ) = log h0 (Tn )+β 0 Xn . This implies that we can interpret βk as the increase
in the log-hazard that occurs when Xnk increases by 1 unit, or as a percentage change in the
hazard rate.
An advantage of the Cox Proportional Hazard Model is that it generalizes the exponen-
tial and Weibull duration models. To see that this is the case, note that for the exponen-
0 −β 0 Xn T −β 0 Xn T
tial model, we have f (Tn ) = e−β Xn e−e n
and S(Tn ) = e−e n
, which implies that
0X −β 0 Xn u
f (Tn ) e−β n e−e 0 0
h(Tn ) = = S(Tn )
= e−β Xn . Using h0 (Tn ) = 1, we have h(Tn ) = h0 (Tn )e−β Xn ,
−β 0 Xn u
e−e
which is a special case of the Cox Proportional Hazard model with the sign of β flipped.
0 −β 0 Xn T )α
For the Weibull model, we have f (Tn ; β, α) = α(e−β Xn )α Tnα−1 e−(e n
and S(Tn ; β, α) =
0X 0X −β 0 Xn T )α
α(e−β n )α T α−1 e−(e n
−(e−β n Tn )α 0
e , which implies that h(Tn ; β, α) = −(e
n
= α(e−αβ Xn )Tnα−1 .
−β 0 Xn Tn )α
e 0
If we use the re-parameterization γ = −αβ, we have h(Tn ; γ, α) = αTnα−1 eγ Xn . Using
0
h0 (Tn ) = αTnα−1 , we have h(Tn ; γ, α) = h0 (Tn )eγ Xn , which indicates that the Weibull model
is a special case of the Cox Proportional Hazards model. The log-normal model is not a
special case of the Cox Proportional Hazards model.
A drawback of this Cox Proportional Hazards model is that we cannot calculate substan-
tive effects (even in principle) since calculating substantive effects requires knowledge of the
0
baseline hazard h0 . To understand why, we can start with h(Tn ) = h0 (Tn )eβ Xn . The inte-
Rt Rt
grated hazard satisfies H(t) = u=0 h(u)du, which in our case implies H(t) = u=0 h(u)du =
Rt 0 0 Rt 0
h (T )eβ Xn du = eβ Xn u=0 h0 (Tn )du = eβ Xn H0 (t). Since S(t) = eH(t) , we have S 0 (t) =
u=0 0 n
0 β 0 Xn H (t) 0 β 0 Xn H (t)
h(t)eH(t) = eβ Xn h0 (t)ee 0
. Since f (t) = −S 0 (t), we have f (t) = −eβ Xn h0 (t)ee 0
.
R∞ 0X 0X
0
β Xn e β n H0 (t) 0
β Xn
R ∞ eβ n H0 (t)
Finally, E[Tn ] = t=0 t(−e h0 (t)e )dt = −e t=0
th0 (t)e dt. We there-
fore cannot calculate E[Tn ] without known h0 (t), which the Cox Proportional Hazards esti-
mator does not reveal. We also cannot calculate differences or ratios of E[Tn ] for different
values of Xn without knowing h0 (t). Instead, we can interpret a positive β as indicating an
increase in the hazard rate, which implies an decrease in the expected value of the uncensored
duration.

90
9.4 Discrete Duration Models
With discrete duration models, we model the probability of observing a positive outcome
for each unit n and each time period t. We use a logit or probit model, i.e. Pr(Ynt = 1) =
F (β 0 Xnt ) where F = Λ or F = Φ. Notice that this approach naturally allows for time
varying covariates. We capture duration dependence by including time-since-last-event in
the covariates Xnt . For example, we could include t − arg max 1{Ynt = 1}. Another common
t∗ <t
practice is to model duration dependence using cubic splines (Beck, Katz and Tucker, 1998).
A third common practice is to include the square and cubic duration terms (Carter and
Signorino, 2010).

9.5 Time Varying Covariates in Continuous Models

91
Data Type Time-Varying Censored
Covariates? Durations?
Discrete Duration No No Count Data (count models, OLS, logged OLS, etc.) /
Logit or Probit with Duration Terms
Yes
Yes No Logit or Probit with Duration Terms
Yes
Continuous Duration No No Positive Data (OLS, logged OLS, etc.)
Yes “Traditional” Duration Models (exponential, weibell, cox)
Basically Tobit mixed with model for positive data
Yes No Discretize: Logit or Probit with Duration Terms /
Yes Don’t Discretize: “Traditional” models with data structures that
indicate when IVs change

Table 9: Summary of Duration Models

92
Continuous time models can handle time varying covariates. The main difference is that
the time varying covariates do not truly vary continuously. Instead, the data structure only
reveals when the covariates change. Based on the integrated hazard rates, we can derive the
log-likelihood.

9.6 Suggested Reading


9.6.1 Background

[1] Beck, Katz and Tucker (1998)

[2] Carter and Signorino (2010)

[3] Greene (2000)

9.6.2 Examples

[1] Simmons (2000)

[2] Stein (2005)

10 Monte Carlo Simulation and the Bootstrap


10.1 Monte Carlo Simulation
Consider a probit model, yn∗ = β00 xn +εn , yn = 1{yn∗ ≥ 0}, and εn ∼ N (0, 1) where εn are i.i.d.
The MLE for β0 should be consistent and asymptotically normal with variance-covariance
matrix V0 . We can examine the finite sample properties of this estimator as follows. For
each r, draw vector of covariates xn from some distribution for each n and draw εn for each
n. We use this to compute yn∗ , use this to further compute yn , and then estimate β0 on this
simulated data set using maximum likelihood estimation. Call this estimate β̂r and let V̂r be
the associated variance covariance matrix. We can investigate the finite sample properties
of the MLE estimator as follows–setting N and letting R be relatively large,

R
N X
[k ≈ 1
Bias R
β̂r,k (300)
r=1

93
v
u R
N u X
RM
\ SE k ≈ t R1 (β̂r,k − β0 )2 (301)
r=1

N
\idencek ≈ RM
\ SE kN
Overconf q (302)
1
P R
R r=1 V̂r,kk
R q q
N,α X
1
Coverage
\ k ≈ R
1{β̂ r − zα/2 V̂r,kk ≤ β0 ≤ β̂ r + zα/2 V̂r,kk } (303)
r=1
N N
The MLE will have good finite sample properties if Bias
[ k is small, RM \ SE k is small,
N N,α
Overconf idencek is close to 1, and Coveragek is close to 1 − α. The same logic can be
\ \
used when N is large to check whether an MLE is coded properly based on the theoretical
N N N
[ k → 0, RM
large sample properties–we should have Bias \ SE k → 0, Overconf
\idencek → 1,
N,α
\ k → 1 − α.
and Coverage

10.2 The Bootstrap


Consider an estimator θ̂ or θ0 . The bootstrap provides a way to conduct inference on c(θ0 )
for some continuous function c using simulation methods. There are a number of different
versions of the bootstrap. The nonparametric bootstrap works as follows. Consider the data
(yn , xn ) where the data are i.i.d. We form R new datasets by drawing random individuals
from the observed data. Denote these new datasets by (yr , xr ). We then compute the same
estimator θ̂r on each of these data sets. We then perform inferences on c(θ0 ) using the
empirical distribution of c(θ̂r ). Let ĉr = c(θ̂r ) and let θ̂(r) denote the order statistics of ĉ.
We could form a 95% confidence interval for c(θ0 ) using [ĉ(.05R) , ĉ(.95R) ].
The parametric bootstrap works as follows. We have a parametric model for the data
(xn , yn ). Suppose that this data is characterized by the CDF F (xn , yn ; θ). We obtain an
estimate of θ0 , which is θ̂ = θ̂(x, y). Here, we could form a 95% confidence interval for c(θ0 )
using [ĉ(.05R) , ĉ(.95R) ]., the function θ̂(x, y) denotes the estimator as a function of the data. We
then take new draws (xrn , ynr ) from the distribution F (xn , yn ; θ̂). We obtain new estimates for
each random sample, θ̂r = θ̂(xr , y r ). We define ĉr = c(θ̂r ). We could form a 95% confidence
interval for c(θ0 ) again using [ĉ(.05R) , ĉ(.95R) ].
An alternative version of the parametric bootstrap begins with the asymptotic approxima-
√ dist.
tion, N (θ̂ − θ0 ) −→ N (0, V0 ) and derives an asymptotic approximation for the distribution

94
√ prob.
of N (c(θ̂) − c(θ0 )), where c is a continuous function. Suppose V̂ −→ V0 . We can draw
from the asymptotic distribution of θ̂, θ̂r = θ0 + V̂ 1/2 ur , where ur ∼ N (0, I). We again
perform inferences using the empirical distribution of c(θ̂r ). The technique is described in
King, Tomz and Wittenberg (2000), though it is not described as the parametric bootstrap.
The particular version of the parametric bootstrap apparently predates this article.

10.3 Recommended Reading


10.3.1 Background

[1] King, Tomz and Wittenberg (2000)

10.3.2 Advanced

[1] Horowitz (2001)

11 Nonlinear Panel Data


Consider the case where the dependent variable ynt and the independent variables xnt are
indexed by two things (this could be individuals and time, countries and time, individuals
within states, etc.). The case where the DV and IV are indexed by two (or more) things goes
by a bunch of different names–grouped data, clustered data, longitudinal data, hierarchical
models, multi-level models, and time-series cross section data.

11.1 Clustered Standard Errors


Consider a panel data model with the data given by the xnt . Suppose that xnt ∼ f (xnt ; θ)
and suppose that xnt are i.i.d. over both n and t. The MLE is then given by,

N X
X T
1
θ̂ = arg max NT
log f (xnt ; θ) (304)
θ n=1 t=1

Consistency of the MLE follows from the fact that,

N X
T
prob.
X
1
NT
log f (xnt ; θ) −→ E[log f (xnt ; θ)] (305)
n=1 t=1

95
and the fact that the information inequality implies that E[log f (xnt ; θ)] is minimized at
θ = θ0 .
With panel data model, the assumption that xnt are i.i.d. over both n and t may be
implausible. Suppose instead that the marginal distribution of xnt is still f (xnt ; θ), but
that xnt is independent over n, but not t. In turns out that the MLE derived under
the assumption that the data were independent is still consistent. The reason for this
is that arg max N1T N
P PT
n=1 t=1 log f (xnt ; θ) can be considered an M-estimator with limit
θ
E[log f (xnt ; θ)] and the limiting objective function does not depend on the joint distribution
of xnt , but only the marginal distribution.
What remains is to characterize the asymptotic distribution. We can write,

ψ(xnt ; θ) = log f (xnt ; θ) (306)

N X
T N
" T
#
X X X
θ̂ = arg max N1T ψ(xnt ; θ) = arg max N1 1
T
ψ(xnt ; θ) (307)
θ n=1 t=1 θ n=1 t=1

N
X
1
= arg max N
φ(xn ; θ)
θ n=1

The theory of M-estimators implies that,

√ dist.
N (θ̂ − θ0 ) −→ N (0, C0−1 B0 C0−1 ) (308)

where,
h i
1 ∂2
C0 = E[φθθ (xn ; θ0 )] = E[ T1 ψθθ (xn ; θ0 )] = E T ∂θ∂θ0
log f (xnt ; θ0 ) (309)

B0 = E[φθ (xn ; θ0 )φθ (xn ; θ0 )0 ] (310)

" T
! T
!0 #
X X
∂ 1 ∂ 1
=E ∂θ T
log f (xnt ; θ0 ) ∂θ T
log f (xnt ; θ0 )
t=1 t=1

We can estimate these using,

N X
X T
1 ∂2
Ĉ = NT 2 ∂θ∂θ0
log f (xnt ; θ̂) (311)
n=1 t=1

96
N T
! T
!0
X X X
1 1 ∂ 1 ∂
B̂ = N T ∂θ
log f (xnt ; θ0 ) T ∂θ
log f (xnt ; θ0 ) (312)
n=1 t=1 t=1

This process is known as clustering the standard errors.


In the derivation, we assumed that the data were independent across individuals, but
not across time. In a panel data framework, this allows for individual specific unobservables
that are mean zero. It also allows for time series dependence in the error terms.

11.2 Nonlinear Fixed Effects Models


Consider the linear fixed effects framework,

ynt = αn0 + δt0 + β00 xnt + εnt (313)

where E[εnt |xnt ] = 0. The OLS estimator is given by,

N X
X T
(β̂, α̂1 , ..., α̂N , δ̂1 , ..., δ̂T ) = arg max 1
NT
(ynt − αn − δt − β 0 xnt )2 (314)
(β,α1 ,...,αN ,δ1 ,...,δT ) n=1 t=1

prob. prob. prob.


We have that β̂ −→ β0 if N → ∞ or T → ∞, α̂n −→ αn0 if T → ∞, and δ̂t −→ δt0 if
N → ∞.
Consider alternatively a nonlinear model with individual level fixed effects,

N X
X T
1
(β̂, α̂1 , ..., α̂N ) = arg max NT
ψ(xnt ; β, αn ) (315)
(β,α1 ,...,αN ) n=1 t=1

prob. prob.
We have that β̂ −→ β0 and α̂n −→ αn0 if T → ∞, but that β̂ and α̂n are both inconsistent
if T is fixed and N → ∞. This become relevant because we cannot apply the nonlinear fixed
effects estimator in short panels (even though we could apply the linear fixed effects model
is short panels).
When estimating models with fixed effects, it is more common to apply the linear model
for that reason. In fact, applications of the linear probability model is published work are
often cases where panels are short and it is desired to include fixed effects.

97
11.3 Conditional Fixed Effects Estimators
The conditional fixed effects logit estimator is a alternative estimator for the fixed effects logit
model that is consistent in short panels. The estimator is based on the following conditional
likelihood,

T
!
X
Pr yn1 , yn2 , ..., ynT | ynt = τ, xnt ; β, αn (316)
t=1

 
Pr yn1 , yn2 , ..., ynT , Tt=1 ynt = τ |xnt ; β, αn
P
Pr (yn1 , yn2 , ..., ynT |xnt ; β, αn )
= P  = P 
T T
Pr y
t=1 nt = τ |x nt ; β, α n Pr y
t=1 nt = τ |x nt ; β, α n

T T 0
Y Y (eαn +β xnt )ynt
Pr (yn1 , yn2 , ..., ynT |xnt ; β, αn ) = Pr(ynt |xnt ; β, αn ) = (317)
t=1 t=1
1 + eαn +β 0 xnt

T
! T 0
X X Y (eαn +β xnt )ynt
Pr ynt = τ |xnt ; β, αn = (318)
t=1
P
t=1
1 + eαn +β 0 xnt
y: t y=τ

0
(eαn +β xnt )ynt
QT
T
!
t=1 1+eαn +β 0 xnt
X
Pr yn1 , yn2 , ..., ynT | ynt = τ, xnt ; β, αn = P QT (eαn +β0 xnt )ynt (319)
t=1 t=1 1+eαn +β 0 xnt
P
y: t y=τ

0 QT 0x
(eαn )ynt (eβ xnt )ynt (eβ nt )ynt
(eαn )τ QTt=11+eαn eβ0 xnt
QT
t=1 0
1+eαn eβ xnt t=1
= QT (eαn )ynt (eβ0 xnt )ynt =P QT β0 x y
αn )τ Q t=1 (e nt ) nt
P
P
y: t y=τ t=1 0
y:
P (e T αn eβ 0 xnt
1+eαn eβ xnt t y=τ t=1 1+e

(eαn )τ QT β 0 xnt ynt


QT α β 0x t=1 (e ) eβ
0
PT
t=1 xnt ynt
t=1 1+e e
n nt
= (eαn )τ QT =P 0
PT
β 0 xnt )ynt β xnt ynt
P
t=1 (e y=τ e
QT 0
P P t=1
1+eαn eβ xnt y: t y=τ y: t
t=1

This derivation suggests that following (conditional) log-likelihood,

N PT !
β0 xnt ynt
X e t=1
l(β) = log P 0
PT (320)
dn :
P P eβ t=1 xnt dnt
n=1 t dnt = t ynt

The advantage of this estimator is that it produces consistent estimates of β0 when T is small

98
and N → ∞. The disadvantage is that it cannot produce a consistent estimate of αn0 . This,
in turn means, that we cannot compute consistent estimates of substantive effects (since the
substantive effects would depend on αn0 ).
There are conditional fixed effects estimators for the ordered logit and poison regression
models (though I have never seen them used in published work). The conditional fixed
effects estimator turns out to be equivalent to the (unconditional) fixed effects estimator,
suggesting that for the Poisson model, the estimator of β0 is consistent when T is small and
N → ∞. There is believed to be no conditional fixed effects probit estimator.

11.4 Random Effects Estimators


Random effects models are an alternative panel data model where β0 can be estimated
consistently if T is small and N → ∞. They are an alternative to the fixed effects model
which, while less general (we typically assume that the random effects are uncorrelated
with the independent variables) do not require a long panel for consistency and allow for
time invariant covariates to be included in the model. They are an alternative to clustered
standard errors which while not valid in as general a set of circumstances, can be more
efficient when the assumption of the random effects model are met.
We can specify the random effects probit as follows,


ynt = β00 xnt + νn + εnt (321)


ynt = 1{ynt ≥ 0} (322)

εnt ∼ N (0, 1) (323)

νn ∼ N (0, σν2 ) (324)

where εnt are iid, νn are iid, and εnt and νm are independent for all n, m, t. Under this
framework, we have that,


ynt |νn ∼ N (β 0 xnt + νn , 1) (325)

We have,

99
T
Y
Pr(yn |νn ) = Φ(β 0 xnt + νn )ynt (1 − Φ(β 0 xnt + νn ))1−ynt (326)
t=1
Z
1 2 2
Pr(yn ) = Pr(yn |νn ) σ 1


e− 2 νn /ν dν (327)
ν
νn

We can form the likelihood for the model as,

N
X
l(β, σν2 ) = log Pr(yn ) (328)
n=1

Here, the expression for Pr(yn ) involves a one-dimensional integral. We can approxi-
mate this integral using simulation. Let unr ∼ N (0, 1). Then σν unr ∼ N (0, σν2 ). We can
approximate,

N
X R Y
X T
l(β, σν2 ) ≈ log Φ(β 0 xnt + σν unr )ynt (1 − Φ(β 0 xnt + σν unr ))1−ynt (329)
n=1 r=1 t=1

Here, the draws unr should be fixed ahead of time (not redrawn each time the likelihood
function is evaluated).

When computing marginal effects or substantive effects, note that ynt ∼ N (β 0 xnt , σν2 + 1).
This implies that,
!
β 0x
Pr(ynt = 1|xnt ; β, σν ) = Φ p nt (330)
σν2 + 1

11.5 Generalized Estimating Equations

100
Model Regular Estimator Random Effects Fixed Effects Conditional
Fixed Effects

Linear-OLS β̂ consistent as
N or T → ∞;
Not necessary
α̂n consistent as
T →∞
Logit-MLE β̂ consistent β̂ consistent as
as N or T → ∞; β̂ consistent N → ∞, no α̂n
Probit-MLE use clustered as T → ∞ n/a
Ordered Logit-MLE standard errors β̂ consistent α̂n consistent Uncommon
Ordered Probit-MLE to account for as N → ∞ as T → ∞ n/a
Multinomial Logit-MLE correlated errors (*for the Poisson n/a
Conditional Logit-MLE within clusters model, β̂ is n/a
Multinomial Probit-MLE consistent as T → ∞ n/a
Poisson-MLE under assumptions Equal to fixed effects,

101
necessary for consistent as N → ∞, no α̂n
Negative Binomial-MLE deriving Poisson n/a (*some software
CFE estimator) erroneously implements
a CFE estimator)
Tobit-MLE n/a

Table 10: Summary of Nonlinear Panel Data Models


Liang and Zeger (1986) proposed the following approach for obtaining estimators that are
consistent under more general conditions than random effects estimators, but are potentially
more efficient than applying clustered standard errors to conventional estimators. Here we
will derive Liang and Zeger’s estimator using an equivalent Classical Minimum Distance
(CMD) estimator. Consider a dependent variable ynt and a mean function µ(xnt ; β) =
E[ynt |xnt ]. Suppose that the following identification condition holds,

µ(xnt ; β) = µ(xnt ; β0 ) (331)

if and only if β = β0 . The classical minimum distance estimator is defined by,

N
X
β̂ = arg min N1 (yn − µ(xn ; β))0 W (yn − µ(xn ; β)) (332)
β n=1

where,
 
µ(xn1 ; β)
 
 µ(xn2 ; β) 
µ(xn ; β) =   (333)

 ... 

µ(xnT ; β)
and W is a positive definite weighting matrix. Liang and Zeger’s estimator works off of
the first order conditions which is, of course, equivalent to optimizing the objective function
above, apart from the possibility that the stationary point may not be unique. Liang and
Zeger set β̂ to be the unique solution to the nonlinear system,

N
X
1
N
∂µ
∂β
(xn ; β)0 V −1 (yn − µ(xn ; β)) = 0 (334)
n=1

To see why this is a reasonable estimator, note that,

∂µ
E[ ∂β (xn ; β)0 V −1 (yn − µ(xn ; β))] = E[E[ ∂β
∂µ
(xn ; β)0 V −1 (yn − µ(xn ; β))|xn ]] (335)

∂µ
= E[ ∂β (xn ; β)0 V −1 E[(yn − µ(xn ; β))|xn ]] (336)

∂µ
= E[ ∂β (xn ; β)0 V −1 E[yn |xn ] − ∂µ
∂β
(xn ; β)0 V −1 µ(xn ; β)]

102
∂µ
= E[ ∂β (xn ; β)0 V −1 µ(xn ; β0 )− ∂β
∂µ
(xn ; β)0 V −1 µ(xn ; β)] = E[ ∂β
∂µ
(xn ; β)0 V −1 (µ(xn ; β0 )−µ(xn ; β))]

This is equal to zero if β = β0 . Demonstrating that this is the unique solution is more
difficult (Newey and McFadden, 1994). Provided the solution is unique and a Law or Large
Numbers holds, i.e.,

N
prob.
X
1
N
∂µ
∂β
(xn ; β)0 V −1 (yn ∂µ
− µ(xn ; β)) −→ E[ ∂β (xn ; β)0 V −1 (yn − µ(xn ; β))] (337)
n=1

the estimator will be consistent.


Next, consider the distribution of the estimator. We can use a Taylor expansion to obtain,

∂µ
µ(xn ; β) ≈ µ(xn ; β0 ) + ∂β
(xn ; β0 )(β − β0 ) (338)

Using the fact that,

N
X
∂µ
∂β
(xn ; β̂)0 V −1 (yn − µ(xn ; β̂)) = 0 (339)
n=1

we obtain,

N
X N
X
∂µ
∂β
(xn ; β̂)0 V −1 (yn − µ(xn ; β0 )) ≈ ∂µ
∂β
(xn ; β̂)0 V −1 ∂β
∂µ
(xn ; β0 )(β̂ − β0 ) (340)
n=1 n=1


N (β̂ − β0 ) (341)

" N
#−1 " N
#
dist.
X X
−→ 1
N
∂µ
∂β
(xn ; β0 )0 V −1 ∂β
∂µ
(xn ; β0 ) √1
N
∂µ
∂β
(xn ; β0 )0 V −1 (yn − µ(xn ; β0 ))
n=1 n=1

√ dist.
N (β̂ − β0 ) −→ N (0, C0−1 B0 C0−1 ) (342)

where,

103
∂µ
C0 = E[ ∂β (xn ; β0 )0 V −1 ∂β
∂µ
(xn ; β0 )] (343)

∂µ
B0 = V ar( ∂β (xn ; β0 )0 V −1 (yn − µ(xn ; β0 ))) (344)

We can consider a number of alternative specifications for µ(xn ; β). For the linear model,
we have µ(xn ; β) = β 0 xn , for the probit model we have µ(xn ; β) = Φ(β 0 xn ), and for the
0
Poisson model, we have µ(xn ; β) = eβ xn .
Note that,

∂µ
B0 = V ar( ∂β (xn ; β0 )0 V −1 (yn − µ(xn ; β0 ))) (345)

∂µ
= E[V ar( ∂β (xn ; β0 )0 V −1 (yn − µ(xn ; β0 ))|xn ) + V ar(E[ ∂β
∂µ
(xn ; β0 )0 V −1 (yn − µ(xn ; β0 ))|xn ])

∂µ
= E[ ∂β (xn ; β0 )0 V −1 V ar(yn |xn )V −1 ∂β
∂µ
(xn ; β0 )]

Suppose that V ar(yn |xn ) does not depend on xn and let V0 = V ar(yn ). If we select V = V0 ,
∂µ
then B0 = C0 = E[ ∂β (xn ; β0 )0 V0−1 ∂β
∂µ
(xn ; β0 )]. It can be shown that this leads to the smallest
possible variance. This calculation suggests that we should select V to be close to V0 .
To specify V , Liang and Zeger (1986) consider four cases:

1. Independence: V = I

2. Exchangeability:

 
1 ρ ... ρ
 
 ρ 1 ... ρ 
V =  (346)

 ... ... ... ... 

ρ ρ ... 1

3. AR1 process:

104
 
1 ρ ρ2 ... ρT −1
ρ 1 ρ ... ρT −2
 
 
 
V =
 ρ2 ρ 1 ... ρT −3 
 (347)
... ... ... ... ...
 
 
ρT −1 ρT −2 ρT −3 ... 1

4. Unstructured:

 
1 ρ12 ρ13 ... ρ1T
ρ12 1 ρ23 ... ρ2T
 
 
 
V =
 ρ13 ρ23 1 ... ρ3T 
 (348)
... ... ... ... ...
 
 
ρ1T ρ2T ρ3T ... 1

The independence case leads to an estimator that is equivalent to the MLE. For the other
cases, the approach is to obtain a preliminary consistent estimator of β0 and to use this to
estimate the parameters of the V matrix. For example, is the unstructured case, we would
set rnt = ynt − µ(xnt ; β̂) and estimate.

1
PN
N n=1 rnt rns
ρ̂ts = q P q P (349)
1 N 2 1 N 2
N n=1 rnt N n=1 rns

11.6 Recommended Reading


11.6.1 Background

[1] Greene (2000)

[2] Hsiao (2003)

[3] Wooldridge (2002)

11.6.2 Examples

[1] Kayser and Peress (2012)

[2] Nepal, Bohara and Gawande (2011)

[3] Richman (2011)

105
11.6.3 Advanced

[1] Liang and Zeger (1986)

[2] Newey and McFadden (1994)

12 Time Series Dependence in Nonlinear Models


12.1 Parametric Models
Consider the following model,

yt∗ = β00 xt + εt (350)

yt = 1{yt∗ ≥ 0} (351)

εt = ρεt−1 + ut (352)

where ut ∼ N (0, 1) and ut are i.i.d. To form the MLE, we must consider,

Pr(y|x; β, ρ) = Pr(yt∗ ≥ 0 f or t : yt = 1, yt∗ ≤ 0 f or t : yt = 0) (353)

This involves a T -dimensional integral of the normal distribution over a region. The integral
would typically be computed using the GHK simulator. The log-likelihood is given by,

l(β, ρ) = log Pr(y|x; β, ρ) (354)

A similar difficulty shows up in other time series models of limited dependent variables.

12.2 Semiparametric Approach


Consider the same model as before, but consider Pr(yt = 1|xt ). We have that the stationary
1
distribution for εt is given by εt ∼ N (0, 1−ρ 2 ). This implies that,

Pr(yt = 1|xt ; β, ρ) = Φ (1 − ρ2 )β 0 xn

(355)

106
p
In fact, for simplicity, we can specify εt = ρεt−1 + 1 − ρ2 ut in which case the stationary
distribution is εt ∼ N (0, 1), so that,

Pr(yt = 1|xt ; β, ρ) = Φ (β 0 xn ) (356)

Since identification of an M-estimator only depends on the marginal distribution, this means
that probit will generally be consistent even if the errors have time series dependence. We
still have to correct the standard errors, but this can be accomplished using the Newey-West
procedure Newey and West (1987, 1994).
If we have panel data, an effective approach is to cluster the standard errors by individual
(provided that we have many individuals). If we have long panels with few individuals, we
may want to apply Newey-West standard errors to deal with time-series dependence.

12.3 Recommended Reading


12.3.1 Advanced

[1] Andrews (1991)

[2] Andrews and Monahan (1992)

[3] Newey and West (1987)

[4] Newey and West (1994)

13 References

References
Alvarez, Michael and Jonathan Nagler. 1995. “Economics, Issues, and the Perot Candidacy:
Voter Choice in the 1992 Presidential Election.” American Political Science Review 3:714–
744.

Andrews, Donald W.K. 1991. “Heteroskedasticity and Autocorrelation Consistent Covari-


ance Matrix Estimation.” Econometrica 59:817–858.

107
Andrews, Donald W.K. and J. Christopher Monahan. 1992. “An Improved Heteroskedastic-
ity and Autocorrelation Consistent Covariance Matrix Estimator.” Econometrica 60:953–
966.

Beck, Nathaniel, Jonathan N. Katz and Richard Tucker. 1998. “Taking Time Seriously:
Time-Series-Cross-Section Analysis with a Binary Dependent Variable.” American Journal
of Political Science 42:1260–1288.

Berry, William, Jacqueline H. R. Demeritt and Justin Esaray. 2010. “Testing for Interaction
in Binary Logit and Probit Models: Is a Product Term Essential?” American Journal of
Political Science 54:248–266.

Burden, Barry C., David T. Cannon, Kenneth R. Mayer and Donald P. Moynihan. 2014.
“Election Laws, Mobilization, and Turnout: The Unanticipated Consequences of Electoral
Reform.” American Journal of Political Science 58:95–109.

Cameron, A. Colin and Pravin K. Trivedi. 2001. Essentials of Count Data Regression. In A
Companion to Theoretical Econometrics, ed. Badi H. Baltagi. Oxford: Blackwell pp. 331–
348.

Carter, David B. and Curtis S. Signorino. 2010. “Back to the Future: Modeling Time
Dependence in Binary Data.” Political Analysis 18:271–292.

Chen, Jowei. 2013. “Voter Partisanship and the Effect of Distributive Spending on Political
Participation.” American Journal of Political Science 57:200–217.

Cox, Gary W. and Mathew D. McCubbins. 1993. Legislative Leviathan: Party Government
in the House. Berkeley and Los Angeles: University of California Press.

Gabrielsen, Arne. 1978. “Consistency and identifiability.” Journal of Econometrics 8:261–


263.

Gallant, Ronald A. 1997. An Introduction to Econometric Theory. Princeton University


Press.

Greene, William H. 2000. Econometric Analysis. 4th ed. New York: Prentice Hall.

Horowitz, Joel L. 2001. The Bootstrap. In Handbook of Econometrics, Volume 5, ed. James J.
Heckman and Edward Leamer. Amsterdam: North-Holland pp. 3159–3228.

108
Hsiao, Cheng. 2003. Analysis of Panel Data. Cambridge University Press.

Huber, Gregory A. and Sanford C. Gordon. 2004. “Accountability and Coercion: Is Justice
Blind when It Runs for Office?” American Journal of Political Science 48:247–263.

Jenkins, Jeffrey A. and Nathan W. Monroe. 2012. “Buying Negative Agenda Control in the
U.S. House.” American Journal of Political Science 56:897–912.

Kayser, Mark A. and Michael Peress. 2012. “Benchmarking Across Borders: Electoral
Accountability and the Necessity of Comparison.” American Political Science Review
106:661–684.

Kennedy, Peter. 1992. A Guide to Econometric Methods. 3rd ed. Cambridge, MA: MIT
Press.

King, Gary. 1989. “Representation through Legislative Redistricting: A Stochastic Model.”


American Journal of Political Science 33(4):787–824.

King, Gary. 1998. Unifying Political Methodology: The Likelihood Theory of Statistical
Inference. Ann Arbor: University of Michigan Press.

King, Gary, Michael Tomz and Jason Wittenberg. 2000. “Making the Most of Statisti-
cal Analyses: Improving Interpretation and Presentation.” American Journal of Political
Science 44:347–361.

Kriner, Douglas and Francis Shen. 2014. “Responding to War on Capitol Hill: Battlefield
Casualties, Congressional Response, and Public Support for the War in Iraq.” American
Journal of Political Science 58:157–174.

Liang, Kung-Yee and Scott L. Zeger. 1986. “Longitudinal Data Analysis using Generalized
Linear Models.” Biometrika 73:13–22.

Martı́n, Ernesto San and Fernando Quintana. 2002. “Consistency and Identifiability Revis-
ited.” Brazilian Journal of Probability and Statistics 16:99–106.

Nagler, Jonathan. 1991. “The Effect of Registration Laws and Education on U.S. Voter
Turnout.” American Political Science Review 85:1393–1405.

Nepal, Mani, Alok K. Bohara and Kishore Gawande. 2011. “More Inequality, More Killings:
The Maoist Insurgency in Nepal.” American Journal of Political Science 55:886–906.

109
Newey, Whitney and Daniel McFadden. 1994. Estimation and Inference in Large Samples.
In Handbook of Econometrics, Volume 4. New York: North Holland.

Newey, Whitney K. and Kenneth D. West. 1987. “A Simple Positive Definite Heteroskedas-
ticity and Autocorrelation Consistent Covariance Matrix.” Econometrica 55:703–708.

Newey, Whitney K. and Kenneth D. West. 1994. “Automatic Lag Selection in Covariance
Matrix Estimation.” Review of Economic Studies 61:631–653.

Plumper, Thomas, Christina J. Schneider and Vera E. Troeger. 2006. “The Politics of EU
Eastern Enlargement: Evidence from a Heckman Selection Model.” British Journal of
Political Science 36:17–38.

Rao, B.L.S. Prakasa. 1992. Identifiability in Stochastic Models. Characterization of Proba-


bility Distributions. Boston: Academic Press.

Richman, Jesse. 2011. “Parties, Pivots, and Policy: The Status Quo Test.” American Polit-
ical Science Review 105:151–165.

Simmons, Beth A. 2000. “International Law and State Behavior: Commitment and Compli-
ance in International Monetary Affairs.” American Political Science Review 94:819–835.

Stein, Jana Von. 2005. “Do Treaties Constrain or Screen? Selection Bias and Treaty Com-
pliance.” American Political Science Review 99:611–622.

Train, Kenneth E. 1992. Discrete Choice Methods with Simulation. Cambridge: Cambridge
University Press.

van der Vaart, A. W. 1997. Superefficiency. In Festschrift for Lucien Le Cam, ed. D. Pollard,
E. Torgersen and G. L. Yang. New York: Springer pp. 397–410.

Weghorst, Keith R. and Staffan I. Lindberg. 2013. “What Drives the Swing Voter in Africa?”
American Journal of Political Science 57:717–734.

White, Halbert. 1984. Asymptotic Theory for Econometricians. San Diego: Academic Press.

Wilson, Matthew C. and James A. Piazza. 2013. “Autocracies and Terrorism: Conditioning
Effects of Authoritarian Regime Type of Terrorist Attacks.” American Journal of Political
Science 57:941–955.

110
Wolfinger, Raymond and Steven Rosenstone. 1980. Who Votes? New Haven, CT: Yale
University Press.

Wooldridge, Jeffrey M. 2002. Econometric Analysis of Cross Section and Panel Data. Cam-
bridge University Press.

A Appendix
A.1 Proof that Asymptotic Normality Implies Asymptotic Unbi-
asedness
An estimator θ̂ of a population parameter θ0 is said to be asymptotically unbiased if lim E[θ̂−
N →∞
√ dist.
θ0 ] = 0. Suppose that N (θ̂ − θ0 ) −→ N (0, V0 ) where V0 is positive definite. We will show
that this implies that the estimator is asymptotically unbiased. We have,

√ Z √
V0−1 N E[θ̂ − θ0 ] = V0−1 N (θ̂ − θ0 )dFθ̂ (θ̂) (357)
θ̂

where V0−1 exists because V0 is assumed to be positive definite. Using the transformation

x = V0−1 N (θ̂ − θ0 ), we have,

√ Z
V0−1 N E[θ̂ − θ0 ] = xdFV −1 √N (θ̂−θ0 ) (x) (358)
0
x
R
Define the mean functional, H(G) = x
xdG(x). We have,


V0−1 N E[θ̂ − θ0 ] = H(FV −1 √N (θ̂−θ0 ) ) (359)
0

By assumption, we have that,

FV −1 √N (θ̂−θ0 ) (x) = ΦJ (x)∀x (360)


0

where ΦJ is the CDF of a standard normal distribution. Continuity of the mean functional
implies that,

lim H(FV −1 √N (θ̂−θ0 ) ) = H(ΦJ ) (361)


N →∞ 0

Hence, we have,

111

lim V0−1 N E[θ̂ − θ0 ] = H(ΦJ ) = 0 (362)
N →∞

which implies that lim E[θ̂ − θ0 ] = 0.


N →∞

A.2 Derivation of the Expected Value of yn in the Heckman Se-


lection Model
We have,

E[yn |xn , zn , wn = 1] = E[β 0 xn + εn |xn , zn , wn = 1] = β 0 xn + E[εn |zn , wn = 1] (363)

Z
0 0 0
= β xn + E[εn |zn , ηn ≥ −γ zn ] = β xn + εn dFε,η (εn , ηn )
(εn ,ηn ):ηn ≥−γ 0 zn

Using the formula for the marginal and conditional normal distributions, we have,

E[yn |xn , zn , wn = 1] (364)

1
Z
1 2 − 2 (ε −ρσηn )2
0
= β xn + √1 e− 2 ηn ε √ 1
e 2σ (1−ρ2 ) n d(εn , ηn )
2π 2πσ 2 (1−ρ2 )
(εn ,ηn ):ηn ≥−γ 0 zn

1
Z Z 
1 2 − (εn −ρσηn )2
0
= β xn + √1 e− 2 ηn εn √ 21 2 e 2σ2 (1−ρ2 ) dεn dηn
2π 2πσ (1−ρ )
ηn :ηn ≥−γ 0 zn εn

Z
1 2
0
= β xn + ρσηn √12π e− 2 ηn dηn = ρσE[ηn |ηn ≥ −γ 0 zn ]
(εn ,ηn ):ηn ≥−γ 0 zn

Finally, using the formula for the mean of a normal distribution,

φ(−γ 0 zn )
E[yn |xn , zn , wn = 1] = β 0 xn + ρσ (365)
1 − Φ(−γ 0 zn )

112

You might also like