Ward Ahlquist
Ward Ahlquist
Analytical Methods for Social Research presents texts on empirical and formal methods
for the social sciences. Volumes in the series address both the theoretical underpinnings of
analytical techniques as well as their application in social research. Some series volumes are
broad in scope, cutting across a number of disciplines. Others focus mainly on methodological
applications within specific fields such as political science, sociology, demography, and public
health. The series serves a mix of students and researchers in the social sciences and statistics.
Series Editors
R. Michael Alvarez, California Institute of Technology
Nathaniel L. Beck, New York University
Lawrence L. Wu, New York University
Time Series Analysis for the Social Sciences, by Janet M. Box-Steffensmeier, John R. Freeman,
Jon C.W. Pevehouse and Matthew Perry Hitt
Event History Modeling: A Guide for Social Scientists, by Janet M. Box-Steffensmeier and
Bradford S. Jones
Ecological Inference: New Methodological Strategies, edited by Gary King, Ori Rosen, and
Martin A. Tanner
Spatial Models of Parliamentary Voting, by Keith T. Poole
Essential Mathematics for Political and Social Research, by Jeff Gill
Political Game Theory: An Introduction, by Nolan McCarty and Adam Meirowitz
Data Analysis Using Regression and Multilevel/Hierarchical Models, by Andrew Gelman and
Jennifer Hill
Counterfactuals and Causal Inference, by Stephen L. Morgan and Christopher Winship
Maximum Likelihood for Social Science
Strategies for Analysis
M I C H A E L D. WA R D
Duke University
J O H N S . A H L QU I S T
University of California, San Diego
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India
79 Anson Road, #06–04/06, Singapore 079906
www.cambridge.org
Information on this title: www.cambridge.org/9781107185821
DOI: 10.1017/9781316888544
© Michael D. Ward and John S. Ahlquist 2018
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2018
Printed in the United States of America by Sheridan Books, Inc.
A catalogue record for this publication is available from the British Library.
Library of Congress Cataloging-in-Publication Data
Names: Ward, Michael Don, 1948– author. | Ahlquist, John S., author.
Title: Maximum likelihood for social science : strategies for analysis/
Michael D. Ward, John S. Ahlquist.
Description: 1 Edition. | New York : Cambridge University Press, 2018. |
Series: Analytical methods for social research
Identifiers: LCCN 2018010101 | ISBN 9781107185821 (hardback) |
ISBN 9781316636824 (paperback)
Subjects: LCSH: Social sciences–Research. | BISAC: POLITICAL SCIENCE / General.
Classification: LCC H62 .W277 2018 | DDC 300.72–dc23
LC record available at https://lccn.loc.gov/2018010101
ISBN 978-1-107-18582-1 Hardback
ISBN 978-1-316-63682-4 Paperback
Cambridge University Press has no responsibility for the persistence or accuracy
of URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
Contents
v
vi Contents
Bibliography 277
Index 293
Figures
xi
xii List of Figures
xv
xvi List of Tables
This project began many years ago at the University of Washington’s Center
for Statistics and the Social Sciences (CSSS). There two ambitious graduate
students, John S. Ahlquist and Christian Breunig (now at the University of
Konstanz), asked Michael D. Ward if he would supervise their training in
maximum likelihood methods so that they could be better prepared for taking
more advanced CSSS courses as well as those in the statistics and biostatistics
departments. Ward gave them a stack of materials and asked them to start by
preparing a lecture on ordinal regression models. Ward subsequently developed
a class on maximum likelihood methods, which he has taught at the University
of Washington (where it is still taught by Christopher Adolph) and, more
recently, at Duke University. Ahlquist has gone on to teach a similar course
at Florida State, the University of Wisconsin, and UC San Diego.
The point of the course was singular, and this book has a simple goal: to
introduce social scientists to the maximum likelihood principle in a practical
way. This praxis includes (a) being able to recognize where maximum likelihood
methods are useful, (b) being able to interpret results from such analyses, and (c)
being able to implement these methods both in terms of creating the likelihood
and in terms of specifying it in a computational language that permits empirical
analysis to be undertaken using the developed model.
The text is aimed at advanced PhD students in the social sciences, especially
political science and sociology. We assume familiarity with basic probability
concepts, the application of multivariate calculus to optimization problems,
and the basics of matrix algebra.
our approach
We take a resolutely applied perspective here, emphasizing core concepts,
computation, and model evaluation and interpretation. While we include a
xvii
xviii Preface
chapter that introduces some of the important theoretical results and their
derivations, we spend relatively little space discussing formal statistical prop-
erties. We made this decision for three reasons. First, there are several ways
to motivate the likelihood framework. We find that a focus on a method’s
“desirable properties” in a frequentist setting to be a less persuasive reason to
study maximum likelihood estimators (MLE). Instead we prefer to emphasize
the powerful conceptual jump that likelihood-based reasoning represents in the
study of statistics, one that enables us to move to a Bayesian setting relatively
easily. Second, the statistical theory underlying the likelihood framework is
well understood; it has been for decades. The requisite theorems and proofs
are already collected in other excellent volumes, so we allocate only a single
chapter to recapitulating them here. Rather, we seek to provide something
that is missing: an applied text emphasizing modern applications of maximum
likelihood in the social sciences. Third, and perhaps most important, we
find that students learn more and have a more rewarding experience when
the acquisition of new technical tools is directly bound to the substantive
applications motivating their study.
Many books and even whole graduate training programs start with so-called
Ordinary Least Squares (OLS). There is a certain logic to that. OLS is easy to
teach, implement, and utilize while introducing a variety of important statistical
concepts. OLS was particularly attractive in a world before powerful modern
computers fit in our pockets. But OLS can be viewed as a special case of a
more general class of models. Practically speaking, a limited range of social
science data fit into this special case. Data in the social sciences tend to be
lumpier, often categorical. Nominal, truncated, and bounded variables emerge
not just from observational datasets but in researcher-controlled experiments
as well (e.g., treatment selection and survival times). Indeed, the vast majority
of social science data comes in forms that are profitably analyzed without
resort to the special case of OLS. While OLS is a pedagogical benchmark,
you will have to look hard for recent, state-of-the-art empirical articles that
analyze observational data based on this approach. After reading this book
and working through the examples, student should be able to fit, choose, and
interpret many of the statistical models that appear in published research. These
models are designed for binary, categorical, ordered, and count data that are
neither continuous nor distributed normally.
what follows
We have pruned this book down from versions that appeared earlier online.
We wanted the main text to focus entirely on the method and application of
maximum likelihood principles. The text is divided into four parts.
Part I (Chapters 1–4) introduces the concept of likelihood and how it fits
with both classical and Bayesian statistics. We discuss OLS only in passing,
Preface xix
special features
This volume contains several special features and sections that deserve further
elaboration.
R Code
This is an applied, computational text. We are particularly interested in helping
students transform mathematical statements into executable computer code.
R has become the dominant language in statistical computing because it is
object-oriented, based on vectors; still has the best statistical graphics; and
is open-source, meaning it is free to students and has a large network of
contributors submitting new libraries almost daily. The newest statistical tools
generally appear in R first.
We include code directly in the text in offset and clearly marked boxes. We
include our own comments in the code chunks so students can see annotation
clarifying computational steps. We also include R output and warnings in
various places to aid in interpreting actual R output as well as trouble-shooting.
All analysis and graphics are generated in R. The online repository contains the
R code needed to reproduce all tables and graphics.
We do not refer to the boxes directly in the main text, unlike equations, tables,
figures, and code chunks. The title of the boxes reflects their function and status;
they present supplemental information for the curious.
“Further Reading”
Each chapter ends with a “further reading” section. These sections all follow a
similar format, with subheadings for “applications,” “past work,” “advanced
study,” and “software notes,” depending on the context these have for different
topics.
The “applications” section highlights two to four studies using the tools
discussed in that chapter and published in major social science journals in the
last four years. These studies are meant to be examples of the types of papers
students might consider when choosing replication projects.
The “past work” section is designed to provide pointers to the major
contributors to the development and popularization of these tools in the social
sciences. The “advanced study” section collects references to more advanced
texts and articles where interested students can look for more detail on the math
or computational algorithms. We consulted many of these texts in writing this
book.
In the “software notes” sections we collect references to the major R libraries
that we found useful in preparing the book or in conducting analysis ourselves.
Since R is open-source, these references will surely become stale. We neverthe-
less thought it beneficial to collect references to R packages in a single place in
each chapter.
notation glossary
In our experience students often find mathematical notation a particularly
frustrating barrier. To mitigate that problem we have included a notation
“glossary” at the beginning of the book
online resources
The online repository, maxlikebook.com, accompanying this volume contains
C O N C E P T S , T H E O RY, A N D I M P L E M E N TAT I O N
1
eaten at either restaurant. They each flip a single coin, deciding that a heads
will indicate a vote for the brewpub. The result is two heads and one tails. The
friends deposit the coin in the parking meter and go to the brewpub.
We might wonder whether the coin was, in fact, fair. As a data analysis
problem, these coin flips were not obtained in a traditional sampling frame-
work, nor are we interested in making inferences about the general class of
restaurant coin flips. Rather, the three flips of a single coin are all the data
that exist, and we just want to know how the decision was taken. This is
a binary outcomes problem. The data are described by the following set in
which 1 represents heads: {1, 1, 0}. Call the probability of a flip in favor of
eating at the brewpub θ ; the probability of a flip in favor of eating Ethiopian
is thereby 1 − θ . In other words, we assume a Bernoulli distribution for a
coin flip.
What value of the parameter, θ , best describes the observed data? Prior
experience may lead us to believe that coin flips are equiprobable; θ̂ = 0.5
seems a reasonable guess. Further, one might also reason that since there are
three pieces of data, the probability of the joint outcome of three flips is
0.53 = 0.125. This may be a reasonable summary of our prior expectations,
but this calculation fails to take advantage of the actual data at hand to inform
our estimate.
A simple tabulation reveals this insight more clearly. We know that in this
example, θ is defined on the interval [0, 1], i.e., 0 ≤ θ ≤ 1. We also know that
unconditional probabilities compound themselves so that the probability of a
head on the first coin toss times the probability of a head on the second times the
probability of tails on the third produces the joint probability of the observed
data: θ ×θ ×(1−θ ). Given this expression we can easily calculate the probability
of getting the observed data for different values of θ . Computationally, the
results are given by Pr(y1 | θ̂) × Pr(y2 | θ̂ ) × Pr(y3 | θ̂ ), where yi is the value of
each observation, i ∈ {1, 2, 3} and | θ̂ is read, “given the proposed value of θ .”
Table 1.1 displays these calculations in increments of 0.1.
1.2 Coin Flips and Maximum Likelihood 5
Observed Data
y θ̂ θ 1s × (1 − θ )0s fB (y | θ̂)
{1, 1, 0} 0.00 0.002 × (1 − 0.00)1 0.000
{1, 1, 0} 0.10 0.102 × (1 − 0.10)1 0.009
{1, 1, 0} 0.20 0.202 × (1 − 0.20)1 0.032
{1, 1, 0} 0.30 0.302 × (1 − 0.30)1 0.063
{1, 1, 0} 0.40 0.402 × (1 − 0.40)1 0.096
{1, 1, 0} 0.50 0.502 × (1 − 0.50)1 0.125
{1, 1, 0} 0.60 0.602 × (1 − 0.60)1 0.144
{1, 1, 0} 0.67 0.672 × (1 − 0.67)1 0.148
{1, 1, 0} 0.70 0.702 × (1 − 0.70)1 0.147
{1, 1, 0} 0.80 0.802 × (1 − 0.80)1 0.128
{1, 1, 0} 0.90 0.902 × (1 − 0.90)1 0.081
{1, 1, 0} 1.00 1.002 × (1 − 0.00)1 0.000
The a priori guess of 0.5 turns out not to be the most likely to have generated
these data. Rather, the value of 23 is the most likely value for θ . It is not
necessary to do all of this by guessing values of θ . This case can be solved
analytically.
When we have data on each of the trials (flips), the Bernoulli probability
model, fB , is a natural place to start. We will call the expression that describes
the joint probability of the observed data as function of the parameters
the likelihood function, denoted L(y; θ ). We can use the tools of differential
calculus to solve for the maximum; we take the logarithm of the likelihood for
computational convenience:
L = θ 2 (1 − θ )1
log L = 2 log θ + 1 log(1 − θ )
∂ log L 2 1
= − =0
∂θ θ (1 − θ )
2
θ̂ = .
3
The value of θ that the maximizes the likelihood function is called the maximum
likelihood estimate, or MLE.
It is clear that, in this case, it does not matter who gets heads and who gets
tails. Only the number of heads out of three flips matters. When Bernoulli data
are grouped in such a way, we can describe them equivalently with the closely
related binomial distribution.
6 Introduction to Maximum Likelihood
1 In statistics, asymptotic analysis refers to theoretical results describing the limiting behavior of a
function as a value, typically the sample size, tends to infinity.
2 This is the most basic statement of the Central Limit Theorem. We state the theorem more
formally in Section 2.2.2.
1.3 Samples and Sampling Distributions 7
0.15
0.10
0.05
0.00
q^
f i g u r e 1.1 The likelihood/probability of getting two heads in three coin tosses, over
various values of θ.
This basic result is often used to interpret output from statistical models
as if the observed data are a sample from a population for which the mean
and variance are unknown. We can use a random sample to calculate our
best guesses as to what those population values are. If we have either a large
enough random sample of the population or enough independent, random
samples – as in the American National Election Study or the Eurobarometer
surveys, for example – then we can retrieve good estimates of the population
parameters of interest without having to actually conduct a census of the entire
population. Indeed, most statistical procedures are based on the idea that they
perform well in repeated samples, i.e., they have good sampling distribution
properties.
Observed data in the social sciences frequently fail to conform to a “sample”
in the classical sense. They instead consist of observations on a particular
(nonrandom) selection of units, such as the 50 US states, all villages in
Afghanistan safe for researchers to visit, all the terror events that newspapers
choose to report, or, as in the example that follows, all the data available from
the World Bank on GDP and CO2 emissions from 2012. In such situations,
8 Introduction to Maximum Likelihood
parameters. In the same way, we argue that the estimators we employ are
good if they produce unbiased estimates of population parameters. Thus we
conceptualize the problem as having estimates that are shifted around by
estimators and sample sizes.
But there is a different way to think about all of this, a way that is not only
completely different, but complementary at the same time.
In case you were wondering … 1.3 Bias and mean squared error
A statistical estimator is simply a formula or algorithm for calcu-
lating some unknown quantity using observed data. Let T(X) be an
estimator for θ . The bias of T(X), denoted bias(θ ), is
bias(θ ) = E[T(X)] − θ .
The mean squared error, MSE(θ ), is given as
MSE(θ ) = E[(T(X) − θ )2 ]
= var(T(X)) + bias(θ )2 .
●
16
● ●
● ●
14 ●
2012 CO2 emissions, log Ktons
● ●
● ● ●
● ●● ●●
● ● ●● ●
● ● ●
●
● ●● ● ●
12 ●●
● ●
● ●
●● ● ● ● ●
● ● ● ● ●
● ● ● ● ●●
●
● ●
● ● ● ● ●● ● ●
● ● ● ● ●● ● ●
●● ● ●
10 ● ●
● ●
●
●
● ●
● ● ●
● ● ●●● ● ● ●
● ●●
● ● ● ● ● ● ● ●
●● ● ●
● ● ● ●● ●●
●● ● ● ●
● ● ● ●
●
8 ●
●
● ● ●●
● ●
●● ●
●
●● ●
● ● ● ●
● ●
● ●
● ● ● ● ● ●
● ●
● ●
● ● ● ●
6 ● ●
● ● ● ● ●● ●
● ● ●
● ● ●
●● ● ●
●
4
●
2
7 8 9 10 11
2012 GDP per capita at PPP, log $US
f i g u r e 1.3 2012 GDP per capita and CO2 emissions. The prediction equation is
shown as a straight line, with intercept and slope as reported in Table 1.2. The large
solid dot represents the United States and the length of the arrow is its residual value
given the model.
identically distributed (iid) – they follow the same distribution and contain no
dependencies – then we write
iid
Yi ∼ N (μi , σ 2 ). (1.1)
Equation 1.1 reads as “Yi is distributed iid normal with mean μi and variance
σ 2 .” When used as a part of a likelihood model, we will adopt the following
notational convention:
Yi ∼ fN (yi ; μi , σ 2 ).
is equivalent to assuming that the errors come from a normal distribution with
mean zero and a fixed, constant variance.
Setting aside, for the moment, the connection between εi and the independent
variable (log per capita GDP), we can further specify the probability distribu-
tion for the outcome variable from Equation 1.2:
1 −(εi )2
fN (εi ) = √ exp .
σ 2π 2σ 2
This yields:
1 −(yi − β0 − β1 xi )2
fN (εi ) = √ exp .
σ 2π 2σ 2
n
−1
= (2π σ 2 )−n/2 exp (yi − β0 − β1 xi )2
2σ 2
i=1
n
−1
2 −n/2 2
= (2π σ ) exp (yi − β0 − β1 xi ) .
2σ 2
i=1
The likelihood is a function of the parameters (here, β and σ ) and all of the data
on the dependent and independent variables, i.e., it is the formula for the joint
probability distribution of the sample. With a likelihood function and the data
we can now use the tools of optimization to find the set of parameter values
that maximize the value of this likelihood function.
Before doing this, however, the likelihood function can be simplified quite a
bit. First, because we are interested in maximizing the function, any monotoni-
cally increasing function of the likelihood can serve as the maximand. Since the
logarithmic transformation is monotonic and sums are easier to manage than
products, we take the natural log. Thus, the log-likelihood is:
1.4 Maximum Likelihood: An Overview 13
−1
n
2 −n/2 2
log L = log (2π σ ) exp (yi − β0 − β1 xi )
2σ 2
i=1
n 1
n
= − log(2π σ 2 ) − (yi − β0 − β1 xi )2
2 2σ 2
i=1
n
1 i=1 (yi − β0 − β1 xi )2
= − n log(2π ) − n log σ − .
2 2σ 2
Terms that are not functions of parameters to be estimated may be dropped,
since these terms only scale the likelihood while leaving it proportional to the
original. The maximizing arguments are unchanged. Dropping 12 n log(2π ) we
have,
n
(yi − β0 − β1 xi )2
log L = −n log σ − i=1 , (1.3)
2σ 2
To further simplify matters and because many computer optimization programs
default to minimization, we often use −2 log L,
n
(yi − β0 − β1 xi )2
−2 log L = n log σ + i=1
2
,
σ2
and for a fixed or known σ this simplifies to a quantity proportional to the sum
of squared errors:
n
−2 log L ∝ (yi − β0 − β1 xi )2 . (1.4)
i=1
In practice, there are two ways to solve the likelihood maximization prob-
lem. First, if an analytic solution is apparent, then we can solve the likelihood
for its extrema directly by taking partial derivatives with respect to each
parameter and setting them to zero, as we did with the coin-flipping example. In
many instances, however, the derivatives of the likelihood function do not have
nice, closed-form solutions. Maximization occurs via numerical techniques
described briefly later here and in more detail in Chapter 4.
In the linear-normal case we can find the MLE using analytical methods.
Nevertheless, it can be useful to plot the likelihood as a function of parameter
values. Figure 1.4 does just that, displaying the likelihood surface for β1 . We
can see that the maximum occurs near 1.06.
l i b r a r y (WDI) # l i b r a r y f o r r e t r i e v i n g WDI d a t a
wdi<−WDI( c o u n t r y = " a l l " ,
i n d i c a t o r = c ( "EN. POP .DNST" , #pop d e n s i t y
"EN.ATM.CO2E .KT" , #CO2 e m i s s i o n s
"NY.GDP. PCAP . PP .CD" ) , #GDPpcPPP
s t a r t = 2012 , end = 2012 , e x t r a = TRUE, c a c h e = NULL)
wdi<− s u b s e t ( wdi , r e g i o n ! =" A g g r e g a t e s " ) # removing c o u n t r y a g g r e g a t e s
wdi<−na . omit ( wdi ) # Omit c a s e s with NA; S e e c h a p t e r on m i s s i n g d a t a
names ( wdi ) [ 4 : 7 ] <− c ( "pop . den" , "co2 . k t " , "gdp . pc . ppp" , "wb . code " )
a t t a c h ( wdi )
1.4 Maximum Likelihood: An Overview 15
Step 2: Define the log-likelihood function, which can have several different
parameterizations. Here we use log L as expressed in Equation 1.3.
R Code Example 1.3 Step 2: Log-likelihood function
Step 4: Now, we can calculate all the standard diagnostics. We provide code
for making a table of regression estimates, including standard errors,
z-scores, and p-values. To preview the theory introduced in Chapter 2,
inverting3 the Hessian matrix will provide the variance-covariance
matrix, the square root of the diagonal of which contains the estimated
standard errors for the parameters. The ratio β̂i /σβ̂i is the z-score (or
asymptotic t-score).
R Code Example 1.5 Step 4: Post-estimation analysis
# C a l c u l a t e s t a n d a r d BUTON o u t p u t .
s t d e r r o r s <− s q r t ( − d i a g ( s o l v e ( mle . f i t $ h e s s i a n ) ) )
z <− mle . f i t $ par / s t d e r r o r s
p . z <− 2 ∗ ( 1 − pnorm ( abs ( z ) ) ) #p− v a l u e s
out . t a b l e <− d a t a . frame ( E s t =mle . f i t $par , SE= s t d e r r o r s ,Z=z , p v a l =p . z )
round ( out . t a b l e , 2 )
Est SE Z pval
1 − 0.30 1.21 − 0.25 0.80
2 1.06 0.13 8.08 0.00
These results imply that, in the set of countries examined, higher levels
of GDP are associated with higher levels of CO2 emissions. The estimated
coefficient for GDP (β1 ) is approximately 1.04 (with a standard error of 0.13,
resulting in a t-ratio of 8). This means that for every order-of-magnitude
increase in GDP per capita, there will be an order-of-magnitude increase in
annual CO2 emissions, on average.
in the case with one independent variable and assuming a linear relationship
between the independent variables and the mean parameter, we get:
stochastic component : Yi ∼ f (θi )
systematic component : θi = β0 + β1 xi .
Here, f is some probability distribution or mass function. We might choose the
Gaussian, but it could be a variety of others such as the binomial, Poisson, or
Weibull that we explore in this book.
Once we specify the systematic and stochastic components, we can construct
a likelihood for the data at hand.
Step 1: Express the joint probability of the data. For the case of independent
data and distribution or mass function f with parameter(s) θ , we have:
Pr(y1 | θ1 ) = f (y1 ; θ1 )
Pr(y1 , y2 | θ1 , θ2 ) = f (y1 ; θ1 ) × f (y2 ; θ2 )
..
.
n
Pr(y1 , . . . , yn | θ1 , . . . , θn ) = f (yi ; θi )
i=1
Step 2: Convert the joint probability into a likelihood. Note the constant, h(y),
which reinforces the fact that the likelihood does not have a direct
interpretation as a probability. A likelihood is defined only up to a
multiplicative constant:
L(θ | y) = h(y) × Pr(y | θ )
n
∝ Pr(y | θ) = f (yi | θi )
i=1
Step 4: Simplify the expression by first taking the log and then eliminating
terms that do not depend on unknown parameters.
n
−(yi − μi )2
2 2 −1/2
log L(μ, σ | y) = log (2π σ ) exp
2σ 2
i=1
n
1 1
= − log(2π ) − log(σ 2 )
2 2
i=1
1
− 2 (yi − (β0 + β1 xi ))2
2σ
n
2 2 1 2
−2 log L(μ, σ | y) = log(2π ) + log(σ ) + 2 (yi − (β0 + β1 xi ))
σ
i=1
1.6 conclusion
This chapter presented the core ideas for the likelihood approach to statistical
modeling that we explore through the rest of the book. The key innovation in
the likelihood framework is treating the observed data as fixed and asking what
combination of probability model and parameter values are the most likely
to have generated these specific data. We showed that the OLS estimator can
be recast as a maximum likelihood estimator and introduced two examples,
including one example of a likelihood programmed directly into the statistical
package R.
What are the advantages of the MLE approach in the case of ordinary least
squares? None; they are equivalent, and the OLS estimator can be derived
under weaker assumptions. But the OLS approach assumes a linear model and
unbounded, continuous outcome. The linear model is a good one, but the world
around can be both nonlinear and bounded. Indeed, we had to take logarithms
of CO2 and per capita GDP in order to make the example work, forcing
linearity on our analysis that is not apparent in the untransformed data. It is
a testament to “marketing” that the OLS model is called the ordinary model,
because in many ways it is a restricted, special case. The maximum likelihood
approach permits us to specify nonlinear models quite easily if warranted by
either our theory or data. The flexibility to model categorical and bounded
20 Introduction to Maximum Likelihood
Applications
A great strength of the likelihood approach is its flexibility. Researchers can
derive and program their own likelihood functions that reflect the specific
research problem at hand. Mebane and Sekhon (2002) and Carrubba and Clark
(2012) are two examples of likelihoods customized or designed for specific
empirical applications. Other custom applications in political science come
from Curtis Signorino (1999; 2002).
Past Work
Several other texts describing likelihood principles applied to the social sciences
have appeared in the past quarter-century. These include King (1989a), Fox
(1997), and Long (1997).
Software Notes
For gentle introductions to the R language and regression, we recommend
Faraway (2004) and more recently Fox and Weisberg (2011) and James E.
Monogan, III (2015). Venables and Ripley (2002) is a canonical R reference
accompanying the MASS library.
2
Thus, the likelihood has no scale nor any direct interpretation as a probabil-
ity. Rather, we interpret likelihoods relative to one another. In what follows we
suppress the term h(x), as it only adds to notational complexity, and it cannot
be estimated in any case.
Information about θ ∗ comes from both the observed data and our assump-
tions about the data-generating process (DGP), as embodied in the probability
model f (x; θ ). The likelihood function does more than generate values for
tables. It provides a mathematical structure in which to incorporate new data
or make probabilistic claims about yet-to-be-observed events.
As we saw in Chapter 1, likelihoods from independent samples sharing a
DGP are easily combined. If x1 and x2 are independent events governed by the
same f (x; θ ∗ ), then L(θ | x1 , x2 ) = f (x1 ; θ )f (x2 ; θ ), and the log-likelihood is
log L(θ | x1 , x2 ) = log f (x1 ; θ ) + log f (x2 ; θ ).
From Bayes’s Rule we see that the posterior distribution of θ is the prior
times the likelihood, scaled by the marginal probability of the data, x. As the
volume of observed data increases, the information in the data dominate our
prior beliefs and the Bayesian posterior distribution comes to resemble the
likelihood function. If we assume that Pr(θ ) follows a uniform distribution
with support that includes θ̂ , then the mode of the posterior distribution will be
the MLE. Under somewhat more general conditions, the posterior distribution
for θ approximates the asymptotic distribution for the MLE discussed in
Section 2.2.2. Although Bayesian thinking and the analysis of fully Bayesian
models is much more involved, the likelihood framework and the logic of
Bayesian statistics are closely linked. In many situations, a parametric Bayesian
approach and the likelihood framework will yield similar results, even if the
interpretation of these estimates differs.
2.1.2 Regularity
The method of maximum likelihood summarizes the information in the data
and likelihood function with θ̂ . To find this maximum we might simply try many
values of θ and pick the one yielding the largest value for L(θ ). This is tedious
and carries no guarantee that the biggest value we found is the biggest that
could be obtained. Calculus lets us find the extrema of functions more quickly
and with greater certainty. The conditions under which we can apply all the
machinery of differential calculus in the likelihood framework are referred to
as “regularity conditions.”
These regularity conditions can be stated in several ways. Some are quite
technical. Figures 1.4 and 1.1 provide some intuition. For a likelihood to be
“regular,” we would like it to be smooth without any kinks or holes (i.e.,
continuous), at least in the neighborhood of θ̂ . In regular problems, we also
require that the MLE not fall on some edge or boundary of the parameter space,
nor does the support of the distribution or mass function for X depend on the
value of θ . Likelihoods with a single maximum, rather than many small, local
hills or plateaus, are easier to work with.
With regularity concerns satisfied, the log-likelihood will be at least twice
differentiable around the MLE. Each of these derivatives of the log-likelihood
is important enough to have a name, so we address each in turn.
24 Theory and Properties of Maximum Likelihood Estimators
2.1.3 Score
The vector of first partial derivatives of a multivariate function is called the
gradient, denoted ∇. The score function is the gradient of the log-likelihood.
Evaluated at a particular point θ , the score function will describe the steepness
of the log-likelihood. If the function is continuous, then at the maximum or
minimum, this vector of first partial derivatives equals the zero vector, 0.
If we imagine that the data we observe are but one realization of all
observable possible data sets given the assumed probability model, f (x; θ ),
then we can consider the score function as a random variable. As a matter
of vocabulary, we refer to the score statistic when the score is considered as a
random variable (with θ fixed at the “true” value for the DGP). In Theorem 2.2
we derive the expected value of the score statistic.
Proof
∂
E[S(θ )] = f (x; θ | θ ) log L(θ )dx (score def., expectation)
X ∂θ
∂θ L(θ )
∂
= f (x; θ | θ )dx (Chain Rule, derivative of ln)
X L(θ )
∂θ L(θ )
∂
= f (x; θ | θ )dx (likelihood def.)
f (x; θ | θ)
X
∂
= L(θ )dx (algebra)
X ∂θ
∂
= L(θ )dx (regularity conditions)
∂θ
X
∂
= f (x; θ | θ )dx (likelihood def.)
∂θ X
∂
= 1 = 0 (pdf def., derivative of a constant).
∂θ
2.1 The Likelihood Function: A Deeper Dive 25
I(θ) ≡ E[S(θ)S(θ ) ],
where the expectation is taken over possible data X with support X .
There are several equivalent ways of writing the expected information,
including as the negative expected derivative of the score:
Theorem 2.3. Assuming appropriate regularity conditions for f (x; θ),
I(θ) = var[S(θ)] (2.1)
∂
= −E S(θ ) , (2.2)
∂θ
where the expectation is taken over possible data X with support X .
Proof We state the proof for the case where θ is single-valued (scalar), but
it readily generalizes to the case where θ is a vector. Noting that var(Y) =
E[Y 2 ] − E[Y]2 , and by applying Theorem 2.2, we obtain Equation (2.1).
(x;θ)
To prove Equation (2.2), note that S(θ ) = ∂θ
∂
log f (x; θ ) = ff (x;θ) . Then by
the quotient rule,
∂ f (x; θ )f (x; θ ) − f (x; θ )2
S(θ ) =
∂θ f (x; θ )2
f (x; θ ) f (x; θ ) 2
= −
f (x; θ ) f (x; θ )
f (x; θ )
= − S(θ )2 .
f (x; θ )
It follows that
∂ f (x; θ )
E S(θ ) = − S(θ )2 f (x; θ )dx
∂θ f (x; θ )
X
= f (x; θ )dx − S(θ )2 f (x; θ )dx
X X
1 The name Hessian refers to the mathematician Ludwig Otto Hesse, not the German mercenaries
contracted by the British Empire.
2.1 The Likelihood Function: A Deeper Dive 27
n
I(θ) = − E [H(θ | xi )] = nI(θ | xi ).
i=1
In other words, the expected Fisher information from a random sample of size
n is simply n times the expected information from a single observation.
2.2.1 Invariance
One frequently useful property of the MLE is that, for any sample size, the
MLE is functionally invariant:
Proof We prove this result supposing that g(·) is injective (this is a simplifying,
though not necessary, assumption). If g(·) is one-to-one, then L(g−1 (g(θ )) =
L(θ ). Then, since θ̂ is the MLE, our previous statement implies that θ̂ =
)). Therefore applying g(·) to both sides, we see that g(θ̂ ) = g(θ
g−1 (g(θ ).
1 n
variables, Xi , with finite expected value, μ. Let X̄n = n i=1 Xi . Then
p
X̄n → μ as n → ∞.
Proof We sketch the proof for the case of scalar θ , but it readily generalizes
to the multiparameter case. First note that the sample likelihood function
converges to the expected log-likelihood by the Law of Large Numbers:
1
n
1
log L(θ | x) = log L(θ | xi )
n n
i=1
p
→ E log L(θ | x) .
We then show that θ ∗ is the maximizer of the expected log-likelihood by
showing that θ ∗ satisfies the first-order conditions (FOC) for a maximum. The
FOC are
∂ f (x; θ )
E log L(θ ∗ | x) = f (x; θ ∗ )dx = 0,
∂θ X f (x; θ )
which follows from the regularity conditions and the definition of the log-
likelihood. The FOC are satisfied when θ = θ ∗ by Theorem 2.2.
Proof The proof begins with a Taylor series expansion of the FOC around θ̂ n ,
where θ 0 is some point between θ̂ n and θ ∗ :
1
n
p
H(θ 0 ) = H(θ 0 | xi ) → E H(θ ∗ ) = −I(θ ∗ )
n
i=1
1
n
1 d
√ S(θ ∗ ) = √ S(θ ∗ | xi ) → N 0, I(θ ∗ ) .
n n
i=1
Other Properties
We mention two other properties of the MLE only in passing. First, the
asymptotic distribution of the MLE implies that the MLE is efficient in that it
reaches the Cramér-Rao lower bound. The MLE achieves this asymptotically.
In other words, the MLE attains the smallest possible variance among all
consistent estimators. Equivalently, the MLE achieves (asymptotically) the
smallest mean-squared error of all consistent estimators.
32 Theory and Properties of Maximum Likelihood Estimators
There is no guarantee that the MLE is unbiased. For example, the MLE for
the normal variance is biased. In fact, the MLE is frequently biased in small
samples. But bias is a second-order concern here for three reasons. First, bias
disappears as n increases (Theorem 2.7). In most applications, we are well
away from the point where small-sample bias will be material. Second, bias
in the MLE can be estimated and accounted for, if needed, using the cross-
validation methods discussed in Chapter 5. Third, even if T(X) is unbiased, any
nonlinear transformation of T(X) will be biased; unbiasedness is not robust to
re-parameterization.
on the exact same observed data for both. The only thing that differs between
the likelihood functions in the numerator and denominator of the ratio is the
candidate values of θ . With this likelihood ratio we can assess the relative
strength of evidence in favor of particular θ values.
The likelihood ratio is used to compare “nested” models, i.e., two models
in which one is a restricted or special case of the other. Recall the simple
regression of CO2 on per capita GDP from Section 1.4.1. In that example
θ R = (β0 , β1 , σ 2 ). Suppose we fit a second model, wherein we include
population density as a covariate with associated regression coefficient β2 .
In the second model, we have θ G = (β0 , β1 , β2 , σ 2 ). In the first model, we
implicitly constrained the β2 coefficient to be 0. The first model is thus a
special case of the more general second one. If the restriction is appropriate,
i.e., θ ∗ lies in the more restricted parameter space, then the (log) likelihoods of
the two models should be approximately equal at their respective MLEs. The
likelihood ratio should be approximately one or, equivalently, the difference in
the log-likelihoods should be about 0. For reasons that will be clear shortly, the
conventional definition of the likelihood ratio statistic is
L(θ R | x)
LR(θ R , θ G | x) = −2 log .
L(θ G | x)
Some presentations reverse the numerator and denominator of the likelihood
ratio, leading to a likelihood ratio statistic without the negative sign.
Proof We sketch the proof for scalar θ . We suppress the subscript n on the
MLEs.
Begin with a second-order Taylor expansion of the log-likelihood around
the MLE:
1
log L(θr ) ≈ log L(θ̂ ) + (θ̂ − θr )S(θ̂ ) − I(θ̂ )(θ̂ − θr )2
2
1
≈ log L(θ̂ ) − I(θ̂ )(θ̂ − θr )2 (by S(θ̂ ) = 0).
2
This implies that the likelihood ratio statistic is
1 2
LR ≈ −2 log L(θ̂ ) − I(θ̂ )(θ̂ − θr ) − log L(θ̂ )
2
≈ I(θ̂)(θ̂ − θr )2
= I(θ̂)(θ̂ − θ )2 .
√ d
√ 1
From Theorem 2.9 we know that n(θ̂ − θ ∗ ) → N 0, I(θ̂ ∗ )−1 , so nI(θ̂ ) 2
d d
(θ̂ − θ ∗ ) → N (0, 1). Therefore nI(θ̂ )(θ̂ − θ ∗ )2 → χ12 .
This theorem states that the likelihood ratio statistic for two nested models is
distributed χ 2 under the “null” hypothesis, which states that the more restricted
model is the “correct” model. The degrees of freedom of the null distribution
equals the number of parameter restrictions imposed as we go from the general
to restricted model. We can therefore calculate the probability of observing a
likelihood ratio statistic at least as great as the one we observe for our data set
and proposed models. In the CO2 emissions example, we calculate this p-value
as 0.02.
Statistical packages will often report a model χ 2 . This refers to the χ 2 value
for the likelihood ratio comparing the estimated model to the null model – a
model with all coefficients except the intercept constrained to equal 0. This
statistic answers the usually boring question: “Is the model you fit noticeably
better than doing nothing at all?”
A related quantity is model deviance, which compares the model we built
to a “perfect” or “saturated” model (i.e., a model with a parameter for every
observation). Deviance is given by
!
D(x) = −2 log L(θ̂ b |x) − log L(θ̂ s |x) ,
2.3 Diagnostics for Maximum Likelihood 35
where θ̂ b is the MLE of θ b , representing the model we built, and θ̂ s is the MLE
of the saturated model. Since this is simply a likelihood ratio, it too follows a
2
χn−k distribution, where n denotes the number of observations in our sample
and k denotes the number of parameters in our model.
Statistical packages routinely report “null deviance” and “residual deviance.”
The former is the deviance of the null model, and the latter is the deviance of
the built model. Letting log L(θ̂ 0 ) be the maximized log-likelihood of the null
model and noting that
!
Residual deviance − null deviance = −2 log L(θ̂ b |x) − log L(θ̂ s |x)
!
− −2 log L(θ̂ 0 |x) − log L(θ̂ s |x)
!
= −2 log L(θ̂ b |x) − log L(θ̂ 0 |x) ,
it is easy to see that the difference between null and residual deviances is simply
another way of calculating the “model χ 2 .”
Wald Tests
We also have the Wald test, which is simply the squared difference of the
estimated parameters from some constrained value (typically zero), weighted
by the curvature of the log-likelihood.
Definition 2.9 (Wald test). Given a log-likelihood, log L(θ ) and hypoth-
esized θh , the Wald test for scalar θ is given as
(θ̂ − θh )2
W= .
I(θ̂ )
·
If θ ∗ = θh then W ∼ χ12 .
For k-dimensional parameter θ, in which hypothesized θ h imposes p ≤
k constraints, the Wald statistic is
W = (θ̂ − θ h ) I(θ̂)−1 (θ̂ − θ h ).
·
Under the maintained hypothesis, W ∼ χp2 .
The Wald test is a generalized version of the standard t-test, which imposes
only one restriction. For scalar θ , we have θ̂ ∼ N (θ , I(θ̂ )−1 ), which implies a
θ̂−θ0
t-ratio of . The t-ratio is asymptotically standard normal, so the square
I(θ̂ )1/2
of this ratio, the Wald statistic, is asymptotically χ12 .
Likelihood Diagnostics
logL(^)
logL( 0 )
log Likelihood
0
^
where
"
= Eg [S(θ )S(θ ) ]"θ ∗
"
I(θ ∗ ) = − Eg [H(θ )]" ∗ . θ
·
We say that θ̂ n ∼ N θ ∗ , I(θ ∗ )−1 I(θ ∗ )−1 .
Theorems 2.11 and 2.9 look nearly identical and their proofs proceed in sim-
ilar fashions (i.e., Taylor expansion around the MLE). The key difference is that
in Theorem 2.11 the expectation is taken with respect to all possible data, which
is governedby unknown distribution or mass function g(x); in Theorem 2.9 we
assumed that we had the distributional assumptions correct. If our assumption
about the DGP is correct and g(x) = f (x), then the term I(θ ∗ )−1 I(θ ∗ )−1
collapses to the conventional inverse Fisher information, I(θ ∗ )−1 .
In actual estimation we replace the expectations by their corresponding
sample analogues, i.e.,
n
ˆ =1
S(θ̂ | xi )S(θ̂ | xi )
n
i=1
1 n
1 ∂
n
I(θ̂ ) = − Hi (θ̂ ) = − S(θ̂ | xi )
n n ∂θ
i=1 i=1
2.5 conclusion
We derived the important features of MLEs that we will rely upon in subsequent
chapters. First, we highlighted that parametric models require the assumption
of a specific probability model. These distributional assumptions must be
stated and evaluated. In the subsequent chapters we will present out-of-sample
heuristics from which we can make pragmatic judgments about modeling
assumptions. Once a model is specified, we showed how the MLE, under
general conditions, is asymptotically consistent and normally distributed. The
covariance of the asymptotic distribution is given by the negative inverse of
2.6 Further Reading 41
Past Work
Aldrich (1997) provides a history of Fisher’s development and justifications of
maximum likelihood. See Pawitan (2013, ch. 1) and Stigler (2007) for excellent
intellectual histories of the development of the theory and method of maximum
likelihood in the mid-20th century.
Hirotugu Akaike (1981) notes that “AIC is an acronym for ‘an information
criterion’,” despite it often being denoted Akaike’s Information Criterion
(Akaike, 1976, 42).
Advanced Study
Edwards (1992) argues for a fully developed likelihood paradigm for all
scientific inference, focussing on the complete likelihood function rather than
just the MLE. LeCam (1990) collects and summarizes several examples of
how the likelihood approach can go awry by way of establishing “Principle
0: Don’t trust any principle.” See Mayo (2014) with the associated discussion
for a critical treatment of the likelihood principle. More general and rigorous
proofs of several results in this chapter can be found in standard advanced texts
such as Greene (2011) and Cox and Barndorff-Nielsen (1994). General and
rigorous treatments of basic probability concepts, the Law of Large Numbers,
the Central Limit Theorem, and other elementary topics can be found in
Resnick (2014).
The sandwich variance estimator has been extended in numerous ways,
particularly to data that are “clustered” in space or time. See, for example,
Cameron et al. (2008); Cameron and Miller (2015). King and Roberts (2014)
argue that the divergence between robust and conventional standard errors can
42 Theory and Properties of Maximum Likelihood Estimators
be used as a diagnostic tool for model misspecification, but see also Aronow
(2016).
Software Notes
Several R libraries implement various robust and clustered standard error
estimators for a variety of different models and data structures. These include
clusterSEs (Esarey, 2017), pcse (Bailey and Katz, 2011), plm (Croissant
and Millo, 2008), rms (Harrell, Jr., 2017), and sandwich (Zeileis, 2004).
3
43
44 Maximum Likelihood for Binary Outcomes
In the left panel of Table 3.1 we see the traditional case-based orientation. On
the right, we display the same data using the grouped binary data orientation.
Responses are reported in the form of the number of successes (yk ) in each of
k covariate classes, out of mk possible successes, such that 0 ≤ yk ≤ mk .
Grouped versus ungrouped is an important distinction even though the
underlying binary data may be identical. Ungrouped (case-based) data has a
natural connection with the Bernoulli distribution, whereas grouped data are
easily described using the closely related binomial distribution. With grouped
data, the normal approximation is more readily useful; asymptotic assumptions
can be based on imagining that m or n approach ∞, although grouped data
become increasingly cumbersome as the number of covariates (and therefore
combinations thereof) increase, especially when covariates are continuously
valued.
This simple example holds another tiny point that will reemerge often in
the analysis of nominal data: more often than not, the data are not actually
recorded as numbers but as discrete categories of labels, often referred to as
factors. Some software packages deal with this nuance automatically, but often
the analyst must translate these words into integers. For the record, the integer
1 is almost always the code for the occurrence of the event in probability space.
force participation, i.e., the decision to seek employment in the formal paid
labor market. For this example we use five of the variables described in Long
(1997, p. 37), as shown in Table 3.2.
Young Children
In Labor Force? 0 1 2 3
No 231 72 19 3
Yes 375 46 7 0
Odds of LFP 1.62 0.64 0.37 0.00
Risk, relative to 0 young children 1.00 0.63 0.44 0.00
of these two numbers is the odds, ω. The odds of a woman being employed in
these data is:
p̂ 0.568
ω̂ = = = 1.3,
1 − p̂ 1 − 0.568
which means that it is about 30% more likely that a woman is employed
than not.
Consider the following data from the Mroz study, given in Table 3.3. How
does the probability that a woman has young children influence her odds of
being a paid participant in the labor force? Compare the odds for being in the
labor for women with no young children, to those with one young child. Such
calculations are given as:
• The probability of labor force participation (LFP) for a woman with zero
375
young children is p̂0 = (375+231) = 0.619, which implies the odds of ω̂0 =
0.619
1−0.619 = 1.62.
46
• The probability of LFP for a woman with one young child is p̂1 = (46+72) =
0.39
0.39, which implies the odds of ω̂0 = 1−0.39 = 0.64. Her risk of LFP relative
0.39
to a woman with no young children is 0.619 = 0.63.
• The probability of LFP for a woman with two young children is p̂2 =
7 0.27
(7+19) = 0.27, which implies the odds of ω̂0 = 1−0.27 = 0.37. Her risk
0.27
of LFP relative to a woman with no young children is 0.619 = 0.44.
• The probability of employment for a woman with three young children is
0 0
p̂3 = (3) = 0.0, which implies the odds of ω̂0 = 1−0.0 = 0.0. Her risk of
0
LFP relative to a woman with no young children is 0.619 = 0.
This illustrates that the odds of being employed are about 1.6 (to 1) for
women with no young children, whereas these odds fall to about 0.6 for
those with just one child younger than six years of age, and further fall to
approximately 0.4 if there are two young children. A woman with one young
3.3 The Linear Probability Model 47
child has a 37% lower probability of being in the labor force than a woman
with no young children.
In some fields, especially medicine, people are interested in the odds-ratio.
For example, the odds-ratio for being employed given you have zero versus
one young child is simply 1.62
0.64 = 2.56, which means that the odds a woman is
employed if she has no young children is about two and one-half times the odds
of her being employed if she has a single child under five. Similarly, a woman
with one young child is not quite twice as likely to be employed as a woman
with two young children ( 0.64
0.37 = 1.73). But all these women are enormously
more likely to be employed than a woman with three young children under five.
The odds are notoriously difficult to describe and interpret. There is exten-
sive empirical work documenting how people’s interpretation of risk assess-
ments, including the odds and odds ratios, do not coincide with probability
theory.
Standardized residuals
●●
●● ●
●
●
● ●
●
●
●●
●● ●●
●●
●
●●
● ●
●
●
●
●
●
●● ●●
●
●
●● ●
●
●
0.5 ●
●
●●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
1 ●
●
●
●
●
●
●
●
●
●
●
●
Residuals
●
●●
●
●
● ●●
●
●
●
●
●
●● ●
●
●
●
●
●●
●
● ●
●
●●
● ●
●
●●
● ●
●
●
●
●●
● ●●
●●
LFP
●
● ●● ●
●
0.0 ●
●
●●●
●
●
●● 0
●
●
●
●
●
●
●●● ●
●
●●
●● ●
●
●
●
●●
●
●● ●
●
●
●●
●
● ●
●●
●
●●
● ●
●
●
●●
●
●
● ●
●
●
−0.5 ●
●
●●
●
●
●●
●
●●
●
●
● −1 ●
●
●
●
●
●
●
●
●●
●
● ●●
●
●
●
●●
●
● ●
●
●●
●
●
●
●●
● ●●
●
●
●●
●●● ●
●●
●
●
●● ●
●●
●
●● ●
●
●●
●
●
●●
252 ●●
●
●
●
−1.0 ●●
●●
No ●●
●●●●
●
●●
●●
●●
●●
●●●
●
●●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●●
●●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●
●●●
●
●●
●●● −2 ● 252
model produces many predictions outside the [0,1] interval. Three predicted
probabilities are below zero, with the largest at −0.14, indicating that one
woman has a negative fourteen percent chance of being employed, clearly
nonsensical. Similarly, there are five predictions that are greater than one,
including the largest at 1.12.
It is clear that the linear probability model can and will produce results that
are nonsensical and difficult to explain to the secretary of labor should she
ask you how you arrived at a negative probability. Other problems arising
with the LPM appear in the diagnostic plots in Figure 3.1. The left panel
shows the dependent variable plotted against the predicted values of the
dependent variable, along with the line representing the regression line. All
the true values of the dependent variable are at the top and the bottom
of this plot, but the prediction line takes on continuous, linear values both
inside the range [0, 1] and outside. The middle panel shows two clusters of
3.3 The Linear Probability Model 49
residuals, one representing the values when the predicted values are subtracted
from 1, another when predicted values are subtracted from 0. These residuals
clearly cannot be normally distributed, nor do several other conditions for
the linear model remain credible. In particular, the variance of the residuals
around the regression line is not constant but is rather grouped in two clumps
that correspond to the 1s and 0s in the data. The model is heteroskedastic
by construction since the calculated variance of the dependent variable as
estimated depends not only on the values of the independent variables but also
on the estimated values of the parameters (β̂). The bottom line is that the model
produces results that are nonsensical in the domain of prediction, even if the
coefficients may remain unbiased.
In short, the LPM:
and
1.0
0.8
logit−1(xiTb)
0.6
/4
=1
pe
slo
0.4
0.2
0.0
−4 −2 0 2 4
xTi b
f i g u r e 3.2 The logistic function is nonlinear with range bounded by 0 and 1. It is
nearly linear in the midrange, but is highly nonlinear towards extrema.
52 Maximum Likelihood for Binary Outcomes
1
systematic: θi ≡ logit−1 xi β = ,
1 + e−xi β
where for each observation i, Yi is the binary dependent variable, xi is the
k-vector of k − 1 independent variables and a constant, and β is a vector of k
regression parameters to be estimated, as always. Since the mean of a Bernoulli
variable must be between 0 and 1, the inverse logit function is one way to
link the systematic component to the outcome domain. For reference, some
presentations will highlight the inverse logit, while others will present material
in terms of the logit.
The inverse logit transformation is shown graphically in Figure 3.2. The
x-axis represents the value of xi β, called the linear predictor, while the curve
portrays the logistic mapping of these linear predictor values into the [0, 1]
probability interval on the y-axis. The midpoint of the line is at a probability
of 0.5 on the y-axis. The logistic curve is nearly linear around this point with a
slope of approximately 14 .
Given the systematic and stochastic components of this kind of model, the
joint probability of the data is straightforward:
n
y
Pr(y | θ ) = θi i (1 − θi )1−yi
i=1
3.4.2 Estimation
This particular class of problems is easy to solve by the computation of
numerical derivatives, but it may be helpful to know that the partial derivative
of the log-likelihood with respect to a particular slope parameter, βj , is given by
3.4 The Logit Model, a.k.a. Logistic Regression 53
# g e t t i n g data
colnames ( l f p )<− c ( "LFP" , "young . k i d s " , " s c h o o l . k i d s " , " age " ,
" c o l l e g e . woman" , " c o l l e g e . man" , "wage" , "income" )
l f p $ l f p b i n <− r e p ( 0 , l e n g t h ( l f p $LFP ) )
l f p $ l f p b i n [ l f p $LFP==" inLF " ]<−1
attach ( lfp )
x<− c b i n d ( young . k i d s , s c h o o l . k i d s , age ,
a s . numeric ( c o l l e g e . woman) − 1, wage )
y<− l f p b i n
# s i m p l e o p t i m i z a t i o n o f l o g i t f o r Mroz d a t a
b i n r e g <− f u n c t i o n (X, y ) {
X <− c b i n d ( 1 ,X) #1 s f o r i n t e r c e p t
negLL <− f u n c t i o n ( b ,X, y ) { # t h e log − l i k e l i h o o d
p<− a s . v e c t o r ( 1 / (1+ exp( −X %∗% b ) ) )
− sum ( y∗ l o g ( p ) + (1 − y ) ∗ l o g (1 − p ) )
}
# pass the l i k e l i h o o d to the optimizer
r e s u l t s <− optim ( r e p ( 0 , n c o l (X) ) , negLL , h e s s i a n =T , method="BFGS" , X=X, y=y )
l i s t ( c o e f f i c i e n t s = r e s u l t s $par , v a r c o v a r i a n c e = s o l v e ( r e s u l t s $ h e s s i a n ) ,
d e v i a n c e =2∗ r e s u l t s $ v a l u e ,
c o n v e r g e d = r e s u l t s $ c o n v e r g e n c e ==0) # o u t p u t an c o n v e r g e n c e check .
}
mlebin . f i t <− b i n r e g ( x , y )
# some r e s u l t s . . .
round ( mlebin . f i t $ c o e f f i c i e n t s , 2 )
> [ 1 ] 2.88 − 1.45 − 0.09 − 0.07 0.61 0.56
∂ log L
n
= (yi − θi )xij .
∂βj
i=1
Deviance R e s i d u a l s :
Min 1Q Median 3Q Max
− 2.0970 − 1.1148 0.6463 0.9821 2.2055
Coefficients :
E s t i m a t e S t d . E r r o r z v a l u e Pr ( > | z | )
( Intercept ) 2.87807 0.62291 4.620 3.83 e −06
young . k i d s − 1.44670 0.19363 − 7.471 7.94 e −14
school . kids − 0.08883 0.06676 − 1.331 0.183336
age − 0.06796 0.01246 − 5.452 4.99 e −08
c o l l e g e . womanCollege 0.61112 0.19372 3.155 0.001607
wage 0.55867 0.14893 3.751 0.000176
( D i s p e r s i o n p a r a m e t e r f o r b i n o m i a l f a m i l y t a k e n t o be 1 )
Number o f F i s h e r S c o r i n g i t e r a t i o n s : 4
distribution. The default is for glm to return deviance residuals, which is each
observation’s contribution to the residual deviance as defined in Section 2.3.
We can see here that the deviance residuals are not centered on 0 and do not
appear to be symmetric.
The main output is the description of the regression coefficients and their
estimated standard errors. Below that R tells us that it fit a canonical Gen-
eralized Linear Model with an assumed dispersion parameter; we take this
up in Chapter 7. The bottom section of the output displays basic model fit
information, including the AIC. Finally, the summary output tells us how many
steps the optimizer needed to arrive at its answer. In this case, it was four; we
go into details in the next chapter.
While the signs and p-values of these estimates are similar to those found
using the LPM, the estimates themselves are quite different. These estimates, as
presented in the table, require more care in interpretation for two basic reasons.
First, the underlying model is nonlinear, so, unlike OLS, the effect of a particular
covariate on the response is not constant across all levels of the independent
variable. To see this, we calculate the marginal effect, or the rate of change of
the outcome variable with respect to a particular independent variable, xk . We
take the derivative of the systematic component with respect to the independent
variable of interest:
∂E[Yi ] ∂θi exp(xi β)
= = βk 2
.
∂xki ∂xki 1 + exp(xi β)
This equation shows that the change in the predicted outcome induced by a
change in xk is reflected not just in the regression coefficient βk . The marginal
effect also depends on the value of xk and the values of all the other covariates
in the model. In other words, a variable’s marginal effect in a logit model is not
constant; it depends on the covariate values at which it is evaluated.
As a first-order approximation, we can use the fact that the logistic curve
is steepest in the middle. Since the slope of the inverse logit is 0.25 at that
point, dividing β̂k by 4 gives an estimate of the maximum difference a one-unit
change in xk can induce in the probability of a success (Gelman and Hill, 2007,
p. 82). Thus, an additional young child reduces the probability of labor force
participation by about 40%, i.e., −1.45
4 = −0.4.
Second, the logit model is a linear regression on the log odds. As a result,
the exponentiated regression coefficients are odds ratios; a coefficient greater
than 1 represents an increase in the relative probability of obtaining a 1 in
the dependent variable. An exponentiated coefficient less than 1 represents a
decrease in the probability of, in this case, being employed.
This exponentiation trick can be useful for analysts, but if you want to
confuse someone – say a student, client, or journalist – try describing your
56 Maximum Likelihood for Binary Outcomes
results in terms of either log odds or odds ratios. In some fields, such as
medicine, these are routinely reported, and scholars in these fields seem to have
a firm handle on their meaning, but in general it is better to interpret your
results in terms of the original scales on which the data were measured. To this
end, two different approaches have evolved for interpreting logistic regression
results on the probability scale.
Central No
Variable Tendency (full) College College
intercept 1.00 1.00 1.00
young kids (k5) 0.00 0.00 0.00
school kids (k618) 1.00 1.00 1.00
age 42.54 43.00 41.50
college (coll) 0.00 0.00 1.00
wage 1.10 0.98 1.40
3.4 The Logit Model, a.k.a. Logistic Regression 57
Pr y = 1 | coll = 0, x¬coll
exp(β̂0 + β̂1 k5 + β̂2 k618 + β̂3 age + β̂4 ∗ 0 + β̂5 wage)
=
1 + exp(β̂0 + β̂1 k5 + β̂2 k618 + β̂3 age + β̂4 ∗ 0 + β̂5 wage)
Pr y = 1 | coll = 1, x¬coll
Substituting these values into the linear predictor, xi β̂, and then using the
inverse logit transformation, we calculate θ̂i for this “typical” respondent as
0.63. A similar calculation in which the woman attended college produces 0.75.
The difference between these two scenarios is 0.13; the model predicts that
having attended college increases the probability of being in the labor force by
20%, holding all the other measured attributes of the woman at “typical” levels.
But suppose we want to take account of the fact that women in the sample
who have attended college look different than those who did not on a variety of
dimensions. For example, Table 3.6 shows that college-attending women have
a median wage rate of 1.4, against a median of 0.98 for women who did not
attend. To make a comparison in this case we construct two more scenarios.
In the first the values of the covariates take on those for a “typical” woman
who never attended college, as shown in the second column of Table 3.6.
The model predicts that such a woman has a 0.60 probability of being in the
labor force. The second scenario uses the typical covariate values for college-
attending women, as displayed in the third column of the table. This woman is
predicted to have a 0.79 probability of being in the labor force, for a difference
of 0.19 between these scenarios. The typical women who attended college in
1973 was 24% more likely to be in the labor force than the typical woman
who did not.
These examples illustrate how the interpretation of nonlinear models requires
the analyst to decide what types of comparisons best illustrate the model’s
implications for the purpose at hand. The structure of the model reinforces
a more general point: there is no single-number summary that communicates
all the model’s interesting insights.
58 Maximum Likelihood for Binary Outcomes
59
60 Maximum Likelihood for Binary Outcomes
No Young Children
One Young Child
0.8
Predicted probability of LFP
0.6
0.4
0.2
0.0
–2 –1 0 1 2 3
Wage rate
f i g u r e 3.3 Plot displaying the 95% confidence bands for the predicted probability of
LFP across different wage rates for women with and without young children. The
estimated relationship between wages and employment probability differs between the
two groups of women.
62 Maximum Likelihood for Binary Outcomes
# t h e wage v a l u e s f o r t h e s c e n a r i o s
lwg . r a n g e <− s e q ( from=min ( l f p $wage ) , t o =max ( l f p $wage ) , by = . 1 )
#women w/ o young k i d s
x . l o <− c ( 1 , # i n t e r c e p t
0 , #young . k i d s
median ( l f p $ s c h o o l . k i d s ) ,
median ( l f p $ age ) ,
0, #college
median ( l f p $wage ) )
X. l o <− m a t r i x ( x . lo , nrow= l e n g t h ( x . l o ) , n c o l = l e n g t h ( lwg . r a n g e ) )
X. l o [ 6 , ] <− lwg . r a n g e # r e p l a c i n g with d i f f e r e n t wage v a l u e s
# p l o t t i n g the r e s u l t s
p l o t ( lwg . range , s . l o [ 2 , ] , y l i m = c ( 0 , . 9 ) , x l a b = "wage i n d e x " ,
y l a b = " P r e d i c t e d P r o b a b i l i t y o f LFP" ,
main = "Wages , C h i l d r e n , and LFP" , b t y ="n" ,
c o l =" w h i t e " )
polygon ( x=c ( lwg . range , r e v ( lwg . r a n g e ) ) , # c o n f i d e n c e r e g i o n
y=c ( s . l o [ 1 , ] , r e v ( s . l o [ 3 , ] ) ) ,
c o l = g r e y ( 0 . 8 ) , b o r d e r =NA)
polygon ( x=c ( lwg . range , r e v ( lwg . r a n g e ) ) , # c o n f i d e n c e r e g i o n
y=c ( s . h i [ 1 , ] , r e v ( s . h i [ 3 , ] ) ) ,
c o l = g r e y ( 0 . 8 ) , b o r d e r =NA)
l i n e s ( lwg . range , s . h i [ 2 , ] , l t y =3 , lwd =2)
l i n e s ( lwg . range , s . l o [ 2 , ] , lwd =2)
l e g e n d ( − 2 , 0 .9 , l e g e n d = c ( "No Young C h i l d r e n " ,
"One Young C h i l d " ) , l t y = c ( 1 , 3 ) , lwd =3)
64 Maximum Likelihood for Binary Outcomes
Dependent Variable:
Vote “Nay” on Coburn Amendment
(Simple Model) (Full Model)
Democrat 3.120 3.300
(0.575) (0.853)
Gender (Female) −0.408
(0.977)
Political Science Major in College 1.160
(0.907)
Number of Top 20 Political Science Programs 2.080
(0.996)
Number of Top 50 Political Science Programs 1.020
(0.770)
Total Number of Political Science Programs −0.183
(0.433)
Percentage with Advanced Degrees 2.350
(1.250)
Number of Amendment Petitioners −0.011
(0.016)
Number of NSF Grants 2008 0.157
(0.345)
Years to Next Election 0.486
(0.229)
Member of Labor HHS Subcommittee 1.390
(0.968)
Seniority −0.0002
(0.041)
Constant −0.802 −3.780
(0.334) (1.600)
n 98 98
log L −42.800 −30.600
AIC 89.700 87.300
BIC 94.800 121.000
−2 0 2 4
Seniority ●
Gender (Female) ●
Democrat ●
Definition 3.1 (Brier Score). The Brier score (Brier, 1950) for a binary
classifier such as logistic regression is defined as
1 2
n
Bb ≡ θ̂i − yi ,
n
i=1
where the predicted probabilities are θ̂i and the observed binary outcomes
are given as yi .
Lower values reflect better predictions. The Brier score for the logit model
containing only partisanship is 0.14 against 0.09 for the more complicated
model.
66 Maximum Likelihood for Binary Outcomes
ROC Curves
The Receiver Operating Characteristic (ROC) Curve (don’t ask about the
name) builds on the idea of comparing correct predictions against false pos-
itives. But rather than choosing one particular threshold, ROC curves display
this comparison for all thresholds between 0 and 1. ROC curves are based on
the idea that the relative costs of mis-predicting a failure (false negative) versus
mis-predicting a success (false positive) can vary depending on the problem
at had. More formally, let the cost of a false positive relative to the cost a
false negative be denoted C. The optimal prediction for an event (ŷ = 1)
that minimizes the total expected cost occurs when t > 1/(1 + C); otherwise,
ŷ = 0. Hence, if the false positives and false negatives are equally costly, then
C = 1. If, however, the cost of mis-predicting an event is twice as costly as
mis-predicting the absence of an event, then C = 2 and the cutoff would
be at t = 1/3. The appropriate value for C in any particular application is,
of course, a policy problem. The threshold should be established in terms of
the human and physical costs of mis-predicting say, the absence of war, versus
mis-predicting a war. The ROC curve is a way of summarizing the ratio of the
rate of false positives to the rate of false negatives over the entire range of t.
ROC curves plot the true positive rate (percent of actual successes correctly
predicted, for some fixed threshold) against the false positive rate (percent of
Observed 0 Observed 1
Predicted 0 27 8
Predicted 1 7 56
3.5 An Example: US Senate Voting 67
1.0
0.8
True positive rate
0.6
0.4
0.2
0.0
false positives out of actual failures).1 The threshold values, t, that generates
each particular point in the curve are not visible in the plot. Better models
will have low false positive rates when they also have high true positive rates,
whereas worse-performing models will have ROC curves that are close to the
diagonal.
Figure 3.5 illustrates the ROC curves generated by both models in Table 3.8.
The simpler model (the gray line) has one step, corresponding to the fact that
the underlying model has only one (binary, categorical) predictor. ROC curves
from competing models can be used heuristically to compare the fit of various
model specifications, though this is a bit of an art. In this example the more
complicated model, as reported in Uscinski and Klofstand’s original paper, is
better at predicting in-sample for all values of t.
1 The false positive rate is also 1 – specificity, where specificity is the percent of failures correctly
predicted as such.
68 Maximum Likelihood for Binary Outcomes
Some prefer the “area under the ROC curve” (AUC) as a single-number sum-
mary of model performance as derived from the ROC; we report these values
in the caption. These single number summaries are unfortunate since they lend
a false sense of precision and certainty in comparing model performance while
discarding the fact that model performance may differ across t. Two models
could conceivably provide very different trade-offs between false positives and
false negatives relevant to decision makers. This would be masked by simple
comparison of AUCs.
Separation Plots
Many models in the social sciences will not have any predicted probabilities
that are greater than 0.5. This is not necessarily a symptom of a poorly fitting
model. Rather, it is driven, in part, by the underlying frequency of events. When
the observed number of events is small relative to the number of trials, as is
the case with international conflict, we should expect small absolute predicted
probabilities. What really concerns us is the model’s ability to distinguish
more likely events from less likely ones, even if they all have small absolute
probabilities.
To visually compare different models’ abilities to usefully discriminate
between cases, we can sort the observations by their predicted probabilities
and then compare this sorting to actual, observed events. This is exactly the
strategy followed by Greenhill et al. (2011) in developing their separation plot.
In these plots, dark vertical bars are observed events, in this case nay votes.
Light bars are nonevents. If the model perfectly discriminated between events
and nonevents, then successes would cluster to the right of the plot and failures
to the left; the plot would appear as two starkly defined color blocks. A model
that performs poorly would appear as a set of randomly distributed vertical
lines.
Figure 3.6 displays separation plots for each of the models in Table 3.8. The
predicted probabilities are ranked from low to high (shown as a black line), a
red bar indicates a vote against the amendment, and a cream colored bar is a
vote in favor. Consistent with the results from the ROC plot, we see that the
full model is better at discriminating nay votes from the other senators.
Model Interpretation
There is some evidence that the more complicated full model of Uscinski and
Klofstand is better at predicting Senate votes on the Coburn Amendment. How
can we interpret these findings? One way is to use the exponentiated logit
coefficients, as presented in Table 3.10. These values represent the odds ratios
for different values of the covariate. This makes some sense in the context of
partisanship; the odds of a Democrat voting nay are 27 times greater than a
Republican. But it is harder to interpret what these values say about continuous
predictors. Odds ratios are a tough sell.
3.5 An Example: US Senate Voting 69
Party-Only Model
Full Model
f i g u r e 3.6 Separation plots for the partisan-only and full models of US Senate
voting on the Coburn Amendment.
2 There were two independent senators at the time, Joe Lieberman of Connecticut and Bernie
Sanders of Vermont, both of whom voted nay.
70 Maximum Likelihood for Binary Outcomes
OR 2.5 % 97.5 %
Democrat 27.00 6.08 190.48
Gender (Female) 0.66 0.10 4.89
Political Science Major in College 3.20 0.57 21.44
Number of Top 20 Political Science Programs 8.01 1.42 89.54
Number of Top 50 Political Science Programs 2.78 0.65 14.47
Total Number of Political Science Programs 0.83 0.34 1.92
Percentage with Advanced Degrees 10.53 1.21 181.70
Number of Amendment Petitioners 0.99 0.96 1.02
Number of NSF Grants, 2008 1.17 0.57 2.30
Years before Next Election 1.63 1.06 2.66
Member of Labor HHS Subcommittee 4.00 0.65 31.22
Seniority 1.00 0.92 1.08
1.0 ● ● ●
●
Predicted probability of nay vote
0.8
0.6
0.4
● Democrats
● Other
0.2
0.0
1 3 5
Years to election
f i g u r e 3.7 The predicted probability of voting “nay” on the Coburn Amendment as
reelection approaches for Democrat and non-Democrat US senators. All other
covariates are held at their central tendencies. The vertical bars are 95% confidence
bands.
3.6 Some Extensions 71
for both a Democrat in that year as well as for the “typical” non-Democrat
who is facing an election within the year. In other words, the “effect” of
proximity to an election appears weak and only visible among non-Democrats.
This insight would not be discernible simply by examining BUTON or odds
ratios.
where the last step follows from the symmetry of the normal distribution, i.e.,
(x) = 1 − (−x). Predicted probabilities and statistical inference under a
probit specification are nearly identical to those from a logit. The coefficient
values will differ; it turns out that logit coefficients divided by 1.6 should give
a close approximation to the estimated coefficients from a probit regression.
Our approach for generating meaningful interpretations of model estimates
and implications fits the probit case just as easily.
1.0
0.8
0.6
q
0.4
cloglog
probit
logit
0.2
0.0
−4 −2 0 2 4
T
xi b
variable such that “nays” are 0 would have no effect on the estimates reported
in Table 3.8 beyond flipping their signs. Nothing else, including model fit, would
change. In a cloglog model, however, this symmetry does not hold.
Dependent Variable:
Civil Conflict Onset
logit probit het. probit cloglog log-log
(1) (2) (3) (4) (5)
Military 0.96 0.45 0.17 0.87 −0.32
(0.28) (0.14) (0.07) (0.27) (0.11)
Personal 0.15 0.09 0.07 0.14 −0.09
(0.65) (0.32) (0.10) (0.62) (0.24)
Party −0.45 −0.22 −0.05 −0.40 0.14
(0.67) (0.30) (0.10) (0.65) (0.21)
GDP (lagged) −0.30 −0.15 −0.04 −0.27 0.11
(0.17) (0.08) (0.03) (0.17) (0.05)
Population (lagged) 0.53 0.24 0.13 0.51 −0.16
(0.12) (0.06) (0.03) (0.10) (0.05)
Peace years −0.45 −0.23 −0.06 −0.42 0.17
(0.18) (0.09) (0.04) (0.16) (0.07)
n 2575 2575 2575 2575 2575
log L −239 −240 −236 −238 −242
AIC 498 500 495 496 503
BIC 557 558 559 555 562
Note: Following Cook and Savun, standard errors are clustered by country and cubic
splines estimated but omitted from the table. Lagged population is the only variance-term
covariate for the heteroskedastic probit model.
we repeat their logit analysis, while the second column fits the same model as
a probit. In the third column we fit a heteroskedastic probit using population
as the only covariate in the model for σ . In the last two columns we fit two
cloglog models. The first of these continues the coding, assigning conflict onset
a 1 and nonconflict a 0, whereas in the “log-log” model we reverse the coding
of the dependent variable.
Unsurprisingly, different distributional assumptions produce different
numerical estimates for the regression parameters. But the raw values reported
in the table turned out to yield very similar descriptions of the DGP, as
indicated by the log-likelihood, AIC, and BIC values. This is further confirmed
in Figure 3.9, which shows that model fit is virtually identical across these
alternatives. The ratio of coefficient estimates to their standard errors are
also similar across models. Note, however, that the parameter estimates for
the cloglog and log-log models are not opposites, reflecting the asymmetry in
the assumed distribution. In this example several model variations produce
effectively the same answer.
76 Maximum Likelihood for Binary Outcomes
1.0
0.8
True positive rate
0.6
0.4
0.2
0.0
Applications
Logit and probit models are widely employed across the social sciences in both
observational and experimental work. Among many recent examples, Einstein
and Glick (2017) look at how a constituent’s race affects whether and how
public housing bureaucrats respond to information requests. Baturo (2017)
examines the predictors of political leaders’ activities after leaving office.
Past Work
Joseph Berkson was a statistician with the Mayo Clinic and most well-known
for his pioneering studies of the link between tobacco smoking and lung
cancer in the 1960s. Berkson’s work on the logistic regression – as an explicit
correction to using a normal distribution for studying probabilities – began
in 1944 (Berkson, 1944). Subsequently, Berkson produced several important
works relating to this topic (Berkson, 1946, 1953, 1955). However, work on
using sigmoid curves (akin to logistic curves) dates back to the work of Bliss
(1935), who worked with Fisher at the Galton Laboratory, University College,
London; R. A. Fisher followed Karl Pearson as the Galton Professor at the UCL
in 1934.
Glasgow and Alvarez (2008) provide a recent summary of likelihood-based
models for discrete choice, along with their extensions. Alvarez and Brehm
(1995, 2002) is an early derivation and application of the heteroskedastic probit
in the study of American public opinion. Nagler (1994) proposes the scobit
model that allows for a logistic distribution that is scaled so as to not require
an assumption of symmetry around 0. A challenge with the scobit model is that
there is often insufficient information in the covariates to cleanly estimate both
the regression weights and the ancillary parameter governing symmetry of the
distribution.
Software Notes
We used the arm library’s coefplot command (Gelman and Su, 2016), but
the coefplot (Lander, 2016) extends this functionality in a number of ways.
The verification (NCAR – Research Applications Laboratory, 2015) and
scoring (Merkle and Steyvers, 2013) libraries can be used to calculate the
Brier score. The separationplot library (Greenhill et al., 2015) calculates
78 Maximum Likelihood for Binary Outcomes
and displays separation plots. glmx (Zeileis et al., 2015) provides a way to
estimate probits (and other models) allowing for heteroskedasticity.
The R package Zelig package (Imai et al., 2008, 2009) and its progeny have
implemented a general syntax for estimating several classes of models, including
logit, probit, and many other types of models. It facilitates for nonprogrammers
the calculation and display of quantities of interest and associated uncertainty
under different scenarios.
4
Implementing MLE
79
80 Implementing MLE
1
x ∈ [a, b]
f (x) = b−a ,
0 otherwise
a+b (b−a)2
with E[X] = 2 and var(X) = 12 .
4.1.1 Examples
Irregular Likelihood Surface
Let x = (x1 , . . . , xn ) be n independent draws from Unif[−θ , θ ]. In this example
the parameter, θ , determines the support for the probability model, something
we recognized as a violation of regularity conditions. Nevertheless, we can
construct a likelihood using our standard procedure:
n
1
L(θ | x) =
2θ
i=1
= (2θ )−n .
This likelihood assigns positive probability for any x such that xi ∈ [−θ , θ ], i ∈
{1, . . . , n} and 0 otherwise. Figure 4.1 plots the likelihood function for θ
Uniform Likelihood
max{|xi |}
q
f i g u r e 4.1 The likelihood surface for L(θ | x) where the probability model is
X ∼ Unif(−θ , θ ).
4.1 The Likelihood Surface 81
Uniform Likelihood
● ●
● ●
q
f i g u r e 4.2 The likelihood surface for L(θ | x) where the probability model is
X ∼ Unif(θ − 2, θ + 2).
results. Applying the theoretical results from Chapter 2 to such output can lead
to erroneous conclusions.
≡ L(θ 1 , θ̂ 2 (θ 1 )).
In other words, the profile likelihood function returns, for each value of θ 1 ,
the maximum value of the likelihood function for the remaining parameters,
θ 2 . Maximizing the profile likelihood will return the MLE for θ 1 . We can plot
profile likelihoods, construct likelihood ratios, and even build likelihood-based
confidence intervals. We have already seen a profile likelihood once: Figure 1.4
displays the profile likelihood for a regression coefficient treating the other
coefficients and the variance as nuisance parameters.
4.2.1 Newton-Raphson
The Newton-Raphson (N-R) algorithm is a general procedure for finding the
root(s) of equations of the form g(θ ) = 0. For scalar θ , given an initial guess
θ0 , the algorithm updates by relying on a linear (Taylor series) expansion of the
target function:
way to stabilize convergence is by adjusting the size of the step by some factor,
δ ∈ (0, 1]. The algorithm is now:
θ 1 = θ 0 − δH(θ 0 )−1 S(θ 0 ).
Other stabilization procedures can be used in place of or in addition to adjusting
the step size. Most software packages allow users to adjust the step size in
addition to supplying different starting values and setting the tolerance.
Given a regular likelihood there should be no problem inverting the Hessian
matrix, but in some situations H(θ 0 ) may turn out to be negative or difficult to
invert, especially if we are far from the MLE. Replacing the observed Fisher
information, −H(θ ) = I(θ), in the updating step with the expected Fisher
information, I(θ), can solve this problem, in addition to speeding up con-
vergence. Optimizers that replace observed with expected Fisher information
are said to follow Fisher scoring. In the context of the Generalized Linear
Model (see Chapter 7), where observed and expected Fisher information are
the same, Fisher scoring is automatic. The default optimizer for R’s glm
function, Iteratively (re)Weighted Least Squares (IWLS), uses the Fisher scoring
approach. Standard summary output from glm reports the number of Fisher
scoring (i.e., N-R) iterations.
EM’s “trick” is to use the value of θ at the current iteration, combined with
the likelihood function, to fill in the missing values in the observed data. EM’s
tremendous popularity is testament to its flexibility and the ubiquity of missing
data problems. EM has drawbacks, however: It can be slow to converge; it does
not readily produce uncertainty estimates around θ̂ ; and it can get trapped at
local maxima.
where the maxima (minima) will be located. If you can supply only first-order
conditions, then the BHHH algorithm is a good choice; BFGS and DFP are
good algorithms when you don’t have good starting values and don’t know the
formulas for the Hessians. One strategy is to start with the coarse algorithms,
and use the results from these to serially move to better and better algorithms,
carrying along the final estimates from one run as starting values for the next,
and also supplying the calculated Hessian from the preceding run into the call
for the current run.
different combinations of values for covariates are equally likely, given the data.
Flat likelihoods also arise in more complex likelihood functions, especially ones
with multiple modes. The flat area of the likelihood could trap the optimizer
even when there is a unique and identifiable maximum. The best solution is to
plot the likelihood surface in a variety of directions, if at all possible. From here
it may make sense to give the optimizer different starting values (away from the
plateau) or increase the step size (so the optimizer can jump off the flat area).
not generate a warning that perfect separation was detected, leading to material
differences in estimates across computing platforms.
A third place to look for the explanation is in the default choice of opti-
mization algorithm, especially for more complicated or customized likelihood
problems. Different optimizers with different default tolerances, etc., can
generate different answers. Whether this is an indication of a problem or
simple rounding differences requires further investigation by the analyst. If
different optimization algorithms are giving wildly different answers, then this
is a symptom of problems described above (multiple modes, etc.). A closer
inspection of the likelihood surface is in order.
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
is, in essence, too good of a predictor. One estimation strategy is the so-called
Firth logistic regression (Firth, 1993), which is a penalized likelihood approach.
Bell and Miller (2015) use Firth regression and show that Rauchhaus’s finding
disappears.
4.6 conclusion
Estimating models using maximum likelihood is often fast and painless. But one
of the strengths of the likelihood approach is its flexibility in accommodating
more complicated data structures and custom-designed models. As a result,
some understanding of the topography of a likelihood function and of details
92 Implementing MLE
Applications
Perfect separation is a common problem in political science data; recent
examples include Ahlquist (2010a); Barrilleaux and Rainey (2014); and Mares
(2015, ch. 9). Chalmers (2017) uses both rare events and Firth logit in modeling
banks’ decisions to lobby the Basel Committee.
Past Work
Heinze and Schemper (2002) provide an applied discussion in support of
Firth regression. Zorn (2005) provides an applied, political-science-focused
discussion of perfect separation problems.
Advanced Study
Pawitan (2013) provides an extended discussion of profile likelihoods and
likelihood-based confidence intervals and inference.
Mebane and Sekhon (1998, 2011) develop and implement a flexible opti-
mization algorithm that has seen some use in the numerical optimization of
more complicated likelihood functions.
When looking at separation and rare events, Gelman et al. (2008) take a
Bayesian approach and propose a proper but uninformative Cauchy prior over
the regression coefficients. See Rainey (2016) for more on the limitations of
Firth regression and the importance of priors when using Bayesian methods to
address perfect separation.
Kordas (2006) uses a binary quantile regression framework for modeling
unbalanced and rare events binary data. Wang and Dey (2010) discuss the use
of the Generalized Extreme Value distribution for modeling rare events using a
more flexible, asymmetric link function.
4.7 Further Reading 93
Software Notes
The R library ProfileLikelihood (Choi, 2011) calculates, plots, and
constructs likelihood-based confidence intervals from profile likelihoods for
specified parameters for many commonly used models.
The R libraries maxLik (Henningsen and Toomet, 2011) and bbmle
(Bolker and R Development Core Team, 2016) provide wrapper functions
and easier access to a number of R’s numerical optimization algorithms
for maximum likelihood estimation. rgenoud implements the Mebane and
Sekhon GENOUD algorithm.
Penalized and Firth regression are implemented in logistf (Heinze and
Ploner, 2016) and brglm (Kosmidis, 2017). King and Zeng’s rare events logit
model is implemented within the Zelig library (Imai et al., 2009). Gelman
et al.’s Bayesian approach to separation and rare events are implemented in the
bayesglm() function in the arm library (Gelman and Su, 2016).
part ii
M O D E L E VA L UAT I O N A N D I N T E R P R E TAT I O N
5
97
98 Model Evaluation and Selection
5.1.1 BUTON
BUTONs are ubiquitous. Several appear in this book; Table 5.2 later in
this chapter is a good example. This table satisfies at least one basic idea
about presenting your research: science should be transparent; procedures
and results should be widely available. Of course, it is important to also
share the data, so that these results may be replicated by other scholars in
different laboratories around the world, using different computers and different
programs. In addition to these kinds of standard numerical displays, many
scholars will include a variety of useful model diagnostics and “goodness-of-fit”
statistics such as likelihood ratios, R2 , BIC, F- and other Wald tests. All these
pieces of information are conditional on the sample used to fit the model; they
tell us little about the extent to which the model is highly tuned to a particular
data set.
BUTON are rarely compelling when trying to convince readers of a
particular model’s benefits. Why? Here are three easy exercises to illustrate
the problem:
1. Think of your favorite empirical study. Write down a coefficient and a
standard error from that model. On paper. Without looking. Put the
answers here:
β̂: ; σβ̂ :
2. Okay, try this one. Write down the estimated coefficient from any article
you read in the past week. Put your answer here: β̂:
3. What was the estimated intercept for the model from last week’s
homework? Answer: β̂0 :
All those coefficients, all those standard errors. Like so many through the years,
they are forgotten. This suggests that having a table of numbers somewhere
for the careful scholar to review is important, but presenting a large table of
numbers is unlikely to be compelling or memorable. The current publishing
norm is to present tables of regression output, explaining it as you go.
Nevertheless, this is frequently not the best option for making your analysis
stick in your readers’ minds. Nor is it necessarily a good strategy for your own
model checking. It seems a sad waste of effort to reduce hard modeling work to
simple tabular summaries, especially when the estimated models contain all the
necessary components to build a simulation of the process you began studying
in the first place. We want to take advantage of the models’ richness to explore
5.1 Current Practice 99
• Tables are best for cataloging and documenting your results. Fill them with
details for those carefully studying your work. Think of them as entries in
the scientific record.
• BUTON are not well suited for quickly transmitting the crux of your findings
in the body of an article or in a presentation to a wide audience.
• Tables should facilitate precise, analytical comparisons.
• Comparisons should flow from left to right.
• There should be enough white space to allow the eyes to construct focused
comparisons easily.
• There should be no unnecessary rules (i.e., lines) that separate columns or
rows within the table.
• Tables are most useful for reporting data: counts of things, or percentages.
• Numbers should be right justified to a common number of decimal places to
facilitate comparisons.
• Tables should present information only as precisely as necessary; entries
should reflect a reasonable degree of realism in the accuracy of measurement.
This means 8% or 8.3% is generally better than 8.34214%. Items that yield
extremely large (9.061e+57) or small (5.061e-57) quantities should be
rescaled or, in the latter case, simply called 0.
• Rows and columns should be informatively and adequately labeled with
meaningful English-language text, in groups or hierarchies if necessary.
• Entries in the table should be organized in substantively meaningful ways,
not alphabetically.
• Give the table an informative title and footnotes to detail information inside
the tabular display.
tend to be faster and more accurate in making comparisons when using graphs
compared to tables (Feliciano et al., 1963). Subsequent recall of relationships
or trends tends to be better when presented graphically. Maps (or map-like
visualizations) are particularly memorable (Saket et al., 2015).
But presenting figures alone is not the answer, either (Gelman, 2011).
Academic researchers (much less lay audiences or policy makers) routinely
misinterpret confidence intervals of the sort displayed in Figure 3.7. Researchers
can also produce graphics that are information-sparse, communicating rela-
tively little compared to a table containing the same information. For example,
Mutz and Reeves (2005) use experiments to study the impact of televised
incivility upon political trust. They use bar charts to display their findings.
These charts appear as Figure 3 in their article, commanding a prominent place
and appearing in color in the journal’s digital edition. The figure’s number
is apt because their display contains just three pieces of information. Three
comparisons of interest appear in 31 square inches; the entire page is 62 square
inches. Compared to displays such as Figure 3.3 this is a low content-to-
space ratio.
In designing compelling graphical displays Tufte (1992) has excellent advice,
including
myriad examples in the literature that discuss in detail the statistical significance
of estimated regression coefficients in different models but fail to comment on
the comparative fit of these same models. Indeed, we observe authors arguing
for or against models that are practically indistinguishable from one another
as models.
This practice of fitting models based on theory (perhaps post hoc) and then
filtering results based on the p-values for some covariates is problematic. There
is a well-established literature that points away from the uncritical use of p-
values in model selection for three reasons. First, p-values do not have the
same interpretation after measurement and model selection as they do ex ante.
Second, any statement of magnitude is conditional on the model employed, as
is any claim to statistical significance. Identifying “significance,” substantive or
otherwise, therefore requires that authors first justify their preferred models
and the comparison scenarios. Only then can we meaningfully turn to the
magnitude of the difference implied by the chosen models, covariates, and
scenarios. Third, using p-values ignores the issue of model uncertainty. Any
statement about parameter values is conditional on the model and its underly-
ing assumptions. If we are uncertain about which of a variety of possible models
is most appropriate, then using p-values to justify selection is nonsensical.
Theory, by itself, is a weak justification for preferring a particular model
specification, especially if one of the researchers’ goals is to “test” that exact
theory. Rather, evaluating model performance can be viewed as an integral
part of the research enterprise: a good and useful theory should lead to better
prediction. If the data support the theoretical claim, then the theoretically
motivated model specification should outperform feasible (and simpler) com-
petitors. Unfortunately, standard practice often provides no explicit reason to
prefer the models presented compared to competitors; p-values on regression
coefficients do not help in making this determination. We argue that out-of-
sample prediction is a sensible, flexible, and powerful method for adjudicating
between competing models and justifying model selection.
actual value in the test set. Good models will have good performance in the
training and test data sets.
More formally, suppose we observe our outcome of interest, y, and covariates
X for a set of n units. We can partition our data into our training set, S, and
our test set, V, such that n = S ∪ V and S ∩ V = ∅. We also have a model
for Yi , denoted M(xi ; θ ), and is a function of the covariates and parameters,
θ . The term θ̂ S is the parameter estimate based on S, the observations in the
training set.
For continuous Y we denote a prediction based on M(xi ; θ̂ S ) as ŷi . The most
commonly used loss function for continuous Y is squared error loss:
G
Loss(yi , M(xi ; θ̂ S )) = |(M̂g − 1g (yi ))|.
g=1
Also commonly used is the bounded loss function, which takes on a value of
unity whenever ŷi = yi and 0 otherwise. The predicted category, ŷi is typically
the category with the largest predicted probability: ŷi = arg maxg M̂g . Finally,
there is deviance given as
G
Loss(yi , M(xi ; θ̂ S )) = −2 1g (yi ) log M̂g .
g=1
The quantity of interest is the expected prediction error, or Err in the notation
of Hastie et al. (2008):
!
Err = E Loss Y, M(X, θ̂ S ) ,
where expectations are taken over all that is random (partition of training and
test set, etc.).
Cross-Validation
The question naturally arises as to where the test and training sets come from in
real-world applications. In a data-rich environment we might consider actually
5.2 The Logic of Prediction, Out-of-Sample 107
withholding some of the data for later use as the test set. Or we might expect a
new sample to arrive later in time. But both of these are relatively uncommon
in the social sciences. Cross-validation is a way to use the data we do have as
both training and test sets, just not at the same time.
Initial work on cross-validation followed the thinking of Seymour Geisser,
who believed that the inferential framework of hypothesis testing was mislead-
ing. Instead, he believed that using prediction-based tools would lead to the
selection of better, i.e., more useful models, even if these models were not the
“true” models.
More specifically, k-fold cross validation divides the data randomly into
k disjoint subsets (or folds) of approximately equal size.3 The division into
subsets must be independent of all covariates for estimating generalization
error. Each of the k subsets will serve as the test set for the model fit using the
data in the remaining k − 1 subsets as the training set. For each observation,
we calculate the prediction error based on the predicted value generated by
the model fit to the training data. More formally, if κ(i) is the fold containing
observation i and −κ(i) is its complement, then the k-fold cross-validation
estimate of Err is given by
1
n
Errcv (M, y, X) = Loss(yi , M(xi ; θ̂ −κ(i) )).
n
i=1
Alternatively we can write the k-fold cross validation estimate as the average
of prediction errors within each of the k folds:
1
CVj (M, y, X) = " " Loss(yi , M(xi ; θ̂ −κ(i) ))
"κ j "
i∈κj
1
k
Errcv (M, y, X) = CVj (M, y, X),
k
j=1
" "
where κj denotes the set of observations i in each fold j ∈ {1, 2, . . . , k}, and "κj "
denotes the cardinality, or number of observations, of this set.
What about k? One choice is “leave-one-out” cross-validation, in which
k = n. Leave-one-out has some nice properties but has higher variance and
is computationally more expensive than setting k to some smaller value. Shao
(1993) shows that leave-one-out cross-validation will not lead to the selection
of the “true” model as n → ∞ but leaving out a larger number of observations
will – if we are willing to entertain the notion of a “true” model. As a result,
cross-validation setting k = 5 or 10 is fairly common. Under such a decision
there may be some bias in our estimate of Err, but the bias is upwards. As
n grows large the distinction becomes less relevant. In comparing models,
3 Note that we are temporarily departing from our notation convention in other chapters, where
k was used to denote the number of covariates in a model. Here we follow the literature on
cross-validation and use k to denote the number of “folds.”
108 Model Evaluation and Selection
the important consideration is using the same k for all the cross-validation
estimates. The exact value of k is less important.
We clearly prefer models with lower prediction error. How big of a difference
in prediction error is “big enough?” Currently, a general description of the
distribution cross-validation estimator does not exist. But we can estimate the
empirical variance of the cross-validation Err:
1
var [Errcv ] = var [CV1 , . . . , CVk ] .
k
The square root of this quantity is the standard error of the cross-validation
estimate of prediction error. Hastie et al. (2008) suggest the “one standard
error rule” in which we select the most parsimonious model whose cross-
validation-estimated prediction error is within one standard deviation of the
model with the smallest prediction error. Put another way, if a simpler model’s
prediction error falls within one standard deviation of a more complicated
model’s prediction error then the simpler model is to be preferred.
When should cross-validation occur, and how does it relate to model building
more generally? Hastie et al. (2008) argue that “In general, with a multi-step
modeling procedure, cross-validation must be applied to the entire sequence of
modeling steps. In particular, samples must be ‘left out’ before any selection or
filtering steps are applied.” (p. 245–249). It is important to note, however, that
they are discussing cross-validation in the context of machine learning, where
there are a very large number of predictors and little to inform model selection
(e.g., some genomics applications). Most social science applications, on the
other hand, present arguments justifying the inclusion of specific predictors or
decisions about functional form (e.g., “interaction terms”). In such situations,
some model selection has already been accomplished. Cross-validation can
then be used to compare competing models and justify model choices without
necessarily building models from scratch for each of the k folds.
Example of Cross-Validation
We look to the Fearon and Laitin (2003) classic article on civil wars for an
example. We display the BUTON for a reestimation of their logistic regression
model in Table 5.1.
For exposition we conduct a two-fold cross-validation. We randomly split
the data in two sets denoted κ1 and κ2 . We first fit the model using just
the data in κ1 . Using those estimated coefficients, along with covariate data,
we produce an in-sample predicted probability for κ1 and an out-of-sample
predicted probability for the cases in κ2 . We then refit the model, reversing
the roles for κ1 and κ2 , producing both in-sample and out-of-sample predicted
probabilities for each observation. From here we can construct a variety of
displays and undertake various calculations summarizing the differences, if any.
A ROC plot summarizing in-sample and out-of-sample predictive performance
is one example, shown in Figure 5.1. The model’s out-of-sample predictive
5.2 The Logic of Prediction, Out-of-Sample 109
β σβ̂
1.0 In-sample
Out-of-sample
0.8
True positive rate
0.6
0.4
0.2
0.0
l i b r a r y (pROC)
flmdw <− r e a d . c s v ( "flmdw . c s v " )
i n f i t s e t <− sample ( rownames ( flmdw ) , s i z e =dim ( flmdw ) [ [ 1 ] ] / 2 ) # d i v i d e t h e sample
t o t a l s e t <−rownames ( flmdw )
i n t e s t s e t <− s e t d i f f ( t o t a l s e t , i n f i t s e t )
f l . t r a i n s e t <−flmdw [ i n f i t s e t , ]
f l . t e s t s e t <−flmdw [ i n t e s t s e t , ]
#estimate
t r a i n i n g . f i t <− glm ( a s . f a c t o r ( o n s e t ) ~ warl + g d p e n l + l p o p l 1 +
lmtnest + ncontig + Oil + nwstate + i n s t a b + p o l i t y 2 l +
e t h f r a c + r e l f r a c , family = binomial ( l i n k = l o g i t ) ,
data = f l . t r a i n s e t )
p i . t r a i n <− p r e d i c t ( t r a i n i n g . f i t , t y p e =" r e s p o n s e " ) # in −sample p r e d i c t i o n s
p i . t e s t <− p r e d i c t ( t r a i n i n g . f i t , t y p e =" r e s p o n s e " ,
newdata = f l . t e s t s e t ) #out −of −sample p r e d i c t i o n s
#plot
par ( b t y ="n" , l a s =1)
p l o t . r o c ( flmdw$ o n s e t , flmdw$ i s . f i t , x l a b =" F a l s e p o s i t i v e r a t e " ,
y l a b ="True p o s i t i v e r a t e " , lwd =3 ,
l e g a c y . a x e s =T , s e =T )
p l o t . r o c ( flmdw$ o n s e t , flmdw$oos . f i t , add=T , l t y =3 , lwd =3)
l e g e n d ( " t o p l e f t " , l e g e n d =c ( "In −sample " , "Out−of −sample " ) ,
l t y =c ( 1 , 3 ) , b t y ="n" )
Bootstrap Estimation
The bootstrap differs from cross-validation, although both rely on randomly
sampling the data we have to build up distributional insights for complicated
problems. Bootstrapping is an estimation method, whereas cross-validation is
typically a post-estimation tool for formal evaluation.
Bootstrap methods involve repeatedly estimating a statistic of interest on a
series of random subsamples from the observed data. Importantly, bootstrapping
relies on sampling with replacement, whereas cross-validation partitions the
data into disjoint subsets. Thus, bootstrap methods aim to build up distribu-
tional information about the statistic of interest (calculating standard errors,
for example).
Suppose we have a data set with 1,000 observations, and we are interested
in the model Y ∼ f (y; θ). We could find bootstrap estimates of the standard
errors around θ̂ by randomly selecting 100 observations (with replacement),
estimating the regression model, and storing the estimate of θ̂ . Repeating
this procedure many times builds up a distribution for θ̂. This distribution
is treated as the sampling distribution, because it is. The goal is to derive
robust estimates of population parameters when inference to the population
is especially problematical or complicated.
Jackknife
The jackknife is an even older resampling technique designed to build up sam-
pling information about a statistic. The jackknife involves iteratively removing
each data point and calculating the statistic on the n−1 remaining observations
and then averaging across all these values. The jackknife is similar to leave-
one-out cross-validation in that it serially leaves a single observation out of the
estimation and then computes an average of the statistics being examined from
the n−1 jackknifed subsample statistics. The variance of the averages calculated
on the resamples is an estimate of the variance of the original sample mean but
robust in small samples as well.
are labeled as having received the stimulus and still retrieve approximately the
same outcome distributions as what was observed. In “shuffling” the label, the
same proportion of the subject pool is labeled as “stimulated,” we merely alter
which subjects those are. In other words, we are sampling without
replacement.
In a permutation test, the researcher actually generates all ns distinct ways of
assigning stimulus to s subjects out of n, constructing the distribution of the
statistic of interest under the sharp null. We can then determine exactly where
in this distribution our observed data fall, generating exact p-values.
Permutation tests can become cumbersome in even moderate samples.
For example, a study with 30 subjects, half of whom receive a stimulus,
has 155,117,520 different combinations. In order to determine whether our
observed data are “surprising,” we do not necessarily need to construct all
possible permutations. We can simply do many of them. When inference is
based on such a random sample of all possible combinations, we refer to a
randomization test.
where i and j index countries and t indexes time. Interest centers on the trading
pair or “dyad,” ij. In this example, all models treat dyad-years as conditionally
independent. The dependent variable, Y, is averaged bilateral trade; X is a
matrix of possibly time-varying covariates, including year indicator variables.
The parameters β and σij2 are to-be estimated quantities. The subscripts on
σ 2 indicate the commonly recognized problem that repeated measurements
on the same dyad should show some dependence; this is most commonly
addressed by using a sandwich estimator and “clustering” standard errors at
the dyad level.
We begin by replicating the Rose (2004) and Tomz et al. (2007) (TGR)
analysis. For the sake of direct comparison, we use the same slate of covariates
as Rose and TGR and the data set provided on Tomz’s website. In addition
to indicators for dyad-level GATT/WTO involvement, X includes the log great
circle distances between countries, the log product of real GDP, the log product
of real GDP per capita, the log product of land area, and indicators for
colonial ties, involvement in regional trade agreements, shared currency, and
the Generalized System of Preferences, shared language, shared borders, shared
colonizing powers, whether i ever colonized j, whether i was ever a territory of
j, the number of landlocked countries in the dyad, and the number of island
nations in the dyad.
5.2 The Logic of Prediction, Out-of-Sample 113
little reason to believe that, among these competitors, the models including
GATT/WTO covariates should be privileged over those that do not.
5.3 conclusion
This chapter provided some tools for model selection and evaluation, in
contrast to current practice, in which model selection is often implicit and
small p-values are prized. Model selection is a broad topic, and we have only
116 Model Evaluation and Selection
scratched the surface. But the key point is that we cannot declare victory simply
because we fit a model with “significant” results. It is easy to overfit the data and
build a model that only describes the data already in hand. That is almost never
the goal, because the researcher already has that data and can make models
arbitrarily close to perfect with those data. The issue is whether the estimated
model will be useful for additional data.
We must also compare models to one another and justify our preferred
specifications prior to making inference about any parameters. Out-of-sample
prediction is a powerful tool for annealing results against overfitting. Cross-
validation is the most common way of conducting out-of-sample tests.
Most statistical software packages contain functionality for out-of-sample
prediction, but it is often worthwhile to design predictive exercises to speak
directly to questions of substantive interest. Existing software’s default settings
for loss function or method of dividing the data may not be best for a
particular application. In certain circumstances we may want to see how a
model works in specific kinds of cases, not necessarily in all of them. By dividing
your sample into different sets and using a cross-validation strategy, we can
probe the dependencies between the model and the data. If it is possible to
keep some data isolated from the estimation process altogether, then all the
better.
We are not, however, arguing for a pure data-mining approach, absent
substantive knowledge and theoretical reflections. A good theory should lead
us to specify models that predict better, but better-predicting models do not
necessarily reflect a “true” or even causal set of relationships. We must also
be careful in constructing out-of-sample prediction exercises when there are
dependencies in the data (e.g., temporal correlation) or when we have reason
to believe that the fundamental processes at work may have changed. As a
result, the prediction heuristic is general, powerful, simple to understand, and
relatively easy to build in a computer. But it will not solve all our model-building
problems nor will it obviate the need for careful reflection on how the data were
obtained.
Applications
Hill and Jones (2014) use cross-validation to systematically evaluate a variety of
competing empirical models of government repression. Grimmer and Stewart
(2013) show how cross-validation and other out-of-sample methods are critical
to the burgeoning field of machine learning, especially in the context of text
analysis. Titiunik and Feher (2017) use randomization inference in examining
the effects of term limits in the Arkansas legislature. Ho and Imai (2006) use
randomization inference in the context of California’s complicated candidate
5.4 Further Reading 117
Previous Work
On the misinterpretation of confidence intervals and p-values, see Belia et al.
(2005); Cumming et al. (2004); Hoekstra et al. (2014).
See Singal (2015) for a detailed description of the Science retraction of the
LaCour and Greene study. Important recent articles about reproducibility in
the social sciences have begun to appear (Benoit et al., 2016; Laitin and Reich,
2017; Miguel et al., 2014). Many political science journals now require publicly
visible data repositories and more (Bueno de Mesquita et al., 2003; DA-RT,
2015; Gleditsch et al., 2003; King, 1995).
Many current recommendations on research reproducibility stem from
Knuth’s invention of literate programming (1984), which was a way of inte-
grating textual documentation with computer programs. Gentleman and Tem-
ple Lang (2007) expanded this idea to statistical programming, and recently
Xie (2015) further updated these ideas with the use of markdown and pandoc
(MacFarlane, 2013).
Regarding the WTO-trade dispute, Park (2012) and Imai and Tingley
(2012) revisit the Goldstein et al. (2007) findings in the context of other
methodological discussions. Both show the GATT/WTO finding to be fragile.
Ward et al. (2013) dispute the assumption of dyadic conditional independence
in gravity models of international trade, arguing for models that incorporate
higher-order network dependencies in the data.
Advanced Study
On the interpretation and (mis)use of p-values in model selection, see Freedman
(1983); Gill (1999); Raftery (1995); Rozeboom (1960).
Model selection need not imply that we choose one “winner.” Rather, in a
Bayesian framework, we can average across models (Bartels, 1997; Raftery,
1995). This approach has received renewed interest in political science (Mont-
gomery et al., 2012a,b; Nyhan and Montgomery, 2010).
Hastie et al. (2016) is the canonical text for cross-validation and applied
machine learning. Arlot and Celisse (2010) provide a recent review of the
state of the art. Stone (1977) shows that choosing models based on leave-
one-out cross validation is asymptotically equivalent to minimizing the AIC,
whereas Shao (1997) links k-fold cross-validation to the BIC. Hastie et al.
(2008) observe that leave-one-out cross validation is approximately unbiased
as an estimator of the expected prediction error. Markatou et al. (2005) presents
some inferential approaches for cross-validation results in the linear regression
case. Efron and Tibshirani (1998) gives a detailed treatment of the bootstrap
118 Model Evaluation and Selection
and jackknife procedures; see also Davison and Hinkley (1997). Gerber and
Green (2012) provide extensive discussion of randomization and permutation
inference in political science.
Software Notes
There are many cross-validation and related routines in R. The cv.glm func-
tion in the boot library (Canty and Ripley, 2016) produces cross-validation
estimates for many of the models explored in this book. One disadvantage of
this implementation is that it simply returns another single number summary,
which, while informative, can be improved upon. The crossval function in
the bootstrap package (Tibshirani and Leisch, 2017) requires some user
manipulation but has more flexibility. Both boot and bootstrap enable
bootstrap and jackknife resampling. cvTools (Alfons, 2012) and caret
(Kuhn, 2016) contain cross-validation functionality as well. The ri (Aronow
and Samii, 2012) package enables randomization and permutation inference.
6
The claim of statistical significance is shorthand for saying “if βf = 0.0, and
we were to repeatedly generate new independent, random samples and calculate
a β̂f for each, then we should see β̂f values at least as large as the one we just
calculated less than 5% of the time.” You can see why we have developed a
shorthand phrase. Alternatively, the author might construct a 95% confidence
interval around β̂f , which has a similar interpretation.
1 Permutation tests and randomization inference, discussed in Chapter 5, exploit this exact
property.
6.1 The Mechanics of Inference 121
(1) (2)
Labor endowment −0.116 −0.500
(0.175) (0.175)
World trade −0.020 −0.027
(0.012) (0.012)
Labor endowment × world trade 0.014
(0.006)
Global % democracies 0.020 0.018
(0.004) (0.004)
Neighborhood % democracies 0.470 0.467
(0.202) (0.202)
Prior democratic failure 0.324 0.312
(0.061) (0.061)
Communist −0.620 −0.616
(0.227) (0.227)
Gold Standard 0.171 0.105
(0.146) (0.146)
Interwar −0.163 −0.241
(0.235) (0.235)
Post–Bretton Woods 0.394 0.419
(0.216) (0.216)
Neighborhood democratic transition 0.386 0.373
(0.133) (0.133)
n 8,347 8,347
log L −621 −619
AIC 1,279 1,278
BIC 1,406 1,419
Note: Estimated model is a dynamic probit; interaction terms with
lagged dependent variable omitted for simplicity, as is the constant
term. Robust standard errors are reported, following Ahlquist and
Wibbels (2012).
2 Ahlquist and Wibbels estimate a dynamic probit model that accounts for transitions out of
democracy as well. We omit those terms from the table here for simplicity; they are readily
available in the original paper or with the data and code accompanying this volume.
122 Inference and Interpretation
Looking at in-sample performance using the AIC and BIC, they find that the
models with and without the interaction term perform almost identically, pro-
viding little little reason to prefer the more-complicated version. Nevertheless,
a standard form of inference might proceed to look at the point estimate and
standard error of the endowment × trade term and notice that 0.014/0.006 ≈
2.3, which implies a (two-sided) p-value of about 0.03. What knowledge claims
are we to make with this result?
conditional on fixed covariate values and the estimated model parameters. The
difference between the two is that the predicted value incorporates fundamental
uncertainty, whereas the expected value does not.
Examples will help. Recall the linear regression of log CO2 on log per capita
GDP, reported in Table 1.2. In this model we estimated that
Yi ∼ fN (μi , 2.1122 )
μi = −0.08 + 1.04Xi .
Suppose we are interested in the model’s predictions of India’s CO2 emissions
if its per capita GDP were 10% larger than in 2012, i.e., if xi = log(1.1 ∗
4,921.84). The expected amount of CO2 is then exp(1.04∗log(1.1∗4,921.84)−
0.08) = exp(8.86) = 6,915.3kT. The predicted amount, however, will
take account of the fundamental uncertainty, as represented by the normal
distribution with variance estimated at 2.1122 . To do this, we can take a
draw from N (8.86, 2.1122 ). One such draw is 6.71; exp(6.71) = 821, which
represents one predicted value.
As a second example, recall the exercise in Chapter 3 in which we estimated
the predicted probability of a woman being in the paid labor force when she
had a college degree, holding the other covariates at their central tendencies.
Based on a logistic
" regression model we calculated this expected value as
−1 "
logit x¬coll β̂ " = 0.75. A predicted value in this case must be either 0
coll=1
or 1. To generate a predicted value that accounts for fundamental uncertainty,
we would sample from a Bernoulli distribution with θ = 0.75. Or flip an
appropriately weighted coin.
Exploring how expected values change under different values of covariates
helps interpret the systematic component of the model we fit. This is frequently
of primary interest. And, in general, we will not generate a single predicted
value; we will take advantage of our computing power and generate many, the
average of which will necessarily converge on the expected value. Nevertheless,
the predicted values are useful for understanding how much fundamental
uncertainty remains after we have fit our model. Does fundamental uncertainty
still swamp whatever systematic relationship we have? If so, this is usually
reason for both some humility in our claims and a need for more research,
including gathering more data.
6.2.2 Scenarios
In calculating our expected and predicted quantities, we must choose specific
values for the covariates in the model. Since our model posits a description
of the data-generating process, these vectors of covariate values represent
scenarios, or possible situations that we are interested in. These scenarios are
sometimes referred to as counterfactuals, since they provide an answer to the
question: “What would we expect to happen if an independent variable took
124 Inference and Interpretation
scenarios are from the support of the data used to estimate the model, the more
model dependent the implied predictions become. As our modeling assumptions
increasingly determine our predictions, our uncertainty about these “extreme
counterfactuals” is understated since the calculated confidence bands fail to
incorporate our uncertainty about the model itself. As model uncertainty comes
to dominate, our inferences about model predictions become increasingly
tenuous.
The difference between interpolation and extrapolation provides a relatively
intuitive criterion for whether our counterfactuals deviate too far from our
experience. To define these terms we must first understand the convex hull
of our data. The convex hull is the smallest convex set that can contain all
the data points. The easiest way to get a feel for the hull is to visualize it,
as in Figure 6.1. Statisticians have long defined interpolation as a way of
constructing new data points that are inside the convex hull of the data, wheras
extrapolation involves going outside the hull and necessarily further from our
recorded experience.
Once we move out of two dimensions, calculating and visualizing convex
hulls becomes difficult, but software exists to assist us.
Our perspective is that predictions involving scenarios within the convex
hull of the data, i.e., interpolation, are better for describing a model’s impli-
cations. But pressing policy problems may require that we pose questions
beyond the realm of our recorded experience, i.e., our data. Should we
disregard the data we do have and the models we can fit in these situa-
tions? Probably not. Nevertheless, an important first step in making such
−1
−2
from which we can easily calculate the standard error of the marginal effect.
Several observations emerge from these expressions:
These calculations for the standard error of the marginal effect only apply to
linear regression. But the observations echo points raised in Chapter 3 about
marginal effects in a logit context. This means that our strategy of specifying
scenarios of interest, simulating from the model’s sampling distribution, and
then constructing displays of the model’s predicted consequences encompasses
the interpretation of interactive, conditional, and other relationships that are
nonlinear in the covariates. We do not need to derive marginal effects (and
variance) expressions for each specific class of models; we can simulate them
directly. But our discussion of model validation and selection adds another
important observation: before engaging in model interpretation and hypoth-
esis testing, analysts proposing models with conditional, nonlinear, or other
complicated functional relationships have a special burden to show that their
more-complicated models out-perform their model’s simpler cousins.
labor-scarce and the dashed line depicting the labor-abundant autocracy. The
shaded regions represent the 95% confidence intervals around the predictions.
If the data were consistent with the literature’s predictions, the plot would
look like an “X,” with the risk of transition increasing in world trade for
the labor-abundant scenario, and the reverse for the labor-scarce. While the
risk of a transition decreases for labor-scarce autocracies as trade increases,
there is no discernible relationship between trade and democratization in labor-
abundant autocracies. The confidence intervals overlap substantially for all
values of trade, making it difficult to sustain the claim that there is an important
conditional relationship at work here, notwithstanding the small p-value on
the interaction term. Had the authors simply declared victory based on theory
and a small p-value on an interaction term, erroneous inference could have
occurred.
0.08
0.06
Probability
0.04
0.02
0.00
15 20 25 30
6.3 conclusion
Conventional statistical inference focuses on deciding whether particular model
parameters are sufficiently far away from an arbitrary “null” value, based on
estimation uncertainty and a particular model of data sampling. While useful,
this mode of reasoning can be misleading or miss the point entirely, especially
when the data do not conform to an actual sample from a population or a
researcher-controlled experiment. Moreover, in many models the parameters do
not have ready interpretations applicable to the substantive question at hand,
meaning statements of statistical significance, as the primary mode of inference,
do not have immediate meaning to audiences.
A more readily accessible form of inference explores a model’s implications
by using substantively relevant scenarios to compare predicted outcomes under
different conditions. Model interpretation is almost always most effective when
communicated on the scale of the dependent variable, whether as an expected
or predicted value. By incorporating (and displaying) estimation uncertainty
around these implications we can develop a richer understanding of both
the size of a model’s predicted “effects” as well as the level of uncertainty
around them.
Applications
Shih et al. (2012) provide an excellent interpretation and presentation of
conditional effects from interaction terms in a custom-designed model for
rank data. They explicitly consider the convex hull in their interpretive
scenarios.
Previous Work
King et al. (2000) discuss the construction and display of meaningful quantities
of interest for model interpretation.
Ai and Norton (2003); Berry et al. (2010, 2012); Brambor et al. (2006);
Braumoeller (2004); and Kam and Franzese (2007) all discuss the inclusion
and interpretation of multiplicative interaction terms in the linear
predictor.
Advanced Study
Cox (2006) provides a detailed discussion of the theory, practice, and history
of statistical inference.
6.4 Further Reading 131
Software Notes
Tomz et al. (2001) developed the stand-alone CLARIFY package for calculating
quantities of interest and estimation uncertainty around them. This functional-
ity has been incorporated into R’s Zelig library.
There are several R packages for calculating convex hulls. The basic chull
function only handles two-dimensional data. Stoll et al. (2014) developed
WhatIf that allows us to compare proposed scenarios to the convex hull of our
data. The geometry package (Habel et al., 2015) also contains functionality
for calculating the convex hull in more than two dimensions.
part iii
135
136 The Generalized Linear Model
7.3.1 GLMs in R
The syntax of the R function glm is built on the theory of the GLM. In
particular, glm requires that the user:
As we have just seen, in the normal model the dispersion parameter corresponds
to σ 2 . Thus 0.22 = σ̂ 2 in the Mroz LPM example. The square root of this
quantity for a linear-normal GLM corresponds to the standard error of the
regression. Table√ 3.4 reports a regression standard error of 0.46, which is
approximately 0.2155109. In fitting the logit model for the Mroz example
the code chunk in Section 3.4.2 shows that R reports the dispersion parameter
fixed at 1, as the binomial GLM requires.
E[Yi ] = μi (β) = g−1 (xi β),
var(Yi ) = φV(μi ).
1 (yi − μ̂i )2
n
φ̂ = . (7.3)
n−k V(μ̂i )
i=1
estimates relies on the functional connection between the mean and variance
that the exponential family provides. What differs between MLE and quasi-
likelihood is the covariance matrix and therefore the standard errors. The chief
benefit of a quasilikelihood approach is the ability to partially relax some of the
distributional assumptions inherent in a fully-specified likelihood. For example,
quasilikelihood allows us to account for overdispersion in binary or count data.
But the quasilikelihood does not calculate an actual likelihood, so likelihood-
based quantities like the AIC and BIC as well as likelihood-based tests are not
available.
7.5 conclusion
This brief chapter introduced the concept and notation of the Generalized Lin-
ear Model, something we have already seen in the context of linear and logistic
regression. The GLM provides a unified way of specifying and estimating a
variety of models with different distributional assumptions. So long as we are
hypothesizing probability models that rely on distributions in the exponential
family, we can decompose our model building into the specification of a linear
predictor term – covariates and regression parameters – and the link function
that maps the linear predictor into the expected value.
Subsequent chapters in this part of the book introduce commonly used
GLMs for outcomes that are integer counts as well as ordered and unordered
categorical variables. Other GLMs exist for outcomes that are strictly positive,
negative, or bounded, as well as many others, a testament to the flexibility of
the exponential family of distributions. But it is also important to recognize that
GLMs are not the only ways we can apply the method of maximum likelihood.
We take up more complicated modeling tools in the last part of the book.
Advanced Study
McCullagh and Nelder (1989) is the canonical citation for the GLM. A variety
of subsequent texts have expanded on these ideas, including Agresti (2002).
Wedderburn (1974) introduced quasilikelihood. A widely used extension
of these ideas appears in the theory and implementation of Generalized
Estimating Equations (GEE) (Hardin and Hilbe, 2012; Ziegler, 2011). Zorn
(2000) summarizes the GEE approach with applications to political science.
8
8.1 motivation
Many variables that social scientists study are neither continuous nor binary,
but rather consist of a (usually small) set of categories that are ranked from
low to high. The most familiar example is the five point scale that is widely
used in surveys such as the American National Election Study (ANES). Survey
designers frequently phrase questions so that respondents must choose among
five ordered choices ranging from strongly disagree (at the low end) to strongly
agree (at the high end). Intermediate categories are Agree, Don’t Know, and
Disagree. This assumes an underlying dimension, y∗ , in which the various
responses are ordered, but we lack a meaningful metric for distance. How far
is “disagree” from “strongly agree”?
141
142 Ordered Categorical Variable Models
xTi b xTj b
strongly
agree agree
n
M
log L(β, τ |Y, X) = 1im log (τm − xi β) − (τm−1 − xi β) .
i=1 m=1
(8.3)
Just as in the binary case, the choice between logit and probit is virtually always
inconsequential.
A change in xj shifts the density up or down the axis, while the positions of
the cutpoints stay the same. For a positive β̂j the “mass of probability” in the
146 Ordered Categorical Variable Models
ta b l e 8.1 Selected variables and descriptors from 2016 ANES pilot study.
β̂ σβ̂
1.0
Pr(extremely liberal)
Pr(liberal)
Pr(slightly liberal)
Pr(moderate)
0.8 Pr(slightly conservative) ●
Pr(conservative)
Pr(extremely conservative) ●
●
0.6
●
π
^
●
0.4
●
● ● ●
● ●
●
0.2 ● ●
●
● ●
●
● ●
●
● ●
● ●
● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ●
0.0
Pr(`liberal` | PID=`independent`) -
The system of logit models fit to each of these Ỹm is called the cumulative logit
model.
150 Ordered Categorical Variable Models
l i b r a r y (MASS)
# r e q u i r e s a n e s . sub d a t a from r e p o s i t o r y .
a t t a c h ( a n e s . sub )
p o l r . out <− p o l r ( a s . o r d e r e d (Obama .LR) ~ f o l l o w . p o l i t i c s + p i d +
age + e d u c a t i o n +income + white ,
d a t a = a n e s . sub , method=" l o g i s t i c " , Hess=T )
b e t a <− c o e f ( p o l r . out )
t a u <− p o l r . out $ z e t a
#create predicted probabilities
X <− c b i n d ( median ( f o l l o w . p o l i t i c s ) , # s c e n a r i o o f i n t e r e s t
min ( p i d ) : max ( p i d ) , # a c r o s s r a n g e o f p a r t y ID
median ( age ) , median ( e d u c a t i o n ) ,
median ( income ) , median ( w h i t e )
)
p1 <− p l o g i s ( t a u [ 1 ] − X %∗% b e t a )
p2 <− p l o g i s ( t a u [ 2 ] − X %∗% b e t a ) − p l o g i s ( t a u [ 1 ] − X %∗% b e t a )
p3 <− p l o g i s ( t a u [ 3 ] − X %∗% b e t a ) − p l o g i s ( t a u [ 2 ] − X %∗% b e t a )
p4 <− p l o g i s ( t a u [ 4 ] − X %∗% b e t a ) − p l o g i s ( t a u [ 3 ] − X %∗% b e t a )
p5 <− p l o g i s ( t a u [ 5 ] − X %∗% b e t a ) − p l o g i s ( t a u [ 4 ] − X %∗% b e t a )
p6 <− p l o g i s ( t a u [ 6 ] − X %∗% b e t a ) − p l o g i s ( t a u [ 5 ] − X %∗% b e t a )
p7 <− 1 .0 − p l o g i s ( t a u [ 6 ] − X %∗% b e t a )
Note that in this general specification each equation has its own set of slope
parameters, β m . The ordered logit and probit models estimate these equations
simultaneously, with the constraint that slope parameters are equal across
equations, i.e., β m = β ∀ m. The intercepts (τm ) vary across equations, as
they must if we are to discriminate between categories. This assumption of
common slope parameters across levels of the response variable goes by the
name “parallel regressions” or, in the logit case, “proportional odds.” The
phrase parallel regressions describes how the impact of the covariates may shift
the predicted probability curves to the right or the left but does not change
the basic slope of these curves across any two categories. This means that the
partial derivative for the probability that Y is in any category with respect to
the covariate information should be equal:
∂ Pr(y ≤ m | x) ∂ Pr(y ≤ m − 1 | x) ∂ Pr(y ≤ m − 2 | x)
= = = ···
∂x ∂x ∂x
For better or worse, the parallel regressions assumption is rarely examined
in practice, so the extent to which violations are substantively important is
not well-explored. But it is possible for a covariate to increase the probability
of being in category 2 relative to 1 and yet have a null or even negative
relationship at other levels. An inappropriate constraint could result in missing
this relationship. The parallel regressions assumption is easier to satisfy when
8.4 Parallel Regressions 151
#scenarios
X.wd <− c b i n d ( median ( f o l l o w . p o l i t i c s ) , 1 , #PID=1 weak democrat
median ( age ) , median ( e d u c a t i o n ) , median ( income ) , TRUE)
X. i n d <− c b i n d ( median ( f o l l o w . p o l i t i c s ) , 3 , #PID=3 i n d e p e n d e n t
median ( age ) , median ( e d u c a t i o n ) , median ( income ) , TRUE)
# c o e f f i c i e n t v e c t o r s . Note i n c l u s i o n o f c u t p o i n t s
draws<−mvrnorm ( 1 0 0 0 , c ( c o e f ( p o l r . out ) , p o l r . out $ z e t a ) ,
s o l v e ( p o l r . out $ H e s s i a n ) )
B<−draws [ , 1 : l e n g t h ( c o e f ( p o l r . out ) ) ]
Taus<−draws [ , ( l e n g t h ( c o e f ( p o l r . out ) ) +1) : n c o l ( draws ) ]
#predicted probabilities
p i . l i b .wd<− p l o g i s ( Taus [ , 2 ] − B%∗%t (X.wd) ) − p l o g i s ( Taus [ , 1 ] − B%∗%t (X.wd) )
p i . l i b . i n d <− p l o g i s ( Taus [ , 2 ] − B%∗%t (X. i n d ) ) − p l o g i s ( Taus [ , 1 ] − B%∗%t (X. i n d ) )
p i . mod .wd<− p l o g i s ( Taus [ , 4 ] − B%∗%t (X.wd) ) − p l o g i s ( Taus [ , 3 ] − B%∗%t (X.wd) )
p i . mod . i n d <− p l o g i s ( Taus [ , 4 ] − B%∗%t (X. i n d ) ) − p l o g i s ( Taus [ , 3 ] − B%∗%t (X. i n d ) )
#differences
f d . l i b <− p i . l i b . i n d − p i . l i b .wd
f d . mod<− p i . mod . i n d − p i . mod .wd
#plotting
p l o t ( d e n s i t y ( f d . mod , a d j u s t = 1 .5 ) , xlim = c ( − 0 .2 ,0 .2 ) , y l i m = c ( 0 ,5 0 ) ,
x l a b ="Change i n p r e d i c t e d p r o b a b i l i t y " , b t y ="n" , c o l =1 ,
y a x t ="n" , lwd =2 , main="" , y l a b ="" )
l i n e s ( d e n s i t y ( f d . l i b , a d j u s t = 1 .5 ) , c o l = g r e y ( 0 . 5 ) , lwd =2 , l t y =2)
t e x t ( x = 0 .1 1 , y =42 , l a b e l s =" Pr ( ‘ l i b e r a l ‘ | PID = ‘ i n d e p e n d e n t ‘ ) −
\ n Pr ( ‘ l i b e r a l ‘ | PID = ‘ weak dem ‘ ) " , cex = . 8 )
t e x t ( x= −.12 , y =35 , l a b e l s =" Pr ( ‘ moderate ‘ | PID = ‘ i n d e p e n d e n t ‘ ) −
\ n Pr ( ‘ moderate ‘ | PID = ‘ weak dem ‘ ) " , cex = . 8 )
d e t a c h ( a n e s . sub )
there are just a few categories but becomes more difficult to meet as the number
of categories grows.
The parallel regressions assumption can be tested in several ways. One way is
to fit m − 1 binary regressions after expanding the ordinal dependent variable
into m − 1 binary Ỹm . With these regression results in hand, the equality of
β̂ m can be examined with standard Wald-type tests or simply visualized. A
somewhat easier test to execute is to compare the ordered logit/probit model
with a cumulative logit or a multinominal logit/probit. We provide an example
of the former below. Multinomial models are discussed in Chapter 9.
equations using the cumulative logit conceptualization. Table 8.5 displays how
the dependent variables, the Ỹm , are constructed.
Figure 8.5 displays the results. The vertical axis is the β̂k . If parallel regression
holds, then these estimates should be approximately equal. The plots are scaled
so that the vertical axis distance roughly covers ±2 standard errors from
the maximum and minimum coefficient estimates. Again, we see unusual and
unstable patterns in the regression coefficients, even those like state failure
(stfl) that appeared as significant in the ordered logit fit.
The results here are consistent with the imprecisely estimated threshold
parameters reported in Table 8.4, implying that several of the categories are
difficult to distinguish from one another. We might fit the OLS model, as above,
or consider combining some categories together, especially where thresholds
are imprecisely estimated. But either way this application violates the parallel
regressions assumption.
154 Ordered Categorical Variable Models
Ỹim = 1 Ỹim = 0
eq. 1 Ỹi,0 0 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5
eq. 2 Ỹi,0.5 0,0.5 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5
eq. 3 Ỹi,1 0,0.5,1 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5
eq. 4 Ỹi,1.5 0,0.5,1, 1.5 2, 2.5, 3, 3.5, 4, 4.5, 5
eq. 5 Ỹi,2 0,0.5,1, 1.5, 2 2.5, 3, 3.5, 4, 4.5, 5
eq. 6 Ỹi,2.5 0,0.5,1, 1.5, 2, 2.5 3, 3.5, 4, 4.5, 5
eq. 7 Ỹi,3 0,0.5,1, 1.5, 2, 2.5, 3 3.5, 4, 4.5, 5
eq. 8 Ỹi,3.5 0,0.5,1, 1.5, 2, 2.5, 3, 3.5 4, 4.5, 5
eq. 9 Ỹi,4 0,0.5,1, 1.5, 2, 2.5, 3, 3.5,4 4.5, 5
eq. 10 Ỹi,4.5 0,0.5,1, 1.5, 2, 2.5, 3, 3.5,4, 4.5 5
8.4.3 Extensions
Several extensions to the ordered logit/probit model have been proposed to
partially relax the parallel regressions assumptions and allow for other nuances
like unequal variances across units. These models, going by names like partial
proportional odds and generalized ordered logit have not seen exensive use
in the social science for several reasons. First, an even more general and
flexible approach, the multinomial model taken up in the next chapter, is
widely understood. The main cost of the multinomial model is that it estimates
many more parameters than an ordered alternative. With datasets growing
bigger and computers becoming ever faster, the relative cost of the multinomial
alternative is falling fast. Second, most of these models require theoretical or
other reasons for contraining some predictors to respect the parallel regressions
assumption, while others do not. Or, as with the heterokedastic probit model,
some covariates must be used to model scale parameters, often with little
improvement to model fit. Rarely do we have theories at this level of precision.
Third, interpretation of these hybrid ordered models is more complicated,
including the fact that it is possible for certain models to generate negative
predicted probabilities (McCullagh and Nelder, 1989).
intrvlag=0
icntglag=0
0.4 0.4 ● ● ●
maglag=3.5
● ● 0.0 ● ● ● ●
1 3 5 7 9 11 1 3 5 7 9 11 1 3 5 7 9 11
magnitud magnitud magnitud
n = 273 n = 273 n = 273
● ● ● ● −2 ● ●
●
7 ● ● 0.9
6 ● ●
● ● −3 ●
● 0.8 −4 ●
5 ● ● ●
●
● ● ●
stfl=1
4
genyr
0.7 ● ● −5 ● ●
●
regtype
3 ● −6 ●
2 ● 0.6 ● ●
1 3 5 7 9 11 1 3 5 7 9 11 1 3 5 7 9 11
magnitud magnitud magnitud
n = 273 n = 273 n = 273
● 100 ● 1.0 ●
0.70 ● ● ●
0.65 ● 80 0.9 ● ●
● ● ● ●
0.60 ● ● 60 0.8
0.55 ● ● ● ●
marg
● 40 ● ● 0.7
●
coldwar
ethkrain
0.50 ● ● ● ● ● 0.6
0.45 ● 20 ●
● ● ●
1 3 5 7 9 11 1 3 5 7 9 11 1 3 5 7 9 11
magnitud magnitud magnitud
n = 273 n = 273 n = 273
f i g u r e 8.4 Plot of the conditional means of the regressors at different levels of the response variable, magnitud. If the parallel
regressions assumption holds, the means will show a strong trend across values of the response variable and line up neatly. The
broken line is the loess smoother describing this trend.
2 2
● ● 0.4 ● ● ●
● 1 ● ● ●
1 ●
● ● ● ●
● ● ● ● ● ●
● ● ● ● 0.2
0 ● 0
● ● ● ● ●
−1 −1 0.0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
2.0 0.2
0.0 ● ● ● ● 1.5 ●
● ● ● ● ● 0.1
−0.2 ● 1.0 ● ● ●
● ● ● ● ● ●
0.5 ● 0.0 ● ●
−0.4 0.0 ● ● ● ● ●
● −0.1 ●
−0.6 −0.5
−0.8 −1.0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
4 0.015 2
2 ● ● ● ●
0.010 ● 1 ● ●
●
● ● ● ● ● ● ● ●
● 0.005 ● ● ● 0
0 ● ● ●
0.000 ● −1 ●
● ● ● ● ●
−2 −0.005 −2
−4 −0.010 −3
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
f i g u r e 8.5 Plot of the estimated regression coefficients from M − 1 regressions on Ỹm , the indicators that Y ≤ m. If the parallel
regressions assumption holds, the estimated coefficients should be stable across levels of Ỹ.
8.5 Ordered Categorical Variables as Regressors 157
5 ●
log child mortality
3
●
●
●
2
●
●
1 2 3 4 5 6 7
using categorical treatments of XCONST fit the data better in-sample than
the one treating XCONST as continuous. On a BIC basis, the decision is
less clear. Using a five-fold cross validation heuristic, we find that the first
model returns an MSE of 1.07; the other two return MSEs of 0.926, an
improvement of almost 14% in out-of-sample predictive performance. Better
modeling of the categorical data on the right hand side of the regression
has helped us better understand the relationship between the variables of
interest here.
160 Ordered Categorical Variable Models
Applications
Walter (2017) uses ordered logit to analyze European Social Survey data
reporting people’s perceptions of labor market risk. Kriner and Shen (2016)
analyze an experiment on the presence of the draft on support of military action.
Past Work
McKelvey and Zaviona (1975) present an early introduction to the ordered
probit in the social sciences. Fullerton (2009) develops a typology of the various
types of logit-based models for ordered categorical outcome variables.
For a nuanced exploration of an ordered categorical variable commonly
treated as interval-scaled (Polity scores), see Treier and Jackman (2008).
Advanced Study
On the “generalized ordered logit” that constrains some regression parameters
to be equal across categories while allowing others to vary, see Peterson and
Harrell (1990); Williams (2016).
Software Notes
See Venables and Ripley (2002) for more extended discussion of contrasts
in R and how they are constructed. Successive differences contrasts are
part of the MASS library, as is the polr function. The ordinal package
(Christensen, 2015) implements a variety of models for ordinal data, including
partial proportional odds and heterokedastic ordered logit. The oglmx package
(Carroll, 2017) implements heteroskedastic ordered logit and probit models.
9
9.1 introduction
In Chapter 8 we extended the latent variable model for binary outcomes to
incorporate ordered response variables. We managed this by adding a set of
estimated “thresholds” that served to map the assumed probability density into
discrete but ordered categories. Our ability to do this relies on the “parallel
regressions” assumption: the relationship between covariates and the outcome
variable is constant across different categories of the outcome.
Unordered, polychotomous dependent variables are simply variables in
which the categories cannot be ordered in any mathematically meaningful way.
These are also called nominal variables having more than the two categories
found in a dichotomous variable. There are lots of good examples in the social
sciences: vote choice (Christian Democrat, Social Democrat, Greens, etc.);
occupation (doctor, lawyer, mechanic, astronaut, student, etc.); marital status
(single, married, divorced, etc.); college major (art history, modern history,
Greek history, etc.); language (French, German, Urdu, etc.); ethnicity (Serb,
Croat, Bosniak, Avar, Lek, etc.); and many, many others. Sometimes these
nominal groups can represent ascriptive categories. But often these groups are
the objects of some choice process. This, in turn, has consequences for how
we think about a statistical model and find relevant covariates. For example,
there may be covariates relevant to the choice, such as price, color, or party
platform. Or we might be interested in covariates relevant to the chooser or
choice process, like income, age, or gender.
Constructing likelihood-based models for nominal variables builds on a
generalization of the binomial distribution, the appropriately named multi-
nomial. From there we develop a set of linear predictors that allow us to
incorporate both covariates and link functions that map the linear predictor
into the appropriate interval.
161
162 Models for Nominal Data
Zi ∼ fc (zi ; p1 , . . . , pM ).
We can write the expectation of a categorical variable using the
notion 1m to be the indicator variable taking on 1 if Zi = m and 0
otherwise. E[1m ] = pm and var(1m ) = pm (1 − pm ). The covariance
between any two categories, a, b ∈ S, is given as cov(1a , 1b ) =
−pa pb . The Bernoulli distribution is a special case of the categorical
distribution in which M = 2.
Now suppose we have n independent realizations of Zi . Let nj ≤
∈ S denote the number of realizations in category j such that
n, j
j∈S nj = n. Let Y = (n1 , . . . , nM ). We say that Y follows a
multinomial distribution with parameter vector θ = (n, p1 , . . . , pM )
and probability mass function
n! nj
Y ∼ fm (y; θ ) = 1 pj ,
j∈S nj ! j∈S
with E[Y] = (np1 , . . . , npM ) and var(Y) = (np1 (1−p1 ), . . . , npM (1−
pM )). The covariance between any two categories, a, b ∈ S, is given
as cov(1a , 1b ) = −npa pb . The binomial distribution is a special case
of the multinomial distribution in which M = 2.
odds of being in category m versus the reference category. It then follows that
any multinomial model can be reestimated with a different reference category.
This will not change the overall fit of the model or its implications, but it will
generally change the regression coefficients that you see on your computer
screen and their immediate interpretations. But, as we will see later in this
section, we can still recover pairwise comparisons across categories not used
as a baseline.
Thinking of the multinomial model as a set of binary models also highlights
another important fact: multinomial models are very demanding of the data.
Multinomial models burn degrees of freedom rapidly as we estimate M − 1
parameters for every new explanatory variable. In a multinomial framework
all the problems of perfect separation and rare events that we encountered
in binary data are amplified. With so many more predictor–outcome combi-
nations, small numbers of observations in any of these cells can arise more
easily.
The multinomial likelihood can be formed following our usual steps and
using 1ij as an indicator variable taking on 1 when observation i is in
the jth category and 0 otherwise. For notational simplicity, the equations
include sums over all M categories. To identify the model, let β 1 = 0 so
exp(xi β 1 ) = 1.
#
$1ih
M
exp(xi β h )
Li = M ,
h=1 =1 exp(xi β )
1M 1ih
n
h=1 exp(xi β h )
L= M ,
i=1 =1 exp(xi β )
#M $
n
M
log L = 1ij xi β h − log exp(xi β ) .
i=1 h=1 =1
This likelihood is nice in all the standard ways: globally concave and quickly
converging, producing (in the limit) estimates that are consistent, normal, and
efficient.
of these characteristics across alternatives: μi (m) = xi β m . Being clever, the
individual chooses among the alternatives so as to maximize utility:
Pr(Yi = m) = Pr(Ui (m) > Ui (d) ∀ d = m ∈ S)
= Pr(μi (m) + im > μi (d) + id ∀ d = m ∈ S)
= Pr(xi β m + im > xi β d + id ∀ d = m ∈ S)
= Pr(im − id > xi (β d − β m ) ∀ d = m ∈ S). (9.3)
If the difference between the stochastic component and that of any alter-
native is greater than the difference in the systematic parts, it has the highest
utility, because either im is large or xi β m is large, or both. The expression in
Equation 9.3 relies on differences between coefficient vectors across categories,
so once again we see that model identification requires that we fix one category
as the baseline. To complete the model we need to choose an expression for the
stochastic component, i.e., we specify the distribution of the error terms, im .
If we choose a multivariate normal distribution we arrive at the multinomial
probit model. If we use a standard type-I extreme value distribution (EV-I), and
the errors are i.i.d. across categories, we arrive at the multinomial logit model
already introduced.
The choice of EV-I appears arbitrary. The following theorem justifies this
choice by linking the EV-I distribution to the logistic distribution we are already
familiar with from Chapters 3 and 8.
i.i.d.
Theorem 9.1. If A, B ∼ fEV1 (0, 1) then A − B ∼ fL (0, 1).
Proof The proof involves the convolution of two type-I extreme value distri-
butions. Let C = A − B.
166 Models for Nominal Data
9.2.2 IIA
IIA stands for the independence of irrelevant alternatives. IIA is an assumption
about the nature of the choice process: under IIA, an individual’s choice does
not depend on the availability or characteristics of inaccessible alternatives. IIA
is closely related to the notion of transitive (or acyclic) preferences.
Returning to the Australian election example, suppose a voter is asked
whether she prefers the Labor Party or the Liberal Party and she responds with
“Liberal.” The interviewer then reminds her that the Green Party is also fielding
candidates, and she switches her choice. IIA says that the only admissible switch
she could make is to the Green Party. She cannot say Labor because she could
have chosen Labor before (when it was only Labor v. Liberal) but decided not
to. More formally, IIA says that if you hold preferences {Liberal Labor} when
those are the only two options, then you must also hold preferences {Green
Liberal Labor} or {Liberal Labor Green} or {Liberal Green Labor}
when the Green party is available. Orderings like {Labor Liberal Green},
in which Labor and Liberal switch positions once Green becomes available, are
not admissible under the IIA assumption.
9.2 Multinomial Logit 167
where β̂ r are the estimates from the restricted model (i.e., the model with
an omitted alternative), β̂ u is the vector of estimates for the unrestricted
model (i.e., the one with all the alternatives included), and V̂r and V̂u are
the estimated variance-covariance matrices for the two sets of coefficients,
respectively.1 Under the null hypothesis that IIA holds, the test is distributed
1 Let β be the coefficients for the choice category that is omitted from the restricted model but
m
included in the unrestricted one. For these vectors and matrices to be conformable, we omit all
the β m elements from β̂ u and V̂u .
168 Models for Nominal Data
9.3.1 Evaluation
Before interpreting the output of a multinomial model, we have several tools
available to describe model fit. With multinomial models we, of course, have
access to the usual in-sample test statistics as well as the AIC and BIC.
Most statistical software packages report the likelihood ratio model diagnostic
comparing a null model (M − 1 intercepts) to the one specified. This tells us
whether all coefficients estimated are equal to 0 for all M outcomes and k
variables. This is typically not very informative, like most null models.
9.3 An Example: Australian Voters in 2013 169
R Code Example 9.1 Wald test for combining categories in a multinomial logit
Stated in words, all coefficients (except the intercepts) for outcomes C and O
are equal. This can be restated as a simple linear constraint: the differences in
coefficients (except the intercepts) for each of the two outcomes are zero under
the null, which leads naturally to a Wald test. Code Example 9.1 calculates
the model in Table 9.1 and then performs the Wald test that the Coalition and
Other can be combined. This produces a p-value ≈ 0, leading to the conclusion
that we can distinguish Coalition from Other.2
Do Australian voters’ choices between Labor and Coalition depend on the
presence of Other alternatives? We can use Equation 9.6 to conduct a the
Hausman-McFadden test for IIA. Leaving out the Other category gives us a
test statistic of −2, implying that preferences between Labor and Coalition are
not affected by the presence of Other, as IIA requires.
Prediction Heuristics
Alongside these hypothesis tests we can take advantage of a variety of tools for
describing how well a multinomial model predicts outcomes, whether in- or
out-of-sample. Many of the tools below are generalizations of those described
for binary outcomes. All start by generating predicted probabilities for each
observation across all M categories: p̂i = (p̂i1 , . . . , p̂iM ). From these predictions
it is common to define ŷi = argm max{p̂im }, i.e., observation i is classified into
the category with the highest predicted probability.
One of the simplest diagnostic tools, a confusion matrix, derives from
a comparison of ŷi and yi . Table 9.2 displays an in-sample version of this
matrix for the model in Table 9.1. From the table we can see that the model
2 This test can also be constructed as a likelihood ratio in which we constrain all the coefficients
except the intercept for one of the categories to be 0.
9.3 An Example: Australian Voters in 2013 171
Predicted
Labor Coalition Other
Labor 396 665 20
Actual Coalition 246 1,305 21
Other 243 419 27
Predicted
Labor non-Labor
Labor 396 685
non-Labor 489 1,772
Coalition non-Coalition
Actual Coalition 1,305 267
non-Coalition 1,084 686
Other non-Other
Other 27 662
non-Other 41 2,612
pmnl<− p r e d i c t ( mnl . f i t )
conmat<− t a b l e ( mnl . f i t $model [ , 1 ] , pmnl ,
dnn= l i s t ( " a c t u a l " , " p r e d i c t e d " ) ) # confusion matrix
sum ( d i a g ( conmat ) ) / sum ( conmat ) # o v e r a l l accuracy
o n e V a l l <− l a p p l y ( 1 : n c o l ( conmat ) , #one v . a l l m a t r i c e s
function ( i ) {
v <− c ( conmat [ i , i ] , #true positives
rowSums ( conmat ) [ i ] − conmat [ i , i ] , # f a l s e n e g a t i v e s
colSums ( conmat ) [ i ] − conmat [ i , i ] , # f a l s e p o s i t i v e s
sum ( conmat )−rowSums ( conmat ) [ i ] − colSums ( conmat ) [ i ] + conmat [ i , i ] ) ;
r e t u r n ( m a t r i x ( v , nrow = 2 , byrow = T ,
dimnames= l i s t (
c ( p a s t e ( " a c t u a l " , colnames ( conmat ) [ i ] ) ,
p a s t e ( " a c t u a l non" , colnames ( conmat ) [ i ] ) ) ,
c ( p a s t e ( " p r e d i c t e d " , colnames ( conmat ) [ i ] ) ,
p a s t e ( " p r e d i c t e d non" , colnames ( conmat ) [ i ] ) )
)
)
)
}
)
p c e r r<− l a p p l y ( oneVall , f u n c t i o n ( x ) # p e r c l a s s e r r o r
r e t u r n (1 − sum ( d i a g ( x ) ) / sum ( x ) )
)
0.8 0.8
True positive rate
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.8
True positive rate
0.6
0.4
0.2
0.0
9.3.2 Interpretation
How does one interpret the (really big) table of numbers generated from
multinomial models? We proceed in the same way as in earlier chapters:
simulate outcomes from the assumed data-generating process described by the
model under meaningful scenarios. As usual, we include the systematic and
stochastic components in all their glory. Here this means using the fundamental
probability statement of the multinomial model from Equation 9.2. From there
we can construct graphical displays, tables of first differences, or calculated
marginal effects. With multiple outcome categories there are multiple compar-
isons that should be reflected in any interpretation.
174 Models for Nominal Data
1
f i g u r e 9.2 The three-dimensional unit simplex.
Labor
0.2 0.8
0.4 0.6
●●●●●●
●
●●●
●
●●
●●
●
●●●●
●
●●
●●
●●
●
●●
●
●
●
●●
●●
●
●●●
●●
●
●●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●●
●●
●●
●●
●●
●union
●●●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●●
●
●
●
●●
●●
●
●
●
●●
●●
●●●
●●
●
●
●●
●
●
●●
●●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●●
●●
●●
●
●●
●●●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
● ●
●
●●
●●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●●
●●●
●●
●
0.6 ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●● ●
●
●
●
●
●
●●
●
●
●
●●● 0.4
●
●●●
●
●● ●●
nonunion
0.8 0.2
1.0
Other
0.8
Cumulative predicted probability
0.6 Coalition
0.4
0.2
Labor
0.0
20 30 40 50 60 70
Age
f i g u r e 9.4 Predicted vote choice in the 2013 Australian federal elections across age
cohorts. Older voters are more likely to support the Liberal-National Coalition.
is the dreaded odds ratio. Since the multinomial logit is a log-odds model, it
may be useful to note that the log of the ratio of two probabilities is a linear
function of the independent variables:
Pr(Yi = m|xi )
log = xi (β̂ m − β̂ d ).
Pr(Yi = d|xi )
Since we set the coefficients of one category – the baseline – to zero for
identification, we can calculate the log odds that i is in m relative to the baseline
using:
Pr(Yi = m|xi )
log = xi β̂ m .
Pr(Yi = 1|xi )
This approach is linear in the parameters. We can calculate hypothetical
changes in the odds ratio for category m associated with a particular covariate
xj by exponentiation (i.e., exp(β̂m,j )). In this way we can inspect Table 9.1
9.3 An Example: Australian Voters in 2013 177
0.7
Coalition
0.6
0.5
Predicted probability
0.4
Labor
0.3
0.2
0.1 Other
0.0
and see that a Catholic has 21% lower odds of voting for the Coalition
over Labor relative to a Protestant (exp(−0.24) = 0.79); this relationship is
large relative to the estimated standard error. However, such statements are
cumbersome because the model is about comparing several categories. Simply
exponentiating coefficients (or their differences) privileges some comparisons
over others. Standard errors and statements of significance may vary depending
on the comparison made. For example, looking back at Table 9.1, we see that
Catholic is a “significant” predictor of voting for Labor relative to Other (left
half), but it is not a “significant” predictor of voting for the Coalition relative
to Other (right half). Relying on just the BUTON to view and interpret the
implications of a multinomial model may not be the most effective means of
communicating the model’s substantive implications to audiences. It certainly
fails to display many quantities that may be of interest to both readers and
analysts.
178 Models for Nominal Data
B<−mvrnorm ( 1 0 0 0 ,
mu=c ( c o e f ( mnl . f i t ) [ 1 , ] , c o e f ( mnl . f i t ) [ 2 , ] ) ,
Sigma= vcov ( mnl . f i t ) )
X. u<− c ( 1 , median ( myoz$income2 [ myoz$ union =="Yes" ] ) ,
1 ,0 ,0 ,0 ,0 , median ( myoz$ age [ myoz$ union =="Yes" ] ) )
X. nu<− c ( 1 , median ( myoz$income2 [ myoz$ union =="Yes" ] ) ,
0 ,0 ,0 ,0 ,0 , median ( myoz$ age [ myoz$ union =="Yes" ] ) )
k<−dim ( c o e f ( mnl . f i t ) ) [ 2 ]
denom . u<− 1+exp ( B [ , 1 : k ]%∗%X. u ) +exp ( B [ , ( k +1) : ( 2 ∗k ) ]%∗%X. u ) # denominator o f
multinomial
denom . nu<−1+exp ( B [ , 1 : k ]%∗%X. nu ) +exp ( B [ , ( k +1) : ( 2 ∗k ) ]%∗%X. nu ) # denominator o f
multinomial
pp . c o a l . u<− exp ( B [ , 1 : k ]%∗%X. u ) / denom . u
pp . o t h e r . u<− exp ( B [ , ( k +1) : ( 2 ∗k ) ]%∗%X. u ) / denom . u
pp . c o a l . nu<− exp ( B [ , 1 : k ]%∗%X. nu ) / denom . nu
pp . o t h e r . nu<− exp ( B [ , ( k +1) : ( 2 ∗k ) ]%∗%X. nu ) / denom . nu
union . pp<− c b i n d ( pp . c o a l . u , pp . o t h e r . u , (1 − pp . c o a l . u−pp . o t h e r . u ) )
nounion . pp<− c b i n d ( pp . c o a l . nu , pp . o t h e r . nu , (1 − pp . c o a l . nu−pp . o t h e r . nu ) )
colnames ( union . pp )<− colnames ( nounion . pp )<− c ( " C o a l i t i o n " , "Other " , "Labor" )
l i b r a r y ( compositions )
# G e t t i n g e l l i p s e r a d i u s f o r CR
# S e e van den Boogaart and Tolosana −Delgado 2013 p . 83
df1 = n c o l ( union . pp ) −1
df2 = nrow ( union . pp )− n c o l ( union . pp ) +1
r c o n f = s q r t ( q f ( p = 0 .9 5 , df1 , df2 ) ∗ df1 / df2 )
rprob = s q r t ( q c h i s q ( p = 0 .9 5 , d f = df1 ) )
p l o t ( acomp ( union . pp ) , c o l = g r e y ( 0 . 8 ) )
p l o t ( acomp ( nounion . pp ) , c o l = g r e y ( 0 . 8 ) , pch =3 , add=T )
i s o P o r t i o n L i n e s ( , c o l = g r e y ( 0 . 5 ) , l t y =2 , lwd = . 7 )
e l l i p s e s ( mean ( acomp ( union . pp ) ) , v a r ( acomp ( union . pp ) ) , r = rprob , c o l =" r e d " , lwd =2)
e l l i p s e s ( mean ( acomp ( nounion . pp ) ) , v a r ( acomp ( nounion . pp ) ) , r = rprob , c o l =" b l u e " ,
lwd =2)
t e x t ( . 3 5 , . 2 , "non \ n union " )
t e x t ( . 5 2 , . 4 , " union " )
This expression depends not only on the values of the other covariates but also
on all the other regression coefficients, βd , d = m. As a result, the marginal effect
of a particular covariate for a specific category may not have the same sign as an
estimated coefficient for covariate j appearing in a BUTON. A full accounting
of marginal effects would require yet another BUTON in which we display the
9.4 Conditional Multinomial Logit 179
marginal effects of each variable for each category, each of which is conditional
on a specific scenario embodied in xi . Whether a particular marginal effect is
large relative to estimation uncertainty will depend on the scenario chosen, the
coefficient for the variable of interest, and all the other coefficients in the model.
As a result, this approach to interpretation is often best avoided.
The parameter vector δ has a constant and a regression coefficient for each of
k − 1 regressors. This formulation goes by the name conditional (multinomial)
logit.3 Note that in the conditional logit model the covariate values differ across
categories and the parameters are constant, whereas in the multinomial model
the covariate values are fixed for each individual across choice categories, and
the parameters differ. Another way to see this is to consider the ratio of the
probabilities of choosing m over d under the conditional logit:
Pr(Yi = m)
= exp[(wim − wid ) δ]. (9.6)
Pr(Yi = d)
3 Note a “conditional logit” model means different things in different disciplines. In economics,
political science, and sociology “conditional logit” usually refers to the model described in this
section, following McFadden (1974). In epidemiology, “conditional logit” refers to a matched
case-control logit model, sometimes called “fixed effects/panel logit” in other disciplines. It is
possible to show that the McFadden conditional logit is a special case of the fixed effects version,
but we do not take that up here. Note that STATA has different model commands for each of
these.
180 Models for Nominal Data
Equation 9.6 is just a restatement of the IIA condition. Comparing Equation 9.4
to Equation 9.6 highlights the difference between in the multinomial and
conditional logit specifications.
Of course the multinomial and the conditional logits can each be viewed as
special cases of the other. Their log-likelihood and methods of interpretation
are nearly identical. It is also possible to build a model that combines both
category- and chooser-specific covariates:
exp(wim δ + xi β m )
Pr(Yi = m) = M .
h=1 exp(wih δ + xi β h )
Conditional Mixed
Contact 0.36 0.33
(0.06) (0.06)
Labor: intercept −0.32 0.32
(0.04) (0.21)
Other: intercept −0.72 0.04
(0.05) (0.24)
Labor: income −0.05
(0.01)
Other: income −0.03
(0.01)
Labor: union member 1.02
(0.10)
Other: union member 0.81
(0.12)
Labor: Catholic 0.24
(0.11)
Other: Catholic −0.11
(0.14)
Labor: not religious 0.71
(0.11)
Other: not religious 1.09
(0.12)
Labor: other religion 0.22
(0.13)
Other: other religion 0.43
(0.15)
Labor: female 0.06
(0.08)
Other: female 0.11
(0.10)
Labor: age −0.01
(0.00)
Other: age −0.02
(0.00)
n 3,342 3,342
log L −3,476 −3,302
AIC 6,957 6,639
BIC 6,976 6,743
182
ta b l e 9.6 A typical rectangular data structure for a multinomial model. Note the inclusion of the
choice-specific variables.
Contact
Respondent Vote Choice Income Union Religion Sex Age Labor Coalition Other
1 Labor 20.0 No Protestant Female 43 1 1 0
2 Labor 20.0 No Protestant Male 61 1 1 1
3 Other 15.5 Yes None Male 52 1 1 1
4 Coalition 7.5 No Protestant Female 69 1 1 1
5 Other 1.2 No Other Female 61 1 1 1
6 Labor 9.5 Yes None Female 20 0 0 0
.. .. .. .. .. .. .. .. .. ..
. . . . . . . . . .
3,342 Other 9.5 Yes None Female 40 1 1 0
ta b l e 9.7 A grouped-response data structure data structure enabling conditional and mixed
logit estimation. Each row is a person-category.
Respondent Category Vote Choice Contact Income Union Religion Sex Age
1 Coalition FALSE 1 20 No Protestant Female 43
1 Labor TRUE 1 20 No Protestant Female 43
1 Other FALSE 0 20 No Protestant Female 43
2 Coalition FALSE 1 20 No Protestant Male 61
2 Labor TRUE 1 20 No Protestant Male 61
2 Other FALSE 1 20 No Protestant Male 61
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
3,342 Coalition FALSE 1 9.5 Yes None Female 40
3,342 Labor FALSE 1 9.5 Yes None Female 40
3,342 Other TRUE 0 9.5 Yes None Female 40
183
184 Models for Nominal Data
The data in Tables 9.6 and 9.7 contain exactly the same information. Refor-
matting only serves computational convenience. Understanding how the data
are organized is important for understanding how to effectively sample from
the limiting distribution of the parameters and generate predicted probabilities.
Viewing the data in grouped-response format has the additional benefit of
highlighting the connection between a conditional or mixed multinomial logit
model and panel data. A panel data set in which each individual is observed
repeatedly is also typically organized as stacked individual-level matrices. In
fact a “conditional logit” is one way of analyzing binary panel data, something
we take up further in Chapter 11.
9.5 extensions
All of the models above rely on the IIA assumption. The IIA, in turn, results
from the assumption that (1) the errors are independent across categories,
(2) the errors are identically distributed, and (3) the errors all follow as EV-I
distribution. A variety of alternatives have been developed to partially relax IIA
or, equivalently, allow for unequal variance or correlation across the outcome
categories. In formulating the likelihood, all these models continue to maintain
the assumption of (conditional) independence across individuals, i.
7 The nested logit model can also be derived from the latent variable constructing by assuming
that the M−vector of errors, i , follows a Generalized Extreme Value distribution.
8 A λ outside the (0,1] interval is commonly viewed as evidence of model misspecification.
d
186 Models for Nominal Data
sum coefficient commonly reported. With these expressions we can now define
another logit,
exp(Wd + λd Zd )
Pr(Ad ) = D .
=1 exp(W + λ Z )
The nested logit requires that we pre-specify the set of nests or meta
categories. In most situations there are several possible nesting structures, and
9.5 Extensions 187
it may not be obvious which is “correct.” In the California recall example, state
law held that only those voting in favor of the recall could cast valid votes for
the successor, conforming to the nested logit structure and providing an obvious
way to construct a nesting structure. This provision was challenged in court
and found unconstitutional during the 2003 recall campaign. As a result, in the
actual election, voters could vote for both the retention of Gray Davis as well
as his successor, a violation of the nesting assumption. In the 2013 Australian
election data, we might imagine that voters first decide whether to vote for
a mainstream party (Labor or Coalition) or to cast a protest vote (Other).
Alternatively, we might imagine that voters first decide whether to vote left
(Labor, Other) or right (Coalition). Evaluating these options involves another
layer of model selection, requiring tools such as likelihood ratios, BIC, and out-
of-sample evaluation. In the particular case of the Australian survey data, there
is no benefit to including either of these nesting structures, based on the BIC.
i = (i1 , . . . , iM ) ∼ N (0, ),
we arrive at the multinomial probit model. The covariance matrix M×M
allows for arbitrary correlation across choice categories. In other words, the
multinomial probit does not require the IIA assumption. But the cost is a
much heavier computational burden. To see this, note the choice probability
for individual i:
the choice probabilities across categories. In this model the β varies according
to some distribution, with its own parameters given by θ . It is common to
assume a normal distribution, implying that θ = (μβ , β ), although others
are feasible.
The mixed logit retains the logit probability, but treats it as a function of β.
That is,
exp(xim β i )
Pr(Yi = m|β) = M .
=1 exp(xi β i )
But now we must specify the distribution for β and then integrate over it in the
process of maximization:
Pr(Yi = m) = Pr(Yi = m|β)f (β|θ)dβ.
β
9.6 conclusion
In this chapter we generalized our treatment of binary and ordered categorical
variables to include multi-category, unordered outcomes. The standard model
is the multinomial logit, which can accomodate covariates describing attributes
of both the outcome categories as well as the units. The multinomial logit model
relies on the assumed IIA, which may be too restrictive for some applications.
Several extensions provide ways to relax this assumption.
The generality of the multinomial model comes at a cost. Multinomial
models ask a lot of the data, estimating a large number of parameters. These
models can become quite rich – and complex – very rapidly with the inclusion of
additional covariates or outcome categories. Decisions about what to include or
leave out necessitate a hierarchy of model evaluation decisions. Once estimated,
the presentation and interpretation of model results also entails additional care
and effort, all the more so when the models involve nested outcome categories
or hierarchical, random coefficients.
9.7 Further Reading 189
Applications
Eifert et al. (2010) use multinomial logit to analyze the choice of identity
group in a collection of African countries. Glasgow (2001) profitably uses the
mixed logit to analyze voter behavior in multi-party UK elections. Martin and
Stevenson (2010) use the conditional logit to examine coalition bargaining and
government formation.
Past Work
Alvarez and Nagler (1998) present an early systematic comparison of several
of the models described in this chapter; also see Alvarez (1998).
Advanced Study
Train (2009) is the more-advanced text on the specification and computation
of a variety of mulitnomial models. Bagozzi (2016) presents a “zero-inflated”
multinomial model with application to international relations. Mebane and
Sekhon (2004) discuss and extend the use of the multinomial model in the
context of counts across categories. See van den Boogaart and Tolosana-
Delgado (2013) on the calculation and plotting of confidence ellipses for
compositional data.
Software Notes
In R, the multinom function in the nnet package (Ripley and Venables,
2016; Venables and Ripley, 2002) only handles rectangular data and only fits
a standard multinomial model with individual-level covariates. The mlogit
(Croissant, 2013) and mnlogit (Hasan et al., 2016) libraries contain tools for
restructuring data sets and estimating all the models discussed in this chapter.
The MNP library (Imai and van Dyk, 2005) fits a Bayesian multinomial probit.
The compositions library (van den Boogaart et al., 2014) has a variety of
functions for working with compositional data such as the probability simplex.
10
10.1 introduction
According to legend, the mathematician and logician Leopold Kronecker
believed that mathematics should be entirely based on whole numbers, noting,
“God made the natural numbers; all else is the work of man.” Counts of discrete
events in time and space are integers. These counts could be the number of
bombs falling in a particular neighborhood, the number of coups d’état, the
number of suicides, the number of fatalities owing to particular risk categories,
such as traffic accidents, the frequency of strikes, the number of governmental
sanctions, terrorist incidents, militarized disputes, trade negotiations, word
counts in the speeches of presidential candidates, or any wide range of political
and social phenomena that are counted.
Models of dependent variables that are counts of events are unsurprisingly
called event count models. Event count models describe variables that map
only to the nonnegative integers: Y ∈ {0, 1, 2, . . .}. While grouped binary or
categorical data can be thought of as counts, such data sets are generally not
analyzed with count models.1 Count data have two important characteristics:
they are discrete and bounded from below.
Using ordinary least squares to directly model integer counts can lead to
problems for the same reason it does with binary data. The variance of a count
increases with the mean (there is more error around larger values), implying
inherent heteroskedasticity. More worryingly, OLS will generate predictions
that are impossible to observe in nature: negative counts and non-integer values.
Some try to salvage a least squares approach by taking the logarithm of the
dependent variable. This strategy, however, requires a decision about what to
do with the zero counts, since log(0) = −∞. One option is to simply discard
1 A major exception is the analysis of vote totals across candidates or parties. See, for example,
Mebane and Sekhon (2004).
190
10.2 The Poisson Distribution 191
the zero-count observations and instead only model the positive counts. This
has numerous drawbacks, including the potential to discard a large proportion
of the observed data. A second approach is to add some constant to the
outcome before taking logs. Aside from its arbitrariness, this approach has its
drawbacks, including exacerbating problems with nonconstant variance and
complicating interpretation. In this chapter we introduce a series of models
that are explicitly designed to model event count data as what they are – integer
counts.
The last term, − log(yi !), is ignorable. This log-likelihood is regular and well-
behaved, so the standard tools apply. We can also derive the score equation for
the Poisson model:
∂ log L
n
= yi xi − xi exp(xi β)
∂β
i=1
n
= (yi − exp(xi β))xi = 0. (10.1)
i=1
Equation 10.1 will appear below because we can also view it from an estimating
equation or quasi-likelihood approach.
OLS Poisson
log log Exposure
(yi + 0.01) (yi + 10) Offset Exposure
OLS OLS
● ●
60 60
Residual quantiles
40 40
Residuals
● ●
● ●
20 ● 20
●● ●
●
●
●●
●●
●●●●
● ●
●
● ● ●
●
●
● ●
●
●
●
●●
●
●●
●
●●
●● ●
●
●
●● ●
0 ●
●●●●●●●
●●
●●
●
●●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●●
●
●●
●
● ●
●●
●●
●
●●
●●
●
●●
●●
●
●●
●●
●
●●
●●
●
●
●●●●
●
●●
●
●
●●
●
●
●●
●
● 0 ●
●
●
●●
●
●●
●●
●
●
●
●●● ●●
●●
●
●●
●●
●
●●
●●
●
●
●●
●
● ●
●●
● ●
−20 −20
● ●
● ●
−40 −40
−2 −1 0 1 2 0 10 20 30 40
Theoretical quantiles Fitted values
1.0 1.0
●● ●
●
●●● ● ● ●●
Residuals
●
0.5 ●
●●●
● 0.5 ●
●● ●
●● ●
●
●●
●
●● ●
●●
●● ●●
●●
●
●
●●
●
●
●●
●
●●
● ●
●
●●
●● ●●
● ●
●
●●
●●
● ●
●●
0.0 ●
●●
●
●●
●●
●
●●
●●
●
●●
●●
●
●
●
●●
●
●●
●
● 0.0 ●
●
●
●
●
●
●
●
●●
●
●● ●●
●●
●
●●
●●
●
●●
●
●●
●
●
●●
●
●●
●●
●
●
●●
● ●
●●
●
●●
●
●●
● ●
●
●●●●●●●●
●● ●●●
●
●●
−0.5 ● −0.5 ●
−1.0 −1.0
● ●
amount is easily absorbed by the intercept and has the effect of artificially
reducing residual variance.2
Figure 10.1 displays residual quantiles and fitted-residual plots for the stan-
dard OLS model as well as the OLS on log(yi + 10). Both clearly indicate that
the OLS residuals are non-normal (with fat tails) and severely heteroskedastic,
as we would expect from count data. Adding a constant and transforming the
data has the drawback of exacerbating both these problems.
2 Technically the log-likelihoods and related quantities for the second and third models are
not comparable since they are fit to “different” outcome variables. This distinction is rarely
appreciated in applied work; the constant added to y is often not explicitly stated.
10.2 The Poisson Distribution 195
The last three columns of Table 10.1 present results from a series of Poisson
regressions. The first of these fits a model with the same linear predictor as the
OLS regression. Based on the AIC and BIC, the OLS model might be preferable
to the Poisson. But the general expression for the Poisson distribution takes into
account the extent of “exposure” for each subject. In the mediation example,
countries that have been around longer have had more opportunities to be
requested as mediators. If observations have different exposure windows, then
their expected counts should differ proportionally. Differential exposure needs
to be incorporated into the model. There are two approaches. The first is known
as an offset: include the size of the exposure window, hi , in the regression but
constrain its coefficient to be 1.0:
λi = exp[xi β + log(hi )].
The second approach simply includes the exposure variable (in logs) in the
regression equation and allows a coefficient to be determined empirically.
The last two columns of Table 10.1 reestimate the Bercovitch and Schneider
example, taking account of each country’s “exposure” with a variable indexing
the number of years a country has been a member of the international system.
The fifth column includes this log duration variable as an offset, while the model
in the sixth estimates a parameter. Accounting for exposure improves model
fit, based on the likelihood ratio and the information criteria. Comparing the
models in the final two columns, we see little reason to estimate a parameter
for the exposure variable. The log-likelihoods for the two models are identical.
A Wald test on the log duration coefficient from the third model gives a p-
value of (1.11 − 1)/0.19 = 0.6. In general, accounting for unequal exposure
is important for count models. Failing to do so will tend to inflate the putative
impact of covariates.
10.2.2 Interpretation
As with many of the models we have seen in this volume, the relationship
between a covariate and the response is nonlinear and depends on the value of
other covariates in the model. Suppose we have k − 1 covariates in the model.
The marginal effect of particular regressor, Xj , on E[Yi ] is
∂λi
= βj exp[xi β]. (10.2)
∂xij
In such situations we can follow our usual strategy: construct meaningful
scenarios; sample from the limiting distribution of the parameters; and combine
the two to generate predicted values, along with our estimation uncertainty.
In Figure 10.2 we do just that, presenting both the predicted and expected
number of mediations as a function of per capita GDP and membership on
the UNSC. For each of these scenarios, the black line represents the expected
196 Strategies for Analyzing Count Data
35
25
Mediations
20
15
10
2 4 6 8
value (λ̂i ), while the red line represents the predicted value – an integer. Gray
bands are 95% confidence intervals around the expected value. In constructing
these scenarios we vary GDP per capita from its 20th to 80th sample percentile
values. We set duration in the international system at its sample median value
for UNSC nonmembers. We set it to log 40 ≈ 3.69 for UNSC members, its
minimum value for that group of countries. The plot shows that richer countries
are more likely to be mediators and that UNSC members are about eight times
more likely to be asked across income levels.
The log-linear construction of the Poisson model lends itself to other inter-
pretive strategies that you may encounter in your reading. Since log λi = xi β
we know that βj is the change in log λi for a change in xij . The quantity exp(β̂j )
therefore represents a multiplicative change in λ̂i , as shown in Equation 10.2.
In the mediation example, exp(2.18) ≈ 8.8, so countries on the UNSC serve as
mediators about nine times more frequently than countries not on the council.
If a particular covariate is included on the log scale, then the coefficient of that
covariate in a Poisson model has a direct interpretation in terms of elasticities,
or a percent change in the outcome for a 1% change in the covariate. For
example, if we included per capita GDP in logarithms (rather than thousands
of dollars), then we obtain β̂GDP = 0.20; a ten-percent increase in per capita
GDP is associated with a two-percent increase in the arrival rate of mediation
requests.
10.3 dispersion
Intuitively we expect greater variation around the mean when there is a
large number of expected events. We therefore expect count data to be inher-
ently heteroskedastic. The Poisson distribution captures this fact; the variance
increases with the mean, one-for-one. But this one-to-one relationship is quite
restrictive and often violated in real-world data. The (very frequent) situation
in which the variance of the residuals is larger than the mean is known as over-
dispersion. Under-dispersion occurs when the variance is too small; it is much
less commonly observed in social science data.
Poisson Assumption ↔ E[Y] = var(Y)
Over-dispersion ↔ E[Y] < var(Y)
Under-dispersion ↔ E[Y] > var(Y) ↔ 0<σ <1
Over-dispersion can arise for several reasons, all of which have consequences
for model building and interpretation. At the simplest level, over-dispersion
may simply be the result of a more variable process than the Poisson distribu-
tion is capable of capturing, perhaps due to heterogeneity in the underlying
population. If this is the case then the model for the mean may still be
adequate, but the variance estimates will be too small, perhaps wildly so. But
198 Strategies for Analyzing Count Data
over-dispersion may also arise for more complicated reasons that have both
substantive and modeling implications. For example, over-dispersion may arise
if there are an excess of zeros in the data or, most importantly, if events are
positively correlated (previous events increase the rate of subsequent events),
violating a basic assumption of the Poisson process. These challenges imply that
the model for the mean is no longer adequate, and we should expect problems
of inconsistency and potentially erroneous inference. In short, over-dispersion
should be viewed as a symptom that needs to be investigated. The results of
this investigation usually turn up substantively interesting aspects of the data-
generating process.
Graphical Methods
One visualization tool, the Poissonness plot (Hoaglin, 1980), examines the
distribution of the dependent variable relative to theoretically expected values
under a Poisson distribution.3 The horizontal axis is the number of events,
denoted y. The vertical axis is the so-called metameter. For the Poisson
distribution the metameter for count value y is given as log y!ny − log n,
where ny is the observed frequency of y events. If the data conform to the
proposed distribution, the points should line up, similar to a quantile-quantile
(Q-Q) plot for continuous distributions. Moreover, a regression of y on the
metameter for y should have slope log(λ) and intercept −λ under the Poisson
distribution.
3 Hoaglin and Tukey (1985) extend the Poissonness plot to the negative binomial and binomial.
Each distribution has its own metameter calculation.
10.3 Dispersion 199
Poissonness Plot
400
300
Metameter
200
●
100 ●
●
●
●●
●●
●●
0 ●●●●●●●●
0 20 40 60 80 100 120
Number of events
f i g u r e 10.3 A Poissonness plot for international mediations data from Bercovitch
and Schneider (2000). If the data follow a Poisson distribution, then the points
should line up, and the slope of the regression should be log λ, whereas the intercept
should be −λ. With these data λ̂ = 3.5. The slope of the regression line is 3.7, and the
intercept is −23.
Figure 10.3 displays a Poissonness plot for the international mediations data.
The points clearly fail to line up on the regression line.4 If these data followed
a Poisson distribution we should observe an intercept of −3.5 and a regression
slope of log 3.5 = 1.25. Instead, we observe a slope of about 3.7 and an
intercept of −23. Based on the Poissonness plot, the mediation data do not
conform to the Poisson distribution. But a weakness of the plot is its inability
to tell us anything about over-dispersion per se or about a particular model fit
to those data.
4 The outlier is, unsurprisingly, the United States. Excluding the United States implies a λ̂ = 2.75.
The corresponding Poissonness plot yields an intercept of −12.5 and a slope of 2.7.
200 Strategies for Analyzing Count Data
To address both these concerns, Kleiber and Zeileis (2016) advocate per-
suasively for a hanging rootogram, due to Tukey (1977). In a rootogram the
horizontal axis is again counts of events. The vertical axis is the square root of
the frequency, where the square root transformation ensures that large values
do not dominate the plot. Let E[ny ] be the expected frequency of event count
value y under the proposed model. For a Poisson model, E[ny ] = ni=1 fP (y; λ̂i ).
√
The vertical bars are drawn from E[ny ] to ( E[ny ] − ny ). In other words,
the vertical boxes hang down from the expected frequency and represent
how much the observed and expected frequencies differ at various values of
the outcome variable. The zero line presents a convenient reference. A bar
that fails to reach 0 means the model is overpredicting counts at that value
(E[ny ] − ny > 0). A bar that crosses 0 implies underprediction at a particular
value of y.
Figure 10.4 displays two rootograms. The plot on the left refers to the
Poisson regression (with exposure offset) from Table 10.1. We can see that
the Poisson model severely underpredicts 0 while also underpredicting large
counts. It overpredicts counts near the sample mean of 3.5. Wave-like patterns
and underpredictions of 0s are consistent with over-dispersion. But rootograms
can be used to describe model fit more generally, whether in- or out-of-sample.
8 8
6 6
frequency
frequency
4 4
2 2
0 0
–2 –2
–4 –4
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Mediations Mediations
f i g u r e 10.4 A hanging rootogram plotting expected versus actual counts. The curve
represents the square root of the expected frequency of mediations under the specified
model, while the vertical bars are drawn from the expected frequency to the observed
frequency (both in square roots). The display on the left is from the Poisson model
(with offset) from Table 10.1; the display on the right is from the negative binomial
model in Table 10.2.
10.4 Modeling Over-Dispersion 201
Formal Tests
There are several tests for over-dispersion. Most of them are constructed based
on the assumptions that (1) the model for the mean is correct and (2) we can
think of over-dispersion as taking the form
var(Yi |xi ) = hi λi + γ w(hi λi ),
where w() is some function. The most frequently used alternatives are w(hi λi ) =
hi λi and w(hi λi ) = (hi λi )2 . Cameron and Trivedi (1990) develop one regression-
based test based on this logic. Define
(yi − hi λ̂i )2 − yi
êi = , (10.3)
hi λ̂i
then we can estimate the OLS regression (omitting the intercept) êi = γ w(hi λ̂i ) +
hi λ̂i
εi . For over-dispersion we are interested in testing the one-tailed hypothesis that
γ > 0 against the null of equi-dispersion, γ = 0.5 Applied to the Poisson model
(with offset) from Table 10.1 and assuming w(x) = x, we obtain γ̂ = 8.9 with
a one-sided p-value of 0.005, consistent with over-dispersion.
Gelman and Hill (2007) describe a version of the Pearson χ 2 test that takes
advantage of the fact that the standard deviation of the Poisson is equal to the
square root of the mean. Define the standardized (or Pearson) residuals from a
Poisson model as
yi − hi λ̂i
zi = 2 .
hi λ̂i
10.4.1 Quasipoisson
One way to proceed is with a quasi-likelihood approach in which we specify
a model for the mean and a variance function rather than a full probability
model (see Section 7.4). If our model for the mean is λi = exp(xi β) and our
variance function is φV(λi ) = φλi , then the set of quasiscore equations reduces
to Equation 10.1. This implies that the solution to the Poisson score equations
and the solution to the quasiscore equations where the mean is modeled as
exp(xi β) is the same. In other words, the Poisson MLE for β̂ are the same as
the quasipoisson estimates. As a result the quasipoisson approach will not alter
point estimates of predicted outcomes.
Where quaispoisson and the full Poisson GLM differ is in the standard
errors. In the quasipoisson setup we estimate the dispersion parameter, φ, using
Equation 7.3, which is exactly the formula for the Pearson residuals we used
earlier divided by n − k. Quasipoisson standard errors are then calculated
from Equation 7.2. With φ > 1 the quasipoisson standard errors will be
larger than those from the Poisson GLM. While we do not have access to the
maximized likelihood or information criteria from the quasipoisson model, we
can still generate predicted counts and use simulation techniques to generate
uncertainty estimates.
6 The choice of the gamma distribution is somewhat arbitrary. Its primary convenience is that it
yields a closed-form expression for the marginal distribution of Yi . In Bayesian terminology, the
gamma distribution is the conjugate prior for the Poisson.
10.4 Modeling Over-Dispersion 203
ba ya−1
Y ∼ f (y; a, b) = exp(−by),
(a)
with shape parameter a > 0 and rate or inverse scale parameter b > 0.
E[Y] = ba and var(Y) = ba2 .
Fixing a = b = α yields the one-parameter gamma distribution,
f (y; α), with expected value of 1 and a variance of α −1 .
Yi | λi ∼ fP (λi ),
λi = exp(xi β + ui ),
= exp(xi β) exp(ui ),
where ui is an error term in the expression for the Poisson mean, λi . If we let
μi = exp(xi β) and νi = exp(ui ) we can complete the model:
λi = μi νi ,
νi ∼ f (α).
The νi are now unit-mean multiplicative error terms for the Poisson mean.
As before, E[Yi ] = λi , but now E[λi ] = μi , implying that E[Yi ] = μi . The
μi are typically modeled in a log-linear fashion as exp(xi β). Integrating over
νi gives us the marginal distribution for Yi , which is a negative binomial
distribution.
α > 0, if
Y ∼ fNb (y; μ, α)
α y
(y + α) α μ
Pr(Y = y | μ, α) = . (10.4)
y!(α) α+μ α+μ
The negative binomial has E[Y] = μ and var(Y) = μ(1 + α −1 μ) =
μ + α −1 μ2 .
Via the parameter α, the negative binomial allows the variance to be greater
than the mean. Cameron and Trivedi (2013) refer to the version just described
as the NB2 model, with the 2 referring to the variance’s quadratic dependence
on the mean. They develop other versions as well. For example, if we substitute
α μi for α in Equation 10.4, then we arrive at the NB1 model, with var(Y) =
μ(1 + 1/α ). More generally, substituting αμp−2 for α in Equation 10.4 yields
the NBp model, with var(Y) = μ(1 + α −1 μp−1 ). Greene (2008) provides a
general expression of the NBp log-likelihood:
μi = exp(xi β),
α
ri = ,
α + μi
2−p
qi = αμi ,
n
log L(β, α|X, y, p) = log (yi + qi ) − log (qi ) − log (yi + 1)
i=1
+ qi + log ri + yi log(1 − ri ). (10.5)
The NB2 model is by far the most commonly used. Differences between NB1
and NB2 appear to be small in most applications. As these models are not
nested versions of one another, choosing between them is best accomplished
using information criteria and out-of-sample fit heuristics. All these versions of
the negative binomial estimate the k regression parameters along with α, which
governs the mean-variance relationship.
While both the quasipoisson and negative binomial models allow for over-
dispersion, they approach it differently. The quasipoisson model directly esti-
mates a dispersion parameter from the data and uses it to adjust standard
errors while retaining the Poisson estimating equation for β. The negative
binomial model retains the φ = 1 assumption of the Poisson and instead
uses α to acount for over-dispersion; α is not a dispersion parameter in the
exponential family sense. This fact is directly visible in the R summary output
for the negative binomial glm: Dispersion parameter for Negative
Binomial(0.3326) family taken to be 1, where 0.3326 is the esti-
mate for α −1 . As we can see from Equation 10.5, the negative binomial
10.4 Modeling Over-Dispersion 205
likelihood differs from the Poisson even when the linear predictor terms are
identical. Model estimates and implications can differ between the two.
Within the negative binomial class of models, the α parameter is a weight in a
polynomial function of the mean. As the underlying polynomial changes, α also
changes. As a result the α from an NB2 is not directly comparable to α from an
NB1, etc., notwithstanding the fact that most texts use the same notation across
model parameterizations. However, all NBp models collapse to a simple Poisson
as α → ∞. Because the Poisson is a limiting case of the negative binomial we
can implement another over-dispersion test by constructing a likelihood ratio
between a Poisson and negative binomial model.7
10.4.3 Mediations
We return to the international mediations example to examine how the
quasipoisson and negative binomial models perform. Table 10.2 displays results
7 Because we obtain a Poisson distribution at the boundary of the negative binomial parameter
space the likelihood ratio between a Poisson and negative binomial model has a nonstandard
distribution, with half of its mass at 0 and half as a χ12 . As a result, the critical value for a test
at the a level is the χi2 value associated with a 2a test. For example, the critical value for a 95%
test is 2.71 rather than the 3.84 normally associated with a χ12 .
206 Strategies for Analyzing Count Data
for a quasipoisson, NB2, and NB1 models. All three models include an offset
for log duration to account for unequal exposure windows.
The quasipoisson returns exactly the β̂ from the third model in Table 10.1.
What has fundamentally changed are the standard errors, which are substan-
tially larger, reflecting more uncertainty, owing to the over-dispersion that
is no longer being ignored. In general, ignoring over-dispersion will lead to
dramatically underestimated standard errors. Note that the estimated disper-
sion parameter for the quasipoisson, φ̂, is exactly the sum of squared Pearson
residuals calculated for the over-dispersion test divided by the residual degrees
of freedom, 1,477/143.
Based on the information criteria, both of the negative binomial models
represent large improvements over the OLS and Poisson models in Table 10.1.
The right panel of Figure 10.4 displays the rootogram for the NB2 model,
which clearly fits the data better than the Poisson alternative. It is far more
accurate in its predictions of 0s and does not show the wave-like pattern of
over- and underprediction across outcome values of the Poisson specification.
The negative binomial models are preferable to the Poisson or quasipoisson.
But, based on the in-sample fit statistics, the two negative binomial models are
nearly identical, a common occurrence.
Consistent with the over-dispersion in the data the negative binomial models
return substantial estimates for α. In R the glm.nb function fits the NB2 model
and uses the parameterization and notation in Venables and Ripley (2002). In
their parameterization of the negative binomial distribution they use “θ ,” which
is equivalent to α −1 in our notation.8 We can recover an approximate standard
error for α̂ from glm.nb output by calculating σ̂θ /θ̂ 2 . In the NB2 case we
therefore recover α̂ = 3, σα̂ = 0.52.
In the mediation example the negative binomial models give substantively
different results from the quasipoisson; per capita GDP is not a strong predictor
of mediation demand once we account for over-dispersion. Differences between
the quasipoisson and negative binomial sometimes arise due to differences in
the weights attached to large counts in the fitting of the model (Ver Hoef
and Boveng, 2007). But in this example the negative binomial model fits the
data substantially better than the (quasi)Poisson. This divergence can happen
when the process generating over-dispersion is more complicated than simple
heterogeneity in the underlying population. For example, the negative binomial
distribution and over-dispersion can result from positive contagion across
events (Gurland, 1959; Winkelmann, 1995), something that appears to be at
work here.
Interpretation of the negative binomial models follows our usual procedures.
For example, take China, a member of the UNSC but relatively poor during this
8 The dbinom (and other associated functions) in R can take several parameterizations of the
negative binomial distribution. In our notation the mu argument to dbinom corresponds to μ
and size corresponds to α.
10.5 Hurdling 0s and 1s 207
time. Had China not been on the UNSC, the model predicts that it would have
been requested as a mediator between two and three times (λ̂ = 2.54) with a
standard error of 0.49). But China as a UNSC member is expected to have been
asked to mediate 31 times. Doubling China’s per capita GDP has no effect on
these predictions.
10.4.4 Under-Dispersion
While under-dispersion is less common than over-dispersion, it does arise.
Under-dispersed count data can be thought of as having some kind of negative
contagion. For example, the neighboring counties to a toxic waste site will be
less likely to create their own, for a variety of reasons.
Many of the same tools used for diagnosing over-dispersion can be repur-
posed for under-dispersion. We can examine the mean of the sample relative
to its variance. We can use the Cameron-Trivedi regression-based test in
Equation 10.3, only specifying an alternative hypothesis of γ < 0. Note,
however, that dispersion tests relying on a likelihood ratio comparing a Poisson
model to a negative binomial will not capture under-dispersion.
We can model under-dispersed data using the quasipoisson approach (in
which case we should recover φ̂ ∈ (0, 1)). The quasipoisson model for under-
dispersion entails all the same restrictions and drawbacks we encountered
for over-dispersion. Several fully parameterized likelihood approaches exist.
King (1989a) details the “continuous parameter binomial” (CPB) for under-
dispersion and the generalized event count model that estimates dispersion
directly. The generalized Poisson model enables estimation of over-dispersion
and some forms of under-dispersion (Hilbe, 2014). The generalized Poisson dis-
tribution entails some restrictions that limit the degree of under-dispersion that
it can capture. The Conway-Maxwell-Poisson (Conway and Maxwell, 1962) is
a relatively recent addition to the applied literature; its two-parameter structure
does not readily accommodate a linear predictor for the mean, making it
difficult to construct easily interpretable regression models (Sellers et al., 2012).
150
100
frequency
50
0 7 16 27 37 53 80 156
πi = 0 for y = 0,
1 for y > 0,
stochastic : πi ∼ fB (θi ),
Yi∗ ∼ fP (yi ; λi | yi > 0),
systematic : θi = logit−1 (zi δ),
λi = exp(xi β).
This specification highlights four things. First, the hurdle model is convention-
ally set up such that crossing the hurdle (i.e., yi > 0) is coded as a “success”
for the logit portion. Second, different vectors of covariates (with different
regression parameters) can govern the “hurdle” (zi ) and the count (xi ) processes.
Third, the hurdle model can also accommodate too few zeros relative to our
expectations. Fourth, the process describing positive counts is a zero-truncated
Poisson distribution. Specifically,
fP (yi |λi )
fP (yi ; λi | yi > 0) =
1 − fP (0|λi )
y
λi i
= .
(exp(λi ) − 1)yi !
Thus the probability statement for Yi is
1 − θi for yi = 0,
Pr(Yi = yi ) = y
λi i
θi (exp(λi )−1)yi ! for yi > 0,
9 The Vuong test is used to test non-nested models; it has been frequently employed as a way to test
for excess zeros in count models (Desmarais and Harden, 2013). But there is controversy about
its applicability in these situations (Wilson, 2015), so we emphasize broader model-selection
heuristics.
10.5 Hurdling 0s and 1s 211
The hurdle negative binomial model is derived similarly, replacing the Poisson
with the negative binomial distribution. The hurdle Poisson allows for some
over-dispersion,10 but the negative binomial is more flexible in its ability to
model the variance of the integer counts.
10 This is owing to the fact that the variance of a zero-truncated Poisson random variable, X, is
λ exp(λ)
E[X](1 + λ − E[X]), where E[X] = exp(λ)−1 . As a result, the variance is not restricted to be
exactly equal to the mean, but they are tied together.
11 The model developed here uses a logit link, but probit, cloglog, or others are feasible.
212 Strategies for Analyzing Count Data
Note two important differences between the ZIP and the hurdle model. First,
the ZIP/ZINB is conventionally parameterized such that a zero is counted as a
“success” for the Bernoulli part of the model. In other words, the Bernoulli
model describes the probability that an observation is an always-zero as
opposed to the probability of crossing the hurdle. Second, the ZIP/ZINB does
not truncate the count distribution. As a result, the ZIP probability statement
differs from the hurdle Poisson model:
⎧
⎨θi + (1 − θi )fP (0|λi ) for yi = 0,
Pr(Yi = yi ) = yi
⎩(1 − θ ) exp(−λi )λi for yi > 0.
i yi !
n
log LZIP = πi log exp(zi δ) + exp(− exp(xi β))
i=1
n
+ (1 − πi ) (yi xi β − exp(xi β))
i=1
n
n
− log(1 + exp(zi δ)) − (1 − πi ) log yi !.
i=1 i=1
This expression is difficult to maximize largely because the first sum yields
a complicated gradient with no closed form. Introducing a latent/unobserved
variable indicating whether i is an always-zero can separate the δ from the β in
the maximization problem. We can proceed using EM to iteratively maximize
the log-likelihood (see Section 4.2.3).
drought and civil conflict, and a spatial lag of the dependent variable.12 The
BUTON appears as Table 10.3.13
Examining the model-fit statistics we see, clear evidence that the ZINB
model is preferred to both the NB2 and the hurdle model. The zero-inflation
hypothesis – that there are some observations that are effectively immune to
atrocities in this period – is in better agreement with these data.
The hurdle and ZINB models frequently give very similar results. Not
entirely so in this example. Aside from model fit statistics, inspection of the
drought coefficients in the models for zeros highlights this (recall that the
ZINB and hurdle models code successes in opposite ways!). In the zero-inflation
model, drought is not a strong predictor of a cell being an always-zero, yet in the
hurdle model, drought is a strong predictor of crossing from a zero to a positive
count. Both models, however, predict that droughts increase the number of
events.14
The ZINB model also illustrates how covariates can influence the predicted
outcomes through two channels: whether an observation is likely to be always-
zero and how many events are predicted to occur. In these data, drought and the
proportion of land in urban settlements affect atrocities through their increase
in the count; they are not good predictors of whether a cell is an always-zero,
given the other covariates in the model.
10.6 summary
Understanding how we can model count data begins with the Poisson dis-
tribution. But the Poisson’s assumption of equal dispersion means that it is
rarely sufficient, since social science data are often overdispersed and frequently
characterized by an inflated number of zeros, and may also have “natural”
thresholds between successive numbers of events. As a result, it is often useful
to employ a simple mixture model in which a binomial (or other) process
is combined with a Poisson process, to capture the full range of the data-
generating processes.
12 The spatial lag is the total number of atrocities in immediately adjacent cells. Cells differ in
area, becoming smaller as we move away from the equator. We might imagine that this would
allow for different observational windows. Including cell area as an offset made no difference,
so we omit it. But good for you if the issue concerned you.
13 The zeroinfl and hurdle procedures in R return log(theta) and its standard error.
Recall that theta corresponds to α −1 in our notation for the negative binomial distribution.
By the invariance property of the MLE, we know that log θ = log θ̂. We also know that the
MLE is asymptotically normal. So we can apply the delta method to calculate the approximate
standard error for θ as θ̂σ. These are the quantities reported in ZINB and hurdle columns
log θ
of Table 10.3.
14 Moving drought from 0 to 2 increases the number of predicted events from 0.03 to 0.04 in the
ZINB and from 0.01 to 0.02 in the hurdle model, with no neighbors experiencing atrocities but
with some civil conflict and all other covariates set to sample means.
214 Strategies for Analyzing Count Data
Applications
Nanes (2017) uses negative binomial models to describe Palestinian casualties
in the West Bank and Gaza. Edwards et al. (2017) use both OLS and negative
binomial models to describe the number of US cities and counties that are split
by Congressional districts.
Past Work
King (1988) is an early discussion of count models in political science. King
(1989b); King and Signorino (1996) develop an alternative “generalized event
count” model for over- and under-dispersion. Land et al. (1996) compare
Poisson and negative binomial models with semiparametric versions. Zorn
(1998) compares zero-inflated and hurdle models in the context of the US
Congress and Supreme Court.
Advanced Study
The classic text in this field is Cameron and Trivedi (2013). Greene (2008)
describes the likelihoods and computation for the NBp and related models; see
also Hilbe (2008). Mebane and Sekhon (2004) discuss and extend the use of
the multinomial model in the context of counts across categories for multiple
units.
Software Notes
The countreg package (Zeileis and Kleiber, 2017) collects many of the models
and statistical tests for count data previously scattered across multiple libraries,
including pscl (Zeileis et al., 2008). The VGAM package (Yee, 2010) enables
generalized Poisson regression. Friendly (2000) describes many useful graphical
displays for count data, including the rootogram and distribution plots. Many
of the tools described in that volume are collected in the vcd R library (Meyer
et al., 2016). This likelihood ratio test for the negative binomial model relative
to a Poisson is implemented as odTest in R’s pscl, AER, and countreg
libraries.
part iv
A DVA N C E D TO P I C S
11
Up to this point in the book we have derived likelihood functions and evaluated
models under the assumption that the Yi are conditionally independent, i.e.,
Pr(y1 , y2 , . . . , yn |θ) = P(y1 |θ1 ) × P(y2 |θ2 ) × . . . P(yn |θn ). This are many threats
to this assumption. For example, observations close together in space may all
be influenced by some neighborhood-specific factor, such as a leaky nuclear
reactor. This is known as spatial dependence. Dependence can also arise when
events or observations take place over time.
There are several ways of thinking about temporal dependence. For example,
we can consider frequency rather than time (Beck, 1991). Another approach
considers specific values or events in a temporal sequence. In such time series
we have repeated observations of some unit i at fixed intervals t, t + 1, . . .. We
are concerned that there might be correlation across them, often referred to
as serial (auto)correlation (Hamilton, 1994). Time series models, such as an
autoregressive integrated moving average (ARIMA) and many others, typically
organize the data to examine their characteristics in the time domain. There is
a vast and highly developed literature on these topics. We do not pursue them
here because that would require an entire volume. Instead, we concentrate on
a third approach to temporal dependence: modeling the time between or until
discrete events.
11.1 introduction
Social scientists are frequently concerned with how long things last and how
often they change. How long will one nation be at war with another? How long
is a leader’s tenure in office? How long after an election until a government is
formed? How long can we expect a worker to remain unemployed? Analysis
of this sort of data goes by different names across disciplines. In political
science and sociology, models of these processes are most commonly called
219
220 Strategies for Temporal Dependence
Duration
Case in Months X1 X2 X3 X4 Censored?
1 7.00 93.00 1.00 1.00 5.00 Yes
2 27.00 62.00 1.00 1.00 1.00 Yes
3 6.00 97.00 1.00 1.00 1.00 Yes
4 49.00 106.00 1.00 1.00 1.00 No
5 7.00 93.00 1.00 1.00 4.00 Yes
.. .. .. .. .. ..
. . . . . .
314 48.00 60.00 0.00 1.00 1.00 No
1 The distinction between panel and TSCS data depends on whether n (the number of units) is
much greater than the number of time periods, T. When n >> T we are in a panel world. T >> n
is referred to as TSCS data. The distinction is not hard-and-fast and derives from whether a
particular estimator relies on asymptotics in T or n to derive its properties. See Beck and Katz
(1995).
ta b l e 11.2 Counting process data, from Ahlquist (2010b).
Index id Country Event Time Start Stop Status tsle.ciep lag.infl lag.unemp growth enpp
1 AUS1974Q1 Australia 1 0 1 0 36.00 16.38 2.13 3.79 2.52
2 AUS1974Q2 Australia 2 1 2 0 44.00 16.55 2.10 3.79 2.52
3 AUS1974Q3 Australia 3 2 3 0 4.00 15.90 2.11 3.79 2.52
4 AUS1974Q4 Australia 4 3 4 0 12.00 15.25 2.64 3.79 2.52
.. .. .. .. .. ..
. . . . . .
36 AUS1982Q4 Australia 36 35 36 0 65.00 9.12 7.02 2.92 2.64
37 AUS1983Q1 Australia 37 36 37 1 74.00 8.54 8.72 −3.00 2.23
.. .. .. .. .. ..
. . . . . .
2,381 USA1999Q1 United States 101 100 101 0 12.50 1.29 4.43 3.39 1.99
2,382 USA1999Q2 United States 102 101 102 0 25.00 1.44 4.29 3.39 1.99
2,383 USA1999Q3 United States 103 102 103 0 37.50 1.62 4.25 3.39 1.99
2,384 USA1999Q4 United States 104 103 104 0 50.00 1.80 4.25 3.39 1.99
223
224 Strategies for Temporal Dependence
0 0 0 1 1 1 0 0 0 1 1 0 Numeric Data
Conflict Instances
Conflict onset
Conflict Termination
that represent whether a particular country in the sample set being analyzed has
a conflict during a particular year. Much conflict data are organized as a collec-
tion of such vectors in BTSCS format. The second row is a visual representation
of the same information, with filled squares representing conflicts, and empty
ones shown for years in which there was no conflict in that country. The third
row uses shading to identify the years in which there was an observed transition
from peace to conflict whereas the fourth row identifies conflict termination–
years of transition from conflict to peace. The row labeled peaceful spell
#1 shows that the first four time periods make up the first spell of non-conflict
until (and including) the year in which the peace ends and the conflict begins.
This spell is left-censored because we do not observe when it begins. The final
row encodes the years consisting of the second spell of conflict, one in which we
observe both the onset and termination. Standard analyses often focus on the
numeric data, but duration models focus on explaining the lengths of the spells;
Standard analyses have covariates that are measured for each t, but duration
models are really concerned, if at all, with what the covariates look like during
the spells.
adoption by US states, we only observe whether a state adopts the law in a year;
all adoptions within a year are considered equivalent.
Beck et al. (1998) argue that fitting a logit, probit, or complementary log-log
model to this pooled, stacked data is both appropriate and equivalent to event
history modeling so long as the analyst includes either (1) indicator variables for
event time or (2) (cubic) splines in event time.2 The event time indicators that
Beck et al. have in mind are T − 1 dummy variables for the number of periods
since that unit last experienced an event. These time dummies have the direct
interpretation of a “baseline hazard,” i.e., the probability of an event occurring
in any particular interval given that it has not yet occurred and before we
consider covariates. It turns out this is equivalent mathematically to a duration
model, developed in the next section.
Formally, the logit model with time indicators is
Yit ∼ fB (yit ; θit , )
θit = logit−1 xit β + τt 1ti ,
where 1ti is a dummy variable that equals one whenever unit i has gone t periods
without an event and zero otherwise.
This BTSCS approach is simple to implement when the data are organized
as a panel, rather than as a collection of episodes. BTSCS models can be
interpreted along the lines of simple binary GLMs. But these models do
require explicit description of temporal dependence across observations in the
form of time indicators or splines. Carter and Signorino (2010) highlight two
problems here. First, splines are complicated and often implemented incorrectly,
in addition to being difficult to interpret. Second, a common problem with
the indicator-variable approach is the small number of observations with
very high times-to-failure. These small numbers can induce perfect separa-
tion. Carter and Signorino (2010) propose using a simple cubic term in
event time (τ1 ti + τ2 ti2 + τ3 ti3 ) rather than splines or time dummies to get
around these exact problems. In our experience a simple linear trend often
works well.
The BTSCS approach does have weaknesses. It requires time-varying data
but, at the same time, it does not require explicit consideration of censoring.
The BTSCS approach does not direct attention to the time between events or
how covariates may affect that interval, focusing instead only on the event
occurrence. While the BTSCS approach continues to see use, there are also
flexible and easy-to-estimate duration models to which we turn.
2 Splines are piecewise polynomials, with the pieces defined by the number of “knots,” or locations
in which the values of the two polynomials on either side are constrained to be equal. Temporal
splines have the advantage of using many fewer degrees of freedom (one per knot) than indicator
variables for each period, but the number and location of the knots is not an obvious decision.
We do not take up splines further here. We also note that there are many other ways of addressing
this degrees of freedom problem than just splines.
226 Strategies for Temporal Dependence
The hazard function describes the probability that a spell will end by t given
that it has survived until t. The value of the hazard function at t is referred
to as the hazard rate. Empirically, the hazard rate is simply the proportion of
units still at risk for failure at t that fail between t and t + 1. We can sum the
hazard rates in continuous time to express the “accumulation” of hazard. This
function, H(t), is called the cumulative or integrated hazard rate.
Theorem 11.1 (Integrated hazard rate).
t
H(t) ≡ h(x)dx = − log S(t).
0
11.4.2 A Likelihood
The problem in constructing a likelihood for survival data is that we generally
do not observe failure times for all units. Units for which our observation
window ends before we observe failure are right-censored. Units that begin
their risk exposure prior to the beginning of the observation period are called
left-censored. The models discussed herein are capable of incorporating both
left- and right-censoring but, for brevity, we focus on the far more common
right-censored case. These observations contribute information on survival up
to the censoring point; they do not contribute any information about failure.
Thus, if we define a censoring indicator δi , which equals 1 if i is censored and
0 otherwise, the generic form of the likelihood is
228 Strategies for Temporal Dependence
n
L= f (ti )1−δi S(ti )δi . (11.3)
i=1
Proportional Hazards
The more common way of writing down duration models is in the hazard rate
form. The exponential and Weibull models, as well as the Cox model described
below, are often expressed in terms of hazard rates:
hi (t) = h0 (t) exp(xi β), (11.4)
where h0 (t) is the baseline hazard, possibly a function of time (but not
covariates), and common to all units. The specification for h0 determines the
model. For example, if h0 (t) = ptp−1 , then we have the Weibull model; with
p = 1 we again have the exponential.
Tied Events
The derivation of the Cox model assumed that no two units failed in the same
time interval. Indeed, if we were really observing events in continuous time, then
the chances of two events actually occurring at the same instant are vanishingly
small. However, data are aggregated in time slices (seconds, weeks, years). In
practice ties occur regularly.
One of the great strengths of the Cox model is its flexibility in handling tied
events. Several different methods have emerged to adjust the partial likelihood
to accommodate ties. They go by the names Breslow, Efron, averaged or exact
partial, and exact discrete or exact marginal. The Breslow method is the default
method for most statistical packages but not in R. It is generally held to be the
least accurate, with this problem becoming more severe the more ties there are.
The Efron method tends to work better (and is the default in R). The two
“exact” methods are the most precise, but can impose severe computational
burdens if there are a lot of ties.
Cox-Snell Cox-Snell residuals are used to examine overall model fit. They
should be distributed as unit exponential. Residual plots allow
us to visually examine whether this is (approximately) true for a
particular model.
Schoenfeld Schoenfeld residuals are used for evaluating the proportional
hazards property. Box-Steffensmeier and Jones (2004) note that
Schoenfeld residuals can “essentially be thought of as the observed
minus the expected values of the covariates at each failure time.”
(2004:121, emphasis added)
Martingale Martingale residuals are the most intuitive to think about but
perhaps the most mathematically complex. They are given by
the observed censoring indicator minus the expected number of
events, as given by the integrated
hazard rate. Martingale residuals
3 i = δi − Ĥ0 (ti ) exp x β̂ , where δi is an indicator of the
are M i
event for observation i, and H 3 0 (ti ) is the estimated cumulative
hazard at the final time for observation i. Martingale residuals
are plotted against included (or possibly excluded) covariates to
evaluate whether they are included appropriately.
11.5 The Cox Model 233
β̂ exp(β̂) σβ̂
2
4
−4 −0.02
0.3
0.04 0.3
Beta(t) for fractionalization
0.2
Beta(t) for polarization
0.00 0.1
0.0
−0.1 −0.02
0.0
−0.2 −0.04
4
10
Beta(t) for opposition_party
Beta(t) for formation_attempts
2 0
1
−10
0
−1 −20
−2
1.6 4.1 7 9.5 14 19 25 33 1.6 4.1 7 9.5 14 19 25 33
Time Time
in Table 11.4. In none of the variables are we able to reject the null of no
relationship. There is no reason to believe the proportional hazards assumption
is violated here.
Interpreting Results
A quick glance at the (exponentiated) coefficients in the BUTON implies that,
for example, requiring a vote of investiture increases the hazard of coalition
termination by about 74%. Coalitions controlling a legislative majority have
about a 38% longer survival time than those in the minority, all else constant.
But, as usual, the nonlinear form of the model makes it difficult to envision
what we mean by “all else constant.” If we want to interpret the model on
something like the scale of the dependent variable, more work is needed. We
can construct meaningful scenarios and then generate model predictions on an
interpretable scale.
11.5 The Cox Model 235
ρ χ2 p-Value
Majority Government 0.00 0.01 0.94
Investiture −0.05 0.52 0.47
Volatility −0.01 0.05 0.81
Polarization −0.01 0.01 0.89
Fractionalization −0.06 0.86 0.35
Crisis 0.01 0.07 0.79
Formation Attempts −0.03 0.26 0.61
Opposition Party −0.06 1.23 0.26
GLOBAL NA 2.93 0.93
Hazard rates are often difficult to understand or convey to audiences for two
reasons: they are conditional, and the sign of estimated coefficients in a propor-
tional hazards model is the opposite of its implications for survival. Presenting
results on the survival scale is often easier to comprehend. Figure 11.3 displays
the expected survival times for governments depending on whether the cabinet
requires a vote of investiture. We see that only about 35% of governments
requiring investiture are expected to survive for at least twenty months; the rate
is about 55% for those not needing such a vote. Put another way, we expect
about half of governments requiring an investiture vote to have fallen at around
15 months, compared to about 22 months for those not requiring a vote. This
particular plot leaves off estimates of uncertainty around these predictions, but
our standard simulation techniques are capable of generating them.
Code Example 11.1 produces the Cox analysis and graphics.
1.0
0.8
no investiture
0.6
S^ (t)
0.4
0.2 investiture
0.0
0 10 20 30 40 50
Months
f i g u r e 11.3 Expected government survival times depending on whether the cabinet
requires a vote of investiture, with all over covariates held at central tendencies.
5 Stratification in a Cox model estimates a separate baseline hazard rates for different values of
the stratified variable. Competing risks models are those that model multiple types of transitions
at once. “Frailty” models are a form of random coefficient or “random effects” Cox model in
which observational units are heterogeneous in the propensity to survive.
11.6 A Split-Population Model 237
library ( survival )
library ( foreign )
s d a t <− r e a d . d t a ( "coalum2 . d t a " )
cox . f i t <− coxph ( Surv ( t i m e = d u r a t i o n , e v e n t = c e n s o r 1 2 ) ~ m a j o r i t y _ government +
investiture + volatility + polarization + fractionalization +
c r i s i s + formation _ attempts + opposition _ party , data= sdat )
cox . zph ( cox . f i t ) # t e s t i n g t h e p r o p o r t i o n a l h a z a r d s assumption
d e e t s <− coxph . d e t a i l ( cox . f i t )
t i m e s <− c ( 0 , d e e t s $ t i m e ) # o b s e r v e d f a i l u r e t i m e s
h0 <− c ( 0 , d e e t s $ hazard ) # hazard e v a l u a t e d a sample mean o f c o v a r i a t e s
x0 <− c ( 0 ) −cox . f i t $mean
h0 <− h0∗exp ( t ( c o e f ( cox . f i t ) )%∗%x0 ) # b a s e l i n e hazard
x . i n v <− c ( 1 ,1 , median ( s d a t $ v o l a t i l i t y ) , # i n v e s t i t u r e s c e n a r i o
mean ( s d a t $ p o l a r i z a t i o n ) ,
mean ( s d a t $ f r a c t i o n a l i z a t i o n ) , mean ( s d a t $ c r i s i s ) ,
median ( s d a t $ f o r m a t i o n _ a t t e m p t s ) ,mean ( s d a t $ o p p o s i t i o n _ p a r t y ) )
x . n i n v <− x . i n v
x . n i n v [ 2 ] <− 0 #no i n v e s t i t u r e s c e n a r i o
h . i n v <− h0∗exp ( t ( c o e f ( cox . f i t ) )%∗%x . i n v )
h . n i n v <− h0∗exp ( t ( c o e f ( cox . f i t ) )%∗%x . n i n v )
S i n v <− exp( −cumsum ( h . i n v ) ) # survival function
S n i n v <− exp( −cumsum ( h . n i n v ) ) # s u r v i v a l f u n c t i o n
p l o t ( t i m e s , Sinv , t y p e =" l " , x l a b ="Months" , lwd =2 ,
y l a b = e x p r e s s i o n ( h a t ( S ) ( t ) ) , b t y ="n" , l a s =1 ,
y l i m = c ( 0 , 1 ) , xlim = c ( 0 ,5 0 ) )
l i n e s ( t i m e s , Sninv , lwd =2 , l t y =2)
t e x t ( x =20 , y=c ( 0 . 2 , 0 . 7 ) , l a b e l s =c ( " i n v e s t i t u r e " , "no i n v e s t i t u r e " ) )
Africa and the Middle East experienced irregular transitions in the past decade.
The important point from a modeling perspective is identifying countries, or
in this case country-months, which are “at risk” of failure. From a policy
perspective these are the places where estimating an expected duration until
the next event makes sense.
Figure 11.4 illustrates the intuition behind this approach. As shown in the
left panel, there are two types of polities. First are those that may have had an
event but essentially are immune from further events (Country B). These include
countries that never had any events but are not shown in this illustration.
The second type of polity is at risk for future events (Country A). The split-
population approach models first the separation of locations into type A or
B, denoted by the if in the panel. The next part of the model determines
the duration of time until the next event, denoted by when. The right panel
illustrates the differences in base hazard rates under the assumption that all
locations have the same risk profile (the standard Weibull) compared to the
baseline risk that assumes the population of locations consists of two types:
those at risk and those immune from risk.
The basic likelihood of this kind of situation may be thought of as a mixture
of two distributions: a Bernoulli distribution determining whether a unit is at
risk and then a second distribution describing duration. We define the variable
πi (t) to be an indicator variable equal to one if ti is part of a spell that ultimately
ends in an observed event of interest and 0 otherwise. We can build the model as
Ti = min{Yi∗ , Ci∗ },
πi ∼ fB (θi ),
θ = logit−1 (zi γ ),
Ti ∼ f (ti ; λ),
λ = exp(−xi β),
S(ti |x, z) = θi S(t) + (1 − θi ).
Building on Equation 11.3, we can derive the likelihood as a product of the risk
and duration:
n
1−δi
L= θi f (ti ) × [(1 − θi )S(ti )]δi ,
i=1
where f (ti ) is the failure rate at time ti , S(ti ) is the survival function, and δi is
the indicator of right-censoring. The split-population model is set up for two
populations, one of them at risk for an event, the other “immune.”
This likelihood function reflects a mixture of two equations: a first step
classifying risk and immunity, and a second step describing expected duration
in a spell. One advantage of this modeling approach is that it allows covariates
to have both a long-term and a short-term impact, depending on whether
they’ appear in the z or the x vector. Variables that enter the at-risk equation
11.6 A Split-Population Model 239
When
EOI EOI
Country A At Risk (Pr(EOI)>0)
If
EOI
Country B Immune (Pr(EOI)=0)
Observation Prediction
(a) Splitting Population
1.5
Hazard
1.0
Split−population
0.0
0 5 10 15 20 25 30 35
Time
(b) Baseline Hazard
f i g u r e 11.4 Country A is at risk; Country B is not at risk. Mixing these two yields a
risk assessment that is too low while overestimating the risk decline. EOI refers to the
Event Of Interest.
To complete the model we must choose a distribution function for f (t). The
R package spduration (Beger et al., 2017) currently admits two: the Weibull
and log-logistic. The Weibull density allows for hazard rates that are increasing,
constant, or decreasing over survival time, while the log-logistic density can fit
rates that have a peak at a particular survival time.
Note that the model focuses on whether the observational unit at time t
is in the risk set – e.g., a country at a particular point in time – not the unit
per se. As the covariates change over time, so can our estimate of whether an
observation is in the risk pool at that point in time. So, while it is helpful on
a conceptual level to speak of observational units as being at risk or immune,
in a technical sense we should refer to them at a specific point in time. In an
analysis of countries, Canada may not be at risk in 2014, but it may be at risk
in 2015 if some unanticipated disaster leads to conditions that we associated
with country-months in the risk pool.
0 t < 0.
Survival data consist of two types of spells, those which end in an event
of interest and those that are right-censored. The key assumption with the
split-population approach involves the coding of spells or country-months
as susceptible (πit = 1). The split-duration approach “retroactively” codes
πit = 1 if period t is part of a spell that ended in an observed event in unit
i. Spells that are right-censored take on πit = 0, as do spells that end when an
observation leaves the data set with no event taking place in the last spell (e.g.,
an observation ceases to exist). Treating right-censored spells as “cured” can
be problematic, since they may later, after we observe more data, end in failure.
The probabilistic model for πi partially mitigates this by both incorporating the
length of a censored spell and sharing information across cases known to be at
risk as well as those coded as cured.
are subject to experiencing the events. That hardly seems sustainable, since we
know that some observations are immune to the particular risk under study.
To that end, a split-population duration model also tries to group observations
that are at risk separately from those that are effectively “cured.”
The risk probability is an estimate that a given observation at a specific time
falls into either the susceptible group (π = 1) or the cured group (π = 0). It
is an estimate at a given time because it also depends on how much time has
passed since the last event. To calculate the conditional hazard, we combine
our estimated hazard rate with the estimated probability that a country at
a given time is susceptible to an event. In other words, conditional hazard
= unconditional hazard × risk probability. More formally, given the density
distribution and estimated parameters we are interested in, the conditional
hazard h(t, θ ), where both the at-risk probabilities and hazard are conditional
on survival to time t:
1−θ
θ (t) = , (11.5)
S(t) + (1 − θ )(1 − S(t))
f (t, θ ) θ (t)f (t)
h(t, θ ) = = . (11.6)
S(t, θ ) (1 − θ (t)) + θ (t)S(t)
Equation 11.6 shows that the conditional risk rate is decreasing over event
time because, as time passes, the surviving cases increasingly consist of the
immune (1 − θ ) that will never fail. In Equation 11.6 for the conditional
hazard, the failure rate in the numerator is conditional on the probability that
a case is in the risk set, given survival up to time t. The denominator is an
adjusted survivor function that accounts for the fraction of cured cases by
time t: (1 − θ (t)).
We estimate the probability of an event in (say) February 2014 to be the
unconditional probability of that event when it has been so many (say, 23)
months since the last event times the probability that the observation is in
the “at risk” group given its characteristics and given that it has been 23
months since the last event. In this way we can get a probability estimate for
an event that takes into account the changing hazard of events over time but
which also corrects for the fact that some observations will never experience
an event.
The coefficient estimates from a split-population model can be interpreted
similar to how we viewed ZIP coefficients. The coefficients in the risk equation
are logistic regression parameters that indicate whether a change in a variable
increases or decreases the probability that a country-month is in the set of
susceptible unites; exponentiated coefficients indicate the factor change in risk
probability associated with a one-unit change in the associated variable. The
duration part of the model is in AFT format, and for interpreting them it
is convenient to think of the dependent variable as being survival time, or,
equivalently, time to failure. A negative coefficient shortens survival and thus
hastens failure (higher probability of an event at time t), while a positive
11.6 A Split-Population Model 243
11.6.2 An Example
In order to illustrate practical implementation of a split-population model, we
adapt the Beger et al. (2017) reexamination of Belkin and Schofer (2003). Belkin
and Schofer model coups d’état. They explicitly distinguish long-term structural
risk factors for coups from short-term triggering causes that can explain the
timing of a coup in an at-risk regime. The data we analyze include 213 coups.
As examples, we fit a conventional Weibull model and Weibull and log-logistic
split-population duration models, including Belkin and Schofer’s index of
structural coup risk in the risk equation. Table 11.5 is the requisite BUTON
reporting coefficients from the duration equation and then estimates from the
risk equation.6 The duration models are in accelerated failure time format, and
the coefficient estimates are on the log of expected time to failure. The negative
coefficient for military regimes, for example, means that the expected time to a
coup is shorter in military regimes than nonmilitary regimes, holding all other
factors constant. In the risk equation, positive coefficients mean a higher risk of
coup. Thus, military regimes have a higher risk of experiencing a coup. Looking
at the AIC and BIC, we see that the split-population model outperforms the
Weibull AFT model. The split-population models are indistinguishable, so we
will continue to focus on the log-logistic form.
A plot of the conditional hazard is the probability of a coup at a time t,
conditional on the covariates in the risk and duration equations and survival
up to time t. We fix covariates at their sample means. These are shown in
Figures 11.5 and 11.6. Figure 11.6 compares the conditional hazard with
covariates held at mean values (panel A) and when covariates are set to a high-
risk, military-regime values (panel B). The conditional hazard is much higher
and steeper in B than in A, reflecting the increased risk of coup.
Code for estimating these models appears as Code Example 11.2.
Out-of-Sample Testing
One of the strengths of a parametric model is its ability to generate out-of-
sample forecasts. In this case we use data from 1996 onwards as the test set,
and prior data for training purposes. For the training data, we need to subset the
training set first, so that coups in the test set do not influence the risk coding
in the training data. For the test set, we add the duration variables and then
subset the test set. Since the test set is later in time than the training set, there is
no contamination of the risk coding, but if we subset the data before building
the duration variables, we will start all duration counters at 1996, when in fact
we can safely use the previous historic coup information.
6 The AFT parameterization of the Weibull model does not report an intercept.
244 Strategies for Temporal Dependence
Split Population
Duration Model Weibull Weibull log-Logistic
Intercept 3.22 2.40
(0.17) (0.21)
Instablity −0.06 −0.09 −0.09
(0.01) (0.01) (0.02)
Military regime −2.33 −1.55 −1.13
(0.17) (0.19) (0.21)
Regional conflict 6.03 5.08 −2.52
(2.44) (2.72) (2.16)
log p −0.15 0.03 −0.45
(0.05) (0.05) (0.06)
Risk Model
Intercept −0.44 2.93
(3.89) (1.87)
Risk index 1.65 0.59
(0.83) (0.32)
log GDPpc 0.35 −0.36
(0.68) (0.28)
Military regime 11.57 10.82
(3.92) (9.29)
Recent war −2.19 −0.53
(1.97) (0.94)
Regional conflict −5.30 −5.43
(12.49) (5.62)
South America −0.55 2.10
(2.18) (1.45)
Central America −1.02 −0.39
(1.41) (0.73)
n 4,250 4,250 4,250
Num. events 213 213 213
log L −704 −662 −662
AIC 1,417 1,330 1,331
BIC 1,442 1,349 1,350
Weibull Loglog
0.06
0.10
0.05
Conditional hazard
Conditional hazard
0.08
0.04
0.06
0.03
0.02 0.04
0.01 0.02
0.00 0.00
0 10 20 30 40 0 10 20 30 40
Time Time
A B
0.8
0.10
Conditional hazard
Conditional hazard
0.6
0.08
0.06 0.4
0.04
0.2
0.02
0.00 0.0
0 10 20 30 40 0 10 20 30 40
Time Time
f i g u r e 11.6 Plots of the hazard rate for the log-logistic model of coups. The left
graph uses the default mean values for covariates, while graph B uses user-specified
variable values for a high-risk military regime.
11.7 conclusion
This chapter introduced a framework for using likelihood principles when there
is temporal dependence between observations. But rather than focusing on the
serial approach that emphasizes repeated measurements over time, we focused
on duration models that explicitly model time itself. We have developed the
likelihoods and illustrated how they may be modified to address a variety of
246 Strategies for Temporal Dependence
d a t a ( bscoup ) # i n spdur l i b r a r y .
bscoup$coup <− i f e l s e ( bscoup $coup==" y e s " , 1 , 0 )
bscoup <− add_ d u r a t i o n ( bscoup , "coup" , u n i t I D =" c o u n t r y i d " ,
tID =" y e a r " , f r e q =" y e a r " , ongoing = FALSE ) # f o r m a t t i n g f o r d u r a t i o n model
weib _ model <− spdur (
duration ~ milreg + in st ab + regconf ,
a t r i s k ~ c o u p r i s k + w e a l t h + m i l r e g + rwar + r e g c o n f +
s a m e r i c a + camerica , d a t a = bscoup , s i l e n t = TRUE)
Weibull
Loglog
Applications
Jones and Branton (2005) compare the BTSCS approach to the Cox model in
the study of policy adoption across US states. Thrower (2017) uses the Cox
model to investigate the duration of executive orders from the US president.
Wolford (2017) also uses the Cox model to decribe the duration of peace among
a coalition of victors in interstate wars.
248 Strategies for Temporal Dependence
Past Work
Freedman (2008) provides a critical primer to survival analysis from the
perspective of medical research and experimentation. Box-Steffensmeier and
DeBoef (2006); Box-Steffensmeier and Zorn (2001, 2002); and Box-Steffensmeier
et al. (2003) introduce several of the more-complicated versions of the Cox
model to political science audiences. Beck and Katz (2011) provide a broad
reviews of the state of the art in the analysis of panel time series data.
Advanced Study
Therneau and Grambsch (2000) is an excellent resource for survival mod-
els. Kalbfleisch and Prentice (2002) present a detailed treatment as well.
Box-Steffensmeier and Jones (2004) develop a political science-focused pre-
sentation of event history models. Park and Hendry (2015) argue for a revised
practice in the evaluation of proportional hazards in the Cox model.
Time series analysis, much of which relies on the likelihood approach, is
an area of vast research, with many texts available. The canonical text for
analyzing time series and panel data remains Wooldridge (2010). Several recent
contributions include Box-Steffensmeier et al. (2015); Brandt and Williams
(2007); and Prado et al. (2017).
For an accessible introduction to spatial dependence in the context of
regression models, see Ward and Gleditsch (2018). Beck et al. (2006) and
Franzese and Hayes (2007) further discuss the interplay of spatial and temporal
dependence.
Software Notes
Therneau and Grambsch (2000) underpins the survival library in R. The
eha package (Broström, 2012, 2017) provides other parameterizations of
common survival models, including parametric models capable of handling
time-varying covariates. flexsurv (Jackson, 2016) and rms (Harrell, Jr.,
2017) are other alternatives that build on survival. The spduration
library (Beger et al., 2017) implements the split-population survival model used
in this chapter. See also smcure (Cai et al., 2012).
12
12.1 introduction
Missingness in data is like tooth decay. Everyone has it. It is easy to ignore.
It causes serious problems. Plus, virtually no one likes to go to the dentist,
which itself is painful and costly. While ignoring missing data is still widespread,
developments in recent decades have made it easier to address the issue in a
principled fashion. This chapter will review the problems of missing data and
survey some techniques for dealing with it, focusing on one technique that can
be inserted at the point of need: in the likelihood function.
Statistical modeling that ignores missingness in the data can result in biased
estimates and standard errors that are wildly deflated or inflated (King et al.,
2001; Molenberghs et al., 2014; Rubin, 1976). Yet many applied researchers
still ignore this problem, even if it is well established (Lall, 2016). Principled
approaches to missing data were first introduced in the 1970s, and now there
are a variety of tools available. The general idea across all of them is to use some
algorithm to fill in the missing data with estimates of what real data would look
like were it available. Because these imputed values are uncertain, we want to
do this many times and incorporate this uncertainty in our analysis. This means
creating multiple “complete” data sets, which are then analyzed separately and
combined to obtain final estimates of quantities of interest, accounting for the
uncertainty due to the missing data (Rubin, 1996, 2004).
Data we collect from the real world, whether from the most carefully
designed experiment or from the underfunded statistical agency of a poor
country will, with near certainty, have some values that are missing. Missingness
can arise due to individuals deciding not to answer certain questions or
dropping out of a study altogether, human error, and data getting lost in the
shuffle of paper and computer files. Or data can be systematically not reported.
It may be that data are missing because they are collected asynchronously so
249
250 Strategies for Missing Data
that, for example, some data for a particular observation may be available
at the beginning of a year but other data not available until the end. Some
repository databases, such as those curated by the World Bank, may rely on
collection mechanisms – such as national reporting agencies – that vary in their
schedules. This can lead some data to be unavailable for various intervals of
time. Another potential cause of missing data is that some reporting agencies
may choose to delay the reporting of statistics if the data are perceived to be
politically damaging. Additionally, natural disasters such as earthquakes and
tsunamis disrupt data collection. Similarly, civil and international wars may
cause some data to go unreported, either because the data were never generated
or the institutions that gather and curate societal data were unavailable during
the conflict.
Consider the example of Nigeria. During Nigeria’s colonial period there
was a national census in the early 1950s, yielding an estimate of about 32
million inhabitants. After independence, the national census became politicized
and several attempted censuses were undertaken during the 1960s, yielding
estimates between 50 and 60 million. After the Nigerian civil war ended in
1970, there were attempts to hold a census, but they were never completed
because of the political controversy over which ethnic groups would be counted
in which areas. It wasn’t until the 1990s that Nigeria was able to conduct a
census, and currently the population is estimated to be around 175 million.
In many databases, Nigerian population was recorded as missing for the 1970s
and 1980s. Obviously the Nigerian population didn’t go missing, but as a result
Nigeria was not included in many scholarly studies.
Missingness can take many forms and result from different events and
circumstances. Each can lead to different patterns of missingness with different
implications for analysis. But it is evident that social science data are rarely
missing at random. Thus, we can often assume that the data are missing for
reasons that we could, in principle, know or describe.
Missingness Map
Missing Observed
300
285
270
255
240
225
210
195
180
165
150
135
120
105
90
75
60
45
30
15
id
qtr
war
time
year
ww2
iraq2
iraq1
korea
hi.gop
or.gop
ca.gop
wa.gop
vietnam
gop.gov
pres.gop
form.strk
president
delta.pres
contract.yr
convention
taft.hartley
industry.va
time.to.exp
MandM.neg
first.contract
industry.emp
convention.yr
past.pol.strks
time.from.exp
stoppage.size
contract.cal.yr
work.stoppage
delta.industry.va
time.since.pol.strk
contract.expiration
delta.industry.emp
symbolic.stoppage
political.strike.count
solidary.strike.count
time.to.exp.corrected
industrial.strike.count
political.work.stoppage
solidary.work.stoppage
industrial.work.stoppage
1.0
1
1
1
1
1
1
0.8
1
1
1
1
1
Proportion of missings
1
1
0.6
Combinations
2
2
2
2
2
0.4
3
3
4
5
5
6
0.2
6
8
8
11
15
19
21
0.0
boix.federalism
boix.federalism
boix.pchildren
boix.pchildren
polity.durable
polity.durable
boix.kapwork
boix.kapwork
EL.gdp_usd
EL.gdp_usd
boix.labagr
boix.labagr
boix.gross
boix.gross
boix.exchr
boix.pcrev
boix.exchr
boix.pcrev
boix.odwp
boix.odwp
world.gdp
world.gdp
boix.fract
boix.fract
boix.inc
boix.inc
f i g u r e 12.2 The left panel displays the proportion of observations missing for a
selection of variables. The right panel displays the frequency of combinations of
missing and non-missing variables.
Missingness Map
Missing Observed
AZE
BGD
ESP
GAB
LBR
POL
SAU
SDN
TWN
ZWE
boix.federalism
boix.pers
boix.pubsectwages
boix.presidentialism
boix.rodwp1
boix.curr_y
worldgdp.dolUS.2000.const
BM.q3
DS.d5
ENPP
AW.50.10
boix.aginiadj
boix.sambanis
boix.trade
WB.code
year
world.trade.gdp.interp
between observational units (the left axis) represents the number of time
periods a particular unit was observed. In these example data Azerbaijan
(AZE) has a relatively short period of observation, i.e., the time series is
unbalanced.
A similar problem arises when the producers of data use numeric values to
encode specific states. For example, the widely used Polity IV index measures
12.2 Finding Missingness 255
0.08
0.06
0.04
0.02
0.00
More problematic, however, is the fact that complete case analysis yields
biased parameter estimates and invalid standard errors when the data are MAR
or NMAR. Again, there is no way to be sure about the missingness mechanism;
using complete case analysis assumes that missingness is MCAR.
If we can calculate this integral, we can then construct a likelihood L(θ |Xobs ).
The trouble is that this integral almost never has a convenient solution, except
in special circumstances. Little and Rubin (2002) use the EM algorithm (See
Section 4.2.3) to find θ̂ . The algorithm consists of an E-step and an M-step; we
start with a set of initial parameter values θ 0 and then iterate. Let θ t be the tth
iteration of the algorithm:
E-step Find the conditional expectation of the missing data, given the observed
data and θ t−1
260 Strategies for Missing Data
M-step Find θ t by maximizing L(θ |Xobs , X̃mis ), where X̃mis are the current
predicted values generated by the parameters θ t−1
12.5.2 MICE
An alternative approach is to focus on each variable and to look at its condi-
tional distribution based on other variables. Multiple Imputation via Chained
Equations (MICE) (Van Buuren, 2012) is a fully conditional specification
method that starts with an initial guess of Y mis (e.g., most often the mean). This
“imputed” variable is used as dependent variable in a regression on all other
variables. The obtained regression estimates are used to impute the missing
observations. The estimates and parameters are then updated, cycling through
the variables and imputing each one given the most current estimates. The
procedure continues as long as some changes are being produced. A similar
technique is used in the mi package in R (Su et al., 2011). Each variable is
imputed based on an “appropriate generalized linear model for each variable’s
conditional distribution” (Kropko et al., 2014, 501).
12.5 Multiple Imputation 261
where fX (·) and fY (·) are the univariate density functions for X and Y.
The last step is to choose a particular copula function for C(, ; θ ) to
characterize functions in Equations (12.1) and (12.2). The Gaussian
copula can can accommodate both positive and negative dependency:
−1 (u) −1 (v)
1
C(u, v; θ ) =
−∞ −∞ 2π(1 − θ 2 )1/2
−(s2 − 2θ st + t2 )
exp dsdt,
2(1 − θ 2 )
Standard errors around these estimates are a combination of our within- and
across-imputation uncertainty:
q
1 d 2
Vkwithin = (σ̃k ) ,
q
d=1
q
1 d
Vkbetween = (β̃k − β̂k,MI )2 ,
q−1
d=1
4
q + 1 between
σ̂k,MI = Vkwithin + Vk .
q
ta b l e 12.1 Missingness in Sorens and Ruger data. Variable names taken from
original study.
3 Generating five data sets through Amelia II took 124 minutes, MICE took 55 minutes to create
20 imputed data sets, and sbgcop.mcmc ran 2,000 imputations in 11 minutes. Each of the
imputation algorithms was run on a single core.
ta b l e 12.2 Root mean square error on imputations of Sorens and Ruger (2012) data. Variable names taken from
original study.
267
268 Strategies for Missing Data
rm ( l i s t = l s ( ) )
l i b r a r y ( sbgcop )
# d e l e t e d code t o g a t h e r a d a t a frame , md . d f :
# md . d f i s a d a t a frame with 195 rows ( c o u n t r i e s )
# run sbgcop . mcmc with d e f a u l t s
new . d f<− sbgcop . mcmc (md . d f )
colnames ( new . d f $Y . pmean )<− c ( " a r e a . mi" , "gdp . mi" , "democ . mi" , " a u t o c . mi" , " p o l i t y .
mi" )
raw . d f . mi<− c b i n d ( raw . df , new . d f $Y . pmean )
p l o t ( l a s =1 , d e n s i t y ( raw . d f . mi$ p o l i t y , na . rm=T ) , xlim =c ( − 20 ,20) , b t y ="n" , x l a b ="
P o l i t y S c o r e " , main="" , lwd =3)
l i n e s ( d e n s i t y ( raw . d f . mi$ p o l i t y . mi ) , c o l =" r e d " , lwd =3)
0.08
0.06
Density
0.04
0.02
0.00
−20 −10 0 10 20
Polity score
f i g u r e 12.5 Using sbgcop to impute missing Polity scores. Distribution of the
variable for only the cases with data (shown in black) compared to the distribution for
the variable with missing values imputed (shown in gray/red).
12.8 conclusion
Ultimately, the analyst has to make a case as to which is worse: analyzing
complete cases or imputing the missing values. In general, imputation is
much preferred to throwing away data and thereby making poor inferences.
Fortunately, there are modern methods that facilitate the imputation of missing
data that preserve the uncertainty that comes from imputing values. Like
modern dentistry, early detection of missingness and use of these approaches
can prevent a lot of subsequent pain. A general strategy is outlined as:
Applications
We searched the 22 quantitative empirical articles in the July 2017 issue of
the American Journal of Political Science and the August 2017 issues of the
American Political Science Review for the terms “attrition,” “drop,” “omit,”
“missing,” “impute,” “imputation,” “nonresponse,” and, “non-response.” Only
seven mention anything at all about the handling of missing data. Two
undertake some form of imputation.
Wright and Frantz (2017) use multiple imputation to replicate an existing
paper linking hydrocarbon income to autocratic survival and show that the
original results fail to hold. Klasnja (2017) imputes missing data in the ANES.
Past Work
Schafer (1999) provides an early primer on multiple imputation for applied
researchers. King et al. (2001) introduced the initial version of Amelia. Horton
and Kleinman (2007) outline a variety of imputation approaches, ranging from
EM through chained equations to MCMC. Ross (2006) is a famous application
in political science, highlighting how poorer and more-authoritarian countries
are less likely to report data.
Advanced Study
Two major texts on multiple imputation are Little and Rubin (2002) and
Van Buuren (2012).
On the theory of copulas, see Nelsen (1970) and Trivedi and Zimmer (2005).
On Gaussian copulas, see Chen et al. (2006); Klaassen et al. (1997); and
Pitt et al. (2006). Hoff (2007) uses copulas in the context of rank likelihoods
(Pettitt, 1982).
270 Strategies for Missing Data
Software Notes
The Amelia library contains several useful features for diagnosing missing
data, including missmap. The VIM package (Kowarik and Templ, 2016)
contains many others.
The machine-learning revolution has focused a lot of attention on ways to
impute missing data. As a result, R contains many packages with imputation
tools. Kowarik and Templ (2016) present one recent overview, surveying a
variety of single-imputation algorithms, including yaImpute (Crookston and
Finley, 2007), missMDA (Josse and Husson, 2016), AmeliaII (Honaker and
King, 2010), mix (Schafer, 2017), missForest (Stekhoven and Buehlmann,
2012), and others. The VIM package includes a variety of approaches for mixed,
scaled variables in large databases with outliers. This packages also supports
the multiple imputation methods that use the EM approach. The MICE library
(Van Buuren and Groothuis-Oudshoorn, 2011) implements MICE, while the
mi (Su et al., 2011) library also implements a conditional modeling approach.
mitools (Lumley, 2014) and Zelig contain a variety of tools for managing
and combining the results from multiple imputations. Van Buuren maintains
an updated list of imputation-related software, including R packages, at http://
www.stefvanbuuren.nl/mi/Software.html.
part v
A LOOK AHEAD
13
Epilogue
273
274 Epilogue
Good examples of this abound in the network literature, especially Hoff (2009,
2015).
Mixture models – as we introduced in the context of zero-inflation and split-
population models – relax the assumption that all our observed data are being
generated by the same process, i.e., they relax the “identical” part of i.i.d. The
strategy and estimation of mixture models, however, is much broader and can
be considerably more complex than the examples we have provided. Some
recent uses include unsupervised classification models (Ahlquist and Breunig,
2012). The application of the Dirichlet Process model has allowed for the
construction of mixture models, where the number of components emerges
as part of the estimation procedure itself. This set of tools is seeing wide
application in machine learning, especially as applied to text corpora (Grimmer,
2010; Lucas et al., 2015; Spirling and Quinn, 2010). Future applications will
likely extend to computer recognition of image and video data.
Abayomi, Kobi, Gelman, Andrew, and Levy, Marc. 2008. Diagnostics for Multivariate
Imputations. Journal of the Royal Statistical Society Series C, 57(3), 273–291.
Agresti, Alan. 2002. Categorical Data Analysis. 2nd edn. New York: John Wiley & Sons,
Ltd.
Ahlquist, John S. 2010a. Building Strategic Capacity: The Political Underpinnings of
Coordinated Wage Bargaining. American Political Science Review, 104(1), 171–188.
Ahlquist, John S. 2010b. Policy by Contract: Electoral Cycles, Parties, and Social Pacts,
1974–2000. Journal of Politics, 72(2), 572–587.
Ahlquist, John S., and Breunig, Christian. 2012. Model-Based Clustering and Typologies
in the Social Sciences. Political Analysis, 20(1), 92–112.
Ahlquist, John S., and Wibbels, Erik. 2012. Riding the Wave: World Trade and Factor-
Based Models of Democratization. American Journal of Political Science, 56(2),
447–464.
Ai, Chunrong, and Norton, Edward C. 2003. Interaction Terms in Logit and Probit
Models. Economics Letters, 80(1), 123–129.
Akaike, Hirotugu. 1976. A New Look at the Statistical Model Identification. IEEE
Transactions on Automatic Control, 19(6), 716–723.
Aldrich, John. 1997. R. A. Fisher and the making of maximum likelihood 1912–1922.
Statistical Science, 12(3), 162–176.
Alfons, Andreas. 2012. cvTools: Cross-Validation Tools for Regression Models. R
package version 0.3.2.
Alvarez, R. Michael. 1998. Information and Elections. Ann Arbor, MI: University of
Michigan Press.
Alvarez, R. Michael, and Brehm, John. 1995. American Ambivalence towards Abortion
Policy: Development of a Heteroskedastic Probit Model of Competing Values. Amer-
ican Journal of Political Science, 39(4), 1055–1082.
Alvarez, R. Michael, and Brehm, John. 2002. Hard Choices, Easy Answers: Values,
Information, and American Public Opinion. Princeton, NJ: Princeton University
Press.
277
278 Bibliography
Alvarez, R. Michael, and Nagler, Johathan. 1998. When Politics and Models Collide:
Estimating Models of Multiparty Elections. American Journal of Political Science,
42(1), 55–96.
Andersen, Per Kragh, and Gill, Richard D. 1982. Cox’s Regression Model for Counting
Processes: A Large Sample Study. Annals of Statistics, 4, 1100–1120.
Angrist, Joshua D., and Pischke, Jorn-Steffen. 2009. Mostly Harmless Econometrics.
Princeton, NJ: Princeton University Press.
Arel-Bundock, Vincent. 2013. WDI: World Development Indicators (World Bank). R
package version 2.4.
Arlot, Sylvain, and Celisse, Alain. 2010. A Survey of Cross-Validation Procedures for
Model Selection. Statistics Surveys, 4, 40–79.
Aronow, Peter M. 2016. A Note on ‘How Robust Standard Errors Expose Methodolog-
ical Problems They Do Not Fix, and What to Do about It.’ arXiv:1609.01774.
Aronow, Peter M., and Samii, Cyrus. 2012. ri: R Package for Performing
Randomization-Based Inference for Experiments. R package version 0.9.
Bagozzi, Benjamin E. 2016. The Baseline-Inflated Multinomial Logit Model for Interna-
tional Relations Research. Conflict Management and Peace Science, 33(2), 174–197.
Bagozzi, Benjamin E., Koren, Ore, and Mukherjee, Bumba. 2017. Droughts, Land
Appropriation, and Rebel Violence in the Developing World. Journal of Politics, 79(3),
1057–1072.
Bailey, Delia, and Katz, Jonathan N. 2011. Implementing Panel-Corrected Standard
Errors in R: The pcse Package. Journal of Statistical Software, Code Snippets, 42(1),
1–11.
Barrilleaux, Charles, and Rainey, Carlisle. 2014. The Politics of Need: Examining
Governors’ Decisions to Oppose the “Obamacare” Medicaid Expansion. State Politics
and Policy Quarterly, 14(4), 437–460.
Bartels, Larry M. 1997. Specification, Uncertainty, and Model Averaging. American
Journal of Political Science, 41(2), 641–674.
Baturo, Alexander. 2017. Democracy, Development, and Career Trajectories of Former
Political Leaders. Comparative Political Studies, 50(8), 1023–1054.
Beck, Nathaniel. 1991. The Illusion of Cycles in International Relations. International
Studies Quarterly, 35(4), 455–476.
Beck, Nathaniel, and Katz, Jonathan N. 1995. What to Do (and Not to Do) with
Pooled Time-Series Cross-Section Data. American Political Science Review, 89(3),
634–647.
Beck, Nathaniel, and Katz, Jonathan N. 2011. Modeling Dynamics in Time-Series
Cross-Section Political Economy Data. Annual Review of Political Science, 14(1),
331–352.
Beck, Nathaniel, Katz, Jonathan N., and Tucker, Richard. 1998. Taking Time Seriously:
Time-Series-Cross-Section Analysis with a Binary Dependent Variable. American
Journal of Political Science, 42(2), 1260–1288.
Beck, Nathaniel, Gleditsch, Kristian Skrede, and Beardsley, Kyle. 2006. Space Is More
Than Geography: Using Spatial Econometrics in the Study of Political Economy.
International Studies Quarterly, 50(1), 27–44.
Beger, Andreas, Hill, Jr., Daniel W., Metternich, Nils W., Minhas, Shahryar, and Ward,
Michael D. 2017. Splitting It Up: The spduration Split-Population Duration
Regression Package. arXiv. The R Journal, 9(2):474-486.
Bibliography 279
Belia, Sarah, Fidler, Fiona, Williams, Jennifer, and Cumming, Geoff. 2005. Researchers
Misunderstand Confidence Intervals and Standard Error Bars. Psychological Meth-
ods, 10(4), 389–396.
Belkin, Aaron, and Schofer, Evan. 2003. Toward a Structural Understanding of Coup
Risk. Journal of Conflict Resolution, 47(5), 594–620.
Bell, Mark S., and Miller, Nicholas L. 2015. Questioning the Effect of Nuclear Weapons
on Conflict. Journal of Conflict Resolution, 59(1), 74–92.
Benoit, Kenneth, Conway, Drew, Lauderdale, Benjamin E., Laver, Michael, and
Mikhaylov, Slava. 2016. Crowd-Sourced Text Analysis: Reproducible and Agile
Production of Political Data. American Political Science Review, 110(2), 278–295.
Bercovitch, Jacob, and Schneider, Gerald. 2000. Who Mediates? The Political Economy
of International Conflict Management. Journal of Peace Research, 37(2), 145–165.
Berkson, Joseph. 1944. Application of the Logistic Function to Bio-Assay. Journal of the
American Statistical Association, 39(227), 357–365.
Berkson, Joseph. 1946. Approximation of Chi-Square by ‘Probits’ and by ‘Logits’.
Journal of the American Statistical Association, 41(233), 70–74.
Berkson, Joseph. 1953. A Statistically Precise and Relatively Simple Method of Estimat-
ing the Bioassay with Quantal Response, Based on the Logistic Function. Journal of
the American Statistical Association, 48(263), 565–599.
Berkson, Joseph. 1955. Maximum Likelihood and Minimum χ 2 Estimates of the
Logistic Function. Journal of the American Statistical Association, 50(269), 130–162.
Berry, William D., DeMeritt, Jacqueline H. R., and Esarey, Justin. 2010. Testing for
Interaction in Binary Logit and Probit Models: Is a Product Term Essential? American
Journal of Political Science, 54(1), 248–266.
Berry, William, Golder, Matt, and Milton, Daniel. 2012. Improving Tests of Theories
Positing Interaction. Journal of Politics, 74(3), 653–671.
Bhat, Chandra R. 1996. A Heterocedastic Extreme Value Model of Intercity Travel
Mode Choice. Transportation Research B, 29(6), 471–483.
Bliss, Chester Ittner. 1935. The Calculation of the Dosage Mortality Curve. Annals of
Applied Biology, 22, 134–167.
Bolker, Ben, and R Development Core Team. 2016. bbmle: Tools for General Maximum
Likelihood Estimation. R package version 1.0.18.
Bortkiewicz, Ladislaus von. 1898. Das gesetz der kleinen Zahlen. Leipzig: B. G. Teubner.
Box-Steffensmeier, Janet M., and DeBoef, Suzanna. 2006. Repeated Events Survival
Models: The Conditional Frailty Model. Statistics in Medicine, 25, 1260–1288.
Box-Steffensmeier, Janet. M., Freeman, John R., Hitt, Matthew, and Pevehouse, Jon C.
W. (eds). 2015. Time Series Analysis for the Social Scientist. New York: Cambidge
University Press.
Box-Steffensmeier, Janet M., and Jones, Bradford S. 2004. Event History Modeling: A
Guide for Social Scientists. New York: Cambridge University Press.
Box-Steffensmeier, Janet M., Reiter, Dan, and Zorn, Christopher. 2003. Nonpropor-
tional Hazards and Event History Analysis in International Relations. Journal of
Conflict Resolution, 47(1), 33–53.
Box-Steffensmeier, Janet M., and Zorn, Christopher J. W. 2001. Duration Models and
Proportional Hazards in Political Science. American Journal of Political Science, 45(4),
972–988.
Box-Steffensmeier, Janet M., and Zorn, Christopher. 2002. Duration Models for
Repeated Events. The Journal of Politics, 64(4), 1069–1094.
280 Bibliography
Brambor, Thomas, Clark, William Roberts, and Golder, Matt. 2006. Understanding
Interaction Models: Improving Empirical Analyses. Political Analysis, 14(1), 63–82.
Brandt, Patrick T., Freeman, John R., and Schrodt, Philip A. 2014. Evaluating Forecasts
of Political Conflict Dynamics. International Journal of Forecasting, 30(4), 944–962.
Brandt, Patrick T., and Williams, John T. 2007. Multiple Time Series Models. Thousand
Oaks, CA: Sage Publications.
Braumoeller, Bear F. 2004. Hypothesis Testing and Multiplicative Interaction Terms.
International Organization, 58(4), 807–820.
Brier, Glenn W. 1950. Verification of Forecasts Expressed in Terms of Probability.
Monthly Weather Review, 78(1), 1–3.
Broockman, David, Kalla, Joshua, and Aronow, Peter. 2015. Irregularities in LaCour
(2014).
Broström, Göran. 2012. Event History Analysis with R. Boca Raton, FL: Chapman &
Hall/CRC Press.
Broström, Göran. 2017. eha: Event History Analysis. R package version 2.4-5.
Bueno de Mesquita, Bruce, Gleditsch, Nils Petter, James, Patrick, King, Gary, Metelits,
Claire, Ray, James Lee, Russett, Bruce, Strand, Håvard, and Valeriano, Brandon. 2003.
Symposium on Replication in International Studies Research. International Studies
Perspectives, 4(1), 72–107.
Cai, Chao, Zou, Yubo, Peng, Yingwei, and Zhang, Jiajia. 2012. smcure: An R Pack-
age for Estimating Semiparametric Mixture Cure Models. Computer Methods and
Programs in Biomedicine, 108(3), 1255–1260.
Cameron, A. Colin, Gelbach, Jonah B., and Miller, Douglas L. 2008. Bootstrap-
Based Improvements for Inference with Clustered Errors. Review of Economics and
Statistics, 90(3), 414–427.
Cameron, A. Colin, and Miller, Douglas L. 2015. A Practitioner’s Guide to Cluster-
Robust Inference. Journal of Human Resources, 50(2), 317–373.
Cameron, A. Colin, and Trivedi, Pravin K. 1990. Regression-Based Tests for Overdis-
persion in the Poisson Model. Journal of Econometrics, 46, 347–364.
Cameron, A. Colin, and Trivedi, Pravin K. 2013. Regression Analysis of Count Data.
Second edn. New York: Cambidge University Press.
Canty, Angelo, and Ripley, Brian. 2016. boot: Bootstrap R (S-Plus) Functions. R package
version 1.3-18.
Carroll, Nathan. 2017. oglmx: Estimation of Ordered Generalized Linear Models. R
package version 2.0.0.3.
Carrubba, Clifford J., and Clark, Tom S. 2012. Rule Creation in a Political Hierarchy.
American Political Science Review, 106(3), 622–643.
Carter, David B., and Signorino, Curtis S. 2010. Back to the Future: Modeling Time
Dependence in Binary Data. Political Analysis, 18(3), 271–292.
Chalmers, Adam William. 2017. When Banks Lobby: The Effects of Organizational
Characteristics and Banking Regulations on International Bank Lobbying. Business
and Politics, 19(1), 107–134.
Chen, Xiaohong, Fan, Yanqin, and Tsyrennikov, Viktor. 2006. Efficient Estimation of
Semiparametric Multivariate Copula Models. Journal of the American Statistical
Association, 101(475), 1228–1240.
Cheng, Simon, and Long, J. Scott. 2007. Testing for IIA in the Multinomial Logit Model.
Sociological Methods & Research, 35(4), 583–600.
Bibliography 281
Chiba, Daina, Metternich, Nils W., and Ward, Michael D. 2015. Every Story Has a
Beginning, Middle, and an End (But Not Always in That Order): Predicting Duration
Dynamics in a Unified Framework. Political Science Research and Methods, 3(3), 515–
541.
Choi, Leena. 2011. ProfileLikelihood: Profile Likelihood for a Parameter in Commonly
Used Statistical Models. R package version 1.1.
Choirat, Christine, Imai, Koske, King, Gary, and Lau, Olivia. 2009. Zelig: Everyone’s
Statistical Software. Version 5.0-17. http://zeligproject.org/.
Christensen, Rune Haubo Bojesen. 2015. ordinal: Regression Models for Ordinal Data.
R package version 2015.6-28. www.cran.r-project.org/package=ordinal/.
Clarke, Kevin A. 2006. A Simple Distribution-Free Test for Nonnested Model Selection.
Political Analysis, 15(3), 347–363.
Cleveland, William S., and McGill, Robert. 1984. Graphical Perception: Theory, Exper-
imentation, and Application to the Development of Graphical Methods. Journal of
the American Statistical Association, 79(387), 531–554.
Conway, Richard W., and Maxwell, William L. 1962. A Queuing Model with State
Dependent Service Rates. Journal of Industrial Engineering, 12(2), 132–136.
Cook, Scott J., and Savun, Burcu. 2016. New Democracies and the Risk of Civil Conflict:
The Lasting Legacy of Military Rule. Journal of Peace Research, 53(6), 745–757.
Cox, David R. 1958. The Regression Analysis of Binary Sequences (with Discussion).
Journal of the Royal Statistical Society Series B, 20(2), 215–242.
Cox, David R. 2006. Principles of Statistical Inference. New York: Cambidge University
Press.
Cox, David R, and Barndorff-Nielsen, Ole E. 1994. Inference and Asymptotics. Boca
Raton, FL: Chapman & Hall/CRC Press.
Croissant, Yves. 2013. mlogit: Multinomial Logit Model. R package version 0.2-4.
Croissant, Yves, and Millo, Giovanni. 2008. Panel Data Econometrics in R: The plm
Package. Journal of Statistical Software, 27(2), 1–43.
Crookston, Nicholas L., and Finley, Andrew O. 2007. yaImpute: An R Package for kNN
Imputation. Journal of Statistical Software, 23(10), 1–16.
Cumming, Geoff, Williams, Jennifer, and Fidler, Fiona. 2004. Replication and
Researchers’ Understanding of Confidence Intervals and Standard Error Bars. Under-
standing Statistics, 3(4), 299–311.
DA-RT. 2015. The Journal Editors’ Transparency Statement (JETS).
Davidson, Russell, and MacKinnon, James G. 1984. Convenient Specification Tests for
Logit and Probit Models. Journal of Econometrics, 25(3), 241–262.
Davison, A. C., and Hinkley, D. V. 1997. Bootstrap Methods and Their Applications.
New York: Cambidge University Press.
Dempster, Arthur P., Laird, Nan M., and Rubin, Donald B. 1977. Maximum Likelihood
from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society
Series B, 39(1), 1–38.
Desmarais, Bruce A., and Harden, Jeffrey J. 2013. Testing for Zero Inflation in Count
Models: Bias Correction for the Vuong Test. Stata Journal, 13(4), 810–835.
Edwards, Anthony W. F. 1992. Likelihood. Expanded edn. Baltimore, MD: Johns
Hopkins University Press.
Edwards, Barry, Crespin, Michael, Williamson, Ryan D., and Palmer, Maxwell. 2017.
Institutional Control of Redistricting and the Geography of Representation. Journal
of Politics, 79(2), 722–726.
282 Bibliography
Efron, Bradley, and Tibshirani, Robert J. 1998. An Introduction to the Bootstrap. Boca
Raton, FL: Chapman & Hall/CRC Press.
Eifert, Benn, Miguel, Edward, and Posner, Daniel N. 2010. Political Competition and
Ethnic Identification in Africa. American Journal of Political Science, 54(2), 494–510.
Einstein, Katherine Levine, and Glick, David M. 2017. Does Race Affect Access to
Government Services? An Experiment Exploring Street-Level Bureaucrats and Access
to Public Housing. American Journal of Political Science, 61(1), 100–116.
Esarey, Justin. 2017. clusterSEs: Calculate Cluster-Robust p-Values and Confidence
Intervals. R package version 2.3.3.
Faraway, Julian J. 2004. Linear Models with R. Boca Raton, FL: Chapman & Hall/CRC
Press.
Fearon, James D., and Laitin, David D. 2003. Ethnicity, Insurgency, and Civil War.
American Political Science Review, 97(1), 75–90.
Feliciano, Gloria D., Powers, Richard D., and Kearl, Bryant E. 1963. The Presentation
of Statistical Information. Audiovisual Communication Review, 11(3), 32–39.
Firth, David. 1993. Bias Reduction of Maximum Likelihood Estimates. Biometrika,
80(1), 27–38.
Fox, John. 1997. Applied Regression Analysis, Linear Models, and Related Methods.
Thousand Oaks, CA: SAGE Publications.
Fox, John, and Weisberg, Sanford. 2011. An R Companion to Applied Regression. 2nd
edn. Thousand Oaks, CA: SAGE Publications.
Franzese, Robert, and Hayes, Jude C. 2007. Spatial Econometric Models of Cross-
Sectional Interdependence in Political Science Panel and Time-Series-Cross-Section
Data. Political Analysis, 15(2), 140–164.
Freedman, David A. 1983. A Note on Screening Regression Equations. Journal of the
American Statistical Association, 37(2), 152–155.
Freedman, David A. 2006. On The So-Called “Huber Sandwich Estimator” and “Robust
Standard Errors.” The American Statistician, 60(4), 299–302.
Freedman, David A. 2008. Survival Analysis: A Primer. American Statistician, 62(2),
110–119.
Friendly, Michael. 2000. Visualizing Categorical Data. Cary, NC: SAS Institute.
Fullerton, Andrew S. 2009. A Conceptual Framework for Ordered Logistic Regression
Models. Sociological Methods & Research, 38(2), 306–347.
Gelman, Andrew. 2008. Scaling Regression Inputs by Dividing by Two Standard
Deviations. Statistics in Medicine, 27(15), 2865–2873.
Gelman, Andrew. 2011. Why Tables Are Really Much Better than Graphs. Journal of
Computational and Graphical Statistics, 20(1), 3–7.
Gelman, Andrew, and Hill, Jennifer. 2007. Data Analysis Using Regression and Mul-
tilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge:
Cambridge University Press.
Gelman, Andrew, Jakulin, Aleks, Pittau, Maria Grazia, and Sufi, Yu-Sung. 2008. A
Weakly Informative Prior Distribution for Logistic and Other Regression Models.
Annals of Applied Statistics, 2(4), 1360–1383.
Gelman, Andrew, Pasarica, Cristian, and Dodhia, Rahul. 2002. Let’s Practice What We
Preach: Turning Tables into Graphs. American Statistician, 56(2), 121–130.
Gelman, Andrew, and Su, Yu-Sung. 2016. arm: Data Analysis Using Regression and
Multilevel/Hierarchical Models. R package version 1.9-3.
Bibliography 283
Gentleman, Robert, and Temple Lang, Duncan. 2007. Statistical Analyses and Repro-
ducible Research. Journal of Computational and Graphical Statistics, 16(1), 1–23.
Gerber, Alan S., and Green, Donald P. 2012. Field Experiments: Design, Analysis, and
Interpretation. New York: W. W. Norton.
Gerber, Alan S., Green, Donald P., and Nickerson, David. 2001. Testing for Publication
Bias in Political Science. Political Analysis, 9(4), 385–392.
Gerber, Alan S., and Malhotra, Neil. 2008. Do Statistical Reporting Standards Affect
What Is Published? Publication Bias in Two Leading Political Science Journals.
Quarterly Journal of Political Science, 3, 313–326.
Gill, Jeff. 1999. The Insignificance of Null Hypothesis Significance Testing. Political
Research Quarterly, 52(3), 647–674.
Glasgow, Garrett. 2001. Mixed Logit Models for Multiparty Elections. Political Analy-
sis, 9(1), 116–136.
Glasgow, Garrett, and Alvarez, R. Michael. 2008. Discrete Choice Methods. In Box-
Steffensmeier, Janet M., Brady, Henry E., and Collier, David (eds.), The Oxford
Handbook of Political Methodology. New York: Oxford University Press, 513–529.
Gleditsch, Kristian Skrede, and Ward, Michael D. 2013. Forecasting Is Difficult, Espe-
cially the Future: Using Contentious Issues to Forecast Interstate Disputes. Jounal of
Peace Research, 50(1), 17–31.
Gleditsch, Nils Petter, Metelits, Claire, and Strand, Havard. 2003. Posting Your Data:
Will You Be Scooped or Will You Be Famous? International Studies Perspectives, 4(1),
89–97.
Goldstein, Judith, Rivers, Douglas, and Tomz, Michael. 2007. Institutions in Interna-
tional Relations: Understanding the Effects of the GATT and the WTO on World
Trade. International Organization, 61(1), 37–67.
Graham, John W. 2009. Missing Data Analysis: Making It Work in the Real World.
Annual Review of Psychology, 60(1), 549–576.
Greene, William H. 2008. Functional Forms for the Negative Binomial Model for Count
Data. Economic Letters, 99(3), 585–590.
Greene, William H. 2011. Fixed Effects Vector Decomposition: A Magical Solution to
the Problem of Time-Invariant Variables in Fixed Effects Models? Political Analysis,
19(2), 135–146.
Greenhill, Brian D., Ward, Michael D., and Sacks, Audrey. 2011. The Separation Plot:
A New Visual Method for Evaluating the Fit of Binary Data. American Journal of
Political Science, 55(4), 991–1002.
Greenhill, Brian D., Ward, Michael D., and Sacks, Audrey. 2015. separationplot:
Separation Plots. R package version 1.1.
Grimmer, Justin. 2010. An Introduction to Bayesian Inference via Variational Approxi-
mations. Political Analysis, 19(1), 32–47.
Grimmer, Justin, and Stewart, Brandon M. 2013. Text as Data: The Promise and Pitfalls
of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3),
267–297.
Gurland, John. 1959. Some Applications of the Negative Binomial and Other Conta-
gious Distributions. American Journal of Public Health, 49(10), 1388–1399.
Habel, Kai, Grasman, Raoul, Gramacy, Robert B., Stahel, Andreas, and Sterratt,
David C. 2015. geometry: Mesh Generation and Surface Tesselation. R package
version 0.3-6.
284 Bibliography
Hamilton, James D. 1994. Time Series Analysis. Princeton, NJ: Princeton University
Press.
Hardin, James W., and Hilbe, Joseph M. 2012. Generalized Estimating Equations. 2nd
edn. Chapman & Hall/CRC Press.
Harrell, Jr., Frank E. 2017. rms: Regression Modeling Strategies. R package version
5.1-0.
Hasan, Asad, Wang, Zhiyu, and Mahani, Alireza S. 2016. Fast Estimation of Multino-
mial Logit Models: R Package mnlogit. Journal of Statistical Software, 75(3), 1–24.
Hastie, Trevor, Tibshirani, Robert J., and Friedman, Jerome. 2008. Elements of Statistical
Learning. New York: Springer-Verlag.
Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. 2016. The Elements of
Statistical Learning: Data Mining, Inference, and Prediction. Second edn. New York:
Springer.
Hausman, Jerry A., and McFadden, Daniel. 1984. Specification Tests for the Multino-
mial Logit Model. Econometrica, 52(5), 1219–1240.
Heckman, James J., and Snyder, Jr., James M. 1997. Linear Probability Models of the
Demand for Attributes with an Application to Estimating Preferences of Legislators.
RAND. Journal of Economics, 28(Special Issue in Honor of Richard E. Quandt),
142–189.
Heinze, Georg and Ploner, Meinhard. 2016. logistf : Firth’s Bias-Reduced Logistic
Regression. R package version 1.22. https://CRAN.R-project.org/package=logistf.
Heinze, Georg, and Schemper, Michael 2002. A Solution to the Problem of Separation
in Logistic Regression. Statistics in Medicine, 21(16), 2409–2419.
Henningsen, Arne, and Toomet, Ott. 2011. maxLik: A Package for Maximum Likeli-
hood Estimation in R. Computational Statistics, 26(3), 443–458.
Herndon, Thomas, Ash, Michael, and Pollin, Robert. 2014. Does High Public Debt
Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff. Cambridge
Journal of Economics, 38(2), 257–279.
Hilbe, Joseph M. 2008. Negative Binomial Regression. New York: Cambidge University
Press.
Hilbe, Joseph M. 2014. Modeling Count Data. New York: Cambidge University Press.
Hill, Daniel W., and Jones, Zachary M. 2014. An Empirical Evaluation of Explanations
for State Repression. American Political Science Review, 108(03), 661–687.
Ho, Daniel E., and Imai, Kosuke. 2006. Randomization Inference with Natural Exper-
iments: An Analysis of Ballot Effects in the 2003 California Recall Election. Journal
of the American Statistical Association, 101(475), 888–900.
Hoaglin, David C. 1980. A Poisonness Plot. American Statistician, 34(3), 146–149.
Hoaglin, David C., and Tukey, John W. 1985. Checking the Shape of Discrete Distribu-
tions. In Hoaglin, David C., Mosteller, F., and Tukey, John W. (eds.), Exploring Data
Tables, Trends, and Shapes. Hoboken, NJ: John Wiley & Sons, 345–416.
Hoekstra, Rink, Morey, Richard D., Rouder, Jeffrey N., and Wagenmakers, Eric-Jan.
2014. Robust Misinterpretation of Confidence Intervals. Psychometric Bulletin and
Review, 21(5), 1157–1164.
Hoff, Peter. 2009. Multiplicative Latent Factor Models for Description and Prediction
of Social Networks. Computational and Mathematical Organization Theory, 15(4),
261–272.
Hoff, Peter D. 2007. Extending the Rank Likelihood for Semiparametric Copula
Estimation. Annals of Applied Statistics, 1(1), 265–283.
Bibliography 285
Hoff, Peter D. 2015. Multilinear Tensor Regression for Longitudinal Relational Data.
Annals of Applied Statistics, 9(3), 1169–1193.
Hoff, Peter D., Niu, Xiaoyue, Wellner, Jon A. 2014. Information Bounds for Gaussian
Copulas. Bernoulli, 20(2), 604–622.
Hollenbach, Florian M., Metternich, Nils W., Minhas, Shahryar, and Ward, Michael D.
2014. Fast & Easy Imputation of Missing Social Science Data. arXiv preprint
arXiv:1411.0647. Sociological Methods and Research, in press.
Honaker, James, and King, Gary. 2010. What to Do about Missing Values in Time-Series
Cross-Section Data. American Journal of Political Science, 54(2010), 561–581.
Honaker, James, King, Gary, and Blackwell, Matthew. 2011. Amelia II: A Program for
Missing Data. Journal of Statistical Software, 45(7), 1–47.
Horrace, William C., and Oaxaca, Ronald L. 2006. Results on the Bias and Inconsistency
of Ordinary Least Squares for the Linear Probability Model. Economic Letters, 90(3),
321–327.
Horton, Nicholas J., and Kleinman, Ken P. 2007. Much Ado about Nothing: A Com-
parison of Missing Data Methods and Software to Fit Incomplete Data Regression
Models. American Statistician, 61(1), 79–90.
Huber, Peter J. 1967. The Behavior of Maximum Likelihood Estimates under Non-
Standard Conditions. Proceedings of the Fifth Berkeley Symposium on Mathematical
Statistics and Probability, 1, 221–233.
Imai, Kosuke, King, Gary, and Lau, Olivia. 2008. Toward a Common Framework
for Statistical Analysis and Development. Journal of Computational and Graphical
Statistics, 17(4), 892–913.
Imai, Kosuke, and Tingley, Dustin H. 2012. A Statistical Method for Empirical Testing
of Competing Theories. American Journal of Political Science, 56(1), 218–236.
Imai, Kosuke, and van Dyk, David A. 2005. MNP: R Package for Fitting the Multinomial
Probit Model. Journal of Statistical Software, 14(3), 1–32.
Ioannidis, John P.A. 2005. Why Most Published Research Findings Are False. PLoS
Medicine, 2(8), E124.
Jackson, Christopher. 2016. flexsurv: A Platform for Parametric Survival Modeling
in R. Journal of Statistical Software, 70(8), 1–33.
Jones, Bradford, and Branton, Regina P. 2005. Beyond Logit and Probit: Cox Duration
Models of Single, Repeating, and Competing Events for State Policy Adoption. State
Politics and Policy Quarterly, 5(4), 420–443.
Josse, Julie, and Husson, François. 2016. missMDA: A Package for Handling Missing
Values in Multivariate Data Analysis. Journal of Statistical Software, 70(1), 1–31.
Kalbfleisch, John D., and Prentice, Ross L. 2002. The Statistical Analysis of Failure Time
Data. Hoboken, NJ: Wiley-Interscience.
Kam, Cindy D., and Franzese, Robert J. 2007. Modeling and Interpreting Interactive
Hypotheses in Regression Analysis. Ann Arbor, MI: University of Michigan Press.
Kastellec, Jonathan P., and Leoni, Eduardo L. 2007. Using Graphs Instead of Tables in
Political Science. Perspectives on Politics, 5(4), 755–771.
King, Gary. 1988. Statistical Models for Political Science Event Counts: Bias in Con-
ventional Procedures and Evidence for the Exponential Poisson Regression model.
American Journal of Political Science, 32(3), 838–863.
King, Gary. 1989a[1998]. Unifying Political Methodology: The Likelihood Theory of
Statistical Inference. Ann Arbor, MI: University of Michigan Press.
286 Bibliography
King, Gary. 1989b. Variance Specification in Event Count Models: From Restrictive
Assumptions to a Generalized Estimator. American Journal of Political Science, 33(3),
762–784.
King, Gary. 1995. Replication, Replication. PS: Political Science & Politics, 28(3),
444–452.
King, Gary, Alt, James, Burns, Nancy, and Laver, Michael. 1990. A Unified Model of
Cabinet Dissolution in Parliamentary Democracies. American Journal of Political
Science, 34(3), 846–871.
King, Gary, Honaker, James, Joseph, Anne, and Scheve, Kenneth. 2001. Analyzing
Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation.
American Political Science Review, 95(1), 49–69.
King, Gary, and Roberts, Margaret E. 2014. How Robust Standard Errors Expose
Methodological Problems They Do Not Fix, and What to Do about It. Political
Analysis, 23(2), 159–179.
King, Gary, and Signorino, Curtis S. 1996. The Generalization in the Generalized Event
Count Model. Political Analysis, 6, 225–252.
King, Gary, Tomz, Michael, and Wittenberg, Jason. 2000. Making the Most of Statistical
Analyses: Improving Interpretation and Presentation. American Journal of Political
Science, 44(2), 341–355.
King, Gary, and Zeng, Langche. 2001. Logistic Regression in Rare Events Data. Political
Analysis, 9(2), 137–163.
King, Gary, and Zeng, Langche. 2006. The Dangers of Extreme Counterfactuals.
Political Analysis, 14(2), 131–159.
King, Gary, and Zeng, Langche. 2007. When Can History be Our Guide? The Pitfalls
of Counterfactual Inference. International Studies Quarterly, 51(1), 183–210.
Klaassen, Chris A. J., Wellner, Jon A. 1997. Efficient Estimation in the Bivariate Normal
Copula Model: Normal Margins Are Least Favourable. Bernoulli, 3(1), 55–77.
Klasnja, Marko. 2017. Uninformed Voters and Corrupt Politicians. American Politics
Research, 45(2), 256–279.
Kleiber, Christian, and Zeileis, Achim. 2016. Visualizing Count Data Regressions Using
Rootograms. American Statistician, 70(3), 296–303.
Knuth, Donald Ervin. 1984. Literate Programming. Computer Journal, 27(2), 97–111.
Kordas, Gregory. 2006. Smoothed Binary Regression Quantiles. Journal of Applied
Econometrics, 21(3), 387–407.
Kosmidis, Ioannis. 2017. brglm: Bias Reduction in Binary-Response Generalized
Linear Models. R package version 0.6.1.
Kowarik, Alexander, and Templ, Matthias. 2016. Imputation with the R Package VIM.
Journal of Statistical Software, 74(7), 1–16.
Krain, Matthew. 2005. International Intervention and the Severity of Genocides and
Politicides. International Studies Quarterly, 49(3), 363–388.
Kriner, Douglas L., and Shen, Francis X. 2016. Conscription, Inequality, and Partisan
Support for War. Journal of Conflict Resolution, 60(8), 1419–1445.
Kropko, Jonathan, Goodrich, Ben, Gelman, Andrew, and Hill, Jennifer. 2014. Multiple
Imputation for Continuous and Categorical Data: Comparing Joint Multivariate
Normal and Conditional Approaches. Political Analysis, 22(4), 497–519.
Kuhn, Max. 2016. A Short Introduction to the caret Package. R package version 1.6.8.
Laitin, David D., and Reich, Rob. 2017. Trust, Transparency, and Replication in Political
Science. PS: Political Science & Politics, 50(1), 172–175.
Bibliography 287
Lall, Ranjit. 2016. How Multiple Imputation Makes a Difference. Political Analysis,
24(4), 414–443.
Land, Kenneth C., McCall, Patricia L., and Nagin, Daniel S. 1996. A Comparison of
Poisson, Negative Binomial, and Semiparametric Mixed Poisson Regression Models
with Empirical Applications to Criminal Careers Data. Sociological Methods &
Research, 24(4), 387–442.
Lander, Jared P. 2016. coefplot: Plots Coefficients from Fitted Models. R package version
1.2.4.
LeCam, Lucian. 1990. Maximum Likelihood: An introduction. International Statistical
Review, 58(2), 153–171.
Li, Jialiang, and Fine, Jason P. 2008. ROC Analysis with Multiple Classes and Multiple
Tests: Methodology and Its Application in Microarray Studies. Biostatistics, 9(3),
566–576.
Little, Roderick J. A., and Rubin, Donald B. 2002. Statistical Analysis with Missing Data.
Hoboken, NJ: Wiley.
Lo, Adeline, Chernoff, Herman, Zheng, Tian, and Lo, Shaw-Hwa. 2015. Why Signif-
icant Variables Aren’t Automatically Good Predictors. Proceedings of the National
Academy of Sciences of the United States of America, 112(45), 13892–13897.
Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent
Variables. Thousand Oaks, CA: SAGE Publications.
Lucas, Christopher, Nielsen, Richard A., Roberts, Margaret E., Stewart, Brandon M.,
Storer, Alex, and Tingley, Dustin. 2015. Computer-Assisted Text Analysis for Com-
parative Politics. Political Analysis, 23(2), 254–277.
Lumley, Thomas. 2014. mitools: Tools for Multiple Imputation of Missing Data. R
package version 2.3.
MacFarlane, John. 2014. Pandoc: A Universal Document Converter. New York: Wiley.
Mares, Isabela. 2015. From Open Secrets to Secret Voting: Democratic Electoral
Reforms and Voter Autonomy. New York: Cambidge University Press.
Markatou, Marianthi, Tian, Hong, Biswas, Shameek, and Hripcsak, George. 2005.
Analysis of Variance of Cross-Validation Estimators of the Generalization Error.
Journal of Machine Learning Research, 6(Jul), 1127–1168.
Marshall, Monty G., Gurr, Ted Robert, and Jaggers, Keith. 2016. Polity IV Project:
Political Regime Characteristics and Transitions, 1800–2015, Dataset Users’ Manual.
College Park, MD: Center for Systemic Peace.
Martin, Lanny, and Stevenson, Randolph T. 2010. The Conditional Impact of
Incumbency on Government Formation. American Politial Science Review, 104(3),
503–518.
Mayo, Deborah G. 2014. On the Birnbaum Argument for the Strong Likelihood
Principle (with Discussion & Rejoinder). Statistical Science, 29(2), 227–266.
McCullagh, Peter, and Nelder, John A. 1989. Generalized Linear Model. London:
Chapman & Hall.
McFadden, Daniel L. 1974. Conditional Logit Analysis of Qualitative Choice Behavior.
In Zarembka, Paul (ed), Frontiers in Econometrics. New York: Academic Press,
105–142.
McKelvey, Richard D., and Zaviona, William. 1975. A Statistical Model for the Analysis
of Ordinal Level Dependent Variables. Journal of Mathematical Sociology, 4(1),
103–120.
Mebane, Walter R., and Sekhon, Jasjeet S. 1998. Genetic Optimization Using Deriva-
tives: Theory and Application to Nonlinear Models. Political Analysis, 7, 189–210.
288 Bibliography
Mebane, Walter R., and Sekhon, Jasjeet S. 2002. Coordinates and Policy Moderation at
Midterm. American Political Science Review, 96(1), 141–157.
Mebane, Walter R., and Sekhon, Jasjeet S. 2004. Robust Estimation and Outlier
Detection for Overdispersed Multinomial Models of Count Data. American Journal
of Political Science, 48(2), 392–411.
Mebane, Walter R., and Sekhon, Jasjeet S. 2011. Genetic Optimization Using Deriva-
tives: The rgenoud Package for R. Journal of Statistical Software, 42(11), 1–26.
Merkle, Edgar C., and Steyvers, Mark. 2013. Choosing a Strictly Proper Scoring Rule.
Decision Analysis, 10(4), 292–304.
Meyer, David, Zeileis, Achim, and Hornik, Kurt. 2016. vcd: Visualizing Categorical
Data. R package version 1.4-3.
Miguel, Edward, Camerer, Colin, Casey, Katherine, Cohen, Joshua, Esterling, Kevin M.,
Gerber, Alan, Glennerster, Rachel, Green, Don P., Humphreys, Macartan, Imbens,
Guido, et al. 2014. Promoting Transparency in Social Science Research. Science,
343(6166), 30–31.
Miles, Andrew. 2016. Obtaining Predictions from Models Fit to Multiply Imputed Data.
Sociological Methods and Research, 45(1), 175–185.
Molenberghs, Geert, Fitzmaurice, Garrett, Kenward, Michael G., Tsiatis, Anastasios,
and Verbeke, Geert. 2014. Handbook of Missing Data Methodology. Boca Raton,
FL: Chapman & Hall/CRC Press.
Monogan, III, James E.. 2015. Political Analysis Using R. New York: Springer.
Montgomery, Jacob M., Hollenbach, Florian M., and Ward, Michael D. 2012a. Ensem-
ble Predictions of the 2012 US Presidential Election. PS: Political Science & Politics,
45(4), 651–654.
Montgomery, Jacob M., Hollenbach, Florian M., and Ward, Michael D. 2012b. Improv-
ing Predictions Using Ensemble Bayesian Model Averaging. Political Analysis, 20(3),
271–291.
Mroz, Thomas A. 1987. The Sensitivity of an Empirical Model of Married Women’s
Hours of Work to Economic and Statistical Assumptions. Econometrica, 55(4),
765–799.
Murray, Jared S., and Reiter, Jerome P. 2014. Multiple Imputation of Missing Categor-
ical and Continuous Values via Bayesian Mixture Models with Local Dependence.
arXiv.
Mutz, Diana C., and Reeves, Byron. 2005. The New Videomalaise: Effects of Televised
Incivility on Political Trust. American Politial Science Review, 99(1), 1–15.
Nagler, Jonathan. 1994. Scobit: An Alternative Estimator to Logit and Probit. American
Journal of Political Science, 38(1), 230–255.
Nanes, Matthew J. 2017. Political Violence Cycles: Electoral Incentives and the Provi-
sion of Counterterrorism. Comparative Political Studies, 50(2), 171–199.
NCAR - Research Applications Laboratory. 2015. verification: Weather Forecast Veri-
fication Utilities. R package version 1.42.
Nelder, John A., and Wedderburn, Robert W. M. 1972. Generalized Linear Models.
Journal of the Royal Statistical Society Series A, 135(3), 370–384.
Nelsen, Roger B. 1970. An Introduction to Copulas. 2nd edn. Berlin: Springer Verlag.
Nyhan, Brendan, and Montgomery, Jacob. 2010. Bayesian Model Averaging: Theoreti-
cal Developments and Practical Applications. Political Analysis, 18(2), 245–270.
Open Science Collaboration. 2015. Estimating the Reproducibility of Psychological
Science. Science, 349(6215), aac4716–1 – aac4716–8.
Bibliography 289
Park, Jong Hee. 2012. A Unified Method for Dynamic and Cross-Sectional Heterogene-
ity: Introducing Hidden Markov Panel Models. American Journal of Political Science,
56(4), 1040–1054.
Park, Sunhee, and Hendry, David J. 2015. Reassessing Schoenfeld Residual Tests of
Proportional Hazards in Political Science Event History Analyses. American Journal
of Political Science, 59(4), 1072–1087.
Pawitan, Yudi. 2013. In All Likelihood: Statistical Modeling and Inference Using
Likelihood. 1st edn. Oxford: Oxford University Press.
Peterson, Bercedis, and Harrell, Jr., Frank E. 1990. Partial Proportional Odds Models
for Ordinal Response Variables. Applied Statistics, 39(2), 205–217.
Pettitt, Anthony N. 1982. Inference for the Linear Model Using a likelihood Based
on Ranks. Journal of the Royal Statistical Society Series B (Methodological), 44(2),
234–243.
Pitt, Michael, Chan, David, and Kohn, Robert. 2006. Efficient Bayesian Inference for
Gaussian Copula Regression Models. Biometrika, 93(3), 537–554.
Poisson, Siméon Denis. 1837. Recherches sur la probabilité des jugements en matière
criminelle et en matiére civile: Précédées des régles générales du calcul des probabilités.
Paris: Bachelier.
Prado, Raquel, Ferreira, Marco A. R., and West, Mike. 2017. Time Series: Modelling,
Computation & Inference. 2nd (forthcoming) edn. Boca Raton, FL: Chapman &
Hall/CRC Press.
Raftery, Adrian E. 1995. Bayesian Model Selection in Social Research. Sociological
Methodology, 25, 111–163.
Rainey, Carlisle. 2016. Dealing with Separation in Logistic Regression Models. Political
Analysis, 24(3), 339–355.
Rauchhaus, R. 2009. Evaluating the Nuclear Peace Hypothesis: A Quantitative
Approach. Journal of Conflict Resolution, 53(2), 258–277.
Resnick, Sidney I. 2014. A Probability Path. Ithaca, NY: Cornell University Press.
Ripley, Brian, and Venables, William. 2016. Package nnet. R package version 7.3-12.
Rose, Andrew K. 2004. Do We Really Know That the WTO Increases Trade? American
Economic Review, 94(1), 98–114.
Ross, Michael L. 2006. Is Democracy Good for the Poor? American Journal of Political
Science, 50(4), 860–874.
Rozeboom, William W. 1960. The Fallacy of the Null Huypothesis Significance Test.
Psychological Bulletin, 57, 416–428.
Rubin, Donald B. 1976. Inference and Missing Data. Biometrika, 63(3), 581–592.
Rubin, Donald B. 1987. Statistical Analysis with Missing Data. Hoboken, NJ: John
Wiley & Sons.
Rubin, Donald B. 1996. Multiple Imputation after 18+ Years. Journal of the American
Statistical Association, 91(434), 473–489.
Rubin, Donald B. 2004. Multiple Imputation for Nonresponse in Surveys. New York:
John Wiley & Sons.
Saket, Bahador, Scheidegger, Carlos, Kobourov, Stephen G., and Börner, Katy. 2015.
Map-Based Visualizations Increase Recall Accuracy of Data. Computer Graphics
Forum, 34(3), 441–450.
Schafer, Joseph L. 1999. Multiple Imputation: A Primer. Statistical Methods in Medical
Research, 8(1), 3–15.
290 Bibliography
Titiunik, Rocío, and Feher, Andrew. 2017. Legislative Behaviour Absent Re-Election
Incentives: Findings from a Natural Experiment in the Arkansas Senate. Journal of
the Royal Statistical Society Series A, 180(pt. 3), 1–28.
Tomz, Michael, Goldstein, Judith, and Rivers, Douglas. 2007. Do We Really Know That
the WTO Increases Trade? American Economic Review, 97(5), 2005–2018.
Tomz, Michael, Wittenberg, Jason, and King, Gary. 2001. CLARIFY: Software for
Interpreting and Presenting Statistical Results. Version 2.0. Harvard University,
Cambridge, MA.
Train, Kenneth E. 2009. Discrete Choice Methods with Simulation. 2nd edn. New York:
Cambidge University Press.
Treier, Shawn, and Jackman, Simon. 2008. Democracy as a Latent Variable. American
Journal of Political Science, 52(1), 201–217.
Trivedi, Pravin K., and Zimmer, David M. 2005. Copula Modeling: An Introduction for
Practitioners. Boston: Now Publishers, Inc.
Tufte, Edward R. 1992. The Visual Display of Quantitative Information. Cheshire, CT:
Graphics Press.
Tukey, John W. 1977. Exploratory Data Analysis. Reading, MA: Addison-Wesley.
Uscinski, Joseph E., and Klofstad, Casey A. 2010. Who Likes Political Science? Deter-
mants of Senators’ Votes on the Coburn Amendment. PS: Political Science & Politics,
43(4), 701–706.
Van Buuren, Stef. 2012. Flexible Imputation of Missing Data. Boca Raton, FL: Chapman
& Hall/CRC Press.
Van Buuren, Stef, and Groothuis-Oudshoorn, Karin. 2011. MICE: Multivariate Imputa-
tion by Chained Equations in R. Journal of Statistical Software, 45(3), 1–67.
van den Boogaart, K. Gerald, and Tolosana-Delgado, Raimon. 2013. Analyzing Com-
positional Data with R. Berlin: Springer-Verlag.
van den Boogaart, K. Gerald, Tolosana, Raimon, and Bren, Matevz. 2014. compositions:
Compositional Data Analysis. R package version 1.40-1 edn.
Venables, William N., and Ripley, Brian D. 2002. Modern Applied Statistics with S.
fourth edn. New York: Springer.
Ver Hoef, Jay M., and Boveng, Peter L. 2007. Quasi-Poisson vs. Negative Binomial
Regression: How Should We Model Overdispersed Count Data? Ecology, 88(11),
2766–2772.
von Hippel, Paul. 2009. How to Impute Squares, Interactions, and Other Transformed
Variables. Sociological Methods and Research, 39(1), 265–291.
Walter, Stefanie. 2017. Globalization and the Demand-Side of Politics: How Globaliza-
tion Shapes Labor Market Risk Perceptions and Policy Preferences. Political Science
Research and Methods, 5(1), 55–80.
Wang, Xia, and Dey, Dipak K. 2010. Generalized Extreme Value Regression for Binary
Response Data: An Application to B2B Electronic Payments System Adoption. Annals
of Applied Statistics, 4(4), 2000–2023.
Ward, Michael D., and Gleditsch, Kristian Skrede. 2018. Spatial Regression Models.
Thousand Oaks, CA: SAGE Publications.
Ward, Michael D., Greenhill, Brian D., and Bakke, Kristin M. 2010. The Perils of
Policy by p-Value: Predicting Civil Conflicts. Journal of Peace Research, 47(4),
363–375.
Ward, Michael D., Metternich, Nils W., Dorff, Cassy L., Gallop, Max, Hollenbach,
Florian M., Schultz, Anna, and Weschle, Simon. 2013. Learning from the Past
292 Bibliography
and Stepping into the Future: Toward a New Generation of Conflict Prediction.
International Studies Review, 16(4), 473–490.
Wedderburn, Robert W. M. 1974. Quasi-Likelihood Functions, Generalized Linear
Models, and the Gauss-Newton Method. Biometrika, 61(3), 439–447.
White, Halbert L. 1980. A Heteroskedasticity-Consistent Covariance Matrix Estimator
and a Direct Test for Heteroskedasticity. Econometrica, 48, 817–838.
Williams, Richard. 2016. Understanding and Interpreting Generalized Ordered Logit
Models. Journal of Mathematical Sociology, 40(1), 7–20.
Wilson, Paul. 2015. The Misuse of the Vuong Test for Non-Nested Models to Test for
Zero-Inflation. Economics Letters, 127, 51–53.
Winkelmann, Rainer. 1995. Duration Dependence and Dispersion in Count-Data Mod-
els. Journal of Business & Economic Statistics, 13(4), 467–474.
Wolford, Scott. 2017. The Problem of Shared Victory: War-Winning Coalitions and
Postwar Peace. Journal of Politics, 79(2), 702–716.
Wooldridge, Jeffrey M. 2010. Econometric Analysis of Cross Section and Panel Data.
2nd edn. Cambridge, MA: MIT Press.
Wright, Joseph, and Frantz, Erica. 2017. How Oil Income and Missing Hydrocarbon
Rents Data Influence Autocratic Survival: A Response to Lucas and Richter (2016).
Research & Politics, 4(3).
Xie, Yihui. 2015. Dynamic Documents with R and knitr. 2nd edn. New York:
CRC Press.
Yee, Thomas W. 2010. The VGAM Package for Categorical Data Analysis. Journal of
Statistical Software, 32(10), 1–34.
Zeileis, Achim. 2004. Econometric Computing with HC and HAC Covariance Matrix
Estimators. Journal of Statistical Software, 11(10), 1–17.
Zeileis, Achim, and Kleiber, Christian. 2017. countreg: Count Data Regression. R
package version 0.2-0.
Zeileis, Achim, Kleiber, Christian, and Jackman, Simon. 2008. Regression Models for
Count Data in R. Journal of Statistical Software, 27(8), 1–25.
Zeileis, Achim, Koenker, Roger, and Doebler, Philipp. 2015. glmx: Generalized Linear
Models Extended. R package version 0.1-1.
Ziegler, Andreas. 2011. Generalized Estimating Equations. New York: Springer.
Zorn, Christopher. 2005. A Solution to Separation in Binary Response Models. Political
Analysis, 13(2), 157–170.
Zorn, Christopher. 1998. An Analytic and Empirical Examination of Zero-Inflated and
Hurdle Poisson Specifications. Sociological Methods and Research, 26(3), 368–400.
Zorn, Christopher J. W. 2000. Generalized Estimating Equation Models for Correlated
Data: A Review with Applications. American Journal of Political Science, 45(2),
470–490.
Index
293
294 Index