James E. Monogan III Auth. Political Analysis Using R
James E. Monogan III Auth. Political Analysis Using R
James E. Monogan III Auth. Political Analysis Using R
Political
Analysis
Using R
Use R!
Series Editors:
Robert Gentleman Kurt Hornik Giovanni Parmigiani
123
James E. Monogan III
Department of Political Science
University of Georgia
Athens, GA, USA
The purpose of this volume is twofold: to help readers who are new to political
research to learn the basics of how to use R and to provide details to intermediate
R users about techniques they may not have used before. R has become prominent
in political research because it is free, easily incorporates user-written packages,
and offers user flexibility in creating unique solutions to complex problems. All of
the examples in this book are drawn from various subfields in Political Science, with
data drawn from American politics, comparative politics, international relations, and
public policy. The datasets come from the types of sources common to political and
social research, such as surveys, election returns, legislative roll call votes, nonprofit
organizations assessments of practices across countries, and field experiments. Of
course, while the examples are drawn from Political Science, all of the techniques
described are valid for any discipline. Therefore, this book is appropriate for anyone
who wants to use R for social or political research.
All of the example and homework data, as well as copies of all of the example
code in the chapters, are available through the Harvard Dataverse: http://dx.
doi.org/10.7910/DVN/ARKOTI. As an overview of the examples, the following
list itemizes the data used in this book and the chapters in which the data are
referenced:
113th U.S. Senate roll call data (Poole et al. 2011). Chapter 8
American National Election Study, 2004 subset used by Hanmer and Kalkan
(2013). Chapters 2 and 7
Comparative Study of Electoral Systems, 30-election subset analyzed in Singh
(2014a), 60-election subset analyzed in Singh (2014b), and 77-election subset
analyzed in Singh (2015). Chapters 7 and 8
Democratization and international border settlements data, 200 countries from
19182007 (Owsiak 2013). Chapter 6
Drug policy monthly TV news coverage (Peake and Eshbaugh-Soha 2008).
Chapters 3, 4, and 7
Energy policy monthly TV news coverage (Peake and Eshbaugh-Soha 2008).
Chapters 3, 7, 8, and 9
vii
viii Preface
Health lobbying data from the U.S. states (Lowery et al. 2008). Chapter 3
Japanese monthly electricity consumption by sector and policy action (Wakiyama
et al. 2014). Chapter 9
Kansas Event Data System, weekly actions from 19792003 in the Israeli-
Palestinian conflict (Brandt and Freeman 2006). Chapter 9
Monte Carlo analysis of strategic multinomial probit of international strategic
deterrence model (Signorino 1999). Chapter 11
National Supported Work Demonstration, as analyzed by LaLonde (1986).
Chapters 4, 5, and 8
National Survey of High School Biology Teachers, as analyzed by Berkman and
Plutzer (2010). Chapters 6 and 8
Nineteenth century militarized interstate disputes data, drawn from Bueno de
Mesquita and Lalman (1992) and Jones et al. (1996). Example applies the method
of Signorino (1999). Chapter 11
Optimizing an insoluble party electoral competition game (Monogan 2013b).
Chapter 11
Political Terror Scale data on human rights, 19931995 waves (Poe and Tate
1994; Poe et al. 1999). Chapter 2
Quarterly U.S. monetary policy and economic data from 19592001 (Enders
2009). Chapter 9
Salta, Argentina field experiment on e-voting versus traditional voting (Alvarez
et al. 2013). Chapters 5 and 8
United Nations roll call data from 19461949 (Poole et al. 2011). Chapter 8
U.S. House of Representatives elections in 2010 for Arizona and Tennessee (Mono-
gan 2013a). Chapter 10
Like many other statistical software books, each chapter contains example code
that the reader can use to practice using the commands with real data. The examples
in each chapter are written as if the reader will work through all of the code in
one chapter in a single session. Therefore, a line of code may depend on prior
lines of code within the chapter. However, no chapter will assume that any code
from previous chapters has been run during the current session. Additionally, to
distinguish ideas clearly, the book uses fonts and colors to help distinguish input
code, output printouts, variable names, concepts, and definitions. Please see Sect. 1.2
on p. 4 for a description of how these fonts are used.
To the reader, are you a beginning or intermediate user? To the course instructor,
in what level of class are you assigning this text? This book offers information
at a variety of levels. The first few chapters are intended for beginners, while the
later chapters introduce progressively more advanced topics. The chapters can be
approximately divided into three levels of difficulty, so various chapters can be
introduced in different types of courses or read based on readers needs:
The book begins with basic informationin fact Chap. 1 assumes that the reader
has never installed R or done any substantial data analysis. Chapter 2 continues
by describing how to input, clean, and export data in R. Chapters 35 describe
graphing techniques and basic inferences, offering a description of the techniques
Preface ix
as well as code for implementing them R. The content in the first five chapters
should be accessible to undergraduate students taking a quantitative methods
class, or could be used as a supplement when introducing these concepts in a
first-semester graduate methods class.
Afterward, the book turns to content that is more commonly taught at the
graduate level: Chap. 6 focuses on linear regression and its diagnostics, though
this material is sometimes taught to undergraduates. Chapter 7 offers code
for generalized linear modelsmodels like logit, ordered logit, and count
models that are often taught in a course on maximum likelihood estimation.
Chapter 8 introduces students to the concept of using packages in R to apply
advanced methods, so this could be worthwhile in the final required course of a
graduate methods sequence or in any upper-level course. Specific topics that are
sampled in Chap. 8 are multilevel modeling, simple Bayesian statistics, matching
methods, and measurement with roll call data. Chapter 9 introduces a variety
of models for time series analysis, so it would be useful as a supplement in a
course on that topic, or perhaps even an advanced regression course that wanted
to introduce time series.
The last two chapters, Chaps. 10 and 11, offer an introduction to R programming.
Chapter 10 focuses specifically on matrix-based math in R. This chapter actually
could be useful in a math for social science class, if students should learn
how to conduct linear algebra using software. Chapter 11 introduces a variety
of concepts important to writing programs in R: functions, loops, branching,
simulation, and optimization.
As a final word, there are several people I wish to thank for help throughout the
writing process. For encouragement and putting me into contact with Springer to
produce this book, I thank Keith L. Dougherty. For helpful advice and assistance
along the way, I thank Lorraine Klimowich, Jon Gurstelle, Eric D. Lawrence, Keith
T. Poole, Jeff Gill, Patrick T. Brandt, David Armstrong, Ryan Bakker, Philip Durbin,
Thomas Leeper, Kerem Ozan Kalkan, Kathleen M. Donovan, Richard G. Gardiner,
Gregory N. Hawrelak, students in several of my graduate methods classes, and
several anonymous reviewers. The content of this book draws from past short
courses I have taught on R. These courses in turn drew from short courses taught
by Luke J. Keele, Evan Parker-Stephen, William G. Jacoby, Xun Pang, and Jeff
Gill. My thanks to Luke, Evan, Bill, Xun, and Jeff for sharing this information. For
sharing data that were used in the examples in this book, I thank R. Michael Alvarez,
Ryan Bakker, Michael B. Berkman, Linda Cornett, Matthew Eshbaugh-Soha, Brian
J. Fogarty, Mark Gibney, Virginia H. Gray, Michael J. Hanmer, Peter Haschke,
Kerem Ozan Kalkan, Luke J. Keele, Linda Camp Keith, Gary King, Marcelo Leiras,
Ines Levin, Jeffrey B. Lewis, David Lowery, Andrew P. Owsiak, Jeffrey S. Peake,
Eric Plutzer, Steven C. Poe, Julia Sofa Pomares, Keith T. Poole, Curtis S. Signorino,
Shane P. Singh, C. Neal Tate, Takako Wakiyama, Reed M. Wood, and Eric Zusman.
xi
xii Contents
4 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 Measures of Central Tendency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.1 Frequency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.1 Quantiles and Percentiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Practice Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5 Basic Inferences and Bivariate Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1 Significance Tests for Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.1.1 Two-Sample Difference of Means Test,
Independent Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1.2 Comparing Means with Dependent Samples . . . . . . . . . . . . . . 69
5.2 Cross-Tabulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Correlation Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4 Practice Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6 Linear Models and Regression Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.1 Estimation with Ordinary Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2 Regression Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2.1 Functional Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2.2 Heteroscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2.3 Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.4 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2.5 Outliers, Leverage, and Influential Data Points . . . . . . . . . . . 94
6.3 Practice Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.1 Binary Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.1.1 Logit Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1.2 Probit Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.1.3 Interpreting Logit and Probit Results . . . . . . . . . . . . . . . . . . . . . . 105
7.2 Ordinal Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.3 Event Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.3.1 Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3.2 Negative Binomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.3.3 Plotting Predicted Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.4 Practice Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8 Using Packages to Apply Advanced Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.1 Multilevel Models with lme4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.1.1 Multilevel Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.1.2 Multilevel Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8.2 Bayesian Methods Using MCMCpack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.2.1 Bayesian Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.2.2 Bayesian Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Contents xiii
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Chapter 1
Obtaining R and Downloading Packages
This chapter is written for the user who has never downloaded R onto his or her
computer, much less opened the program. The chapter offers some brief background
on what the program is, then proceeds to describe how R can be downloaded and
installed completely free of charge. Additionally, the chapter lists some internet-
based resources on R that users may wish to consult whenever they have questions
not addressed in this book.
Fig. 1.1 The comprehensive R archive network (CRAN) homepage. The top box, Download and
Install R, offers installation links for the three major operating systems
download and installation. In each operating system, opening the respective file will
guide the user through the automated installation process.1 In this simple procedure,
a user can install R on his or her personal machine within five minutes.
As months and years pass, users will observe the release of new versions of R.
There are not update patches for R, so as new versions are released, you must
completely install a new version whenever you would like to upgrade to the latest
edition. Users need not reinstall every single version that is released. However, as
time passes, add-on libraries (discussed later in this chapter) will cease to support
older versions of R. One potential guide on this point is to upgrade to the newest
version whenever a library of interest cannot be installed due to lack of support.
The only major inconvenience that complete reinstallation poses is that user-created
add-on libraries will have to be reinstalled, but this can be done on an as-needed
basis.
Once you have installed R, there will be an icon either under the Windows
Start menu (with an option of placing a shortcut on the Desktop) or in the Mac
Applications folder (with an option of keeping an icon in the workspace Dock).
Clicking or double-clicking the icon will start R. Figure 1.2 shows the window
associated with the Mac version of the software. You will notice that R does have
a few push-button options at the top of the window. Within Mac, the menu bar at
the top of the workspace will also feature a few pull-down menus. In Windows,
pull down menus also will be presented within the R window. With only a handful
of menus and buttons, however, commands in R are entered primarily through user
code. Users desiring the fallback option of having more menus and buttons available
may wish to install RStudio or a similar program that adds a point-and-click front
end to R, but a knowledge of the syntax is essential. Figure 1.3 shows the window
associated with the Mac version of RStudio.2
Users can submit their code either through script files (the recommended choice,
described in Sect. 1.3) or on the command line displayed at the bottom of the R
console. In Fig. 1.2, the prompt looks like this:
>
1
The names of these files change as new versions of R are released. As of this printing,
the respective files are R-3.1.2-snowleopard.pkg or R-3.1.2-mavericks.pkg for
various versions of Mac OS X and R-3.1.2-win.exe for Windows. Linux users will find it
easier to install from a terminal. Terminal code is available by following the Download R for
Linux link on the CRAN page, then choosing a Linux distribution on the next page, and using the
terminal code listed on the resulting page. At the time of printing, Debian, various forms of Red
Hat, OpenSUSE, and Ubuntu are all supported.
2
RStudio is available at http://www.rstudio.com.
4 1 Obtaining R and Downloading Packages
Whenever typing code directly into the command prompt, if a user types a single
command that spans multiple lines, then the command prompt turns into a plus sign
(+) to indicate that the command is not complete. The plus sign does not indicate
any problem or error, but just reminds the user that the previous command is not
yet complete. The cursor automatically places itself there so the user can enter
commands.
In the electronic edition of this book, input syntax and output printouts from R
will be color coded to help distinguish what the user should type in a script file from
what results to expect. Input code will be written in blue teletype font.
R output will be written in black teletype font. Error messages that R
returns will be written in red teletype font. These colors correspond to
the color coding R uses for input and output text. While the colors may not be
visible in the print edition, the books text also will distinguish inputs from outputs.
Additionally, names of variables will be written in bold. Conceptual keywords from
statistics and programming, as well as emphasized text, will be written in italics.
Finally, when the meaning of a command is not readily apparent, identifying initials
will be underlined and bolded in the text. For instance, the lm command stands for
linear model.
1.2 Getting Started: A First Session in R 5
When writing R syntax on the command line or in a script file, users should bear
a few important preliminaries in mind:
Expressions and commands in R are case-sensitive. For example, the function
var returns the variance of a variable: a simple function discussed in Chap. 4.
By contrast, the VAR command from the vars library estimates a Vector
Autoregression modelan advanced technique discussed in Chap. 9. Similarly, if
a user names a dataset mydata, then it cannot be called with the names MyData,
MyData, MYDATA, or myData. R would assume each of these names indicates
a different meaning.
Command lines do not need to be separated by any special character like a
semicolon as in Limdep, SAS, or Gauss. A simple hard return will do.
R ignores anything following the pound character (#) as a comment. This applies
when using the command line or script files, but is especially useful when saving
notes in script files for later use.
An object name must start with an alphabetical character, but may contain
numeric characters thereafter. A period may also form part of the name of an
object. For example, x.1 is a valid name for an object in R.
You can use the arrow keys on the keyboard to scroll back to previous commands.
One push of the up arrow recalls the previously entered command and places it in
6 1 Obtaining R and Downloading Packages
the command line. Each additional push of the arrow moves to a command prior
to the one listed in the command line, while the down arrow calls the command
following the one listed in the command line.
Aside from this handful of important rules, the command prompt in R tends
to behave in an intuitive way, returning responses to input commands that could
be easily guessed. For instance, at its most basic level R functions as a high-end
calculator. Some of the key arithmetic commands are: addition (+), subtraction
(-), multiplication (*), division (/), exponentiation (^ ), the modulo function (%%),
and integer division (%/%). Parentheses ( ) specify the order of operations. For
example, if we type the following input:
(3+5/78)^3*7
Then R prints the following output:
[1] 201.3761
As another example, we could ask R what the remainder is when dividing 89 by 13
using the modulo function:
89%%13
R then provides the following answer:
[1] 11
If we wanted R to perform integer division, we could type:
89%/%13
Our output answer to this is:
[1] 6
The options command allows the user to tweak attributes of the output. For
example, the digits argument offers the option to adjust how many digits are
displayed. This is useful, for instance, when considering how precisely you wish to
present results on a table. Other useful built-in functions from algebra and trigonom-
etry include: sin(x), cos(x), tan(x), exp(x), log(x), sqrt(x), and pi.
To apply a few of these functions, first we can expand the number of digits printed
out, and then ask for the value of the constant :
options(digits=16)
pi
R accordingly prints out the value of to 16 digits:
[1] 3.141592653589793
We also may use commands such as pi to insert the value of such a constant into
a function. For example, if we wanted to compute the sine of a 2 radians (or 90 )
angle, we could type:
sin(pi/2)
R correctly prints that sin. 2 / D 1:
[1] 1
1.3 Saving Input and Output 7
When analyzing data or programming in R, a user will never get into serious trouble
provided he or she follows two basic rules:
1. Always leave the original datafiles intact. Any revised version of data should be
written into a new file. If you are working with a particularly large and unwieldy
dataset, then write a short program that winnows-down to what you need, save
the cleaned file separately, and then write code that works with the new file.
2. Write all input code in a script that is saved. Users should usually avoid writing
code directly into the console. This includes code for cleaning, recoding, and
reshaping data as well as for conducting analysis or developing new programs.
If these two rules are followed, then the user can always recover his or her work up to
the point of some error or omission. So even if, in data management, some essential
information is dropped or lost, or even if a journal reviewer names a predictor that
a model should add, the user can always retrace his or her steps. By calling the
original dataset with the saved program, the user can make minor tweaks to the
code to incorporate a new feature of analysis or recover some lost information. By
contrast, if the original data are overwritten or the input code is not saved, then the
user likely will have to restart the whole project from the beginning, which is a waste
of time.
A script file in R is simply plain text, usually saved with the suffix .R. To create
a new script file in R, simply choose File!New Document in the drop down menu
to open the document. Alternatively, the console window shown in Fig. 1.2 shows
an icon that looks like a blank page at the top of the screen (second icon from the
right). Clicking on this will also create a new R script file. Once open, the normal
Save and Save As commands from the File menu apply. To open an existing R script,
choose File!Open Document in the drop down menu, or click the icon at the top
of the screen that looks like a page with writing on it (third icon from the right in
Fig. 1.2). When working with a script file, any code within the file can be executed
in the console by simply highlighting the code of interest, and typing the keyboard
shortcut Ctrl+R in Windows or Cmd+Return in Mac. Besides the default script
file editor, more sophisticated text editors such as Emacs and RWinEdt also are
available.
The product of any R session is saved in the working directory. The working
directory is the default file path for all files the user wants to read in or write out
to. The command getwd (meaning get working directory) will print Rs current
working directory, while setwd (set working directory) allows you to change the
working directory as desired. Within a Windows machine the syntax for checking,
and then setting, the working directory would look like this:
getwd()
setwd("C:/temp/")
8 1 Obtaining R and Downloading Packages
This now writes any output files, be they data sets, figures, or printed output to the
folder temp in the C: drive. Observe that R expects forward slashes to designate
subdirectories, which contrasts from Windowss typical use of backslashes. Hence,
specifying C:/temp/ as the working directory points to C:\temp\ in normal
Windows syntax. Meanwhile for Mac or Unix, setting a working directory would be
similar, and the path directory is printed exactly as these operating systems designate
them with forward slashes:
setwd("/Volumes/flashdisk/temp")
Note that setwd can be called multiple times in a session, as needed. Also,
specifying the full path for any file overrides the working directory.
To save output from your session in R, try the sink command. As a general
computing term, a sink is an output point for a program where data or results are
written out. In R, this term accordingly refers to a file that records all of our printed
output. To save your sessions ouput to the file Rintro.txt within the working
directory type:
sink("Rintro.txt")
Now that we have created an output file, any output that normally would print to
the console will instead print to the file Rintro.txt. (For this reason, in a first
run of new code, it is usually advisable to allow output to print to the screen and then
rerun the code later to print to a file.) The print command is useful for creating
output that can be easily followed. For instance, the command:
print("The mean of variable x is...")
By way of explanation: this syntax draws randomly 1000 times from a standard
normal distribution and assigns the values to the vector x. Observe the arrow (<-),
formed with a less than sign and a hyphen, which is Rs assignment operator. Any
time we assign something with the arrow (<-) the name on the left (x in this
case) allows us to recall the result of the operation on the right (rnorm(1000)
1.4 Work Session Management 9
in this case).3 Now we can print the mean of these 1000 draws (which should be
close to 0 in this case) to our output file as follows:
cat("The mean of variable x is...", mean(x), "\n")
With this syntax, objects from R can be embedded into the statement you print. The
character \n puts in a carriage return. You also can print any statistical output using
the either print or cat commands. Remember, your output does not go to the log
file unless you use one of the print commands. Another option is to simply copy and
paste results from the R console window into Word or a text editor. To turn off the
sink command, simply type:
sink()
As a rule, it is a good idea to use the rm command at the start of any new program.
If the previous user saved his or her workspace, then they may have used objects
sharing the same name as yours, which can create confusion.
To quit R either close the console window or type:
q()
At this point, R will ask if you wish to save the workspace image. Generally, it is
advisable not to do this, as starting with a clean slate in each session is more likely
to prevent programming errors or confusion on the versions of objects loaded in
memory.
Finally, in many R sessions, we will need to load packages, or batches of
code and data offering additional functionality not written in Rs base code.
Throughout this book we will load several packages, particularly in Chap. 8, where
3
The arrow (<-) is the traditional assignment operator, though a single equals sign (=) also can
serve for assignments.
10 1 Obtaining R and Downloading Packages
Package installation is case and spelling sensitive. R will likely prompt you at this
point to choose one of the CRAN mirrors from which to download this package: For
faster downloading, users typically choose the mirror that is most geographically
proximate. The install.packages command only needs to be run once per R
installation for a particular package to be available on a machine. The library
command needs to be run for every session that a user wishes to use the package.
Hence, in the next session that we want to use MCMCpack, we need only type:
library(MCMCpack).
1.5 Resources
Given the wide array of base functions that are available in R, much less the even
wider array of functionality created by R packages, a book such as this cannot
possibly address everything R is capable of doing. This book should serve as a
resource introducing how a researcher can use R as a basic statistics program and
offer some general pointers about the usage of packages and programming features.
As questions emerge about topics not covered in this space, there are several other
resources that may be of use:
Within R, the Help pull down menu (also available by typing help.start()
in the console) offers several manuals of use, including an Introduction to R
and Writing R Extensions. This also opens an HTML-based search engine of
the help files.
UCLAs Institute for Digital Research and Education offers several nice tutorials
(http://www.ats.ucla.edu/stat/r/). The CRAN website also includes a variety of
online manuals (http://www.cran.r-project.org/other-docs.html).
Some nice interactive tutorials include swirl, which is a package you install in
your own copy of R (more information: http://www.swirlstats.com/), and Try R,
which is completed online (http://tryr.codeschool.com/).
Within the R console, the commands ?, help(), and help.search() all
serve to find documentation. For instance, ?lm would find the documenta-
tion for the linear model command. Alternatively, help.search("linear
model") would search the documentation for a phrase.
1.6 Practice Problems 11
Each chapter will end with a few practice problems. If you have tested all of the
code from the in-chapter examples, you should be able to complete these on your
own. If you have not done so already, go ahead and install R on your machine for
free and try the in-chapter code. Then try the following questions.
1. Compute the following in R:
(a) 7 23
8
(b) 82 C1
(c) p
cos
(d) 81
(e) ln e4
2. What does the command cor do? Find documentation about it and describe what
the function does.
3. What does the command runif do? Find documentation about it and describe
what the function does.
4. Create a vector named x that consists of 1000 draws from a standard normal
distribution, using code just like you see in Sect. 1.3. Create a second vector
named y in the same way. Compute the correlation coefficient between the two
vectors. What result do you get, and why do you get this result?
5. Get a feel for how to decide when add-on packages might be useful for you.
Log in to http://www.rseek.org and look up what the stringr package does.
What kinds of functionality does this package give you? When might you want
to use it?
Chapter 2
Loading and Manipulating Data
We now turn to using R to conduct data analysis. Our first basic steps are simply
to load data into R and to clean the data to suit our purposes. Data cleaning and
recoding are an often tedious task of data analysis, but nevertheless are essential
because miscoded data will yield erroneous results when a model is estimated using
them. (In other words, garbage in, garbage out.) In this chapter, we will load various
types of data using differing commands, view our data to understand their features,
practice recoding data in order to clean them as we need to, merge data sets, and
reshape data sets.
Our working example in this chapter will be a subset of Poe et al.s (1999)
Political Terror Scale data on human rights, which is an update of the data in Poe
and Tate (1994). Whereas their complete data cover 19761993, we focus solely on
the year 1993. The eight variables this dataset contains are:
country: A character variable listing the country by name.
democ: The countrys score on the Polity III democracy scale. Scores range from 0
(least democratic) to 10 (most democratic).
sdnew: The U.S. State Department scale of political terror. Scores range from
1 (low state terrorism, fewest violations of personal integrity) to 5 (highest
violations of personal integrity).
military: A dummy variable coded 1 for a military regime, 0 otherwise.
gnpcats: Level of per capita GNP in five categories: 1 = under $1000, 2 = $1000
$1999, 3 = $2000$2999, 4 = $3000$3999, 5 = over $4000.
lpop: Logarithm of national population.
civ_war: A dummy variable coded 1 if involved in a civil war, 0 otherwise.
int_war: A dummy variable coded 1 if involved in an international war, 0 otherwise.
Getting data into R is quite easy. There are three primary ways to import data:
Inputting the data manually (perhaps written in a script file), reading data from
a text-formatted file, and importing data from another program. Since it is a bit
less common in Political Science, inputting data manually is illustrated in the
examples of Chap. 10, in the context of creating vectors, matrices, and data frames.
In this section, we focus on importing data from saved files.
First, we consider how to read in a delimited text file with the read.table
command. R will read in a variety of delimited files. (For all the options associated
with this command type ?read.table in R.) In text-based data, typically each
line represents a unique observation and some designated delimiter separates each
variable on the line. The default for read.table is a space-delimited file wherein
any blank space designates differing variables. Since our Poe et al. data file,
named hmnrghts.txt, is space-separated, we can read our file into R using the
following line of code. This data file is available from the Dataverse named on page
vii or the chapter content link on page 13. You may need to use setwd command
introduced in Chap. 1 to point R to the folder where you have saved the data. After
this, run the following code:
hmnrghts<-read.table("hmnrghts.txt",
header=TRUE, na="NA")
Note: As was mentioned in the previous chapter, R allows the user to split a single
command across multiple lines, which we have done here. Throughout this book,
commands that span multiple lines will be distinguished with hanging indentation.
Moving to the code itself, observe a few features: One, as was noted in the previous
chapter, the left arrow symbol (<-) assigns our input file to an object. Hence,
hmnrghts is the name we allocated to our data file, but we could have called it
any number of things. Second, the first argument of the read.table command
calls the name of the text file hmnrghts.txt. We could have preceded this
argument with the file= optionand we would have needed to do so if we had
not listed this as the first argumentbut R recognizes that the file itself is normally
the first argument this command takes. Third, we specified header=TRUE, which
conveys that the first row of our text file lists the names of our variables. It is
essential that this argument be correctly identified, or else variable names may be
erroneously assigned as data or data as variable names.1 Finally, within the text file,
the characters NA are written whenever an observation of a variable is missing. The
option na="NA" conveys to R that this is the data sets symbol for a missing value.
(Other common symbols of missingness are a period (.) or the number -9999.)
The command read.table also has other important options. If your text file
uses a delimiter other than a space, then this can be conveyed to R using the sep
option. For instance, including sep="\t" in the previous command would have
1
A closer look at the file itself will show that our header line of variable names actually has one
fewer element than each line of data. When this is the case, R assumes that the first item on each
line is an observation index. Since that is true in this case, our data are read correctly.
2.1 Reading in Data 15
allowed us to read in a tab-separated text file, while sep="," would have allowed a
comma-separated file. The commands read.csv and read.delim are alternate
versions of read.table that merely have differing default values. (Particularly,
read.csv is geared to read comma-separated values files and read.delim
is geared to read tab-delimited files, though a few other defaults also change.)
Another important option for these commands is quote. The defaults vary across
these commands for which characters designate string-based variables that use
alphabetic text as values, such as the name of the observation (e.g., country, state,
candidate). The read.table command, by default, uses either single or double
quotation marks around the entry in the text file. This would be a problem if
double quotes were used to designate text, but apostrophes were in the text. To
compensate, simply specify the option quote = "\"" to only allow double
quotes. (Notice the backslash to designate that the double-quote is an argument.)
Alternatively, read.csv and read.delim both only allow double quotes by
default, so specifying quote = "\"" would allow either single or double
quotes, or quote = "\" would switch to single quotes. Authors also can specify
other characters in this option, if need be.
Once you download a file, you have the option of specifying the full path
directory in the command to open the file. Suppose we had saved hmnrghts.txt
into the path directory C:/temp/, then we could load the file as follows:
hmnrghts <- read.table("C:/temp/hmnrghts.txt",
header=TRUE, na="NA")
As was mentioned when we first loaded the file, another option would have been to
use the setwd command to set the working directory, meaning we would not have
to list the entire file path in the call to read.table. (If we do this, all input and
output files will go to this directory, unless we specify otherwise.) Finally, in any
GUI-based system (e.g., non-terminal), we could instead type:
hmnrghts<-read.table(file.choose(),header=TRUE,na="NA")
The file.choose() option will open a file browser allowing the user to locate
and select the desired data file, which R will then assign to the named object in
memory (hmnrghts in this case). This browser option is useful in interactive
analysis, but less useful for automated programs.
Another format of text-based data is a fixed width file. Files of this format do not
use a character to delimit variables within an observation. Instead, certain columns
of text are consistently dedicated to a variable. Thus, R needs to know which
columns define each variable to read in the data. The command for reading a fixed
width file is read.fwf. As a quick illustration of how to load this kind of data, we
will load a different datasetroll call votes from the 113th United States Senate, the
term running from 2013 to 2015.2 This dataset will be revisited in a practice problem
2
These data were gathered by Jeff Lewis and Keith Poole. For more information, see http://www.
voteview.com/senate113.htm.
16 2 Loading and Manipulating Data
in Chap. 8. To open these data, start by downloading the file sen113kh.ord from
the Dataverse listed on page vii or the chapter content link on page 13. Then type:
senate.113<-read.fwf("sen113kh.ord",
widths=c(3,5,2,2,8,3,1,1,11,rep(1,657)))
The first argument of read.fwf is the name of the file, which we draw from
a URL. (The file extension is .ord, but the format is plain text. Try opening
the file in Notepad or TextEdit just to get a feel for the formatting.) The second
argument, widths, is essential. For each variable, we must enter the number of
characters allocated to that variable. In other words, the first variable has three
characters, the second has five characters, and so on. This procedure should make
it clear that we must have a codebook for a fixed width file, or inputting the data
is a hopeless endeavor. Notice that the last component in the widths argument
is rep(1,657). This means that our data set ends with 657 variables that are
one character long. These are the 657 votes that the Senate cast during that term of
Congress, with each variable recording whether each senator voted yea, nay, present,
or did not vote.
With any kind of data file, including fixed width, if the file itself does not have
names of the variables, we can add these in R. (Again, a good codebook is useful
here.) The commands read.table, read.csv, and read.fwf all include an
option called col.names that allows the user to name every variable in the dataset
when reading in the file. In the case of the Senate roll calls, though, it is easier for
us to name the variables afterwards as follows:
colnames(senate.113)[1:9]<-c("congress","icpsr","state.code",
"cd","state.name","party","occupancy","attaining","name")
for(i in 1:657){colnames(senate.113)[i+9]<-paste("RC",i,sep="")}
In this case, we use the colnames command to set the names of the variables. To
the left of the arrow, by specifying [1:9], we indicate that we are only naming the
first nine variables. To the right of the arrow, we use one of the most fundamental
commands in R: the combine command (c), which combines several elements into
a vector. In this case, our vector includes the names of the variables in text. On the
second line, we proceed to name the 657 roll call votes RC1, RC2, . . . , RC657.
To save typing we make this assignment using a for loop, which is described in
more detail in Chap. 11. Within the for loop, we use the paste command, which
simply prints our text ("RC") and the index number i, separated by nothing (hence
the empty quotes at the end). Of course, by default, R assigns generic variable names
(V1, V2, etc.), so a reader who is content to use generic names can skip this step,
if preferred. (Bear in mind, though, that if we name the first nine variables like we
did, the first roll call vote would be named V10 without our applying a new name.)
2.1 Reading in Data 17
Turning back to our human rights example, you also can import data from many
other statistical programs. One of the most important libraries in R is the foreign
package, which makes it very easy to bring in data from other statistical packages,
such as SPSS, Stata, and Minitab.3 As an alternative to the text version of the human
rights data, we also could load a Stata-formatted data file, hmnrghts.dta.
Stata files generally have a file extension of dta, which is what the read.dta
command refers to. (Similarly, read.spss will read an SPSS-formatted file with
the .sav file extension.) To open our data in Stata format, we need to download the
file hmnrghts.dta from the Dataverse linked on page vii or the chapter content
linked on page 13. Once we save it to our hard drive, we can either set our working
directory, list the full file path, or use the file.choose() command to access our
data . For example, if we downloaded the file, which is named hmnrghts.dta,
into our C:\temp\ folder, we could open it by typing:
library(foreign)
setwd("C:/temp/")
hmnrghts.2 <- read.dta("hmnrghts.dta")
Any data in Stata format that you select will be converted to R format. One word of
warning, by default if there are value labels on Stata-formatted data, R will import
the labels as a string-formatted variable. If this is an issue for you, try importing
the data without value labels to save the variables using the numeric codes. See the
beginning of Chap. 7 for an example of the convert.factors=FALSE option.
(One option for data sets that are not excessively large is to load two copies of a
Stata datasetone with the labels as text to serve as a codebook and another with
numerical codes for analysis.) It is always good to see exactly how the data are
formatted by inspecting the spreadsheet after importing with the tools described in
Sect. 2.2.
R distinguishes between vectors, lists, data frames, and matrices. Each of these
is an object of a different class in R. Vectors are indexed by length, and matrices
are indexed by rows and columns. Lists are not pervasive to basic analyses, but
are handy for complex storage and are often thought of as generic vectors where
each element can be any class of object. (For example, a list could be a vector of
3
The foreign package is so commonly used that it now downloads with any new R installation.
In the unlikely event, though, that the package will not load with the library command,
simply type install.packages("foreign") in the command prompt to download it.
Alternatively, for users wishing to import data from Excel, two options are present: One is to
save the Excel file in comma-separated values format and then use the read.csv command. The
other is to install the XLConnect library and use the readWorksheetFromFile command.
18 2 Loading and Manipulating Data
model results, or a mix of data frames and maps.) A data frame is a matrix that
R designates as a data set. With a data frame, the columns of the matrix can be
referred to as variables. After reading in a data set, R will treat your data as a data
frame, which means that you can refer to any variable within a data frame by adding
$VARIABLENAME to the name of the data frame.4 For example, in our human rights
data we can print out the variable country in order to see which countries are in
the dataset:
hmnrghts$country
Another option for calling variables, though an inadvisable one, is to use the
attach command. R allows the user to load multiple data sets at once (in contrast
to some of the commercially available data analysis programs). The attach
command places one dataset at the forefront and allows the user to call directly the
names of the variables without referring the name of the data frame. For example:
attach(hmnrghts)
country
With this code, R would recognize country in isolation as part of the attached
dataset and print it just as in the prior example. The problem with this approach is
that R may store objects in memory with the same name as some of the variables in
the dataset. In fact, when recoding data the user should always refer to the data frame
by name, otherwise R confusingly will create a copy of the variable in memory that
is distinct from the copy in the data frame. For this reason, I generally recommend
against attaching data. If, for some circumstance, a user feels attaching a data frame
is unavoidable, then the user can conduct what needs to be done with the attached
data and then use the detach command as soon as possible. This command works
as would be expected, removing the designation of a working data frame and no
longer allowing the user to call variables in isolation:
detach(hmnrghts)
To export data you are using in R to a text file, use the functions write.table or
write.csv. Within the foreign library, write.dta allows the user to write
out a Stata-formatted file. As a simple example, we can generate a matrix with four
observations and six variables, counting from 1 to 24. Then we can write this to a
comma-separated values file, a space-delimited text file, and a Stata file:
x <- matrix(1:24, nrow=4)
write.csv(x, file="sample.csv")
write.table(x, file="sample.txt")
write.dta(as.data.frame(x), file="sample.dta")
4
More technically, data frames are objects of the S3 class. For all S3 objects, attributes of the
object (such as variables) can be called with the dollar sign ($).
2.2 Viewing Attributes of the Data 19
Once data are input into R, the first task should be to inspect the data and make sure
they have loaded properly. With a relatively small dataset, we can simply print the
whole data frame to the screen:
hmnrghts
Printing the entire dataset, of course, is not recommended or useful with large
datasets. Another option is to look at the names of the variables and the first few
lines of data to see if the data are structured correctly through a few observations.
This is done with the head command:
head(hmnrghts)
For a quick list of the names of the variables in our dataset (which can also be useful
if exact spelling or capitalization of variable names is forgotten) type:
names(hmnrghts)
A route to getting a comprehensive look at the data is to use the fix command:
fix(hmnrghts)
This presents the data in a spreadsheet allowing for a quick view of observations or
variables of interest, as well as a chance to see that the data matrix loaded properly.
An example of this data editor window that fix opens is presented in Fig. 2.1. The
user has the option of editing data within the spreadsheet window that fix creates,
though unless the revised data are written to a new file, there will be no permanent
record of these changes.5 Also, it is key to note that before continuing an R session
5
The View command is similar to fix, but does not allow editing of observations. If you prefer to
only be able to see the data without editing values (perhaps even by accidentally leaning on your
keyboard), then View might be preferable.
20 2 Loading and Manipulating Data
with more commands, you must close the data editor window. The console is frozen
as long as the fix window is open.
We will talk more about descriptive statistics in Chap. 4. In the meantime, though,
it can be informative to see some of the basic descriptive statistics (including the
mean, median, minimum, and maximum) as well as a count of the number of
missing observations for each variable:
summary(hmnrghts)
Alternatively, this information can be gleaned for only a single variable, such as
logged population:
summary(hmnrghts$lpop)
As we turn to cleaning data that are loaded in R, an essential toolset is the group of
logical statements. Logical (or Boolean) statements in R are evaluated as to whether
they are TRUE or FALSE. Table 2.1 summarizes the common logical operators in R.
2.3 Logical Statements and Variable Generation 21
Note that the Boolean statement is equal to is designated by two equals signs (==),
whereas a single equals sign (=) instead serves as an assignment operator.
To apply some of these Boolean operators from Table 2.1 in practice, suppose, for
example, we wanted to know which countries were in a civil war and had an above
average democracy score in 1993. We could generate a new variable in our working
dataset, which I will call dem.civ (though the user may choose the name). Then
we can view a table of our new variable and list all of the countries that fit these
criteria:
hmnrghts$dem.civ <- as.numeric(hmnrghts$civ_war==1 &
hmnrghts$democ>5.3)
table(hmnrghts$dem.civ)
hmnrghts$country[hmnrghts$dem.civ==1]
On the first line, hmnrghts$dem.civ defines our new variable within the human
rights dataset.6 On the right, we have a two-part Boolean statement: The first asks
whether the country is in a civil war, and the second asks if the countrys democracy
score is higher than the average of 5.3. The ampersand (&) requires that both
statements must simultaneously be true for the whole statement to be true. All of
this is embedded within the as.numeric command, which encodes our Boolean
output as a numeric variable. Specifically, all values of TRUE are set to 1 and
FALSE values are set to 0. Such a coding is usually more convenient for modeling
purposes. The next line gives us a table of the relative frequencies of 0s and 1s.
It turns out that only four countries had above-average democracy levels and were
involved in a civil war in 1993. To see which countries, the last line asks R to print
the names of countries, but the square braces following the vector indicate which
observations to print: Only those scoring 1 on this new variable.7
6
Note, though, that any new variables we create, observations we drop, or variables we recode only
change the data in working memory. Hence, our original data file on disk remains unchanged and
therefore safe for recovery. Again, we must use one of the commands from Sect. 2.1.3 if we want
to save a second copy of the data including all of our changes.
7
The output prints the four country names, and four values of NA. This means in four cases, one of
the two component statements was TRUE but the other statement could not be evaluated because
the variable was missing.
22 2 Loading and Manipulating Data
The first statement asks for each observation whether the value of democracy is
missing. The table command then aggregates this and informs us that 31 observa-
tions are missing. The next two statements ask whether our dataset hmnrghts
is saved as a matrix, then as a data frame. The is.matrix statement returns
FALSE, indicating that matrix-based commands will not work on our data, and the
is.data.frame statement returns TRUE, which indicates that it is stored as a
data frame. With a sense of logical statements in R, we can now apply these to the
task of cleaning data.
One of the first tasks of data cleaning is deciding how to deal with missing data. R
designates missing values with NA. It translates missing values from other statistics
packages into the NA missing format. However a scholar deals with missing data,
it is important to be mindful of the relative proportion of unobserved values in the
data and what information may be lost. One (somewhat crude) option to deal with
missingness would be to prune the dataset through listwise deletion, or removing
every observation for which a single variable is not recorded. To create a new data
set that prunes in this way, type:
hmnrghts.trim <- na.omit(hmnrghts)
This diminishes the number of observations from 158 to 127, so a tangible amount
of information has been lost.
Most modeling commands in R give users the option of estimating the model
over complete observations only, implementing listwise deletion on the fly. As a
warning, listwise deletion is actually the default in the base commands for linear
and generalized linear models, so data loss can fly under the radar if the user is
not careful. Users with a solid background on regression-based modeling are urged
to consider alternative methods for dealing with missing data that are superior to
listwise deletion. In particular, the mice and Amelia libraries implement the
useful technique of multiple imputation (for more information see King et al. 2001;
Little and Rubin 1987; Rubin 1987).
If, for some reason, the user needs to redesignate missing values as having
some numeric value, the is.na command can be useful. For example, if it were
beneficial to list missing values as 9999, then these could be coded as:
hmnrghts$democ[is.na(hmnrghts$democ)]<- -9999
2.4 Cleaning Data 23
In other words, all values of democracy for which the value is missing will take
on the value of 9999. Be careful, though, as R and all of its modeling commands
will now regard the formerly missing value as a valid observation and will insert
the misleading value of 9999 into any analysis. This sort of action should only be
taken if it is required for data management, a special kind of model where strange
values can be dummied-out, or the rare case where a missing observation actually
can take on a meaningful value (e.g., a budget dataset where missing items represent
a $0 expenditure).
In many cases, it is convenient to subset our data. This may mean that we only want
observations of a certain type, or it may mean we wish to winnow-down the number
of variables in the data frame, perhaps because the data include many variables that
are of no interest. If, in our human rights data, we only wanted to focus on countries
that had a democracy score from 610, we could call this subset dem.rights and
create it as follows:
dem.rights <- subset(hmnrghts, subset=democ>5)
This creates a 73 observation subset of our original data. Note that observations
with a missing (NA) value of democ will not be included in the subset. Missing
observations also would be excluded if we made a greater than or equal to
statement.8
As an example of variable selection, if we wanted to focus only on democracy
and wealth, we could keep only these two variables and an index for all observations:
dem.wealth<-subset(hmnrghts,select=c(country, democ, gnpcats))
Additionally, users have the option of calling both the subset and select
options if they wish to choose a subset of variables over a specific set of
observations.
8
This contrasts from programs like Stata, which treat missing values as positive infinity. In Stata,
whether missing observations are included depends on the kind of Boolean statement being made.
R is more consistent in that missing cases are always excluded.
24 2 Loading and Manipulating Data
A final aspect of data cleaning that often arises is the need to recode variables. This
may emerge because the functional form of a model requires a transformation of a
variable, such as a logarithm or square. Alternately, some of the values of the data
may be misleading and thereby need to be recoded as missing or another value. Yet
another possibility is that the variables from two datasets need to be coded on the
same scale: For instance, if an analyst fits a model with survey data and then makes
forecasts using Census data, then the survey and Census variables need to be coded
the same way.
For mathematical transformations of variables, the syntax is straightforward and
follows the form of the example below. Suppose we want the actual population of
each country instead of its logarithm:
hmnrghts$pop <- exp(hmnrghts$lpop)
Quite simply, we are applying the exponential function (exp) to a logged value to
recover the original value. Yet any type of mathematical operator could be sub-
stituted in for exp. A variable could be squared (2), logged (log()), have
the square root taken (sqrt()), etc. Addition, subtraction, multiplication, and
division are also valideither with a scalar of interest or with another variable.
Suppose we wanted to create an ordinal variable coded 2 if a country was in both a
civil war and an international war, 1 if it was involved in either, and 0 if it was not
involved in any wars. We could create this by adding the civil war and international
war variables:
hmnrghts$war.ord<-hmnrghts$civ_war+hmnrghts$int_war
A quick table of our new variable, however, reveals that no nations had both kinds
of conflict going in 1993.
Another common issue to address is when data are presented in an undesirable
format. Our variable gnpcats is actually coded as a text variable. However, we
may wish to recode this as a numeric ordinal variable. There are two means of
accomplishing this. The first, though taking several lines of code, can be completed
quickly with a fair amount of copy-and-paste:
hmnrghts$gnp.ord <- NA
hmnrghts$gnp.ord[hmnrghts$gnpcats=="<1000"]<-1
hmnrghts$gnp.ord[hmnrghts$gnpcats=="1000-1999"]<-2
hmnrghts$gnp.ord[hmnrghts$gnpcats=="2000-2999"]<-3
hmnrghts$gnp.ord[hmnrghts$gnpcats=="3000-3999"]<-4
hmnrghts$gnp.ord[hmnrghts$gnpcats==">4000"]<-5
Here, a blank variable was created, and then the values of the new variable filled-in
contingent on the values of the old using Boolean statements.
A second option for recoding the GNP data can be accomplished through John
Foxs companion to applied regression (car) library. As a user-written library, we
must download and install it before the first use. The installation of a library is
straightforward. First, type:
install.packages("car")
2.4 Cleaning Data 25
Once the library is installed (again, a step which need not be repeated unless R is
reinstalled), the following lines will generate our recoded per capita GNP measure:
library(car)
hmnrghts$gnp.ord.2<-recode(hmnrghts$gnpcats,"<1000"=1;
"1000-1999"=2;"2000-2999"=3;"3000-3999"=4;">4000"=5)
Be careful that the recode command is delicate. Between the apostrophes, all
of the reassignments from old values to new are defined separated by semicolons.
A single space between the apostrophes will generate an error. Despite this,
recode can save users substantial time on data cleaning. The basic syntax of
recode, of course, could be used to create dummy variables, ordinal variables,
or a variety of other recoded variables. So now two methods have created a new
variable, each coded 1 to 5, with 5 representing the highest per capita GNP.
Another standard type of recoding we might want to do is to create a dummy
variable that is coded as 1 if the observation meets certain conditions and 0
otherwise. For example, suppose instead of having categories of GNP, we just want
to compare the highest category of GNP to all the others:
hmnrghts$gnp.dummy<-as.numeric(hmnrghts$gnpcats==">4000")
As with our earlier example of finding democracies involved in a civil war, here we
use a logical statement and modify it with the as.numeric statement, which turns
each TRUE into a 1 and each FALSE into a 0.
Categorical variables in R can be given a special designation as factors. If you
designate a categorical variable as a factor, R will treat it as such in statistical
operation and create dummy variables for each level when it is used in a regression.
If you import a variable with no numeric coding, R will automatically call the
variable a character vector, and convert the character vector into a factor in most
analysis commands. If we prefer, though, we can designate that a variable is a factor
ahead of time and open up a variety of useful commands. For example, we can
designate country as a factor:
hmnrghts$country <- as.factor(hmnrghts$country)
levels(hmnrghts$country)
Notice that R allows the user to put the same quantity (in this case, the variable
country) on both sides of an assignment operator. This recursive assignment takes
the old values of a quantity, makes the right-hand side change, and then replaces the
new values into the same place in memory. The levels command reveals to us the
different recorded values of the factor.
To change which level is the first level (e.g., to change which category R will
use as the reference category in a regression) use the relevel command. The
following code sets united states as the reference category for country:
hmnrghts$country<-relevel(hmnrghts$country,"united states")
levels(hmnrghts$country)
Now when we view the levels of the factor, united states is listed as the first level,
and the first level is always our reference group.
26 2 Loading and Manipulating Data
Two final tasks that are common to data management are merging data sets and
reshaping panel data. As we consider examples of these two tasks, let us consider an
update of Poe et al.s (1999) data: Specifically, Gibney et al. (2013) have continued
to code data for the Political Terror Scale. We will use the 1994 and 1995 waves of
the updated data. The variables in these waves are:
Country: A character variable listing the country by name.
COWAlpha: Three-character country abbreviation from the Correlates of War
dataset.
COW: Numeric country identification variable from the Correlates of War dataset.
WorldBank: Three-character country abbreviation used by the World Bank.
Amnesty.1994/Amnesty.1995: Amnesty Internationals scale of political terror.
Scores range from 1 (low state terrorism, fewest violations of personal integrity)
to 5 (highest violations of personal integrity).
StateDept.1994/StateDept.1995: The U.S. State Department scale of political
terror. Scores range from 1 (low state terrorism, fewest violations of personal
integrity) to 5 (highest violations of personal integrity).
For the last two variables, the name of the variable depends on which wave of data
is being studied, with the suffix indicating the year. Notice that these data have four
identification variables: This is designed explicitly to make these data easier for
researchers to use. Each index makes it easy for a researcher to link these political
terror measures to information provided by either the World Bank or the Correlates
of War dataset. This should show how ubiquitous the act of merging data is to
Political Science research.
To this end, let us practice merging data. In general, merging data is useful
when the analyst has two separate data frames that contain information about the
same observations. For example, if a Political Scientist had one data frame with
economic data by country and a second data frame containing election returns by
country, the scholar might want to merge the two data sets to link economic and
political factors within each country. In our case, suppose we simply wanted to
link each countrys political terror score from 1994 to its political terror score from
1995. First, download the data sets pts1994.csv and pts1995.csv from the
Dataverse on page vii or the chapter content link on page 13. As before, you may
need to use setwd to point R to the folder where you have saved the data. After
this, run the following code to load the relevant data:
hmnrghts.94<-read.csv("pts1994.csv")
hmnrghts.95<-read.csv("pts1995.csv")
These data are comma separated, so read.csv is the best command in this case.
If we wanted to take a look at the first few observations of our 1994 wave, we
could type head(hmnrghts.94). This will print the following:
2.5 Merging and Reshaping Data 27
To combine our 1994 and 1995 data, we now turn to the merge command.9
We type:
hmnrghts.wide<-merge(x=hmnrghts.94,y=hmnrghts.95,by=c("COW"))
Within this command, the option x refers to one dataset, while y is the other. Next
to the by option, we name an identification variable that uniquely identifies each
observation. The by command actually allows users to name multiple variables
if several are needed to uniquely identify each observation: For example, if a
researcher was merging data where the unit of analysis was a country-year, a country
variable and a year variable might be essential to identify each row uniquely. In such
a case the syntax might read, by=c("COW","year"). As yet another option,
if the two datasets had the same index, but the variables were named differently,
R allows syntax such as, by.x=c("COW"), by.y=c("cowCode"), which
conveys that the differently named index variables are the name.
Once we have merged our data, we can preview the finished product by typing
head(hmnrghts.wide). This prints:
COW Country COWAlpha WorldBank Amnesty.1994
1 2 United States USA USA 1
2 20 Canada CAN CAN 1
3 31 Bahamas BHM BHS 1
4 40 Cuba CUB CUB 3
5 41 Haiti HAI HTI 5
6 42 Dominican Republic DOM DOM 2
StateDept.1994 Amnesty.1995 StateDept.1995
1 NA 1 NA
2 1 NA 1
3 2 1 1
4 3 4 3
5 4 2 3
6 2 2 2
As we can see, the 1994 and 1995 scores for Amnesty and StateDept are recorded
in one place for each country. Hence, our merge was successful. By default, R
excludes any observation from either dataset that does not have a linked observation
(e.g., equivalent value) from the other data set. So if you use the defaults and the new
dataset includes the same number of rows as the two old datasets, then all observations
were linked and included. For instance, we could type:
dim(hmnrghts.94); dim(hmnrghts.95); dim(hmnrghts.wide)
This would quickly tell us that we have 179 observations in both of the inputs, as well
as the output dataset, showing we did not lose any observations. Other options within
merge are all.x, all.y, and all, which allow you to specify whether to force
9
Besides the merge, the dplyr package offers several data-joining commands that you also may
find of use, depending on your needs.
2.5 Merging and Reshaping Data 29
the inclusion of all observations from the dataset x, the dataset y, and from either
dataset, respectively. In this case, R would encode NA values for observations that did
not have a linked case in the other dataset.
As a final point of data management, sometimes we need to reshape our data. In
the case of our merged data set, hmnrghts.wide, we have created a panel data set
(e.g., a data set consisting of repeated observations of the same individuals) that is in
wide format. Wide format means that each row in our data defines an individual of
study (a country) while our repeated observations are stored in separate variables (e.g.,
Amnesty.1994 and Amnesty.1995 record Amnesty International scores for two
separate years). In most models of panel data, we need our data to be in long format,
or stacked format. Long format means that we need two index variables to identify
each row, one for the individual (e.g., country) and one for time of observation (e.g.,
year). Meanwhile, each variable (e.g., Amnesty) will only use one column. R allows
us to reshape our data from wide to long, or from long to wide. Hence, whatever the
format of our data, we can reshape it to our needs.
To reshape our political terror data from wide format to long, we use the reshape
command:
hmnrghts.long<-reshape(hmnrghts.wide,varying=c("Amnesty.1994",
"StateDept.1994","Amnesty.1995","StateDept.1995"),
timevar="year",idvar="COW",direction="long",sep=".")
Within the command, the first argument is the name of the data frame we wish
to reshape. The varying term lists all of the variables that represent repeated
observations over time. Tip: Be sure that repeated observations of the same variable
have the same prefix name (e.g., Amnesty or StateDept) and then the suffix
(e.g., 1994 or 1995) consistently reports time. The timevar term allows us to
specify the name of our new time index, which we call year. The idvar term
lists the variable that uniquely identifies individuals (countries, in our case). With
direction we specify that we want to convert our data into long format. Lastly,
the sep command offers R a cue of what character separates our prefixes and suffixes
in the repeated observation variables: Since a period (.) separates these terms in each
of our Amnesty and StateDept variables, we denote that here.
A preview of the result can be seen by typing head(hmnrghts.long). This
prints:
COW Country COWAlpha WorldBank year
2.1994 2 United States USA USA 1994
20.1994 20 Canada CAN CAN 1994
31.1994 31 Bahamas BHM BHS 1994
40.1994 40 Cuba CUB CUB 1994
41.1994 41 Haiti HAI HTI 1994
42.1994 42 Dominican Republic DOM DOM 1994
Amnesty StateDept
2.1994 1 NA
20.1994 1 1
30 2 Loading and Manipulating Data
31.1994 1 2
40.1994 3 3
41.1994 5 4
42.1994 2 2
Notice that we now have only one variable for Amnesty and one for
StateDept. We now have a new variable named year, so between COW
and year, each row uniquely identifies each country-year. Since the data
are naturally sorted, the top of our data only show 1994 observations.
Typing head(hmnrghts.long[hmnrghts.long$year==1995,]) shows
us the first few 1995 observations:
COW Country COWAlpha WorldBank year
2.1995 2 United States USA USA 1995
20.1995 20 Canada CAN CAN 1995
31.1995 31 Bahamas BHM BHS 1995
40.1995 40 Cuba CUB CUB 1995
41.1995 41 Haiti HAI HTI 1995
42.1995 42 Dominican Republic DOM DOM 1995
Amnesty StateDept
2.1995 1 NA
20.1995 NA 1
31.1995 1 1
40.1995 4 3
41.1995 2 3
42.1995 2 2
As we can see, all of the information is preserved, now in long (or stacked) format.
As a final illustration, suppose we had started with a data set that was in long format
and wanted one in wide format. To try this, we will reshape hmnrghts.long and
try to recreate our original wide data set. To do this, we type:
hmnrghts.wide.2<-reshape(hmnrghts.long,
v.names=c("Amnesty","StateDept"),
timevar="year",idvar="COW",direction="wide",sep=".")
A few options have now changed: We now use the v.names command to indicate
the variables that include repeated observations. The timevar parameter now needs
to be a variable within the dataset, just as idvar is, in order to separate individuals
from repeated time points. Our direction term is now wide because we want
to convert these data into wide format. Lastly, the sep command specifies the
character that R will use to separate prefixes from suffixes in the final form. By
typing head(hmnrghts.wide.2) into the console, you will now see that this
new dataset recreates the original wide dataset.
This chapter has covered the variety of means of importing and exporting data in
R. It also has discussed data management issues such as missing values, subsetting,
recoding data, merging data, and reshaping data. With the capacity to clean and
manage data, we now are ready to start analyzing our data. We next proceed to data
visualization.
2.6 Practice Problems 31
As a practice dataset, we will download and open a subset of the 2004 American
National Election Study used by Hanmer and Kalkan (2013). This dataset is named
hanmerKalkanANES.dta, and it is available from the Dataverse referenced on
page vii or in the chapter content link on page 13. These data are in Stata format, so
be sure to load the correct library and use the correct command when opening. (Hint:
When using the proper command, be sure to specify the convert.factors=F
option within it to get an easier-to-read output.) The variables in this dataset all
relate to the 2004 U.S. presidential election, and they are: a respondent identification
number (caseid), retrospective economic evaluations (retecon), assessment of George
W. Bushs handling of the war in Iraq (bushiraq), an indicator for whether the
respondent voted for Bush (presvote), partisanship on a seven-point scale (partyid),
ideology on a seven-point scale (ideol7b), an indicator of whether the respondent is
white (white), an indicator of whether the respondent is female (female), age of the
respondent (age), level of education on a seven-point scale (educ1_7), and income on
a 23-point scale (income). (The variable exptrnout2 can be ignored.)
1. Once you have loaded the data, do the following to check your work:
(a) If you ask R to return the variable names, what does the list say? Is it correct?
(b) Using the head command, what do the first few lines look like?
(c) If you use the fix command, do the data look like an appropriate spreadsheet?
2. Use the summary command on the whole data set. What can you learn immedi-
ately? How many missing observations do you have?
3. Try subsetting the data in a few ways:
(a) Create a copy of the dataset that removes all missing observations with listwise
deletion. How many observations remain in this version?
(b) Create a second copy that only includes the respondent identification number,
retrospective economic evaluations, and evaluation of Bushs handling of Iraq.
4. Create a few new variables:
(a) The seven-point partisanship scale (partyid) is coded as follows: 0 = Strong
Democrat, 1 = Weak Democrat, 2 = Independent Leaning Democrat, 3 =
Independent No Leaning, 4 = Independent Leaning Republican, 5 = Weak
Republican, and 6 = Strong Republican. Create two new indicator variables.
The first should be coded 1 if the person identifies as Democrat in any way
(including independents who lean Democratic), and 0 otherwise. The second
new variable should be coded 1 if the person identifies as Republican in any
way (including independents who lean Republican), and 0 otherwise. For each
of these two new variables, what does the summary command return for
them?
(b) Create a new variable that is the squared value of the respondents age in years.
What does the summary command return for this new variable?
(c) Create a new version of the income variable that has only four categories.
The first category should include all values of income than range from 112,
32 2 Loading and Manipulating Data
the second from 1317, the third from 1820, and the last from 2123.
Use the table command to see the frequency of each category.
(d) Bonus: Use the table command to compare the 23-category version of
income to the four-category version of income. Did you code the new version
correctly?
Chapter 3
Visualizing Data
Visually presenting data and the results of models has become a centerpiece
of modern political analysis. Many of Political Sciences top journals, including
the American Journal of Political Science, now ask for figures in lieu of tables
whenever both can convey the same information. In fact, Kastellec and Leoni
(2007) make the case that figures convey empirical results better than tables.
Cleveland (1993) and Tufte (2001) wrote two of the leading volumes that describe
the elements of good quantitative visualization, and Yau (2011) has produced a
more recent take on graphing. Essentially these works serve as style manuals for
graphics.1 Beyond the suggestions these scholars offer for the sake of readers,
viewing ones own data visually conveys substantial information about the datas
univariate, bivariate, and multivariate features: Does a variable appear skewed?
Do two variables substantively appear to correlate? What is the proper functional
relationship between variables? How does a variable change over space or time?
Answering these questions for oneself as an analyst and for the reader generally can
raise the quality of analysis presented to the discipline.
On the edge of this graphical movement in quantitative analysis, R offers
state-of-the-art data and model visualization. Many of the commercial statistical
programs have tried for years to catch up to Rs graphical capacities. This chapter
showcases these capabilities, turning first to the plot function that is automatically
available as part of the base package. Second, we discuss some of the other graph-
ing commands offered in the base library. Finally, we turn to the lattice library,
which allows the user to create Trellis Graphicsa framework for visualization
2
Nebraska and North Carolina are each missing observations of the Ranney index.
3.1 Univariate Graphs in the base Package 35
As a first look at our data, displaying a single variable graphically can convey a sense
of the distribution of the data, including its mode, dispersion, skew, and kurtosis. The
lattice library actually offers a few more commands for univariate visualization
than base does, but we start with the major built-in univariate commands. Most
graphing commands in the base package call the plot function, but hist and
boxplot are noteworthy exceptions.
The hist command is useful to simply gain an idea of the relative frequency
of several common values. We start by loading our data on energy policy television
news coverage. Then we create a histogram of this time series of monthly story
counts with the hist command. First, download Peake and Eshbaugh-Sohas data
on energy policy coverage, the file named PESenergy.csv. The file is available
from the Dataverse named on page vii or the chapter content link on page 33. You
may need to use setwd to point R to the folder where you have saved the data.
After this, run the following code:
pres.energy<-read.csv("PESenergy.csv")
hist(pres.energyEnergy,xlab="Television Stories",main="")
abline(h=0,col=gray60)
box()
The result this code produces is presented in Fig. 3.1. In this code, we begin by
reading Peake and Eshbaugh-Sohas (2008) data. The data file itself is a comma-
separated values file with a header row of variable names, so the defaults of
read.csv suit our purposes. Once the data are loaded, we plot a histogram of
our variable of interest using the hist command: pres.energy$Energy calls
the variable of interest from its data frame. We use the xlab option, which allows us
to define the label R prints on the horizontal axis. Since this axis shows us the values
of the variable, we simply wish to see the phrase Television Stories, describing in
brief what these numbers mean. The main option defines a title printed over the top
of the figure. In this case, the only way to impose a blank title is to include quotes
with no content between them. A neat feature of plotting in the base package is
36 3 Visualizing Data
80
Frequency 60
40
20
0
0 50 100 150 200
Television Stories
Fig. 3.1 Histogram of the monthly count of television news stories related to energy
that a few commands can add additional information to a plot that has already been
drawn. The abline command is a flexible and useful tool. (The name a-b line
refers to the linear formula y D a C bx. Hence, this command can draw lines with a
slope and intercept, or it can draw a horizontal or vertical line.) In this case, abline
adds a horizontal line along the 0 point on the vertical axis, hence h=0. This is added
to clarify where the base of the bars in the figure is. Finally, the box() command
encloses the whole figure in a box, often useful in printed articles for clarifying
where graphing space ends and other white space begins. As the histogram shows,
there is a strong concentration of observations at and just above 0, and a clear
positive skew to the distribution. (In fact, these data are reanalyzed in Fogarty and
Monogan (2014) precisely to address some of these data features and discuss useful
means of analyzing time-dependent media counts.)
Another univariate graph is a box-and-whisker plot. R allows us to obtain this
solely for the single variable, or for a subset of the variable based on some other
available measure. First drawing this for a single variable:
boxplot(pres.energy$Energy,ylab="Television Stories")
The result of this is presented in panel (a) of Fig. 3.2. In this case, the values of the
monthly counts are on the vertical axis; hence, we use the ylab option to label the
vertical axis (or y-axis label) appropriately. In the figure, the bottom of the box
represents the first quartile value (25th percentile), the large solid line inside the box
represents the median value (second quartile, 50th percentile), and the top of the
box represents the third quartile value (75th percentile). The whiskers, by default,
extend to the lowest and highest values of the variable that are no more than 1.5
times the interquartile range (or difference between third and first quartiles) away
3.1 Univariate Graphs in the base Package 37
from the box. The purpose of the whiskers is to convey the range over which the
bulk of the data fall. Data falling outside of this range are portrayed as dots at their
respective values. This boxplot fits our conclusion from the histogram: small values
including 0 are common, and the data have a positive skew.
Box-and-whisker plots also can serve to offer a sense of the conditional
distribution of a variable. For our time series of energy policy coverage, the first
major event we observe is Nixons November 1973 speech on the subject. Hence,
we might create a simple indicator where the first 58 months of the series (through
October 1973) are coded 0, and the remaining 122 months of the series (November
1973 onward) are coded 1. Once we do this, the boxplot command allows us to
condition on a variable:
pres.energy$post.nixon<-c(rep(0,58),rep(1,122))
boxplot(pres.energy$Energy~pres.energy$post.nixon,
axes=F,ylab="Television Stories")
axis(1,at=c(1,2),labels=c("Before Nov. 1973",
"After Nov. 1973"))
axis(2)
box()
This output is presented in panel (b) of Fig. 3.2. The first line of code defines our
pre v. post November 1973 variable. Notice here that we again define a vector
with c. Within c, we use the rep command (for repeat). So rep(0,58) produces
58 zeroes, and rep(1,122) produces 122 ones. The second line draws our
boxplots, but we add two important caveats relative to our last call to boxplot:
First, we list pres.energy$Energypres.energy$post.nixon as our
data argument. The argument before the tilde (~) is the variable for which we want
the distribution, and the argument afterward is the conditioning variable. Second,
we add the axes=F command. (We also could write axes=FALSE, but R accepts
F as an abbreviation.) This gives us more control over how the horizontal and
200
200
Television Stories
Television Stories
150
150
100
100
50
50
0
Fig. 3.2 Box-and-whisker plots of the distribution of the monthly count of television new stories
related to energy. Panel (a) shows the complete distribution, and panel (b) shows the distributions
for the subsets before and after November 1973
38 3 Visualizing Data
vertical axes are presented. In the subsequent command, we add axis 1 (the bottom
horizontal axis), adding text labels at the tick marks of 1 and 2 to describe the values
of the conditioning variable. Afterward, we add axis 2 (the left vertical axis), and a
box around the whole figure. Panel (b) of Fig. 3.2 shows that the distribution before
and after this date is fundamentally different. Much smaller values persist before
Nixons speech, while there is a larger mean and a greater spread in values afterward.
Of course, this is only a first look and the effect of Nixons speech is confounded
with a variety of factorssuch as the price of oil, presidential approval, and the
unemployment ratethat contribute to this difference.
Bar graphs can be useful whenever we wish to illustrate the value some statistic
takes for a variety of groups as well as for visualizing the relative proportions of
nominal or ordinally measured data. For an example of barplots, we turn now to
the other example data set from this chapter, on health lobbying in the 50 American
states. Lowery et al. offer a bar graph of the means across all states of the lobbying
participation rateor number of lobbyists as a percentage of number of firms
for all health lobbyists and for seven subgroups of health lobbyists (2008, Fig. 3).
We can recreate that figure in R by taking the means of these eight variables and
then applying the barplot function to the set of means. First we must load
the data. To do this, download Lowery et al.s data on lobbying, the file named
constructionData.dta. The file is available from the Dataverse named on
page vii or the chapter content link on page 33. Again, you may need to use setwd
to point R to the folder where you have saved the data. Since these data are in Stata
format, we must use the foreign library and then the read.dta command:
library(foreign)
health.fin<-read.dta("constructionData.dta")
To create the actual figure itself, we can create a subset of our data that only
includes the eight predictors of interest and then use the apply function to obtain
the mean of each variable.
part.rates<-subset(health.fin,select=c(
partratetotalhealth,partratedpc,
partratepharmprod,partrateprofessionals,partrateadvo,
partratebusness,partrategov,rnmedschoolpartrate))
lobby.means<-apply(part.rates,2,mean)
names(lobby.means)<-c("Total Health Care",
"Direct Patient Care","Drugs/Health Products",
"Health Professionals","Health Advocacy","
Health Finance","Local Government","Health Education")
In this case, part.rates is our subsetted data frame that only includes the eight
lobby participation rates of interest. On the last line, the apply command allows
us to take a matrix or data frame (part.rates) and apply a function of interest
(mean) to either the rows or the columns of the data frame. We want the mean of
3.1 Univariate Graphs in the base Package 39
each variable, and the columns of our data set represent the variables. The 2 that is
the second component of this command therefore tells apply that we want to apply
mean to the columns of our data. (By contrast, an argument of 1 would apply to the
rows. Row-based computations would be handy if we needed to compute some new
quantity for each of the 50 states.) If we simply type lobby.means into the R
console now, it will print the eight means of interest for us. To set up our figure
in advance, we can attach an English-language name to each quantity that will be
reported in our figures margin. We do this with the names command, and then
assign a vector with a name for each quantity.
To actually draw our bar graph, we use the following code:
par(mar=c(5.1, 10 ,4.1 ,2.1))
barplot(lobby.means,xlab="Percent Lobby Registration",
xlim=c(0,26),horiz=T,cex.names=.8,las=1)
text(x=lobby.means,y=c(.75,1.75,3,4.25,5.5,6.75,8,9),
labels=paste(round(lobby.means,2)),pos=4)
box()
The results are plotted in Fig. 3.3. The first line calls the par command, which
allows the user to change a wide array of defaults in the graphing space. In our
Health Education 11
0 5 10 15 20 25
Percent Lobby Registration
Fig. 3.3 Bar graph of the mean lobbying participation rate in health care and in seven sub-guilds
across the 50 U.S. states, 1997
40 3 Visualizing Data
case, we need a bigger left margin, so we used the mar option to change this,
setting the second value to the relatively large value of 10. (In general, the margins
are listed as bottom, left, top, then right.) Anything adjusted with par is reset to
the defaults after the plotting window (or device, if writing directly to a file) is
closed. Next, we actually use the barplot command. The main argument is
lobby.means, which is the vector of variable means. The default for barplot is
to draw a graph with vertical lines. In this case, though, we set the option horiz=T
to get horizontal bars. We also use the options cex.names (character expansion
for axis names) and las=1 (label axis style) to shrink our bar labels to 80 % of their
default size and force them to print horizontally, respectively.3 The xlab command
allows us to describe the variable for which we are showing the means, and the
xlim (x-axis limits) command allows us to set the space of our horizontal axis.
Finally, we use the text command to print the mean for each lobby registration rate
at the end of the bar. The text command is useful any time we wish to add text to
a graph, be these numeric values or text labels. This command takes x coordinates
for its position along the horizontal axis, y coordinates for its position along the
vertical axis, and labels values for the text to print at each spot. The pos=4
option specifies to print the text to the right of the given point (alternatively 1, 2,
and 3 would specify below, left, and above, respectively), so that our text does not
overlap with the bar.
We turn now to plot, the workhorse graphical function in the base package. The
plot command lends itself naturally to bivariate plots. To see the total sum of
arguments that one can call using plot, type args(plot.default), which
returns the following:
function (x, y=NULL, type="p", xlim=NULL, ylim=NULL,
log="", main=NULL, sub=NULL, xlab=NULL, ylab=NULL,
ann=par("ann"), axes=TRUE, frame.plot=axes,
panel.first=NULL, panel.last=NULL, asp=NA, ...)
Obviously there is a lot going on underneath the generic plot function. For the
purpose of getting started with figure creation in R we want to ask what is essential.
The answer is straightforward: one variable x must be specified. Everything else
has either a default value or is not essential. To start experimenting with plot, we
continue to use the 1997 state health lobbying data loaded in Sect. 3.1.1.
With plot, we can plot the variables separately with the command
plot(varname), though this is definitively less informative than the kinds of
3
The default las value is 0, which prints labels parallel to the axis. 1, our choice here, prints them
horizontally. 2 prints perpendicular to the axis, and 3 prints them vertically.
3.2 The plot Function 41
10 20 30 40 50
50
Lobby Participation Rate
0
0 10 20 30 40 50 0 50 100 150 200 250 300
a) Index b) Number of Health Establishments
Fig. 3.4 Lobby participation rate of the health finance industry alone and against the number of
health finance business establishments. (a) Index. (b) Number of Health Establishments
graphs just presented in Sect. 3.1. That said, if we simply wanted to see all of the
observed values of the lobby participation rate by state of health finance firms
(partratebusness), we simply type:
plot(health.fin$partratebusness,
ylab="Lobby Participation Rate")
Figure 3.4a is returned in the R graphics interface. Note that this figure plots
the lobby participation rate against the row number in the data frame: With
cross-sectional data this index is essentially meaningless. By contrast, if we were
studying time series data, and the data were sorted on time, then we could
observe how the series evolves over time. Note that we use the ylab option
because otherwise the default will label our vertical axis with the tacky-looking
health.fin$partratebusness. (Try it, and ask yourself what a journal
editor would think of how the output looks.)
Of course, we are more often interested in bivariate relationships. We can
explore these easily by incorporating a variable x on the horizontal axis (usually
an independent variable) and a variable y on the vertical axis (usually a dependent
variable) in the call to plot:
plot(y=health.fin$partratebusness,x=health.fin$supplybusiness,
ylab="Lobby Participation Rate",
xlab="Number of Health Establishments")
This produces Fig. 3.4b, where our horizontal axis is defined by the number of health
finance firms in a state, and the vertical axis is defined by the lobby participation rate
of these firms in the respective state. This graph shows what appears to be a decrease
in the participation rate as the number of firms rises, perhaps in a curvilinear
relationship.
One useful tool is to plot the functional form of a bivariate model onto the
scatterplot of the two variables. In the case of Fig. 3.4b, we may want to compare
42 3 Visualizing Data
The lm (linear model) command fits our models, and the summary command
summarizes our results. Again, details of lm will be discussed in Chap. 6. With
the model that is a linear function of number of firms, we can simply feed the name
of our fitted model (finance.linear) into the command abline in order to
add our fitted regression line to the plot:
plot(y=health.fin$partratebusness,x=health.fin$supplybusiness,
ylab="Lobby Participation Rate",
xlab="Number of Health Establishments")
abline(finance.linear)
50
Lobby Participation Rate
40
30
30
20
20
10
10
0
0 50 100 150 200 250 300 0 50 100 150 200 250 300
a) Number of Health Establishments b) Number of Health Establishments
Fig. 3.5 Lobby participation rate of the health finance industry against the number of health
establishments, linear and quadratic models. (a) Linear function. (b) Quadratic function
3.2 The plot Function 43
This outcome is presented in Fig. 3.5b. While we will not get into lms details yet,
notice that I(supplybusiness^2) is used as a predictor. I means as is, so it
allows us to compute a mathematical formula on the fly. After redrawing our original
scatterplot, we estimate our quadratic model and save the fitted values to our data
frame as the variable quad.fit. On the fourth line, we reorder our data frame
health.fin according to the values of our input variable supplybusiness.
This is done by using the order command, which lists vector indices in order
of increasing value. Finally, the lines command takes our predicted values as
the vertical coordinates (y) and our values of the number of firms as the horizontal
coordinates (x). This adds the line to the plot showing our quadratic functional form.
So far, our analyses have relied on the plot default of drawing a scatterplot. In
time series analysis, though, a line plot over time is often useful for observing the
properties of the series and how it changes over time. (Further information on this
is available in Chap. 9.) Returning to the data on television news coverage of energy
policy first raised in Sect. 3.1, let us visualize the outcome of energy policy coverage
and an input of oil price.
Starting with number of energy stories by month, we create this plot as follows:
plot(x=pres.energy$Energy,type="l",axes=F,
xlab="Month", ylab="Television Stories on Energy")
axis(1,at=c(1,37,73,109,145),labels=c("Jan. 1969",
"Jan. 1972","Jan. 1975","Jan. 1978","Jan. 1981"),
cex.axis=.7)
axis(2)
abline(h=0,col="gray60")
box()
This produces Fig. 3.6a. In this case, our data are already sorted by month, so if
we only specify x with no y, R will show all of the values in correct temporal
44 3 Visualizing Data
42
200
40
Television Stories on Energy
150
38
Cost of Oil
36
100
34
50
32
0
Jan. 1969 Jan. 1972 Jan. 1975 Jan. 1978 Jan. 1981 Jan. 1969 Jan. 1972 Jan. 1975 Jan. 1978 Jan. 1981
a) Month b) Month
Fig. 3.6 Number of television news stories on energy policy and the price of oil per barrel,
respectively, by month. (a) News coverage. (b) Oil price
Again, the data are sorted, so only one variable is necessary. Figure 3.6b presents
this graph.
4
Alternatively, though, if a user had some time index in the data frame, a similar plot
could be produced by typing something to the effect of: pres.energy$Time<-1:180;
plot(y=pres.energy$Energy,x=pres.energy$Time,type="l").
3.2 The plot Function 45
Having tried our hand with plots from the base package, we will now itemize in
detail the basic functions and options that bring considerable flexibility to creating
figures in R. Bear in mind that R actually offers the useful option of beginning with
a blank slate and adding items to the graph bit-by-bit.
The Coordinate System: In Fig. 3.4, we were not worried about establishing the
coordinate system because the data effectively did this for us. But often, you
will want to establish the dimensions of the figure before plotting anything
especially if you are building up from the blank canvas. The most important point
here is that your x and y must be of the same length. This is perhaps obvious,
but missing data can create difficulties that will lead R to balk.
Plot Types: We now want to plot these series, but the plot function allows for
different types of plots. The different types that one can include within the
generic plot function include:
type="p" This is the default and it plots the x and y coordinates as points.
type="l" This plots the x and y coordinates as lines.
type="n" This plots the x and y coordinates as nothing (it sets up the
coordinate space only).
type="o" This plots the x and y coordinates as points and lines overlaid
(i.e., it overplots).
type="h" This plots the x and y coordinates as histogram-like vertical lines.
(Also called a spike plot.)
type="s" This plots the x and y coordinates as stair-step like lines.
Axes: It is possible to turn off the axes, to adjust the coordinate space by using the
xlim and ylim options, and to create your own labels for the axes.
axes= Allows you to control whether the axes appear in the figure or not. If
you have strong preferences about how your axes are created, you may turn
them off by selecting axes=F within plot and then create your own labels
using the separate axis command:
axis(side=1,at=c(2,4,6,8,10,12),labels=c("Feb",
"Apr","June","Aug","Oct","Dec"))
xlim=, ylim= For example, if we wanted to expand the space from the R
default, we could enter:
plot(x=ind.var, y=dep.var, type="o", xlim=c(-5,
17),ylim=c(-5, 15))
xlab="", ylab="" Creates labels for the x- and y-axis.
Style: There are a number of options to adjust the style in the figure, including
changes in the line type, line weight, color, point style, and more. Some common
commands include:
46 3 Visualizing Data
asp= Defines the aspect ratio of the plot. Setting asp=1 is a powerful and
useful option that allows the user to declare that the two axes are measured
on the same scale. See Fig. 5.1 on page 76 and Fig. 8.4 on page 153 as two
examples of this option.
lty= Selects the type of line (solid, dashed, short-long dash, etc.).
lwd= Selects the line width (fat or skinny lines).
pch= Selects the plotting symbol, can either be a numbered symbol (pch=1)
or a letter (pch="D").
col= Selects the color of the lines or points in the figure.
cex= Character expansion factor that adjusts the size of the text and symbols
in the figure. Similarly, cex.axis adjusts axis annotation size, cex.lab
adjusts font size for axis labels, cex.main adjusts the font size of the title,
and cex.sub adjusts subtitle font size.
Graphing Parameters: The par function brings added functionality to plotting
in R by giving the user control over the graphing parameters. One noteworthy
feature of par is that it allows you to plot multiple calls to plot in a single
graphic. This is accomplished by selecting par(new=T) while a plot window
(or device) is still open and before the next call to plot. Be careful, though. Any
time you use this strategy, include the xlim and ylim commands in each call to
make sure the graphing space stays the same. Also be careful that graph margins
are not changing from one call to the next.
There are also a number of add-on functions that one can use once the basic
coordinate system has been created using plot. These include:
arrows(x1, y1, x2, y2) Create arrows within the plot (useful for label-
ing particular data points, series, etc.).
text(x1, x2, "text") Create text within the plot (modify size of text
using the character expansion option cex).
lines(x, y) Create a plot that connects lines.
points(x, y) Create a plot of points.
polygon() Create a polygon of any shape (rectangles, triangles, etc.).
legend(x, y, at = c("", ""), labels=c("", "")) Create a
legend to identify the components in the figure.
axis(side) Add an axis with default or customized labels to one of the sides
of a plot. Set the side to 1 for bottom, 2 for left, 3 for top, and 4 for right.
mtext(text, side) Command to add margin text. This lets you add an axis
label to one of the sides with more control over how the label is presented. See
the code that produces Fig. 7.1 on page 108 for an example of this.
3.3 Using lattice Graphics in R 47
To obtain a scatterplot similar to the one we drew with plot, this can be
accomplished in lattice using the xyplot command:
xyplot(partratebusness~supplybusiness,data=health.fin,
col="black",ylab="Lobby Participation Rate",
xlab="Number of Health Establishments")
Figure 3.7a displays this graph. The syntax differs from the plot function
somewhat: In this case, we can specify an option, data=health.fin, that allows
us to type the name of the relevant data frame once, rather than retype it for each
variable. Also, both variables are listed together in a single argument using the
form, vertical.variablehorizontal.variable. In this case, we also
specify the option, col="black" for the sake of producing a black-and-white
figure. By default lattice colors results cyan in order to allow readers to easily
separate data information from other aspects of the display, such as axes and labels
(Becker et al. 1996, p. 153). Also, by default, xyplot prints tick marks on the third
and fourth axes to provide additional reference points for the viewer.
45
44
Lobby Participation Rate (Rank Order)
43
42
41
50 40
39
38
37
36
Lobby Participation Rate
35
34
40 33
32
31
30
29
28
27
26
30 25
24
23
22
21
20
19
20 18
17
16
15
14
13
12
10 11
10
9
8
7
6
5
0 4
3
2
1
Fig. 3.7 Lobby participation rate of the health finance industry against the number of health
establishments, (a) scatterplot and (b) dotplot
48 3 Visualizing Data
40
0.015
Percent of Total
30
0.010
Density
20
0.005
10
0.000 0
Fig. 3.8 (a) Density plot and (b) histogram showing the univariate distribution of the monthly
count of television news stories related to energy
The lattice package also contains functions that draw graphs that are similar
to a scatterplot, but instead use a rank-ordering of the vertical axis variable. This is
how the stripplot and dotplot commands work, and they offer another view
of a relationship and its robustness. The dotplot command may be somewhat
more desirable as it also displays a line for each rank-ordered value, offering a sense
that the scale is different. The dotplot syntax looks like this:
dotplot(partratebusness~supplybusiness,
data=health.fin,col="black",
ylab="Lobby Participation Rate (Rank Order)",
xlab="Number of Health Establishments")
Figure 3.7b displays this result. The stripplot function uses similar syntax.
Lastly, the lattice library again gives us an option to look at the distribution
of a single variable by plotting either a histogram or a density plot. Returning to the
presidential time series data we first loaded in Sect. 3.1, we can now draw a density
plot using the following line of code:
densityplot(~Energy,data=pres.energy,
xlab="Television Stories",col="black")
This is presented in Fig. 3.8a. This output shows points scattered along the base,
each representing the value of an observation. The smoothed line across the graph
represents the estimated relative density of the variables values.
Alternatively, a histogram in lattice can be drawn with the histogram
function:
histogram(~Energy, data=pres.energy,
xlab="Television Stories", col="gray60")
this case, a medium gray still allows each bar to be clearly distinguished.
A final interesting feature of histogram is left to the reader: The func-
tion will draw conditional histogram distributions. If you still have the
post.nixon variable available that we created earlier, you might try typing
histogram(Energy|post.nixon,data=pres.energy), where the
vertical pipe (|) is followed by the conditioning variable.
A final essential point is a word on how users can export their R graphs into a
desired word processor or desktop publisher. The first option is to save the screen
output of a figure. On Mac machines, user may select the figure output window
and then use the dropdown menu File!Save As. . . to save the figure as a PDF file.
On Windows machines, a user can simply right-click on the figure output window
itself and then choose to save the figure as either a metafile (which can be used in
programs such as Word) or as a postscript file (for use in LATEX). Also by right-
clicking in Windows, users may copy the image and paste it into Word, PowerPoint,
or a graphics program.
A second option allows users more precision over the final product. Specifically,
the user can write the graph to a graphics device, of which there are several options.
For example, in writing this book, I exported Fig. 3.5a by typing:
postscript("lin.partrate.eps",horizontal=FALSE,width=3,
height=3,onefile=FALSE,paper="special",pointsize=7)
plot(y=health.fin$partratebusness,x=health.fin$supplybusiness,
ylab="Lobby Participation Rate",
xlab="Number of Health Establishments")
abline(finance.linear)
dev.off()
The first line calls the postscript command, which created a file called
lin.partrate.eps that I saved the graph as. Among the key options in this
command are width and height, each of which I set to three inches. The
pointsize command shrank the text and symbols to neatly fit into the space
I allocated. The horizontal command changes the orientation of the graphic
from landscape to portrait orientation on the page. Change it to TRUE to have the
graphic adopt a landscape orientation. Once postscript was called, all graphing
commands wrote to the file and not to the graphing window. Hence, it is typically a
good idea to perfect a graph before writing it to a graphics device. Thus, the plot
and abline commands served to write all of the output to the file. Once I was
finished writing to the file, the dev.off() command closed the file so that no
other graphing commands would write to it.
Of course postscript graphics are most frequently used by writers who use the
desktop publishing language of LATEX. Writers who use more traditional word
processors such as Word or Pages will want to use other graphics devices. The
50 3 Visualizing Data
available options include: jpeg, pdf, png, and tiff.5 To use any of these four
graphics devices, substitute a call for the relevant function where postscript
is in the previous code. Be sure to type ?png to get a feel for the syntax of these
alternative devices, though, as each of the five has a slightly different syntax.
As a special circumstance, graphs drawn from the lattice package use a
different graphics device, called trellis.device. It is technically possible to
use the other graphics devices to write to a file, but unadvisable because the device
options (e.g., size of graph or font size) will not be passed to the graph. In the case
of Fig. 3.7b, I generated the output using the following code:
trellis.device("postscript",file="dotplot.partrate.eps",
theme=list(fontsize=list(text=7,points=7)),
horizontal=FALSE,width=3,height=3,
onefile=FALSE,paper="special")
dotplot(partratebusness~supplybusiness,
data=health.fin,col=black,
ylab="Lobby Participation Rate (Rank Order)",
xlab="Number of Health Establishments")
dev.off()
The first argument of the trellis.device command declares which driver the
author wishes to use. Besides postscript, the author can use jpeg, pdf, or
png. The second argument lists the file to write to. Font and character size must
be set through the theme option, and the remaining arguments declare the other
preferences about the output.
This chapter has covered univariate and bivariate graphing functions in R.
Several commands from both the base and lattice packages have been
addressed. This is far from an exhaustive list of Rs graphing capabilities, and
users are encouraged to learn more about the available options. This primer should,
however, serve to introduce users to various means by which data can be visualized
in R. With a good sense of how to get a feel for our datas attributes visually, the
next chapter turns to numerical summaries of our data gathered through descriptive
statistics.
5
My personal experience indicates that png often looks pretty clear and is versatile.
3.5 Practice Problems 51
Before developing any models with or attempting to draw any inferences from a
data set, the user should first get a sense of the features of the data. This can be
accomplished through the data visualization methods described in Chap. 3, as well
as through descriptive statistics of a variables central tendency and dispersion,
described in this chapter. Ideally, the user will perform both tasks, regardless
of whether the results become part of the final published product. A traditional
recommendation to analysts who estimate functions such as regression models is
that the first table of the article ought to describe the descriptive statistics of all
input variables and the outcome variable. While some journals have now turned
away from using scarce print space on tables of descriptive statistics, a good data
analyst will always create this table for him or herself. Frequently this information
can at least be reported in online appendices, if not in the printed version of the
article.
As we work through descriptive statistics, the working example in this chapter
will be policy-focused data from LaLondes (1986) analysis of the National Sup-
ported Work Demonstration, a 1970s program that helped long-term unemployed
individuals find private sector jobs and covered the labor costs of their employment
for a year. The variables in this data frame are:
treated: Indicator variable for whether the participant received the treatment.
age: Measured in years.
education: Years of education.
black: Indicator variable for whether the participant is African-American.
married: Indicator variable for whether the participant is married.
nodegree: Indicator variable for not possessing a high school diploma.
Our first task will be to calculate centrality measures, which give us a sense of a
typical value of a distribution. The most common measures of central tendency are
the mean, median, and mode. The inter-quartile range, offering the middle 50 %
of the data, is also informative. To begin with some example calculations, we first
must load LaLondes data (named LL). These are available as part of the Coarsened
Exact Matching package (cem), which we will discuss at greater length in Chap. 8.
As with any other user-defined package, our first task is to install the package:
install.packages("cem")
library(cem)
data(LL)
After installing the package, we load the library, as we will have to do in every
session in which we use the package. Once the library is loaded, we can load the
data simply by calling the data command, which loads this saved data frame from
the cem package into working memory. We conveniently can refer to the data frame
by the name LL1 .
For all of the measures of central tendency that we compute, suppose we have
a single variable x, with n different values: x1 ; x2 ; x3 ; : : : ; xn . We also could sort the
values from smallest to largest, which is designated differently with order statistics
as: x.1/ ; x.2/ ; x.3/ ; : : : x.n/ . In other words, if someone asked you for the second order
statistic, you would tell them the value of x.2/ , the second smallest value of the
variable.
With a variable like this, the most commonly used measure of centrality is the
sample mean. Mathematically, we compute this as the average of the observed
values:
1X
n
x1 C x2 C C xn
xN D D xi (4.1)
n n iD1
Within R, we can apply Eq. (4.1)s formula using the mean function. So if x in this
case was the income participants in the National Supported Work Demonstration
earned in 1974, we would apply the function to the variable re74:
1
These data also are available in comma-separated format in the file named LL.csv. This data file
can be downloaded from the Dataverse on page vii or the chapter content link on page 53.
4.1 Measures of Central Tendency 55
Income in 1974
0.00015 0.00020
Density
0.00005 0.00010
0.00000
Fig. 4.1 Density plot of real earnings in 1974 from the National Supported Work Demonstration
data
mean(LL$re74)
On the first line, the density command allows us to compute the density of
observations at each value of income. With the from option, we can specify that
the minimum possible value of income is 0 (and the to option would have let us set
a maximum). On the second line, we simply plot this density object. Lastly, we use
abline to add a vertical line where our computed mean of $3,630.74 is located.
The resulting graph is shown in Fig. 4.1. This figure is revealing: The bulk of
the data fall below the mean. The mean is as high as it is because a handful of very
large incomes (shown in the long right tail of the graph) are drawing it upward. With
the picture, we quickly get a sense of the overall distribution of the data.
Turning back to statistical representations, another common measure of central
tendency is the sample median. One advantage of computing a median is that it is
more robust to extreme values than the mean. Imagine if our sample had somehow
included Warren Buffettour estimate of mean income would have increased
56 4 Descriptive Statistics
substantially just with one observation. The median, by contrast, would move very
little in response to such an extreme observation. Our formula for computing a
median with observed data turns to the order statistics we defined above:
8
<x nC1 where n is odd
xQ D 2 (4.2)
:1 x n C x n where n is even
2 .2/ .1C 2 /
Note that notation for the median is somewhat scattered, and xQ is one of the several
commonly used symbols. Formally, whenever we have an odd number of values, we
simply take the middle order statistic (or middle value when the data are sorted from
smallest to largest). Whenever we have an even number of values, we take the two
middle order statistics and average between them. (E.g., for ten observations, split
the difference between x.5/ and x.6/ to get the median.) R will order our data, find
the middle values, and take any averages to report the median if we simply type:
median(LL$re74)
In this case, R prints [1] 823.8215, so we can report xQ D 823:8215 as the
median income for program participants in 1974. Observe that the median value is
much lower than the mean value, $2,806.92 lower, to be exact. This is consistent
with what we saw in Fig. 4.1: We have a positive skew to our data, with some
extreme values pulling the mean up somewhat. Later, we will further verify this
by looking at quantiles of our distribution.
A third useful measure of central tendency reports a range of central values.
The inter-quartile range is the middle 50 % of the data. Using order statistics, we
compute the lower and upper bounds of this quantity as:
h i
IQRx D x. n / ; x. 3n / (4.3)
4 4
The two quantities reported are called the first and third quartiles. The first quartile
is a value for which 25 % of the data are less than or equal to the value. Similarly,
75 % of the data are less than or equal to the third quartile. In this way, the middle
50 % of the data falls between these two values. In R, there are two commands we
can type to get information on the inter-quartile range:
summary(LL$re74)
IQR(LL$re74)
The summary command is useful because it presents the median and mean of the
variable in one place, along with the minimum, maximum, first quartile, and third
quartile. Our output from R looks like this:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.0 823.8 3631.0 5212.0 39570.0
This is handy for getting three measures of central tendency at once, though note
that the values of the mean and median by default are rounded to fewer digits that the
separate commands reported. Meanwhile, the interquartile range can be read from
4.1 Measures of Central Tendency 57
the printed output as IQRx D 0; 5212. Normally, we would say that at least 50 % of
participants had an income between $0 and $5212. In this case, though, we know no
one earned a negative income, so 75 % of respondents fell into this range. Finally,
the IQR command reports the difference between the third and first quartiles, in this
case printing: [1] 5211.795. This command, then, simply reports the spread
between the bottom and top of the interquartile range, again with less rounding that
we would have gotten by using the numbers reported by summary.
In most circumstances, rounding and the slight differences in outputs that these
commands produce pose little issue. However, if more digits are desired, the user
can control a variety of global options that shape the way R presents results with
the options command that was mentioned in Chap. 1. The digits argument
specifically shapes the number of digits presented. So, for example, we could type:
options(digits=9)
summary(LL$re74)
This reports the same descriptive statistics as before, but for all variables at once. If
any observations are missing a value for a variable, this command will print the
number of NA values for the variable. Beware, though, not all of the quantities
reported in this table are meaningful. For indicator variables such as treated, black,
married, nodegree, hispanic, u74, and u75, remember that the variables are not
continuous. The mean essentially reports the proportion of respondents receiving a
1 rather than a 0, and the count of any missing values is useful. However, the other
information is not particularly informative.
For variables that are measured nominally or ordinally, the best summary of
information is often a simple table showing the frequency of each value. In R,
the table command reports this for us. For instance, our data include a simple
indicator coded 1 if the respondent is African-American and 0 otherwise. To get the
relative frequencies of this variable, we type:
58 4 Descriptive Statistics
table(LL$black)
On the first line, we specify that we are drawing a bar plot of table(LL$edu
cation). Notice that we use cex.axis and cex.names to reduce the size
4.1 Measures of Central Tendency 59
200
150
Frequency
100
50
0
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Years of Education
Fig. 4.2 Distribution of number of years of education from the National Supported Work
Demonstration data
of the text on the vertical and horizontal axis, respectively. Afterward, we add a
baseline at 0 and draw a box around the full figure. The result is shown in Fig. 4.2.
With this plot, we can easily spot that the highest bar, our mode, is at 11 years of
education. The graph also gives us a quick sense of the spread of the other values.
As a side point, suppose an analyst wanted not just a table of frequencies, but
the percentage of values in each category. This could be accomplished simply by
dividing the table by the number of cases and multiplying by 100. So for the percent
of respondents falling into each category on the education variable, we type:
100*table(LL$education)/sum(table(LL$education))
R then will print:
3 4 5 6
0.1385042 0.8310249 0.6925208 0.9695291
7 8 9 10
2.0775623 8.5872576 15.2354571 22.4376731
11 12 13 14
27.0083102 16.8975069 3.1855956 1.5235457
15 16
0.2770083 0.1385042
This output now shows the percentage of observations falling into each category.
60 4 Descriptive Statistics
Besides getting a sense of the center of our variable, we also would like to know
how spread out our observations are. The most common measures of this are the
variance and standard deviation, though we also will discuss the median average
deviation as an alternative measure. Starting with the sample variance, our formula
for this quantity is:
1 X
n
.x1 xN /2 C .x2 xN /2 C C .xn xN /2
Var.x/ D s2x D D .xi xN /2
n1 n 1 iD1
(4.4)
In R, we obtain this quantity with the function var. For income in 1974, we type:
var(LL$re74)
This prints the value: [1] 38696328. Hence, we can write Var.x/ D 38696328.
Of course, the variance is in a squared metric. Since we may not want to think of
the spread in terms of 38.7 million squared dollars, we will turn to alternative
measures of dispersion as well. That said, the variance is an essential quantity that
feeds into a variety of other calculations of interest.
The standard deviation is simply the square root of the variance:
v
u
p u 1 X n
SD.x/ D sx D Var.x/ D t .xi xN /2 (4.5)
n 1 iD1
This simple transformation of the variance has the nice property of putting our
measure of dispersion back onto the original scale. We could either take the square
root of a computed variance, or allow R to do all steps of the calculation for us:
sd(LL$re74)
In this case, R prints: [1] 6220.637. Hence, sx D 6220:637. When a variable
is shaped like a normal distribution (which our income variable is not), a useful
approximation is the 68-95-99.7 rule. This means that approximately 68 % of our
data fall within one standard deviation of the mean, 95 % within two standard
deviations, and 99.7 % within three standard deviations. For income in 1974, a heavy
concentration of incomes at $0 throws this rule off, but with many other variables it
will observationally hold.
A very different measure of dispersion is the median absolute deviation. We
define this as:
MAD.x/ D median.jxi median.x/j/ (4.6)
In this case, we use the median as our measure of centrality, rather than the mean.
Then we compute the absolute difference between each observation and the median.
Lastly, we compute the median of the deviations. This offers us a sense of a typical
deviation from the median. In R the command is typed:
mad(LL$re74)
4.2 Measures of Dispersion 61
Here, R returns a value of 1221.398. Like the standard deviation, this is on the
scale of the original variable, in dollars. Unlike the standard deviation, this statistic
turns out to be much smaller in this case. Again, extreme values can really run-
up variances and standard deviations, just as they can distort a mean. The median
absolute deviation, by contrast, is less sensitive to extreme values.
The first command prints our default quantiles, though it reports them with the
rescaled percentile labels:
0% 25% 50% 75% 100%
0.0000 0.0000 823.8215 5211.7946 39570.6797
Essentially, this information repeats the quartile information that summary pro-
vided us earlier. On our second line of code, we add a vector of 11 quantiles of
interest to request deciles, which give us the cut points for each additional 10 % of
the data. This result is:
0% 10% 20% 30%
0.0000 0.0000 0.0000 0.0000
40% 50% 60% 70%
0.0000 823.8215 1837.2208 3343.5705
80% 90% 100%
6651.6747 10393.2177 39570.6797
This is revealing as it shows that at least 40 % of our respondents had an income
of $0 in 1974. Further, going from the 90th percentile to the 100th percentile
(or maximum), we see a jump from $10,393 to $39,570, suggesting that some
particularly extreme values are in the top 10 % of our data. Hence, these data do
62 4 Descriptive Statistics
have a substantial positive skew to them, explaining why our computed median is
so different from the mean.
In this chapter, we have covered the various means by which we can compute
measures of centrality and dispersion in R. We have also discussed frequency tables
and quantiles. Together with the graphing techniques of Chap. 3, we now have a big
basket of tools for assessing and reporting the attributes of a data set. In the coming
chapter, we will turn to drawing inferences from our data.
Consider again Peake and Eshbaugh-Sohas (2008) analysis of drug policy cov-
erage, which was introduced in the practice problems for Chap. 3. Recall that the
comma-separated data file is named drugCoverage.csv. If you do not have
it downloaded already, please visit the Dataverse (see page vii) or this chapters
online content (see page 53). Again, the variables are: a character-based time index
showing month and year (Year), news coverage of drugs (drugsmedia), an indicator
for a speech on drugs that Ronald Reagan gave in September 1986 (rwr86), an
indicator for a speech George H.W. Bush gave in September 1989 (ghwb89), the
presidents approval rating (approval), and the unemployment rate (unemploy).
1. What can you learn simply by applying the summary command to the full data
set? What jumps out most clearly from this output? Are there any missing values
in these data?
2. Using the mean function, compute the following:
What is the mean of the indicator for George H.W. Bushs 1989 speech? Given
that this variable only takes on values of 0 and 1, how would you interpret this
quantity?
What is the mean level of presidential approval? How would you interpret this
quantity?
3. What is the median level of media coverage of drug-related issues?
4. What is the interquartile range of media coverage of drug-related issues?
5. Report two frequency tables:
In the first, report the frequency of values for the indicator for Ronald
Reagans 1986 speech.
In the second, report the frequency of values for the unemployment rate in a
given month.
What is the modal value of unemployment? In what percentage of months
does the mode occur?
6. What are the variance, standard deviation, and median absolute deviation for
news coverage of drugs?
7. What are the 10th and 90th percentiles of presidential approval in this 19771992
time frame?
Chapter 5
Basic Inferences and Bivariate Association
and the reader is urged to read about random sampling if more information is desired
about the theory of statistical inference.
Before computing any inferential statistics, we must load LaLondes data once
again. Users who have already installed the cem library can simply type:
library(cem)
data(LL)
Users who did not install cem in Chap. 4 will need to type install.packages
("cem") before the two preceding lines of code will work properly. Once these
data are loaded in memory, again by the name LL, we can turn to applied analysis1 .
We begin by testing hypotheses about the mean of a population (or multiple
populations). We first consider the case where we want to test whether the mean
of some population of interest differs from some value of interest. To conduct this
significance test, we need: (1) our estimate of the sample mean, (2) the standard
error of our mean estimate, and (3) a null and alternative hypothesis. The sample
mean is defined earlier in Eq. (4.1), and the standard error of our estimate is simply
the standard deviation of thep variable [defined in Eq. (4.5)] divided by the square
root of our sample size, sx = n.
When defining our null and alternative hypotheses, we define the null hypothesis
based on some value of interest that we would like to rule out as a possible value of
the population parameter. Hence, if we say:
H0 W D 0
This means that our null hypothesis (H0 ) is that the population mean () is equal
to some numeric value we set (0 ). Our research hypothesis is the alternative
hypothesis we would like to reject this null in favor of. We have three choices for
potential research hypotheses:
HA W > 0
HA W < 0
HA W 0
The first two are called one-tailed tests and indicate that we believe the population
mean should be, respectively, greater than or less than the proposed value 0 . Most
research hypotheses should be considered as one of the one-tailed tests, though
occasionally the analyst does not have a strong expectation on whether the mean
1
As before, these data also are available in comma-separated format in the file named LL.csv.
This data file can be downloaded from the Dataverse on page vii or the chapter content link on
page 63.
5.1 Significance Tests for Means 65
should be bigger or smaller. The third alternative listed defines the two-tailed test,
which asks whether the mean is simply different from (or not equal to) the value 0 .
Once we have formulated our hypothesis, we compute a t-ratio as our test statistic
for the hypothesis. Our test statistic includes the sample mean, standard error, and
the population mean defined by the null hypothesis (0 ). This formula is:
xN 0 xN 0
tD D p (5.1)
SE.NxjH0 / sx = n
In this case, assume t is the actual value of our statistic that we compute. The
typical action in this case is to have a pre-defined confidence level and decide
whether to reject the null hypothesis or not based on whether the p-value indicates
that rejection can be done with that level of confidence. For instance, if an analyst
was willing to reject a null hypothesis if he or she could do so with 90 % confidence,
then if p < 0:10, he or she would reject the null and conclude that the research
hypothesis is correct. Many users also proceed to report the p-value so that readers
can draw conclusions about significance themselves.
R makes all of these calculations very straightforward, doing all of this in a single
line of user code. Suppose that we had a hypothesis that, in 1974, the population of
long-term unemployed Americans had a lower income than $6,059, a government
estimate of the mean income for the overall population of Americans. In this case,
our hypothesis is:
H0 W D 6059
HA W < 6059
This is a one-tailed test because we do not even entertain the idea that the long-term
unemployed could have an on-average higher income than the general population.
Rather, we simply ask whether the mean of our population of interest is discernibly
lower than $6,059 or not. To test this hypothesis in R, we type:
t.test(LL$re74, mu=6059, alternative="less")
2
This statistic has a t distribution because the sample mean has a normally distributed sampling
distribution and the sample standard error has a 2 sampling distribution with n 1 degrees of
freedom. The ratio of these two distributions yields a t distribution.
66 5 Basic Inferences and Bivariate Association
The first argument of the t.test lists our variable of interest, LL$re74, for
which R automatically computes the sample mean and standard error. Second, the
mu=6059 argument lists the value of interest from our null hypothesis. Be sure to
include this argument: If you forget, the command will still run assuming you want
mu=0, which is silly in this case. Finally, we specify our alternative hypothesis
as "less". This means we believe the population mean to be less than the null
quantity presented. The result of this command prints as:
One Sample t-test
data: LL$re74
t = -10.4889, df = 721, p-value < 2.2e-16
alternative hypothesis: true mean is less than 6059
95 percent confidence interval:
-Inf 4012.025
sample estimates:
mean of x
3630.738
This presents a long list of information: At the end, it reports the sample mean of
3630.738. Earlier, it shows us the value of our t-ratio is 10:4889, along with the
fact that our t distribution has 721 degrees of freedom. As for the p-value, when R
prints p-value < 2.2e-16, this means that p is so minuscule that it is smaller
than Rs level of decimal precision, much less any common significance threshold.
Hence, we can reject the null hypothesis and conclude that long-term unemployed
Americans had a significantly lower income than $6,059 in 1974.
H0 W x D y
Again, we will pair this with one of the three alternative hypotheses:
HA W x < y
HA W x > y
HA W x y
5.1 Significance Tests for Means 67
Again, the first two possible alternative hypotheses are one-tailed tests where we
have a clear expectation as to which populations mean should be bigger. The
third possible alternative simply evaluates whether the means are different. When
building our test statistic from this null hypothesis, we rely on the fact that H0 also
implies x y D 0. Using this fact, we construct our t-ratio as:
.Nx yN / .x y /
tD (5.2)
SE.Nx yN jH0 /
The last question is how we calculate the standard error. Our calculation depends
on whether we are willing to assume that the variance is the same in each population.
Under the assumption of unequal variance, we compute the standard error as:
s
s2x s2y
SE.Nx yN jH0 / D C (5.3)
nx ny
As an example, we can conduct a test with the last observation of income in the
National Supported Work Demonstration, which was measured in 1978. Suppose
our hypothesis is that income in 1978 was higher among individuals who received
the treatment of participating in the program (y) than it was among those who were
control observations and did not get to participate in the program (x). Our hypothesis
in this case is:
H0 W x D y
HA W x < y
Again, this is a one-tailed test because we are not entertaining the idea that
the treatment could have reduced long-term income. Rather, the treatment either
increased income relative to the control observations, or it had no discernible
effect. R allows us to conduct this two-sample t-test using either assumption. The
commands for unequal and equal variances, respectively, are:
t.test(re78~treated,data=LL,alternative="less",var.equal=F)
t.test(re78~treated,data=LL,alternative="less",var.equal=T)
treated (group 0, the control) should be lower than the average for the higher value
of treated (group 1, the treated group). The only difference between the commands
is that the first sets var.equal=F so that variances are assumed unequal, and the
second sets var.equal=T so that variances are assumed equal.
The results print as follows. For the assumption of unequal variances, we see:
Welch Two Sample t-test
HA W w < 0
HA W w > 0
HA W w 0
The test statistic in this case is given by Eq. (5.5), computed for the new variable w.
wN 0 wN
tD D p (5.5)
N 0/
SE.wjH sw = n
As can be seen, this is effectively the same test statistic as in Eq. (5.1) with w as
the variable of interest and 0 as the null value. The user technically could create
the w variable him or herself and then simply apply the code for a single-sample
significance test for a mean.
More quickly, though, this procedure could be automated by inserting two
separate variables for the linked observations into the t-test command. Suppose,
for instance, that we wanted to know if our control observations saw a rise in
their income from 1974 to 1978. It is possible that wages may not increase over
this time because these numbers are recorded in real terms. However, if wages did
increase, then observing how they changed for the control group can serve as a good
baseline for comparing change in the treated groups wages in the same time frame.
To conduct this paired sample t-test for our control observations, we type:
LL.0<-subset(LL,treated==0)
t.test(LL.0$re74,LL.0$re78,paired=T,alternative="less")
In the first line we create a subset only of our control observations. In the second
line, our first argument is the measure of income in 1974, and the second is income
in 1978. Third, we specify the option paired=T: This is critical, otherwise R will
assume each variable forms an independent sample, but in our case this is a paired
sample where each individual has been observed twice. (To this end, by typing
paired=F instead, this gives us the syntax for a two-sample t-test if our separate
70 5 Basic Inferences and Bivariate Association
5.2 Cross-Tabulations
In this code, y specifies the column variable, and x specifies the row variable. This
means our dependent variable makes up the columns and the independent makes up
the rows. Because we want the conditional distribution of the dependent variable
for each given value of the independent variable, the options prop.c, prop.t,
and prop.chisq are all set to FALSE (referring to proportion of the column,
total sample, and contribution to the chisquare statistic). This means that each cell
72 5 Basic Inferences and Bivariate Association
only contains the raw frequency and the row-percentage, which corresponds to the
distribution conditional on the independent variable. The option chisq=T reports
Pearsons Chi-squared (2 ) test. Under this test, the null hypothesis is that the two
variables are independent of each other. The alternative hypothesis is that knowing
the value of one variable changes the expected distribution of the other.3 By setting
the format option to SPSS, rather than SAS, we are presented with percentages
in our cells, rather than proportions.
The results of this command are printed below:
Cell Contents
|-------------------------|
| Count |
| Row Percent |
|-------------------------|
| LL$u75
LL$u74 | 0 | 1 | Row Total |
-------------|-----------|-----------|-----------|
0 | 386 | 9 | 395 |
| 97.722% | 2.278% | 54.709% |
-------------|-----------|-----------|-----------|
1 | 47 | 280 | 327 |
| 14.373% | 85.627% | 45.291% |
-------------|-----------|-----------|-----------|
Column Total | 433 | 289 | 722 |
-------------|-----------|-----------|-----------|
3
Note that this is a symmetric test of association. The test itself has no notion of which is the
dependent or independent variable.
5.2 Cross-Tabulations 73
As we can see, among those who were employed in 1974 (u74=0), 97.7 % were
employed in 1975. Among those who were unemployed in 1974 (u75=1), 14.4 %
were employed in 1975.4 This corresponds to an 83.3 percentage point difference
between the categories. This vast effect indicates that employment status in 1
year does, in fact, beget employment status in the following year. Further, our
test statistic is 21df D 517:7155 with a minuscule corresponding p-value. Hence,
we reject the null hypothesis that employment status in 1974 is independent from
employment status in 1975 and conclude that employment status in 1974 conditions
the distribution of employment status in 1975.
As a more interesting question, we might ask whether receiving the treatment
from the National Supported Work Demonstration shapes employment status in
1975. We would test this hypothesis with the code:
CrossTable(y=LL$u75,x=LL$treated,prop.c=F,prop.t=F,
prop.chisq=F,chisq=T,format="SPSS")
| LL$u75
LL$treated | 0 | 1 | Row Total |
-------------|-----------|-----------|-----------|
0 | 247 | 178 | 425 |
| 58.118% | 41.882% | 58.864% |
-------------|-----------|-----------|-----------|
1 | 186 | 111 | 297 |
| 62.626% | 37.374% | 41.136% |
-------------|-----------|-----------|-----------|
Column Total | 433 | 289 | 722 |
-------------|-----------|-----------|-----------|
4
To get more meaningful levels than 0 and 1 in this case, we would need to create copies of the
variables u74 and u75 that recorded each value as text (e.g., Unemployed and Employed). The
recode command from the car library offers a straightforward way of doing this, if desired.
74 5 Basic Inferences and Bivariate Association
The first line computes the actual correlation coefficient itself. R returns a printout
of: [1] 0.08916458. Hence, our correlation coefficient is r D 0:0892. The
second line recalculates the correlation and squares the result for us all at once.
This tells us that r2 D 0:0080. The implication of this finding is that by knowing a
respondents number of years of education, we could explain 0.8 % of the variance
in 1974 income. On its face, this seems somewhat weak, but as a general word of
advice always gauge r2 (or multiple R2 , in the next chapter) values by comparing
them with other findings in the same area. Some sorts of models will routinely
explain 90 % of the variance, while others do well to explain 5 % of the variance.
As a final example, we can consider the idea that income begets income. Consider
how well income in 1975 correlates with income in 1978. We compute this by
typing:
cor(LL$re75,LL$re78)
cor(LL$re75,LL$re78)^2
The first line returns the correlation coefficient between these two variables,
printing: [1] 0.1548982. Our estimate of r D 0:1549 indicates that high values
of income in 1975 do generally correspond to high values of income in 1978. In this
case, the second line returns r2 D 0:0240. This means we can explain 2.4 % of the
variance of income in 1978 by knowing what someone earned in 1975.
Remember that the graphing tools from Chap. 3 can help us understand our data,
including any results that we quantify such as correlation coefficients. If we are
wondering why earlier income does not do a better job of predicting later income,
we could draw a scatterplot as follows:
plot(x=LL$re75,y=LL$re78,xlab="1975 Income",ylab="1978 Income",
asp=1,xlim=c(0,60000),ylim=c(0,60000),pch=".")
Notice that we have used the asp=1 option to set the aspect ratio of the two axes
at 1. This guarantees that the scale of the two axes is held to be the samewhich
is appropriate since both variables in the figure are measured in inflation-adjusted
dollars. The output is reported in Fig. 5.1. As can be seen, many of the observations
cluster at zero in one or both of the years, so there is a limited degree to which a
linear relationship characterizes these data.
We now have several basic inferences in hand: t-tests on means and 2 tests
for cross-tabulations. Difference in means tests, cross-tabulations, and correlation
5
The cor command also provides a method option for which available arguments are pearson,
kendall (which computes Kendalls , a rank correlation), and spearman (which computes
Spearmans , another rank correlation). Users are encouraged to read about the alternate methods
before using them. Here, we focus on the default Pearson method.
76 5 Basic Inferences and Bivariate Association
60000
50000
40000
1978 Income
30000
20000
10000
0
Fig. 5.1 Scatterplot of income in 1975 and 1978 from National Supported Work Demonstration
data
Please load the foreign library and download Alvarez et al.s (2013) data, which
are saved in the Stata-formatted file alpl2013.dta. This file is available from
the Dataverse named on page vii or the chapter content named on page 63. These
data are from a field experiment in Salta, Argentina in which some voters cast
ballots through e-voting, and others voted in the traditional setting. The variables
are: an indictor for whether the voter used e-voting or traditional voting (EV), age
group (age_group), education (educ), white collar worker (white_collar), not a
full time worker (not_full_time), male (male), a count variable for number of six
possible technological devices used (tech), an ordinal scale for political knowledge
5.4 Practice Problems 77
(pol_info), a character vector naming the polling place (polling_place), whether the
respondent thinks poll workers are qualified (capable_auth), whether the voter eval-
uated the voting experience positively (eval_voting), whether the voter evaluated
the speed of voting as quick (speed), whether the voter is sure his or her vote is being
counted (sure_counted), whether the voter thought voting was easy (easy_voting),
whether the voter is confident in ballot secrecy (conf_secret), whether the voter
thinks Saltas elections are clean (how_clean), whether the voter thinks e-voting
should replace traditional voting (agree_evoting), and whether the voter prefers
selecting candidates from different parties electronically (eselect_cand).
1. Consider the number of technological devices. Test the hypothesis that the
average Salta voter has used more than three of these six devices. (Formally:
H0 W D 3I HA W > 3.)
2. Conduct two independent sample difference of means tests:
a. Is there any difference between men and women in how many technological
devices they have used?
b. Is there any difference in how positively voters view the voting experience
(eval_voting) based on whether they used e-voting or traditional voting (EV)?
3. Construct two cross-tabulations:
a. Construct a cross-tabulation where the dependent variable is how positively
voters view the voting experience (eval_voting) and the independent variable
is whether they used e-voting or traditional voting (EV). Does the distribution
of voting evaluation depend on whether the voter used e-voting? This cross-
tabulation addressed the same question as is raised in #2.b. Which approach
is more appropriate here?
b. Construct a cross-tabulation where the dependent variable is how positively
voters view the voting experience (eval_voting) and the independent variable
is the ordinal scale of political knowledge (pol_info). Does the distribution of
voting evaluation change with the voters level of political knowledge?
4. Consider the correlation between level of education (educ) and political knowl-
edge (pol_info):
a. Compute Pearsons r between these two variables.
b. Many argue that, with two ordinal variables, a more appropriate correlation
measure is Spearmans , which is a rank correlation. Compute and contrast
the results from r.
Chapter 6
Linear Models and Regression Diagnostics
The linear regression model estimated with ordinary least squares (OLS) is a
workhorse model in Political Science. Even when a scholar uses a more advanced
method that may make more accurate assumptions about his or her datasuch as
probit regression, a count model, or even a uniquely crafted Bayesian model
the researcher often draws from the basic form of a model that is linear in the
parameters. By a similar token, many of the R commands for these more advanced
techniques use functional syntax that resembles the code for estimating a linear
regression. Therefore, an understanding of how to use R to estimate, interpret, and
diagnose the properties of a linear model lends itself to sophisticated use of models
with a similar structure.
This chapter proceeds by describing the lm (linear model) command in R, which
estimates a linear regression model with OLS, and the commands various options.
Then, the chapter describes how to conduct regression diagnostics of a linear model.
These diagnostics serve to evaluate whether critical assumptions of OLS estimation
hold up, or if our results may be subject to bias or inefficiency.
Throughout the chapter, the working example is an analysis of the number of
hours high school biology teachers spend teaching evolution. The model replicates
work by Berkman and Plutzer (2010, Table 7.2), who argue that this policy outcome
is affected by state-level factors (such as curriculum standards) and teacher attributes
(such as training). The data are from the National Survey of High School Biology
Teachers and consist of 854 observations of high school biology teachers who were
surveyed in the spring of 2007. The outcome of interest is the number of hours a
teacher devotes to human and general evolution in his or her high school biology
class (hrs_allev), and the twelve input variables are as follows:
phase1: An index of the rigor of ninth & tenth grade evolution standards in 2007
for the state the teacher works in. This variable is coded on a standardized scale
with mean 0 and standard deviation 1.
senior_c: An ordinal variable for the seniority of the teacher. Coded 3 for 12
years experience, 2 for 35 years, 1 for 610 years, 0 for 1120 years, and 1
for 21+ years.
ph_senior: An interaction between standards and seniority.
notest_p: An indicator variable coded 1 if the teacher reports that the state does
not have an assessment test for high school biology, 0 if the state does have such
a test.
ph_notest_p: An interaction between standards and no state test.
female: An indicator variable coded 1 if the teacher is female, 0 if male. Missing
values are coded 9.
biocred3: An ordinal variable for how many biology credit hours the teacher has
(both graduate and undergraduate). Coded 0 for 24 hours or less, 1 for 25
40 hours, and 2 for 40+ hours.
degr3: The number of science degrees the teacher holds, from 0 to 2.
evol_course: An indicator variable coded 1 if the instructor took a specific college-
level course on evolution, 0 otherwise.
certified: An indicator coded 1 if the teacher has normal state certification, 0
otherwise.
idsci_trans: A composite measure, ranging from 0 to 1, of the degree to which the
teacher thinks of him or herself as a scientist.
confident: Self-rated expertise on evolutionary theory. Coded 1 for less than
many other teachers, 0 for typical of most teachers, 1 for very good
compared to most high school biology teachers, and 2 for exceptional and on
par with college-level instructors.
To start, we need to load the survey data, which we will name evolution. In
this example, we load a Stata-formatted data set. This is easily possible through the
foreign library, which provides us with the read.dta command:1
rm(list=ls())
library(foreign)
evolution<-read.dta("BPchap7.dta",convert.factors=FALSE)
1
Berkman and Plutzers data file, named BPchap7.dta, is available from the Dataverse linked
on page vii or the chapter content linked on page 79. Remember that you may need to use the
setwd command to point to where you have saved the data.
6.1 Estimation with Ordinary Least Squares 81
As a rule, we want to start by viewing the descriptive statistics from our data
set. At minimum, use the summary command, and perhaps some of the other
commands described in Chaps. 3 and 4:
summary(evolution)
In addition to the descriptive statistics summary gives us, it will also list the number
of missing observations we have on a given variable (under NAs), if any are
missing. The default condition for most modeling commands in R is to delete any
case that is missing an observation on any variable in a model. Hence, the researcher
needs to be aware not only of variation in relevant variables, but also how many
cases lack an observation.2 Additionally, researchers should be careful to notice
anything in the descriptive statistics that deviates from a variables values that are
listed in the codebook. For example, in this case the variable female has a maximum
value of 9. If we know from our codebook that 0 and 1 are the only valid observed
values of this variable, then we know that anything else is either a miscode or (in this
case) a missing value.
Before proceeding, we need to reclassify the missing observations of female:
evolution$female[evolution$female==9]<-NA
summary(evolution)
evolution<-subset(evolution,!is.na(female))
This command recodes only the values of female coded as a 9 as missing. As the
subsequent call to summary shows, the 13 values coded as a 9 are now listed as
missing, so they will automatically be omitted in our subsequent analysis. To make
sure any computations we make focus only on the observations over which we fit
the model, we subset our data to exclude the missing observations. As an alternative
to using subset here, if we had missing values on multiple variables, we instead
may have wanted to type: evolution<-na.omit(evolution).
Having cleaned our data, we now turn to the model of hours spent teaching
evolution described at the start of the chapter. We estimate our linear model
using OLS:
mod.hours<-lm(hrs_allev~phase1*senior_c+phase1*notest_p+
female+biocred3+degr3+evol_course+certified+idsci_trans+
confident,data=evolution)
summary(mod.hours)
The standard syntax for specifying the formula for a model is to list the
outcome variable to the left of the tilde (~), and the input variables on the
right-hand side separated by plus signs. Notice that we did include two special
terms: phase1*senior_c and phase1*notest_p. Considering the first,
phase1*senior_c, this interactive notation adds three terms to our model:
phase1, senior_c, and the product of the two. Such interactive models allow for
2
A theoretically attractive alternative to listwise deletion as a means of handling missing data is
multiple imputation. See Little and Rubin (1987), Rubin (1987), and King et al. (2001) for more
details.
82 6 Linear Models and Regression Diagnostics
Residuals:
Min 1Q Median 3Q Max
-20.378 -6.148 -1.314 4.744 32.148
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.2313 1.1905 8.594 < 2e-16 ***
phase1 0.6285 0.3331 1.886 0.0596 .
senior_c -0.5813 0.3130 -1.857 0.0636 .
notest_p 0.4852 0.7222 0.672 0.5019
female -1.3546 0.6016 -2.252 0.0246 *
biocred3 0.5559 0.5072 1.096 0.2734
degr3 -0.4003 0.3922 -1.021 0.3077
evol_course 2.5108 0.6300 3.985 7.33e-05 ***
certified -0.4446 0.7212 -0.617 0.5377
idsci_trans 1.8549 1.1255 1.648 0.0997 .
confident 2.6262 0.4501 5.835 7.71e-09 ***
phase1:senior_c -0.5112 0.2717 -1.881 0.0603 .
phase1:notest_p -0.5362 0.6233 -0.860 0.3899
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
3
See Brambor et al. (2006) for further details on interaction terms. Also, note that
an equivalent specification of this model could be achieved by replacing phase1*
senior_c and phase1*notest_p with the terms phase1+senior_c+ph_senior+
notest_p+ph_notest_p. We are simply introducing each of the terms separately in this way.
6.1 Estimation with Ordinary Least Squares 83
Table 6.1 Linear model of hours of class time spent teaching evolution by high
school biology teachers (OLS estimates)
Predictor Estimate Std. Error t value Pr(> jtj)
Intercept 10:2313 1.1905 8:59 0.0000
Standards index 2007 0:6285 0.3331 1:89 0.0596
Seniority (centered) 0:5813 0.3130 1:86 0.0636
Standards seniority 0:5112 0.2717 1:88 0.0603
Believes there is no test 0:4852 0.7222 0:67 0.5019
Standards believes no test 0:5362 0.6233 0:86 0.3899
Teacher is female 1:3546 0.6016 2:25 0.0246
Credits earned in biology (02) 0:5559 0.5072 1:10 0.2734
Science degrees (02) 0:4003 0.3922 1:02 0.3077
Completed evolution class 2:5108 0.6300 3:99 0.0001
Has normal certification 0:4446 0.7212 0:62 0.5377
Identifies as scientist 1:8549 1.1255 1:65 0.0997
Self-rated expertise (1 to +2) 2:6262 0.4501 5:84 0.0000
Notes: N D 841. R2 D 0:1226. F12;828 D 9:641 (p < 0:001). Data from
Berkman and Plutzer (2010)
The top of the printout repeats the user-specified model command and then
provides some descriptive statistics for the residuals. The table that follows presents
the results of primary interest: The first column lists every predictor in the model,
including an intercept. The second column presents the OLS estimate of the partial
regression coefficient. The third column presents the t-ratio for a null hypothesis
that the partial regression coefficient is zero, and the fourth column presents a two-
tailed p-value for the t-ratio. Finally, the table prints dots and stars based on the
thresholds that the two-tailed p-value crosses.4 Below the table, several fit statistics
are reported: The standard error of regression (or residual standard error), the R2
and adjusted R2 values, and the F-test for whether the model as a whole explains
a significant portion of variance. The results of this model also are presented more
formally in Table 6.1.5
4
Users are reminded that for one-tailed tests, in which the user wishes to test that the partial
coefficient specifically is either greater than or less than zero, the p-value will differ. If the sign of
the coefficient matches the alternative hypothesis, then the corresponding p-value is half of what is
reported. (Naturally, if the sign of the coefficient is opposite the sign of the alternative hypothesis,
the data do not fit with the researchers hypothesis.) Additionally, researchers may want to test a
hypothesis in which the null hypothesis is something other than zero: In this case, the user can
construct the correct t-ratio using the reported estimate and standard error.
5
Researchers who write their documents with LATEX can easily transfer the results of a linear
model from R to a table using the xtable library. (HTML is also supported by xtable.)
On first use, install with: install.packages("xtable"). Once installed, simply entering
library(xtable); xtable(mod.hours) would produce LATEX-ready code for a table
that is similar to Table 6.1. As another option for outputting results, see the rtf package about
how to output results into Rich Text Format.
84 6 Linear Models and Regression Diagnostics
Many researchers, rather than reporting the t-ratios and p-values presented in
the default output of lm will instead report confidence intervals of their estimates.
One must be careful in the interpretation of confidence intervals, so readers
unfamiliar with these are urged to consult a statistics or econometrics textbook for
more information (such as Gujarati and Porter 2009, pp. 108109). To construct
such a confidence interval in R, the user must choose a confidence level and use the
confint command:
confint(mod.hours,level=0.90)
The level option is where the user specifies the confidence level. 0.90 corre-
sponds to 90 % confidence, while level=0.99, for instance, would produce a
99 % confidence interval. The results of our 90 % confidence interval are reported
as follows:
5 % 95 %
(Intercept) 8.27092375 12.19176909
phase1 0.07987796 1.17702352
senior_c -1.09665413 -0.06587642
notest_p -0.70400967 1.67437410
female -2.34534464 -0.36388231
biocred3 -0.27927088 1.39099719
degr3 -1.04614354 0.24552777
evol_course 1.47336072 3.54819493
certified -1.63229086 0.74299337
idsci_trans 0.00154974 3.70834835
confident 1.88506881 3.36729476
phase1:senior_c -0.95856134 -0.06377716
phase1:notest_p -1.56260919 0.49020149
Among other features, one useful attribute of these is that a reader can examine a
90 % (for instance) confidence interval and reject any null hypothesis that proposes
a value outside of the intervals range for a two-tailed test. For example, the interval
for the variable confident does not include zero, so we can conclude with 90 %
confidence that the partial coefficient for this variable is different from zero.6
We are only content to use OLS to estimate a linear model if it is the Best Linear
Unbiased Estimator (BLUE). In other words, we want to obtain estimates that
on average yield the true population parameter (unbiased), and among unbiased
6
In fact, we also could conclude that the coefficient is greater than zero at the 95 % confidence
level. For more on how confidence intervals can be useful for one-tailed tests as well, see Gujarati
and Porter (2009, p. 115).
6.2 Regression Diagnostics 85
estimators we want the estimator that minimizes the error variance of our estimates
(best or efficient). Under the GaussMarkov theorem, OLS is BLUE and valid for
inferences if four assumptions hold:
1. Fixed or exogenous input values. In other words the predictors (X) must be inde-
pendent of the error term. Cov.X2i ; ui / D Cov.X3i ; ui / D D Cov.Xki ; ui / D 0.
2. Correct functional form. In other words, the conditional mean of the disturbance
must be zero.
E.ui jX2i ; X3i ; : : : ; Xki / D 0.
3. Homoscedasticity or constant variance of the disturbances (ui ). Var.ui / D 2 .
4. There is no autocorrelation between disturbances. Cov.ui ; uj / D 0 for i j.
While we never observe the values of disturbances, as these are population terms,
we can predict residuals (Ou) after estimating a linear model. Hence, we typically will
use residuals in order to assess whether we are willing to make the GaussMarkov
assumptions. In the following subsections, we conduct regression diagnostics to
assess the various assumptions and describe how we might conduct remedial
measures in R to correct for apparent violations of the GaussMarkov assumptions.
The one exception is that we do not test the assumption of no autocorrelation
because we cannot reference our example data by time or space. See Chap. 9 for
examples of autocorrelation tests and corrections. Additionally, we describe how
to diagnose whether the errors have a normal distribution, which is essential for
statistical inference. Finally, we consider the presence of two notable data features
multicollinearity and outlier observationsthat are not part of the GaussMarkov
assumptions but nevertheless are worth checking for.
It is critical to have the correct functional form in a linear model; otherwise, its
results will be biased. Therefore, upon estimating a linear model we need to assess
whether we have specified the model correctly, or whether we need to include
nonlinear aspects of our predictors (such as logarithms, square roots, squares, cubes,
or splines). As a rule, an essential diagnostic for any linear model is to do a
scatterplot of the residuals (Ou). These plots ought to be done against both the fitted
values (Y)O and against the predictors (X). To construct a plot of residuals against
fitted values, we would simply reference attributes of the model we estimated in a
call to the plot command:
plot(y=mod.hours$residuals,x=mod.hours$fitted.values,
xlab="Fitted Values",ylab="Residuals")
Notice that mod.hours$residuals allowed us to reference the models
residuals (Ou), and mod.hours$fitted.values allowed us to call the predicted
values (Y).O We can reference many features with the dollar sign ($). Type
names(mod.hours) to see everything that is saved. Turning to our output
plot, it is presented in Fig. 6.1. As analysts, we should check this plot for a few
features: Does the local average of the residuals tend to stay around zero? If the
86 6 Linear Models and Regression Diagnostics
30
2010
Residuals
0
10
20
5 10 15 20
Fitted Values
Fig. 6.1 Scatterplot of residuals against fitted values from model of hours of teaching evolution
residuals show a clear pattern of rising or falling over any range, then the functional
form of some variable may be wrong. Does the spread of the residuals differ at any
portion in the graph? If so, there may be a heteroscedasticity issue. One apparent
feature of Fig. 6.1 is that the residuals appear to hit a diagonal floor near the
bottom of the cloud. This emerges because a teacher cannot spend fewer than zero
hours teaching evolution. Hence, this natural floor reflects a limit in the dependent
variable. A functional form limitation such as this is often best addressed within the
Generalized Linear Model framework, which will be considered in the next chapter.
Another useful tool is to draw figures of the residuals against one or more
predictors. Figure 6.2 shows two plots of the residuals from our model against the
composite scale of the degree to which the teacher self-identifies as a scientist.
Figure 6.2a shows the basic plot using the raw data, which a researcher should
always look at. In this case, the predictor of interest takes on 82 unique values,
but many observations take on the same values, particularly at the upper end of the
scale. In cases like this, many points on the plot will be superimposed on each other.
By jittering the values of idsci_trans, or adding a small randomly drawn number,
it becomes easier to see where a preponderance of the data are. Figure 6.2b shows
a revised plot that jitters the predictor. The risk of the jittered figure is that moving
the data can distort a true pattern between the predictor and residuals. However,
in a case of an ordinal (or perhaps semi-ordinal) input variable, the two subfigures
6.2 Regression Diagnostics 87
30
30
20
20
Residuals
Residuals
10
10
0
0
20 10
20 10
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Identifies as Scientist Identifies as Scientist (Jittered)
a) Raw Data b) Jittered Data
Fig. 6.2 Scatterplot of residuals against the degree to which a teacher identifies as a scientist.
(a) Raw data. (b) Jittered data
can complement each other to offer the fullest possible picture. The two scatterplots
from Fig. 6.2 are produced as follows:
plot(y=mod.hours$residuals,x=evolution$idsci_trans,
xlab="Identifies as Scientist",ylab="Residuals")
plot(y=mod.hours$residuals,x=jitter(evolution$idsci_trans,
amount=.01),xlab="Identifies as Scientist (Jittered)",
ylab="Residuals")
Much like the residual-to-fitted value plot of Fig. 6.1, we examine the residual-to-
predictor plots of Fig. 6.2 for changes in the local mean as well as differences in the
spread of the residuals, each contingent on the predictor value. On functional form,
there is little to suggest that the running mean is changing markedly across values.
Hence, as with the residual-to-fitted plot, we see little need to respecify our model
with a nonlinear version of this predictor. However, the spread of the residuals looks
a bit concerning, so we will revisit this issue in the next section.
In addition to graphical methods, one common test statistic for diagnosing a
misspecified functional form is Ramseys RESET test (regression specification error
test). This test proceeds by reestimating the original model, but this time including
the fitted values from the original model in some nonlinear form (such as a quadratic
or cubic formula). Using an F-ratio to assess whether the new model explains
significantly more variance than the old model serves as a test of whether a different
form of one or more predictors should be included in the model. We can conduct
this test for a potential cubic functional form as follows:
evolution$fit<-mod.hours$fitted.values
reset.mod<-lm(hrs_allev~phase1*senior_c+phase1*notest_p+
female+biocred3+degr3+evol_course+certified+idsci_trans+
confident+I(fit^2)+I(fit^3), data=evolution)
anova(mod.hours, reset.mod)
88 6 Linear Models and Regression Diagnostics
The first line of code saves the fitted values from the original model as a variable in
the data frame. The second line adds squared and cubed forms of the fitted values
into the regression model. By embedding these terms within the I function (again
meaning, as is), we can algebraically transform the input variable on the fly as we
estimate the model. Third, the anova command (for analysis of variance) presents
the results of an F-test that compares the original model to the model including
a quadratic and cubic form of the fitted values. In this case, we get a result of
F2826 D 2:5626, with a p-value of p D 0:07772. This indicates that the model
with the cubic polynomial of fitted values does fit significantly better at the 90 %
level, implying another functional form would be better.
To determine which predictor could be the culprit of the misspecified functional
form, we can conduct DurbinWatson tests on the residuals, sorting on the predictor
that may be problematic. (Note that traditionally DurbinWatson tests sort on time
to test for temporal autocorrelation. This idea is revisited in Chap. 9.) A discernible
result indicates that residuals take similar values at similar values of the inputa
sign that the predictor needs to be respecified. The lmtest library (users will need
to install with install.packages the first time) provides commands for several
diagnostic tests, including the DurbinWatson test. Sorting the residuals on the rigor
of evolution standards (phase1), we run the test:
install.packages("lmtest")
library(lmtest)
dwtest(mod.hours, order.by=evolution$phase1)
As with the RESET test itself, our new model (mod.cubic) illustrates how we can
use additional features of the lm command. Again, by using the I function, we can
perform algebra on any input variable within the model command. As before, the
caret (^) raises a variable to a power, allowing our polynomial function. Again, for
interaction terms, simply multiplying two variables with an asterisk (*) ensures that
the main effects and product terms of all variables in the interaction are included.
Hence, we allow seniority and whether there is no assessment test each to interact
with the full polynomial form of evolution standards.
6.2 Regression Diagnostics 89
6.2.2 Heteroscedasticity
When the error variance in the residuals is not uniform across all observations,
a model has heteroscedastic error variance, the estimates are inefficient, and
the standard errors are biased to be too small. The first tool we use to assess
whether the error variance is homoscedastic (or constant for all observations)
versus heteroscedastic is a simple scatterplot of the residuals. Figure 6.1 offered
us the plot of our residuals against the fitted values, and Fig. 6.2 offers an example
plot of the residuals against a predictor. Besides studying the running mean to
evaluate functional form, we also assess the spread of residuals. If the dispersion
of the residuals is a constant band around zero, then we may use this as a visual
confirmation of homoscedasticity. However, in the two panels of Fig. 6.2, we can
see that the preponderance of residuals is more narrowly concentrated close to zero
for teachers who are less inclined to self-identify as a scientist, while the residuals
are more spread-out among those who are more inclined to identify as a scientist.
(The extreme residuals are about the same for all values of X, making this somewhat
tougher to spot, but the spread of concentrated data points in the middle expands at
higher values.) All of this suggests that self-identification as a scientist corresponds
with heteroscedasticity for this model.
Besides visual methods, we also have the option of using a test statistic in
a BreuschPagan test. Using the lmtest library (which we loaded earlier), the
syntax is as follows:
bptest(mod.hours, studentize=FALSE)
The default of bptest is to use Koenkers studentized version of this test. Hence,
the studentize=FALSE option gives the user the choice of using the original
version of the BreuschPagan test. The null hypothesis in this chi-squared test is
homoscedasticity. In this case, our test statistic is 212df D 51:7389 (p < 0:0001).
Hence, we reject the null hypothesis and conclude that the residuals are not
homoscedastic.
Without homoscedasticity, our results are not efficient, so how might we correct
for this? Perhaps the most common solution to this issue is to use HuberWhite
robust standard errors, or sandwich standard errors (Huber 1967; White 1980). The
downside of this method is that it ignores the inefficiency of the OLS estimates and
continues to report these as the parameter estimates. The upside, however, is that
although OLS estimates are inefficient under heteroscedasticity, they are unbiased.
Since the standard errors are biased, correcting them fixes the biggest problem
heteroscedasticity presents us. Computing HuberWhite standard errors can be
accomplished using the sandwich (needing a first-time install) and lmtest
libraries:
install.packages("sandwich")
library(sandwich)
coeftest(mod.hours,vcov=vcovHC)
90 6 Linear Models and Regression Diagnostics
The lmtest library makes the coeftest command available, and the
sandwich library makes the variance-covariance matrix vcovHC available within
this. (Both libraries require installation on first use.) The coeftest command will
now present the results of mod.hours again, with the same OLS estimates as
before, the new HuberWhite standard errors, and values of t and p that correspond
to the new standard errors.
Finally, we also have the option to reestimate our model using WLS. To do this,
the analyst must construct a model of the squared residuals as a way of forecasting
the heteroscedastic error variance for each observation. While there are a few ways
to do this effectively, here is the code for one plan. First, we save the squared
residuals and fit an auxiliary model of the logarithm of these squared residuals:
evolution$resid2<-mod.hours$residuals^2
weight.reg<-lm(log(resid2)~phase1*senior_c+phase1*notest_p+
female+biocred3+degr3+evol_course+certified+idsci_trans+
confident, data=evolution)
A key caveat of WLS is that all weights must be nonnegative. To guarantee this, the
code here models the logarithm of the squared residuals; therefore, the exponential
of the fitted values from this auxiliary regression serve as positive predictions of
the squared residuals. (Other solutions to this issue exist as well.) The auxiliary
regression simply includes all of the predictors from the original regression in
their linear form, but the user is not tied to this assumption. In fact, WLS offers
the BLUE under heteroscedasticity, but only if the researcher properly models the
error variance. Hence, proper specification of the auxiliary regression is essential. In
WLS, we essentially want to highly weight values with a low error variance and give
little weight to those with a high error variance. Hence, for our final WLS regression,
the weights command takes the reciprocal of the predicted values (exponentiated
to be on the original scale of the squared residuals):
wls.mod<-lm(hrs_allev~phase1*senior_c+phase1*notest_p+
female+biocred3+degr3+evol_course+certified+idsci_trans+
confident,data=evolution,
weights=I(1/exp(weight.reg$fitted.values)))
summary(wls.mod)
This presents us with a set of estimates that accounts for heteroscedasticity in the
residuals.
6.2.3 Normality
200
150
Frequency
100
50
0
20 10 0 10 20 30
Residuals
This histogram is reported in Fig. 6.3. Generally, we would like a symmetric bell
curve that is neither excessively flat nor peaked. If both skew (referring to whether
the distribution is symmetric or if the tails are even) and kurtosis (referring to the
distributions peakedness) are similar to a normal distribution, we may use this
figure in favor of our assumption. In this case, the residuals appear to be right-
skewed, suggesting that normality is not a safe assumption in this case.
A slightly more complex figure (albeit potentially more informative) is called
a quantilequantile plot. In this figure, the quantiles of the empirical values of
the residuals are plotted against the quantiles of a theoretical normal distribution.
The less these quantities correspond, the less reasonable it is to assume the residuals
are distributed normally. Such a figure is constructed in R as follows:
qqnorm(mod.hours$residuals)
qqline(mod.hours$residuals,col="red")
The first line of code (qqnorm) actually creates the quantilequantile plot. The
second line (qqline) adds a guide line to the existing plot. The complete graph is
located in Fig. 6.4. As can be seen, at lower and higher quantiles, the sample values
92 6 Linear Models and Regression Diagnostics
Normal QQ Plot
30 20
Sample Quantiles
0 10 10
20
3 2 1 0 1 2 3
Theoretical Quantiles
Fig. 6.4 Normal quantilequantile plot for residuals from model of hours of teaching evolution
deviate substantially from the theoretical values. Again, the assumption of normality
is questioned by this figure.
Besides these substantively focused assessments of the empirical distribution,
researchers also can use test statistics. The most commonly used test statistic in this
case is the JarqueBera test, which is based on the skew and kurtosis of the residuals
empirical distribution. This test uses the null hypothesis that the residuals are
normally distributed and the alternative hypothesis that they are not.7 The tseries
library can calculate this statistic, which we install on first use:
install.packages("tseries")
library(tseries)
jarque.bera.test(mod.hours$residuals)
In our case, 2 D 191:5709, so we reject the null hypothesis and conclude that the
residuals are not normally distributed. Like diagnostics for heteroscedasticity, we
would prefer a null result since we prefer not to reject the assumption.
All three diagnostics indicate a violation of the normality assumption, so how
might we respond to this violation? In many cases, the best answer probably lies
in the next chapter on Generalized Linear Models (GLMs). Under this framework,
we can assume a wider range of distributions for the outcome variable, and we
also can transform the outcome variable through a link function. Another somewhat
7
In other words, if we fail to reject the null hypothesis for a JarqueBera test, then we conclude
that there is not significant evidence of non-normality. Note that this is different from concluding
that we do have normality. However, this is the strongest conclusion we can draw with this test
statistic.
6.2 Regression Diagnostics 93
similar option would be to transform the dependent variable somehow. In the case
of our running example on hours spent on evolution, our outcome variable cannot
be negative, so we might add 1 to each teachers response and take the logarithm
of our dependent variable. Bear in mind, though, that this has a bigger impact on
the models functional form (see Gujarati and Porter 2009, pp. 162164), and we
have to assume that the disturbances of the model with a logged dependent variable
are normally distributed for inferential purposes.
6.2.4 Multicollinearity
8
A VIF of 10 means that 90 % of the variance in a predictor can be explained by the other
predictors, which in most contexts can be regarded as a large degree of common variance.
Unlike other diagnostic tests, though, this rule of thumb should not be regarded as a test statistic.
Ultimately the researcher must draw a substantive conclusion from the results.
94 6 Linear Models and Regression Diagnostics
The VIFs calculated in this way are presented in Table 6.2. As can be seen in the
table, all of the VIFs are small, implying that multicollinearity is not a major issue
in this model. In situations where multicollinearity does emerge, though, sometimes
the best advice is to do nothing. For a discussion of how to decide whether doing
nothing is the best approach or another solution would work better, see Gujarati and
Porter (2009, pp. 342346).
201 300
0.000 0.005 0.010 0.015 0.020 171
Cooks distance
125 211
171
4
211 577
182
0.05
hatvalues
120
185
0.03
0.01
Fig. 6.5 Cooks distances, Studentized residuals, and hat values from model of hours teaching
evolution
influenceIndexPlot(mod.hours,
vars=c("Cook","Studentized","hat"),id.n=5)
The values of these three quantities are reported in Fig. 6.5, which shows Cooks
distances, Studentized residuals, and hat values, respectively. In any of these
plots, an extreme value relative to the others indicates that an observation may
be particularly problematic. In this figure, none of the observations stand out
particularly, and none of the values of Cooks distance are remotely close to 1
(which is a common rule-of-thumb threshold for this quantity). Hence, none of the
observations appear to be particularly problematic for this model. In an instance
where some observations do appear to exert influence on the results, the researcher
must decide whether it is reasonable to keep the observations in the analysis or if
any of them ought to be removed. Removing data from a linear model can easily be
accomplished with the subset option of lm.
96 6 Linear Models and Regression Diagnostics
We now have considered how to fit linear models in R and how to conduct several
diagnostics to determine whether OLS presents us with the BLUE. While this is a
common model in Political Science, researchers frequently need to model limited
dependent variables in the study of politics. To address dependent variables of this
nature, we turn in the next chapter to GLMs. These models build on the linear model
framework but allow outcome variables that are bounded or categorical in nature.
This set of practice problems will draw from Owsiaks (2013) work on democrati-
zation, in which he shows that states that settle all of their international borders tend
to become more democratic. Please load the foreign library and then download a
subset of Owsiaks data, saved in the Stata-formatted file owsiakJOP2013.dta.
This file can be downloaded from the Dataverse linked on page vii or the chapter
content linked on page 79. These are panel data that include observations for 200
countries from 1918 to 2007, with a total of 10,434 country-years forming the data.
The countries in these data change over time (just as they changed in your history
book) making this what we call an unbalanced panel. Hence, our subsequent model
includes lagged values of several variables, or values from the previous year. See
Chap. 8 for more about nested data, and Chap. 9 for more about temporal data. For
this exercise, our standard OLS tools will work well.
1. Start by using the na.omit command, described on page 81, to eliminate
missing observations from these data. Then compute the descriptive statistics
for the variables in this data set.
2. To replicate Model 2 from Owsiak (2013), estimate a linear regression with
OLS using the following specification (with variable names in parentheses): The
dependent variable is Polity score (polity2), and the predictors are an indicator
for having all borders settled (allsettle), lagged GDP (laggdpam), lagged change
in GDP (laggdpchg), lagged trade openness (lagtradeopen), lagged military
personnel (lagmilper), lagged urban population (lagupop), lagged pervious non-
democratic movement (lagsumdown), and lagged Polity score (lagpolity).
3. Plot the residuals against the fitted values.
4. Is there heteroscedasticity in the residuals? Based on scatterplots and a Breusch
Pagan test, what do you conclude?
a. Estimate HuberWhite standard errors for this model with the sandwich
library and coeftest command.
b. For bonus credit, you can reproduce Owsiaks (2013) results exactly by
computing clustered standard errors, clustering on country (variable name:
ccode). You can do this in three steps: First, install the multiwayvcov
library. Second, define an error variance-covariance matrix using the
cluster.vcov command. Third, use that error variance-covariance matrix
as an argument in the coeftest command from the lmtest library.
6.3 Practice Problems 97
While the linear regression model is common to Political Science, many of the
outcome measures researchers wish to study are binary, ordinal, nominal, or count
variables. When we study these limited dependent variables, we turn to techniques
such as logistic regression, probit regression, ordered logit (and probit) regression,
multinomial logit (and probit) regression, Poisson regression, and negative binomial
regression. A review of these and several other methods can be seen in volumes such
as King (1989) and Long (1997).
In fact, all of these techniques can be thought of as special cases of the
generalized linear model, or GLM (Gill 2001). The GLM approach in brief is to
transform the mean of our outcome in some way so that we can apply the usual logic
of linear regression modeling to the transformed mean. This way, we can model
a broad class of dependent variables for which the distribution of the disturbance
terms violates the normality assumption from the GaussMarkov theorem. Further,
in many cases, the outcome is bounded, so the link function we use to transform
the mean of the outcome may reflect a more realistic functional form (Gill 2001,
pp. 3132).
The glm command in R is flexible enough to allow the user to specify many of
the most commonly used GLMs, such as logistic and Poisson regression. A handful
of models that get somewhat regular usage, such as ordered logit and negative
binomial regression, actually require unique commands that we also will cover. In
general, though, the glm command is a good place to look first when a researcher
has a limited dependent variable. In fact, the glm command takes an argument
called family that allows the user to specify what kind of model he or she wishes
to estimate. By typing ?family into the R console, the user can get a quick
overview of which models the glm command can estimate.
We first consider binary outcome variables, or variables that take only two possible
values. Usually these outcomes are coded 0 or 1 for simplicity of interpreta-
tion. As our example in this section, we use survey data from the Comparative
Study of Electoral Systems (CSES). Singh (2014a) studies a subset of these data
that consists of 44,897 survey respondents from 30 elections. These elections
occurred between the years of 19962006 in the countries of Canada, the Czech
Republic, Denmark, Finland, Germany, Hungary, Iceland, Ireland, Israel, Italy, the
Netherlands, New Zealand, Norway, Poland, Portugal, Slovenia, Spain, Sweden,
Switzerland, and the United Kingdom.
Singh uses these data to assess how ideological distance shapes individuals vote
choice and willingness to vote in the first place. Building on the spatial model of
politics advanced by Hotelling (1929), Black (1948), Downs (1957), and others, the
article shows that linear differences in ideology do a better job of explaining voter
behavior than squared differences. The variables in the data set are as follows:
voted: Indicator coded 1 if the respondent voted, 0 if not.
votedinc: Indicator coded 1 if the respondent voted for the incumbent party, 0 if he
or she voted for another party. (Non-voters are missing.)
cntryyear: A character variable listing the country and year of the election.
cntryyearnum: A numeric index identifying the country-year of the election.
distanceinc: Distance between the survey respondent and the incumbent party on
a 010 ideology scale.
distanceincsq: Squared distance between the voter and incumbent party.
distanceweighted: Distance between the survey respondent and the most similar
political party on a 010 ideology scale, weighted by the competitiveness of the
election.
distancesqweighted: Squared weighted distance between the voter and most
similar ideological party.
The data are saved in Stata format, so we will need to load the foreign library.
Download the file SinghJTP.dta from the Dataverse linked on page vii or the
chapter content linked on page 97. Then open the data as follows:
library(foreign)
voting<-read.dta("SinghJTP.dta",convert.factors=FALSE)
A good immediate step here would be to use commands such as summary as well
as graphs to get a feel for the descriptive attributes of the data. This is left for the
reader.
7.1 Binary Outcomes 101
As a first model from these data, we will model the probability that a respondent
voted for the incumbent party, rather than another party. We will use only one
predictor in this case, and that is the distance between the voter and the incumbent
party. We craft this as a logistic regression model. The syntax for this model is:
inc.linear<-glm(votedinc~distanceinc,
family=binomial(link="logit"),data=voting)
The syntax of glm (generalized linear model) is nearly identical to lm: We still
start with a functional specification that puts the dependent variable to the left
of the tilde (~) and predictors on the right separated with plus signs. Again,
we reference our dataset with the data option. Now, however, we must use
the family option to specify which GLM we want to estimate. By specify-
ing binomial(link="logit"), we declare a binary outcome and that we
are estimating a logit, rather than probit, model. After estimation, by typing
summary(inc.linear) we get the output from our logistic regression model,
which is as follows:
Call:
glm(formula = votedinc ~ distanceinc, family = binomial
(link = "logit"),
data = voting)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2608 -0.8632 -0.5570 1.0962 2.7519
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.19396 0.01880 10.32 <2e-16 ***
distanceinc -0.49469 0.00847 -58.41 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Table 7.1 Logit model of Predictor Estimate Std. error z value Pr(> jzj)
probability of voting for
incumbent party, 30 Intercept 0.1940 0.0188 10.32 0.0000
cross-national elections Distance 0.4947 0.0085 58.41 0.0000
Notes: N D 38;211. AIC D 42;914. 69 % correctly
predicted. Data from Singh (2014a)
The printout looks similar to the printout of the linear model we estimated in
Chap. 6.1 A more formal presentation of our results can be found in Table 7.1.2
Comparing these results to the linear model, however, a few differences are
important to note. First, the coefficient estimates themselves are not as meaningful
as those of the linear model. A logit model transforms our outcome of interest, the
probability of voting for the incumbent party, because it is bounded between 0 and 1.
The logit transform is worthwhile because it allows us to use a linear prediction
framework, but it calls for an added step of effort for interpretation. (See Sect. 7.1.3
for more on this.) A second difference in the output is that it reports z ratios instead
of t ratios: Just as before, these are computed around the null hypothesis that the
coefficient is zero, and the formula for the ratio uses the estimate and standard
error in the same way. Yet, we now need to assume these statistics follow a normal
distribution, rather than a t distribution.3 Third, different fit statistics are presented:
deviance scores and the Akaike information criterion (AIC).4
In Table 7.1, we report the coefficients, standard errors, and inferential infor-
mation. We also report the AIC, which is a good fit index and has the feature of
penalizing for the number of parameters. Unlike R2 in linear regression, though,
the AIC has no natural metric that gives an absolute sense of model fit. Rather, it
works better as a means of comparing models, with lower values indicating a better
penalized fit. To include a measure of fit that does have a natural scale to it, we also
report what percentage of responses our model correctly predicts. To compute this,
all we need to do is determine whether the model would predict a vote for the
incumbent party and compare this to how the respondent actually voted. In R,
we can roll our own computation:
1
In this case, the coefficient estimates we obtain are similar to those reported by Singh (2014a).
However, our standard errors are smaller (and hence z and p values are bigger) because Singh
clusters the standard errors. This is a useful idea because the respondents are nested within
elections, though multilevel models (which Singh also reports) address this issue as wellsee
Sect. 8.1.
2 A
LT
EX users can create a table similar to this quickly by typing: library(xtable);
xtable(inc.linear).
3
An explanation of how the inferential properties of this model are derived can be found in Eliason
(1993, pp. 2627).
4
Deviance is calculated as 2 times the logged ratio of the fitted likelihood to the saturated
likelihood. Formally, 2 log LL12 , where L1 is the fitted likelihood and L2 is the saturated likelihood.
R reports two quantities: the null deviance computes this for an intercept-only model that always
predicts the modal value, and the residual deviance calculates this for the reported model.
7.1 Binary Outcomes 103
predicted<-as.numeric(
predict.glm(inc.linear,type="response")>.5)
true<-voting$votedinc[voting$voted==1]
correct<-as.numeric(predicted==true)
100*table(correct)/sum(table(correct))
On the first line, we create a vector of the predictions from the model. The
code uses the predict.glm command, which usefully can forecast from any
model estimated with the glm command. By specifying type="response" we
clarify that we want our predictions to be on the probability scale (instead of the
default scale of latent utility). We then ask if each probability is greater than 0.5.
By wrapping all of this in the as.numeric command, we count all probabilities
above 0.5 as predicted values of 1 (for the incumbent) and all that are less than 0.5
as predicted values of 0 (against the incumbent). On the second line, we simply
subset the original vector of the outcome from the original data to those that voted
and hence were included in the model. This subsetting step is essential because the
glm command automatically deletes missing data from estimation. Hence, without
subsetting, our predicted and true values would not properly link together. On the
third line, we create a vector coded 1 if the predicted value matches the true value,
and on the fourth line we create a table of this vector. The printout is:
correct
0 1
30.99108 69.00892
Hence, we know that the model correctly predicts 69 % of the outcome values,
which we report in Table 7.1.
As one more example of logistic regression, Singh (2014a) compares a model
with linear ideological distances to one with squared ideological distances. To fit
this alternative model, we type:
inc.squared<-glm(votedinc~distanceincsq,
family=binomial(link="logit"),data=voting)
summary(inc.squared)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.1020 -0.9407 -0.5519 1.2547 3.6552
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.179971 0.014803 -12.16 <2e-16 ***
104 7 Generalized Linear Models
Logit models have gained traction over the years for the sake of simplicity in
computation and interpretation. (For instance, logit models can be interpreted with
odds ratios.) However, a key assumption of logit models is that the error term in
the latent variable model (or the latent utility) has a logistic distribution. We may
be more content to assume that the error term of our model is normally distributed,
given the prevalence of this distribution in nature and in asymptotic results.5 Probit
regression allows us to fit a model with a binary outcome variable with a normally
distributed error term in the latent variable model.
To show how this alternative model of a binary outcome works, we turn to a
model of the probability a survey respondent voted at all. Singh (2014a) models
this as a function of the ideological proximity to the nearest party weighted by the
competitiveness of the election. The theory here is that individuals with a relatively
proximate alternative in a competitive election are more likely to find it worthwhile
to vote. We fit this model as follows:
turnout.linear<-glm(voted~distanceweighted,
family=binomial(link="probit"),data=voting)
summary(turnout.linear)
5
Additionally, in advanced settings for which we need to develop a multivariate distribution for
multiple outcome variables, the normal distribution is relatively easy to work with.
7.1 Binary Outcomes 105
Call:
glm(formula = voted ~ distanceweighted, family = binomi
al(link = "probit"),
data = voting)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9732 0.5550 0.5550 0.5776 0.6644
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.068134 0.009293 114.942 < 2e-16
***
distanceweighted -0.055074 0.011724 -4.698 2.63e-06
***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 .
0.1 1
The layout of these probit model results look similar to the results of the logit model.
Note, though, that changing the distribution of the latent error term to a normal
distribution changes the scale of the coefficients, so the values will be different
between logit and probit models. The substantive implications typically are similar
between the models, so the user must decide which model works best in terms of
assumptions and interpretation for the data at hand.
An important feature of GLMs is that the use of a link function makes the
coefficients more difficult to interpret. With a linear regression model, as estimated
in Chap. 6, we could simply interpret the coefficient in terms of change in the
expected value of the outcome itself, holding the other variables equal. With a GLM,
though, the mean of the outcome has been transformed, and the coefficient speaks
to change in the transformed mean. Hence, for analyses like logit and probit models,
106 7 Generalized Linear Models
we need to take additional steps in order to interpret the effect an input has on the
outcome of interest.
For a logistic regression model, the analyst can quickly calculate the odds ratio
for each coefficient simply by taking the exponential of the coefficient.6 Recall that
the odds of an event is the ratio of the probability the event occurs to the probability
p
it does not occur: 1p . The odds ratio tells us the multiplicative factor by which the
odds will change for a unit increase in the predictor. Within R, if we want the odds
ratio for our distance coefficient in Table 7.1, we simply type:
exp(inc.linear$coefficients[-1])
This syntax will take the exponential of every coefficient estimate from our model,
no matter the number of covariates. The [-1] omits the intercept, for which an
odds ratio would be meaningless. Having only one predictor, the printout in this
case is:
distanceinc
0.6097611
We need to be careful when interpreting the meaning of odds ratios. In this case,
for a one point increase in distance from the incumbent party on the ideology scale,
the odds that a respondent will vote for the incumbent party diminish by a factor of
0.61. (With multiple predictors, we would need to add the ceteris paribus caveat.)
If, instead of interpreting as a multiplicative factor, the analyst preferred to discuss
change in percentage terms, type:
100*(exp(inc.linear$coefficients[-1])-1)
In this case a value of 39.02389 is returned. Hence, we can say: for a one point
increase in distance from the incumbent party on the ideology scale, the odds that a
respondent will vote for the incumbent party diminish by 39 %. Remember, though,
that all of these statements relate specifically to odds, so in this case we are referring
to a 39 % decrease in the ratio of the probability of voting for the incumbent to the
probability of voting for any other party.
An alternative interpretation that often is easier to explain in text is to report
predicted probabilities from a model. For a logistic regression model, inputting the
predictions from the linear function (the latent utilities) into the logistic cumulative
distribution function produces the predicted probability that the outcome takes a
value of 1. A simple approach to intuitively illustrate the effect of a predictor is
to plot the predicted probabilities at every value that a predictor can take, which
shows how the probability changes in a nonlinear way as the predictor changes.
We proceed first by creating our predicted probabilities:
distances<-seq(0,10,by=.1)
inputs<-cbind(1,distances)
colnames(inputs)<-c("constant","distanceinc")
inputs<-as.data.frame(inputs)
6
This is because the logit link function is the log-odds, or logarithm of the odds of the event.
7.1 Binary Outcomes 107
forecast.linear<-predict(inc.linear,newdata=inputs,
type="response")
On the first line, we create a sequence of possible distances from the incumbent
party, ranging from the minimum (0) to the maximum (10) in small increments
(0.1). We then create a matrix named inputs that stores predictor values of
interest for all predictors in our model (using the column bind, cbind, command to
combine two vectors as columns in a matrix). Subsequently, we name the columns
to match our variable names and recategorize this matrix as a data frame. On the
final line, we use the predict command, which saves the predicted probabilities
to a vector. Observe the use of the newdata option to specify our data frame
of predictor values and the type option to specify that we want our predicted
values on the response scale. By setting this to the response scale, the command
returns predicted probabilities of voting for the incumbent party at each hypothetical
distance.
As an alternative to the model in which voting for the incumbent is a function
of the linear ideological distance between the voter and the party, we also fitted a
model using the squared distance. We easily can compute the predicted probability
from this alternative model against the value of distance on its original scale. Again,
the predicted probabilities are computed by typing:
inputs2<-cbind(1,distances^2)
colnames(inputs2)<-c("constant","distanceincsq")
inputs2<-as.data.frame(inputs2)
forecast.squared<-predict(inc.squared,newdata=inputs2,
type="response")
In this case, we use the original vector distances that captured hypothetical pre-
dictor values, and square them. By using these squared values, we save our predicted
probabilities from the alternative model into the vector forecast.squared.
To plot the predicted probabilities from each model on the same space, we type:
plot(y=forecast.linear,x=distances,ylim=c(0,.6),type="l",
lwd=2,xlab="",ylab="")
lines(y=forecast.squared,x=distances,lty=2,col="blue",lwd=2)
legend(x=6,y=.5,legend=c("linear","squared"),lty=c(1,2),
col=c("black","blue"),lwd=2)
mtext("Ideological Distance",side=1,line=2.75,cex=1.2)
mtext("Probability of Voting for Incumbent",side=2,
line=2.5,cex=1.2)
On the first line, we plot the predicted probabilities from the model with linear
distance. On the vertical axis (y) are the probabilities, and on the horizontal axis
(x) are the values of distance. We bound the probabilities between 0 and 0.6 for
a closer look at the changes, set type="l" to produce a line plot, and use the
option lwd=2 to increase the line thickness. We also set the x- and y-axis labels to
be empty (xlab="",ylab="") so that we can fill in the labels later with a more
precise command. On the second line, we add another line to the open figure of the
108 7 Generalized Linear Models
0.6
probability of voting for the
incumbent party as a function
0.5
Probability of Voting for Incumbent
of ideological distance from
linear
the incumbents, based on a squared
linear and a quadratic
0.4
functional form
0.3
0.2
0.1
0.0
0 2 4 6 8 10
Ideological Distance
predicted probabilities from the model with squared distance. This time, we color
the line blue and make it dashed (lty=2) to distinguish it from the other models
predicted probabilities. On the third line, we add a legend to the plot, located at the
coordinates where x=6 and y=0.5, that distinguishes the lines based on the linear
and squared distances. Finally, on the last two lines we add axis labels using the
mtext command: The side option lets us declare which axis we are writing on,
the line command determines how far away from the axis the label is printed, and
the cex command allows us to expand the font size (to 120 % in this case). The
full results are presented in Fig. 7.1. As the figure shows, the model with squared
distance is more responsive at middle values, with a flatter response at extremes.
Hence, Singhs (2014a) conclusion that linear distance fits better has substantive
implications for voter behavior.
As one final example of reporting predicted probabilities, we turn to an example
from the probit model we estimated of turnout. Predicted probabilities are computed
in a similar way for probit models, except that the linear predictions (or utilities) are
now input into a normal cumulative distribution function. In this example, we will
add to our presentation of predicted probabilities by including confidence intervals
around our predictions, which convey to the reader the level of uncertainty in our
forecast. We begin as we did in the last example, by creating a data frame of
hypothetical data values and producing predicted probabilities with them:
wght.dist<-seq(0,4,by=.1)
inputs.3<-cbind(1,wght.dist)
colnames(inputs.3)<-c("constant","distanceweighted")
inputs.3<-as.data.frame(inputs.3)
forecast.probit<-predict(turnout.linear,newdata=inputs.3,
type="link",se.fit=TRUE)
7.1 Binary Outcomes 109
In this case, weighted ideological distance from the nearest ideological party is our
one predictor. This predictor ranges from approximately 04, so we create a vector
spanning those values. On the last line of the above code, we have changed two
features: First, we have specified type="link". This means that our predictions
are now linear predictions of the latent utility, and not the probabilities in which
we are interested. (This will be corrected in a moment.) Second, we have added
the option se.fit=TRUE, which provides us with a standard error of each linear
prediction. Our output object, forecast.probit now contains both the linear
forecasts and the standard errors.
The reason we saved the linear utilities instead of the probabilities is that doing
so will make it easier for us to compute confidence intervals that stay within the
probability limits of 0 and 1. To do this, we first compute the confidence intervals
of the linear predictions. For the 95 % confidence level, we type:
lower.ci<-forecast.probit$fit-1.95996399*forecast.probit$se.fit
upper.ci<-forecast.probit$fit+1.95996399*forecast.probit$se.fit
plot(y=pnorm(forecast.probit$fit),x=wght.dist,ylim=c(.7,.9),
type="l",lwd=2,xlab="Weighted Ideological Distance",
ylab="Probability of Turnout")
lines(y=pnorm(lower.ci),x=wght.dist,lty=3,col="red",lwd=2)
lines(y=pnorm(upper.ci),x=wght.dist,lty=3,col="red",lwd=2)
In the first line, we plot the predicted probabilities themselves. To obtain the proba-
bilities for the vertical axis, we type y=pnorm(forecast.probit$fit). The
pnorm function is the normal cumulative distribution function, so this converts
our linear utility predictions into actual probabilities. Meanwhile, x=wght.dist
places the possible values of weighted distance to the nearest party on the horizontal
axis. On the second line, we plot the lower bound of the 95 % confidence interval
of predicted probabilities. Here, pnorm(lower.ci) converts the confidence
interval forecast onto the probability scale. Finally, we repeat the process on line
three to plot the upper bound of the confidence interval. The full output can be seen
in Fig. 7.2. A noteworthy feature of this plot is that the confidence interval becomes
noticeably wide for the larger values of weighted distance. This is because the mean
of the variable is low and there are few observations at these higher values.
The predicted probabilities in both of these cases were simple because they
included only one predictor. For any GLM, including a logit or probit model,
predicted probabilities and their level of responsiveness depend on the value of
all covariates. Whenever a researcher has multiple predictors in a GLM model,
110 7 Generalized Linear Models
0.90
0.85
Probability of Turnout
0.80
0.75
0.70
0 1 2 3 4
Weighted Ideological Distance
Fig. 7.2 Predicted probability of turning out to vote as a function of weighted ideological distance
from the nearest party, with 95 % confidence intervals
reasonable values of the control variables must be included in the forecasts. See
Sect. 7.3.3 for an example of using the predict function for a GLM that includes
multiple predictors.
We now turn to ordinal outcome measures. Ordinal variables have multiple cate-
gories as responses that can be ranked from lowest to highest, but other than the
rankings the numeric values have no inherent meaning. As an example of an ordinal
outcome variable, we again use survey data from the CSES, this time from Singhs
(2014b) study of satisfaction with democracy. Relative to the previous example,
these data have a wider scope, including 66,908 respondents from 62 elections. The
variables in this data set are as follows:
satisfaction: Respondents level of satisfaction with democracy. Ordinal scale
coded 1 (not at all satisfied), 2 (not very satisfied), 3 (fairly satisfied), or 4
(very satisfied).
cntryyear: A character variable listing the country and year of the election.
cntryyearnum: A numeric index identifying the country-year of the election.
freedom: Freedom House scores for a countrys level of freedom. Scores range
from 5.5 (least free) to 1 (most free).
gdpgrowth: Percentage growth in Gross Domestic Product (GDP).
7.2 Ordinal Outcomes 111
gdppercapPPP: GDP per capita, computed using purchasing price parity (PPP),
chained to 2000 international dollars, in thousands of dollars.
CPI: Corruption Perceptions Index. Scores range from 0 (least corrupt) to 7.6
(most corrupt).
efficacy: Respondent thinks that voting can make a difference. Ordinal scale from
1 (disagree) to 5 (agree).
educ: Indicator coded 1 if the respondent graduated from college, 0 if not.
abstained: Indicator coded 1 if the respondent abstained from voting, 0 if the
respondent voted.
prez: Indicator coded 1 if the country has a presidential system, 0 if not.
majoritarian_prez: Indicator coded 1 if the country has a majoritarian system, 0
if not.
winner: Indicator coded 1 if the respondent voted for the winning party, 0 if not.
voted_ID: Indicator coded 1 if the respondent voted for the party he or she
identifies with, 0 if not.
voted_affect: Indicator coded 1 if the respondent voted for the party he or she
rated highest, 0 if not.
voted_ideo: Indicator coded 1 if the respondent voted for the most similar party
on ideology, 0 if not.
optimality: Vote optimality scale ranging from 0 to 3, coded by adding voted_ID,
voted_affect, and voted_ideo.
winnerXvoted_ID: Interaction term between voting for the winner and voting by
party identification.
winnerXvoted_affect: Interaction term between voting for the winner and voting
for the highest-rated party.
winnerXvoted_ideo: Interaction term between voting for the winner and voting
by ideological similarity.
winnerXoptimality: Interaction term between voting for the winner and the vote
optimality scale.
These data are also in Stata format, so if the foreign library is not
already loaded, it will need to be called. To load our data, download the file
SinghEJPR.dta from the Dataverse linked on page vii or the chapter content
linked on page 97. Then type:
library(foreign)
satisfaction<-read.dta("SinghEJPR.dta")
require us to quantify adjectives such as very and fairly. Hence, an ordered logit
or ordered probit model is going to be appropriate for this analysis.
As our first example, Singh (2014b, Table SM2) fits a model in which satisfaction
in democracy is based on whether the respondent voted for the candidate most
ideologically proximate, whether the candidate voted for the winner, and the
interaction between these two variables.7 The most important of these terms is the
interaction, as it tests the hypothesis that individuals who were on the winning
side and voted for the most similar party ideologically will express the greatest
satisfaction.
Turning to specifics, for ordinal regression models we actually must use the
special command polr (short for proportional odds logistic regression), which is
part of the MASS package. Most R distributions automatically install MASS, though
we still need to load it with the library command.8 In order to load the MASS
package and then estimate an ordered logit model with these data, we would type:
library(MASS)
satisfaction$satisfaction<-ordered(as.factor(
satisfaction$satisfaction))
ideol.satisfaction<-polr(satisfaction~voted_ideo*winner+
abstained+educ+efficacy+majoritarian_prez+
freedom+gdppercapPPP+gdpgrowth+CPI+prez,
method="logistic",data=satisfaction)
summary(ideol.satisfaction)
7
This and the next example do not exactly replicate the original results, which also include random
effects by country-year. Also, the next example illustrates ordered probit regression, instead of the
ordered logistic model from the original article. Both of the examples are based on models found
in the online supporting material at the European Journal of Political Research website.
8
If a user does need to install the package, install.packages("MASS") will do the job.
9
An equivalent specification would have been to include voted_ideo+winner+
winnerXvoted_ideo as three separate terms from the data.
7.2 Ordinal Outcomes 113
Coefficients:
Value Std. Error t value
voted_ideo -0.02170 0.023596 -0.9198
winner 0.21813 0.020638 10.5694
abstained -0.25425 0.020868 -12.1838
educ 0.08238 0.020180 4.0824
efficacy 0.16246 0.006211 26.1569
majoritarian_prez 0.05705 0.018049 3.1609
freedom 0.04770 0.014087 3.3863
gdppercapPPP 0.01975 0.001385 14.2578
gdpgrowth 0.06653 0.003188 20.8673
CPI -0.23153 0.005810 -39.8537
prez -0.11503 0.026185 -4.3930
voted_ideo:winner 0.19004 0.037294 5.0957
Intercepts:
Value Std. Error t value
1|2 -2.0501 0.0584 -35.1284
2|3 -0.0588 0.0575 -1.0228
3|4 2.7315 0.0586 46.6423
here is for the interaction term to be positive. Hence, we can obtain our one-tailed p
value by typing:
1-pnorm(5.0957)
10
Unfortunately the xtable command does not produce ready-made LATEX tables for results from
polr. By creating a matrix with the relevant results, though, LATEX users can produce a table faster
than hand coding, though some revisions of the final product are necessary. Try the following:
coef<-c(ideol.satisfaction$coefficients,ideol.satisfaction$zeta)
se<-sqrt(diag(vcov(ideol.satisfaction)))
z<-coef/se
p<-2*(1-pnorm(abs(z)))
xtable(cbind(coef,se,z,p),digits=4)
7.2 Ordinal Outcomes 115
exp(-ideol.satisfaction$coefficients)
100*(exp(-ideol.satisfaction$coefficients)-1)
Besides changing one of the interacted variables, the only difference between this
code and the prior command for the ordered logit command is the specification of
method="probit". This changes the scale of our coefficients somewhat, but the
substantive implications of results are generally similar regardless of this choice.
By typing summary(affect.satisfaction), we obtain the output:
Call:
polr(formula = satisfaction ~ voted_affect * winner +
abstained +
educ + efficacy + majoritarian_prez + freedom +
gdppercapPPP +
116 7 Generalized Linear Models
Coefficients:
Value Std. Error t value
voted_affect 0.03543 0.0158421 2.237
winner 0.04531 0.0245471 1.846
abstained -0.11307 0.0170080 -6.648
educ 0.05168 0.0115189 4.487
efficacy 0.09014 0.0035177 25.625
majoritarian_prez 0.03359 0.0101787 3.300
freedom 0.03648 0.0082013 4.448
gdppercapPPP 0.01071 0.0007906 13.546
gdpgrowth 0.04007 0.0018376 21.803
CPI -0.12897 0.0033005 -39.075
prez -0.03751 0.0147650 -2.540
voted_affect:winner 0.14278 0.0267728 5.333
Intercepts:
Value Std. Error t value
1|2 -1.1559 0.0342 -33.7515
2|3 -0.0326 0.0340 -0.9586
3|4 1.6041 0.0344 46.6565
As a third type of GLM, we turn to models of event counts. Whenever our dependent
variable is the number of events that occurs within a defined period of time, the
variable will have the feature that it can never be negative and must take on a
discrete value (e.g., 0,1,2,3,4, . . . ). Count outcomes therefore tend to have a strong
right skew and a discrete probability distribution such as the Poisson or negative
binomial distribution.
As an example of count data, we now return to Peake and Eshbaugh-Sohas
(2008) data that were previously discussed in Chap. 3. Recall that the outcome
variable in this case is the number of television news stories related to energy policy
in a given month. (See Chap. 3 for additional details on the data.) The number of
news stories in a month certainly is an event count. However, note that because
7.3 Event Counts 117
these are monthly data, they are time dependent, which is a feature we ignore at this
time. In Chap. 9 we revisit this issue and consider models that account for time. For
now, though, this illustrates how we usually fit count models in R.
First, we load the data again:11
pres.energy<-read.csv("PESenergy.csv")
After viewing the descriptive statistics on our variables and visualizing the data as
we did in Chap. 3, we can now turn to fitting a model.
The simplest count model we can fit is a Poisson model. If we were to type:
energy.poisson<-glm(Energy~rmn1173+grf0175+grf575+jec477+
jec1177+jec479+embargo+hostages+oilc+Approval+Unemploy,
family=poisson(link=log),data=pres.energy)
Deviance Residuals:
Min 1Q Median 3Q Max
-8.383 -2.994 -1.054 1.536 11.399
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 13.250093 0.329121 40.259 < 2e-16 ***
rmn1173 0.694714 0.077009 9.021 < 2e-16 ***
11
The footnote should read: "For users who do not have the file handy from Chapter 3, please
download the file from the Dataverse linked on page vii or the chapter content linked on page 97.
12
Note that the presidential speech terms are coded 1 only in the month of the speech, and 0 in all
other months. The terms for the oil embargo and hostage crisis were coded 1 while these events
were ongoing and 0 otherwise.
118 7 Generalized Linear Models
exp(-.034096)
Here, we simply inserted the estimated coefficient from the printed output. The
result gives us a count ratio of 0.9664787. We could interpret this as meaning for
a percentage point increase in the presidents approval rating, coverage of energy
policy diminishes by 3.4 % on average and holding all other predictors equal. As
a quick way to get the count ratio and percentage change for every coefficient, we
could type:
exp(energy.poisson$coefficients[-1])
100*(exp(energy.poisson$coefficients[-1])-1)
In both lines the [-1] index for the coefficient vector throws away the intercept
term, for which we do not want a count ratio. The printout from the second line
reads:
rmn1173 grf0175 grf575 jec477 jec1177
100.313518 59.726654 -12.240295 202.987027 78.029428
jec479 embargo hostages oilc Approval
193.425875 155.434606 -9.017887 -19.224639 -3.352127
Unemploy
-8.625516
7.3 Event Counts 119
From this list, we can simply read-off the percentage changes for a one-unit increase
in the input, holding the other inputs equal. For a graphical means of interpretation,
see Sect. 7.3.3.
An intriguing feature of the Poisson distribution is that the variance is the same as
the mean. Hence, when we model the logarithm of the mean, our model is simul-
taneously modeling the variance. Often, however, we find that the variance of our
count variable is wider than we would expect given the covariatesa phenomenon
called overdispersion. Negative binomial regression offers a solution to this problem
by estimating an extra dispersion parameter that allows the conditional variance to
differ from the conditional mean.
In R, negative binomial regression models actually require a special command
from the MASS library called glm.nb. If the MASS library is not loaded, be sure
to type library(MASS) first. Then, we can fit the negative binomial model by
typing:
energy.nb<-glm.nb(Energy~rmn1173+grf0175+grf575+jec477+
jec1177+jec479+embargo+hostages+oilc+Approval+Unemploy,
data=pres.energy)
Notice that the syntax is similar to the glm command, but there is no family
option since the command itself specifies that. By typing summary(energy.nb)
the following results print:
Call:
glm.nb(formula = Energy ~ rmn1173 + grf0175 + grf575 +
jec477 +
jec1177 + jec479 + embargo + hostages + oilc +
Approval +
Unemploy, data = pres.energy, init.theta =
2.149960724, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7702 -0.9635 -0.2624 0.3569 2.2034
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 15.299318 1.291013 11.851 < 2e-16 ***
rmn1173 0.722292 0.752005 0.960 0.33681
grf0175 0.288242 0.700429 0.412 0.68069
grf575 -0.227584 0.707969 -0.321 0.74786
jec477 0.965964 0.703611 1.373 0.16979
120 7 Generalized Linear Models
Theta: 2.150
Std. Err.: 0.242
2 x log-likelihood: -1500.427
The coefficients reported in this output can be interpreted in the same way that
coefficients from a Poisson model are interpreted because both model the logarithm
of the mean. The key addition, reported at the end of the printout, is the dispersion
parameter . In this case, our estimate is O D 2:15, and with a standard error
of 0.242 the result is discernible. This indicates that overdispersion is present in
this model. In fact, many of the inferences drawn vary between the Poisson and
negative binomial models. The two models are presented side by side in Table 7.3.
As the results show, many of the discernible results from the Poisson model are not
discernible in the negative binomial model. Further, the AIC is substantially lower
for the negative binomial model, indicating a better fit even when penalizing for the
extra overdispersion parameter.
7.3 Event Counts 121
While count ratios are certainly a simple way to interpret coefficients from count
models, we have the option of graphing our results as well. In this case, we model
the logarithm of our mean parameter, so we must exponentiate our linear prediction
to predict the expected count given our covariates. As with logit and probit models,
for count outcomes the predict command makes forecasting easy.
Suppose we wanted to plot the effect of presidential approval on the number of
TV news stories on energy, based on the two models of Table 7.3. This situation
contrasts a bit from the graphs we created in Sect. 7.1.3. In all of the logit and probit
examples, we only had one predictor. By contrast, in this case we have several
other predictors, so we have to set those to plausible alternative values. For this
example, we will set the value of all dummy variable predictors to their modal value
of zero, while the price of oil and unemployment are set to their mean. If we fail
to insert reasonable values for the covariates, the predicted counts will not resemble
the actual mean and the size of the effect will not be reasonable.13 In this example,
the way in which we use the predict command to forecast average counts with
multiple predictors can be used in exactly the same way for a logit or probit model
to forecast predicted probabilities with multiple predictors.
Table 7.3 Two count models of monthly TV new stories on energy policy, 19691983
Poisson Negative binomial
Parameter Estimate Std. error Pr(> jzj) Estimate Std. error Pr(> jzj)
Intercept 13.2501 0.3291 0.0000 15.2993 1.2910 0.0000
Nixon 11/73 0.6947 0.0770 0.0000 0.7223 0.7520 0.3368
Ford 1/75 0.4683 0.0962 0.0000 0.2882 0.7004 0.6807
Ford 5/75 0.1306 0.1622 0.4208 0.2276 0.7080 0.7479
Carter 4/77 1.1085 0.1222 0.0000 0.9660 0.7036 0.1698
Carter 11/77 0.5768 0.1555 0.0002 0.5732 0.7025 0.4145
Carter 4/79 1.0765 0.0951 0.0000 1.1415 0.6949 0.1005
Arab oil embargo 0.9378 0.0511 0.0000 1.1409 0.3501 0.0011
Iran hostage crisis 0.0945 0.0462 0.0406 0.0894 0.1975 0.6507
Price of oil 0.2135 0.0081 0.0000 0.2766 0.0301 0.0000
Presidential approval 0.0341 0.0014 0.0000 0.0321 0.0058 0.0000
Unemployment 0.0902 0.0097 0.0000 0.0770 0.0376 0.0407
2.1500 0.2419 0.0000
AIC 3488.2830 1526.4272
Notes: N D 180. Data from Peake and Eshbaugh-Soha (2008)
13
Besides this approach of making predictions using central values of control variables, Hanmer
and Kalkan (2013) make the case that forecasting outcomes based on the observed values of control
variables in the data set is preferable. Readers are encouraged to consult their article for further
advice on this issue.
122 7 Generalized Linear Models
Turning to specifics, in our data the variable approve ranges from 24%
approval to 72.3%. Thus, we construct a vector that includes the full range of
approval as well as plausible values of all of the other predictors:
approval<-seq(24,72.3,by=.1)
inputs.4<-cbind(1,0,0,0,0,0,0,0,0,mean(pres.energy$oilc),
approval,mean(pres.energy$Unemploy))
colnames(inputs.4)<-c("constant","rmn1173","grf0175",
"grf575","jec477","jec1177","jec479","embargo","hostages",
"oilc","Approval","Unemploy")
inputs.4<-as.data.frame(inputs.4)
The first line above creates the vector of hypothetical values of our predictor of
interest. The second line creates a matrix of hypothetical data valuessetting the
indicator variables to zero, the continuous variables to their means, and approval
to its range of hypothetical values. The third line names the columns of the matrix
after the variables in our model. On the last line, the matrix of predictor values is
converted to a data frame.
Once we have the data frame of predictors in place, we can use the predict
command to forecast the expected counts for the Poisson and negative binomial
models:
forecast.poisson<-predict(energy.poisson,newdata=inputs.4,
type="response")
forecast.nb<-predict(energy.nb,newdata=inputs.4,type="response")
These two lines only differ in the model from which they draw coefficient estimates
for the forecast.14 In both cases, we specify type="response" to obtain
predictions on the count scale.
To graph our forecasts from each model, we can type:
plot(y=forecast.poisson,x=approval,type="l",lwd=2,
ylim=c(0,60),xlab="Presidential Approval",
ylab="Predicted Count of Energy Policy Stories")
lines(y=forecast.nb,x=approval,lty=2,col="blue",lwd=2)
legend(x=50,y=50,legend=c("Poisson","Negative Binomial"),
lty=c(1,2),col=c("black","blue"),lwd=2)
The first line plots the Poisson predictions as a line with the type="l" option.
The second line adds the negative binomial predictions, coloring the line blue
and dashing it with lty=2. Finally, the legend command allows us to quickly
distinguish which line represents which model. The full output is presented in
Fig. 7.3. The predictions from the two models are similar and show a similar
negative effect of approval. The negative binomial model has a slightly lower
forecast at low values of approval and a slightly shallower effect of approval, such
that the predicted counts overlap at high values of approval.
14
As a side note, by using Rs matrix algebra commands, described further
in Chap. 10, the user can compute predicted counts easily with alternate
syntax. For instance, for the negative binomial model, we could have typed:
forecast.nb<-exp(as.matrix(inputs.4)%*%energy.nb$coefficients).
7.4 Practice Problems 123
60
Predicted Count of Energy Policy Stories
50
Poisson
Negative Binomial
40
30
20
10
0
30 40 50 60 70
Presidential Approval
Fig. 7.3 Predicted count of energy policy stories on TV news as a function of presidential
approval, holding continuous predictors at their mean and nominal predictors at their mode.
Predictions based on Poisson and negative binomial model results
After the first seven chapters of this volume, users should now be able to perform
most of the basic tasks that statistical software is designed to do: manage data,
compute simple statistics, and estimate common models. In the remaining four
chapters of this book, we now turn to the unique features of R that allow the
user greater flexibility to apply advanced methods with packages developed by
other users and tools for programming in R.
variable. Suppose you were worried about reciprocal causation bias in the
model of retrospective economic evaluations. Which independent variable or
variables would be most suspect to this criticism?
3. Count model: In the practice problems from Chapters 3 and 4, we introduced
Peake and Eshbaugh-Sohas (2008) analysis of drug policy coverage. If you
do not have their data from before, download drugCoverage.csv from
the Dataverse linked on page vii or the chapter content linked on page 97.
The outcome variable is news coverage of drugs (drugsmedia), and the four
inputs are an indicator for a speech on drugs that Ronald Reagan gave in
September 1986 (rwr86), an indicator for a speech George H.W. Bush gave in
September 1989 (ghwb89), the presidents approval rating (approval), and the
unemployment rate (unemploy).15
a. Estimate a Poisson regression model of drug policy coverage as a function of
the four predictors.
b. Estimate a negative binomial regression model of drug policy coverage as a
function of the four predictors. Based on your models results, which model
is more appropriate, Poisson or negative binomial? Why?
c. Compute the count ratio for the presidential approval predictor for each
model. How would you interpret each quantity?
d. Plot the predicted counts from each model contingent on the unemployment
level, ranging from the minimum to maximum observed values. Hold the
two presidential speech variables at zero, and hold presidential approval
at its mean. Based on this figure, what can you say about the effect of
unemployment in each model?
15
Just as in the example from the chapter, these are time series data, so methods from Chap. 9 are
more appropriate.
Chapter 8
Using Packages to Apply Advanced Models
In the first seven chapters of this book, we have treated R like a traditional statistical
software program and reviewed how it can perform data management, report simple
statistics, and estimate a variety of regression models. In the remainder of this book,
though, we turn to the added flexibility that R offersboth in terms of programming
capacity that is available for the user as well as providing additional applied tools
through packages. In this chapter, we focus on how loading additional batches of
code from user-written packages can add functionality that many software programs
will not allow. Although we have used packages for a variety of purposes in the
previous seven chapters (including car, gmodels, and lattice, to name a
few), here we will highlight packages that enable unique methods. While the CRAN
website lists numerous packages that users may install at any given time, we will
focus on four particular packages to illustrate the kinds of functionality that can be
added.
The first package we will discuss, lme4, allows users to estimate multilevel
models, thereby offering an extension to the regression models discussed in Chaps. 6
and 7 (Bates et al. 2014). The other three were all developed specifically by Political
Scientists to address data analytic problems that they encountered in their research:
MCMCpack allows users to estimate a variety of models in a Bayesian framework
using Markov chain Monte Carlo (MCMC) (Martin et al. 2011). cem allows the user
to conduct coarsened exact matchinga method for causal inference with field data
(Iacus et al. 2009, 2011). Lastly, wnominate allows the user to scale choice data,
such as legislative roll call data, to estimate ideological ideal points of legislators
or respondents (Poole and Rosenthal 1997; Poole et al. 2011). The following four
sections will consider each package separately, so each section will introduce its
data example in turn. These sections are designed to offer a brief overview of the
kinds of capacities R packages offer, though some readers may be unfamiliar with
the background behind some of these methods. The interested reader is encouraged
to consult with some of the cited resources to learn more about the theory behind
these methods.
In this example, we return to our example from Chap. 6 on the number of hours
teachers spend in the classroom teaching evolution. Originally, we fitted a linear
model using ordinary least squares (OLS) as our estimator. However, Berkman and
Plutzer (2010) make the point that teachers in the same state are likely to share
similar features. These features could be similarities in training, in the local culture,
or in state law. To account for these unobserved similarities, we can think of teachers
as being nested within states. For this reason, we will add a random effect for each
state. Random effects account for intraclass correlation, or error correlation among
1
Note that this multilevel approach to panel data is most sensible for short panels such as these
where there are many individuals relative to the number of time points. For long panels in which
there are many time points relative to the number of individuals, more appropriate models are
described as pooled time series cross-section methods. For more on the study of short panels, see
Monogan (2011) and Fitzmaurice et al. (2004).
8.1 Multilevel Models with lme4 129
observations within the same group. In the presence of intraclass correlation, OLS
estimates are inefficient because the disturbance terms are not independent, so a
random effects model accounts for this problem.
First, we reload the data from the National Survey of High School Biology
Teachers as follows:2
rm(list=ls())
library(foreign)
evolution<-read.dta("BPchap7.dta")
evolution$female[evolution$female==9]<-NA
evolution<-subset(evolution,!is.na(female))
Once we have loaded the library, we fit our multilevel linear model using the lmer
(linear mixed effects regression) command:
hours.ml<-lmer(hrs_allev~phase1+senior_c+ph_senior+notest_p+
ph_notest_p+female+biocred3+degr3+evol_course+certified+
idsci_trans+confident+(1|st_fip),data=evolution)
The syntax to the lmer command is nearly identical to the code we used when
fitting a model with OLS using lm. In fact, the only attribute we added is the
additional term (1|st_fip) on the right-hand side of the model. This adds a
random intercept by state. On any occasion for which we wish to include a random
effect, in parentheses we place the term for which to include the effect followed by a
vertical pipe and the variable that identifies the upper-level units. So in this case we
wanted a random intercept (hence the use of 1), and we wanted these to be assigned
by state (hence the use of st_fip).
We obtain our results by typing:
summary(hours.ml)
In our result output, R prints the correlation between all of the fixed effects, or
estimated regression parameters. This part of the printout is omitted below:
Linear mixed model fit by REML [lmerMod]
Formula:
hrs_allev~phase1+senior_c+ph_senior+notest_p+ph_notest
_p+
2
If you do not have these data from before, you can download the file BPchap7.dta from the
Dataverse on page vii or the chapter content on page 125.
3
See also the nlme library, which was a predecessor to lme4.
130 8 Using Packages to Apply Advanced Models
female+biocred3+degr3+evol_course+certified+idsci_
trans+
confident+(1|st_fip)
Data: evolution
Scaled residuals:
Min 1Q Median 3Q Max
-2.3478 -0.7142 -0.1754 0.5566 3.8846
Random effects:
Groups Name Variance Std.Dev.
st_fip (Intercept) 3.089 1.758
Residual 67.873 8.239
Number of obs: 841, groups: st_fip, 49
Fixed effects:
Estimate Std. Error t value
(Intercept) 10.5676 1.2138 8.706
phase1 0.7577 0.4431 1.710
senior_c -0.5291 0.3098 -1.708
ph_senior -0.5273 0.2699 -1.953
notest_p 0.1134 0.7490 0.151
ph_notest_p -0.5274 0.6598 -0.799
female -0.9702 0.6032 -1.608
biocred3 0.5157 0.5044 1.022
degr3 -0.4434 0.3887 -1.141
evol_course 2.3894 0.6270 3.811
certified -0.5335 0.7188 -0.742
idsci_trans 1.7277 1.1161 1.548
confident 2.6739 0.4468 5.984
The output first prints a variety of fit statistics: AIC, BIC, log-likelihood, deviance,
and restricted maximum likelihood deviance. Second, it prints the variance and
standard deviation of the random effects. In this case, the variance for the st_fip
term is the variance of our state-level random effects. The residual variance
corresponds to the error variance of regression that we would normally compute for
our residuals. Last, the fixed effects that are reported are synonymous with linear
regression coefficients that we normally are interested in, albeit now our estimates
have accounted for the intraclass correlation among teachers within the same state.
Table 8.1 compares our OLS and multilevel estimates side-by-side. As can be seen,
the multilevel model now divides the unexplained variance into two components
(state and individual-level), and coefficient estimates have changed somewhat.
8.1 Multilevel Models with lme4 131
Table 8.1 Two models of hours of class time spent teaching evolution by high school biology
teachers
OLS Multilevel
Parameter Estimate Std. error Pr(> jzj) Estimate Std. error Pr(> jzj)
Intercept 10.2313 1.1905 0.0000 10.5675 1.2138 0.0000
Standards index 2007 0.6285 0.3331 0.0596 0.7576 0.4431 0.0873
Seniority (centered) 0.5813 0.3130 0.0636 0.5291 0.3098 0.0876
Standards seniority 0.5112 0.2717 0.0603 0.5273 0.2699 0.0508
Believes there is no test 0.4852 0.7222 0.5019 0.1135 0.7490 0.8795
Standards believes no test 0.5362 0.6233 0.3899 0.5273 0.6598 0.4241
Teacher is female 1.3546 0.6016 0.0246 0.9703 0.6032 0.1077
Credits earned in biology (02) 0.5559 0.5072 0.2734 0.5157 0.5044 0.3067
Science degrees (02) 0.4003 0.3922 0.3077 0.4434 0.3887 0.2540
Completed evolution class 2.5108 0.6300 0.0001 2.3894 0.6270 0.0001
Has normal certification 0.4446 0.7212 0.5377 0.5335 0.7188 0.4580
Identifies as scientist 1.8549 1.1255 0.0997 1.7277 1.1161 0.1216
Self- rated expertise (1 to +2) 2.6262 0.4501 0.0000 2.6738 0.4468 0.0000
State-level variance 3.0892
Individual-level variance 69.5046 67.8732
Notes: N D 841. Data from Berkman and Plutzer (2010)
While somewhat more complex, the logic of multilevel modeling can also be applied
when studying limited dependent variables. There are two broad approaches to
extending GLMs into a multilevel framework: marginal models, which have a
population-averaged interpretation, and generalized linear mixed models (GLMMs),
which have an individual-level interpretation (Laird and Fitzmaurice 2013, pp. 149
156). While readers are encouraged to read further on the kinds of models available,
their estimation, and their interpretation, for now we focus on the process of
estimating a GLMM.
In this example, we return to our example from Sect. 7.1.1 from the last chapter,
on whether a survey respondent reported voting for the incumbent party as a
function of the ideological distance from the party. As Singh (2014a) observes,
voters making their choice in the same country-year are going to face many features
of the choice that are unique to that election. Hence, intraclass correlation among
voters within the same election is likely. Further, the effect of ideology itself may be
stronger in some elections than it is in others: Multilevel methods including GLMMs
allow us to evaluate whether there is variation in the effect of a predictor across
groups, which is a feature that we will use.
132 8 Using Packages to Apply Advanced Models
Turning to the specifics of code, if the lme4 library is not loaded, we need that
again. Also, if the data from are not loaded, then we need to load the foreign
library and the data set itself. All of this is accomplished as follows:4
library(lme4)
library(foreign)
voting<-read.dta("SinghJTP.dta")
Building on the model from Table 7.1, we first simply add a random intercept to
our model. The syntax for estimating the model and printing the results is:
inc.linear.ml<-glmer(votedinc~distanceinc+(1|cntryyear),
family=binomial(link="logit"),data=voting)
summary(inc.linear.ml)
Notice that we now use the glmer command (generalized linear mixed effects
regression). By using the family option, we can use any of the common link
functions available for the glm command. A glance at the output shows that, in
addition to the traditional fixed effects that reflect logistic regression coefficients,
we also are presented with the variance of the random intercept for country and year
of the election:
Generalized linear mixed model fit by the Laplace
approximation
Formula: votedinc ~ distanceinc + (1 | cntryyear)
Data: voting
AIC BIC logLik deviance
41998.96 42024.62 -20996.48 41992.96
Random effects:
Groups Name Variance Std.Dev.
cntryyear (Intercept) 0.20663 0.45457
Number of obs: 38211, groups: cntryyear, 30
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.161788717 0.085578393 1.89053 0.058687 .
distanceinc -0.501250136 0.008875997 -56.47254 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
To replicate a model more in line with Singhs (2014a) results, we now fit a model
that includes a random intercept and a random coefficient on ideological distance,
both contingent on the country and year of the election. The syntax for estimating
this model and printing the output is:
4
If you do not have these data from before, you can download the file SinghJTP.dta from the
Dataverse on page vii or the chapter content on page 125.
8.1 Multilevel Models with lme4 133
inc.linear.ml.2<-glmer(votedinc~distanceinc+
(distanceinc|cntryyear),family=binomial(link="logit"),
data=voting)
summary(inc.linear.ml.2)
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.26223 0.14531 1.805 0.0711 .
distanceinc -0.53061 0.05816 -9.124 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
To estimate our Bayesian linear regression model, we must reload the data from the
National Survey of High School Biology Teachers, if they are not already loaded:
rm(list=ls())
library(foreign)
evolution<-read.dta("BPchap7.dta")
evolution$female[evolution$female==9]<-NA
evolution<-subset(evolution,!is.na(female))
With the data loaded, we must install MCMCpack if this is the first use of the
package on the computer. Once the program is installed, we then must load the
library:
install.packages("MCMCpack")
library(MCMCpack)
Now we can use MCMC to fit our Bayesian linear regression model with the
MCMCregress command:
mcmc.hours<-MCMCregress(hrs_allev~phase1+senior_c+ph_senior+
notest_p+ph_notest_p+female+biocred3+degr3+
evol_course+certified+idsci_trans+confident,data=evolution)
Be prepared that estimation with MCMC usually takes a longer time computation-
ally, though simple models such as this one usually finish fairly quickly. Also,
because MCMC is a simulation-based technique, it is normal for summaries of
the results to differ slightly across replications. To this end, if you find differences
between your results and those printed here after using the same code, you need not
worry unless the results are markedly different.
8.2 Bayesian Methods Using MCMCpack 135
While the above code relies on the defaults of the MCMCregress command,
a few of this commands options are essential to highlight: A central feature of
Bayesian methods is that the user must specify priors for all of the parameters that
are estimated. The defaults for MCMCregress are vague conjugate priors for the
coefficients and the variance of the disturbances. However, the user has the option
of specifying his or her own priors on these quantities.5 Users are encouraged to
review these options and other resources about how to set priors (Carlin and Louis
2009; Gelman et al. 2004; Gill 2008; Robert 2001). Users also have the option to
change the number of iterations in the MCMC sample with the mcmc option and
the burn-in period (i.e., number of starting iterations that are discarded) with the
burnin option. Users should always assess model convergence after estimating a
model with MCMC (to be discussed shortly) and consider if either the burn-in or
number of iterations should be changed if there is evidence of nonconvergence.
After estimating the model, typing summary(mcmc.hours) will offer a quick
summary of the posterior sample:
Iterations = 1001:11000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 10000
5
For priors on the coefficients, the option b0 sets the vector of means of a multivariate Gaussian
prior, and B0 sets the variance-covariance matrix of the multivariate Gaussian prior. The prior
distribution of the error variance of regression is inverse Gamma, and this distribution can be
manipulated by setting its shape parameter with option c0 and scale parameter with option d0.
Alternatively, the inverse Gamma distribution can be manipulated by changing its mean with the
option sigma.mu and its variance with the option sigma.var.
136 8 Using Packages to Apply Advanced Models
Table 8.2 Linear model of hours of class time spent teaching evolution by high
school biology teachers (MCMC Estimates)
Predictor Mean Std. Dev. [95 % Cred. Int.]
Intercept 10.2353 1.1922 [ 7.9236: 12.5921 ]
Standards index 2007 0.6346 0.3382 [ 0.0279: 1.3008 ]
Seniority (centered) 0.5894 0.3203 [ 1.2253: 0.0425 ]
Standards seniority 0.5121 0.2713 [ 1.0439: 0.0315 ]
Believes there is no test 0.4828 0.7214 [ 0.9272: 1.8887 ]
Standards believes no test 0.5483 0.6182 [ 1.7505: 0.6323 ]
Teacher is female 1.3613 0.5997 [ 2.5231: 0.1804 ]
Credits earned in biology (02) 0.5612 0.5100 [ 0.4282: 1.5589 ]
Science degrees (02) 0.4071 0.3973 [ 1.1956: 0.3828 ]
Completed evolution class 2.5014 0.6299 [ 1.2617: 3.7350 ]
Has normal certification 0.4525 0.7194 [ 1.8483: 0.9506 ]
Identifies as scientist 1.8658 1.1230 [ 0.3320: 4.0902 ]
Self-rated expertise (1 to C2) 2.6302 0.4523 [ 1.7357: 3.4894 ]
Error variance of regression 70.6874 3.5029 [ 64.1275: 77.8410 ]
Notes: N D 841. Data from Berkman and Plutzer (2010)
8.2 Bayesian Methods Using MCMCpack 137
Here we have specified that we want to compare the mean of the first 10 % of the
chain (frac1=0.1) to the mean of the last 50 % of the chain (frac2=0.5).
The resulting output presents a z-ratio for this difference of means test for each
parameter:
Fraction in 1st window = 0.1
Fraction in 2nd window = 0.5
(Intercept) phase1 senior_c ph_senior notest_p
-1.34891 -1.29015 -1.10934 -0.16417 0.95397
ph_notest_p female biocred3 degr3 evol_course
1.13720 -0.57006 0.52718 1.25779 0.62082
certified idsci_trans confident sigma2
1.51121 -0.87436 -0.54549 -0.06741
In this case, none of the test statistics surpass any common significance thresholds
for a normally distributed test statistic, so we find no evidence of nonconvergence.
Based on this, we may be content with our original MCMC sample of 10,000.
One more thing we may wish to do with our MCMC output is plot the overall
estimated density function of our marginal posterior distributions. We can plot
these one at a time using the densplot function, though the analyst will need
to reference the parameter of interest based on its numeric order of appearance in
the summary table. For example, if we wanted to plot the coefficient for the indicator
of whether a teacher completed an evolution class (evol_course), that is the tenth
6
To write out a similar table to Table 8.2 in LATEX, load the xtable library in R and type the
following into the console:
xtable(cbind(summary(mcmc.hours)$statistics[,1:2],
summary(mcmc.hours)$quantiles[,c(1,5)]),digits=4)
7
This frequently occurs when one package depends on code from another.
138 8 Using Packages to Apply Advanced Models
0.8
0.6
0.4
0.2
0.0
0 1 2 3 4 5 1 2 3 4
N = 10000 Bandwidth = 0.1058 N = 10000 Bandwidth = 0.07599
a) Completed Evolution Class b) Self-Rated Expertise
Fig. 8.1 Density plots of marginal posterior distribution of coefficients for whether the teacher
completed an evolution class and the teachers self-rated expertise. Based on an MCMC sample of
10,000 iterations (1000 iteration burn-in). (a) Completed evolution class; (b) Self-rated expertise
parameter estimate reported in the table. Similarly, if we wanted to report the density
plot for the coefficient of the teachers self-rated expertise (confident), that is the
thirteenth parameter reported in the summary table. Hence we could plot each of
these by typing:
densplot(mcmc.hours[,10])
densplot(mcmc.hours[,13])
The resulting density plots are presented in Fig. 8.1. As the figures show, both of
these marginal posterior distributions have an approximate normal distribution, and
the mode is located near the mean and median reported in our summary output.
Just as with the MCMCregress command, we have chosen to use the defaults in
this case, but users are encouraged to consider setting their own priors to suit their
needs. As a matter of fact, this is one case where we will need to raise the number of
iterations in our model. We can check convergence of our model using the Geweke
diagnostic:
geweke.diag(inc.linear.mcmc, frac1=0.1, frac2=0.5)
Our output in this case actually shows a significant difference between the means at
the beginning and end of the chain for each parameter:
Fraction in 1st window = 0.1
Fraction in 2nd window = 0.5
(Intercept) distanceinc
2.680 -1.717
The absolute value of both z-ratios exceeds 1.645, so we can say the mean is
significantly different for each parameter at the 90 % confidence level, which is
evidence of nonconvergence.
As a response, we can double both our burn-in period and number of iterations
to 2,000 and 20,000, respectively. The code is:
inc.linear.mcmc.v2<-MCMClogit(votedinc~distanceinc,
data=voting,burnin=2000,mcmc=20000)
We can now check for the convergence of this new sample by typing:
geweke.diag(inc.linear.mcmc.v2, frac1=0.1, frac2=0.5)
Our output now shows nonsignificant z-ratios for each parameter, indicating that
there is no longer evidence of nonconvergence:
Fraction in 1st window = 0.1
Fraction in 2nd window = 0.5
(Intercept) distanceinc
-1.0975 0.2128
Proceeding with this sample of 20,000, then, if we type
summary(inc.linear.mcmc.v2) into the console, the output is:
Iterations = 2001:22000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 20000
An important task when using matching methods is assessing the degree to which
the data are balanced, or the degree to which treated cases have a similar distribution
of covariate values relative to the control group. We can assess the degree to which
our treatment and control groups have differing distributions with the imbalance
command. In the code below, we first increase the penalty for scientific notation
(an option if you prefer decimal notation). Then, we create a vector naming the
variables we do not want to assess balance forthe treatment variable (treated)
and the outcome of interest (re78). All other variables in the dataset are covariates
that we believe can shape income in 1978, so we would like to have balance on
them. In the last line, we actually call the imbalance command.
options(scipen=8)
todrop <- c("treated","re78")
imbalance(group=LL$treated, data=LL, drop=todrop)
Within the imbalance command, the group argument is our treatment variable
that defines the two groups for which we want to compare the covariate distributions.
The data argument names the dataset we are using, and the drop option allows us
omit certain variables from the dataset when assessing covariate balance. Our output
from this command is as follows:
8
LaLondes data is also available in the file LL.csv, available in the Dataverse (see page vii) or
the chapter content (see page 125).
142 8 Using Packages to Apply Advanced Models
indices or irrelevant variables. We could use a vector to list all of the variables we
want to be ignored, as we did with the imbalance command before, but in this
case, only the outcome re78 needs to be skipped. We type:
cem.match.1 <- cem(treatment="treated", data=LL, drop="re78")
cem.match.1
$education
[1] 3.0 4.3 5.6 6.9 8.2 9.5 10.8 12.1 13.4 14.7 16.0
$black
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
$married
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
$nodegree
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
$re74
[1] 0.000 3957.068 7914.136 11871.204 15828.272 19785.340
[7] 23742.408 27699.476 31656.544 35613.612 39570.680
$re75
[1] 0.000 3743.166 7486.332 11229.498 14972.664 18715.830
[7] 22458.996 26202.162 29945.328 33688.494 37431.660
$hispanic
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
144 8 Using Packages to Apply Advanced Models
$u74
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
$u75
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
To illustrate what this means, consider age. The lowest category of coarsened
age lumps together everyone aged 1720.8, the second category lumps everyone
aged 20.824.6, and so forth. Variables like black, married, nodegree,
hispanic, u74, and u75 are actually binary, so most of the categories being
created are unnecessary, though empty bins will not hurt our analysis. Of course,
users are not required to use software defaults, and Iacus et al. urge researchers to
use substantive knowledge of each variables measurement to set the ranges of the
coarsening bins (2012, p. 9). Section 8.3.2 offers details on doing this.
Now we can assess imbalance in the new matched sample by typing:
imbalance(LL$treated[cem.match.1$matched],
LL[cem.match.1$matched,], drop=todrop)
Compare this to the original data. We now have L1 D 0:592, which is less than
our score of 0.735 for the raw data, indicating that multivariate balance is better in
the matched sample. Turning to the individual covariates, you can see something
of a mixed bag, but overall the balance looks better. For instance, with age the
8.3 Causal Inference with cem 145
difference in means is actually a bit larger in absolute value with the matched sample
age
(0.42) than the raw data (0.18). However, L1 is now minuscule in the matched
sample, and less than the 0.0047 value for the raw data. This is likely on account of
the fact that the raw data has a larger discrepancy at the high end than the matched
sample has. The user now must decide whether the treated and control cases are
sufficiently balanced or whether to try other coarsenings to improve balance.
If the user is satisfied with the level of balance, he or she can proceed to estimate
the Average Treatment effect on the Treated (ATT) using the command att. This
quantity represents the causal effect on the kind of individual who received the
treatment. In this command we specify what our matched sample is using the obj
argument, the outcome variable (re78) and treatment (treated) using the formula
argument, and our data using the data argument. This gives us:
est.att.1 <- att(obj=cem.match.1, formula=re78~treated, data=LL)
est.att.1
G0 G1
All 425 297
Matched 222 163
Unmatched 203 134
As a final point, if a researcher is not happy with the level of balance or the sample
size in the matched sample, then a tool for finding better balance is the cemspace
command. This command randomly produces several different coarsenings for
the control variables (250 different coarsenings by default). The command then
plots the level of balance against the number of treated observations included in
the matched sample. The following code calls this command:
146 8 Using Packages to Apply Advanced Models
cem.explore<-cemspace(treatment="treated",data=LL,drop="re78")
The syntax of cemspace is similar to cem, though two more options are important:
minimal and maximal. These establish what the minimum and maximum
allowed number of coarsened intervals is for the variables. The command above
uses the defaults of 1 and 5, which means that no more than five intervals may
be included for a variable. Hence, all matched samples from this command will
be coarser than what we used in Sect. 8.3.1, and therefore less balanced. The user
could, however, increase maximal to 12 or even a higher number to create finer
intervals and potentially improve the balance over our prior result.
Our output from cemspace is shown in Fig. 8.2. On account of the random
element in choosing coarsenings, your results will not exactly match this figure.
Figure 8.2a shows the interactive figure that opens up. The horizontal axis of this
figure shows the number of matched treatment units in descending order, while
the vertical axis shows the level of imbalance. In general, a matched sample at
the bottom-left corner of the graph would be ideal as that would indicate the best
balance (reducing bias) and the largest sample (increasing efficiency). Normally,
though, we have to make a choice on this tradeoff, generally putting a bit more
weight on minimizing imbalance. By clicking on different points on the graph, the
second window that cemspace creates, shown in Fig. 8.2b will show the intervals
used in that particular coarsening. The user can copy the vectors of the interval
cutpoints and paste them into his or her own code. Note: R will not proceed with
new commands until these two windows are closed.
In Fig. 8.2a, one of the potential coarsenings has been chosen, and it is high-
lighted and yellow. If we want to implement this coarsening, we can copy the vectors
shown in the second window illustrated in Fig. 8.2b. Pasting these into our own R
script produces the following code:
Fig. 8.2 Plot of balance statistics for 250 matched samples from random coarsenings against
number of treated observations included in the respective matched sample. (a) Plot window;
(b) X11 window
8.4 Legislative Roll Call Analysis with wnominate 147
We end this code by creating a list of all of these vectors. While our vectors here
have been created using a coarsening created by cemspace, this is the procedure
a programmer would use to create his or her own cutpoints for the intervals.
By substituting the vectors above with user-created cutpoint vectors, a researcher
can use his or her own knowledge of the variables measurement to coarsen.
Once we have defined our own cutpoints, either by using cemspace or
substantive knowledge, we can now apply CEM with the following code:
cem.match.2 <- cem(treatment="treated", data=LL, drop="re78",
cutpoints=new.cuts)
Our key addition here is the use of the cutpoints option, where we input our list
of intervals. Just as in Sect. 8.3.1, we can now assess the qualities of the matched
sample, imbalance levels, and compute the ATT if we wish:
cem.match.2
imbalance(LL$treated[cem.match.2$matched],
LL[cem.match.2$matched,], drop=todrop)
est.att.2 <- att(obj=cem.match.2, formula=re78~treated, data=LL)
est.att.2
In this case, in part because of the coarser bins we are using, the balance is worse
that what we found in the previous section. Hence, we would be better off in this
case sticking with our first result. The reader is encouraged to try to find a coarsening
that produces better balance than the software defaults.
for both the US House of Representatives and Senate for every term of Congress.
Countless authors have used these data, typically interpreting the first dimension
score as a scale of liberal-conservative ideology.
In brief, the basic logic of the model draws from the spatial proximity model
of politics, which essentially states that both individuals ideological preferences
and available policy alternatives can be represented in geometric space of one or
more dimensions. An individual generally will vote for the policy choice that is
closest in space to his or her own ideological ideal point (Black 1958; Downs
1957; Hotelling 1929). The NOMINATE model is based on these assumptions, and
places legislators and policy options in ideological space based on how legislators
votes divide over the course of many roll call votes and when legislators behave
unpredictably (producing errors in the model). For example, in the US Congress,
liberal members typically vote differently from conservative members, and the
extreme ideologues are the most likely to be in a small minority whenever there
is wide consensus on an issue. Before applying the NOMINATE method to your
own dataand even before downloading pre-measured DW-NOMINATE data to
include in a model you estimatebe sure to read more about the method and its
assumptions because thoroughly understanding how a method works is essential
before using it. In particular, Chap. 2 and Appendix A from Poole and Rosenthal
(1997) and Appendix A from McCarty et al. (1997) offer detailed, yet intuitive,
descriptions of how the method works.
In R, the wnominate package implements W-NOMINATE, which is a version
of the NOMINATE algorithm that is intended only to be applied to a single
legislature. W-NOMINATE scores are internally valid, so it is fair to compare
legislators scores within a single dataset. However, the scores cannot be externally
compared to scores when W-NOMINATE is applied to a different term of the
legislature or a different body of actors altogether. Hence, it is a good method for
trying to make cross-sectional comparisons among legislators of the same body.
While the most common application for W-NOMINATE has been the
US Congress, the method could be applied to any legislative body. To that end,
the working example in this section focuses on roll call votes cast in the United
Nations. This UN dataset is available in the wnominate package, and it pools 237
roll call votes cast in the first three sessions of the UN (19461949) by 59 member
nations. The variables are labeled V1 to V239. V1 is the name of the member
nation, and V2 is a categorical variable either coded as WP for a member of the
Warsaw Pact, or Other for all other nations. The remaining variables sequentially
identify roll call votes.
To begin, we clean up, install the wnominate package the first time we use it,
load the library, and load the UN data:9
rm(list=ls())
install.packages("wnominate")
library(wnominate)
data(UN)
9
The UN data is also available in the file UN.csv, available in the Dataverse (see page vii) or the
chapter content (see page 125)
8.4 Legislative Roll Call Analysis with wnominate 149
Once the data are loaded, they can be viewed with the standard commands such
as fix, but for a quick view of what the data look like, we could simply type:
head(UN[,1:15]). This will show the structure of the data through the first 13
roll call votes.
Before we can apply W-NOMINATE, we have to reformat the data to an object
of class rollcall. To do this, we first need to redefine our UN dataset as a matrix,
and split the names of the countries, whether the country was in the Warsaw Pact,
and the set of roll calls into three separate parts:
UN<-as.matrix(UN)
UN.2<-UN[,-c(1,2)]
UNnames<-UN[,1]
legData<-matrix(UN[,2],length(UN[,2]),1)
The first line turned the UN data frame into a matrix. (For more on matrix commands
in R, see Chap. 10.) The second line created a new matrix, which we have named
UN.2, which has eliminated the first two columns (country name and member of
Warsaw Pact) to leave only the roll calls. The third line exported the names of the
nations into the vector UNnames. (In many other settings, this would instead be the
name of the legislator or an identification variable.) Lastly, our variable of whether
a nation was in the Warsaw Pact has been saved as a one-column matrix named
legData. (In many other settings, this would be a legislators political party.) Once
we have these components together, we can use the rollcall command to define
a rollcall-class object that we name rc as follows:
rc<-rollcall(data=UN.2,yea=c(1,2,3),nay=c(4,5,6),
missing=c(7,8,9),notInLegis=0,legis.names=UNnames,
legis.data=legData,desc="UN Votes",source="voteview.com")
We specify our matrix of roll call votes with the data argument. Based on how
the data in the roll call matrix are coded, we use the yea, nay, and missing
arguments to translate numeric codes into their substantive meaning. Additionally,
notInLegis allows us to specify a code that specifically means that the legislator
was not a member at the time of the vote (e.g., a legislator died or resigned). We
have no such case in these data, but the default value is notInLegis=9, and
9 means something else to us, so we need to specify an unused code of 0. With
legis.names we specify the names of the voters, and with legis.data we
specify additional variables about our voters. Lastly, desc and source allow us
to record additional information about our data.
With our data formatted properly, we can now apply the W-NOMINATE model.
the command is simply called wnominate:
result<-wnominate(rcObject=rc,polarity=c(1,1))
With the rcObject argument, we simply name our properly formatted data. The
polarity argument, by contrast, requires substantive input from the researcher:
The user should specify a vector that lists which observation should clearly fall
on the positive side of each dimension estimated. Given the politics of the Cold
150 8 Using Packages to Apply Advanced Models
War, we use observation #1, the USA, as the anchor on both dimensions we
estimate. By default, wnominate places voters in two-dimensional ideological
space (though this could be changed by specifying the dims option).
To view the results of our estimation, we can start by typing:
summary(result). This prints the following output.
SUMMARY OF W-NOMINATE OBJECT
----------------------------
Leg
40
Second Dimension
0.5
30
Count
0.0
20
0.5
10
1.0
0
1.0 0.5 0.0 0.5 1.0 0 20 40 60 80 100 130 160
First Dimension Angle in Degrees
0.5
Eigenvalue
2
0.0
0.5
1
1.0
Fig. 8.3 Output plot from estimating W-NOMINATE scores from the first three sessions of the
United Nations
proceedings, this is where each nation falls. The bottom-left panel shows a scree
plot, which lists the eigenvalue associated with each dimension. Larger eigenvalues
indicate that a dimension has more explanatory power. As in all scree plots, each
additional dimension has lower explanatory value than the previous one.10 The top-
right panel shows the distribution of the angles of the cutting lines. The cutting lines
divide yea from nay votes on a given issue. The fact that so many cutting lines are
near the 90 mark indicates that the first dimension is important for many of the
10
When choosing how many dimensions to include in a measurement model, many scholars use
the elbow rule, meaning they do not include any dimensions past a visual elbow in the scree plot.
In this case, a scholar certainly would not include more than three dimensions, and may be content
with two. Another common cutoff is to include any dimension for which the eigenvalue exceeds 1,
which would have us stop at two dimensions.
152 8 Using Packages to Apply Advanced Models
votes. Finally, the bottom-right panel shows the Coombs Mesh from this modela
visualization of how all of the cutting lines on the 237 votes come together in a
single space.
If the user is satisfied with the results of this measurement model, then it is
straightforward to write the scores into a useable data format. Within
our wnominate output named result we can call the attribute named
legislators, which saves our ideal points for all countries, any non-roll call
variables we specified (e.g., Warsaw Pact or not), and a variety of other measures.
We save this as a new data frame named scores and then write that to a CSV file:
scores<-result$legislators
write.csv(scores,"UNscores.csv")
Just remember to use the setwd command to specify the folder in which you wish
to save the output file.
Once we have our W-NOMINATE ideal points in a separate data frame, we can
do anything we normally would with data in R, such as draw our own graphs.
Suppose we wanted to reproduce our own graph of the ideal points, but we wanted
to mark which nations were members of the Warsaw Pact versus those that were not.
We could easily do this using our scores data. The easiest way to do this might
be to use the subset command to create separate data frames of our two groups:
wp.scores<-subset(scores, V1=="WP")
other.scores<-subset(scores, V1=="Other")
Once we have these subsets in hand, we can create the relevant graph in three lines
of code.
plot(x=other.scores$coord1D, y=other.scores$coord2D,
xlab="First Dimension", ylab="Second Dimension",
xlim=c(-1,1), ylim=c(-1,1), asp=1)
points(x=wp.scores$coord1D, y=wp.scores$coord2D,
pch=3,col=red)
legend(x=-1,y=-.75,legend=c("Other","Warsaw Pact"),
col=c("black","red"),pch=c(1,3))
In the call to plot, we graph the 53 nations that were not members of the Warsaw
Pact, putting the first dimension on the horizontal axis, and the second on the vertical
axis. We label our axes appropriately using xlab and ylab. We also set the bounds
of our graph as running from 1 to 1 on both dimensions, as scores are constrained
to fall in these ranges. Importantly, we guarantee that the scale of the two dimensions
is the same, as we generally should for this kind of measurement model, by setting
the aspect ratio to 1 (asp=1). On the second line of code, we use the points
command to add the six observations that were in the Warsaw Pact, coloring these
observations red and using a different plotting character. Lastly, we add a legend.
Figure 8.4 presents the output from our code. This graph immediately conveys
that the first dimension is capturing the Cold War cleavage between the USA and
its allies versus the Soviet Union and its allies. We specified that the USA would
take positive coordinates on both dimensions, so we can see that the Soviet allies
8.5 Practice Problems 153
1.0 0.5
Second Dimension
0.0 0.5
Other
Warsaw Pact
1.0
Fig. 8.4 Plot of first and second dimensions of W-NOMINATE scores from the first three sessions
of the United Nations. A red cross indicates a member of the Warsaw Pact, and a black circle
indicates all other UN members (Color figure online)
(represented with red crosses) are at the extremes of negative values on the first
dimension.
To recap, this chapter has illustrated four examples of how to use R packages
to implement advanced methods. The fact that these packages are freely available
makes cutting-edge work in political methodology and from a variety of disciplines
readily available to any R user. No book could possibly showcase all of the
researcher-contributed packages that are available, not least because new packages
are being made available on a regular basis. The next time you find yourself facing
a taxing methodological problem, you may want to check the CRAN servers to see
if someone has already written a program that provides what you need.
This set of practice problems considers each of the example libraries in turn, and
then suggests you try using a brand new package that has not been discussed in this
chapter. Each question calls for a unique data set.
1. Multilevel Logistic Regression: Revisit Singhs (2015) data on voter turnout as
a function of compulsory voting rules and several other predictors. If you do not
have the file stdSingh.dta, please download it from the Dataverse (see page
vii) or the chapter content (see page 125). (These data were first introduced in
154 8 Using Packages to Apply Advanced Models
Sect. 7.4.) Refit this logistic regression model using the glmer command, and
include a random intercept by country-year (cntryyear). Recall that the outcome
is turnout (voted). The severity of compulsory voting rules (severity) is inter-
acted with the first five predictors: age (age), political knowledge (polinfrel),
income (income), efficacy (efficacy), and partisanship (partyID). Five more
predictors should be included only for additive effects: district magnitude
(dist_magnitude), number of parties (enep), victory margin (vicmarg_dist),
parliamentary system (parliamentary), and per capita GDP (development).
Again, all of the predictor variables have been standardized. What do you learn
from this multilevel logistic regression model estimated with glmer that you do
not learn from a pooled logistic regression model estimated with glm?
2. Bayesian Poisson model with MCMC: Determine how to estimate a Poisson
model with MCMC using MCMCpack. Reload Peake and Eshbaugh-Sohas
(2008) data on energy policy news coverage, last discussed in Sect. 7.3. If you
do not have the file PESenergy.csv, you may download it from the Dataverse
(see page vii) or the chapter content (see page 125). Estimate a Bayesian Poisson
model in which the outcome is energy coverage (Energy), and the inputs are
six indicators for presidential speeches (rmn1173, grf0175, grf575, jec477,
jec1177, and jec479), an indicator for the Arab oil embargo (embargo), an
indicator for the Iran hostage crisis (hostages), the price of oil (oilc), presidential
approval (Approval), and the unemployment rate (Unemploy). Use a Geweke
test to determine whether there is any evidence of nonconvergence. How should
you change your code in R if nonconvergence is an issue? Summarize your
results in a table, and show a density plot of the partial coefficient on Richard
Nixons November 1973 speech (rmn1173).
3. Coarsened Exact Matching: In Chap. 5, the practice problems introduced Alvarez
et al.s (2013) data from a field experiment in Salta, Argentina in which some
voters cast ballots through e-voting, and others voted in the traditional setting.
Load the foreign library and open the data in Stata format. If you do not have
the file alpl2013.dta, you may download it from the Dataverse (see page vii)
or the chapter content (see page 125). In this example, the treatment variable is
whether the voter used e-voting or traditional voting (EV). The covariates are age
group (age_group), education (educ), white collar worker (white_collar), not a
full-time worker (not_full_time), male (male), a count variable for number of
six possible technological devices used (tech), and an ordinal scale for political
knowledge (pol_info). Use the cem library to answer the following:
a. How balanced are the treatment and control observations in the raw data?
b. Conduct coarsened exact matching with the cem command. How much has
the balance improved as a result?
c. Consider three possible response variables: whether the voter evaluated the
voting experience positively (eval_voting), whether the voter evaluated the
speed of voting as quick (speed), and whether the voter is sure his or her vote
is being counted (sure_counted). What is the average treatment effect on the
treated (ATT) on your matched dataset for each of these three responses?
8.5 Practice Problems 155
d. How do your estimates of the average treatment effects on the treated differ
from simple difference-of-means tests?
4. W-NOMINATE: Back in Sect. 2.1, we introduced Lewis and Pooles roll call data
for the 113th US Senate. Consult the code there to read these data, which are in
fixed width format. The file name is sen113kh.ord, and it is available from
the Dataverse (see page vii) and the chapter content (see page 125).
a. Format the data as a matrix and create the following: a separate matrix just of
the 657 roll calls, a vector of the ICPSR identification numbers, and a matrix
of the non-roll call variables. Use all of these to create a rollcall-class
object. The roll call votes are coded as follows: 1 D Yea; 6 D Nay; 7 &
9 D missing, and 0 D not a member.
b. Estimate a two-dimensional W-NOMINATE model for this roll call object.
From the summary of your results, report the following: How many legislators
were deleted? How many votes were deleted? Was was the overall correct
classification?
c. Examine the output plot of your estimated model, including the
W-NOMINATE coordinates and the scree plot. Based on the scree plot,
how many dimensions do you believe are sufficient to characterize voting
behavior in the 113th Senate? Why?
5. Bonus: Try learning how to use a package you have never used before. Install the
Amelia package, which conducts multiple imputation for missing data. Have a
look at Honaker et al.s (2011) article in the Journal of Statistical Software to get
a feel for the logic of multiple imputation and to learn how to do this in R. Fit a
linear model on imputed datasets using the freetrade data from the Amelia
library. What do you find?
Chapter 9
Time Series Analysis
Most of the methods described so far in this book are oriented primarily at
cross-sectional analysis, or the study of a sample of data taken at the same point in
time. In this chapter, we turn to methods for modeling a time series, or a variable that
is observed sequentially at regular intervals over time (e.g., daily, weekly, monthly,
quarterly, or annually). Time series data frequently have trends and complex error
processes, so failing to account for these features can produce spurious results
(Granger and Newbold 1974). Several approaches for time series analysis have
emerged to address these problems and prevent false inferences. Within Political
Science, scholars of public opinion, political economy, international conflict, and
several other subjects regularly work with time-referenced data, so adequate tools
for time series analysis are important in political analysis.
Many researchers do not think of R as a program for time series analysis, instead
using specialty software such as Autobox, EViews, or RATS. Even SAS and Stata
tend to get more attention for time series analysis than R does. However, R actually
has a wide array of commands for time series models, particularly through the TSA
and vars packages. In this chapter, we will illustrate three approaches to time series
analysis in R: the BoxJenkins approach, extensions to linear models estimated with
least squares, and vector autoregression. This is not an exhaustive list of the tools
available for studying time series, but is just meant to introduce a few prominent
methods. See Sect. 9.4 for some further reading on time series.
Both the BoxJenkins approach and extensions to linear models are examples
of single equation time series models. Both approaches treat one time series as
an outcome variable and fit a model of that outcome that can be stated in one
equation, much like the regression models of previous chapters can be stated in
a single equation. Since both approaches fall into this broad category, the working
dataset we use for both the BoxJenkins approach and extensions to linear models
will be Peake and Eshbaugh-Sohas (2008) monthly data on television coverage of
energy policy that was first introduced in Chap. 3. By contrast, vector autoregression
is a multiple equation time series model (for further details on this distinction, see
Brandt and Williams 2007 or Ltkepohl 2005). With a vector autoregression model,
two or more time series are considered endogenous, so multiple equations are
required to fully specify the model. This is important because endogenous variables
may affect each other, and to interpret an input variables effect, the broader context
of the full system must be considered. Since multiple equation models have such a
different specification, when discussing vector autoregression the working example
will be Brandt and Freemans (2006) analysis of weekly political actions in the
IsraeliPalestinian conflict; more details will be raised once we get to that section.
1
Many use ARIMA models for forecasting future values of a series. ARIMA models themselves are
atheoretical, but often can be effective for prediction. Since most Political Science work involves
testing theoretically motivated hypotheses, this section focuses more on the role ARIMA models
can serve to set up inferential models.
2
If you do not have the data file PESenergy.csv already, you can download it from the
Dataverse (see page vii) or the online chapter content (see page 155).
9.1 The BoxJenkins Method 159
3
In addition to examining the original series or the autocorrelation function, an Augmented
DickeyFuller test also serves to diagnose whether a time series has a unit root. By loading the
tseries package, the command adf.test will conduct this test in R.
160 9 Time Series Analysis
0.8
0.6
Partial ACF
ACF
0.4
0.2
0.0
0.0
0.2
0 5 10 15 20 5 10 15 20
Lag Lag
a) ACF b) PACF
Fig. 9.1 Autocorrelation function and partial autocorrelation function of monthly TV coverage of
energy policy through 24 lags. (a) ACF. (b) PACF
acf(pres.energy$Energy,lag.max=24)
pacf(pres.energy$Energy,lag.max=24)
The acf and pacf functions are available without loading a package, though the
code changes slightly if users load the TSA package.4 Notice that within the acf
and pacf functions, we first list the series we are diagnosing. Second, we designate
lag.max, which is the number of lags of autocorrelation we wish to consider.
Since these are monthly data, 24 lags gives us 2 years worth of lags. In some series,
seasonality will emerge, in which we see evidence of similar values at the same time
each year. This would be seen with significant lags around 12 and 24 with monthly
data (or around 4, 8, and 12, by contrast, with quarterly data). No such evidence
appears in this case.
The graphs of our ACF and PACF are shown in Fig. 9.1. In each of these
figures, the horizontal axis represents the lag length, and the vertical axis represents
the correlation (or partial correlation) value. At each lag, the autocorrelation
for that lag is shown with a solid histogram-like line from zero to the value of
the correlation. The blue dashed lines represent the threshold for a significant
correlation. Specifically, the blue bands represent the 95 % confidence interval based
on an uncorrelated series.5 All of this means that if the histogram-like line does not
4
The primary noticeable change is that the default version of acf graphs the zero-lag correlation,
ACF(0), which is always 1.0. The TSA version eliminates this and starts with the first lag
autocorrelation, ACF(1).
q for these error bands is: 01:96ser . The standard error for a correlation coefficient
5
The formula
2
is: ser D 1r n2
. So in this case, we set r D 0 under the null hypothesis, and n is the sample size
(or series length).
9.1 The BoxJenkins Method 161
cross the dashed line, the level of autocorrelation is not discernibly different from
zero, but correlation spikes outside of the error band are statistically significant.
The ACF, in Fig. 9.1a starts by showing the zero-lag autocorrelation, which is just
how much current values correlate with themselvesalways exactly 1.0. Afterward,
we see the more informative values of autocorrelation at each lag. At one lag out,
for instance, the serial correlation is 0.823. For two lags, it drops to 0.682. The sixth
lag is the last lag to show discernible autocorrelation, so we can say that the ACF
decays rapidly. The PACF, in Fig. 9.1b, skips the zero lag and starts with first-order
serial correlation. As expected, this is 0.823, just like the ACF showed. However,
once we account for first-order serial correlation, the partial autocorrelation terms
at later lags are not statistically discernible.6
At this point, we determine which ARIMA error process would leave an
empirical footprint such as the one this ACF and PACF show. For more details on
common footprints, see Enders (2009, p. 68). Notationally, we call the error process
ARIMA(p,d,q), where p is the number of autoregressive terms, d is how many times
the series needs to be differenced, and q is the number of moving average terms.
Functionally, a general ARMA model, which includes autoregressive and moving
average components, is written as follows:
X
p
X
q
yt D 0 C i yti C i
ti C
t (9.1)
iD1 iD1
Here, yt is our series of interest, and this ARMA model becomes an ARIMA model
when we decide whether we need to difference yt or not. Notice that yt is lagged p
times for the autoregressive terms, and the disturbance term (
t ) is lagged q times
for the moving average terms. In the case of energy policy coverage, the ACF shows
a rapid decay and we see one significant spike in the PACF, so we can say we are
dealing with a first-order autoregressive process, denoted AR(1) or ARIMA(1,0,0).
Once we have identified our autoregressive integrated moving average model,
we can estimate it using the arima function:
ar1.mod<-arima(pres.energy$Energy,order=c(1,0,0))
The first input is the series we are modeling, and the order option allows us to
specify p, d, and q (in order) for our ARIMA(p,d,q) process. By typing ar1.mod,
we see the output of our results:
Call:
arima(x = pres.energy$Energy, order = c(1, 0, 0))
Coefficients:
ar1 intercept
0.8235 32.9020
s.e. 0.0416 9.2403
6
Technically, PACF at the third lag is negative and significant, but the common patterns of error
processes suggest that this is unlikely to be a critical part of the ARIMA process.
162 9 Time Series Analysis
As with many other models, we can call our residuals using the model name and
a dollar sign (ar1.mod$residuals). The resulting graphs are presented in
Fig. 9.2. As the ACF and PACF both show, the second and fourth lags barely cross
the significance threshold, but there is no clear pattern or evidence of an overlooked
feature of the error process. Most (but not all) analysts would be content with this
pattern in these figures.
As a second step of diagnosing whether we have sufficiently filtered the data, we
compute the LjungBox Q-test. This is a joint test across several lags of whether
there is evidence of serial correlation in any of the lags. The null hypothesis is that
0.15
0.05
Partial ACF
ACF
0.05
0.15
0 5 10 15 20 5 10 15 20
Lag Lag
a) ACF b) PACF
Fig. 9.2 Autocorrelation function and partial autocorrelation function for residuals of AR(1)
model through 24 lags. Blue dashed lines represent a 95 % confidence interval for an uncorrelated
series. (a) ACF. (b) PACF (Color figure online)
7
Here we show in the main text how to gather one diagnostic at a time, but the reader also may want
to try typing tsdiag(ar1.mod,24) to gather graphical representations of a few diagnostics all
at once.
9.1 The BoxJenkins Method 163
the data are independent, so a significant result serves as evidence of a problem. The
syntax for this test is:
Box.test(ar1.mod$residuals,lag=24,type="Ljung-Box")
We first specify the series of interest, then with the lag option state how many lags
should go into the test, and lastly with type specify Ljung-Box (as opposed to
the BoxPierce test, which does not perform as well). Our results from this test are:
Box-Ljung test
data: ar1.mod$residuals
X-squared = 20.1121, df = 24, p-value =
0.6904
Our test statistic is not significant (p D 0:6904), so this test shows no evidence of
serial correlation over 2 years lags. If we are satisfied that AR(1) characterizes our
error process, we can proceed to actual modeling in the next section.
Call:
arima(x=pres.energy$Energy,order=c(1,0,0),xreg=
predictors)
Coefficients:
ar1 intercept rmn1173 grf0175 grf575
0.8222 5.8822 91.3265 31.8761 -8.2280
s.e. 0.0481 52.9008 15.0884 15.4643 15.2025
jec477 jec1177 jec479 embargo hostages
29.6446 -6.6967 -20.1624 35.3247 -16.5001
s.e. 15.0831 15.0844 15.2238 15.1200 13.7619
oilc Approval Unemploy
0.8855 -0.2479 1.0080
s.e. 1.0192 0.2816 3.8909
The syntax to arimax is similar to arima, but we are now allowed a few
more options for transfer functions. Notice in this case that we use the code
xreg=predictors[-1] to remove the indicator for Nixons November 1973
speech from the static predictors. We instead place this predictor with the xtransf
option. The last thing we need to do is specify the order of our transfer function,
which we do with the option transfer. The transfer option accepts a list of
9.1 The BoxJenkins Method 165
vectors, one vector per transfer function predictor. For our one transfer function, we
specify c(1,0): The first term refers to the order of the dynamic decay term (so a
0 here actually reverts back to a static model), and the second term refers to the lag
length of the predictors effect (so if we expected an effect to grow, we might put a
higher number than 0 here). With these settings, we say that Nixons speech had an
effect in the month he gave it, and then the effect spilled over to subsequent months
at a decaying rate.
By typing dynamic.mod, we get our output:
Call:
arimax(x=pres.energy$Energy,order=c(1,0,0),
xreg=predictors[,-1],
xtransf = predictors[, 1], transfer = list(c(1, 0)))
Coefficients:
ar1 intercept grf0175 grf575 jec477
0.8262 20.2787 31.5282 -7.9725 29.9820
s.e. 0.0476 46.6870 13.8530 13.6104 13.5013
jec1177 jec479 embargo hostages oilc
-6.3304 -19.8179 25.9388 -16.9015 0.5927
s.e. 13.5011 13.6345 13.2305 12.4422 0.9205
Approval Unemploy T1-AR1 T1-MA0
-0.2074 0.1660 0.6087 160.6241
s.e. 0.2495 3.5472 0.0230 17.0388
8
In this case, we have a pulse input, so we can say that in November 1973, the effect of the
speech was an expected 161 increase in news stories, holding all else equal. In December 1973, the
carryover effect is that we expect 98 more stories, holding all else equal because 161 0:61 98.
In January 1974, the effect of the intervention is we expect 60 more stories, ceteris paribus because
161 0:61 0:61 60. The effect of the intervention continues forward in a similar decaying
pattern. By contrast, if we had gotten these results with a step intervention instead of a pulse
intervention, then these effects would accumulate rather than decay. Under this hypothetical, the
effects would be 161 in November 1973, 259 in December 1973 (because 161+98=259), and 319
in January 1974 (because 161+98+60=319).
166 9 Time Series Analysis
we have fitted to these data. Hence, the dynamic model has a better fit than the static
model or the atheoretical AR(1) model with no predictors.
To get a real sense of the effect of an intervention analysis, though, an analyst
should always try to draw the effect that they modeled. (Again, it is key to study
the functional form behind the chosen intervention specification, as described by
Enders 2009, Sect. 5.1.) To draw the effect of our intervention for Nixons 1973
speech, we type:
months<-c(1:180)
y.pred<-dynamic.mod$coef[2:12]%*%c(1,predictors[58,-1])+
160.6241*predictors[,1]+
160.6241*(.6087^(months-59))*as.numeric(months>59)
plot(y=pres.energy$Energy,x=months,xlab="Month",
ylab="Energy Policy Stories",type="l",axes=F)
axis(1,at=c(1,37,73,109,145),labels=c("Jan. 1969",
"Jan. 1972","Jan. 1975","Jan. 1978","Jan. 1981"))
axis(2)
box()
lines(y=y.pred,x=months,lty=2,col="blue",lwd=2)
On the first line, we simply create a time index for the 180 months in the study. In
the second line, we create predicted values for the effect of the intervention holding
everything else equal. A critical assumption that we make is that we hold all other
predictors equal by setting them to their values from October 1973, the 58th month
of the series (hence, predictors[58,-1]). So considering the components of
this second line, the first term multiplies the coefficients for the static predictors by
their last values before the intervention, the second term captures the effect of the
intervention in the month of the speech, and the third term captures the spillover
effect of the intervention based on the number of months since the speech. The next
four lines simply draw a plot of our original series values and manage some of the
graphs features. Lastly, we add a dashed line showing the effect of the intervention
holding all else constant. The result is shown in Fig. 9.3. As the figure shows, the
result of this intervention is a large and positive jump in the expected number of
news stories, that carries over for a few months but eventually decays back to
pre-intervention levels. This kind of graph is essential for understanding how the
dynamic intervention actually affects the model.
As a final graph to supplement our view of the dynamic intervention effect, we
could draw a plot that shows how well predictions from the full model align with
true values from the series. We could do this with the following code:
months<-c(1:180)
full.pred<-pres.energy$Energy-dynamic.mod$residuals
plot(y=full.pred,x=months,xlab="Month",
ylab="Energy Policy Stories",type="l",
ylim=c(0,225),axes=F)
points(y=pres.energy$Energy,x=months,pch=20)
legend(x=0,y=200,legend=c("Predicted","True"),
pch=c(NA,20),lty=c(1,NA))
axis(1,at=c(1,37,73,109,145),labels=c("Jan. 1969",
"Jan. 1972","Jan. 1975","Jan. 1978","Jan. 1981"))
axis(2)
box()
9.2 Extensions to Least Squares Linear Regression Models 167
200
Energy Policy Stories
150
100
50
0
Jan. 1969 Jan. 1972 Jan. 1975 Jan. 1978 Jan. 1981
Month
Fig. 9.3 The dashed line shows the dynamic effect of Nixons Nov. 1973 speech, holding all else
equal. The solid line shows observed values of the series
Again, we start by creating a time index, months. In the second line, we create our
predicted values by subtracting the residuals from the true values. In the third line
of code, we draw a line graph of the predicted values from the model. In the fourth
line, we add points showing the true values from the observed series. The remaining
lines complete the graph formatting. The resulting graph is shown in Fig. 9.4. As
can be seen, the in-sample fit is good, with the predicted values tracking the true
values closely.
As a final point here, the reader is encouraged to consult the code in Sect. 9.5
for alternative syntax for producing Figs. 9.3 and 9.4. The tradeoff of the alternative
way of drawing these figures is that it requires more lines of code on the one hand,
but on the other hand, it is more generalizable and easier to apply to your own
research. Plus, the alternative code introduces how the ts command lets analysts
convert a variable to a time series object. Seeing both approaches is worthwhile for
illustrating that, in general, many tasks can be performed in many ways in R.
A second approach to time series analysis draws more from the econometric
literature and looks for ways to extend linear regression models to account for
the unique issues associated with time-referenced data. Since we already discussed
visualization with these data extensively in Sect. 9.1, we will not revisit graphing
issues here. As with BoxJenkins type models, though, the analyst should always
168 9 Time Series Analysis
200
Predicted
True
Energy Policy Stories
150
100
50
0
Jan. 1969 Jan. 1972 Jan. 1975 Jan. 1978 Jan. 1981
Month
Fig. 9.4 Predicted values from a full transfer function model on a line, with actual observed values
as points
begin by drawing a line plot of the series of interest, and ideally a few key predictors
as well. Even diagnostic plots such as the ACF and PACF would be appropriate, in
addition to residual diagnostics such as we will discuss shortly.
When modeling data from an econometric approach, researchers again have to
decide whether to use a static or a dynamic specification of the model. For static
models in which current values of predictors affect current values of the outcome,
researchers may estimate the model with ordinary least squares (OLS) in the rare
case of no serial correlation. For efficiency gains on a static model, however, feasible
generalized least squares (FGLS) is a better estimator. By contrast, in the case of a
dynamic functional form, a lag structure can be introduced into the linear models
specification.
Starting with static models, the simplest kind of model (though rarely appropri-
ate) would be to estimate the model using simple OLS. Returning to our energy
policy data, our model specification here would be:
static.ols<-lm(Energy~rmn1173+grf0175+grf575+jec477+
jec1177+jec479+embargo+hostages+oilc+
Approval+Unemploy,data=pres.energy)
Call:
lm(formula=Energy~rmn1173+grf0175+grf575+jec477+
jec1177+
jec479 + embargo + hostages + oilc + Approval +
Unemploy,
data = pres.energy)
Residuals:
Min 1Q Median 3Q Max
-104.995 -12.921 -3.448 8.973 111.744
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 319.7442 46.8358 6.827 1.51e-10 ***
rmn1173 78.8261 28.8012 2.737 0.00687 **
grf0175 60.7905 26.7006 2.277 0.02406 *
grf575 -4.2676 26.5315 -0.161 0.87240
jec477 47.0388 26.6760 1.763 0.07966 .
jec1177 15.4427 26.3786 0.585 0.55905
jec479 72.0519 26.5027 2.719 0.00724 **
embargo 96.3760 13.3105 7.241 1.53e-11 ***
hostages -4.5289 7.3945 -0.612 0.54106
oilc -5.8765 1.0848 -5.417 2.07e-07 ***
Approval -1.0693 0.2147 -4.980 1.57e-06 ***
Unemploy -3.7018 1.3861 -2.671 0.00831 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
For both the dwtest and bgtest commands, we simply provide the name of the
model as the main argument. The Durbin-Watson test (computed with dwtest)
170 9 Time Series Analysis
tests for first-order serial correlation and is not valid for a model that includes a
lagged dependent variable. Our test produces the following output:
Durbin-Watson test
data: static.ols
DW = 1.1649, p-value = 1.313e-09
alternative hypothesis: true autocorrelation is greater
than 0
The DurbinWatson d statistic (1.1649 in this case), does not have a parametric
distribution. Traditionally the value of d has been checked against tables based
on Monte Carlo results to determine significance. R, however, does provide an
approximate p-value with the statistic. For a DurbinWatson test, the null hypothesis
is that there is no autocorrelation, so our significant value of d suggests that
autocorrelation is a problem.
Meanwhile, the results of our Breusch-Godfrey test (computed with bgtest)
offer a similar conclusion. The BreuschGodfrey test has a 2 distribution and
can be used to test autocorrelation in a model with a lagged dependent variable.
By default, the bgtest command checks for first-order serial correlation, though
higher-order serial correlation can be tested with the order option. Our output in
this case is:
Breusch-Godfrey test for serial correlation of
order up to
1
data: static.ols
LM test = 38.6394, df = 1, p-value = 5.098e-10
Again, the null hypothesis is that there is no autocorrelation, so our significant 2
value shows that serial correlation is a concern, and we need to do something to
account for this.
At this point, we can draw one of two conclusions: The first possibility is that
our static model specification is correct, and we need to find an estimator that is
efficient in the presence of error autocorrelation. The second possibility is that we
have overlooked a dynamic effect and need to respecify our model. (In other words,
if there is a true spillover effect, and we have not modeled it, then the errors will
appear to be serially correlated.) We will consider each possibility.
First, if we are confident that our static specification is correct, then our functional
form is right, but under the GaussMarkov theorem OLS is inefficient with error
autocorrelation, and the standard errors are biased. As an alternative, we can
use feasible generalized least squares (FGLS), which estimates the level of error
correlation and incorporates this into the estimator. There are a variety of estimation
techniques here, including the PraisWinsten and CochraneOrcutt estimators. We
proceed by illustrating the CochraneOrcutt estimator, though users should be wary
9.2 Extensions to Least Squares Linear Regression Models 171
Once we have loaded this package, we insert the name of a linear model we have
estimated with OLS into the cochrane.orcutt command. This then iteratively
reestimates the model and produces our FGLS results as follows:
$Cochrane.Orcutt
Call:
lm(formula = YB ~ XB - 1)
Residuals:
Min 1Q Median 3Q Max
-58.404 -9.352 -3.658 8.451 100.524
Coefficients:
Estimate Std. Error t value Pr(>|t|)
XB(Intercept) 16.8306 55.2297 0.305 0.7609
XBrmn1173 91.3691 15.6119 5.853 2.5e-08 ***
XBgrf0175 32.2003 16.0153 2.011 0.0460 *
XBgrf575 -7.9916 15.7288 -0.508 0.6121
XBjec477 29.6881 15.6159 1.901 0.0590 .
XBjec1177 -6.4608 15.6174 -0.414 0.6796
XBjec479 -20.0677 15.6705 -1.281 0.2021
XBembargo 34.5797 15.0877 2.292 0.0232 *
XBhostages -16.9183 14.1135 -1.199 0.2323
XBoilc 0.8240 1.0328 0.798 0.4261
XBApproval -0.2399 0.2742 -0.875 0.3829
XBUnemploy -0.1332 4.3786 -0.030 0.9758
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
9
In particular, at each stage of the iterative process, the linear model is estimated by regressing
y
t D yt yt1 on xt D xt xt1 (Hamilton 1994, p. 223). This procedure assumes that the
dynamic adjustment process is the same for the outcome and the input variables, which is unlikely.
Hence, a dynamic specification such as an autoregressive distributive lag model would be more
flexible.
172 9 Time Series Analysis
$rho
[1] 0.8247688
$number.interaction
[1] 15
The first portion of table looks like the familiar linear regression output (though the
letters XB appear before the name of each predictor). All of these coefficients, stan-
dard errors, and inferential statistics have the exact same interpretation as in a model
estimated with OLS, but our estimates now should be efficient because they were
computed with FGLS. Near the bottom of the output, we see $rho, which shows us
our final estimate of error autocorrelation. We also see $number.interaction,
which informs us that the model was reestimated in 15 iterations before it converged
to the final result. FGLS is intended to produce efficient estimates if a static
specification is correct.
By contrast, if we believe a dynamic specification is correct, we need to work
to respecify our linear model to capture that functional form. In fact, if we get the
functional form wrong, our results are biased, so getting this right is critical. Adding
a lag specification to our model can be made considerably easier if we install and
load the dyn package. We name our model koyck.ols for reasons that will be
apparent shortly:
install.packages("dyn")
library(dyn)
pres.energy<-ts(pres.energy)
koyck.ols<-dyn$lm(Energy~lag(Energy,-1)+rmn1173+
grf0175+grf575+jec477+jec1177+jec479+embargo+
hostages+oilc+Approval+Unemploy,data=pres.energy)
After loading dyn, the second line uses the ts command to declare that our data are
time series data. In the third line, notice that we changed the linear model command
to read, dyn$lm. This modification allows us to include lagged variables within our
model. In particular, we now have added lag(Energy,-1), which is the lagged
value of our dependent variable. With the lag command, we specify the variable
being lagged and how many times to lag it. By specifying -1, we are looking at the
immediately prior value. (Positive values represent future values.) The default lag is
0, which just returns current values.
We can see the results of this model by typing summary(koyck.ols):
Call:
lm(formula = dyn(Energy ~ lag(Energy, -1) + rmn1173 +
grf0175 +
9.2 Extensions to Least Squares Linear Regression Models 173
Residuals:
Min 1Q Median 3Q Max
-51.282 -8.638 -1.825 7.085 70.472
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.11485 36.96818 1.680 0.09479 .
lag(Energy, -1) 0.73923 0.05113 14.458 < 2e-16 ***
rmn1173 171.62701 20.14847 8.518 9.39e-15 ***
grf0175 51.70224 17.72677 2.917 0.00403 **
grf575 7.05534 17.61928 0.400 0.68935
jec477 39.01949 17.70976 2.203 0.02895 *
jec1177 -10.78300 17.59184 -0.613 0.54075
jec479 28.68463 17.83063 1.609 0.10958
embargo 10.54061 10.61288 0.993 0.32206
hostages -2.51412 4.91156 -0.512 0.60942
oilc -1.14171 0.81415 -1.402 0.16268
Approval -0.15438 0.15566 -0.992 0.32278
Unemploy -0.88655 0.96781 -0.916 0.36098
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Our specification here is often called a Koyck model. This is because Koyck (1954)
observed that when a lagged dependent variable is included as a predictor, each
predictor will have spillover effects in subsequent months.
Consider two examples of predictor spillover effects. First, our coefficient for
Nixons speech is approximately 172. Here, we are interested in an impulse effect
whereby the predictor increased to 1 in the month the speech was given, and then
went back to 0. Therefore, in the month of November 1973 when the speech
was given, the expected effect of this speech holding all else equal is a 172 story
increase in energy policy coverage. However, in December 1973, Novembers level
of coverage is a predictor, and Novembers coverage was shaped by the speech.
Since our coefficient on the lagged dependent variable is approximately 0.74, and
since 0:74 172 128, we therefore expect that the speech increased energy
coverage in December by 128, ceteris paribus. Yet the effect would persist into
January as well because Decembers value predicts January 1974s value. Since
174 9 Time Series Analysis
13
(9.2)
1 2
Our output is 3:399746, which means that a persistent 1 % point rise in unem-
ployment would reduce TV news coverage of energy policy in the long term by
3.4 stories on average and all else equal. Again, this kind of long-term effect could
occur for any variable that is not limited to a pulse input.
As a final strategy, we could include one or more lags of one or more predictors
without including a lagged dependent variable. In this case, any spillover will
be limited to whatever we directly incorporate into the model specification. For
example, if we only wanted a dynamic effect of Nixons speech and a static
specification for everything else, we could specify this model:
udl.mod<-dyn$lm(Energy~rmn1173+lag(rmn1173,-1)+
lag(rmn1173,-2)+lag(rmn1173,-3)+lag(rmn1173,-4)+
grf0175+grf575+jec477+jec1177+jec479+embargo+
hostages+oilc+Approval+Unemploy,data=pres.energy)
In this situation, we have included the current value of the Nixons speech indictor,
as well as four lags. For an intervention, that means that this predictor will have an
effect in November 1973 and for 4 months afterwards. (In April 1974, however, the
effect abruptly drops to 0, where it stays.) We see the results of this model by typing
summary(udl.mod):
9.3 Vector Autoregression 175
Call:
lm(formula=dyn(Energy~rmn1173+lag(rmn1173,-1)+lag
(rmn1173,-2)+lag(rmn1173,-3)+lag(rmn1173,-4)
+grf0175+grf575+jec477+jec1177+jec479+embargo
+hostages+oilc+Approval+Unemploy),data=pres.energy)
Residuals:
Min 1Q Median 3Q Max
-43.654 -13.236 -2.931 7.033 111.035
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 334.9988 44.2887 7.564 2.89e-12 ***
rmn1173 184.3602 34.1463 5.399 2.38e-07 ***
lag(rmn1173, -1) 181.1571 34.1308 5.308 3.65e-07 ***
lag(rmn1173, -2) 154.0519 34.2151 4.502 1.29e-05 ***
lag(rmn1173, -3) 115.6949 34.1447 3.388 0.000885 ***
lag(rmn1173, -4) 75.1312 34.1391 2.201 0.029187 *
grf0175 60.5376 24.5440 2.466 0.014699 *
grf575 -3.4512 24.3845 -0.142 0.887629
jec477 45.5446 24.5256 1.857 0.065146 .
jec1177 14.5728 24.2440 0.601 0.548633
jec479 71.0933 24.3605 2.918 0.004026 **
embargo -9.7692 24.7696 -0.394 0.693808
hostages -4.8323 6.8007 -0.711 0.478392
oilc -6.1930 1.0232 -6.053 9.78e-09 ***
Approval -1.0341 0.1983 -5.216 5.58e-07 ***
Unemploy -4.4445 1.3326 -3.335 0.001060 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
The final approach that we will describe in this chapter is vector autoregression
(VAR). The VAR approach is useful when studying several variables that are
endogenous to each other because there is reciprocal causation among them. The
176 9 Time Series Analysis
basic framework is to estimate a linear regression model for each of the endogenous
variables. In each linear model, include several lagged values of the outcome
variable itself (say p lags of the variable) as well as p lags of all of the other
endogenous variables. So for the simple case of two endogenous variables, x and
y, in which we set our lag length to p D 3, we would estimate two equations that
could be represented as follows:
A fuller treatment on this methodology, including notation for models with more
endogenous variables and examples of models that include exogenous variables as
well, can be found in Brandt and Williams (2007). With this model, tests such as
Granger causality tests can be used to assess if there is a causal effect from one
endogenous variable to another (Granger 1969).
As an example of how to implement a VAR model in R, we turn to work
by Brandt and Freeman (2006), who analyze weekly data regarding the Israeli
Palestinian conflict. Their data are drawn from the Kansas Event Data System,
which automatically codes English-language news reports to measure political
events, with a goal of using this information as an early warning to predict political
change. The endogenous variables in these data are scaled political actions taken
by either the USA, Israel, or Palestine, and directed to one of the other actors. This
produces six variables a2i, a2p, i2a, p2a, i2p, and p2i. The abbreviations are a
for American, i for Israeli, and p for Palestinian. As an example, i2p measures
the scaled value of Israeli actions directed toward the Palestinians. The weekly data
we will use run from April 15, 1979 to October 26, 2003, for a total of 1278 weeks.
To proceed, we need to install and load the vars package to make the relevant
estimation commands available. We also need the foreign package because our
data are in Stata format:10
install.packages("vars")
library(vars)
library(foreign)
levant.0 <- read.dta("levant.dta")
levant<- subset(levant.0,
select=c("a2i","a2p","i2p","i2a","p2i","p2a"))
After loading the packages, we load our data on the third line, naming it levant.0.
These data also contain three date-related indices, so for analysis purposes we
actually need to create a second copy of the data that only includes our six
endogenous variables without the indices. We do this using the subset command
to create the dataset named levant.
A key step at this point is to choose the appropriate lag length, p. The lag
length needs to capture all error processes and causal dynamics. One approach to
10
This example requires the file levant.dta. Please download this file from the Dataverse (see
page vii) or this chapters online content (see page 155).
9.3 Vector Autoregression 177
determining the appropriate lag length is to fit several models, choose the model
with the best fit, and then see if the residuals show any evidence of serial correlation.
The command VARselect automatically estimates several vector autoregression
models and selects which lag length has the best fit. Since our data are weekly, we
consider up to 104 weeks lags, or 2 years worth of data, to consider all possible
options. The following command can take a minute to run because 104 models are
being estimated:
levant.select<-VARselect(levant,type="const",lag.max=104)
To find out which of the lag lengths fits best, we can type
levant.select$ selection. Our output is simply the following:
AIC(n) HQ(n) SC(n) FPE(n)
60 4 1 47
This reports the chosen lag length for the Akaike information criterion, Hannan
Quinn information criterion, Schwarz criterion, and forecast prediction error. All
four of these indices are coded so that lower values are better. To contrast extremes,
the lowest value of the Schwarz criterion, which has a heavy penalty for additional
parameters, is for the model with only one lag of the endogenous variables. By
contrast, the best fit on the AIC comes from the model that requires 60 lags
perhaps indicating that annual seasonality is present. A much longer printout
giving the value of each of the four fit indices for all 104 models can be seen by
typing: levant.select$criteria. Ideally, our fit indices would have settled
on models with similar lag lengths. Since they did not, and since we have 1278
observations, we will take the safer route with the long lag length suggested by the
AIC.11
To estimate the vector autoregression model with p D 60 lags of each
endogenous variable, we type:
levant.AIC<-VAR(levant,type="const",p=60)
The VAR command requires the name of a dataset containing all of the endogenous
variables. With the type option, we have chosen "const" in this case (the
default). This means that each of our linear models includes a constant. (Other
options include specifying a "trend," "both" a constant and a trend, or
"none" which includes neither.) With the option p we choose the lag length.
An option we do not use here is the exogen option, which allows us to specify
exogenous variables.
Once we have estimated our model using VAR, the next thing we should do is
diagnose the model using a special call to the plot function:
plot(levant.AIC,lag.acf=104,lag.pacf=104)
By using the name of our VAR model, levant.AIC, as the only argument in
plot, R will automatically provide a diagnostic plot for each of the endogenous
11
You are encouraged to examine the models that would have been chosen by the HannanQuinn
criterion (4 lags) or the Schwarz criterion (1 lag) on your own. How do these models perform in
terms of diagnostics? How would inferences change?
178 9 Time Series Analysis
variables. With the options of lag.acf and lag.pacf, we specify that the ACF
and PACF plots that this command reports should show us 2 years worth (104
weeks) of autocorrelation patterns. For our data, R produces six plots. R displays
these plots one at a time, and between each, the console will pose the following
prompt:
Hit <Return> to see next plot:
R will keep a plot on the viewer until you press the Return key, at which point it
moves on to the next graph. This process repeats until the graph for each outcome
has been shown.
Alternatively, if we wanted to see the diagnostic plot for one endogenous variable
in particular, we could type:
plot(levant.AIC,lag.acf=104,lag.pacf=104,names="i2p")
Here, the names option has let us specify that we want to see the diagnostics for i2p
(Israeli actions directed towards Palestine). The resulting diagnostic plot is shown
in Fig. 9.5. The graph has four parts: At the top is a line graph that shows the true
values in a solid black line and the fitted values in a blue dashed line. Directly
beneath this is a line plot of the residuals against a line at zero. These first two
graphs can illustrate whether the model consistently makes unbiased predictions of
the outcome and whether the residuals are homoscedastic over time. In general, the
forecasts consistently hover around zero, though for the last 200 observations the
error variance does seem to increase slightly. The third graph, in the bottom left,
is the ACF for the residuals on i2p.12 Lastly, in the bottom right of the panel, we
see the PACF for i2ps residuals. No spikes are significant in the ACF and only one
spike is significant in the PACF over 2 years, so we conclude that our lag structure
has sufficiently filtered-out any serial correlation in this variable.
When interpreting a VAR model, we turn to two tools to draw inferences
and interpretations from these models. First, we use Granger causality testing to
determine if one endogenous variable causes the others. This test is simply a block
F-test of whether all of the lags of a variable can be excluded from the model. For
a joint test of whether one variable affects the other variables in the system, we can
use the causality command. For example, if we wanted to test whether Israeli
actions towards Palestine caused actions by the other five directed dyads, we would
type:
causality(levant.AIC, cause="i2p")$Granger
The command here calls for the name of the model first (levant.AIC), and then
with the cause option, we specify which of the endogenous variables we wish to
test the effect of. Our output is as follows:
12
Note that, by default, the graph R presents actually includes the zero-lag perfect correlation. If
you would like to eliminate that, given our long lag length and the size of the panel, simply load
the TSA package before drawing the graph to change the default.
9.3 Vector Autoregression 179
0.06
0.06 0.02 0.02
0 20 40 60 80 100 0 20 40 60 80 100
Fig. 9.5 Predicted values, residual autocorrelation function, and residual partial autocorrelation
function for the Israel-to-Palestine series in a six-variable vector autoregression model
We can proceed to test whether each of the other five variables Granger-cause
the other predictors in the system by considering each one with the causality
command:
causality(levant.AIC, cause="a2i")$Granger
causality(levant.AIC, cause="a2p")$Granger
causality(levant.AIC, cause="i2a")$Granger
causality(levant.AIC, cause="p2i")$Granger
causality(levant.AIC, cause="p2a")$Granger
The results are not reprinted here to preserve space. At the 95 % confidence level,
though, you will see that each variable significantly causes the others, except for
American actions directed towards Israel (a2i).
Finally to get a sense of the substantive impact of each predictor, we turn to
impulse response analysis. The logic here is somewhat similar to the intervention
analysis we graphed back in Fig. 9.3. An impulse response function considers a
one-unit increase in one of the endogenous variables and computes how such an
exogenous shock would dynamically influence all of the endogenous variables,
given autoregression and dependency patterns.
When interpreting an impulse response function, a key consideration is the fact
that shocks to one endogenous variable are nearly always correlated with shocks to
other variables. We therefore need to consider that a one-unit shock to one variables
residual term is likely to create contemporaneous movement in the residuals of the
other variables. The off-diagonal terms of the variancecovariance matrix of the
endogenous variables residuals, O , tells us how much the shocks covary.
There are several ways to deal with this issue. One is to use theory to determine
the ordering of the endogenous variables. In this approach, the researcher assumes
that a shock to one endogenous variable is not affected by shocks to any other
variable. Then a second variables shocks are affected only by the first variables,
and no others. The researcher recursively repeats this process to identify a system in
which all variables can be ordered in a causal chain. With a theoretically designed
system like this, the researcher can determine how contemporaneous shocks affect
each other with a structured Cholesky decomposition of O (Enders 2009, p. 309).
A second option, which is the default option in Rs irf command, is to assume that
there is no theoretical knowledge of the causal ordering of the contemporaneous
shocks and apply the method of orthogonalization of the residuals. This involves
another Cholesky decomposition, in which we find A1 10 O .
0 by solving A0 A0 D
For more details about response ordering or orthogonalizing residuals, see Brandt
and Williams (2007, pp. 3641 & 6670), Enders (2009, pp. 307315), or Hamilton
(1994, pp. 318324).
As an example of an impulse response function, we will graph the effect of
one extra political event from Israel directed towards Palestine. R will compute
our impulse response function with the irf command, using the default orthog-
onalization of residuals. One of the key options in this command is boot, which
determines whether to construct a confidence interval with bootstraps. Generally,
it is advisable to report uncertainty in predictions, but the process can take several
9.4 Further Reading About Time Series Analysis 181
minutes.13 So if the reader wants a quick result, set boot=FALSE. To get the result
with confidence intervals type:
levant.irf<-irf(levant.AIC,impulse="i2p",n.ahead=12,boot=TRUE)
We name our impulse response function levant.irf. The command requires us
to state the name of our model (levant.AIC). The option impulse asks us to
name the variable we want the effect of, the n.ahead option sets how far ahead
we wish to forecast (we say 12 weeks, or 3 months), and lastly boot determines
whether to create confidence intervals based on a bootstrap sample.
Once we have computed this, we can graph the impulse response function with a
special call to plot:
plot(levant.irf)
The resulting graph is shown in Fig. 9.6. There are six panels, one for each of
the six endogenous variables. The vertical axis on each panel lists the name of
the endogenous variable and represents the expected change in that variable. The
horizontal axis on each panel represents the number of months that have elapsed
since the shock. Each panel shows a solid red line at zero, representing where a
non-effect falls on the graph. The solid black line represents the expected effect
in each month, and the red dashed lines represent each confidence interval. As the
figure shows, the real impact of a shock in Israel-to-Palestine actions is dramatic to
the Israel-to-Palestine series itself (i2p) on account of autoregression and feedback
from effects to other series. We also see a significant jump in Palestine-to-Israel
actions (p2i) over the ensuing 3 months. With the other four series, the effects pale
in comparison. We easily could produce plots similar to Fig. 9.6 by computing the
impulse response function for a shock in each of the six inputs. This is generally a
good idea, but is omitted here for space.
With these tools in hand, the reader should have some sense of how to estimate and
interpret time series models in R using three approachesBoxJenkins modeling,
econometric modeling, and vector autoregression. Be aware that, while this chapter
uses simple examples, time series analysis is generally challenging. It can be
difficult to find a good model that properly accounts for all of the trend and error
processes, but the analyst must carefully work through all of these issues or the
13
Beware that bootstrap-based confidence intervals do not always give the correct coverages
because they confound information about how well the model fits with uncertainty of parameters.
For this reason, Bayesian approaches are often the best way to represent uncertainty (Brandt and
Freeman 2006; Sims and Zha 1999).
182 9 Time Series Analysis
40
40
30
30
a2p
a2i
20
20
10
10
40 0
40 0
30
30
i2a
20
20
i2p
10
10
40 0
40 0
30
30
p2a
20
20
p2i
10
10
0
0 1 2 3 4 5 6 7 8 9 10 12 0 1 2 3 4 5 6 7 8 9 10 12
95 % Bootstrap CI, 100 runs
Fig. 9.6 Impulse response function for a one-unit shock in the Israel-to-Palestine series in a six-
variable vector autoregression model
inferences will be biased. (See again, Granger and Newbold 1974.) So be sure to
recognize that it often takes several tries to find a model that properly accounts for
all issues.
It is also worth bearing in mind that this chapter cannot comprehensively address
any of the three approaches we consider, much less touch on all types of time
series analysis. (Spectral analysis, wavelet analysis, state space models, and error-
correction models are just a handful of topics not addressed here.) Therefore, the
interested reader is encouraged to consult other resources for further important
details on various topics in time series modeling. Good books that cover a range
of time series topics and include R code are: Cowpertwait and Metcalfe (2009),
Cryer and Chan (2008), and Shumway and Stoffer (2006). For more depth on
9.5 Alternative Time Series Code 183
the theory behind time series, good volumes by Political Scientists include Box-
Steffensmeier et al. (2014) and Brandt and Williams (2007). Other good books
that cover time series theory include: Box et al. 2008, Enders (2009), Wei (2006),
Ltkepohl (2005), and for an advanced take see Hamilton (1994). Several books
focus on specific topics and applying the methods in R: For instance, Petris et al.
(2009) covers dynamic linear models, also called state space models, and explains
how to use the corresponding dlm package in R to apply these methods. Pfaff
(2008) discusses cointegrated data models such as error-correction models and
vector error-correction models, using the R package vars. Additionally, readers
who are interested in time series cross section models, or panel data, should consult
the plm package, which facilitates the estimation of models appropriate for those
methods. Meanwhile, Mtys and Sevestre (2008) offers a theoretical background
on panel data methods.
As mentioned in Sect. 9.1, we will now show alternative syntax for producing
Figs. 9.3 and 9.4. This code is a little longer, but is more generalizable.14 First off,
if you do not have all of the packages, data objects, and models loaded from before,
be sure to reload a few of them so that we can draw the figure:
library(TSA)
pres.energy<-read.csv("PESenergy.csv")
predictors<-as.matrix(subset(pres.energy,select=c(rmn1173,
grf0175,grf575,jec477,jec1177,jec479,embargo,hostages,
oilc,Approval,Unemploy)))
All three of the previous lines of code were run earlier in the chapter. By way of
reminder, the first line loads the TSA package, the second loads our energy policy
coverage data, and the third creates a matrix of predictors.
Now, to redraw either figure, we need to engage in a bit of data management:
months <- 1:180
static.predictors <- predictors[,-1]
dynamic.predictors <- predictors[,1, drop=FALSE]
y <- ts(pres.energy$Energy, frequency=12, start=c(1972, 1))
14
My thanks to Dave Armstrong for writing and suggesting this alternative code.
184 9 Time Series Analysis
Next, we need to actually estimate our transfer function. Once we have done this,
we can save several outputs from the model:
dynamic.mod<-arimax(y,order=c(1,0,0),xreg=static.predictors,
xtransf=dynamic.predictors,transfer=list(c(1,0)))
b <- coef(dynamic.mod)
static.coefs <- b[match(colnames(static.predictors), names(b))]
ma.coefs <- b[grep("MA0", names(b))]
ar.coefs <- b[grep("AR1", names(b))]
The first line refits our transfer function. The second uses the coef command to
extract the coefficients from the model and save them in a vector named b. The last
three lines separate our coefficients into static effects (static.coefs), initial
dynamic effects (ma.coefs), and decay terms (ar.coefs). In each line, we
carefully reference the names of our coefficient vector, using the match command
to find coefficients for the static predictors, and then the grep command to search
for terms that contain MA0 and AR1, respectively, just as output terms of a transfer
function do.
With all of these elements extracted, we now turn specifically to redrawing
Fig. 9.3, which shows the effect of the Nixons speech intervention against the real
data. Our intervention effect consists of two parts, the expected value from holding
all of the static predictors at their values for the 58th month, and the dynamic effect
of the transfer function. We create this as follows:
xreg.pred<-b["intercept"]+static.coefs%*%static.predictors[58,]
transf.pred <- as.numeric(dynamic.predictors%*%ma.coefs+
ma.coefs*(ar.coefs^(months-59))*(months>59))
y.pred<-ts(xreg.pred+transf.pred,frequency=12,start=c(1972,1))
The first line simply makes the static prediction from a linear equation. The
second uses our initial effects and decay terms to predict the dynamic effect of
the intervention. Third, we add the two pieces together and save them as a time
series with the same frequency and start date as the original series. With both y and
y.pred now coded as time series of the same frequency over the same time span,
it is now easy to recreate Fig. 9.3:
plot(y,xlab="Month", ylab="Energy Policy Stories",type="l")
lines(y.pred, lty=2,col=blue,lwd=2)
The first line simply plots the original time series, and the second line adds the
intervention effect itself.
With all of the setup work we have done, reproducing Fig. 9.4 now only requires
three lines of code:
full.pred<-fitted(dynamic.mod)
plot(full.pred,ylab="Energy Policy Stories",type="l",
ylim=c(0,225))
points(y, pch=20)
The first line simply uses the fitted command to extract fitted values from the
transfer function model. The second line plots these fitted values, and the third adds
the points that represent the original series.
9.6 Practice Problems 185
This set of practice problems reviews each of the three approaches to time series
modeling introduced in the chapter, and then poses a bonus question about the
Peake and Eshbaugh-Soha energy data that asks you to learn about a new method.
Questions #13 relate to single-equation models, so all of these questions use a
dataset about electricity consumption in Japan. Meanwhile, question #4 uses US
economic data for a multiple equation model.
1. Time series visualization: Wakiyama et al. (2014) study electricity consump-
tion in Japan, assessing whether the March 11, 2011, Fukushima nuclear
accident affected electricity consumption in various sectors. They do this by
conducting an intervention analysis on monthly measures of electricity con-
sumption in megawatts (MW), from January 2008 to December 2012. Load
the foreign package and open these data in Stata format from the file
comprehensiveJapanEnergy.dta. This data file is available from the
Dataverse (see page vii) or this chapters online content (see page 155). We will
focus on household electricity consumption (variable name: house). Take the
logarithm of this variable and draw a line plot of logged household electricity
consumption from month-to-month. What patterns are apparent in these data?
2. BoxJenkins modeling:
a. Plot the autocorrelation function and partial autocorrelation function for
logged household electricity consumption in Japan. What are the most
apparent features from these figures?
b. Wakiyama et al. (2014) argue that an ARIMA(1,0,1), with a Seasonal
ARIMA(1,0,0) component fit this series. Estimate this model and report
your results. (Hint: For this model, you will want to include the option
seasonal=list(order=c(1,0,0), period=12) in the arima
command.)
c. How well does this ARIMA model fit? What do the ACF and PACF look like
for the residuals from this model? What is the result of a LjungBox Q-test?
d. Use the arimax command from the TSA package. Estimate a model that uses
the ARIMA error process from before, the static predictors of temperature
(temp) and squared temperature (temp2), and a transfer function for the
Fukushima intervention (dummy).
e. Bonus: The Fukushima indicator is actually a step intervention, rather than
a pulse. This means that the effect cumulates rather than decays. Footnote 8
describes how these effects cumulate. Draw a picture of the cumulating effect
of the Fukushima intervention on logged household electricity consumption.
3. Econometric modeling:
a. Fit a static linear model using OLS for logged household electricity consump-
tion in Japan. Use temperature (temp), squared temperature (temp2), and
the Fukushima indicator (dummy) as predictors. Load the lmtest package,
186 9 Time Series Analysis
and compute both a DurbinWatson and BreuschGodfrey test for this linear
model. What conclusions would you draw from each? Why do you think you
get this result?
b. Reestimate the static linear model of logged household electricity con-
sumption using FGLS with the CochraneOrcutt algorithm. How similar or
different are your results from the OLS results? Why do you think this is?
c. Load the dyn package, and add a lagged dependent variable to this model.
Which of the three econometric models do you think is the most appropriate
and why? Do you think your preferred econometric model or the BoxJenkins
intervention analysis is more appropriate? Why?
4. Vector autoregression:
a. Enders (2009, p. 315) presents quarterly data on the US economy, which
runs from the second quarter of 1959 to the first quarter of 2001. Load the
vars and foreign packages, and then open the data in Stata format from
the file moneyDem.dta. This file is available from the Dataverse (see page
vii) or this chapters online content (see page 155). Subset the data to only
include three variables: change in logged real GDP (dlrgdp), change in the
real M2 money supply (dlrm2), and change in the 3-month interest rate on
US Treasury bills (drs). Using the VARselect command, determine the
best-fitting lag length for a VAR model of these three variables, according
to the AIC.
b. Estimate the model you determined to be the best fit according to the AIC.
Examine the diagnostic plots. Do you believe these series are clear of serial
correlation and that the functional form is correct?
c. For each of the three variables, test whether the variable Granger-causes the
other two.
d. In monetary policy, the interest rate is an important policy tool for the Federal
Reserve Bank. Compute an impulse response function for a percentage point
increase in the interest rate (drs). Draw a plot of the expected changes in
logged money supply (dlrm2) and logged real GDP (dlrgdp). (Hint: Include
the option response=c("dlrgdp","dlrm2") in the irf function.)
Be clear about whether you are orthogonalizing the residuals or making a
theoretical assumption about response ordering.
5. Bonus: You may have noticed that Peake and Eshbaugh-Sohas (2008) data on
monthly television coverage of the energy issue was used both as an example
for count regression in Chap. 7 and as an example time series in this chapter.
Brandt and Williams (2001) develop a Poisson autoregressive (PAR) model for
time series count data, and Fogarty and Monogan (2014) apply this model to
these energy policy data. Replicate this PAR model on these data. For replication
information see: http://hdl.handle.net/1902.1/16677.
Chapter 10
Linear Algebra with Programming Applications
The R language has several built-in matrix algebra commands. This proves useful
for analysts who wish to write their own estimators or have other problems in
linear algebra that they wish to compute using software. In some instances, it is
easier to apply a formula for predictions, standard errors, or some other quantity
directly rather than searching for a canned program to compute the quantity, if one
exists. Matrix algebra makes it straightforward to compute these quantities yourself.
This chapter introduces the syntax and available commands for conducting matrix
algebra in R.
The chapter proceeds by first describing how a user can input original data by
hand, as a means of creating vectors, matrices, and data frames. Then it presents
several of the commands associated with linear algebra. Finally, we work through
an applied example in which we estimate a linear model with ordinary least squares
by programming our own estimator.
As working data throughout the chapter, we consider a simple example from the
2010 US congressional election. We model the Republican candidates share of the
two-party vote in elections for the House of Representatives (y) in 2010. The input
variables are a constant (x1 ), Barack Obamas share of the two-party presidential
vote in 2008 (x2 ), and the Republican candidates financial standing relative to the
Democrat in hundreds of thousands of dollars (x3 ). For simplicity, we model the
nine House races in the state of Tennessee. The data are presented in Table 10.1.
As a first task, when assigning values to a vector or matrix, we must use the tra-
ditional assignment command (<-). The command c combines several component
elements into a vector object.1 So to create a vector a with the specific values of 3,
4, and 5:
a <- c(3,4,5)
Within c, all we need to do is separate each element of the vector with a comma.
As a more interesting example, suppose we wanted to input three of the variables
from Table 10.1: Republican share of the two-party vote in 2010, Obamas share of
the two-party vote in 2008, and Republican financial advantage, all as vectors. We
would type:
Y<-c(.808,.817,.568,.571,.421,.673,.724,.590,.251)
X2<-c(.29,.34,.37,.34,.56,.37,.34,.43,.77)
X3<-c(4.984,5.073,12.620,-6.443,-5.758,15.603,14.148,0.502,
-9.048)
Whenever entering data in vector form, we should generally make sure we have
entered the correct number of observations. The length command returns how
many observations are in a vector. To check all three of our vector entries, we type:
length(Y); length(X2); length(X3)
All three of our vectors should have length 9 if they were entered correctly. Observe
here the use of the semicolon (;). If a user prefers to place multiple commands on
a single line of text, the user can separate each command with a semicolon instead
of putting each command on a separate line. This allows simple commands to be
stored more compactly.
1
Alternatively, when the combined elements are complex objects, c instead creates a list object.
10.1 Creating Vectors and Matrices 189
If we want to create a vector that follows a repeating pattern, we can use the rep
command. For instance, in our model of Tennessee election returns, our constant
term is simply a 9 1 vector of 1s:
X1 <- rep(1, 9)
The first term within rep is the term to be repeated, and the second term is the
number of times it should be repeated.
Sequential vectors also are simple to create. A colon (:) prompts R to list
sequential integers from the starting to the stopping point. For instance, to create
an index for the congressional district of each observation, we would need a vector
that contains values counting from 1 to 9:
index <- c(1:9)
Any vector we create can be printed to the screen simply by typing the name of
the vector. For instance, if we simply type Y into the command prompt, our vector
of Republican vote share is printed in the output:
[1] 0.808 0.817 0.568 0.571 0.421 0.673 0.724
[8] 0.590 0.251
In this case, the nine values of the vector Y are printed. The number printed in the
square braces at the start of each row offers the index of the first element in that row.
For instance, the eighth observation of Y is the value 0.590. For longer vectors, this
helps the user keep track of the indices of the elements within the vector.
Turning to matrices, we can create an object of class matrix in several possible ways.
First, we could use the matrix command: In the simplest possible case, suppose
we wanted to create a matrix with all of the values being the same. To create a 4 4
matrix b with every value equaling 3:
b <- matrix(3, ncol=4, nrow=4, byrow=FALSE)
The syntax of the matrix command first calls for the elements that will define
the matrix: In this case, we listed a single scalar, so this number was repeated for
all matrix cells. Alternatively, we could list a vector instead that includes enough
entries to fill the entire matrix. We then need to specify the number of columns
(ncol) and rows (nrow). Lastly, the byrow argument is set to FALSE by default.
With the FALSE setting, R fills the matrix column-by-column. A TRUE setting fills
190 10 Linear Algebra with Programming Applications
the matrix row-by-row instead. As a rule, whenever creating a matrix, type the name
of the matrix (b in this case) into the command console to see if the way R input the
data matches what you intended.
A second simple option is to create a matrix from vectors that have already been
entered. We can bind the vectors together as column vectors using the cbind
command and as row vectors using the rbind command. For example, in our
model of Tennessee election returns, we will need to create a matrix of all input
variables in which variables define the columns and observations define the rows.
Since we have defined our three variable vectors (and each vector is ordered by
observation), we can simply create such a matrix using cbind:
X<-cbind(1,X2,X3)
X
The cbind command treats our vectors as columns in a matrix. This is what we
want since a predictor matrix defines rows with observations and columns with
variables. The 1 in the cbind command ensures that all elements of the first column
are equal to the constant 1. (Of course, the way we designed X1, we also could have
included that vector.) When typing X in the console, we get the printout:
X2 X3
[1,] 1 0.29 4.984
[2,] 1 0.34 5.073
[3,] 1 0.37 12.620
[4,] 1 0.34 -6.443
[5,] 1 0.56 -5.758
[6,] 1 0.37 15.603
[7,] 1 0.34 14.148
[8,] 1 0.43 0.502
[9,] 1 0.77 -9.048
These results match the data we have in Table 10.1, so our covariate matrix should
be ready when we are ready to estimate our model.
Just to illustrate the rbind command, R easily would combine the vectors as
rows as follows:
T<-rbind(1,X2,X3)
T
We do not format data like this, but to see how the results look, typing T in the
console results in the printout:
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
1.000 1.000 1.00 1.000 1.000 1.000 1.000
X2 0.290 0.340 0.37 0.340 0.560 0.370 0.340
X3 4.984 5.073 12.62 -6.443 -5.758 15.603 14.148
[,8] [,9]
1.000 1.000
X2 0.430 0.770
X3 0.502 -9.048
10.1 Creating Vectors and Matrices 191
Hence, we see that each variable vector makes a row and each observation makes a
column. In the printout, when R lacks the space to print an entire row on a single
line, it wraps all rows at once, thus presenting columns 8 and 9 later.
Third, we could create a matrix by using subscripting. (Additional details
on vector and matrix subscripts are presented in Sect. 10.1.3.) Sometimes when
creating a matrix, the user will know all of the values up front, but on other occasions
a user must create a matrix and then fill in the values later. In the latter case, it is a
good idea to create blanks to fill in by designating every cell as missing (or NA).
The nice feature of this is that a user can easily identify a cell that was never filled-
in. By contrast, if a matrix is created with some default numeric value, say 0, then
later on it is impossible to distinguish a cell that has a default 0 from one with a true
value of 0. So if we wanted to create a 3 3 matrix named blank to fill in later, we
would write:
blank <- matrix(NA, ncol=3, nrow=3)
If we then wanted to assign the value of 8 to the first row, third column element, we
would write:
blank[1,3] <- 8
If we then wanted to insert the value (D 3:141592 : : :) into the second row, first
column entry, we would write:
blank[2,1] <- pi
If we wanted to use our previously defined vector a D .3; 4; 5/0 to define the second
column, we would write:
blank[,2] <- a
We then could check our progress simply by typing blank into the command
prompt, which would print:
[,1] [,2] [,3]
[1,] NA 3 8
[2,] 3.141593 4 NA
[3,] NA 5 NA
To the left of the matrix, the row terms are defined. At the top of the matrix, the
column terms are defined. Notice that four elements are still coded NA because a
replacement value was never offered.
Fourth, in contrast to filling-in matrices after creation, we also may know the
values we want at the time of creating the matrix. As a simple example, to create a
2 2 matrix W in which we list the value of each cell column-by-column:
W <- matrix(c(1,2,3,4), ncol=2, nrow=2)
Notice that our first argument is now a vector because we want to provide unique
elements for each of the cells in the matrix. With four elements in the vector, we
have the correct number of entries for a 2 2 matrix. Also, in this case, we ignored
the byrow argument because the default is to fill-in by columns. By contrast, if
we wanted to list the cell elements row-by-row in matrix Z, we would simply set
byrow to TRUE:
192 10 Linear Algebra with Programming Applications
Type W and Z into the console and observe how the same vector of cell entries has
been reorganized going from one cell to another.
Alternatively, suppose we wanted to create a 10 10 matrix N where every cell
entry was a random draw from a normal distribution with mean 10 and standard
deviation 2, or N .10; 4/:
N <- matrix(rnorm(100, mean=10, sd=2), nrow=10, ncol=10)
Because rnorm returns an object of the vector class, we are again listing a vector
to create our cell entries. The rnorm command is drawing from the normal
distribution with our specifications 100 times, which provides the number of cell
entries we need for a 10 10 matrix.
Fifth, on many occasions, we will want to create a diagonal matrix that contains
only elements on its main diagonal (from the top left to the bottom right), with zeros
in all other cells. The command diag makes it easy to create such a matrix:
D <- diag(c(1:4))
By typing D into the console, we now see how this diagonal matrix appears:
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 2 0 0
[3,] 0 0 3 0
[4,] 0 0 0 4
Additionally, if one inserts a square matrix into the diag command, it will return a
vector from the matrixs diagonal elements. For example, in a variance-covariance
matrix, the diagonal elements are variances and can be extracted quickly in this way.
A final means of creating a matrix object is with the as.matrix command. This
command can take an object of the data frame class and convert it to an object
of the matrix class. For a data frame called mydata, for example, the command
mydata.2 <- as.matrix(mydata) would coerce the data into matrix form,
making all of the ensuing matrix operations applicable to the data. Additionally, typ-
ing mydata.3 <- as.matrix(mydata[,4:8]) would only take columns
four through eight of the data frame and coerce those into a matrix, allowing the
user to subset the data while creating a matrix object.
Similarly, suppose we wanted to take an object of the matrix class and create an
object of the data frame class, perhaps our Tennessee electoral data from Table 10.1.
In this case, the as.data.frame command will work. If we simply wanted to
create a data frame from our covariate matrix X, we could type:
X.df <- as.data.frame(X)
10.1 Creating Vectors and Matrices 193
If we wanted to create a data frame using all of the variables from Table 10.1,
including the dependent variable and index, we could type:
tennessee <- as.data.frame(cbind(index,Y,X1,X2,X3))
10.1.3 Subscripting
To index an n k matrix X for the value Xij , where i represents the row and j
represents the column, use the syntax X[i,j]. If we want to select all values of the
jth column of X, we can use X[,j]. For example, to return the second column of
matrix X, type:
X[,2]
Alternatively, if a column has a name, then the name (in quotations) can be used to
call the column as well. For example:
X[,"X2"]
Similarly, if we wish to select all of the elements of the ith row of X, we can use
X[i,]. For the first row of matrix X, type:
X[1,]
Alternatively, we could use a row name as well, though the matrix X does not have
names assigned to the rows.
If we wish, though, we can create row names for matrices. For example:
rownames(X)<-c("Dist. 1","Dist. 2","Dist. 3","Dist. 4",
"Dist. 5","Dist. 6","Dist. 7","Dist. 8","Dist. 9")
Similarly, the command colnames allows the user to define column names. To
simply type rownames(X) or colnames(X) without making an assignment, R
will print the row or column names saved for a matrix.
194 10 Linear Algebra with Programming Applications
Now that we have a sense of creating vectors and matrices, we turn to commands
that either extract information from these objects or allow us to conduct linear
algebra with them. As was mentioned before, after entering a vector or matrix into
R, in addition to printing the object onto the screen for visual inspection, it is also
a good idea to check and make sure the dimensions of the object are correct. For
instance, to obtain the length of a vector a:
length(a)
Similarly, to obtain the dimensions of a matrix, type:
dim(X)
The dim command first prints the number of rows, then the number of columns.
Checking these dimensions offers an extra assurance that the data in a matrix have
been entered correctly.
For vectors, the elements can be treated as data, and summary quantities can be
extracted. For instance, to add up the sum of a vectors elements:
sum(X2)
Similarly, to take the mean of a vectors elements (in this example, the mean of
Obamas 2008 vote by district):
mean(X2)
And to take the variance of a vector:
var(X2)
Another option we have for matrices and vectors is that we can sample from a
given object with the command sample. Suppose we want a sample of ten numbers
from N, our 10 10 matrix of random draws from a normal distribution:
set.seed(271828183)
N <- matrix(rnorm(100, mean=10, sd=2), nrow=10, ncol=10)
s <- sample(N,10)
The first command, set.seed makes this simulation more replicable. (See
Chap. 11 for more detail about this command.) This gives us a vector named s of
ten random elements from N. We also have the option of applying the sample
command to vectors.2
The apply command is often the most efficient way to do vectorized calcula-
tions. For example, to calculate the means for all the columns in our Tennessee data
matrix X:
apply(X, 2, mean)
2
For readers interested in bootstrapping, which is one of the most common applications of sampling
from data, the most efficient approach will be to install the boot package and try some of the
examples that library offers.
10.2 Vector and Matrix Commands 195
In this case, the first argument lists the matrix to analyze. The second argument, 2
tells R to apply a mathematical function along the columns of the matrix. The third
argument, mean is the function we want to apply to each column. If we wanted the
mean of the rows instead of the columns, we could use a 1 as the second argument
instead of a 2. Any function defined in R can be used with the apply command.
Commands that are more specific to matrix algebra are also available. Whenever
the user wants to save the output of a vector or matrix operation, the assignment
command must be used. For example, if for some reason we needed the difference
between each Republican share of the two-party vote and Obamas share of the two-
party vote, we could assign vector m to the difference of the two vectors by typing:
m <- Y - X2
With arithmetic operations, R is actually pretty flexible, but a word on how the
commands work may avoid future confusion. For addition (+) and subtraction (-): If
the arguments are two vectors of the same length (as is the case in computing vector
m), then R computes regular vector addition or subtraction where each element is
added to or subtracted from its corresponding element in the other vector. If the
arguments are two matrices of the same size, matrix addition or subtraction applies
where each entry is combined with its corresponding entry in the other matrix. If
one argument is a scalar, then the scalar will be added to each element of the vector
or matrix.
Note: If two vectors of uneven length or two matrices of different size are added,
an error message results, as these arguments are non-conformable. To illustrate this,
if we attempted to add our sample of ten numbers from before, s, to our vector of
Obamas 2008 share of the vote, x2 , as follows:
s + X2
In this case we would receive an output that we should ignore, followed by an error
message in red:
[1] 11.743450 8.307068 11.438161 14.251645 10.828459
[6] 10.336895 9.900118 10.092051 12.556688 9.775185
Warning message:
In s + X2 : longer object length is not a multiple of shorter
object length
This warning message tells us the output is nonsense. Since the vector s has length
10, while the vector X2 has length 9, what R has done here is added the tenth
element of s to the first element of X2 and used this as the tenth element of the
output. The error message that follows serves as a reminder that a garbage input that
breaks the rules of linear algebra produces a garbage output that no one should use.
196 10 Linear Algebra with Programming Applications
This would return a matrix that squares each separate cell in the matrix X. By
a similar token, the simple multiplication operation (*) performs multiplication
element by element. Hence, if two vectors are of the same length or two matrices
of the same size, the output will be a vector or matrix of the same size where each
element is the product of the corresponding elements. Just as an illustration, let us
multiply Obamas share of the two-party vote by Republican financial advantage in
a few ways. (The quantities may be silly, but this offers an example of how the code
works.) Try for example:
x2.x3 <- X2*X3
R would now return the inner product, or dot product (x2 x3 ). This equals 6.25681
in our example. Another useful quantity for vector multiplication is the outer
product, or tensor product (x2 x3 ). We can obtain this by transposing the second
vector in our code:
x2.x3.outer <- X2%*%t(X3)
The output of this command is a 9 9 matrix. To obtain this quantity, the transpose
command (t) was used.3 In matrix algebra, we may need to turn row vectors
to column vectors or vice versa, and the transpose operation accomplishes this.
Similarly, when applied to a matrix, every row becomes a column and every column
becomes a row.
As one more example of matrix multiplication, the input variable matrix X has
size 9 3, and the matrix T that lets variables define rows is a 3 9 matrix. In
this case, we could create the matrix P D TX because the number of columns of
T is the same as the number of rows of X. We therefore can say that the matrices
are conformable for multiplication and will result in a matrix of size 3 3. We can
compute this as:
3
An alternate syntax would have been X2%o%X3.
10.2 Vector and Matrix Commands 197
P <- T%*%X
Note that if our matrices are not conformable for multiplication, then R will return
an error message.4
Beyond matrix multiplication and the transpose operation, another important
quantity that is unique to matrix algebra is the determinant of a square matrix (or
a matrix that has the same number of rows as columns). Our matrix P is square
because it has three rows and three columns. One reason to calculate the determinant
is that a square matrix has an inverse only if the determinant is nonzero.5 To compute
the determinant of P, we type:
det(P)
On the first line, we created P1 , which is the inverse of P. On the second line, we
multiply P1 P and the printout is:
X2 X3
1.000000e+00 -6.106227e-16 -6.394885e-14
X2 1.421085e-14 1.000000e+00 1.278977e-13
X3 2.775558e-17 -2.255141e-17 1.000000e+00
This is the basic form of the identity matrix, with values of 1 along the main diagonal
(running from the top left to bottom right), and values of 0 off the diagonal. While
the off-diagonal elements are not listed exactly as zero, this can be attributed to
rounding error on Rs part. The scientific notation for the second row, first column
element for example means that the first 13 digits after the decimal place are zero,
followed by a 1 in the 14th digit.
4
A unique variety of matrix multiplication is called the Kronecker product (HL). The Kronecker
product has useful applications in the analysis of panel data. See the kronecker command in R
for more information.
5
As another application in statistics, the likelihood function for a multivariate normal distribution
also calls on the determinant of the covariance matrix.
198 10 Linear Algebra with Programming Applications
To illustrate the various matrix algebra operations that R has available, in this section
we will work an applied example by computing the ordinary least squares (OLS)
estimator with our own program using real data.
First, to motivate the background to the problem, consider the formulation of the
model and how we would estimate this by hand. Our population linear regression
model for vote shares in Tennessee is: y D X C u. In this model, y is a vector of
the Republican vote share in each district, X is a matrix of predictors (including
a constant, Obamas share of the vote in 2008, and the Republicans financial
advantage relative the Democrat), consists of the partial coefficient for each
predictor, and u is a vector of disturbances. We estimate O D .X0 X/1 X0 y, yielding
the sample regression function yO D X. O
To start computing by hand, we have to define X. Note that we must include a
vector of 1s in order to estimate an intercept term. In scalar form, our population
model is: yi D 1 x1i C 2 x2i C 3 x3i C ui , where x1i D 1 for all i. This gives us the
predictor matrix:
2 3
1 0:29 4:984
6 1 0:34 5:073 7
6 7
6 1 0:37 12:620 7
6 7
6 1 0:34 6:443 7
6 7
6 7
X D 6 1 0:56 5:758 7
6 7
6 1 0:37 15:603 7
6 7
6 1 0:34 14:148 7
6 7
4 1 0:43 0:502 5
1 0:77 9:048
We also need X0 y:
2 3
0:808
6 0:817 7
6 7
6 0:568 7
2 36 7
6
1:000 1:000 1:000 1:000 1:000 1:000 1:000 1:000 1:000 6 0:571 7 7
6 7
X0 y D 4 0:290 0:340 0:370 0:340 0:560 0:370 0:340 0:430 0:770 5 6 0:421 7
6 7
4:984 5:073 12:620 6:443 5:758 15:603 14:148 0:502 9:048 6 0:673 7
6 7
6 0:724 7
6 7
4 0:590 5
0:251
Inverse by Hand
The last quantity we need is the inverse of X0 X. Doing this by hand, we can solve
by Gauss-Jordan elimination:
2 3
9:00000 3:81000 31:68100 1:00000 0:00000 0:00000
X0 XjI D 4 3:81000 1:79610 6:25681 0:00000 1:00000 0:00000 5
31:68100 6:25681 810:24462 0:00000 0:00000 1:00000
Divide row 1 by 9:
2 3
1:00000 0:42333 3:52011 0:11111 0:00000 0:00000
4 3:81000 1:79610 6:25681 0:00000 1:00000 0:00000 5
31:68100 6:25681 810:24462 0:00000 0:00000 1:00000
As a slight wrinkle in these hand calculations, we can see that .X0 X/1 is a little off
due to rounding error. It should actually be a symmetric matrix.
10.3 Applied Example: Programming OLS Regression 201
Since inverting the matrix took so many steps and even ran into some rounding error,
it will be easier to have R do some of the heavy lifting for us. (As the number of
observations or variables rises, we will find the computational assistance even more
valuable.) In order to program our own OLS estimator in R, the key commands we
require are:
Matrix multiplication: %*%
Transpose: t
Matrix inverse: solve
Knowing this, it is easy to program an estimator for O D .X0 X/1 X0 y.
First we must enter our variable vectors using the data from Table 10.1 and
combine the input variables into a matrix. If you have not yet entered these, type:
Y<-c(.808,.817,.568,.571,.421,.673,.724,.590,.251)
X1 <- rep(1, 9)
X2<-c(.29,.34,.37,.34,.56,.37,.34,.43,.77)
X3<-c(4.984,5.073,12.620,-6.443,-5.758,15.603,14.148,0.502,
-9.048)
X<-cbind(X1,X2,X3)
To estimate OLS with our own program, we simply need to translate the estimator
.X0 X/1 X0 y into R syntax:
beta.hat<-solve(t(X)%*%X)%*%t(X)%*%Y
beta.hat
Breaking this down a bit: The solve command leads off because the first quantity
is an inverse, .X0 X/1 . Within the call to solve, the first argument must be
transposed (hence the use of t) and then it is postmultiplied by the non-transposed
covariate matrix (hence the use of %*%). We follow up by postmultiplying the
transpose of X, then postmultiplying the vector of outcomes (y). When we print
our results, we get:
[,1]
1.017845630
X2 -1.001809341
X3 0.002502538
202 10 Linear Algebra with Programming Applications
These are the same results we obtained by hand, despite the rounding discrepancies
we encountered when inverting on our own. Writing the program in R, however,
simultaneously gave us full control over the estimator, while being much quicker
than the hand operations.
Of course an even faster option would be to use the canned lm command we used
in Chap. 6:
tennessee <- as.data.frame(cbind(Y,X2,X3))
lm(Y~X2+X3, data=tennessee)
This also yields the exact same results. In practice, whenever computing OLS
estimates, this is almost always the approach we will want to take. However, this
procedure has allowed us to verify the usefulness of Rs matrix algebra commands.
If the user finds the need to program a more complex estimator for which there is
not a canned command, this should offer relevant tools with which to accomplish
this.
With these basic tools in hand, users should now be able to begin programming with
matrices and vectors. Some intermediate applications of this include: computing
standard errors from models estimated with optim (see Chap. 11) and predicted
values for complicated functional forms (see the code that produced Fig. 9.3). Some
of the more advanced applications that can benefit from using these tools include:
feasible generalized least squares (including weighted least squares), optimizing
likelihood functions for multivariate normal distributions, and generating correlated
data using Cholesky decomposition (see the chol command). Very few programs
offer the estimation and programming flexibility that R does whenever matrix
algebra is essential to the process.
For these practice problems, consider the data on congressional elections in Arizona
in 2010, presented below in Table 10.2:
10.4 Practice Problems 203
As the last several chapters have shown, R offers users flexibility and opportunities
for advanced data analysis that cannot be found in many programs. In this final
chapter, we will explore Rs programming tools, which allow the user to create
code that addresses any unique problem he or she faces.
Bear in mind that many of the tools relevant to programming have already been
introduced earlier in the book. In Chap. 10, the tools for matrix algebra in R were
introduced, and many programs require matrix processing. Additionally, logical
(or Boolean) statements are essential to programming. Rs logical operators were
introduced in Chap. 2, so see Table 2.1 for a reminder of what each operators
function is.
The subsequent sections will introduce several other tools that are important for
programming: probability distributions, new function definition, loops, branching,
and optimization (which is particularly useful for maximum likelihood estimation).
The chapter will end with two large applied examples. The first, drawing from
Monogan (2013b), introduces object-oriented programming in R and applies several
programming tools from this chapter to finding solutions from an insoluble game
theoretic problem. The second, drawing from Signorino (1999), offers an example
of Monte Carlo analysis and more advanced maximum likelihood estimation in R.1
Together, the two applications should showcase how all of the programming tools
can come together to solve a complex problem.
R allows you to use a wide variety of distributions for four purposes. For each
distribution, R allows you to call the cumulative distribution function (CDF),
probability density function (PDF), quantile function, and random draws from the
distribution. All probability distribution commands consist of a prefix and a suffix.
Table 11.1 presents the four prefixes, and their usage, as well as the suffixes for
some commonly used probability distributions. Each distributions functions take
arguments unique to that probability distributions parameters. To see how these are
specified, use help files (e.g., ?punif, ?pexp, or ?pnorm).2
If you wanted to know the probability that a standard normal observation will be
less than 1.645, use the cumulative distribution function (CDF) command pnorm:
pnorm(1.645)
Suppose you want to draw a scalar from the standard normal distribution: to draw
a N .0; 1/, use the random draw command rnorm:
a <- rnorm(1)
To draw a vector with ten values from a 2 distribution with four degrees of freedom,
use the random draw command:
c <- rchisq(10,df=4)
Recall from Chap. 10 that the sample command also allows us to simulate values,
whenever we provide a vector of possible values. Hence, R offers a wide array of
data simulation commands.
2
Other distributions can be loaded through various packages. For example, another useful
distribution is the multivariate normal. By loading the MASS library, a user can sample from the
multivariate normal distribution with the mvrnorm command. Even more generally, the mvtnorm
package allows the user to compute multivariate normal and multivariate t probabilities, quantiles,
random deviates, and densities.
11.2 Functions 207
Suppose we have a given probability, 0.9, and we want to know the value of a 2
distribution with four degrees of freedom at which the probability of being less than
or equal to that value is 0.9. This calls for the quantile function:
qchisq(.9,df=4)
We can calculate the probability of a certain value from the probability mass
function (PMF) for a discrete distribution. Of a Poisson distribution with intensity
parameter 9, what is the probability of a count of 5?
dpois(5,lambda=9)
Although usually of less interest, for a continuous distribution, we can calculate the
value of the probability density function (PDF) of a particular value. This has no
inherent meaning, but occasionally is required. For a normal distribution with mean
4 and standard deviation 2, the density at the value 1 is given by:
dnorm(1,mean=4,sd=2)
11.2 Functions
R allows you to create your own functions with the function command. The
function command uses the basic following syntax:
function.name <- function(INPUTS){BODY}
Notice that the function command first expects the inputs to be listed in round
parentheses, while the body of the function is listed in curly braces. As can be
seen, there is little constraint in what the user chooses to put into the function, so a
function can be crafted to accomplish whatever the user would like.
For example, suppose we were interested in the equation y D 2 C x12 . We could
define this function easily in R. All we have to do is specify that our input variable
is x and our body is the right-hand side of the equation. This will create a function
that returns values for y. We define our function, named first.fun, as follows:
first.fun<-function(x){
y<-2+x^{-2}
return(y)
}
Although broken-up across a few lines, this is all one big command, as the curly
braces ({}) span multiple lines. With the function command, we begin by
declaring that x is the name of our one input, based on the fact that our equation
of interest has x as the name of the input variable. Next, inside the curly braces,
we assign the output y as being the exact function of x that we are interested in.
As a last step before closing our curly braces, we use the return command,
which tells the function what the output result is after it is called. The return
command is useful when determining function output for a couple of reasons: First,
208 11 Additional Programming Tools
10
8
6
y
4
2
0
4 2 0 2 4
x
1
Fig. 11.1 Plot of the function y D 2 C x2
a political party in a model of how parties will choose an issue position when
competing in sequential elections (Monogan 2013b, Eq. (6)):
In this equation, the expected utility to a political party (party A) is the sum of
two cumulative logistic distribution functions (the functions ), with the second
downweighted by a discount term (0 1). Utility depends on the positions
taken by party A and party D ( A and D ), a valence advantage to party A (V), and
the issue position of the median voter in the first and second election (m1 and m2 ).
This is now a function of several variables, and it requires us to use the CDF of the
logistic distribution. To input this function in R, we type:
Quadratic.A<-function(m.1,m.2,p,delta,theta.A,theta.D){
util.a<-plogis(-(m.1-theta.A)^2+
(m.1-theta.D)^2+p)+
delta*plogis(-(m.2-theta.A)^2+
(m.2-theta.D)^2)
return(util.a)
}
The plogis command computes each relevant probability from the logistic CDF,
and all of the other terms are named self-evidently from Eq. (11.1) (except that p
refers to the valence term V). Although this function was more complicated, all we
had to do is make sure we named every input and copied Eq. (11.1) entirely into the
body of the function.
This more complex function, Quadratic.A, still behaves like our simple
function. For instance, we could go ahead and supply a numeric value for every
argument of the function like this:
Quadratic.A(m.1=.7,m.2=-.1,p=.1,delta=0,theta.A=.7,theta.D=.7)
0.5
Party As Expected Utility
0.4
0.3
0.2
0.1
Fig. 11.2 Plot of an advantaged partys expected utility over two elections contingent on issue
position
In the first line, we generate a range of positions party A may choose for A . On the
second line, we calculate a vector of expected utilities to party A at each of the issue
positions we consider and setting the other parameters at their previously mentioned
value. Lastly, we use the plot function to draw a line graph of these utilities.
It turns out that the maximum of this function is at A D 0:7, so our previously
mentioned utility of 0.52 is the best party A can do after all.
11.3 Loops
Loops are easy to write in R and can be used to repeat calculations that are either
identical or vary only by a few parameters. The basic structure for a loop using the
for command is:
for (i in 1:M) {COMMANDS}
In this case, M is the number of times the commands will be executed. R also
supports loops using the while command which follow a similar command
structure:
j <- 1
while(j < M) {
COMMANDS
j <- j + 1
}
11.3 Loops 211
Whether a for loop or a while loop works best can vary by situation, so the
user must use his or her own judgment when choosing a setup. In general, while
loops tend to be better for problems where you would have to do something until
a criterion is met (such as a convergence criterion), and for loops are generally
better for things you want to repeat a fixed number of times. Under either structure,
R will allow a wide array of commands to be included in a loop; the users task is
how to manage the loops input and output efficiently.
As a simple demonstration of how loops work, consider an example that illus-
trates the law of large numbers. To do this, we can simulate random observations
from the standard normal distribution (easy with the rnorm) command. Since the
standard normal has a population mean of zero, we would expect the sample mean
of our simulated values to be near zero. As our sample size gets larger, the sample
mean should generally be closer to the population mean of zero. A loop is perfect
for this exercise: We want to repeat the calculations of simulating from the standard
normal distribution and then taking the mean of the simulations. What differs from
iteration to iteration is that we want our sample size to increase.
We can set this up easily with the following code:
set.seed(271828183)
store <- matrix(NA,1000,1)
for (i in 1:1000){
a <- rnorm(i)
store[i] <- mean(a)
}
plot(store, type="h",ylab="Sample Mean",
xlab="Number of Observations")
abline(h=0,col=red,lwd=2)
In the first line of this code, we call the command set.seed, in order to make our
simulation experiment replicable. When R randomly draws numbers in any way,
it uses a pseudo-random number generator, which is a list of 2.1 billion numbers
that resemble random draws.3 By choosing any number from 1 to 2,147,483,647,
others should be able to reproduce our results by using the same numbers in their
simulation. Our choice of 271,828,183 was largely arbitrary. In the second line of
code, we create a blank vector named store of length 1000. This vector is where
we will store our output from the loop. In the next four lines of code, we define our
loop. The loop runs from 1 to 1000, with the index of each iteration being named
i. In each iteration, our sample size is simply the value of i, so the first iteration
simulates one observation, and the thousandth iteration simulates 1000 observations.
Hence, the sample size increases with each pass of the loop. In each pass of the
program, R samples from a N .0; 1/ distribution and then takes the mean of that
sample. Each mean is recorded in the ith cell of store. After the loop closes, we
plot our sample means against the sample size and use abline to draw a red line
3
Formally, this list makes-up draws from a standard uniform distribution, which is then converted
to whatever distribution we want using a quantile function.
212 11 Additional Programming Tools
1.0
Sample Mean
0.5
0.0
0.50
at the population mean of zero. The result is shown in Fig. 11.3. Indeed, our plot
shows the mean of the sample converging to the true mean of zero as the sample
size increases.
Loops are necessary for many types of programs, such as if you want to do a
Monte Carlo analysis. In some (but not all) cases, however, loops can be slower than
vectorized versions of commands. It may be worth trying the apply command, for
instance, if it can accomplish the same goal as a loop. Which is faster often depends
on the complexity of the function and the memory overhead for calculating and
storing results. So if one approach is too slow or too complicated for your computer
to handle, it may be worth trying the other.
11.4 Branching
The first line creates a scalar named even.count and sets it at zero. The next
line starts the loop that gives us 100 trials. The third line creates our sample of
two numbers from 1 to 10, and names the sample a. The fourth line defines our
if statement: We use the modulo function to find the remainder when each term
in our sample is divided by two.4 If the remainder for every term is zero, then all
numbers are even and the sum is zero. In that case, we have drawn a sample in which
numbers are even. Hence, when sum(a%%2)==0 is true, then we want to add one
to a running tally of how many samples of two even numbers we have. Therefore
the fifth line of our code adds to our tally, but only in cases that meet our condition.
(Notice that a recursive assignment for even.count is acceptable. R will take the
old value, add one to it, and update the value that is saved.) Try this experiment
yourself. As a hint, probability theory would say that 25 % of trials will yield two
even numbers, on average.
Users also may make if. . . else branching statements such that one set of
operations applies whenever an expression is true, and another set of operations
applies when the expression is false. This basic structure looks like this:
if (logical_expression) {
expression_1
...
} else {
expression_2
...
}
4
Recall from Chap. 1 that the modulo function gives us the remainder from division.
214 11 Additional Programming Tools
In this case, expression_1 will only be applied in cases when the logical
expression is true, and expression_2 will only be applied in cases when the
logical expression is false.
Finally, users have the option of branching even further. With cases for which
the first logical expression is false, thereby calling the expressions following else,
these cases can be branched again with another if. . . else statement. In fact, the
programmer can nest as many if. . . else statements as he or she would like. To
illustrate this, consider again the case when we simulate two random numbers from
1 to 10. Imagine this time we want to know not only how often we draw two even
numbers, but also how often we draw two odd numbers and how often we draw
one even number and one odd. One way we could address this is by keeping three
running tallies (below named even.count, odd.count, and split.count)
and adding additional branching statements:
even.count<-0
odd.count<-0
split.count<-0
for (i in 1:100){
a<-sample(c(1:10),2,replace=TRUE)
if (sum(a%%2)==0){
even.count<-even.count+1
} else if (sum(a%%2)==2){
odd.count<-odd.count+1
} else{
split.count<-split.count+1
}
}
even.count
odd.count
split.count
Our for loop starts the same as before, but after our first if statement, we follow
with an else statement. Any sample that did not consist of two even numbers is
now subject to the commands under else, and the first command under else
is. . . another if statement. This next if statement observes that if both terms in the
sample are odd, then the sum of the remainders after dividing by two will be two.
Hence, all samples with two odd entries now are subject to the commands of this
new if statement, where we see that our odd.count index will be increased by
one. Lastly, we have a final else statementall samples that did not consist of two
even or two odd numbers will now be subject to this final set of commands. Since
these samples consist of one even and one odd number, the split.count index
will be increased by one. Try this out yourself. Again, as a hint, probability theory
indicates that on average 50 % of samples should consist of one even and one odd
number, 25 % should consist of two even numbers, and 25 % should consist of two
odd numbers.
11.5 Optimization and Maximum Likelihood Estimation 215
5
See Nocedal and Wright (1999) for a review of how these techniques work.
6
When using maximum likelihood estimation, as an alternative to using optim is to use the
maxLik package. This can be useful for maximum likelihood specifically, though optim has
the advantage of being able to maximize or minimize other kinds of functions as well, when
appropriate.
216 11 Additional Programming Tools
Now to estimate ,O we need R to find the value of that maximizes the log-
likelihood function given the data. (We call this term prob in the code to avoid
confusion with Rs use of the command pi to store the geometric constant.) We can
do this with the optim command. We compute:
test <- optim(c(.5), # starting value for prob
binomial.loglikelihood, # the log-likelihood function
method="BFGS", # optimization method
hessian=TRUE, # return numerical Hessian
control=list(fnscale=-1), # maximize instead of minimize
y=43, n=100) # the data
print(test)
Remember that everything after a pound sign (#) is a comment that R ignores,
so these notes serve to describe each line of code. We always start with a vector
of starting values with all parameters to estimate (just in this case), name the
log-likelihood function we defined elsewhere, choose our optimization method
(Broyden-Fletcher-Goldfarb-Shanno is often a good choice), and indicate that we
want R to return the numerical Hessian so we can compute standard errors later.
The fifth line of code is vitally important: By default optim is a minimizer, so we
have to specify fnscale=-1 to make it a maximizer. Any time you use optim
for maximum likelihood estimation, this line will have to be included. On the sixth
line, we list our data. Often, we will call a matrix or data frame here, but in this case
we need only list the values of y and n.
Our output from print(test) looks like the following:
$par
[1] 0.4300015
$value
[1] -68.33149
$counts
function gradient
13 4
$convergence
[1] 0
$message
NULL
$hessian
[,1]
[1,] -407.9996
11.6 Object-Oriented Programming 217
To interpret our output: The par term lists the estimates of the parameters, so our
estimate is O D 0:43 (as we anticipated). The log-likelihood for our final solution
is 68:33149, and is presented under value. The term counts tells us how often
optim had to call the function and the gradient. The term convergence will
be coded 0 if the optimization was successfully completed; any other value is an
error code. The message item may return other necessary information from the
optimizer. Lastly, the hessian is our matrix of second derivatives of the log-
likelihood function. With only one parameter here, it is a simple 1 1 matrix. In
general, if the user wants standard errors from an estimated maximum likelihood
model, the following line will return them:
sqrt(diag(solve(-test$hessian)))
This line is based on the formula for standard errors in maximum likelihood
estimation. All the user will ever need to change is to replace the word test with
the name associated with the call to optim. In this case, R reports that the standard
error is SE./
O D 0:0495074.
Finally, in this case in which we have a single parameter of interest, we have
the option of using our defined log-likelihood function to draw a picture of the
optimization problem. Consider the following code:
ruler <- seq(0,1,0.01)
loglikelihood <- binomial.loglikelihood(ruler, y=43, n=100)
plot(ruler, loglikelihood, type="l", lwd=2, col="blue",
xlab=expression(pi),ylab="Log-Likelihood",ylim=c(-300,-70),
main="Log-Likelihood for Binomial Model")
abline(v=.43)
The first line defines all values that possibly can take. The second line inserts into
the log-likelihood function the vector of possible values of , plus the true values of
y and n. The third through fifth lines are a call to plot that gives us a line graph of
the log-likelihood function. The last line draws a vertical line at our estimated value
for .
O The result of this is shown in Fig. 11.4. While log-likelihood functions with
many parameters usually cannot be visualized, this does remind us that the function
we have defined can still be used for purposes other than optimization, if need be.
100
150
LogLikelihood
200
250
300
Fig. 11.4 Binomial log-likelihood across all possible values of probability parameter when the
data consist of 43 successes in 100 trials
7
To see the full original version of the program, consult http://hdl.handle.net/1902.1/16781. Note
that here we use the S3 object family, but the original code uses the S4 family. The original
program also considers alternative voter utility functions besides the quadratic proximity function
listed here, such as an absolute proximity function and the directional model of voter utility
(Rabinowitz and Macdonald 1989).
220 11 Additional Programming Tools
(m.2-theta.D)^2)
return(util.a)
}
Quadratic.D<-function(m.1,m.2,p,delta,theta.A,theta.D){
util.d<-(1-plogis(-(m.1-theta.A)^2+
(m.1-theta.D)^2+p))+
delta*(1-plogis(-(m.2-theta.A)^2+
(m.2-theta.D)^2))
return(util.d)
}
#matrix attributes
rownames(outcomeA)<-colnames(outcomeA)<-rownames(outcomeD)<-
colnames(outcomeD)<-names(bestResponseA)<-
names(bestResponseD)<-theta
8
If you preferred to use the S4 object system for this exercise instead, the next thing we
would need to do is define the object class before defining the function. If we wanted
to call our object class simulation, we would type: setClass("simulation",
representation(outcomeA="matrix", outcomeD="matrix",
bestResponseA="numeric", bestResponseD="numeric",
equilibriumA="character", equilibriumD="character")). The S3 object
system does not require this step, as we will see when we create objects of our self-defined
game.simulation class.
11.6 Object-Oriented Programming 221
As a general tip, when writing a long function, it is best to test code outside
of the function wrapper. For bonus points and to get a full sense of this code,
you may want to break this function into its components and see how each piece
works. As the comments throughout show, each set of commands does something
we normally might do in a function. Notice that when the arguments are defined,
both m.1 and theta are given default values. This means that in future uses, if we
do not specify the values of these inputs, the function will use these defaults, but if
we do specify them, then the defaults will be overridden. Turning to the arguments
within the function, it starts off by defining internal parameters and setting their
attributes. Every set of commands thereafter is a loop within a loop to repeat certain
commands and fill output vectors and matrices.
The key addition here that is unique to anything discussed before is in the
definition of the result term at the end of the function. Notice that in doing this,
222 11 Additional Programming Tools
we first define the output as a list, in which each component is named. In this
case, we already named our objects the same as the way we want to label them
in the output list, but that may not always be the case.9 We then use the class
command to declare that our output is of the game.simulation class, a concept
we are now creating. The class command formats this as an object in the S3
family. By typing invisible(result) at the end of the function, we know that
our function will return this game.simulation-class object.10
Now that this function is defined, we can put it to use. Referring to the terms
in Eq. (11.1), suppose that V D 0:1, D 0, m1 D 0:7, and m2 D 0:1. In the
code below, we create a new object called treatment.1 using the simulate
function when the parameters take these values, and then we ask R to print the
output:
treatment.1<-simulate(v=0.1,delta=0.0,m.2=-0.1)
treatment.1
Notice that we did not specify m.1 because 0.7 already is the default value for
that parameter. The output from the printout of treatment.1 is too lengthy to
reproduce here, but on your own screen you will see that treatment.1 is an
object of the game.simulation class, and the printout reports the values for
each attribute.
If we are only interested in a particular feature from this result, we can ask R to
return only the value of a specific slot. For an S3 object, which is how we saved our
result, we can call an attribute by naming the object, using the $ symbol, and then
naming the attribute we want to use. (This contrasts from S4 objects, which use the
@ symbol to call slots.) For example, if we only wanted to see what the equilibrium
choices for parties A and D were, we could simply type:
treatment.1$equilibriumA
treatment.1$equilibriumD
9
For example, in the phrase outcomeA=outcomeA, the term to the left of the equals sign states
that this term in the list will be named outcomeA, and the term to the right calls the object of this
name from the function to fill this spot.
10
The code would differ slightly for the S4 object system. If we defined our simulation
class as the earlier footnote describes, here we would replace the definition of
result as follows: result<-new("simulation",outcomeA=outcomeA,
outcomeD=outcomeD,bestResponseA=bestResponseA,
bestResponseD=bestResponseD, equilibriumA=equilibriumA,
equilibriumD=equilibriumD). Replacing the list command with the new command
assigns the output to our simulation class and fills the various slots all in one step. This means
we can skip the extra step of using the class command, which we needed when using the S3
system.
11.7 Monte Carlo Analysis: An Applied Example 223
We accomplished the finer precision by substituting our own vector in for theta.
Our output values of the two equilibrium slots are again the same:
[1] "0.63"
So we know under this second treatment, the equilibrium for the game is A D D D
0:63. Substantively, what happened here is we increased the value of winning the
second election to the parties. As a result, the parties moved their issue position a
little closer to the median voters issue preference in the second election.
11
If we had instead saved treatment.1 as an S4 simulation object as the
prior footnotes described, the command to call a specific attribute would instead be:
treatment.1@equilibriumA.
224 11 Additional Programming Tools
U1(A)+ 1A U1(A) + 1A
2
U1(SQ)
U2(D)+ 2D U2(D)+2D
benefit the player gets from the outcome. In this setup, we use observable predictors
to model the utilities for each possible outcome. This kind of model is interesting
because the population data-generating process typically has nonlinearities in it that
are not captured by standard approaches. Hence, we will have to program our own
likelihood function and use optim to estimate it.12
To motivate how the strategic multinomial probit is set up, consider a substantive
model of behavior between two nations, an aggressor (1) and a target (2). This
is a two-stage game: First, the aggressor decides whether to attack (A) or not
(A); then, the target decides whether to defend (D) or not (D). Three observable
consequences could occur: if the aggressor decides not to attack, the status quo
holds; if the aggressor attacks, the target may choose to capitulate; finally, if
the target defends, we have war. The game tree is presented in Fig. 11.5. At the
end of each branch are the utilities to players (1) and (2) for status quo (SQ),
capitulation (C), and war (W), respectively. We make the assumption that these
players are rational and therefore choose the actions that maximize the resulting
payoff. For example, if the target is weak, the payoff of war U2 .W/ would be terribly
negative, and probably lower than the payoff for capitulation U2 .C/. If the aggressor
knows this, it can deduce that in the case of attack, the target would capitulate. The
aggressor therefore would know that if it chooses to attack, it will receive the payoff
U1 .C/ (and not U1 .W/). If the payoff of capitulation U1 .C/ is bigger than the payoff
of the status quo U1 .SQ/ for the aggressor, the rational decision is to attack (and the
targets rational decision is to capitulate). Of course, the goal is to get a sense of
what will happen as circumstances change across different dyads of nations.
In general, we assume the utility functions respond to observable specific
circumstances (the value of the disputed resources, the expected diplomatic price
due to sanctions in case of aggression, the military might, and so forth). We
12
Readers interested in doing research like this in their own work should read about the games
package, which was developed to estimate this kind of model.
11.7 Monte Carlo Analysis: An Applied Example 225
will also assume that the utilities for each countrys choice are stochastic. This
introduces uncertainty into our payoff model, representing the fact that this model is
incomplete, and probably excludes some relevant parameters from the evaluation. In
this exercise, we will consider only four predictors Xi to model the utility functions,
and assume that the payoffs are linear in these variables.
In particular, we consider the following parametric form for the payoffs:
U1 .SQ/ D 0
U1 .C/ D X1 1
U1 .W/ D X2 2 (11.4)
U2 .C/ D 0
U2 .W/ D X3 3 C X4 4
N .0; 0:5/
Each nation makes its choice based on which decision will give it a bigger utility.
This is based on the known information, plus an unknown private disturbance () for
each choice. This private disturbance adds a random element to actors decisions.
We as researchers can look at past conflicts, measure the predictors (X), observe
the outcome, and would like to infer: How important are X1 , X2 , X3 , and X4 in the
behavior of nation-dyads? We have to determine this based on whether each data
point resulted in status quo, capitulation, or war.
Notice that p is in the equation for q. This nonlinear feature of the model is not
accommodated by standard canned models.
226 11 Additional Programming Tools
Knowing the formulae for p and q, we know the probabilities of status quo,
capitulation, and war. Therefore, the likelihood function is simply the product of
the probabilities of each event, raised to a dummy of whether the event happened,
multiplied across all observations:
Y
n
L.jy; X/ D .1 q/DSQ .q.1 p//DC .pq/DW (11.7)
iD1
Where DSQ , DC , and DW are dummy variables equal to 1 if the case is status quo,
capitulation, or war, respectively, and 0 otherwise. The log-likelihood function is:
X
n
`.jy; X/ D DSQ log.1 q/ C DC log q.1 p/ C DW log.pq/ (11.8)
iD1
We are now ready to begin programming this into R. We begin by cleaning up and
then defining our log-likelihood function as llik, which again includes comments
to describe parts of the functions body:
rm(list=ls())
llik=function(B,X,Y) {
#Separate data matrices to individual variables:
sq=as.matrix(Y[,1])
cap=as.matrix(Y[,2])
war=as.matrix(Y[,3])
X13=as.matrix(X[,1])
X14=as.matrix(X[,2])
X24=as.matrix(X[,3:4])
lnpwar=log(P2*P4)
llik=sq*lnpsq+cap*lnpcap+war*lnpwar
return(sum(llik))
}
While substantially longer than the likelihood function we defined in Sect. 11.5, the
idea is the same. The function still accepts parameter values, independent variables,
and dependent variables, and it still returns a log-likelihood value. With a more
complex model, though, the function needs to be broken into component steps.
First, our data are now matrices, so the first batch of code separates the variables
by equation from the model. Second, our parameters are all coefficients stored in
the argument B, so we need to separate the coefficients by equation from the model.
Third, we matrix multiply the variables by the coefficients to create the three utility
terms. Fourth, we use that information to compute the probabilities that the target
will defend or not. Fifth, we use the utilities, plus the probability of the targets
actions, to determine the probability the aggressor will attack or not. Lastly, the
probabilities of all outcomes are used to create the log-likelihood function.
Now that we have defined our likelihood function, we can use it to simulate data and
fit the model over our simulated data. With any Monte Carlo experiment, we need
to start by defining the number of experiments we will run for a given treatment
and the number of simulated data points in each experiment. We also need to define
empty spaces for the outputs of each experiment. We type:
set.seed(3141593)
i<-100 #number of experiments
n<-1000 #number of cases per experiment
beta.qre<-matrix(NA,i,4)
stder.qre<-matrix(NA,i,4)
We start by using set.seed to make our Monte Carlo results more replicable.
Here we let i be our number of experiments, which we set to be 100. (Though
normally we might prefer a bigger number.) We use n as our number of cases. This
allows us to define beta.qre and stder.qre as the output matrices for our
estimates of the coefficients and standard errors from our models, respectively.
With this in place, we can now run a big loop that will repeatedly simulate a
dataset, estimate our strategic multinomial probit model, and then record the results.
The loop is as follows:
for(j in 1:i){
#Simulate Causal Variables
x1<-rnorm(n)
x2<-rnorm(n)
x3<-rnorm(n)
x4<-rnorm(n)
228 11 Additional Programming Tools
#Fit Model
strat.mle<-optim(stval,llik,hessian=TRUE,method="BFGS",
control=list(maxit=2000,fnscale=-1,trace=1),
X=indvar,Y=depvar)
#Save Results
beta.qre[j,]<-strat.mle$par
stder.qre[j,]<-sqrt(diag(solve(-strat.mle$hessian)))
}
In this model, we set D 1 for every coefficient in the population model. Otherwise,
Eq. (11.4) completely defines our population model. In the loop the first three
batches of code all generate data according to this model. First, we generate four
independent variables, each with a standard normal distribution. Second, we define
the utilities according to Eq. (11.4), adding in the random disturbance terms ()
as indicated in Fig. 11.5. Third, we create the values of the dependent variables
based on the utilities and disturbances. After this, the fourth step is to clean our
simulated data; we define starting values for optim and bind the dependent and
independent variables into matrices. Fifth, we use optim to actually estimate our
model, naming the within-iteration output strat.mle. Lastly, the results from the
model are written to the matrices beta.qre and stder.qre.
After running this loop, we can get a quick look at the average value of our
O by typing:
estimated coefficients ()
apply(beta.qre,2,mean)
In this call to apply, we study our matrix of regression coefficients in which each
row represents one of the 100 Monte Carlo experiments, and each column represents
one of the four regression coefficients. We are interested in the coefficients, so we
11.7 Monte Carlo Analysis: An Applied Example 229
type 2 to study the columns, and then take the mean of the columns. While your
results may differ somewhat from what is printed here, particularly if you did not
set the seed to be the same, the output should look something like this:
[1] 1.0037491 1.0115165 1.0069188 0.9985754
In short, all four estimated coefficients are close to 1, which is good because 1 is the
population value for each. If we wanted to automate our results a little more to tell
us the bias in the parameters, we could type something like this:
deviate <- sweep(beta.qre, 2, c(1,1,1,1))
colMeans(deviate)
On the first line, we use the sweep command, which sweeps out (or subtracts)
the summary statistic of our choice from a matrix. In our case, beta.qre is our
matrix of coefficient estimates across 100 simulations. The 2 argument we enter
separately indicates to apply the statistic by column (e.g., by coefficient) instead
of by row (which would have been by experiment). Lastly, instead of listing an
empirical statistic we want to subtract away, we simply list the true population
parameters to subtract away. On the second line, we calculate the bias by taking
the Mean deviation by column, or by parameter. Again, the output can vary with
different seeds and number generators, but in general it should indicate that the
average values are not far from the true population values. All of the differences are
small:
[1] 0.003749060 0.011516459 0.006918824 -0.001424579
Another worthwhile quantity is the mean absolute error, which tells us how much
an estimate differs from the population value on average. This is a bit different in
that overestimates and underestimates cannot wash out (as they could with the bias
calculation). An unbiased estimator still can have a large error variance and a large
mean absolute error. Since we already defined deviate before, now we need only
type:
colMeans(abs(deviate))
that takes two arguments, x and n, and returns h.x; n/. Using the function,
determine the values of h.0:8; 30/, h.0:7; 50/, and h.0:95; 20/.
4. Maximum likelihood estimation. Consider an applied example of Signorinos
(1999) strategic multinomial probit method. Download a subset of nineteenth
century militarized interstate disputes, the Stata-formatted file war1800.dta,
from the Dataverse (see page vii) or this chapters online content (see page 205).
These data draw from sources such as EUGene (Bueno de Mesquita and Lalman
1992) and the Correlates of War Project (Jones et al. 1996). Program a likelihood
function for a model like the one shown in Fig. 11.5 and estimate the model
for these real data. The three outcomes are: war (coded 1 if the countries went
to war), sq (coded 1 if the status quo held), and capit (coded 1 if the target
country capitulated). Assume that U1 .SQ/ is driven by peaceyrs (number of
years since the dyad was last in conflict) and s_wt_re1 (S score for political
similarity of states, weighted for aggressors region). U1 .W/ is a function of
balanc (the aggressors military capacity relative to the combined capacity of
11.8 Practice Problems 231
Alvarez RM, Levin I, Pomares J, Leiras M (2013) Voting made safe and easy: the impact of
e-voting on citizen perceptions. Polit Sci Res Methods 1(1):117137
Bates D, Maechler M, Bolker B, Walker S (2014) lme4: linear mixed-effects models using Eigen
and S4. R package version 1.1-7. http://www.CRAN.R-project.org/package=lme4
Becker RA, Cleveland WS, Shyu M-J (1996) The visual design and control of Trellis display.
J Comput Graph Stat 5(2):123155
Beniger JR, Robyn DL (1978) Quantitative graphics in statistics: a brief history. Am Stat
32(1):111
Berkman M, Plutzer E (2010) Evolution, creationism, and the battle to control Americas
classrooms. Cambridge University Press, New York
Black D (1948) On the rationale of group decision-making. J Polit Econ 56(1):2334
Black D (1958) The theory of committees and elections. Cambridge University Press, London
Box GEP, Tiao GC (1975) Intervention analysis with applications to economic and environmental
problems. J Am Stat Assoc 70:7079
Box GEP, Jenkins GM, Reinsel GC (2008) Time series analysis: forecasting and control, 4th edn.
Wiley, Hoboken, NJ
Box-Steffensmeier JM, Freeman JR, Hitt MP, Pevehouse JCW (2014) Time series analysis for the
social sciences. Cambridge University Press, New York
Brambor T, Clark WR, Golder M (2006) Understanding interaction models: improving empirical
analyses. Polit Anal 14(1):6382
Brandt PT, Freeman JR (2006) Advances in Bayesian time series modeling and the study of
politics: theory testing, forecasting, and policy analysis. Polit Anal 14(1):136
Brandt PT, Williams JT (2001) A linear Poisson autoregressive model: the Poisson AR(p) model.
Polit Anal 9(2):164184
Brandt PT, Williams JT (2007) Multiple time series models. Sage, Thousand Oaks, CA
Bueno de Mesquita B, Lalman D (1992) War and reason. Yale University Press, New Haven
Carlin BP, Louis TA (2009) Bayesian methods for data analysis. Chapman & Hall/CRC, Boca
Raton, FL
Chang W (2013) R graphics cookbook. OReilly, Sebastopol, CA
Cleveland WS (1993) Visualizing data. Hobart Press, Sebastopol, CA
Cowpertwait PSP, Metcalfe AV (2009) Introductory time series with R. Springer, New York
Cryer JD, Chan K-S (2008) Time series analysis with applications in R, 2nd edn. Springer,
New York
Downs A (1957) An economic theory of democracy. Harper and Row, New York
Eliason SR (1993) Maximum likelihood estimation: logic and practice. Sage, Thousand Oaks, CA
Enders W (2009) Applied econometric time series, 3rd edn. Wiley, New York
Fitzmaurice GM, Laird NM, Ware JH (2004) Applied longitudinal analysis. Wiley-Interscience,
Hoboken, NJ
Fogarty BJ, Monogan JE III (2014) Modeling time-series count data: the unique challenges facing
political communication studies. Soc Sci Res 45:7388
Gelman A, Hill J (2007) Data analysis using regression and multilevel/hierarchical models.
Cambridge University Press, New York
Gelman A, Carlin JB, Stern HS, Rubin DB (2004) Bayesian data analysis, 2nd edn. Chapman &
Hall/CRC, Boca Raton, FL
Gibney M, Cornett L, Wood R, Haschke P (2013) Political terror scale, 19762012. Retrieved
December 27, 2013 from the political terror scale web site: http://www.politicalterrorscale.org
Gill J (2001) Generalized linear models: a unified approach. Sage, Thousand Oaks, CA
Gill J (2008) Bayesian methods: a social and behavioral sciences approach, 2nd edn. Chapman &
Hall/CRC, Boca Raton, FL
Granger CWJ (1969) Investigating causal relations by econometric models and cross spectral
methods. Econometrica 37:424438
Granger CWJ, Newbold P (1974) Spurious regressions in econometrics. J Econ 26:10451066
Gujarati DN, Porter DC (2009) Basic econometrics, 5th edn. McGraw-Hill/Irwin, New York
Halley E (1686) An historical account of the trade winds, and monsoons, observable in the seas
between and near the tropicks, with an attempt to assign the phisical cause of the said winds.
Philos Trans 16(183):153168
Hamilton JD (1994) Time series analysis. Princeton University Press, Princeton, NJ
Hanmer MJ, Kalkan KO (2013) Behind the curve: clarifying the best approach to calculating
predicted probabilities and marginal effects from limited dependent variable models. Am J Polit
Sci 57(1):263277
Honaker J, King G, Blackwell M (2011) Amelia II: a program for missing data. J Stat Softw
45(7):147
Hotelling H (1929) Stability in competition. Econ J 39(153):4157
Huber PJ (1967) The behavior of maximum likelihood estimates under nonstandard conditions.
In: LeCam LM, Neyman J (eds) Proceedings of the 5th Berkeley symposium on mathematical
statistics and probability, volume 1: statistics University of California Press, Berkeley, CA
Iacus SM, King G, Porro G (2009) cem: software for coarsened exact matching. J Stat Softw
30(9):127
Iacus SM, King G, Porro G (2011) Multivariate matching methods that are monotonic imbalance
bounding. J Am Stat Assoc 106(493):345361
Iacus SM, King G, Porro G (2012) Causal inference without balance checking: coarsened exact
matching. Polit Anal 20(1):124
Imai K, van Dyk DA (2004) Causal inference with general treatment regimes: generalizing the
propensity score. J Am Stat Assoc 99(467):854866
Jones DM, Bremer SA, Singer JD (1996) Militarized interstate disputes, 18161992: rationale,
coding rules, and empirical patterns. Confl Manag Peace Sci 15(2):163213
Kastellec JP, Leoni EL (2007) Using graphs instead of tables in political science. Perspect Polit
5(4):755771
Keele L, Kelly NJ (2006) Dynamic models for dynamic theories: the ins and outs of lagged
dependent variables. Polit Anal 14(2):186205
King G (1989) Unifying political methodology. Cambridge University Press, New York
King G, Honaker J, Joseph A, Scheve K (2001) Analyzing incomplete political science data: an
alternative algorithm for multiple imputation. Am Polit Sci Rev 95(1):4969
Koyck LM (1954) Distributed lags and investment analysis. North-Holland, Amsterdam
Laird NM, Fitzmaurice GM (2013) Longitudinal data modeling. In: Scott MA, Simonoff JS, Marx
BD (eds) The Sage handbook of multilevel modeling. Sage, Thousand Oaks, CA
LaLonde RJ (1986) Evaluating the econometric evaluations of training programs with experimental
data. Am Econ Rev 76(4):604620
Little RJA, Rubin DB (1987) Statistical analysis with missing data, 2nd edn. Wiley, New York
References 235
Long JS (1997) Regression models for categorical and limited dependent variables. Sage,
Thousand Oaks, CA
Lowery D, Gray V, Monogan JE III (2008) The construction of interest communities: distinguish-
ing bottom-up and top-down models. J Polit 70(4):11601176
Ltkepohl H (2005) New introduction to multiple time series analysis. Springer, New York
Martin AD, Quinn KM, Park JH (2011) MCMCpack: Markov chain Monte Carlo in R. J Stat Softw
42(9):121
Mtys L, Sevestre P (eds) (2008) The econometrics of panel data: fundamentals and recent
developments in theory and practice, 3rd edn. Springer, New York
McCarty NM, Poole KT, Rosenthal H (1997) Income redistribution and the realignment of
American politics. American enterprise institute studies on understanding economic inequality.
AEI Press, Washington, DC
Monogan JE III (2011) Panel data analysis. In: Badie B, Berg-Schlosser D, Morlino L (eds)
International encyclopedia of political science. Sage, Thousand Oaks, CA
Monogan JE III (2013a) A case for registering studies of political outcomes: an application in the
2010 House elections. Polit Anal 21(1):2137
Monogan JE III (2013b) Strategic party placement with a dynamic electorate. J Theor Polit
25(2):284298
Nocedal J, Wright SJ (1999) Numerical optimization. Springer, New York
Owsiak AP (2013) Democratization and international border agreements. J Polit 75(3):717729
Peake JS, Eshbaugh-Soha M (2008) The agenda-setting impact of major presidential TV addresses.
Polit Commun 25:113137
Petris G, Petrone S, Campagnoli P (2009) Dynamic linear models with R. Springer, New York
Pfaff B (2008) Analysis of Integrated and cointegrated time series with R, 2nd edn. Springer,
New York
Playfair W (1786/2005) In: Wainer H, Spence I (eds) Commercial and political atlas and statistical
breviary. Cambridge University Press, New York
Poe SC, Tate CN (1994) Repression of human rights to personal integrity in the 1980s: a global
analysis. Am Polit Sci Rev 88(4):853872
Poe SC, Tate CN, Keith LC (1999) Repression of the human right to personal integrity revisited: a
global cross-national study covering the years 19761993. Int Stud Q 43(2):291313
Poole KT, Rosenthal H (1997) Congress: a political-economic history of roll call voting. Oxford
University Press, New York
Poole KT, Lewis J, Lo J, Carroll R (2011) Scaling roll call votes with wnominate in R. J Stat
Softw 42(14):121
Rabinowitz G, Macdonald SE (1989) A directional theory of issue voting. Am Polit Sci Rev
83:93121
Robert CP (2001) The Bayesian choice: from decision-theoretic foundations to computational
implementation, 2nd edn. Springer, New York
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York
Rubin DB (2006) Matched sampling for causal effects. Cambridge University Press, New York
Scott MA, Simonoff JS, Marx BD (eds) (2013) The Sage handbook of multilevel modeling. Sage,
Thousand Oaks, CA
Sekhon JS, Grieve RD (2012) A matching method for improving covariate balance in cost-
effectiveness analyses. Health Econ 21(6):695714
Shumway RH, Stoffer DS (2006) Time series analysis and its applications with R examples, 2nd
edn. Springer, New York
Signorino CS (1999) Strategic interaction and the statistical analysis of international conflict. Am
Polit Sci Rev 93:279297
Signorino CS (2002) Strategy and selection in international relations. Int Interact 28:93115
Signorino CS, Yilmaz K (2003) Strategic misspecification in regression models. Am J Polit Sci
47:551566
Sims CA, Zha T (1999) Error bands for impulse responses. Econometrica 67(5):11131155
236 References
Singh SP (2014a) Linear and quadratic utility loss functions in voting behavior research. J Theor
Polit 26(1):3558
Singh SP (2014b) Not all election winners are equal: satisfaction with democracy and the nature
of the vote. Eur J Polit Res 53(2):308327
Singh SP (2015) Compulsory voting and the turnout decision calculus. Polit Stud 63(3):548568
Tufte ER (2001) The visual display of quantitative information, 2nd edn. Graphics Press, Cheshire,
CT
Tukey JW (1977) Exploratory data analysis. Addison-Wesley, Reading, PA
Wakiyama T, Zusman E, Monogan JE III (2014) Can a low-carbon-energy transition be sustained
in post-Fukushima Japan? Assessing the varying impacts of exogenous shocks. Energy Policy
73:654666
Wei WWS (2006) Time series analysis: univariate and multivariate methods, 2nd edn. Pearson,
New York
White H (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct test for
heteroskedasticity. Econometrica 48(4):817838
Yau N (2011) Visualize this: the FlowingData guide to design, visualization, and statistics. Wiley,
Indianapolis
Index
cor, 75 fix, 19
correlation coefficient (r), 74, 160 for, 210, 213, 214, 220, 227
cross-tabulations, 71 foreign, 138
CrossTable, 71 function, 207, 209, 215, 219, 220, 226
default values, 221
D
data G
merging, 26 generalized linear model, 99
missing, 14, 22 for binary outcomes, 100
panel data formatting, 29 for count outcomes, 116
reading in, 14 for ordinal outcomes, 110
recoding, 21, 24 link function, 99, 105, 112, 114, 115, 117,
reshaping, 29 118
subsetting, 23 getwd, 7, 19
viewing, 19 geweke.diag, 137, 139
writing out, 18 glm, 99
data frame, 17, 192 family argument, 99
democratization and international border for logit model, 101, 103
settlements data, 96 for Poisson model, 117
densityplot, 48 for probit model, 104
densplot, 138 glmer, 132
descriptive statistics graphs
central tendency, 54 bar graph, 38, 39, 59
dispersion measures, 60 box-and-whisker plot, 36, 37
frequency tables, 57 density plot, 48, 55, 137, 138
inter-quartile range, 56 dot plot, 47
mean, 54, 194 histogram, 35, 36, 48, 91
median, 56 jittering, 86
median absolute deviation, 60 line plot, 43, 108, 110, 123, 167, 168, 208,
mode, 58 210, 218
percentiles, 61 quantile-quantile plot, 91
quantiles, 61 saving, 49
standard deviation, 60 scatterplot, 41, 47, 76, 86, 87, 153
variance, 60, 194 spike plot, 45, 212
det, 197 univariate, 35
detach, 18 grep, 184
dev.off, 49
diag, 192, 217, 227
dim, 28, 194 H
dotplot, 48 hat values, 95
drug policy coverage data, 50, 62, 125 head, 19, 26, 28, 29, 149
Durbin-Watson test, 88, 169 health lobbying state data, 34
dwtest, 88, 169 help files, 10
hist, 35, 91
histogram, 48
E
energy policy coverage data, 34, 117, 158, 186
error message, 4, 195 I
I, 88
if, 212, 213
F if...else, 213, 214
file.choose, 17 imbalance, 141, 144, 147
fitted, 184 influenceIndexPlot, 94
Index 239
t.test V
for a single mean, 65 VAR, 177
paired, 69 var, 60, 194
two sample, 67 variable or attribute referencing ($), 18, 35,
table, 22, 57, 102 143, 152, 162, 166, 177, 183, 217,
text, 40, 46 222, 223
tiff, 50 variance inflation factor, 93
time series, 157 VARselect, 177
AIC, 165, 177 vcovHC, 89
ARIMA model, 158, 161, 163, 164 vector, 17
autocorrelation function, 159, 160, 162, addition, 195
178, 179 character, 25
Box-Jenkins approach, 158 commands for, 194
Cochrane-Orcutt estimator, 171 creating, 188
dynamic model, 163, 172174 factor, 25
econometric approaches, 167 indices, 184, 193
Feasible Generalized Least Squares, 170 inner product (%*%), 196
further resources, 181 numeric, 17, 21, 146
Granger causality test, 176, 178, 180 omitting a term, 106, 118
impulse response analysis, 180182 outer product (%o%), 196
integrated (unit root), 159 sequential count (:), 166, 183, 189
Koyck model, 172, 173 vector algebra, 195
lag, 159, 172, 174, 176 View, 19
orthogonalization of residuals, 180 vif, 93
single and multiple equation models, 157
static model, 163, 168, 170
stationary and nonstationary, 158 W
transfer function, 164 which.max, 58, 220
trending, 159 while, 210
unrestricted distributed lag, 174 wnominate, 149
variable ordering, 180 working directory, 7, 19
vector autoregression, 175, 177, 179 write.csv, 18, 152
trellis.device, 50 write.dta, 18
ts, 167, 172, 183, 184 write.table, 18
tsdiag, 162
X
U xtable, 83, 102
U.N. roll call vote data, 148 xyplot, 47
U.S. congressional election 2010 data
Arizona, 203
Tennessee, 188 Z
U.S. congressional election 2010 data z-ratio
Arizona, 202 for logit model, 102
Tennessee, 187 for ordered logit model, 113
U.S. monetary and economic data, 186 for Poisson model, 118